Various hub/contrib improvements (#528)

At a high level this commit does: - moves things around, e.g. official -> DAGWorks. - fixes copy and READMEs - adds stub for leaderboard - fixes telemetry/contrib module usage when sf-hamilton-contrib is not installed See squashed commits for details: * Moves Official to DAGWorks for contrib/hub Because that's who is maintaining it. I also tacked on refactoring the python hub docs code into jinja2 to enable more easily having conditional logic that is easier to maintain. * Adds "view by tags" link in hub So that people can see an index. This is a quick way to get to this screen without really trying to customize docusaurus. * Fixes some README wording to drop `==` install For the hub we don't want to have to bump the version in the docs. So removing specifying the version for now. Though we could build something to know what the version is that's more effort than I want to do at the moment. * Adds link to copy docs for hub * Adds duplicate contrib/__init__ for telemetry capture So if someone does not have sf-hamilton-contrib installed, you could download a dataflow but not be able to import it, because they have a dependency on having a `hamilton.contrib` submodule. This sticks a placeholder in -- to enable things to work without installing sf-hamilton-contrib. The assumption is that it will get clobbered by installed `sf-hamilton-contrib`, and everything should continue to work. This also updates the telemetry code to handle this case, and to parse the relevant details from the file path. I have tested that things work for old and new paths. What I haven't extensively tested is the installation order. * Moves polars import to try/except For some reason this was breaking for me, and this fixed it. * Fixing isort issues * Cleaning up some typos * Revert "Fixing isort issues" This reverts commit 8288b7f. * Adds stub for leaderboard page for hub This is pretty basic. But putting something up here. * Fixing isort
DAGWorks-Inc · Nov 11, 2023 · d9f7a44 · d9f7a44
1 parent 8463800
commit d9f7a44
Show file tree

Hide file tree

Showing 19 changed files with 258 additions and 140 deletions.
diff --git a/.github/workflows/docusaurus-gh-pages.yml b/.github/workflows/docusaurus-gh-pages.yml
@@ -44,7 +44,9 @@ jobs:
           pip install -e .
       - name: Compile code to create pages
         working-directory: contrib/docs
-        run: python compile_docs.py
+        run: |
+          pip install jinja2
+          python compile_docs.py
       - name: Set up Node.js
         uses: actions/setup-node@v3
         with:

diff --git a/contrib/README.md b/contrib/README.md
@@ -26,11 +26,11 @@ pip install sf-hamilton-contrib --upgrade
 Once installed, you can import the dataflows as follows.
 
 Things you need to know:
-1. Whether it's a user or official dataflow. If user, what the name of the user is.
+1. Whether it's a user or official DAGWorks supported dataflow. If user, what the name of the user is.
 2. The name of the dataflow.
 ```python
 from hamilton import driver
-# from hamilton.contrib.official import NAME_OF_DATAFLOW
+# from hamilton.contrib.dagworks import NAME_OF_DATAFLOW
 from hamilton.contrib.user.NAME_OF_USER import NAME_OF_DATAFLOW
 
 dr = (
@@ -45,6 +45,7 @@ result = dr.execute(
     inputs={...}  # pass in inputs as appropriate
 )
 ```
+To find an example [go to the hub](https://hub.dagworks.io/docs/).
 
 #### Dynamic installation
 Here we dynamically download the dataflow from the internet and execute it. This is useful for quickly
@@ -54,7 +55,7 @@ iterating in a notebook and pulling in just the dataflow you need.
 from hamilton import dataflow, driver
 
 # downloads into ~/.hamilton/dataflows and loads the module -- WARNING: ensure you know what code you're importing!
-# NAME_OF_DATAFLOW = dataflow.import_module("NAME_OF_DATAFLOW") # if using official dataflow
+# NAME_OF_DATAFLOW = dataflow.import_module("NAME_OF_DATAFLOW") # if using official DAGWorks dataflow
 NAME_OF_DATAFLOW = dataflow.import_module("NAME_OF_DATAFLOW", "NAME_OF_USER")
 dr = (
   driver.Builder()
@@ -68,11 +69,12 @@ result = dr.execute(
   inputs={...}  # pass in inputs as appropriate
 )
 ```
+To find an example [go to the hub](https://hub.dagworks.io/docs/).
+
 #### Modification
 Getting started is one thing, but then modifying to your needs is another. So we have a prescribed
 flow to enable you to take a dataflow, and copy the code to a place of your choosing. This allows
-you to easily modify the dataflow as you see fit. You will need to adhere to any licenses the code may come with.
-The default, if not specified, is the "BSD-3 clear clause".
+you to easily modify the dataflow as you see fit.
 
 Run this in a notebook or python script to copy the dataflow to a directory of your choosing.
 ```python
@@ -85,6 +87,8 @@ dataflow.copy(NAME_OF_DATAFLOW, destination_path="PATH_TO_DIRECTORY")
 from hamilton.contrib.user.NAME_OF_USER import NAME_OF_DATAFLOW
 dataflow.copy(NAME_OF_DATAFLOW, destination_path="PATH_TO_DIRECTORY")
 ```
+You can then modify/import the code as you see fit. See [copy()](https://hamilton.dagworks.io/en/latest/reference/dataflows/copy/)
+for more details.
 
 
 ### How to contribute

diff --git a/contrib/docs/compile_docs.py b/contrib/docs/compile_docs.py
@@ -15,15 +15,17 @@
 import shutil
 import subprocess
 
+import jinja2
+
 from hamilton.function_modifiers import config
 from hamilton.htypes import Collect, Parallelizable
 
 DATAFLOW_FOLDER = ".."
 USER_PATH = DATAFLOW_FOLDER + "/hamilton/contrib/user"
-OFFICIAL_PATH = DATAFLOW_FOLDER + "/hamilton/contrib/official"
+DAGWORKS_PATH = DATAFLOW_FOLDER + "/hamilton/contrib/dagworks"
 
 
-@config.when(is_official="False")
+@config.when(is_dagworks="False")
 def user__usr(path: str) -> Parallelizable[dict]:
     """Find all users in the contrib/user folder."""
     for _user in os.listdir(path):
@@ -36,10 +38,10 @@ def user__usr(path: str) -> Parallelizable[dict]:
         yield {"user": _user, "path": os.path.join(path, _user)}
 
 
-@config.when(is_official="True")
-def user__official(path: str) -> Parallelizable[dict]:
-    """Find all users in the contrib/official folder."""
-    yield {"user": "::OFFICIAL::", "path": path}
+@config.when(is_dagworks="True")
+def user__dagworks(path: str) -> Parallelizable[dict]:
+    """Find all users in the contrib/dagworks folder."""
+    yield {"user": "::DAGWORKS::", "path": path}
 
 
 def dataflows(user: dict) -> list[dict]:
@@ -127,71 +129,8 @@ def dataflows_with_everything(
 
 
 # TEMPLATES!
-python_user_dataflow_template = """from hamilton import dataflow, driver
-# downloads into ~/.hamilton/dataflows and loads the module -- WARNING: ensure you know what code you're importing!
-{MODULE_NAME} = dataflow.import_module("{MODULE_NAME}", "{USER}")
-dr = (
-    driver.Builder()
-    .with_config({{}})  # replace with configuration as appropriate
-    .with_modules({MODULE_NAME})
-    .build()
-)
-# execute the dataflow, specifying what you want back. Will return a dictionary.
-result = dr.execute(
-    [{MODULE_NAME}.CHANGE_ME, ...],  # this specifies what you want back
-    inputs={{...}}  # pass in inputs as appropriate
-)
-"""
-
-python_official_dataflow_template = """from hamilton import dataflow, driver
-# downloads into ~/.hamilton/dataflows and loads the module
-{MODULE_NAME} = dataflow.import_module("{MODULE_NAME}")
-dr = (
-    driver.Builder()
-    .with_config({{}})  # replace with configuration as appropriate
-    .with_modules({MODULE_NAME})
-    .build()
-)
-# execute the dataflow, specifying what you want back. Will return a dictionary.
-result = dr.execute(
-    [{MODULE_NAME}.CHANGE_ME, ...],  # this specifies what you want back
-    inputs={{...}}  # pass in inputs as appropriate
-)
-"""
-
-python_user_import_template = """# pip install sf-hamilton-contrib==0.0.1rc1
-from hamilton import driver
-from hamilton.contrib.user.{USER} import {MODULE_NAME}
-
-dr = (
-    driver.Builder()
-    .with_config({{}})  # replace with configuration as appropriate
-    .with_modules({MODULE_NAME})
-    .build()
-)
-# execute the dataflow, specifying what you want back. Will return a dictionary.
-result = dr.execute(
-    [{MODULE_NAME}..., ...],  # this specifies what you want back
-    inputs={{...}}  # pass in inputs as appropriate
-)
-"""
-
-python_official_import_template = """# pip install sf-hamilton-contrib==0.0.1rc1
-from hamilton import driver
-from hamilton.contrib.official import {MODULE_NAME}
-
-dr = (
-    driver.Builder()
-    .with_config({{}})  # replace with configuration as appropriate
-    .with_modules({MODULE_NAME})
-    .build()
-)
-# execute the dataflow, specifying what you want back. Will return a dictionary.
-result = dr.execute(
-    [{MODULE_NAME}..., ...],  # this specifies what you want back
-    inputs={{...}}  # pass in inputs as appropriate
-)
-"""
+template_env = jinja2.Environment(loader=jinja2.FileSystemLoader("templates/"))
+builder_template = template_env.get_template("driver_builder.py.jinja2")
 
 mdx_template = """---
 id: {USER}-{DATAFLOW_NAME}
@@ -214,7 +153,7 @@ def dataflows_with_everything(
 
 ### Use published library version
 ```bash
-pip install sf-hamilton-contrib==0.0.1rc1  # make sure you have the latest
+pip install sf-hamilton-contrib --upgrade  # make sure you have the latest
 ```
 
 import example2 from '!!raw-loader!./example2.py';
@@ -239,8 +178,8 @@ def dataflows_with_everything(
 
 """
 
-mdx_official_template = """---
-id: OFFICIAL-{DATAFLOW_NAME}
+mdx_dagworks_template = """---
+id: DAGWorks-{DATAFLOW_NAME}
 title: {DATAFLOW_NAME}
 tags: {USE_CASE_TAGS}
 ---
@@ -260,7 +199,7 @@ def dataflows_with_everything(
 
 ### Use published library version
 ```bash
-pip install sf-hamilton-contrib==0.0.1rc1  # make sure you have the latest
+pip install sf-hamilton-contrib --upgrade  # make sure you have the latest
 ```
 
 import example2 from '!!raw-loader!./example2.py';
@@ -286,7 +225,7 @@ def dataflows_with_everything(
 # TODO: edit/adjust links to docs, etc.
 
 
-@config.when(is_official="False")
+@config.when(is_dagworks="False")
 def user_dataflows__user(dataflows_with_everything: Collect[list[dict]]) -> dict[str, list[dict]]:
     """Big function that creates the docs for a user."""
     result = {}
@@ -314,17 +253,31 @@ def user_dataflows__user(dataflows_with_everything: Collect[list[dict]]) -> dict
                 ]:
                     continue
                 shutil.copyfile(os.path.join(single_df["path"], file), os.path.join(df_path, file))
+
+            # get tags
+            with open(os.path.join(single_df["path"], "tags.json"), "r") as f:
+                tags = json.load(f)
+            # checks for driver related tags
+            uses_executor = tags.get("driver_tags", {}).get("executor", None)
             # create python file
             with open(os.path.join(df_path, "example1.py"), "w") as f:
                 f.write(
-                    python_user_dataflow_template.format(
-                        MODULE_NAME=single_df["dataflow"], USER=_user_name
+                    builder_template.render(
+                        use_executor=uses_executor,
+                        dynamic_import=True,
+                        is_user=True,
+                        MODULE_NAME=single_df["dataflow"],
+                        USER=_user_name,
                     )
                 )
             with open(os.path.join(df_path, "example2.py"), "w") as f:
                 f.write(
-                    python_user_import_template.format(
-                        MODULE_NAME=single_df["dataflow"], USER=_user_name
+                    builder_template.render(
+                        use_executor=uses_executor,
+                        dynamic_import=False,
+                        is_user=True,
+                        MODULE_NAME=single_df["dataflow"],
+                        USER=_user_name,
                     )
                 )
             # create MDX file
@@ -333,9 +286,6 @@ def user_dataflows__user(dataflows_with_everything: Collect[list[dict]]) -> dict
             readme_string = ""
             for line in readme_lines:
                 readme_string += line.replace("#", "##", 1)
-            # get tags
-            with open(os.path.join(single_df["path"], "tags.json"), "r") as f:
-                tags = json.load(f)
 
             with open(os.path.join(df_path, "README.mdx"), "w") as f:
                 f.write(
@@ -362,28 +312,28 @@ def _create_commit_file(df_path, single_df):
             f.write(f"[commit::{commit}][ts::{ts}]\n")
 
 
-@config.when(is_official="True")
-def user_dataflows__official(
+@config.when(is_dagworks="True")
+def user_dataflows__dagworks(
     dataflows_with_everything: Collect[list[dict]],
 ) -> dict[str, list[dict]]:
-    """Big function that creates the docs for official dataflow."""
+    """Big function that creates the docs for dagworks dataflow."""
     result = {}
-    for _official_dataflows in dataflows_with_everything:
-        if len(_official_dataflows) < 1:
+    for _dagworks_dataflows in dataflows_with_everything:
+        if len(_dagworks_dataflows) < 1:
             continue
-        _user_name = _official_dataflows[0]["user"]
-        result[_user_name] = _official_dataflows
+        _user_name = _dagworks_dataflows[0]["user"]
+        result[_user_name] = _dagworks_dataflows
         # make the folder
-        official_path = os.path.join("docs", "Official")
-        os.makedirs(official_path, exist_ok=True)
+        dagworks_path = os.path.join("docs", "DAGWorks")
+        os.makedirs(dagworks_path, exist_ok=True)
         # copy the author.md file
         shutil.copyfile(
-            _official_dataflows[0]["author_path"], os.path.join(official_path, "index.mdx")
+            _dagworks_dataflows[0]["author_path"], os.path.join(dagworks_path, "index.mdx")
         )
         # make all dataflow folders
-        for single_df in _official_dataflows:
+        for single_df in _dagworks_dataflows:
             # make the folder
-            df_path = os.path.join(official_path, single_df["dataflow"])
+            df_path = os.path.join(dagworks_path, single_df["dataflow"])
             os.makedirs(df_path, exist_ok=True)
             # copy the files
             for file in os.listdir(single_df["path"]):
@@ -396,16 +346,27 @@ def user_dataflows__official(
                 ]:
                     continue
                 shutil.copyfile(os.path.join(single_df["path"], file), os.path.join(df_path, file))
+            # get tags
+            with open(os.path.join(single_df["path"], "tags.json"), "r") as f:
+                tags = json.load(f)
+            # checks for driver related tags
+            uses_executor = tags.get("driver_tags", {}).get("executor", None)
             # create python file
             with open(os.path.join(df_path, "example1.py"), "w") as f:
                 f.write(
-                    python_official_dataflow_template.format(
+                    builder_template.render(
+                        use_executor=uses_executor,
+                        dynamic_import=True,
+                        is_user=False,
                         MODULE_NAME=single_df["dataflow"],
                     )
                 )
             with open(os.path.join(df_path, "example2.py"), "w") as f:
                 f.write(
-                    python_official_import_template.format(
+                    builder_template.render(
+                        use_executor=uses_executor,
+                        dynamic_import=False,
+                        is_user=False,
                         MODULE_NAME=single_df["dataflow"],
                     )
                 )
@@ -415,13 +376,10 @@ def user_dataflows__official(
             readme_string = ""
             for line in readme_lines:
                 readme_string += line.replace("#", "##", 1)
-            # get tags
-            with open(os.path.join(single_df["path"], "tags.json"), "r") as f:
-                tags = json.load(f)
 
             with open(os.path.join(df_path, "README.mdx"), "w") as f:
                 f.write(
-                    mdx_official_template.format(
+                    mdx_dagworks_template.format(
                         DATAFLOW_NAME=single_df["dataflow"],
                         USE_CASE_TAGS=tags["use_case_tags"],
                         README=readme_string,
@@ -443,7 +401,7 @@ def user_dataflows__official(
     remote_executor = executors.MultiThreadingExecutor(max_tasks=100)
     dr = (
         driver.Builder()
-        .with_config(dict(is_official="False"))
+        .with_config(dict(is_dagworks="False"))
         .enable_dynamic_execution(allow_experimental_mode=True)
         .with_remote_executor(remote_executor)  # We only need to specify remote executor
         # The local executor just runs it synchronously
@@ -463,13 +421,13 @@ def user_dataflows__official(
 
     dr = (
         driver.Builder()
-        .with_config(dict(is_official="True"))
+        .with_config(dict(is_dagworks="True"))
         .enable_dynamic_execution(allow_experimental_mode=True)
         .with_remote_executor(remote_executor)  # We only need to specify remote executor
         # The local executor just runs it synchronously
         .with_modules(compile_docs)
         .build()
     )
-    res = dr.execute(["user_dataflows"], inputs={"path": OFFICIAL_PATH})
+    res = dr.execute(["user_dataflows"], inputs={"path": DAGWORKS_PATH})
 
     pprint.pprint(res)
diff --git a/contrib/docs/docs/Official/Intro.md → contrib/docs/docs/DAGWorks/Intro.md b/contrib/docs/docs/Official/Intro.md → contrib/docs/docs/DAGWorks/Intro.md