Skip to content

Commit

Permalink
Various hub/contrib improvements (#528)
Browse files Browse the repository at this point in the history
At a high level this commit does:
 - moves things around, e.g. official -> DAGWorks.
 - fixes copy and READMEs
 - adds stub for leaderboard
 - fixes telemetry/contrib module usage when sf-hamilton-contrib is not installed



See squashed commits for details:

* Moves Official to DAGWorks for contrib/hub

Because that's who is maintaining it.

I also tacked on refactoring the python hub docs
code into jinja2 to enable more easily having
conditional logic that is easier to maintain.

* Adds "view by tags" link in hub

So that people can see an index.

This is a quick way to get to this screen
without really trying to customize docusaurus.

* Fixes some README wording to drop `==` install

For the hub we don't want to have to bump the version
in the docs. So removing specifying the version for now.
Though we could build something to know what the version
is that's more effort than I want to do at the moment.

* Adds link to copy docs for hub

* Adds duplicate contrib/__init__ for telemetry capture

So if someone does not have sf-hamilton-contrib installed, you could
download a dataflow but not be able to import it, because they
have a dependency on having a `hamilton.contrib` submodule. This sticks
a placeholder in -- to enable things to work without installing sf-hamilton-contrib.
The assumption is that it will get clobbered by installed `sf-hamilton-contrib`,
and everything should continue to work.

This also updates the telemetry code to handle this case, and to parse
the relevant details from the file path. I have tested that things work
for old and new paths.

What I haven't extensively tested is the installation order.

* Moves polars import to try/except

For some reason this was breaking for me,
and this fixed it.

* Fixing isort issues

* Cleaning up some typos

* Revert "Fixing isort issues"

This reverts commit 8288b7f.

* Adds stub for leaderboard page for hub

This is pretty basic. But putting something up here.

* Fixing isort
  • Loading branch information
skrawcz authored Nov 11, 2023
1 parent 8463800 commit d9f7a44
Show file tree
Hide file tree
Showing 19 changed files with 258 additions and 140 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/docusaurus-gh-pages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@ jobs:
pip install -e .
- name: Compile code to create pages
working-directory: contrib/docs
run: python compile_docs.py
run: |
pip install jinja2
python compile_docs.py
- name: Set up Node.js
uses: actions/setup-node@v3
with:
Expand Down
14 changes: 9 additions & 5 deletions contrib/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,11 @@ pip install sf-hamilton-contrib --upgrade
Once installed, you can import the dataflows as follows.

Things you need to know:
1. Whether it's a user or official dataflow. If user, what the name of the user is.
1. Whether it's a user or official DAGWorks supported dataflow. If user, what the name of the user is.
2. The name of the dataflow.
```python
from hamilton import driver
# from hamilton.contrib.official import NAME_OF_DATAFLOW
# from hamilton.contrib.dagworks import NAME_OF_DATAFLOW
from hamilton.contrib.user.NAME_OF_USER import NAME_OF_DATAFLOW

dr = (
Expand All @@ -45,6 +45,7 @@ result = dr.execute(
inputs={...} # pass in inputs as appropriate
)
```
To find an example [go to the hub](https://hub.dagworks.io/docs/).

#### Dynamic installation
Here we dynamically download the dataflow from the internet and execute it. This is useful for quickly
Expand All @@ -54,7 +55,7 @@ iterating in a notebook and pulling in just the dataflow you need.
from hamilton import dataflow, driver

# downloads into ~/.hamilton/dataflows and loads the module -- WARNING: ensure you know what code you're importing!
# NAME_OF_DATAFLOW = dataflow.import_module("NAME_OF_DATAFLOW") # if using official dataflow
# NAME_OF_DATAFLOW = dataflow.import_module("NAME_OF_DATAFLOW") # if using official DAGWorks dataflow
NAME_OF_DATAFLOW = dataflow.import_module("NAME_OF_DATAFLOW", "NAME_OF_USER")
dr = (
driver.Builder()
Expand All @@ -68,11 +69,12 @@ result = dr.execute(
inputs={...} # pass in inputs as appropriate
)
```
To find an example [go to the hub](https://hub.dagworks.io/docs/).

#### Modification
Getting started is one thing, but then modifying to your needs is another. So we have a prescribed
flow to enable you to take a dataflow, and copy the code to a place of your choosing. This allows
you to easily modify the dataflow as you see fit. You will need to adhere to any licenses the code may come with.
The default, if not specified, is the "BSD-3 clear clause".
you to easily modify the dataflow as you see fit.

Run this in a notebook or python script to copy the dataflow to a directory of your choosing.
```python
Expand All @@ -85,6 +87,8 @@ dataflow.copy(NAME_OF_DATAFLOW, destination_path="PATH_TO_DIRECTORY")
from hamilton.contrib.user.NAME_OF_USER import NAME_OF_DATAFLOW
dataflow.copy(NAME_OF_DATAFLOW, destination_path="PATH_TO_DIRECTORY")
```
You can then modify/import the code as you see fit. See [copy()](https://hamilton.dagworks.io/en/latest/reference/dataflows/copy/)
for more details.


### How to contribute
Expand Down
166 changes: 62 additions & 104 deletions contrib/docs/compile_docs.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,17 @@
import shutil
import subprocess

import jinja2

from hamilton.function_modifiers import config
from hamilton.htypes import Collect, Parallelizable

DATAFLOW_FOLDER = ".."
USER_PATH = DATAFLOW_FOLDER + "/hamilton/contrib/user"
OFFICIAL_PATH = DATAFLOW_FOLDER + "/hamilton/contrib/official"
DAGWORKS_PATH = DATAFLOW_FOLDER + "/hamilton/contrib/dagworks"


@config.when(is_official="False")
@config.when(is_dagworks="False")
def user__usr(path: str) -> Parallelizable[dict]:
"""Find all users in the contrib/user folder."""
for _user in os.listdir(path):
Expand All @@ -36,10 +38,10 @@ def user__usr(path: str) -> Parallelizable[dict]:
yield {"user": _user, "path": os.path.join(path, _user)}


@config.when(is_official="True")
def user__official(path: str) -> Parallelizable[dict]:
"""Find all users in the contrib/official folder."""
yield {"user": "::OFFICIAL::", "path": path}
@config.when(is_dagworks="True")
def user__dagworks(path: str) -> Parallelizable[dict]:
"""Find all users in the contrib/dagworks folder."""
yield {"user": "::DAGWORKS::", "path": path}


def dataflows(user: dict) -> list[dict]:
Expand Down Expand Up @@ -127,71 +129,8 @@ def dataflows_with_everything(


# TEMPLATES!
python_user_dataflow_template = """from hamilton import dataflow, driver
# downloads into ~/.hamilton/dataflows and loads the module -- WARNING: ensure you know what code you're importing!
{MODULE_NAME} = dataflow.import_module("{MODULE_NAME}", "{USER}")
dr = (
driver.Builder()
.with_config({{}}) # replace with configuration as appropriate
.with_modules({MODULE_NAME})
.build()
)
# execute the dataflow, specifying what you want back. Will return a dictionary.
result = dr.execute(
[{MODULE_NAME}.CHANGE_ME, ...], # this specifies what you want back
inputs={{...}} # pass in inputs as appropriate
)
"""

python_official_dataflow_template = """from hamilton import dataflow, driver
# downloads into ~/.hamilton/dataflows and loads the module
{MODULE_NAME} = dataflow.import_module("{MODULE_NAME}")
dr = (
driver.Builder()
.with_config({{}}) # replace with configuration as appropriate
.with_modules({MODULE_NAME})
.build()
)
# execute the dataflow, specifying what you want back. Will return a dictionary.
result = dr.execute(
[{MODULE_NAME}.CHANGE_ME, ...], # this specifies what you want back
inputs={{...}} # pass in inputs as appropriate
)
"""

python_user_import_template = """# pip install sf-hamilton-contrib==0.0.1rc1
from hamilton import driver
from hamilton.contrib.user.{USER} import {MODULE_NAME}
dr = (
driver.Builder()
.with_config({{}}) # replace with configuration as appropriate
.with_modules({MODULE_NAME})
.build()
)
# execute the dataflow, specifying what you want back. Will return a dictionary.
result = dr.execute(
[{MODULE_NAME}..., ...], # this specifies what you want back
inputs={{...}} # pass in inputs as appropriate
)
"""

python_official_import_template = """# pip install sf-hamilton-contrib==0.0.1rc1
from hamilton import driver
from hamilton.contrib.official import {MODULE_NAME}
dr = (
driver.Builder()
.with_config({{}}) # replace with configuration as appropriate
.with_modules({MODULE_NAME})
.build()
)
# execute the dataflow, specifying what you want back. Will return a dictionary.
result = dr.execute(
[{MODULE_NAME}..., ...], # this specifies what you want back
inputs={{...}} # pass in inputs as appropriate
)
"""
template_env = jinja2.Environment(loader=jinja2.FileSystemLoader("templates/"))
builder_template = template_env.get_template("driver_builder.py.jinja2")

mdx_template = """---
id: {USER}-{DATAFLOW_NAME}
Expand All @@ -214,7 +153,7 @@ def dataflows_with_everything(
### Use published library version
```bash
pip install sf-hamilton-contrib==0.0.1rc1 # make sure you have the latest
pip install sf-hamilton-contrib --upgrade # make sure you have the latest
```
import example2 from '!!raw-loader!./example2.py';
Expand All @@ -239,8 +178,8 @@ def dataflows_with_everything(
"""

mdx_official_template = """---
id: OFFICIAL-{DATAFLOW_NAME}
mdx_dagworks_template = """---
id: DAGWorks-{DATAFLOW_NAME}
title: {DATAFLOW_NAME}
tags: {USE_CASE_TAGS}
---
Expand All @@ -260,7 +199,7 @@ def dataflows_with_everything(
### Use published library version
```bash
pip install sf-hamilton-contrib==0.0.1rc1 # make sure you have the latest
pip install sf-hamilton-contrib --upgrade # make sure you have the latest
```
import example2 from '!!raw-loader!./example2.py';
Expand All @@ -286,7 +225,7 @@ def dataflows_with_everything(
# TODO: edit/adjust links to docs, etc.


@config.when(is_official="False")
@config.when(is_dagworks="False")
def user_dataflows__user(dataflows_with_everything: Collect[list[dict]]) -> dict[str, list[dict]]:
"""Big function that creates the docs for a user."""
result = {}
Expand Down Expand Up @@ -314,17 +253,31 @@ def user_dataflows__user(dataflows_with_everything: Collect[list[dict]]) -> dict
]:
continue
shutil.copyfile(os.path.join(single_df["path"], file), os.path.join(df_path, file))

# get tags
with open(os.path.join(single_df["path"], "tags.json"), "r") as f:
tags = json.load(f)
# checks for driver related tags
uses_executor = tags.get("driver_tags", {}).get("executor", None)
# create python file
with open(os.path.join(df_path, "example1.py"), "w") as f:
f.write(
python_user_dataflow_template.format(
MODULE_NAME=single_df["dataflow"], USER=_user_name
builder_template.render(
use_executor=uses_executor,
dynamic_import=True,
is_user=True,
MODULE_NAME=single_df["dataflow"],
USER=_user_name,
)
)
with open(os.path.join(df_path, "example2.py"), "w") as f:
f.write(
python_user_import_template.format(
MODULE_NAME=single_df["dataflow"], USER=_user_name
builder_template.render(
use_executor=uses_executor,
dynamic_import=False,
is_user=True,
MODULE_NAME=single_df["dataflow"],
USER=_user_name,
)
)
# create MDX file
Expand All @@ -333,9 +286,6 @@ def user_dataflows__user(dataflows_with_everything: Collect[list[dict]]) -> dict
readme_string = ""
for line in readme_lines:
readme_string += line.replace("#", "##", 1)
# get tags
with open(os.path.join(single_df["path"], "tags.json"), "r") as f:
tags = json.load(f)

with open(os.path.join(df_path, "README.mdx"), "w") as f:
f.write(
Expand All @@ -362,28 +312,28 @@ def _create_commit_file(df_path, single_df):
f.write(f"[commit::{commit}][ts::{ts}]\n")


@config.when(is_official="True")
def user_dataflows__official(
@config.when(is_dagworks="True")
def user_dataflows__dagworks(
dataflows_with_everything: Collect[list[dict]],
) -> dict[str, list[dict]]:
"""Big function that creates the docs for official dataflow."""
"""Big function that creates the docs for dagworks dataflow."""
result = {}
for _official_dataflows in dataflows_with_everything:
if len(_official_dataflows) < 1:
for _dagworks_dataflows in dataflows_with_everything:
if len(_dagworks_dataflows) < 1:
continue
_user_name = _official_dataflows[0]["user"]
result[_user_name] = _official_dataflows
_user_name = _dagworks_dataflows[0]["user"]
result[_user_name] = _dagworks_dataflows
# make the folder
official_path = os.path.join("docs", "Official")
os.makedirs(official_path, exist_ok=True)
dagworks_path = os.path.join("docs", "DAGWorks")
os.makedirs(dagworks_path, exist_ok=True)
# copy the author.md file
shutil.copyfile(
_official_dataflows[0]["author_path"], os.path.join(official_path, "index.mdx")
_dagworks_dataflows[0]["author_path"], os.path.join(dagworks_path, "index.mdx")
)
# make all dataflow folders
for single_df in _official_dataflows:
for single_df in _dagworks_dataflows:
# make the folder
df_path = os.path.join(official_path, single_df["dataflow"])
df_path = os.path.join(dagworks_path, single_df["dataflow"])
os.makedirs(df_path, exist_ok=True)
# copy the files
for file in os.listdir(single_df["path"]):
Expand All @@ -396,16 +346,27 @@ def user_dataflows__official(
]:
continue
shutil.copyfile(os.path.join(single_df["path"], file), os.path.join(df_path, file))
# get tags
with open(os.path.join(single_df["path"], "tags.json"), "r") as f:
tags = json.load(f)
# checks for driver related tags
uses_executor = tags.get("driver_tags", {}).get("executor", None)
# create python file
with open(os.path.join(df_path, "example1.py"), "w") as f:
f.write(
python_official_dataflow_template.format(
builder_template.render(
use_executor=uses_executor,
dynamic_import=True,
is_user=False,
MODULE_NAME=single_df["dataflow"],
)
)
with open(os.path.join(df_path, "example2.py"), "w") as f:
f.write(
python_official_import_template.format(
builder_template.render(
use_executor=uses_executor,
dynamic_import=False,
is_user=False,
MODULE_NAME=single_df["dataflow"],
)
)
Expand All @@ -415,13 +376,10 @@ def user_dataflows__official(
readme_string = ""
for line in readme_lines:
readme_string += line.replace("#", "##", 1)
# get tags
with open(os.path.join(single_df["path"], "tags.json"), "r") as f:
tags = json.load(f)

with open(os.path.join(df_path, "README.mdx"), "w") as f:
f.write(
mdx_official_template.format(
mdx_dagworks_template.format(
DATAFLOW_NAME=single_df["dataflow"],
USE_CASE_TAGS=tags["use_case_tags"],
README=readme_string,
Expand All @@ -443,7 +401,7 @@ def user_dataflows__official(
remote_executor = executors.MultiThreadingExecutor(max_tasks=100)
dr = (
driver.Builder()
.with_config(dict(is_official="False"))
.with_config(dict(is_dagworks="False"))
.enable_dynamic_execution(allow_experimental_mode=True)
.with_remote_executor(remote_executor) # We only need to specify remote executor
# The local executor just runs it synchronously
Expand All @@ -463,13 +421,13 @@ def user_dataflows__official(

dr = (
driver.Builder()
.with_config(dict(is_official="True"))
.with_config(dict(is_dagworks="True"))
.enable_dynamic_execution(allow_experimental_mode=True)
.with_remote_executor(remote_executor) # We only need to specify remote executor
# The local executor just runs it synchronously
.with_modules(compile_docs)
.build()
)
res = dr.execute(["user_dataflows"], inputs={"path": OFFICIAL_PATH})
res = dr.execute(["user_dataflows"], inputs={"path": DAGWORKS_PATH})

pprint.pprint(res)
File renamed without changes.
Loading

0 comments on commit d9f7a44

Please sign in to comment.