Skip to content

Commit

Permalink
Adds initialization script
Browse files Browse the repository at this point in the history
This makes contribution easier + impmroves the README
  • Loading branch information
elijahbenizzy committed Nov 28, 2023
1 parent 7faaa62 commit f95afe7
Show file tree
Hide file tree
Showing 6 changed files with 245 additions and 18 deletions.
69 changes: 55 additions & 14 deletions contrib/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ to this repository. We will review your dataflow and if it meets our standards w
package. To submit a pull request please use [this template](https://github.com/DAGWorks-Inc/hamilton/blob/main/.github/PULL_REQUEST_TEMPLATE/HAMILTON_CONTRIB_PR_TEMPLATE.md)
. To access it, create a new Pull Request, then hit the `preview` tab, and click the link to append `template=HAMILTON_CONTRIB_PR_TEMPLATE.md` to the URL.


#### Dataflow standards
We want to ensure that the dataflows in this package are of high quality and are easy to use. To that end,
we have a set of standards that we expect all dataflows to meet. If you have any questions, please reach out.
Expand All @@ -109,22 +110,62 @@ Standards:
- It must work.
- It must follow our standard structure as outlined below.

#### Getting started with development

To get started with development, you'll want to first fork the hamilton repository from the github UI.

Then, clone it locally and install the package in editable mode, ensuring you install any dependencies required for the initilization script
```bash
cd hamilton # Your fork
pip install -e "./contrib[contribute]" # Note that this package lives under the `contrib` folder
```

#### Checklist for new dataflows:
Do you have the following?
- [ ] Added a directory mapping to my github user name in the contrib/hamilton/contrib/user directory.
- [ ] If my author names contains hyphens I have replaced them with underscores.
- [ ] If my author name starts with a number, I have prefixed it with an underscore.
Next, you need to initialize your dataflow. This will create the necessary files and directories for you to get started.
```bash
init-dataflow -u <your_github_username> -n <name_of_dataflow>
```

This will do the following:

1. Create a package under `contrib/hamilton/contrib/user/<your_github_username>` with the appropriate files to describe you
- `author.md` -- this will describe you with links out to github/socials
- `__init__.py` -- this will be an empty file that allows you to import your dataflow
2. Create a package under `contrib/hamilton/contrib/user/<your_github_username>/<name_of_dataflow>` with the appropriate files to describe your dataflow:
- `README.md` to describe the dataflow with the standard headings
- `__init__.py` to contain the Hamilton code
- `requirements.txt` to contain the required packages outside of Hamilton
- `tags.json` to curate your dataflow
- `valid_configs.jsonl` to specify the valid configurations for it to be run
- `dag.png` to show one possible configuration of your dataflow
3. Add all the above files to git!

These are all required. You do not have to use the initialization script -- you can always copy the files over directly. That said, it is idempotent (it will fill out any missing files),
and will ensure that you have the correct structure.

#### Developing your dataflow

To get started, you'll want to do the following:

- [ ] Fill out your `__init__.py` with the appropriate code -- see [this issue](https://github.com/DAGWorks-Inc/hamilton/issues/559) if you want some inspiration for where to get started
- [ ] Fill out the sections of your `README.md` with the appropriate documentation -- see [this example](hamilton/contrib/dagworks/text_summarization/README.md)
- [ ] Fill out your `tags.json` with the appropriate tags -- see [this example](hamilton/contrib/dagworks/text_summarization/tags.json) for tags
- [ ] Fill out your `valid_configs.jsonl` with the appropriate configurations -- this is not necessary if you have no configurations that can change the shape of your DAG
- [ ] Generate an image for your DAG -- see the `__name__ == __main__` line in [this example](hamilton/contrib/dagworks/text_summarization/__init__.py) for a simple way to do so
- [ ] Push a branch back to your fork
- [ ] Open up a pull request to the main Hamilton repo!
- [ ] Commit the files we just added
- [ ] Create a PR
- [ ] Tag one of the maintainers [elijahbenizzy](https://github.com/elijahbenizzy), [skrawcz](https://github.com/skrawcz), or [zilto](https://github.com/zilto) for a review
- [ ] Ping us on [slack](https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg) if you don't hear back within a few days

#### Username Management

As usernames map to packages, we need to ensure that they are valid. To that end, we have a few rules:
- [ ] If your username contains hyphens, replace them with underscores.
- [ ] If your username starts with a number, prefix it with an underscore.
- [ ] If your author name is a python reserved keyword. Reach out to the maintainers for help.
- [ ] Added an author.md file under my username directory and is filled out.
- [ ] Added an __init__.py file under my username directory.
- [ ] Added a new folder for my dataflow under my username directory.
- [ ] Added a README.md file under my dataflow directory that follows the standard headings and is filled out.
- [ ] Added a __init__.py file under my dataflow directory that contains the Hamilton code.
- [ ] Added a requirements.txt under my dataflow directory that contains the required packages outside of Hamilton.
- [ ] Added tags.json under my dataflow directory to curate my dataflow.
- [ ] Added valid_configs.jsonl under my dataflow directory to specify the valid configurations.
- [ ] Added a dag.png that shows one possible configuration of my dataflow.

If the above apply, run the `init-dataflow` command with `-s` to specify a sanitized username.

## Got questions?
Join our [slack](https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg) community to chat/ask Qs/etc.
179 changes: 179 additions & 0 deletions contrib/hamilton/contribute.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
import logging
import os
import shutil
from typing import List

import click
import git

from hamilton.log_setup import setup_logging

setup_logging(logging.INFO)

logger = logging.getLogger(__name__)


def _validate_package_name(name: str) -> str:
"""Validates that the username is a legitimate python variable"""
if not str.isidentifier(name):
raise ValueError(
f"Username {name} is not an importable package name!"
f" See instructions at the dataflow hub -- "
f"https://hub.dagworks.io/docs/#checklist-for-new-dataflows" # noqa E231
) # noqa E231
return name


def _get_base_git_path():
try:
repo = git.Repo(".", search_parent_directories=True)
repo_path = repo.git.rev_parse("--show-toplevel")
return repo_path
except git.InvalidGitRepositoryError:
return None
except git.NoSuchPathError:
return None


def _get_contrib_base_path(git_repo_path: str, namespace: str = "user"):
return os.path.join(git_repo_path, "contrib", "hamilton", "contrib", namespace)


def _get_base_template_dir(base_contrib_path: str):
return os.path.join(base_contrib_path, "example_dataflow_template")


def _create_username_dir_if_not_exists(
base_contrib_path: str, sanitized_username: str, username: str
) -> List[str]:
to_add = []
username_dir = os.path.join(base_contrib_path, sanitized_username)
if not os.path.exists(username_dir):
logger.info(
f"✅ Creating directory for {username} at {username_dir}, no such directory exists"
)
os.mkdir(os.path.join(base_contrib_path, sanitized_username))
else:
logger.info(f"Directory for {username} already exists at {username_dir}, no need to create")

to_add.append(username_dir)

init_py_location = os.path.join(username_dir, "__init__.py")
if not os.path.exists(init_py_location):
logger.info(
f"✅ Creating __init__.py for {username} at {init_py_location}, no such file exists"
)
with open(init_py_location, "w") as f:
f.write(f'"""{username}\'s dataflows"""\n')
else:
logger.info(
f"✅ __init__.py for {username} already exists at {init_py_location}, no need to create"
)

to_add.append(init_py_location)

base_template_dir = _get_base_template_dir(base_contrib_path)
author_md_file_path = os.path.join(username_dir, "author.md")
if not os.path.exists(author_md_file_path):
logger.info(
f"✅ Creating author.md for {username} at {author_md_file_path}, no such file exists"
)
with open(os.path.join(base_template_dir, "author.md"), "r") as f:
author_md = f.read()
with open(os.path.join(username_dir, "author.md"), "w") as f:
contents = author_md.format(github_username=username)
contents = (
contents.replace("---\n", "").replace("title: Example Template\n", "").strip()
) # a little hacky, but it'll do
f.write(contents)
return to_add


def _create_dataflow_dir_if_not_exists(
base_contrib_path: str, sanitized_username: str, dataflow_name: str
) -> List[str]:
to_add = []
dataflow_dir = os.path.join(base_contrib_path, sanitized_username, dataflow_name)
if not os.path.exists(dataflow_dir):
logger.info(
f"✅ Creating directory for {dataflow_name} at {dataflow_dir}, no such directory exists"
)
os.mkdir(dataflow_dir)
template_dir = os.path.join(_get_base_template_dir(base_contrib_path), "dataflow_template")
for file_ in [
"__init__.py",
"dag.png",
"README.md",
"requirements.txt",
"tags.json",
"valid_configs.jsonl",
]:
file_path = os.path.join(dataflow_dir, file_)
if not os.path.exists(file_path):
copy_from = os.path.join(template_dir, file_)
logger.info(
f"✅ Creating file {file_} for {sanitized_username} at {file_path} from {copy_from}"
)
shutil.copy(copy_from, file_path)
else:
logger.info(
f"✅ {file_} for {sanitized_username} already exists at {file_path}, no need to create"
)
to_add.append(file_path)
return to_add


def _git_add(files_to_add: List[str], git_repo_path: str):
repo = git.Repo(git_repo_path)
repo.index.add(files_to_add)
logger.info(f"Adding files {files_to_add} to git! Happy developing!")


@click.command()
@click.option("-u", "--username", required=True, help="Username to use for the dataflow")
@click.option(
"-s",
"--sanitized-username",
required=False,
help="Sanitized username to use for the dataflow -- we will use this for package names. "
"If not provided, we will use the same username as above.",
default=None,
)
@click.option("-n", "--dataflow-name", type=_validate_package_name, required=True)
@click.option(
"-p",
"--repo-path",
type=click.Path(exists=True),
default=_get_base_git_path(),
help="Path to the git repository to add the dataflow to. Defaults to the "
"git parent of the current directory",
)
@click.option("-g", "--no-git-add", is_flag=True, help="Don't add the files to git")
def initialize(
username: str, dataflow_name: str, sanitized_username: str, repo_path: str, no_git_add: bool
):
if repo_path is None:
raise ValueError(
"No git repository found. Please provide the path to the git repository using the "
"--repo-path flag or run from within your local hamilton clone"
)
base_contrib_path = _get_contrib_base_path(repo_path)
if sanitized_username is None:
try:
sanitized_username = _validate_package_name(username)
except ValueError as e:
raise ValueError(
f"Sanitized username not provided and username {username} is not a valid python "
f"package name. Please provide a valid python package name or a sanitized username "
f"using the --sanitized-username flag"
) from e
files_to_add = []
files_to_add.extend(
_create_username_dir_if_not_exists(base_contrib_path, sanitized_username, username)
)
files_to_add.extend(
_create_dataflow_dir_if_not_exists(base_contrib_path, sanitized_username, dataflow_name)
)

if not no_git_add:
_git_add(files_to_add, repo_path)
7 changes: 7 additions & 0 deletions contrib/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,10 +68,17 @@ def load_requirements():
# adding this to slim the package down, since these dependencies are only used in certain contexts.
extras_require={
"visualization": ["sf-hamilton[visualization]"],
"contribute": ["click>8.0.0", "gitpython"],
},
# Relevant project URLs
project_urls={ # Optional
"Bug Reports": "https://github.com/dagworks-inc/hamilton/issues",
"Source": "https://github.com/dagworks-inc/hamilton/contrib",
},
# Useful scripts
entry_points={
"console_scripts": [
"init-dataflow = hamilton.contribute:initialize",
]
},
)
3 changes: 2 additions & 1 deletion hamilton/dataflows/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@
import urllib.error
import urllib.request
from types import ModuleType
from typing import Callable, Dict, List, NamedTuple, Optional, Tuple, Type, Union
from typing import (Callable, Dict, List, NamedTuple, Optional, Tuple, Type,
Union)

from hamilton import driver, telemetry

Expand Down
1 change: 0 additions & 1 deletion hamilton/dataflows/template/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@
# Code to create an imaging showing on DAG workflow.
# run as a script to test Hamilton's execution
import __init__ as MODULE_NAME

from hamilton import base, driver

dr = driver.Driver(
Expand Down
4 changes: 2 additions & 2 deletions hamilton/dataflows/template/author.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# {username}
# {github_username}

Hi I'm ...

# Github
https://github.com/{username}
https://github.com/{github_username}
# Linkedin

# X (Twitter)

0 comments on commit f95afe7

Please sign in to comment.