Python API for creating DAG? #5646
Replies: 6 comments 2 replies
-
Thanks, @skshetry! Looks like a lot of design decisions to be made (maybe a good enhancement proposal candidate?). Before we can decide on the design, can we focus on what issues a Python API solves:
|
Beta Was this translation helpful? Give feedback.
-
How to provide more flexible DAG definitions is a broad discussion for which a Python API is one possible solution. I'm not sure whether to put this into a separate discussion or maybe rename this one, but I'll start with a new thread here. For example, see #3633 (comment), which proposes adding more advanced filters and custom filters to A Python API would provide more flexibility, but it makes DAGs less readable and might make editing and debugging more difficult. |
Beta Was this translation helpful? Give feedback.
-
This library was recently shared on discord (https://discord.com/channels/485586884165107732/485586884165107734/869509789791715338): |
Beta Was this translation helpful? Give feedback.
-
I like the idea of being about to execute/evaluate normal python code in dvc.yaml (e.g. dvc.py) - more thoughts/details in #6512 (comment) |
Beta Was this translation helpful? Give feedback.
-
I just wanted to jump in and suggest why a python API to DAGs/pipelines would be useful for me.
|
Beta Was this translation helpful? Give feedback.
-
I'm currently working on a project and trying out dvc. I've used For example, at the moment I have this in my Python script: import typer
from pathlib import Path
def create_dataset(cuad_path: Path, citrain_path: Path, output_path: Path):
# do some cool stuff
if __name__ == '__main__':
typer.run(create_dataset) and then I have to create the declaration in the .yaml file create-contracts-dataset:
cmd: python3 src/create_contract_dataset.py some/path some/other/path output/path
deps:
- some/path
- some/other/path
outs:
- output/path If there was a was to explicitly declare the data type in Python I could write something like: from pathlib import Path
import typer
from shiny_cool_lib import InputPath, OutputPath
def create_dataset(cuad_path: InputPath = Path('some/path'), citrain_path: InputPath = Path('some/other/path'), output_path: OutputPath = Path('output/path')):
# do some cool stuff
if __name__ == '__main__':
typer.run(create_dataset) Behind the scenes, the |
Beta Was this translation helpful? Give feedback.
-
I thought we had an issue #5178, but looks like it is more for displaying dag rather than creating it.
With
dvc.yaml
workflow file, the declared definition are static in nature, while runtime being isolated incmd
. This allows us to provide other data management features such as checkout, pull, push etc. without executing.By providing a python API, I can think of 2 kinds of APIs:
dvc.yaml
file, providing an API that statically explains the stage or workflow. Examples:doit
,nox
With this kind of API, they are not that different from dvc.yaml file, in that they just explain stage definition through python api. But, as they are static in nature, we'll still be able to provide other data management commands.
What about
dvc.lock
though? Maybe, run-cache should be used instead here, as having a python API and a lock file (that needs to be added to git) does not make much sense.How will viewer and other similar tools understand the DAG?
So, regarding api, we need to find a way to statically determine metadata for dvc without executing it, while providing scalability and flexibility when executing it dynamically. But, this way it seems we'll have to clearly separate what should be determined statically and what should be dynamic data, which might not be as clear in the future.
Same problem here though, regarding dvc.lock. As dvc's metadata are static, we'll be able to provide other data management features as well.
How should function composition work? In dvc.yaml file, there is no support for map/reduce/join/accumulate, but with Python API, it seems an important feature. Another important feature would be a dependency between stages (functions with python api). How to detect those dependencies?
Beta Was this translation helpful? Give feedback.
All reactions