This a repo to understand how orchestation tools work. I will evaluate Luigi, Prefect and Dagster. Airflow was not evaluated because the learning curve was to high and the complexity of the project didn't require that kind of personalization and complexity on the orchestation tool.
The demo was done with Prefect, as it outperforms Luigi in the capabilities we needed. The task done are:
- Extract json data from a URL
- Process that json data
- Save the process data in a csv
- Send me an email with the message of completed or failed using the Block
Everything was executed and monitored using Prefect Cloud and their interface:
In this document we want to understand which is the best tool for our requirements. In order to do that, we did two things:
Ask several questions about the components and how the workflow works Created a rubric to measure the alignment between our requirements and the capabilities of the tool
Questions
- What is it?
- Which is the main purpose for which it was created?
- Which are the main components?
- How is it executed?
- Has a visualization tool that enables us to monitor the pipeline (ideally with task graph)?
- Do they save some records of the executions?
- How is it triggered?
- How reruns are handled?
- Does it work in a local environment?
- How is it deployed?
- Where is it deployed?
- Is it scalable?
- Has Documentation?
Rubric
- Execute task where the output of one task can be the input of another
- Being able to work on local environment for development
- Has a simple and agnostic way to setup the deploy
- Has implemented a way of tracking executions
- Has a visualization tool that enable us to monitor the pipeline (ideally with task graph)
- Has a scheduling function that i can configurate to program when to run?
- it has a event based scheduling configuration?
- It has a automatical re-run function that i can manipulate with rules on the configuration
- It's possible to re-run a process base on the rules configurated by me for retry?
- It's possible to re-run a process and only execute the task that failed?
- Has a good documentation
- Easy to learn
Here are the answers for Luigi:
-
What is it?
Luigi is a Python package used for building and managing complex data pipelines. -
Main purpose?
It was created to handle long-running batch jobs and ensure that tasks are executed in the correct order. -
Main components?
The main components are Tasks, Targets (outputs), and Workers. -
How is it executed?
Luigi tasks are executed through a command-line interface using Python scripts -
Visualization tool?
Yes, Luigi provides a web-based UI to monitor task pipelines. -
Save records of executions?
Yes, Luigi saves the status of task executions. -
How is it triggered?
Tasks are triggered by dependency resolution and scheduled execution. Luigi does not have a built-in, fully-featured scheduling system like Prefect. It relies on external schedulers, such as cron, to run workflows at specific intervals. Luigi can handle task dependencies and order of execution, but it requires something like cron to actually schedule when the pipeline or workflow should begin. -
How reruns are handled Luigi does not have an automatic re-run function with configurable rules. If a task fails, it will need to be rerun manually or with an external retry mechanism (like using cron jobs or external orchestration tools). However, Luigi can handle retries for tasks marked as failed when manually triggered, but there are no advanced rules for automatic retries like you might find in more modern orchestration tools such as Prefect or Airflow
-
Work in local environment?
Yes, it can be run locally. -
How is it deployed?
Luigi can be deployed on local machines, clusters, or in the cloud, but requires manual configuration. -
Where is it deployed?
It can be deployed on local servers or cloud-based environments. -
Scalable?
Yes but not withouth effort. Luigi scales well across distributed systems, but it may require additional setup. -
Has Documentation?
No, Luigi has comprehensive documentation available but for some reason i could not access directly to the details of each function, you can only do it putting the name of the function or some word in the search bar.
In Luigi, the components interact as follows:
- Tasks: The core units of work. Each task represents a single job (e.g., loading data).
- Dependencies: Tasks can have dependencies on other tasks, meaning they can only run once the required task completes.
- Workers: They manage task execution, ensuring the dependencies are resolved in order.
- Targets: Tasks generate output, referred to as a Target, which serves as input for dependent tasks.
Here are the answers for Prefect:
-
What is it? Prefect is a workflow orchestration tool that simplifies the execution and monitoring of data pipelines.
-
Main purpose? It was created to automate, orchestrate, and monitor workflows, ensuring resiliency, scalability, and ease of use in data pipelines.
-
Main components? Key components include Flows, Tasks, Agents, Workers, and Projects for defining, executing, and monitoring workflows.
-
How is it executed? Flows are executed using Prefect Agents or Workers, which run tasks in parallel.
-
Visualization tool? Yes, it includes a dashboard with a task graph for pipeline monitoring.
-
Saves records? Yes, Prefect records and logs every workflow execution.
-
How is it triggered? Triggers can be event-driven, schedule-based, or manually executed.
-
How reruns are handled?
-
Works locally? Yes, Prefect can run in local environments.
-
How is it deployed? Prefect can be deployed using Prefect Cloud, Docker, Kubernetes, or on a local server.
-
Scalable? Yes, it scales horizontally across distributed systems.
-
Has Documentation? Yes, comprehensive documentation is available on Prefect’s website.
In Prefect, the components interact as follows:
- Flows: These define the overall workflow and organize tasks.
- Tasks: Tasks represent individual units of work and are executed within a flow.
- Agents: Agents are responsible for deploying and running flows on specific infrastructure environments.
- Workers: Workers distribute the task execution, ensuring they run in parallel and on distributed systems.
- Projects: Projects group flows together, helping organize large-scale workflows.
Official doc
- Prefect: https://docs.prefect.io/3.0/get-started/index / https://www.prefect.io/blog#prefect-product
- Luigi: https://luigi.readthedocs.io/en/stable/index.html
- dagster: https://docs.dagster.io/getting-started?_gl=1*dogn49*_gcl_au*ODA0Mzg1NTI5LjE3MjIwMDQyMTU.*_ga*MTUzNzAxOTAxOS4xNzIwNzE4NDg1*_ga_84VRQZG7TV*MTcyOTUxODgzOS4xOC4xLjE3Mjk1MTkwMDEuNS4wLjA.
Articles
- https://medium.com/big-data-processing/getting-started-with-luigi-what-why-how-f8e639a1f2a5
- https://medium.com/@nitaionutandrei/workflow-orchestrators-airflow-vs-prefect-vs-dagster-0392b00dff30
- https://medium.com/dev-genius/orchestrate-modern-data-stack-with-dagster-0b6cf4d2d291
- https://dagster.io/vs/dagster-vs-prefect