Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Architectural Design for recording Airflow DAG executions in Linea #459

Open
dorx opened this issue Dec 14, 2021 · 0 comments
Open

Architectural Design for recording Airflow DAG executions in Linea #459

dorx opened this issue Dec 14, 2021 · 0 comments
Labels
architectural large takes longer than 1HD

Comments

@dorx
Copy link
Contributor

dorx commented Dec 14, 2021

Goal
Retrieve execution metadata of DAGs generated by lineapy in Airflow. We will use the metadata to support lineage queries (and potentially control the runtime behavior of Airflow since it's possible https://www.astronomer.io/guides/airflow-queries).

Note: this is not a UX design for lineage query use cases.

Desiderata

  1. Not having to inject lineapy APIs into the generated DAGs.
  2. Collect comprehensive metadata.

Proposed solution
lineapy to retrieve execution records and status for DAGs through Airflow's DB (schemata) and either

  1. add relevant records to lineapy's DB, which requires augmenting our DB schema, or
  2. use Airflow DB query results in memory only to support lineage use cases in lineapy

Assuming we will always have access to the Airflow DB, Option 2 avoids adding complexity to our DB schema and is easier to adapt to changes in Airflow. However, this is a pretty major assumption that could severely hamper lineapy's usefulness if violated. Thus, I recommend Option 1.

Side note: in a previous discussion, we floated the idea of injecting lineapy APIs into the generated DAGs. The upside is that it gives us a lot of flexibility in terms of what we could log. The downside is that it requires users to install lineapy into their production Airflow environments, which violates the first desideratum.

TODO:

  1. design the basic table schemata for tracking executions in lineapy's DB
  2. write Airflow DB connectors to poll Airflow's DB for updates on DAGs generated by lineapy
@dorx dorx added architectural large takes longer than 1HD labels Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architectural large takes longer than 1HD
Projects
None yet
Development

No branches or pull requests

1 participant