Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(airflow): OWID, FOPH and Colombia dag. #133

Merged
merged 30 commits into from
Oct 14, 2022

Conversation

luabida
Copy link
Contributor

@luabida luabida commented Sep 1, 2022

Start of Airflow implementation. It still needs improvements on the documentation and some bug fixes.

How to test the changes:

  • make docker-build ENV=dev
  • make docker-start ENV=dev
  • access http://127.0.0.1:8099/
  • login with admin, password admin
  • trigger manually the DAGs, you can find the workflow in Graph page.

NOTES:

  1. Sometimes the Scheduler won't boot with the Webserver, I'm doing it manually inside Airflow's container with the command airflow scheduler (ping @xmnlab)
  2. In FOPH Dag have some tasks that won't match the tables size, it needs to be fixed.
  3. foph_casesvaccpersons_d table in FOPH Dag fails.
  4. Tests need to be implemented.
  5. In the future, all modules will import directly from epigraphhub_py.

fix #36
fix #120
fix #137

Copy link
Member

@xmnlab xmnlab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on that @luabida .

I think we should move data_collection to epigraphhub_py earlier than later, so I vote to move that before we merge this PR.

the other comments is most about folders and environment variables

data_collection/config.py Outdated Show resolved Hide resolved
data_collection/foph/download_foph_data.py Outdated Show resolved Hide resolved
data_collection/owid/compare_data.py Outdated Show resolved Hide resolved
docker/airflow/Dockerfile Show resolved Hide resolved
docker/airflow/dags/colombia_dag.py Outdated Show resolved Hide resolved
docker/airflow/dags/foph_dag.py Outdated Show resolved Hide resolved
docker/airflow/dags/owid_dag.py Outdated Show resolved Hide resolved
docker/superset/Dockerfile Show resolved Hide resolved
@xmnlab
Copy link
Member

xmnlab commented Sep 1, 2022

@luabida some extra comments about the environment folders:

inside a library that is not good to have a folder path defined hard coded, for example, if the user want to use the library in a different operation system or with restricted access to the system's folder. so it is very important to have a way to parametrize that.

.env files are very nice for system or platforms .. but not good for libraries ... so for libraries we maybe would like to have something like a config file .. maybe something inside ${HOME}/.config/epigraphhub.cnf

let's discuss that today or tomorrow, so we can define that as soon as possible

@eduardocorrearaujo
Copy link
Contributor

thanks for working on that @luabida .

I think we should move data_collection to epigraphhub_py earlier than later, so I vote to move that before we merge this PR.

the other comments is most about folders and environment variables

I agree

data_collection/foph/foph_fetch.py Outdated Show resolved Hide resolved
data_collection/foph/foph_fetch.py Outdated Show resolved Hide resolved
data_collection/foph/foph_fetch.py Outdated Show resolved Hide resolved
task_id="start",
)

@task(task_id="load_into_db", retries=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this link: https://dev.socrata.com/foundry/www.datos.gov.co/gt2j-8ykr/ , it's available the last updated date of this dataset. Unfortunately, I don't find a way to get it directly from the API. But, maybe we can create a web scraping to get it and use it as a way to decide if the code to update data should run or not. I can do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luabida, a final comment: Following issue #129, you could create a task that, after updating the table, will add (or update) the date of update of this table in the meta table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this first review, I'll patching the changes in this PR so we can track the TODOs for the Airflow.

@fccoelho
Copy link
Contributor

fccoelho commented Sep 2, 2022

When relocating the code to the epigraphhub_py library, please open a PR there so that we can check that we have reasonable test coverage. This code is critical and must be carefully tested.

@luabida
Copy link
Contributor Author

luabida commented Sep 2, 2022

ATTENTION! The last commit possibly BREAKS dependencies, it was the only way I could import epigraphhub_py library, which has a lot of dependency problems with EpiGraphHub.

conda/pip.txt Outdated Show resolved Hide resolved
docker/airflow/dags/owid_dag.py Outdated Show resolved Hide resolved
@luabida
Copy link
Contributor Author

luabida commented Sep 6, 2022

Currently using this branch to install the eph lib: https://github.com/luabida/epigraphhub_py/tree/data-collection
I'll start the documentation in next commits.
Some foph csv still need fix in the compare data step.

@luabida
Copy link
Contributor Author

luabida commented Sep 8, 2022

#ec337b5 commit is the beginning of airflow's documentations, more information will be included in next commits

@@ -0,0 +1,103 @@
# Airflow Overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to create an entry for this document on index.rst Perhaps named: "Creating Data Collection DAGS". Maybe it can replace this page, but first take a look at it and move any still relevant information to this page.

@luabida
Copy link
Contributor Author

luabida commented Sep 23, 2022

@xmnlab about #08e8f98: I had to configure the .yaml file by hand changing the host variable to the postgres container's IP address. Could it be a Docker network error?

@luabida
Copy link
Contributor Author

luabida commented Oct 11, 2022

about #4f4ee21:
I'm not quite sure how the DAGs that updates the tables should be tested nor if unity tests for tables existence should be added along with the tables Airflow is meant to update.

Only the dag that fix #137 remained.

@luabida luabida marked this pull request as ready for review October 11, 2022 19:14
@xmnlab
Copy link
Member

xmnlab commented Oct 11, 2022

about #4f4ee21: I'm not quite sure how the DAGs that updates the tables should be tested nor if unity tests for tables existence should be added along with the tables Airflow is meant to update.

Only the dag that fix #137 remained.

according to this: https://stackoverflow.com/a/59316781/4995866

I think that you would have two options:

with api call post, maybe you could run that outside of the airflow container, although, not sure about timeout or the response.

if you decide to use CLI, you can do something like:

  • add to Makefile:
CMD:=

.PHONY:docker-exec-command
docker-exec-command:
	$(DOCKER) exec ${SERVICE} ${CMD}

and call it with: make docker-exec-command CMD="airflow trigger_dag <dag_id>"

does it make sense?

@luabida
Copy link
Contributor Author

luabida commented Oct 11, 2022

It does make sense, thank you for helping on this subject. Another issue is the credentials on CI for the legacy Cron tests. Do you know why it receives permissions error?
https://github.com/thegraphnetwork/EpiGraphHub/actions/runs/3229208852/jobs/5286235535#step:9:114

@xmnlab
Copy link
Member

xmnlab commented Oct 11, 2022

not sure .. maybe you need to review the sql files to ensure that the user used to create the table has the permissions.
but probably it is safe to remove the cron tests step because it will be replaced by your DAGS

@fccoelho
Copy link
Contributor

about #4f4ee21: I'm not quite sure how the DAGs that updates the tables should be tested nor if unity tests for tables existence should be added along with the tables Airflow is meant to update.

Only the dag that fix #137 remained.

I think we need to be careful to avoid redundant testing: If a DAG uses code that is implemented in the epigraphhub library, that code should not be tested (or re-tested) from inside the DAG, but rather directly in the library where it is much easier. The only testing required from airflow are that it starts and finishes the DAGs correctly.

@fccoelho
Copy link
Contributor

It does make sense, thank you for helping on this subject. Another issue is the credentials on CI for the legacy Cron tests. Do you know why it receives permissions error? https://github.com/thegraphnetwork/EpiGraphHub/actions/runs/3229208852/jobs/5286235535#step:9:114

When I fixed the permissions for the epigraph user that does the insertions on the cron script, I remember that I had to grant permissions also at the schema level, not just for the table...

@fccoelho
Copy link
Contributor

fccoelho commented Oct 11, 2022

note that the error:

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.InsufficientPrivilege) permission denied for schema switzerland
LINE 2: CREATE TABLE switzerland.foph_cases_d

is for schema switzerland, because you are trying to create a table, so you probably need something like this:

GRANT USAGE ON  SCHEMA <name_of the_schema> TO <user> ;
GRANT ALL ON ALL TABLES IN SCHEMA <name_of the_schema> TO <user> ;
GRANT ALL ON ALL FUNCTIONS IN SCHEMA <name_of the_schema> TO <user> ;
GRANT ALL ON ALL SEQUENCES IN SCHEMA <name_of the_schema> TO <user> ;

maybe something more...

@esloch
Copy link
Contributor

esloch commented Oct 12, 2022

Everyone, in the makefile we have a target make docker-dev-prepare-db. I didn't find it in any step of the CI.
How are sql scripts imported into docker-entrypoint-initdb.d in postgres container?

@xmnlab
Copy link
Member

xmnlab commented Oct 12, 2022

It should be done automatically by dockerfile/docker-compose. Is it not working?

@esloch
Copy link
Contributor

esloch commented Oct 12, 2022

It should be done automatically by dockerfile/docker-compose. Is it not working?

Thanks Ivan, initdb is up to date with a volume in the compose file:

      volumes:
        - ./postgresql/sql/dev/:/docker-entrypoint-initdb.d/

I will continue to debug on my localhost.

Copy link
Member

@xmnlab xmnlab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @luabida , I will take another later today

@luabida
Copy link
Contributor Author

luabida commented Oct 13, 2022

fix #36
fix #120
fix #137

@xmnlab
Copy link
Member

xmnlab commented Oct 14, 2022

fix #36 fix #120 fix #137

@luabida can you add this information directly to the first comment of this PR? otherwise github will not close these issues automatically

@luabida
Copy link
Contributor Author

luabida commented Oct 14, 2022

Done. Is it done to merge? From my side there is the other PR that needs rebasing from main after this merge

Copy link
Member

@xmnlab xmnlab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just added a few comments, other than that LGTM.
thanks @luabida !

Makefile Outdated
@@ -86,8 +86,8 @@ docker-dev-prepare-db:

.PHONY:docker-run-cron
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove the docker-run-cron target pls? it will not be used anymore

Makefile Outdated

.PHONY:docker-console
docker-console:
$(DOCKER) exec ${SERVICE} ${CONSOLE}
$(DOCKER) exec -it ====${SERVICE} ${CONSOLE}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this bunch of equal sign here?

@@ -78,7 +78,8 @@ COPY --chown=epigraphhub:epigraphhub docker/superset/entrypoint.sh /opt/entrypoi
RUN chmod +x /opt/entrypoint.sh \
&& echo "source /opt/entrypoint.sh" > ~/.bashrc \
&& sudo mkdir -p /opt/data/superset/ \
&& sudo chown -R epigraphhub:epigraphhub /opt/data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for removing this one?

@luabida
Copy link
Contributor Author

luabida commented Oct 14, 2022

Another question is the user creation for Airflow, maybe we will need a bit of caution on how is it rn

Copy link
Member

@xmnlab xmnlab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks @luabida !

@xmnlab xmnlab merged commit e854993 into thegraphnetwork:main Oct 14, 2022
@xmnlab
Copy link
Member

xmnlab commented Oct 14, 2022

Another question is the user creation for Airflow, maybe we will need a bit of caution on how is it rn

could explain a bit more about that pls?

@xmnlab
Copy link
Member

xmnlab commented Oct 14, 2022

btw, the database issue we are planning to fix with #142

@luabida
Copy link
Contributor Author

luabida commented Oct 14, 2022

could explain a bit more about that pls?

I wasn't sure how would be done the Users on Airflow webserver. The data collection DAGs are set with the owner: epigraphhub. It doesn't stop the Admin to run the DAGs, but if some dev get access from a user that is not Admin, it has to be 'epigraphhub', otherwise it won't show any DAG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants