-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(airflow): OWID, FOPH and Colombia dag. #133
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working on that @luabida .
I think we should move data_collection to epigraphhub_py earlier than later, so I vote to move that before we merge this PR.
the other comments is most about folders and environment variables
@luabida some extra comments about the environment folders: inside a library that is not good to have a folder path defined hard coded, for example, if the user want to use the library in a different operation system or with restricted access to the system's folder. so it is very important to have a way to parametrize that. .env files are very nice for system or platforms .. but not good for libraries ... so for libraries we maybe would like to have something like a config file .. maybe something inside let's discuss that today or tomorrow, so we can define that as soon as possible |
I agree |
task_id="start", | ||
) | ||
|
||
@task(task_id="load_into_db", retries=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this link: https://dev.socrata.com/foundry/www.datos.gov.co/gt2j-8ykr/ , it's available the last updated date of this dataset. Unfortunately, I don't find a way to get it directly from the API. But, maybe we can create a web scraping to get it and use it as a way to decide if the code to update data should run or not. I can do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this first review, I'll patching the changes in this PR so we can track the TODOs for the Airflow.
When relocating the code to the epigraphhub_py library, please open a PR there so that we can check that we have reasonable test coverage. This code is critical and must be carefully tested. |
ATTENTION! The last commit possibly BREAKS dependencies, it was the only way I could import |
Currently using this branch to install the eph lib: https://github.com/luabida/epigraphhub_py/tree/data-collection |
#ec337b5 commit is the beginning of airflow's documentations, more information will be included in next commits |
@@ -0,0 +1,103 @@ | |||
# Airflow Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about #4f4ee21: Only the dag that fix #137 remained. |
according to this: https://stackoverflow.com/a/59316781/4995866 I think that you would have two options:
with api call post, maybe you could run that outside of the airflow container, although, not sure about timeout or the response. if you decide to use CLI, you can do something like:
and call it with: does it make sense? |
It does make sense, thank you for helping on this subject. Another issue is the credentials on CI for the legacy Cron tests. Do you know why it receives permissions error? |
not sure .. maybe you need to review the sql files to ensure that the user used to create the table has the permissions. |
I think we need to be careful to avoid redundant testing: If a DAG uses code that is implemented in the epigraphhub library, that code should not be tested (or re-tested) from inside the DAG, but rather directly in the library where it is much easier. The only testing required from airflow are that it starts and finishes the DAGs correctly. |
When I fixed the permissions for the epigraph user that does the insertions on the cron script, I remember that I had to grant permissions also at the schema level, not just for the table... |
note that the error: sqlalchemy.exc.ProgrammingError: (psycopg2.errors.InsufficientPrivilege) permission denied for schema switzerland
LINE 2: CREATE TABLE switzerland.foph_cases_d is for schema switzerland, because you are trying to create a table, so you probably need something like this: GRANT USAGE ON SCHEMA <name_of the_schema> TO <user> ;
GRANT ALL ON ALL TABLES IN SCHEMA <name_of the_schema> TO <user> ;
GRANT ALL ON ALL FUNCTIONS IN SCHEMA <name_of the_schema> TO <user> ;
GRANT ALL ON ALL SEQUENCES IN SCHEMA <name_of the_schema> TO <user> ; maybe something more... |
Everyone, in the makefile we have a target make docker-dev-prepare-db. I didn't find it in any step of the CI. |
It should be done automatically by dockerfile/docker-compose. Is it not working? |
Thanks Ivan, initdb is up to date with a volume in the compose file:
I will continue to debug on my localhost. |
…lure to True on DAGs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @luabida , I will take another later today
Done. Is it done to merge? From my side there is the other PR that needs rebasing from main after this merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just added a few comments, other than that LGTM.
thanks @luabida !
Makefile
Outdated
@@ -86,8 +86,8 @@ docker-dev-prepare-db: | |||
|
|||
.PHONY:docker-run-cron |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove the docker-run-cron
target pls? it will not be used anymore
Makefile
Outdated
|
||
.PHONY:docker-console | ||
docker-console: | ||
$(DOCKER) exec ${SERVICE} ${CONSOLE} | ||
$(DOCKER) exec -it ====${SERVICE} ${CONSOLE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this bunch of equal sign here?
@@ -78,7 +78,8 @@ COPY --chown=epigraphhub:epigraphhub docker/superset/entrypoint.sh /opt/entrypoi | |||
RUN chmod +x /opt/entrypoint.sh \ | |||
&& echo "source /opt/entrypoint.sh" > ~/.bashrc \ | |||
&& sudo mkdir -p /opt/data/superset/ \ | |||
&& sudo chown -R epigraphhub:epigraphhub /opt/data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for removing this one?
Another question is the user creation for Airflow, maybe we will need a bit of caution on how is it rn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks @luabida !
could explain a bit more about that pls? |
btw, the database issue we are planning to fix with #142 |
I wasn't sure how would be done the Users on Airflow webserver. The data collection DAGs are set with the |
Start of Airflow implementation. It still needs improvements on the documentation and some bug fixes.
How to test the changes:
make docker-build ENV=dev
make docker-start ENV=dev
http://127.0.0.1:8099/
admin
, passwordadmin
Graph
page.NOTES:
airflow scheduler
(ping @xmnlab)foph_casesvaccpersons_d
table in FOPH Dag fails.epigraphhub_py
.fix #36
fix #120
fix #137