-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor CDE harvester. #408
Comments
@raytula @steviewanders @n-a-t-e let me know your thoughts |
Sounds like a good idea to me. CDE will be updated to support non-fixed data sources during Cycle 4 but I expect that work will not start for a while, and that the back-end harvester improvements can be done before and independent of that work. |
Right we can think about the strategies to use to handle this kind of data as we're revamping the harvester. |
I think this is a nice list of things to look at, especially adding Sentry. but I'm not seeing how they are related to fixing the issue of the 20+ minute queries. Here are a couple issues that would be great to focus on:
|
Thanks @n-a-t-e, I added to the list those points more clearly |
Am I right in thinking that CDE currently makes use of redis? If so, you could consider using redis to store the harvesting queue instead of celery. I have some experience with redis as it relates to ckan harvester management if you want to chat about it. Your last point of splitting harvesting into multiple queued jobs based on time or some other attribute seems like the most likely to improve performance. |
Yup. At least when @n-a-t-e showed me a year or so ago.
Typically Celery can use redis as a persistence backend for its jobs and results, not instead of :) @JessyBarrette sorry haven't had time to look at this with you yet. Will soon. |
Neat, I have not knowingly used celery though now that I search for it I see there is a reference in ckan to it so perhaps I have used it and didn't realize. :-) |
@JessyBarrette As discussed I'm fine with the rewrite given a diagram and review of current and intended architecture and then you can deploy to CapRover via Docker. |
Thanks @JessyBarrette, is an excellent overview diagram. |
In your diagram, is the 'database' our AWS RDS Postgres instance or something else? |
This is a dedicated database deployed as through its own container and running on the server where cde is deployed: pac-dev2.cioos.ca (development) and explore.cioos.ca (production) |
@JessyBarrette I've added @BryceMarshall to the issue and project. @BryceMarshall once you find space in your calendar, schedule a meeting with us three (and anybody else who wants to join in) for an overview of this work, current status and how to go forward together. |
This PR #410 is fixing a few issues listed above. This should be deployed in development environment, further code refactoring is happening within the the |
What's the deployment commands/method? And is the development environment |
I would join this as well |
hypothetical dev environment? We used to have on pac-dev2 but I don't know if its still running |
Haha, ok so we can discuss setting this up somewhere. |
Further notes regarding the latest changes. I turns out the harvester is running multiple thread in parallel for each erddap server that it harvest from. However, all the data collected is combined together once all the workers are completed. The data is then dumped to csv files which are then uploaded to the database via the I'm planning on having the harvester populating the database for each individual workers separately. This will then avoid having to wait on all the jobs to complete to update the cde database. |
Lots of people like https://explore.cioos.ca, but we're having issues with the harvester.
Issue
explore.cioos.ca harvester is having a few issues and fails on occasion to query some erddap servers and/or datasets.
Some of those issues are related to timeout errors associated to some of the queries which sometimes need a lot of ressources on the remote ERDDAP server. As an example, we've asking the different CIOOS groups to use a 20min timeout at least to be able to handle big queries from CDE.
We also have right no way to test future implementation, track cron job and issues. Also a lot of time, the harvester is simply waiting for a reply from an ERDDAP server.
Suggestion
Based on those different issues, I'm thinking it would be time for an overhaul of the explore cioos harvester. For this, I'm thinking doing the following:
Rely on erddap-python for generating all the erddap queriesUse celery for handling multithreading the different queries to the multiple datasets simultaneouslyalternative from @steviewanders https://huey.readthedocs.io/en/latest/alternative from @steviewanders https://github.com/hynek/staminaadd a table to the database to track updates to databaseSplit multithreads by datasets/erddap servers, to avoid slowing down other server harvesters.The harvester is already splitting by erddap server, which I think make sense.sentry_sdk
click
pytest
poetry
Since the queries generated by the harvester can be pretty ressources intensive on the erddap server side. This would be a good opportunity to use this to help assessing the caprover deployments with the caprover deployment method (seeissue)
The text was updated successfully, but these errors were encountered: