Refactor CDE harvester. #408

JessyBarrette · 2024-02-08T22:23:14Z

Lots of people like https://explore.cioos.ca, but we're having issues with the harvester.

Issue

explore.cioos.ca harvester is having a few issues and fails on occasion to query some erddap servers and/or datasets.

Some of those issues are related to timeout errors associated to some of the queries which sometimes need a lot of ressources on the remote ERDDAP server. As an example, we've asking the different CIOOS groups to use a 20min timeout at least to be able to handle big queries from CDE.

We also have right no way to test future implementation, track cron job and issues. Also a lot of time, the harvester is simply waiting for a reply from an ERDDAP server.

Suggestion

Based on those different issues, I'm thinking it would be time for an overhaul of the explore cioos harvester. For this, I'm thinking doing the following:

~~Rely on erddap-python for generating all the erddap queries~~
~~Use celery for handling multithreading the different queries to the multiple datasets simultaneously~~
- ~~alternative from @steviewanders https://huey.readthedocs.io/en/latest/~~
- ~~alternative from @steviewanders https://github.com/hynek/stamina~~
- ~~add a table to the database to track updates to database~~
~~Split multithreads by datasets/erddap servers, to avoid slowing down other server harvesters.~~ The harvester is already splitting by erddap server, which I think make sense.
Track cronjobs and errors with sentry_sdk
Document better the CLI interface with click
Add some tests with pytest
Manage dependencies with poetry
Containerize the deployment with Docker
On slow queries, split dataset queries by time ranges (yearly,decades) to reduce size of queries and avoid timeout times of >20min.
Add a diagram to explain CDE workflow and harvester

Since the queries generated by the harvester can be pretty ressources intensive on the erddap server side. This would be a good opportunity to use this to help assessing the caprover deployments with the caprover deployment method (seeissue)

The text was updated successfully, but these errors were encountered:

JessyBarrette · 2024-02-08T22:23:43Z

@raytula @steviewanders @n-a-t-e let me know your thoughts

raytula · 2024-02-09T13:38:28Z

@raytula @steviewanders @n-a-t-e let me know your thoughts

Sounds like a good idea to me. CDE will be updated to support non-fixed data sources during Cycle 4 but I expect that work will not start for a while, and that the back-end harvester improvements can be done before and independent of that work.

JessyBarrette · 2024-02-09T14:10:33Z

Right we can think about the strategies to use to handle this kind of data as we're revamping the harvester.

n-a-t-e · 2024-02-09T17:59:55Z

I think this is a nice list of things to look at, especially adding Sentry. but I'm not seeing how they are related to fixing the issue of the 20+ minute queries. Here are a couple issues that would be great to focus on:

Slow queries: @JessyBarrette you had mentioned some ideas of how to address these slow queries, maybe breaking them down by decade or year.
Asynchronous harvesting: Right now, the erddap servers are queried in parallel, but the data isn't processed until all the threads have returned. So one slow server will slow down the entire process. It would be great to move to an asynchronous system, similar to how CKAN does it, where we have one harvester per remote server, and they don't slow each other down.

JessyBarrette · 2024-02-12T14:38:32Z

Thanks @n-a-t-e, I added to the list those points more clearly

fostermh · 2024-02-21T20:05:27Z

Am I right in thinking that CDE currently makes use of redis? If so, you could consider using redis to store the harvesting queue instead of celery. I have some experience with redis as it relates to ckan harvester management if you want to chat about it.

Your last point of splitting harvesting into multiple queued jobs based on time or some other attribute seems like the most likely to improve performance.

steviewanders · 2024-02-21T20:39:53Z

Am I right in thinking that CDE currently makes use of redis?

Yup. At least when @n-a-t-e showed me a year or so ago.

If so, you could consider using redis to store the harvesting queue instead of celery.

Typically Celery can use redis as a persistence backend for its jobs and results, not instead of :)
But yes, should consider it as is pretty standard and is already part of the stack.

@JessyBarrette sorry haven't had time to look at this with you yet. Will soon.

fostermh · 2024-02-21T21:01:08Z

Neat, I have not knowingly used celery though now that I search for it I see there is a reference in ckan to it so perhaps I have used it and didn't realize. :-)

steviewanders · 2024-03-05T21:00:24Z

@steviewanders let me know your thoughts

@JessyBarrette As discussed I'm fine with the rewrite given a diagram and review of current and intended architecture and then you can deploy to CapRover via Docker.
Let me know if there is anything else I can do to help.

JessyBarrette · 2024-03-20T20:01:10Z

Here's broad diagram which present cde present workflow:

And here's the harvester more particularly

steviewanders · 2024-03-20T20:09:39Z

Thanks @JessyBarrette, is an excellent overview diagram.

fostermh · 2024-03-20T23:12:27Z

In your diagram, is the 'database' our AWS RDS Postgres instance or something else?

JessyBarrette · 2024-03-21T13:09:29Z

This is a dedicated database deployed as through its own container and running on the server where cde is deployed: pac-dev2.cioos.ca (development) and explore.cioos.ca (production)

steviewanders · 2024-04-03T19:06:40Z

@JessyBarrette I've added @BryceMarshall to the issue and project.

@BryceMarshall once you find space in your calendar, schedule a meeting with us three (and anybody else who wants to join in) for an overview of this work, current status and how to go forward together.

JessyBarrette · 2024-04-04T16:21:03Z

This PR #410 is fixing a few issues listed above.

This should be deployed in development environment, further code refactoring is happening within the the branch/harvester-chunks to fix deeper issues

steviewanders · 2024-04-04T17:17:45Z

This should be deployed in development environment,

What's the deployment commands/method?

And is the development environment goose.hakai.org?

n-a-t-e · 2024-04-04T17:19:28Z

@JessyBarrette I've added @BryceMarshall to the issue and project.

@BryceMarshall once you find space in your calendar, schedule a meeting with us three (and anybody else who wants to join in) for an overview of this work, current status and how to go forward together.

I would join this as well

n-a-t-e · 2024-04-04T17:24:44Z

This should be deployed in development environment,

What's the deployment commands/method?

And is the development environment goose.hakai.org?

hypothetical dev environment? We used to have on pac-dev2 but I don't know if its still running

steviewanders · 2024-04-04T17:44:53Z

Haha, ok so we can discuss setting this up somewhere.

JessyBarrette · 2024-04-04T18:08:00Z

Further notes regarding the latest changes. I turns out the harvester is running multiple thread in parallel for each erddap server that it harvest from. However, all the data collected is combined together once all the workers are completed. The data is then dumped to csv files which are then uploaded to the database via the db_loader.

I'm planning on having the harvester populating the database for each individual workers separately. This will then avoid having to wait on all the jobs to complete to update the cde database.

JessyBarrette self-assigned this Feb 8, 2024

JessyBarrette mentioned this issue Feb 22, 2024

Add sentry to cde harvester #410

Merged

steviewanders assigned BryceMarshall and fostermh and unassigned fostermh Apr 3, 2024

chrishakai unassigned BryceMarshall Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor CDE harvester. #408

Refactor CDE harvester. #408

JessyBarrette commented Feb 8, 2024 •

edited

Loading

JessyBarrette commented Feb 8, 2024

raytula commented Feb 9, 2024

JessyBarrette commented Feb 9, 2024

n-a-t-e commented Feb 9, 2024

JessyBarrette commented Feb 12, 2024

fostermh commented Feb 21, 2024

steviewanders commented Feb 21, 2024 •

edited

Loading

fostermh commented Feb 21, 2024

steviewanders commented Mar 5, 2024

JessyBarrette commented Mar 20, 2024

steviewanders commented Mar 20, 2024

fostermh commented Mar 20, 2024

JessyBarrette commented Mar 21, 2024

steviewanders commented Apr 3, 2024

JessyBarrette commented Apr 4, 2024

steviewanders commented Apr 4, 2024

n-a-t-e commented Apr 4, 2024

n-a-t-e commented Apr 4, 2024

steviewanders commented Apr 4, 2024

JessyBarrette commented Apr 4, 2024

Refactor CDE harvester. #408

Refactor CDE harvester. #408

Comments

JessyBarrette commented Feb 8, 2024 • edited Loading

Issue

Suggestion

JessyBarrette commented Feb 8, 2024

raytula commented Feb 9, 2024

JessyBarrette commented Feb 9, 2024

n-a-t-e commented Feb 9, 2024

JessyBarrette commented Feb 12, 2024

fostermh commented Feb 21, 2024

steviewanders commented Feb 21, 2024 • edited Loading

fostermh commented Feb 21, 2024

steviewanders commented Mar 5, 2024

JessyBarrette commented Mar 20, 2024

steviewanders commented Mar 20, 2024

fostermh commented Mar 20, 2024

JessyBarrette commented Mar 21, 2024

steviewanders commented Apr 3, 2024

JessyBarrette commented Apr 4, 2024

steviewanders commented Apr 4, 2024

n-a-t-e commented Apr 4, 2024

n-a-t-e commented Apr 4, 2024

steviewanders commented Apr 4, 2024

JessyBarrette commented Apr 4, 2024

JessyBarrette commented Feb 8, 2024 •

edited

Loading

steviewanders commented Feb 21, 2024 •

edited

Loading