Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor CDE harvester. #408

Open
5 of 7 tasks
JessyBarrette opened this issue Feb 8, 2024 · 20 comments
Open
5 of 7 tasks

Refactor CDE harvester. #408

JessyBarrette opened this issue Feb 8, 2024 · 20 comments
Assignees

Comments

@JessyBarrette
Copy link
Contributor

JessyBarrette commented Feb 8, 2024

Lots of people like https://explore.cioos.ca, but we're having issues with the harvester.

Issue

explore.cioos.ca harvester is having a few issues and fails on occasion to query some erddap servers and/or datasets.

Some of those issues are related to timeout errors associated to some of the queries which sometimes need a lot of ressources on the remote ERDDAP server. As an example, we've asking the different CIOOS groups to use a 20min timeout at least to be able to handle big queries from CDE.

We also have right no way to test future implementation, track cron job and issues. Also a lot of time, the harvester is simply waiting for a reply from an ERDDAP server.

Suggestion

Based on those different issues, I'm thinking it would be time for an overhaul of the explore cioos harvester. For this, I'm thinking doing the following:

  • Rely on erddap-python for generating all the erddap queries
  • Use celery for handling multithreading the different queries to the multiple datasets simultaneously
  • Split multithreads by datasets/erddap servers, to avoid slowing down other server harvesters. The harvester is already splitting by erddap server, which I think make sense.
  • Track cronjobs and errors with sentry_sdk
  • Document better the CLI interface with click
  • Add some tests with pytest
  • Manage dependencies with poetry
  • Containerize the deployment with Docker
  • On slow queries, split dataset queries by time ranges (yearly,decades) to reduce size of queries and avoid timeout times of >20min.
  • Add a diagram to explain CDE workflow and harvester

Since the queries generated by the harvester can be pretty ressources intensive on the erddap server side. This would be a good opportunity to use this to help assessing the caprover deployments with the caprover deployment method (seeissue)

@JessyBarrette JessyBarrette self-assigned this Feb 8, 2024
@JessyBarrette
Copy link
Contributor Author

@raytula @steviewanders @n-a-t-e let me know your thoughts

@raytula
Copy link

raytula commented Feb 9, 2024

@raytula @steviewanders @n-a-t-e let me know your thoughts

Sounds like a good idea to me. CDE will be updated to support non-fixed data sources during Cycle 4 but I expect that work will not start for a while, and that the back-end harvester improvements can be done before and independent of that work.

@JessyBarrette
Copy link
Contributor Author

Right we can think about the strategies to use to handle this kind of data as we're revamping the harvester.

@n-a-t-e
Copy link
Member

n-a-t-e commented Feb 9, 2024

I think this is a nice list of things to look at, especially adding Sentry. but I'm not seeing how they are related to fixing the issue of the 20+ minute queries. Here are a couple issues that would be great to focus on:

  • Slow queries: @JessyBarrette you had mentioned some ideas of how to address these slow queries, maybe breaking them down by decade or year.

  • Asynchronous harvesting: Right now, the erddap servers are queried in parallel, but the data isn't processed until all the threads have returned. So one slow server will slow down the entire process. It would be great to move to an asynchronous system, similar to how CKAN does it, where we have one harvester per remote server, and they don't slow each other down.

@JessyBarrette
Copy link
Contributor Author

Thanks @n-a-t-e, I added to the list those points more clearly

@fostermh
Copy link
Contributor

Am I right in thinking that CDE currently makes use of redis? If so, you could consider using redis to store the harvesting queue instead of celery. I have some experience with redis as it relates to ckan harvester management if you want to chat about it.

Your last point of splitting harvesting into multiple queued jobs based on time or some other attribute seems like the most likely to improve performance.

@steviewanders
Copy link

steviewanders commented Feb 21, 2024

Am I right in thinking that CDE currently makes use of redis?

Yup. At least when @n-a-t-e showed me a year or so ago.

If so, you could consider using redis to store the harvesting queue instead of celery.

Typically Celery can use redis as a persistence backend for its jobs and results, not instead of :)
But yes, should consider it as is pretty standard and is already part of the stack.

@JessyBarrette sorry haven't had time to look at this with you yet. Will soon.

@fostermh
Copy link
Contributor

Neat, I have not knowingly used celery though now that I search for it I see there is a reference in ckan to it so perhaps I have used it and didn't realize. :-)

@steviewanders
Copy link

@steviewanders let me know your thoughts

@JessyBarrette As discussed I'm fine with the rewrite given a diagram and review of current and intended architecture and then you can deploy to CapRover via Docker.
Let me know if there is anything else I can do to help.

@JessyBarrette
Copy link
Contributor Author

Here's broad diagram which present cde present workflow:

And here's the harvester more particularly

@steviewanders
Copy link

Thanks @JessyBarrette, is an excellent overview diagram.

@fostermh
Copy link
Contributor

In your diagram, is the 'database' our AWS RDS Postgres instance or something else?

@JessyBarrette
Copy link
Contributor Author

This is a dedicated database deployed as through its own container and running on the server where cde is deployed: pac-dev2.cioos.ca (development) and explore.cioos.ca (production)

@steviewanders
Copy link

@JessyBarrette I've added @BryceMarshall to the issue and project.

@BryceMarshall once you find space in your calendar, schedule a meeting with us three (and anybody else who wants to join in) for an overview of this work, current status and how to go forward together.

@JessyBarrette
Copy link
Contributor Author

This PR #410 is fixing a few issues listed above.

This should be deployed in development environment, further code refactoring is happening within the the branch/harvester-chunks to fix deeper issues

@steviewanders
Copy link

This should be deployed in development environment,

What's the deployment commands/method?

And is the development environment goose.hakai.org?

@n-a-t-e
Copy link
Member

n-a-t-e commented Apr 4, 2024

@JessyBarrette I've added @BryceMarshall to the issue and project.

@BryceMarshall once you find space in your calendar, schedule a meeting with us three (and anybody else who wants to join in) for an overview of this work, current status and how to go forward together.

I would join this as well

@n-a-t-e
Copy link
Member

n-a-t-e commented Apr 4, 2024

This should be deployed in development environment,

What's the deployment commands/method?

And is the development environment goose.hakai.org?

hypothetical dev environment? We used to have on pac-dev2 but I don't know if its still running

@steviewanders
Copy link

Haha, ok so we can discuss setting this up somewhere.

@JessyBarrette
Copy link
Contributor Author

Further notes regarding the latest changes. I turns out the harvester is running multiple thread in parallel for each erddap server that it harvest from. However, all the data collected is combined together once all the workers are completed. The data is then dumped to csv files which are then uploaded to the database via the db_loader.

I'm planning on having the harvester populating the database for each individual workers separately. This will then avoid having to wait on all the jobs to complete to update the cde database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

6 participants