-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor rosindex to use rosdistro cache instead of crawling all packages #444
Comments
I fully agree with this approach, and I'll work on it at some point. I'd first like to finish the new search table for system dependencies, which will allow us to delete a lot of code around paging for package and dependencies lists. I think that the goal, as I believe you stated at ROSCON 2024, is to move rosindex to be mostly a rendering app, with any crawling and scraping done in a separate step. The repo crawling is done by the build farm, but there are other items that need crawling that are not. I made a brief Discord comment about this. Some items that I am already running in my dev version are relying on offline scraping, currently done in a github action under a rkent github repo which is not intended as a long-term solution. I need to resolve the offline scraping issue to get some of that landed. I think what I am leaning toward doing for now is to merge the content of the rkent repo into rosindex, initially setup as github actions under rosindex, but allowing easy transition to the build farm if that is the desired long-term solution. |
If we need to do other offline scraping processes, what I'd suggest is that we set these up as separate processes. And for local builds/development you can run them in an appropriate sequence. And on the buildfarm we can have them in separate processes like we have for the rosdistro cache and pass the generated artifact(s) to the rendering process. And we can also make those intermediate artifacts avaialble publicly in the same way that we do with the rosdistro-cache so that unless you're debugging/developing those or running a custom instance it's not necessary to do the crawl for most users. |
What exactly do you mean by "separate processes"? |
They can be invoked as a command on the command line and generate a file output that can be used as an input for the rendering. Compared to our current implementation that is doing all the crawling inside of the jekyll ruby process. |
Looking at this, one issue is that the distro cache mostly provides the package.xml content, while rosindex also needs other files, like README and CONTRIBUTING. I don't see how we can avoid continuing to download entire repos instead of just the rosdistro_cache files. |
Yeah, that's why we may need to setup another crawler/cache for the other content as a separate content. If we do those in a separate crawl process and cache the content we can separate that crawl from the build. If we can make that a separate process that potentially can run longer and focus on being efficient and caching it can be easier to maintain the website (as just a build/render cycle) and potentially be available for more uses too. The rosdistro crawler has special logic for fetching specific files if the content is on a standard web host like github and bitbucket. But will fall back to cloning if the host is not one of those. |
Here's an update on my overall plan for moving forward with rosindex, of which this issue is a piece. I want to finish first the combined list & search table for packages and dependencies, meaning I want to do a Tabulator-based list for dependencies, then remove the existing pagination-based package list, package search, dependency list, and dependency search code. Next, I want to take the various separate preliminary steps of rosindex (discover, update, scrape, etc), that may rely partly on the cache for persistence, and instead make them separate python scripts, relying on files rather than the cache for communication. Initially those will still be called from rosindex_generator.rb as part of a unified build of rosindex as currently done, but the plan is eventually allow those separate steps to be scheduled independently if desired, or even to use shared file resources in a buildfarm. So the request of this issue is one of the pieces of that overall plan. |
That sounds like a great plan. |
The rosdistro cache is actively maintained by the OSRF buildfarm https://github.com/ros-infrastructure/rosdistro and in the cache it has effectively all of the content that we need in the index, including all the distro information as well as the package.xml content. We can likely reduce most or even all of the crawling and iterate using the rosdistro API and leverage the cache to build this site much faster.
@rkent this was what I was mentioning about avoiding the full crawl in jekyll.
The text was updated successfully, but these errors were encountered: