Refactor rosindex to use rosdistro cache instead of crawling all packages #444

tfoote · 2024-10-31T00:14:01Z

The rosdistro cache is actively maintained by the OSRF buildfarm https://github.com/ros-infrastructure/rosdistro and in the cache it has effectively all of the content that we need in the index, including all the distro information as well as the package.xml content. We can likely reduce most or even all of the crawling and iterate using the rosdistro API and leverage the cache to build this site much faster.

@rkent this was what I was mentioning about avoiding the full crawl in jekyll.

rkent · 2024-10-31T15:31:19Z

I fully agree with this approach, and I'll work on it at some point. I'd first like to finish the new search table for system dependencies, which will allow us to delete a lot of code around paging for package and dependencies lists.

I think that the goal, as I believe you stated at ROSCON 2024, is to move rosindex to be mostly a rendering app, with any crawling and scraping done in a separate step.

The repo crawling is done by the build farm, but there are other items that need crawling that are not. I made a brief Discord comment about this. Some items that I am already running in my dev version are relying on offline scraping, currently done in a github action under a rkent github repo which is not intended as a long-term solution. I need to resolve the offline scraping issue to get some of that landed. I think what I am leaning toward doing for now is to merge the content of the rkent repo into rosindex, initially setup as github actions under rosindex, but allowing easy transition to the build farm if that is the desired long-term solution.

tfoote · 2024-11-08T19:26:22Z

If we need to do other offline scraping processes, what I'd suggest is that we set these up as separate processes. And for local builds/development you can run them in an appropriate sequence. And on the buildfarm we can have them in separate processes like we have for the rosdistro cache and pass the generated artifact(s) to the rendering process. And we can also make those intermediate artifacts avaialble publicly in the same way that we do with the rosdistro-cache so that unless you're debugging/developing those or running a custom instance it's not necessary to do the crawl for most users.

rkent · 2024-11-08T19:33:31Z

What exactly do you mean by "separate processes"?

tfoote · 2024-11-08T20:21:12Z

They can be invoked as a command on the command line and generate a file output that can be used as an input for the rendering. Compared to our current implementation that is doing all the crawling inside of the jekyll ruby process.

rkent · 2024-11-08T21:09:21Z

Looking at this, one issue is that the distro cache mostly provides the package.xml content, while rosindex also needs other files, like README and CONTRIBUTING. I don't see how we can avoid continuing to download entire repos instead of just the rosdistro_cache files.

tfoote · 2024-11-12T16:26:31Z

Yeah, that's why we may need to setup another crawler/cache for the other content as a separate content. If we do those in a separate crawl process and cache the content we can separate that crawl from the build. If we can make that a separate process that potentially can run longer and focus on being efficient and caching it can be easier to maintain the website (as just a build/render cycle) and potentially be available for more uses too. The rosdistro crawler has special logic for fetching specific files if the content is on a standard web host like github and bitbucket. But will fall back to cloning if the host is not one of those.

rkent · 2024-11-13T22:08:40Z

Here's an update on my overall plan for moving forward with rosindex, of which this issue is a piece.

I want to finish first the combined list & search table for packages and dependencies, meaning I want to do a Tabulator-based list for dependencies, then remove the existing pagination-based package list, package search, dependency list, and dependency search code.

Next, I want to take the various separate preliminary steps of rosindex (discover, update, scrape, etc), that may rely partly on the cache for persistence, and instead make them separate python scripts, relying on files rather than the cache for communication. Initially those will still be called from rosindex_generator.rb as part of a unified build of rosindex as currently done, but the plan is eventually allow those separate steps to be scheduled independently if desired, or even to use shared file resources in a buildfarm. So the request of this issue is one of the pieces of that overall plan.

tfoote · 2024-11-21T00:57:15Z

That sounds like a great plan.

tfoote added the enhancement New feature or request label Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor rosindex to use rosdistro cache instead of crawling all packages #444

Refactor rosindex to use rosdistro cache instead of crawling all packages #444

tfoote commented Oct 31, 2024

rkent commented Oct 31, 2024

tfoote commented Nov 8, 2024

rkent commented Nov 8, 2024

tfoote commented Nov 8, 2024

rkent commented Nov 8, 2024

tfoote commented Nov 12, 2024 •

edited

Loading

rkent commented Nov 13, 2024

tfoote commented Nov 21, 2024

Refactor rosindex to use rosdistro cache instead of crawling all packages #444

Refactor rosindex to use rosdistro cache instead of crawling all packages #444

Comments

tfoote commented Oct 31, 2024

rkent commented Oct 31, 2024

tfoote commented Nov 8, 2024

rkent commented Nov 8, 2024

tfoote commented Nov 8, 2024

rkent commented Nov 8, 2024

tfoote commented Nov 12, 2024 • edited Loading

rkent commented Nov 13, 2024

tfoote commented Nov 21, 2024

tfoote commented Nov 12, 2024 •

edited

Loading