Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor rosindex to use rosdistro cache instead of crawling all packages #444

Open
tfoote opened this issue Oct 31, 2024 · 8 comments
Open
Labels
enhancement New feature or request

Comments

@tfoote
Copy link
Member

tfoote commented Oct 31, 2024

The rosdistro cache is actively maintained by the OSRF buildfarm https://github.com/ros-infrastructure/rosdistro and in the cache it has effectively all of the content that we need in the index, including all the distro information as well as the package.xml content. We can likely reduce most or even all of the crawling and iterate using the rosdistro API and leverage the cache to build this site much faster.

@rkent this was what I was mentioning about avoiding the full crawl in jekyll.

@tfoote tfoote added the enhancement New feature or request label Oct 31, 2024
@rkent
Copy link

rkent commented Oct 31, 2024

I fully agree with this approach, and I'll work on it at some point. I'd first like to finish the new search table for system dependencies, which will allow us to delete a lot of code around paging for package and dependencies lists.

I think that the goal, as I believe you stated at ROSCON 2024, is to move rosindex to be mostly a rendering app, with any crawling and scraping done in a separate step.

The repo crawling is done by the build farm, but there are other items that need crawling that are not. I made a brief Discord comment about this. Some items that I am already running in my dev version are relying on offline scraping, currently done in a github action under a rkent github repo which is not intended as a long-term solution. I need to resolve the offline scraping issue to get some of that landed. I think what I am leaning toward doing for now is to merge the content of the rkent repo into rosindex, initially setup as github actions under rosindex, but allowing easy transition to the build farm if that is the desired long-term solution.

@tfoote
Copy link
Member Author

tfoote commented Nov 8, 2024

If we need to do other offline scraping processes, what I'd suggest is that we set these up as separate processes. And for local builds/development you can run them in an appropriate sequence. And on the buildfarm we can have them in separate processes like we have for the rosdistro cache and pass the generated artifact(s) to the rendering process. And we can also make those intermediate artifacts avaialble publicly in the same way that we do with the rosdistro-cache so that unless you're debugging/developing those or running a custom instance it's not necessary to do the crawl for most users.

@rkent
Copy link

rkent commented Nov 8, 2024

What exactly do you mean by "separate processes"?

@tfoote
Copy link
Member Author

tfoote commented Nov 8, 2024

They can be invoked as a command on the command line and generate a file output that can be used as an input for the rendering. Compared to our current implementation that is doing all the crawling inside of the jekyll ruby process.

@rkent
Copy link

rkent commented Nov 8, 2024

Looking at this, one issue is that the distro cache mostly provides the package.xml content, while rosindex also needs other files, like README and CONTRIBUTING. I don't see how we can avoid continuing to download entire repos instead of just the rosdistro_cache files.

@tfoote
Copy link
Member Author

tfoote commented Nov 12, 2024

Yeah, that's why we may need to setup another crawler/cache for the other content as a separate content. If we do those in a separate crawl process and cache the content we can separate that crawl from the build. If we can make that a separate process that potentially can run longer and focus on being efficient and caching it can be easier to maintain the website (as just a build/render cycle) and potentially be available for more uses too. The rosdistro crawler has special logic for fetching specific files if the content is on a standard web host like github and bitbucket. But will fall back to cloning if the host is not one of those.

@rkent
Copy link

rkent commented Nov 13, 2024

Here's an update on my overall plan for moving forward with rosindex, of which this issue is a piece.

I want to finish first the combined list & search table for packages and dependencies, meaning I want to do a Tabulator-based list for dependencies, then remove the existing pagination-based package list, package search, dependency list, and dependency search code.

Next, I want to take the various separate preliminary steps of rosindex (discover, update, scrape, etc), that may rely partly on the cache for persistence, and instead make them separate python scripts, relying on files rather than the cache for communication. Initially those will still be called from rosindex_generator.rb as part of a unified build of rosindex as currently done, but the plan is eventually allow those separate steps to be scheduled independently if desired, or even to use shared file resources in a buildfarm. So the request of this issue is one of the pieces of that overall plan.

@tfoote
Copy link
Member Author

tfoote commented Nov 21, 2024

That sounds like a great plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants