[Test] Use a new library "rpmrepo_metadata" for metadata parsing instead of createrepo_c #2410

dralley · 2022-02-21T15:34:15Z

[noissue]

dralley · 2022-02-21T23:38:20Z

This is still a proof of concept, we would need to have a broader discussion about it since it impacts multiple people / teams. It definitely is not "decided" or set in stone whatsoever. This is the beginning of the discussion!

Motivations against doing this:

There is always the possibility for undiscovered logic bugs, although the acceptance test results seem to indicate we're OK on that front.

The build team probably hasn't dealt with Rust packages before, there could be challenges.

createrepo_c remains the "official" solution with a pool of existing maintainers. The bus factor is higher since we would be doing our own thing, in a less common language no less. Theoretically there could be metadata changes in the future and we'd be on the hook for supporting that. But all of this is slightly offset for by the situation where we currently have to do a lot of parsing ourselves for memory reasons anyway, and difficulty in using / creating the new parsing APIs.

Motivations for doing this:

We've hit a number of problems with createrepo_c, in particular:

edit: removed a claim that probably isn't true.

The way the code is structured is fundamentally so complicated that fixing or extending it to work the way we need it to is very difficult and maybe not possible without a rewrite
Memory consumption is poor - there is a new API which improves the situation, but it is impacted even moreso by the points above. There are memory leaks and segfaults [0] and double-free issues that I have poured hours into trying to solve without much success. We also cannot use the new API, and I poured even more hours into trying to make it work for us [1] but everything got bogged down in the insanely complicated object lifetimes and memory management issues.

[0] rpm-software-management/createrepo_c#305
[1] rpm-software-management/createrepo_c#295

[edit] To be clear, these issues can be fixed, probably, but it feels likely to either take a lot of time or require some extremely fragile and perhaps difficult to maintain code. It's difficult to make a decision here until we come to a conclusion w/ createrepo_c upstream about the viability of fixing them in the short-term.

This new library started as a hobby project a year ago but over the past few weeks seems like it may be a legitimate option going forwards.

In terms of correctness, it avoids all the known pitfalls of createrepo_c with regards to the way parsing works, The API is vastly more straightforwards - you can see that from the diff - and the same is true of the code in general. Enough so that it seems plausible that a non-Rust dev would have an easier time understanding and fixing bugs. Structurally it is much simpler and more direct, whereas the createrepo_c code does juggling objects between callbacks and various lists. There is a lot less room for things to go wrong.

We've also hit segfaults [0] [1], memory corruption bugs [2] [3] and memory leaks [4] [5] [6] with createrepo_c in the past, and Rust goes a very long way towards avoiding all of that.

[0] rpm-software-management/createrepo_c#294
[1] rpm-software-management/createrepo_c#248
[2] rpm-software-management/createrepo_c#244
[3] rpm-software-management/createrepo_c#297
[4] rpm-software-management/createrepo_c#231
[5] rpm-software-management/createrepo_c#233
[6] rpm-software-management/createrepo_c#282

Additionally there is another correctness bug (which is very fixable unlike the others - but not fixed yet) which means we're currently saving bad metadata for a specific package rpm-software-management/createrepo_c#286

So those are the problems we're currently facing. Benefits of the new library, apart from addressing those:

The API is vastly easier to use and lets us do things we couldn't do before, like get the number of packages ahead of time without actually parsing one of the metadata files and counting all the packages, which helps with the progress bar code.

The memory consumption is improved by about ~600mb on a full sync of RHEL 7, averaging around 1.5gb and peaking at 1.7, compared to 2.1 and 2.3 gb for the existing code.

The overhead of the parsing is now measured as being only a few megabytes (between 10 and 50, usually ~20), so Brian's sync pipeline memory work will be needed to get it down any lower than this.

Since the typical case where a user might experience OOMs is syncing many repos simultaneously, this could add up to multiple gb of memory savings depending on which repos are syncing. It also eliminates the peaks - consumption should stay "constant-ish" rather than peaking immediately at the beginning of the sync.

Sync time is also improved quite a bit, from 570 seconds to 450 seconds for a RHEL 7 sync. It was initially worse (800 seconds), but I was able to figure out why in about 30 minutes of investigation and fix it on my local build.

I have acceptance tests against the output of createrepo_c, and they are ironically failing because of that aforementioned bug rpm-software-management/createrepo_c#286

fao89 · 2022-02-22T19:01:19Z

.ci/ansible/Containerfile.j2

@@ -1,5 +1,7 @@
 FROM {{ ci_base | default("ghcr.io/pulp/pulp-ci-centos:" + pulp_container_tag) }}

+RUN yum install -y rustc cargo patchelf


we are going 🦀

🎉🎉🎉

Well, maybe! Like I said there needs to be a discussion first.. but I definitely have a preference.

re: this particular line, we should be able to ship pre-built packages and it wouldn't be necessary to have these packages installed on the host. This is just to get the tests working without having to set all that up.

Also I'm still discussing it would Ales upstream, if he manages to get it to work, it would be a little hard to justify switching.

Although I still have some concerns about complicated it is getting, architecturally.

[noissue]

dralley force-pushed the new-parsing-api branch 11 times, most recently from c8ec0e1 to 46862f3 Compare February 21, 2022 23:24

dralley force-pushed the new-parsing-api branch 3 times, most recently from b7fd1b6 to 4a9b027 Compare February 22, 2022 03:15

dralley changed the title ~~[Test] Swap createrepo_c for rpmrepo_metadata~~ [Test] Use a new library "rpmrepo_metadata" for metadata parsing instead of createrepo_c Feb 22, 2022

fao89 reviewed Feb 22, 2022

View reviewed changes

Swap createrepo_c for rpmrepo_metadata

dc1c8e8

[noissue]

dralley force-pushed the new-parsing-api branch from 4a9b027 to dc1c8e8 Compare April 4, 2022 17:21

dralley closed this May 12, 2022

dralley deleted the new-parsing-api branch June 1, 2022 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Test] Use a new library "rpmrepo_metadata" for metadata parsing instead of createrepo_c #2410

[Test] Use a new library "rpmrepo_metadata" for metadata parsing instead of createrepo_c #2410

dralley commented Feb 21, 2022

dralley commented Feb 21, 2022 •

edited

Loading

fao89 Feb 22, 2022

dralley Feb 22, 2022

dralley Feb 22, 2022

		@@ -1,5 +1,7 @@
		FROM {{ ci_base \| default("ghcr.io/pulp/pulp-ci-centos:" + pulp_container_tag) }}

		RUN yum install -y rustc cargo patchelf

[Test] Use a new library "rpmrepo_metadata" for metadata parsing instead of createrepo_c #2410

[Test] Use a new library "rpmrepo_metadata" for metadata parsing instead of createrepo_c #2410

Conversation

dralley commented Feb 21, 2022

dralley commented Feb 21, 2022 • edited Loading

Motivations against doing this:

Motivations for doing this:

fao89 Feb 22, 2022

Choose a reason for hiding this comment

dralley Feb 22, 2022

Choose a reason for hiding this comment

dralley Feb 22, 2022

Choose a reason for hiding this comment

dralley commented Feb 21, 2022 •

edited

Loading