-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(backend): Don't store null fields of processed data for perf, emit exactly current schema in get-released-data - imputing with null #3232
Conversation
I don't think this is explicitly defined anywhere but it's what our current practice is, anything that's in external metadata is also in metadata
…nformation is more helpful when debugging
…lways emit all metadata fields per schema. Otherwise slight refactors to reduce verboseness/increse dryness
…hat aren't in schema, add fields as null if not present in processed data
…tabase (#3233) * Don't store null values of processed data for performance * Automated backend code formatting * Dummy for CI * Try out with persistent db to investigate config changes * Consider throwing out null part of compression, much nicer code! * woops * remove unnecessary code --------- Co-authored-by: Loculus bot <[email protected]>
get-released-data
, adding null fields and filtering out fields not in schemaThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this play together will data that is already in the database? If I interpret the current code correctly, then it will decompress all processed data. Will this work with uncompressed data in the database?
How much space does this actually save? Metadata isn't that large in terms of memory demand. I wonder whether the overhead (in code complexity and runtime) is worth the trouble?
...end/src/test/kotlin/org/loculus/backend/controller/submission/GetReleasedDataEndpointTest.kt
Show resolved
Hide resolved
Thanks @corneliusroemer this looks great to me! I just wanted to ask two follow ups - otherwise I'm happy to approve.
|
Re @fengelniederhammer's comments:
Compression is not really what's happening here, so it seems to give the wrong idea. We're not compressing metadata, we're just not storing null fields. Everything is null unless it's present with non-null value. But there's no harm in having extra nulls in there, the "decompress" code doesn't care, it just adds nulls back when necessary. As a result, all not previously compressed data already in the db is perfectly valid and can be "uncompressed" without problem or loss of anything.
Saving storage (and amount of data to be streamed by db which is just as much a bottleneck) isn't the main and only reason for this, but it's a nice extra at no cost - well the cost is this line which is negligible I hope you agree: loculus/backend/src/main/kotlin/org/loculus/backend/service/submission/CompressionService.kt Line 121 in d78c8b8
I find it hard to argue why we should keep those nulls around if it's so simple to get rid of them.
If we want to be able to easily add metadata fields without having to reprocess then we need the "decompression" step, whether we keep nulls around or not. So that slight complexity is required: loculus/backend/src/main/kotlin/org/loculus/backend/service/submission/CompressionService.kt Lines 89 to 96 in 6716149
But are we really arguing over 4 lines as complexity - I'm actually quite proud of myself for the lack of complexity of the solution (it's mostly because you've set things up so nicely with the compression service, so full credit to you @fengelniederhammer)? For runtime, I'm very sure that this is almost entirely negligible. And we agreed that we would do this in #2793 and related PRs (where you said that you were happy for the backend to always emit what the schema defines ;) ) |
Re @anna-parker's comments:
That's correct - compression is really not the right word though for metadata. It's more dehydration, throwing out things that are not required (but also don't do harm). Compression is entirely optional, has slight benefits of saving space and is otherwise irrelevant. But it just takes one line of code, so why not.
Excellent, that's exactly what I was thinking of! Make externality an optional extra of a metadata fields. We should probably do this in a separate PR - it should be pretty straightforward! |
Oh I see. I misread the code (just had a quick look over it). I think it would be cleaner if we did not hide metadata processing in "compress" and "decompress" functions. Let's split out that functionality to a dedicated place? |
Re how much storage is saved, running this script from ChatGPT
on this branch vs main gives: This branch (compressed)
main
So we're shaving off 30% in total size, 20% in table size and 30% in toast size for our largest table which is the preprocessed data. In an actual deployment that table is even more the largest component. This is a cheap substantial perf win, do you agree now @fengelniederhammer :) Just out of curiosity, here's our current prod pathoplexus result, preprocessed data is almost everything (we should truncate to speed things up again, we don't need old versions of processing lying around)
|
I potentially disagree, I think this is a great place, note I renamed the function to no longer mention sequences as we now compress sequences and metadata. I agree that it might be confusing to you as when you wrote this you only compressed sequences, but that's just because you wrote the code. Maybe we could rename the service to something more general than compression. I wouldn't move it somewhere else, that would just complicate logic - the fact the required changes are so small is a good sign that this is the right place to make those changes. I first had a separate service, but that would then have to be called in a number of places per endpoint (the way I had set it up) - now this is much cleaner. The current location is good because we want this hydration/dehydration (or compression or whatever) to happen in one place before things go to db and when they come out of db. And it's natural to do it in the same place that we compress sequences. @fengelniederhammer feel free to propose a split into something that makes sense for you, maybe in a PR that targets this PR here - I'm probably thinking of something different (and worse) than what you have in mind. |
-> #3241 |
…ion (#3241) * sketch: do not hide metadata processing in sequence compression function * Make test work again, mention filtering out extra fields * use hamcrest assertThat for better error messages --------- Co-authored-by: Cornelius Roemer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great - checked this should do the same thing as before :-)
backend/src/test/kotlin/org/loculus/backend/service/ProcessedMetadataPostprocessorTest.kt
Outdated
Show resolved
Hide resolved
backend/src/main/kotlin/org/loculus/backend/service/submission/ProcessedDataPostprocessor.kt
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you!
…etadataPostprocessorTest.kt Co-authored-by: Anna (Anya) Parker <[email protected]>
resolves #2793
preview URL: https://enforce-schema.loculus.org
What does this PR solve?
This PR resolves 2 issues discussed among others in #2793 but recapitulated here:
sequence_entries_preprocessed_data.processed_data
in the db - in Pathoplexus this wastage might even be 50% or more of total row size, reaching a point that's worth optimizing.sequence_entries_preprocessed_data.processed_data
to theget-released-data
consumer and SILO's preprocessing doesn't like missing (or extra as well, I think) fields in NDJSON. This created a lot of friction in deployment.What does this PR do?
This PR introduces the idea of metadata compression, in parallel to the existing sequence compression. This fits very nicely into the existing code base, making the actual code (non-test) changes trivial.
null
values:loculus/backend/src/main/kotlin/org/loculus/backend/service/submission/CompressionService.kt
Line 121 in 6716149
loculus/backend/src/main/kotlin/org/loculus/backend/service/submission/CompressionService.kt
Lines 89 to 96 in 6716149
Aside: should external metadata always be a subset of metadata?
This is kind of a side discussion, only relevant because the decompression code needs to know what all the metadata fields are that should be filled with null.
As part of adapting tests to the new paradigm, I noticed that the backend config is a little unclear regarding external metadata:
For me, it seems most logical to have metadata be the thing that the backend emits in get-released-data. Whether we allow externalMetadataSubmitter to override or not is a property of a metadata field (in my understanding). For now this difference doesn't matter other than in the way I resolved tests. In the backend tests I use my interpretation of enforcing externalMetadata to be a subset of metadata. We could in the future relax this but that would require test changes.
It might make more sense eventually though to get rid of the "externalMetadata" part of the config and add extra optional fields to the "metadata" fields listing whether or not they are allowed to be overridden by external metadata submitter and if so by which user.
If we want to keep things lose like the tests were right now, we can adapt this PR, it'd just have to then iterate over both metadata and external metadata schemas to get the full set of expected fields. Also: how would we treat cases where field values are identical but types differ?
Screenshot
No more nulls in processed data:
Testing
I've written a comprehensive unit test for the metadata compression/decompression behavior.
It's a bit tricky to write endpoint test cases for this - the relevant behavior to test would be changes of backend config (before and after addition/removal of a metadata field) which I couldn't figure out how to test well.
The changes made themselves seen in a couple of existing tests though. I checked thoroughly that we can now add and remove metadata fields on a persistent db and that things work as expected, e.g. 4e884b6 and 59833e1
PR Checklist