feat(backend): Don't store null fields of processed data for perf, emit exactly current schema in get-released-data - imputing with null #3232

corneliusroemer · 2024-11-18T21:53:05Z

resolves #2793

preview URL: https://enforce-schema.loculus.org

What does this PR solve?

This PR resolves 2 issues discussed among others in #2793 but recapitulated here:

We were wasting db storage and bandwidth by storing lots of metadata fields with null values in sequence_entries_preprocessed_data.processed_data in the db - in Pathoplexus this wastage might even be 50% or more of total row size, reaching a point that's worth optimizing.
Whenever we wanted to add a new metadata field, we needed to reprocess the data because the backend would rigidly send exactly what it had in sequence_entries_preprocessed_data.processed_data to the get-released-data consumer and SILO's preprocessing doesn't like missing (or extra as well, I think) fields in NDJSON. This created a lot of friction in deployment.

What does this PR do?

This PR introduces the idea of metadata compression, in parallel to the existing sequence compression. This fits very nicely into the existing code base, making the actual code (non-test) changes trivial.

Processed metadata is compressed before being stored in the db by removing all fields that have null values:

loculus/backend/src/main/kotlin/org/loculus/backend/service/submission/CompressionService.kt

Line 121 in 6716149

processedData.metadata.filterNot { (_, value) -> value.isNull },

Processed metadata is decompressed after being retrieved from the db by filling back null to all the fields that are missing. At the same time, fields that are no longer in the current metadata schema are dropped so as to allow us to remove no-longer-wanted metadata fields:

loculus/backend/src/main/kotlin/org/loculus/backend/service/submission/CompressionService.kt

Lines 89 to 96 in 6716149

    
           backendConfig 
        
               .getInstanceConfig(organism) 
        
               .schema 
        
               .metadata 
        
               .map { it.name } 
        
               .associateWith { fieldName -> 
        
                   processedData.metadata[fieldName] ?: NullNode.instance 
        
               },

Aside: should external metadata always be a subset of metadata?

This is kind of a side discussion, only relevant because the decompression code needs to know what all the metadata fields are that should be filled with null.

As part of adapting tests to the new paradigm, I noticed that the backend config is a little unclear regarding external metadata:

should external metadata always be part of metadata? (this is what we do in practice right now via the helm chart), i.e. should external metadata fields be a subset of metadata
or should external metadata be allowed to have fields not in metadata? (this is what the backend test config was until now)

For me, it seems most logical to have metadata be the thing that the backend emits in get-released-data. Whether we allow externalMetadataSubmitter to override or not is a property of a metadata field (in my understanding). For now this difference doesn't matter other than in the way I resolved tests. In the backend tests I use my interpretation of enforcing externalMetadata to be a subset of metadata. We could in the future relax this but that would require test changes.

It might make more sense eventually though to get rid of the "externalMetadata" part of the config and add extra optional fields to the "metadata" fields listing whether or not they are allowed to be overridden by external metadata submitter and if so by which user.

If we want to keep things lose like the tests were right now, we can adapt this PR, it'd just have to then iterate over both metadata and external metadata schemas to get the full set of expected fields. Also: how would we treat cases where field values are identical but types differ?

Screenshot

No more nulls in processed data:

Testing

I've written a comprehensive unit test for the metadata compression/decompression behavior.

It's a bit tricky to write endpoint test cases for this - the relevant behavior to test would be changes of backend config (before and after addition/removal of a metadata field) which I couldn't figure out how to test well.

The changes made themselves seen in a couple of existing tests though. I checked thoroughly that we can now add and remove metadata fields on a persistent db and that things work as expected, e.g. 4e884b6 and 59833e1

PR Checklist

All necessary documentation has been adapted.
The implemented feature is covered by an appropriate test.

I don't think this is explicitly defined anywhere but it's what our current practice is, anything that's in external metadata is also in metadata

…nformation is more helpful when debugging

…lways emit all metadata fields per schema. Otherwise slight refactors to reduce verboseness/increse dryness

…hat aren't in schema, add fields as null if not present in processed data

…sult

…tabase (#3233) * Don't store null values of processed data for performance * Automated backend code formatting * Dummy for CI * Try out with persistent db to investigate config changes * Consider throwing out null part of compression, much nicer code! * woops * remove unnecessary code --------- Co-authored-by: Loculus bot <[email protected]>

backend/src/test/resources/backend_config.json

fengelniederhammer

How does this play together will data that is already in the database? If I interpret the current code correctly, then it will decompress all processed data. Will this work with uncompressed data in the database?

How much space does this actually save? Metadata isn't that large in terms of memory demand. I wonder whether the overhead (in code complexity and runtime) is worth the trouble?

backend/src/test/resources/backend_config.json

...end/src/test/kotlin/org/loculus/backend/controller/submission/GetReleasedDataEndpointTest.kt

anna-parker · 2024-11-19T11:06:50Z

Thanks @corneliusroemer this looks great to me! I just wanted to ask two follow ups - otherwise I'm happy to approve.

Do understand correctly that only newly processed data will be compressed - unless we increase the preprocessing pipeline version and we will then process all data again and it will be compressed?
About the external metadata: Although I wrote external metadata to be potentially disjoint from the metadata now that I understand our code ecosystem better this seems like an unnecessary complication - which we anyways do not use. So I am fine with enforcing external metadata to always be a subset of the full metadata.
However, with the current config structure this doesn't have to be the case and we don't check for this. I think we should change the config structure to as you suggested - e.g. add an additional optional field to the metadata fields which specifies a field is external metadata - even better we could have the backend not expect two separate external metadata and metadata dictionaries but one metadata dictionary with this field. Does that sound alright with you?

corneliusroemer · 2024-11-19T13:38:32Z

Re @fengelniederhammer's comments:

How does this play together will data that is already in the database? If I interpret the current code correctly, then it will decompress all processed data. Will this work with uncompressed data in the database?

Compression is not really what's happening here, so it seems to give the wrong idea. We're not compressing metadata, we're just not storing null fields. Everything is null unless it's present with non-null value. But there's no harm in having extra nulls in there, the "decompress" code doesn't care, it just adds nulls back when necessary.

As a result, all not previously compressed data already in the db is perfectly valid and can be "uncompressed" without problem or loss of anything.

How much space does this actually save? Metadata isn't that large in terms of memory demand.

Saving storage (and amount of data to be streamed by db which is just as much a bottleneck) isn't the main and only reason for this, but it's a nice extra at no cost - well the cost is this line which is negligible I hope you agree:

loculus/backend/src/main/kotlin/org/loculus/backend/service/submission/CompressionService.kt

Line 121 in d78c8b8

processedData.metadata.filterNot { (_, value) -> value.isNull },

I find it hard to argue why we should keep those nulls around if it's so simple to get rid of them.

I wonder whether the overhead (in code complexity and runtime) is worth the trouble?

If we want to be able to easily add metadata fields without having to reprocess then we need the "decompression" step, whether we keep nulls around or not. So that slight complexity is required:

loculus/backend/src/main/kotlin/org/loculus/backend/service/submission/CompressionService.kt

Lines 89 to 96 in 6716149

    
           backendConfig 
        
               .getInstanceConfig(organism) 
        
               .schema 
        
               .metadata 
        
               .map { it.name } 
        
               .associateWith { fieldName -> 
        
                   processedData.metadata[fieldName] ?: NullNode.instance 
        
               },

But are we really arguing over 4 lines as complexity - I'm actually quite proud of myself for the lack of complexity of the solution (it's mostly because you've set things up so nicely with the compression service, so full credit to you @fengelniederhammer)? For runtime, I'm very sure that this is almost entirely negligible. And we agreed that we would do this in #2793 and related PRs (where you said that you were happy for the backend to always emit what the schema defines ;) )

corneliusroemer · 2024-11-19T13:42:06Z

Re @anna-parker's comments:

Do understand correctly that only newly processed data will be compressed - unless we increase the preprocessing pipeline version and we will then process all data again and it will be compressed?

That's correct - compression is really not the right word though for metadata. It's more dehydration, throwing out things that are not required (but also don't do harm). Compression is entirely optional, has slight benefits of saving space and is otherwise irrelevant. But it just takes one line of code, so why not.

Even better we could have the backend not expect two separate external metadata and metadata dictionaries but one metadata dictionary with this field.

Excellent, that's exactly what I was thinking of! Make externality an optional extra of a metadata fields. We should probably do this in a separate PR - it should be pretty straightforward!

fengelniederhammer · 2024-11-19T13:47:22Z

Compression is not really what's happening here, so it seems to give the wrong idea.

Oh I see. I misread the code (just had a quick look over it). I think it would be cleaner if we did not hide metadata processing in "compress" and "decompress" functions. Let's split out that functionality to a dedicated place?

corneliusroemer · 2024-11-19T13:51:10Z

Re how much storage is saved, running this script from ChatGPT

SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
    pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
    pg_size_pretty(pg_indexes_size(schemaname || '.' || tablename)) AS indexes_size,
    pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename) - pg_relation_size(schemaname || '.' || tablename) - pg_indexes_size(schemaname || '.' || tablename)) AS toast_size
FROM
    pg_tables
WHERE
    schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY
    pg_total_relation_size(schemaname || '.' || tablename) DESC;

on this branch vs main gives:

This branch (compressed)

loculus-#     pg_total_relation_size(schemaname || '.' || tablename) DESC;
      schemaname       |             tablename              | total_size | table_size | indexes_size | toast_size
-----------------------+------------------------------------+------------+------------+--------------+------------
 public                | sequence_entries_preprocessed_data | 74 MB      | 28 MB      | 2512 kB      | 43 MB
 public                | sequence_entries                   | 57 MB      | 37 MB      | 2120 kB      | 17 MB
... (truncating small/irrelevant tables)

main

      schemaname       |             tablename              | total_size | table_size | indexes_size | toast_size
-----------------------+------------------------------------+------------+------------+--------------+------------
 public                | sequence_entries_preprocessed_data | 104 MB     | 35 MB      | 3160 kB      | 66 MB
 public                | sequence_entries                   | 56 MB      | 37 MB      | 2144 kB      | 17 MB
... (truncating small/irrelevant tables)

So we're shaving off 30% in total size, 20% in table size and 30% in toast size for our largest table which is the preprocessed data. In an actual deployment that table is even more the largest component.

This is a cheap substantial perf win, do you agree now @fengelniederhammer :)

Just out of curiosity, here's our current prod pathoplexus result, preprocessed data is almost everything (we should truncate to speed things up again, we don't need old versions of processing lying around)

      schemaname       |             tablename              | total_size | table_size | indexes_size | toast_size
-----------------------+------------------------------------+------------+------------+--------------+------------
 public                | sequence_entries_preprocessed_data | 1127 MB    | 218 MB     | 21 MB        | 887 MB
 public                | sequence_entries                   | 112 MB     | 57 MB      | 2704 kB      | 52 MB
 ena_deposition_schema | submission_table                   | 5264 kB    | 120 kB     | 48 kB        | 5096 kB
 public                | audit_log                          | 4336 kB    | 3896 kB    | 88 kB        | 352 kB
 public                | data_use_terms_table               | 3272 kB    | 2304 kB    | 960 kB       | 8192 bytes
 ena_deposition_schema | assembly_table                     | 224 kB     | 168 kB     | 48 kB        | 8192 bytes
 public                | external_metadata                  | 200 kB     | 144 kB     | 48 kB        | 8192 bytes
 ena_deposition_schema | sample_table                       | 200 kB     | 144 kB     | 48 kB        | 8192 bytes
 public                | user_groups_table                  | 64 kB      | 8192 bytes | 48 kB        | 8192 bytes
 ena_deposition_schema | flyway_schema_history              | 48 kB      | 8192 bytes | 32 kB        | 8192 bytes
 public                | flyway_schema_history              | 48 kB      | 8192 bytes | 32 kB        | 8192 bytes
 ena_deposition_schema | project_table                      | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | sequence_upload_aux_table          | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | seqset_records                     | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | groups_table                       | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | metadata_upload_aux_table          | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | table_update_tracker               | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | seqsets                            | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | seqset_to_records                  | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | current_processing_pipeline        | 24 kB      | 8192 bytes | 16 kB        | 0 bytes

corneliusroemer · 2024-11-19T13:57:52Z

I think it would be cleaner if we did not hide metadata processing in "compress" and "decompress" functions. Let's split out that functionality to a dedicated place?

I potentially disagree, I think this is a great place, note I renamed the function to no longer mention sequences as we now compress sequences and metadata. I agree that it might be confusing to you as when you wrote this you only compressed sequences, but that's just because you wrote the code.

Maybe we could rename the service to something more general than compression. I wouldn't move it somewhere else, that would just complicate logic - the fact the required changes are so small is a good sign that this is the right place to make those changes.

I first had a separate service, but that would then have to be called in a number of places per endpoint (the way I had set it up) - now this is much cleaner.

The current location is good because we want this hydration/dehydration (or compression or whatever) to happen in one place before things go to db and when they come out of db. And it's natural to do it in the same place that we compress sequences.

@fengelniederhammer feel free to propose a split into something that makes sense for you, maybe in a PR that targets this PR here - I'm probably thinking of something different (and worse) than what you have in mind.

fengelniederhammer · 2024-11-19T14:30:31Z

@fengelniederhammer feel free to propose a split into something that makes sense for you, maybe in a PR that targets this PR here - I'm probably thinking of something different (and worse) than what you have in mind.

-> #3241

…ion (#3241) * sketch: do not hide metadata processing in sequence compression function * Make test work again, mention filtering out extra fields * use hamcrest assertThat for better error messages --------- Co-authored-by: Cornelius Roemer <[email protected]>

anna-parker

Looks great - checked this should do the same thing as before :-)

backend/src/test/kotlin/org/loculus/backend/service/ProcessedMetadataPostprocessorTest.kt

backend/src/main/kotlin/org/loculus/backend/service/submission/ProcessedDataPostprocessor.kt

chaoran-chen

Great, thank you!

…etadataPostprocessorTest.kt Co-authored-by: Anna (Anya) Parker <[email protected]>

corneliusroemer added 10 commits November 18, 2024 20:40

External metadata fields must also be part of metadata in schema

c36dd34

I don't think this is explicitly defined anywhere but it's what our current practice is, anything that's in external metadata is also in metadata

f

8e294e0

don't think required in external metadata makes sense?

f54a93e

Asserting on size is silly if we insert on each field anyways, that i…

c62496c

…nformation is more helpful when debugging

Don't assert on absence of keys, check before=after instead: we now a…

3279aca

…lways emit all metadata fields per schema. Otherwise slight refactors to reduce verboseness/increse dryness

Add external metadata fields to metadata

8005062

Emit exactly all metadata fields in schema, filter out extra fields t…

f13d10a

…hat aren't in schema, add fields as null if not present in processed data

Undo auto-formatting of wildcard imports - bad IntelliJ!

34952f1

Fix imports and formatting

8874674

Deflake test that made wrong assumptions about order of JSON array re…

a8c9786

…sult

corneliusroemer marked this pull request as draft November 18, 2024 21:53

corneliusroemer added preview Triggers a deployment to argocd format_me Triggers github_actions to format website code on PR labels Nov 18, 2024

corneliusroemer mentioned this pull request Nov 18, 2024

perf(backend): remove null values before storing processed data in database #3233

Merged

2 tasks

corneliusroemer changed the title ~~Always emit correct schema in get-released-data, adding null fields and filtering out fields not in schema~~ feat(backend): Don't store null fields of processed data for perf, emit exactly current schema in get-released-data - imputing with null Nov 18, 2024

corneliusroemer added 4 commits November 19, 2024 00:38

Try changing schema and see if things are still alive afterwards

6716149

don't refresh db

e254fb8

add snps back

d39ffb9

test it all works

4e884b6

corneliusroemer mentioned this pull request Nov 18, 2024

chore(backend): fix flaky test that made wrong assumptions about order being guaranteed #3234

Merged

corneliusroemer marked this pull request as ready for review November 18, 2024 23:59

corneliusroemer added 2 commits November 19, 2024 01:00

test again

59833e1

Test metadata compression

b2fe9f0

corneliusroemer commented Nov 19, 2024

View reviewed changes

backend/src/test/resources/backend_config.json Show resolved Hide resolved

Undo persistent db changes

f8086d9

corneliusroemer changed the base branch from main to fix-test November 19, 2024 00:52

This is addressed in separate PR to keep things clean

d78c8b8

corneliusroemer changed the base branch from fix-test to main November 19, 2024 00:55

corneliusroemer requested a review from fengelniederhammer November 19, 2024 00:56

corneliusroemer requested review from anna-parker, chaoran-chen and theosanderson November 19, 2024 00:56

fengelniederhammer reviewed Nov 19, 2024

View reviewed changes

backend/src/test/resources/backend_config.json Show resolved Hide resolved

anna-parker reviewed Nov 19, 2024

View reviewed changes

...end/src/test/kotlin/org/loculus/backend/controller/submission/GetReleasedDataEndpointTest.kt Show resolved Hide resolved

corneliusroemer requested review from anna-parker and fengelniederhammer November 19, 2024 13:42

Merge branch 'main' into enforce-schema

b8090dc

corneliusroemer mentioned this pull request Nov 19, 2024

Delete processed data of no-longer-current pipeline versions after upgrade to save on storage #3240

Open

fengelniederhammer mentioned this pull request Nov 19, 2024

sketch: do not hide metadata processing in sequence compression function #3241

Merged

2 tasks

fengelniederhammer and others added 2 commits November 19, 2024 16:50

Merge branch 'main' into enforce-schema

db0b033

corneliusroemer mentioned this pull request Nov 19, 2024

backend config: consolidate external metadata into metadata #3245

Open

anna-parker approved these changes Nov 20, 2024

View reviewed changes

backend/src/test/kotlin/org/loculus/backend/service/ProcessedMetadataPostprocessorTest.kt Outdated Show resolved Hide resolved

chaoran-chen reviewed Nov 20, 2024

View reviewed changes

backend/src/main/kotlin/org/loculus/backend/service/submission/ProcessedDataPostprocessor.kt Show resolved Hide resolved

chaoran-chen approved these changes Nov 20, 2024

View reviewed changes

corneliusroemer and others added 3 commits November 20, 2024 11:40

Merge branch 'main' into enforce-schema

dc40b08

Update backend/src/test/kotlin/org/loculus/backend/service/ProcessedM…

0c9b9c3

…etadataPostprocessorTest.kt Co-authored-by: Anna (Anya) Parker <[email protected]>

f

827f6e4

corneliusroemer merged commit c75883f into main Nov 20, 2024
17 checks passed

corneliusroemer deleted the enforce-schema branch November 20, 2024 14:20

corneliusroemer mentioned this pull request Dec 2, 2024

Breaking: Update Loculus version to 0fdac5 pathoplexus/pathoplexus#308

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(backend): Don't store null fields of processed data for perf, emit exactly current schema in get-released-data - imputing with null #3232

feat(backend): Don't store null fields of processed data for perf, emit exactly current schema in get-released-data - imputing with null #3232

corneliusroemer commented Nov 18, 2024 •

edited

Loading

fengelniederhammer left a comment

anna-parker commented Nov 19, 2024 •

edited

Loading

corneliusroemer commented Nov 19, 2024

corneliusroemer commented Nov 19, 2024

fengelniederhammer commented Nov 19, 2024

corneliusroemer commented Nov 19, 2024 •

edited

Loading

corneliusroemer commented Nov 19, 2024 •

edited

Loading

fengelniederhammer commented Nov 19, 2024

anna-parker left a comment

chaoran-chen left a comment

	backendConfig
	.getInstanceConfig(organism)
	.schema
	.metadata
	.map { it.name }
	.associateWith { fieldName ->
	processedData.metadata[fieldName] ?: NullNode.instance
	},

feat(backend): Don't store null fields of processed data for perf, emit exactly current schema in get-released-data - imputing with null #3232

feat(backend): Don't store null fields of processed data for perf, emit exactly current schema in get-released-data - imputing with null #3232

Conversation

corneliusroemer commented Nov 18, 2024 • edited Loading

What does this PR solve?

What does this PR do?

Aside: should external metadata always be a subset of metadata?

Screenshot

Testing

PR Checklist

fengelniederhammer left a comment

Choose a reason for hiding this comment

anna-parker commented Nov 19, 2024 • edited Loading

corneliusroemer commented Nov 19, 2024

corneliusroemer commented Nov 19, 2024

fengelniederhammer commented Nov 19, 2024

corneliusroemer commented Nov 19, 2024 • edited Loading

This branch (compressed)

main

corneliusroemer commented Nov 19, 2024 • edited Loading

fengelniederhammer commented Nov 19, 2024

anna-parker left a comment

Choose a reason for hiding this comment

chaoran-chen left a comment

Choose a reason for hiding this comment

corneliusroemer commented Nov 18, 2024 •

edited

Loading

anna-parker commented Nov 19, 2024 •

edited

Loading

corneliusroemer commented Nov 19, 2024 •

edited

Loading

corneliusroemer commented Nov 19, 2024 •

edited

Loading