Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(backend): Don't store null fields of processed data for perf, emit exactly current schema in get-released-data - imputing with null #3232

Merged
merged 25 commits into from
Nov 20, 2024

Conversation

corneliusroemer
Copy link
Contributor

@corneliusroemer corneliusroemer commented Nov 18, 2024

resolves #2793

preview URL: https://enforce-schema.loculus.org

What does this PR solve?

This PR resolves 2 issues discussed among others in #2793 but recapitulated here:

  1. We were wasting db storage and bandwidth by storing lots of metadata fields with null values in sequence_entries_preprocessed_data.processed_data in the db - in Pathoplexus this wastage might even be 50% or more of total row size, reaching a point that's worth optimizing.
  2. Whenever we wanted to add a new metadata field, we needed to reprocess the data because the backend would rigidly send exactly what it had in sequence_entries_preprocessed_data.processed_data to the get-released-data consumer and SILO's preprocessing doesn't like missing (or extra as well, I think) fields in NDJSON. This created a lot of friction in deployment.

What does this PR do?

This PR introduces the idea of metadata compression, in parallel to the existing sequence compression. This fits very nicely into the existing code base, making the actual code (non-test) changes trivial.

Aside: should external metadata always be a subset of metadata?

This is kind of a side discussion, only relevant because the decompression code needs to know what all the metadata fields are that should be filled with null.

As part of adapting tests to the new paradigm, I noticed that the backend config is a little unclear regarding external metadata:

  • should external metadata always be part of metadata? (this is what we do in practice right now via the helm chart), i.e. should external metadata fields be a subset of metadata
  • or should external metadata be allowed to have fields not in metadata? (this is what the backend test config was until now)

For me, it seems most logical to have metadata be the thing that the backend emits in get-released-data. Whether we allow externalMetadataSubmitter to override or not is a property of a metadata field (in my understanding). For now this difference doesn't matter other than in the way I resolved tests. In the backend tests I use my interpretation of enforcing externalMetadata to be a subset of metadata. We could in the future relax this but that would require test changes.

It might make more sense eventually though to get rid of the "externalMetadata" part of the config and add extra optional fields to the "metadata" fields listing whether or not they are allowed to be overridden by external metadata submitter and if so by which user.

If we want to keep things lose like the tests were right now, we can adapt this PR, it'd just have to then iterate over both metadata and external metadata schemas to get the full set of expected fields. Also: how would we treat cases where field values are identical but types differ?

Screenshot

No more nulls in processed data:

pgAdmin 4 2024-11-18 23 42 52

Testing

I've written a comprehensive unit test for the metadata compression/decompression behavior.

It's a bit tricky to write endpoint test cases for this - the relevant behavior to test would be changes of backend config (before and after addition/removal of a metadata field) which I couldn't figure out how to test well.

The changes made themselves seen in a couple of existing tests though. I checked thoroughly that we can now add and remove metadata fields on a persistent db and that things work as expected, e.g. 4e884b6 and 59833e1

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by an appropriate test.

I don't think this is explicitly defined anywhere but it's what our current practice is, anything that's in external metadata is also in metadata
…lways emit all metadata fields per schema.

Otherwise slight refactors to reduce verboseness/increse dryness
…hat aren't in schema, add fields as null if not present in processed data
@corneliusroemer corneliusroemer marked this pull request as draft November 18, 2024 21:53
@corneliusroemer corneliusroemer added preview Triggers a deployment to argocd format_me Triggers github_actions to format website code on PR labels Nov 18, 2024
…tabase (#3233)

* Don't store null values of processed data for performance

* Automated backend code formatting

* Dummy for CI

* Try out with persistent db to investigate config changes

* Consider throwing out null part of compression, much nicer code!

* woops

* remove unnecessary code

---------

Co-authored-by: Loculus bot <[email protected]>
@corneliusroemer corneliusroemer changed the title Always emit correct schema in get-released-data, adding null fields and filtering out fields not in schema feat(backend): Don't store null fields of processed data for perf, emit exactly current schema in get-released-data - imputing with null Nov 18, 2024
@corneliusroemer corneliusroemer changed the base branch from main to fix-test November 19, 2024 00:52
@corneliusroemer corneliusroemer changed the base branch from fix-test to main November 19, 2024 00:55
Copy link
Contributor

@fengelniederhammer fengelniederhammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this play together will data that is already in the database? If I interpret the current code correctly, then it will decompress all processed data. Will this work with uncompressed data in the database?

How much space does this actually save? Metadata isn't that large in terms of memory demand. I wonder whether the overhead (in code complexity and runtime) is worth the trouble?

@anna-parker
Copy link
Contributor

anna-parker commented Nov 19, 2024

Thanks @corneliusroemer this looks great to me! I just wanted to ask two follow ups - otherwise I'm happy to approve.

  1. Do understand correctly that only newly processed data will be compressed - unless we increase the preprocessing pipeline version and we will then process all data again and it will be compressed?

  2. About the external metadata: Although I wrote external metadata to be potentially disjoint from the metadata now that I understand our code ecosystem better this seems like an unnecessary complication - which we anyways do not use. So I am fine with enforcing external metadata to always be a subset of the full metadata.
    However, with the current config structure this doesn't have to be the case and we don't check for this. I think we should change the config structure to as you suggested - e.g. add an additional optional field to the metadata fields which specifies a field is external metadata - even better we could have the backend not expect two separate external metadata and metadata dictionaries but one metadata dictionary with this field. Does that sound alright with you?

@corneliusroemer
Copy link
Contributor Author

Re @fengelniederhammer's comments:

How does this play together will data that is already in the database? If I interpret the current code correctly, then it will decompress all processed data. Will this work with uncompressed data in the database?

Compression is not really what's happening here, so it seems to give the wrong idea. We're not compressing metadata, we're just not storing null fields. Everything is null unless it's present with non-null value. But there's no harm in having extra nulls in there, the "decompress" code doesn't care, it just adds nulls back when necessary.

As a result, all not previously compressed data already in the db is perfectly valid and can be "uncompressed" without problem or loss of anything.

How much space does this actually save? Metadata isn't that large in terms of memory demand.

Saving storage (and amount of data to be streamed by db which is just as much a bottleneck) isn't the main and only reason for this, but it's a nice extra at no cost - well the cost is this line which is negligible I hope you agree:

processedData.metadata.filterNot { (_, value) -> value.isNull },

I find it hard to argue why we should keep those nulls around if it's so simple to get rid of them.

I wonder whether the overhead (in code complexity and runtime) is worth the trouble?

If we want to be able to easily add metadata fields without having to reprocess then we need the "decompression" step, whether we keep nulls around or not. So that slight complexity is required:

backendConfig
.getInstanceConfig(organism)
.schema
.metadata
.map { it.name }
.associateWith { fieldName ->
processedData.metadata[fieldName] ?: NullNode.instance
},

But are we really arguing over 4 lines as complexity - I'm actually quite proud of myself for the lack of complexity of the solution (it's mostly because you've set things up so nicely with the compression service, so full credit to you @fengelniederhammer)? For runtime, I'm very sure that this is almost entirely negligible. And we agreed that we would do this in #2793 and related PRs (where you said that you were happy for the backend to always emit what the schema defines ;) )

@corneliusroemer
Copy link
Contributor Author

Re @anna-parker's comments:

Do understand correctly that only newly processed data will be compressed - unless we increase the preprocessing pipeline version and we will then process all data again and it will be compressed?

That's correct - compression is really not the right word though for metadata. It's more dehydration, throwing out things that are not required (but also don't do harm). Compression is entirely optional, has slight benefits of saving space and is otherwise irrelevant. But it just takes one line of code, so why not.

Even better we could have the backend not expect two separate external metadata and metadata dictionaries but one metadata dictionary with this field.

Excellent, that's exactly what I was thinking of! Make externality an optional extra of a metadata fields. We should probably do this in a separate PR - it should be pretty straightforward!

@fengelniederhammer
Copy link
Contributor

Compression is not really what's happening here, so it seems to give the wrong idea.

Oh I see. I misread the code (just had a quick look over it). I think it would be cleaner if we did not hide metadata processing in "compress" and "decompress" functions. Let's split out that functionality to a dedicated place?

@corneliusroemer
Copy link
Contributor Author

corneliusroemer commented Nov 19, 2024

Re how much storage is saved, running this script from ChatGPT

SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
    pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
    pg_size_pretty(pg_indexes_size(schemaname || '.' || tablename)) AS indexes_size,
    pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename) - pg_relation_size(schemaname || '.' || tablename) - pg_indexes_size(schemaname || '.' || tablename)) AS toast_size
FROM
    pg_tables
WHERE
    schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY
    pg_total_relation_size(schemaname || '.' || tablename) DESC;

on this branch vs main gives:

This branch (compressed)

loculus-#     pg_total_relation_size(schemaname || '.' || tablename) DESC;
      schemaname       |             tablename              | total_size | table_size | indexes_size | toast_size
-----------------------+------------------------------------+------------+------------+--------------+------------
 public                | sequence_entries_preprocessed_data | 74 MB      | 28 MB      | 2512 kB      | 43 MB
 public                | sequence_entries                   | 57 MB      | 37 MB      | 2120 kB      | 17 MB
... (truncating small/irrelevant tables)

main

      schemaname       |             tablename              | total_size | table_size | indexes_size | toast_size
-----------------------+------------------------------------+------------+------------+--------------+------------
 public                | sequence_entries_preprocessed_data | 104 MB     | 35 MB      | 3160 kB      | 66 MB
 public                | sequence_entries                   | 56 MB      | 37 MB      | 2144 kB      | 17 MB
... (truncating small/irrelevant tables)

So we're shaving off 30% in total size, 20% in table size and 30% in toast size for our largest table which is the preprocessed data. In an actual deployment that table is even more the largest component.

This is a cheap substantial perf win, do you agree now @fengelniederhammer :)

Just out of curiosity, here's our current prod pathoplexus result, preprocessed data is almost everything (we should truncate to speed things up again, we don't need old versions of processing lying around)

      schemaname       |             tablename              | total_size | table_size | indexes_size | toast_size
-----------------------+------------------------------------+------------+------------+--------------+------------
 public                | sequence_entries_preprocessed_data | 1127 MB    | 218 MB     | 21 MB        | 887 MB
 public                | sequence_entries                   | 112 MB     | 57 MB      | 2704 kB      | 52 MB
 ena_deposition_schema | submission_table                   | 5264 kB    | 120 kB     | 48 kB        | 5096 kB
 public                | audit_log                          | 4336 kB    | 3896 kB    | 88 kB        | 352 kB
 public                | data_use_terms_table               | 3272 kB    | 2304 kB    | 960 kB       | 8192 bytes
 ena_deposition_schema | assembly_table                     | 224 kB     | 168 kB     | 48 kB        | 8192 bytes
 public                | external_metadata                  | 200 kB     | 144 kB     | 48 kB        | 8192 bytes
 ena_deposition_schema | sample_table                       | 200 kB     | 144 kB     | 48 kB        | 8192 bytes
 public                | user_groups_table                  | 64 kB      | 8192 bytes | 48 kB        | 8192 bytes
 ena_deposition_schema | flyway_schema_history              | 48 kB      | 8192 bytes | 32 kB        | 8192 bytes
 public                | flyway_schema_history              | 48 kB      | 8192 bytes | 32 kB        | 8192 bytes
 ena_deposition_schema | project_table                      | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | sequence_upload_aux_table          | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | seqset_records                     | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | groups_table                       | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | metadata_upload_aux_table          | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | table_update_tracker               | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | seqsets                            | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | seqset_to_records                  | 32 kB      | 8192 bytes | 16 kB        | 8192 bytes
 public                | current_processing_pipeline        | 24 kB      | 8192 bytes | 16 kB        | 0 bytes

@corneliusroemer
Copy link
Contributor Author

corneliusroemer commented Nov 19, 2024

I think it would be cleaner if we did not hide metadata processing in "compress" and "decompress" functions. Let's split out that functionality to a dedicated place?

I potentially disagree, I think this is a great place, note I renamed the function to no longer mention sequences as we now compress sequences and metadata. I agree that it might be confusing to you as when you wrote this you only compressed sequences, but that's just because you wrote the code.

Maybe we could rename the service to something more general than compression. I wouldn't move it somewhere else, that would just complicate logic - the fact the required changes are so small is a good sign that this is the right place to make those changes.

I first had a separate service, but that would then have to be called in a number of places per endpoint (the way I had set it up) - now this is much cleaner.

The current location is good because we want this hydration/dehydration (or compression or whatever) to happen in one place before things go to db and when they come out of db. And it's natural to do it in the same place that we compress sequences.

@fengelniederhammer feel free to propose a split into something that makes sense for you, maybe in a PR that targets this PR here - I'm probably thinking of something different (and worse) than what you have in mind.

@fengelniederhammer
Copy link
Contributor

@fengelniederhammer feel free to propose a split into something that makes sense for you, maybe in a PR that targets this PR here - I'm probably thinking of something different (and worse) than what you have in mind.

-> #3241

fengelniederhammer and others added 2 commits November 19, 2024 16:50
…ion (#3241)

* sketch: do not hide metadata processing in sequence compression function

* Make test work again, mention filtering out extra fields

* use hamcrest assertThat for better error messages

---------

Co-authored-by: Cornelius Roemer <[email protected]>
Copy link
Contributor

@anna-parker anna-parker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - checked this should do the same thing as before :-)

Copy link
Member

@chaoran-chen chaoran-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you!

@corneliusroemer corneliusroemer merged commit c75883f into main Nov 20, 2024
17 checks passed
@corneliusroemer corneliusroemer deleted the enforce-schema branch November 20, 2024 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
format_me Triggers github_actions to format website code on PR preview Triggers a deployment to argocd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't put fields with null values in sequence_entries_preprocessed_data.processed_data
4 participants