Optimize memory utilization of `get_structure()`. #765

cloverzizi · 2025-03-03T05:33:03Z

When obtaining an AtomArray using pdbx.get_structure(), the invoked _find_matches() function generates a [N_struct_conn_col, N_struct_conn_row, N_atom] matrix. In cases where a CIF file contains numerous atoms and numerous inter-residue bonds, this can lead to a significant increase in memory usage. For example, for PDB ID 7Y4L, memory consumption can reach up to 100GB, and it takes approximately 450 seconds to run pdbx.get_structure().
This PR modifies the implementation of _find_matches() to address and prevent this issue. With the updated code, processing 7Y4L requires less than 1.5GB of memory, and the runtime is reduced to approximately 45 seconds.

codspeed-hq · 2025-03-03T05:52:53Z

CodSpeed Performance Report

Merging #765 will not alter performance

_{Comparing cloverzizi:main (4a7c6c9) with main (877efed)}

Summary

✅ 59 untouched benchmarks

Croydon-Brixton

Thank you for this nice addition!

This looks great to me, my only slight concern is the 20-30% performance hit we get on average-sized structures, which is the usual case for almost all PDB structures.

One simple way to get the best of both worlds would be to pick between the two strategies (dict strategy vs dense array strategy) based on the size of 'query_arrays', 'reference_arrays' and 'n_atoms'. Could you add a switch use the array based matching for typical structures, but the dictionary based matching for problems above a certain size? This should allow us to retain the benchmark speed while fixing the issue you observed for large structures.
(Alternatively you could also implement the dictionary matching in cython, I'd imagine that should also fix the slowdown).

Also, would you be able to add your observed test case to the test suite and mark it as slow test (via pytest marks)? This way we can track the performance on large structures as well.

src/biotite/structure/io/pdbx/convert.py

padix-key · 2025-03-03T17:55:52Z

Thanks for spotting this failure. 👍. Allocating hundreds of GB is definitely not the intended behavior.

I also agree with @Croydon-Brixton here: This fix should not negatively affect the performance of the more common cases. Using some cutoff based one len(query_arrays) * len(reference_arrays) should work here, as suggested. That being said, I assume the performance of the implementation in this PR can also be a bit optimized, I left one suggestion regarding this in the review.

Alternatively you could also implement the dictionary matching in cython, I'd imagine that should also fix the slowdown.

This is unfortunately not easily possible, as this would require unicode string handling on C-level, which quite a pain. In the future, this could be handled by Rust code instead (#688), which has built-in safe support for Unicode strings.

This enhancement avoids creating a `[N_struct_conn, N_atom]` matrix when reading the "struct_conn" field, preventing excessive memory usage when dealing with CIF files containing a large number of atoms and numerous inter-residue bonds. Co-authored-by: Jincai Yang <[email protected]>

cloverzizi · 2025-03-04T08:18:00Z

@Croydon-Brixton @padix-key Thank you both for your suggestions. I initially tried using tuples as dictionary keys, which improved performance somewhat, but it still did not outperform the dense array strategy in the benchmark test (1AKI). As a result, I implemented a switch based on the size of N_query * N_ref, ultimately choosing 2^13 as the threshold to decide which strategy to use.

Additionally, I did not include 7Y4L in the pytest testing code because its CIF file is excessively large (>100MB), which would bloat the repository size.

padix-key · 2025-03-04T10:55:47Z

Thanks for the helpful benchmark! I would not have expected that the tipping point comes so early.

choosing 2^13 as the threshold to decide which strategy to use

I would suggest to use the only slightly larger 10^4 as more 'round' number here, but this is no strong opinion.

Additionally, I did not include 7Y4L in the pytest testing code because its CIF file is excessively large (>100MB), which would bloat the repository size.

This makes sense, still we probably want to include the new approach in the tests. Instead we could mock the threshold constant in the test, to force the usage for the dictionary approach even for smaller structures. If you prefer, I could also add such test to this PR.

cloverzizi · 2025-03-04T11:04:20Z

@padix-key Sure, you can set the threshold to 10^4. Feel free to update this PR to adjust the threshold and add the test code.

padix-key

I also found the time to review your PR now, below are some minor suggestions.

padix-key · 2025-03-04T10:57:27Z

src/biotite/structure/io/pdbx/convert.py

+    #  it was observed that when the size exceeds 2**13 (8192)
+    #  the dict strategy becomes significantly faster than the dense array
+    #  and does not cause excessive memory usage.
+    if query_arrays[0].shape[0] * reference_arrays[0].shape[0] <= 8192:


Could you add a module-level constant for this value instead?

padix-key · 2025-03-04T11:14:09Z

src/biotite/structure/io/pdbx/convert.py

+    # Convert reference arrays to a dictionary for O(1) lookups
+    reference_dict = {}
+    unambiguously_keys = set()
+    for idx, col in enumerate(np.stack(reference_arrays, axis=-1)):


Running np.stack() forces the reference_arrays to the same dtype, probably still requiring time consuming conversion to string and losing the type information. For the purpose required here, zip() should achieve the same, right?

Furthermore, you are iterating over the rows of the atom_site (reference) and struct_conn (query) categories instead of columns, if I am not mistaken.

Suggested change

for idx, col in enumerate(np.stack(reference_arrays, axis=-1)):

for idx, row in enumerate(zip(*reference_arrays)):

padix-key · 2025-03-04T11:18:09Z

src/biotite/structure/io/pdbx/convert.py

+        occurrence = reference_dict.get(query_key, -1)
+
+        if occurrence == -1:


As we do not use NumPy here, we are able to use the more idiomatic None here.

Suggested change

occurrence = reference_dict.get(query_key, -1)

if occurrence == -1:

occurrence = reference_dict.get(query_key)

if occurrence is None:

padix-key · 2025-03-04T11:19:23Z

src/biotite/structure/io/pdbx/convert.py

+    """
+    # Convert reference arrays to a dictionary for O(1) lookups
+    reference_dict = {}
+    unambiguously_keys = set()


The set actually collects the keys that are ambiguous.

Suggested change

unambiguously_keys = set()

ambiguous_keys = set()

padix-key · 2025-03-04T11:20:51Z

src/biotite/structure/io/pdbx/convert.py

+    #  it was observed that when the size exceeds 2**13 (8192)
+    #  the dict strategy becomes significantly faster than the dense array
+    #  and does not cause excessive memory usage.


Could you link your comment in this PR with the nice benchmark?

cloverzizi force-pushed the main branch from ca8a57b to 4fc4033 Compare March 3, 2025 05:38

cloverzizi force-pushed the main branch 2 times, most recently from bfc2eff to a8676b9 Compare March 3, 2025 06:48

cloverzizi temporarily deployed to publish March 3, 2025 07:11 — with GitHub Actions Inactive

Croydon-Brixton requested changes Mar 3, 2025

View reviewed changes

padix-key reviewed Mar 3, 2025

View reviewed changes

src/biotite/structure/io/pdbx/convert.py Outdated Show resolved Hide resolved

cloverzizi force-pushed the main branch from a8676b9 to 881a726 Compare March 4, 2025 02:50

cloverzizi temporarily deployed to publish March 4, 2025 03:14 — with GitHub Actions Inactive

cloverzizi force-pushed the main branch 2 times, most recently from 30aabdb to 0024041 Compare March 4, 2025 06:35

cloverzizi force-pushed the main branch from 0024041 to 4a7c6c9 Compare March 4, 2025 07:02

cloverzizi deployed to publish March 4, 2025 07:26 — with GitHub Actions View deployment

padix-key requested changes Mar 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize memory utilization of `get_structure()`. #765

Optimize memory utilization of `get_structure()`. #765

cloverzizi commented Mar 3, 2025 •

edited

Loading

codspeed-hq bot commented Mar 3, 2025 •

edited

Loading

Croydon-Brixton left a comment

padix-key commented Mar 3, 2025

cloverzizi commented Mar 4, 2025 •

edited

Loading

padix-key commented Mar 4, 2025 •

edited

Loading

cloverzizi commented Mar 4, 2025

padix-key left a comment

padix-key Mar 4, 2025

padix-key Mar 4, 2025

padix-key Mar 4, 2025

padix-key Mar 4, 2025

padix-key Mar 4, 2025

	for idx, col in enumerate(np.stack(reference_arrays, axis=-1)):
	for idx, row in enumerate(zip(*reference_arrays)):

		occurrence = reference_dict.get(query_key, -1)

		if occurrence == -1:

Optimize memory utilization of get_structure(). #765

Are you sure you want to change the base?

Optimize memory utilization of get_structure(). #765

Conversation

cloverzizi commented Mar 3, 2025 • edited Loading

codspeed-hq bot commented Mar 3, 2025 • edited Loading

CodSpeed Performance Report

Merging #765 will not alter performance

Summary

Croydon-Brixton left a comment

Choose a reason for hiding this comment

padix-key commented Mar 3, 2025

cloverzizi commented Mar 4, 2025 • edited Loading

padix-key commented Mar 4, 2025 • edited Loading

cloverzizi commented Mar 4, 2025

padix-key left a comment

Choose a reason for hiding this comment

padix-key Mar 4, 2025

Choose a reason for hiding this comment

padix-key Mar 4, 2025

Choose a reason for hiding this comment

padix-key Mar 4, 2025

Choose a reason for hiding this comment

padix-key Mar 4, 2025

Choose a reason for hiding this comment

padix-key Mar 4, 2025

Choose a reason for hiding this comment

Optimize memory utilization of `get_structure()`. #765

Optimize memory utilization of `get_structure()`. #765

cloverzizi commented Mar 3, 2025 •

edited

Loading

codspeed-hq bot commented Mar 3, 2025 •

edited

Loading

cloverzizi commented Mar 4, 2025 •

edited

Loading

padix-key commented Mar 4, 2025 •

edited

Loading