Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize memory utilization of get_structure(). #765

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cloverzizi
Copy link
Contributor

@cloverzizi cloverzizi commented Mar 3, 2025

When obtaining an AtomArray using pdbx.get_structure(), the invoked _find_matches() function generates a [N_struct_conn_col, N_struct_conn_row, N_atom] matrix. In cases where a CIF file contains numerous atoms and numerous inter-residue bonds, this can lead to a significant increase in memory usage. For example, for PDB ID 7Y4L, memory consumption can reach up to 100GB, and it takes approximately 450 seconds to run pdbx.get_structure().
This PR modifies the implementation of _find_matches() to address and prevent this issue. With the updated code, processing 7Y4L requires less than 1.5GB of memory, and the runtime is reduced to approximately 45 seconds.

Copy link

codspeed-hq bot commented Mar 3, 2025

CodSpeed Performance Report

Merging #765 will not alter performance

Comparing cloverzizi:main (4a7c6c9) with main (877efed)

Summary

✅ 59 untouched benchmarks

@cloverzizi cloverzizi force-pushed the main branch 2 times, most recently from bfc2eff to a8676b9 Compare March 3, 2025 06:48
Copy link
Contributor

@Croydon-Brixton Croydon-Brixton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this nice addition!

This looks great to me, my only slight concern is the 20-30% performance hit we get on average-sized structures, which is the usual case for almost all PDB structures.

One simple way to get the best of both worlds would be to pick between the two strategies (dict strategy vs dense array strategy) based on the size of 'query_arrays', 'reference_arrays' and 'n_atoms'. Could you add a switch use the array based matching for typical structures, but the dictionary based matching for problems above a certain size? This should allow us to retain the benchmark speed while fixing the issue you observed for large structures.
(Alternatively you could also implement the dictionary matching in cython, I'd imagine that should also fix the slowdown).

Also, would you be able to add your observed test case to the test suite and mark it as slow test (via pytest marks)? This way we can track the performance on large structures as well.

@padix-key
Copy link
Member

Thanks for spotting this failure. 👍. Allocating hundreds of GB is definitely not the intended behavior.

I also agree with @Croydon-Brixton here: This fix should not negatively affect the performance of the more common cases. Using some cutoff based one len(query_arrays) * len(reference_arrays) should work here, as suggested. That being said, I assume the performance of the implementation in this PR can also be a bit optimized, I left one suggestion regarding this in the review.

Alternatively you could also implement the dictionary matching in cython, I'd imagine that should also fix the slowdown.

This is unfortunately not easily possible, as this would require unicode string handling on C-level, which quite a pain. In the future, this could be handled by Rust code instead (#688), which has built-in safe support for Unicode strings.

This enhancement avoids creating a `[N_struct_conn, N_atom]` matrix when
reading the "struct_conn" field, preventing excessive memory usage
when dealing with CIF files containing a large number of atoms
and numerous inter-residue bonds.

Co-authored-by: Jincai Yang <[email protected]>
@cloverzizi
Copy link
Contributor Author

cloverzizi commented Mar 4, 2025

@Croydon-Brixton @padix-key Thank you both for your suggestions. I initially tried using tuples as dictionary keys, which improved performance somewhat, but it still did not outperform the dense array strategy in the benchmark test (1AKI). As a result, I implemented a switch based on the size of N_query * N_ref, ultimately choosing 2^13 as the threshold to decide which strategy to use.

Additionally, I did not include 7Y4L in the pytest testing code because its CIF file is excessively large (>100MB), which would bloat the repository size.

image
image

@padix-key
Copy link
Member

padix-key commented Mar 4, 2025

Thanks for the helpful benchmark! I would not have expected that the tipping point comes so early.

choosing 2^13 as the threshold to decide which strategy to use

I would suggest to use the only slightly larger 10^4 as more 'round' number here, but this is no strong opinion.

Additionally, I did not include 7Y4L in the pytest testing code because its CIF file is excessively large (>100MB), which would bloat the repository size.

This makes sense, still we probably want to include the new approach in the tests. Instead we could mock the threshold constant in the test, to force the usage for the dictionary approach even for smaller structures. If you prefer, I could also add such test to this PR.

@cloverzizi
Copy link
Contributor Author

@padix-key Sure, you can set the threshold to 10^4. Feel free to update this PR to adjust the threshold and add the test code.

Copy link
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found the time to review your PR now, below are some minor suggestions.

# it was observed that when the size exceeds 2**13 (8192)
# the dict strategy becomes significantly faster than the dense array
# and does not cause excessive memory usage.
if query_arrays[0].shape[0] * reference_arrays[0].shape[0] <= 8192:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a module-level constant for this value instead?

# Convert reference arrays to a dictionary for O(1) lookups
reference_dict = {}
unambiguously_keys = set()
for idx, col in enumerate(np.stack(reference_arrays, axis=-1)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running np.stack() forces the reference_arrays to the same dtype, probably still requiring time consuming conversion to string and losing the type information. For the purpose required here, zip() should achieve the same, right?

Furthermore, you are iterating over the rows of the atom_site (reference) and struct_conn (query) categories instead of columns, if I am not mistaken.

Suggested change
for idx, col in enumerate(np.stack(reference_arrays, axis=-1)):
for idx, row in enumerate(zip(*reference_arrays)):

Comment on lines +699 to +701
occurrence = reference_dict.get(query_key, -1)

if occurrence == -1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we do not use NumPy here, we are able to use the more idiomatic None here.

Suggested change
occurrence = reference_dict.get(query_key, -1)
if occurrence == -1:
occurrence = reference_dict.get(query_key)
if occurrence is None:

"""
# Convert reference arrays to a dictionary for O(1) lookups
reference_dict = {}
unambiguously_keys = set()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set actually collects the keys that are ambiguous.

Suggested change
unambiguously_keys = set()
ambiguous_keys = set()

Comment on lines +723 to +725
# it was observed that when the size exceeds 2**13 (8192)
# the dict strategy becomes significantly faster than the dense array
# and does not cause excessive memory usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you link your comment in this PR with the nice benchmark?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants