-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize memory utilization of get_structure()
.
#765
base: main
Are you sure you want to change the base?
Conversation
CodSpeed Performance ReportMerging #765 will not alter performanceComparing Summary
|
bfc2eff
to
a8676b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this nice addition!
This looks great to me, my only slight concern is the 20-30% performance hit we get on average-sized structures, which is the usual case for almost all PDB structures.
One simple way to get the best of both worlds would be to pick between the two strategies (dict strategy vs dense array strategy) based on the size of 'query_arrays', 'reference_arrays' and 'n_atoms'. Could you add a switch use the array based matching for typical structures, but the dictionary based matching for problems above a certain size? This should allow us to retain the benchmark speed while fixing the issue you observed for large structures.
(Alternatively you could also implement the dictionary matching in cython, I'd imagine that should also fix the slowdown).
Also, would you be able to add your observed test case to the test suite and mark it as slow test (via pytest marks)? This way we can track the performance on large structures as well.
Thanks for spotting this failure. 👍. Allocating hundreds of GB is definitely not the intended behavior. I also agree with @Croydon-Brixton here: This fix should not negatively affect the performance of the more common cases. Using some cutoff based one
This is unfortunately not easily possible, as this would require unicode string handling on C-level, which quite a pain. In the future, this could be handled by Rust code instead (#688), which has built-in safe support for Unicode strings. |
30aabdb
to
0024041
Compare
This enhancement avoids creating a `[N_struct_conn, N_atom]` matrix when reading the "struct_conn" field, preventing excessive memory usage when dealing with CIF files containing a large number of atoms and numerous inter-residue bonds. Co-authored-by: Jincai Yang <[email protected]>
@Croydon-Brixton @padix-key Thank you both for your suggestions. I initially tried using tuples as dictionary keys, which improved performance somewhat, but it still did not outperform the dense array strategy in the benchmark test (1AKI). As a result, I implemented a switch based on the size of N_query * N_ref, ultimately choosing 2^13 as the threshold to decide which strategy to use. Additionally, I did not include 7Y4L in the pytest testing code because its CIF file is excessively large (>100MB), which would bloat the repository size. |
Thanks for the helpful benchmark! I would not have expected that the tipping point comes so early.
I would suggest to use the only slightly larger 10^4 as more 'round' number here, but this is no strong opinion.
This makes sense, still we probably want to include the new approach in the tests. Instead we could mock the threshold constant in the test, to force the usage for the dictionary approach even for smaller structures. If you prefer, I could also add such test to this PR. |
@padix-key Sure, you can set the threshold to 10^4. Feel free to update this PR to adjust the threshold and add the test code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also found the time to review your PR now, below are some minor suggestions.
# it was observed that when the size exceeds 2**13 (8192) | ||
# the dict strategy becomes significantly faster than the dense array | ||
# and does not cause excessive memory usage. | ||
if query_arrays[0].shape[0] * reference_arrays[0].shape[0] <= 8192: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a module-level constant for this value instead?
# Convert reference arrays to a dictionary for O(1) lookups | ||
reference_dict = {} | ||
unambiguously_keys = set() | ||
for idx, col in enumerate(np.stack(reference_arrays, axis=-1)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running np.stack()
forces the reference_arrays to the same dtype, probably still requiring time consuming conversion to string and losing the type information. For the purpose required here, zip()
should achieve the same, right?
Furthermore, you are iterating over the rows of the atom_site
(reference) and struct_conn
(query) categories instead of columns, if I am not mistaken.
for idx, col in enumerate(np.stack(reference_arrays, axis=-1)): | |
for idx, row in enumerate(zip(*reference_arrays)): |
occurrence = reference_dict.get(query_key, -1) | ||
|
||
if occurrence == -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we do not use NumPy here, we are able to use the more idiomatic None
here.
occurrence = reference_dict.get(query_key, -1) | |
if occurrence == -1: | |
occurrence = reference_dict.get(query_key) | |
if occurrence is None: |
""" | ||
# Convert reference arrays to a dictionary for O(1) lookups | ||
reference_dict = {} | ||
unambiguously_keys = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The set actually collects the keys that are ambiguous.
unambiguously_keys = set() | |
ambiguous_keys = set() |
# it was observed that when the size exceeds 2**13 (8192) | ||
# the dict strategy becomes significantly faster than the dense array | ||
# and does not cause excessive memory usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you link your comment in this PR with the nice benchmark?
When obtaining an AtomArray using
pdbx.get_structure()
, the invoked_find_matches()
function generates a[N_struct_conn_col, N_struct_conn_row, N_atom]
matrix. In cases where a CIF file contains numerous atoms and numerous inter-residue bonds, this can lead to a significant increase in memory usage. For example, for PDB ID 7Y4L, memory consumption can reach up to 100GB, and it takes approximately 450 seconds to runpdbx.get_structure()
.This PR modifies the implementation of
_find_matches()
to address and prevent this issue. With the updated code, processing 7Y4L requires less than 1.5GB of memory, and the runtime is reduced to approximately 45 seconds.