Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store unique masks in a dictionary to avoid duplication (FormatNXmx.py) #789

Merged
merged 7 commits into from
Feb 21, 2025

Conversation

yash4karan
Copy link
Collaborator

@yash4karan yash4karan commented Feb 20, 2025

Since static masks tend to be constant for a given experiment, it is more efficient to store unique entries in a dictionary.

This PR adds a MaskDict singleton, which inherits from the Python dict. This class provides a .insert() method that inserts a static mask object only if it is not already present and then returns a reference to the corresponding dictionary entry.

Motivation: dials.import seems to require a surprising amount of ram (sometimes) #2227

When trying to dials.import a large number of experiment files, the RAM usage on Linux was unnecessarily high due to separate static masks being created for each file. Since masks tend to be limited in number for each experiment, it makes more sense to store masks with unique values only, which is what this PR aims to do.

Before and after:

I profiled dials.import using memray on a Linux workstation with the following VMXi dataset:
/dls/mx/data/nt30330/nt30330-162/VMXi-AB2776/well_*/images/*.nxs.

Before

I found that originally, peak RAM usage was ~18 GB, with the mask objects taking up ~16 GB for 3836 files (i.e. ~4.3 MB per .nxs file, as expected). The RAM usage vs. time plot is shown below:

LinuxVMXiImport_Before

After

After the PR changes were made, total RAM usage was <900 MB, with the mask objects taking up <10 MB at any given time. The RAM usage vs. time plot is shown below:

LinuxVMXiImport_After

@yash4karan yash4karan changed the title Stores static masks in FormatNXmx.py in a dict to avoid duplication Store static masks in FormatNXmx.py in a dict to avoid duplication Feb 20, 2025
Copy link
Collaborator

@graeme-winter graeme-winter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution: this looks exactly like the fix which is needed here - the changes I propose are largely cosmetic though dealing with mask tuples of length > 1 would be important before we merge this.

For others: I have already discussed these changes with @yash4karan 🙂

@ndevenish
Copy link
Collaborator

Is it worth storing the masks as weak references; this way, if you ever do load anything with a million separate masks you'll keep them around forever even if the original format class is long gone.

If so, I haven't used it before, but it looks like that module has a WeakValueDictionary that may do exactly what we need?

@ndevenish
Copy link
Collaborator

ohey

A primary use for weak references is to implement caches or mappings holding large objects, where it’s desired that a large object not be kept alive solely because it appears in a cache or mapping.

@yash4karan yash4karan changed the title Store static masks in FormatNXmx.py in a dict to avoid duplication Store unique masks in a dictionary to avoid duplication (FormatNXmx.py) Feb 20, 2025
@ndevenish
Copy link
Collaborator

I edited the news; as I tend to do for everyone. The news fragment is aimed at end-users of DIALS, so I try to describe the effect rather than the technical description of what changed; it isn't a commit message.

@ndevenish
Copy link
Collaborator

Add yourself to dxtbx AUTHORS also

@ndevenish ndevenish merged commit 45f50b2 into cctbx:main Feb 21, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants