Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REF] optimize get_studies_by_mask #891

Merged
merged 1 commit into from
Jul 25, 2024

Conversation

jdkent
Copy link
Member

@jdkent jdkent commented Jul 22, 2024

This comes from a user trying to use get_studies_by_mask, but kept running out of memory on their machine, this makes the function more efficient.

Closes # .

Changes proposed in this pull request:

Summary by Sourcery

This pull request optimizes the get_studies_by_mask function in nimare/dataset.py to improve memory efficiency, addressing issues where users experienced memory exhaustion.

  • Enhancements:
    • Optimized the get_studies_by_mask function to improve memory efficiency by replacing the use of cdist with a more direct coordinate comparison approach.

@jdkent jdkent changed the title [REF] make the function more performant [REF] optimize get_studies_by_mask Jul 22, 2024
@jdkent
Copy link
Member Author

jdkent commented Jul 24, 2024

@sourcery-ai review

Copy link
Contributor

sourcery-ai bot commented Jul 24, 2024

Reviewer's Guide by Sourcery

This pull request optimizes the get_studies_by_mask function in the nimare/dataset.py file to improve memory efficiency. The changes involve replacing the use of scipy.spatial.distance.cdist with a more direct method of checking coordinate presence within the mask, thus reducing memory usage and improving performance.

File-Level Changes

Files Changes
nimare/dataset.py Optimized the get_studies_by_mask function by removing the use of scipy.spatial.distance.cdist and implementing a more memory-efficient method to check for coordinate presence within the mask.

Tips
  • Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
  • Continue your discussion with Sourcery by replying directly to review comments.
  • You can change your review settings at any time by accessing your dashboard:
    • Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
    • Change the review language;
  • You can always contact us if you have any questions or feedback.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jdkent - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 4 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.

mask_coords = np.vstack(np.where(mask_data)).T

# Check for presence of coordinates in mask
in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (performance): Potential performance issue with nested array operations

The nested array operations (np.all and np.any) could be computationally expensive, especially for large datasets. Consider optimizing this part of the code if performance becomes an issue.


# Check for presence of coordinates in mask
in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)
found_ids = list(self.coordinates.loc[in_mask, "id"].unique())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (performance): Check for potential performance impact of unique()

Calling unique() on a potentially large dataset could have performance implications. Ensure that this operation is efficient for the expected size of self.coordinates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be your suggestion to optimize this code for a large dataset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To optimize the code for a large dataset, you can consider the following suggestions:

  1. Use a Set for Uniqueness:
    Instead of using unique(), you can use a set to collect unique IDs. This can be more efficient for large datasets.

  2. Vectorized Operations:
    Ensure that the operations are as vectorized as possible to leverage NumPy's performance benefits.

  3. Efficient Data Structures:
    If the dataset is extremely large, consider using more memory-efficient data structures or libraries like pandas with dask for out-of-core computation.

Here's a revised version of the code incorporating the use of a set:

mask_data = mask.get_fdata()
mask_coords = np.vstack(np.where(mask_data)).T

# Check for presence of coordinates in mask
in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)

# Use a set to collect unique IDs
found_ids = set(self.coordinates.loc[in_mask, "id"])

# Convert the set back to a list if needed
found_ids = list(found_ids)

This approach should help in managing the performance impact when dealing with large datasets. If you have any specific constraints or further details about the dataset size, feel free to share, and we can refine the approach further.

distances = cdist(mask_ijk, dset_ijk)
distances = np.any(distances == 0, axis=0)
found_ids = list(self.coordinates.loc[distances, "id"].unique())
mask_data = mask.get_fdata()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider renaming mask_data to mask_fdata for consistency

Since mask.get_fdata() is being used, it might be clearer to name the variable mask_fdata to indicate that it holds the floating-point data array from the mask.

Suggested change
mask_data = mask.get_fdata()
mask_fdata = mask.get_fdata()

Comment on lines +656 to +657
# Check for presence of coordinates in mask
in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider adding a brief explanation for the coordinate check

While comments are generally avoided, a brief explanation here could help future developers understand the purpose and logic behind this coordinate check.

Suggested change
# Check for presence of coordinates in mask
in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)
# Check if each coordinate in dset_ijk is present in mask_coords
in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)

@jdkent jdkent merged commit 0c48c2f into neurostuff:main Jul 25, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant