[REF] optimize `get_studies_by_mask` #891

jdkent · 2024-07-22T21:38:24Z

This comes from a user trying to use get_studies_by_mask, but kept running out of memory on their machine, this makes the function more efficient.

Closes # .

Changes proposed in this pull request:

Summary by Sourcery

This pull request optimizes the get_studies_by_mask function in nimare/dataset.py to improve memory efficiency, addressing issues where users experienced memory exhaustion.

Enhancements:
- Optimized the get_studies_by_mask function to improve memory efficiency by replacing the use of cdist with a more direct coordinate comparison approach.

jdkent · 2024-07-24T18:59:51Z

@sourcery-ai review

sourcery-ai · 2024-07-24T18:59:58Z

Reviewer's Guide by Sourcery

This pull request optimizes the get_studies_by_mask function in the nimare/dataset.py file to improve memory efficiency. The changes involve replacing the use of scipy.spatial.distance.cdist with a more direct method of checking coordinate presence within the mask, thus reducing memory usage and improving performance.

File-Level Changes

Files	Changes
`nimare/dataset.py`	Optimized the `get_studies_by_mask` function by removing the use of `scipy.spatial.distance.cdist` and implementing a more memory-efficient method to check for coordinate presence within the mask.

Tips

Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
Continue your discussion with Sourcery by replying directly to review comments.
You can change your review settings at any time by accessing your dashboard:
- Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
- Change the review language;
You can always contact us if you have any questions or feedback.

sourcery-ai

Hey @jdkent - I've reviewed your changes and they look great!

Here's what I looked at during the review

🟡 General issues: 4 issues found
🟢 Security: all looks good
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.}

sourcery-ai · 2024-07-24T19:00:45Z

nimare/dataset.py

+        mask_coords = np.vstack(np.where(mask_data)).T
+
+        # Check for presence of coordinates in mask
+        in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)


issue (performance): Potential performance issue with nested array operations

The nested array operations (np.all and np.any) could be computationally expensive, especially for large datasets. Consider optimizing this part of the code if performance becomes an issue.

sourcery-ai · 2024-07-24T19:00:45Z

nimare/dataset.py

+
+        # Check for presence of coordinates in mask
+        in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)
+        found_ids = list(self.coordinates.loc[in_mask, "id"].unique())


issue (performance): Check for potential performance impact of unique()

Calling unique() on a potentially large dataset could have performance implications. Ensure that this operation is efficient for the expected size of self.coordinates.

what would be your suggestion to optimize this code for a large dataset?

To optimize the code for a large dataset, you can consider the following suggestions:

Use a Set for Uniqueness:
Instead of using unique(), you can use a set to collect unique IDs. This can be more efficient for large datasets.

Vectorized Operations:
Ensure that the operations are as vectorized as possible to leverage NumPy's performance benefits.

Efficient Data Structures:
If the dataset is extremely large, consider using more memory-efficient data structures or libraries like pandas with dask for out-of-core computation.

Here's a revised version of the code incorporating the use of a set:

mask_data = mask.get_fdata() mask_coords = np.vstack(np.where(mask_data)).T # Check for presence of coordinates in mask in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1) # Use a set to collect unique IDs found_ids = set(self.coordinates.loc[in_mask, "id"]) # Convert the set back to a list if needed found_ids = list(found_ids)

This approach should help in managing the performance impact when dealing with large datasets. If you have any specific constraints or further details about the dataset size, feel free to share, and we can refine the approach further.

sourcery-ai · 2024-07-24T19:00:46Z

nimare/dataset.py

-        distances = cdist(mask_ijk, dset_ijk)
-        distances = np.any(distances == 0, axis=0)
-        found_ids = list(self.coordinates.loc[distances, "id"].unique())
+        mask_data = mask.get_fdata()


suggestion: Consider renaming mask_data to mask_fdata for consistency

Since mask.get_fdata() is being used, it might be clearer to name the variable mask_fdata to indicate that it holds the floating-point data array from the mask.

Suggested change

mask_data = mask.get_fdata()

mask_fdata = mask.get_fdata()

sourcery-ai · 2024-07-24T19:00:46Z

nimare/dataset.py

+        # Check for presence of coordinates in mask
+        in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)


suggestion: Consider adding a brief explanation for the coordinate check

While comments are generally avoided, a brief explanation here could help future developers understand the purpose and logic behind this coordinate check.

Suggested change

# Check for presence of coordinates in mask

in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)

# Check if each coordinate in dset_ijk is present in mask_coords

in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)

make the function more performant

e92f008

jdkent changed the title ~~[REF] make the function more performant~~ [REF] optimize get_studies_by_mask Jul 22, 2024

sourcery-ai bot reviewed Jul 24, 2024

View reviewed changes

jdkent merged commit 0c48c2f into neurostuff:main Jul 25, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REF] optimize `get_studies_by_mask` #891

[REF] optimize `get_studies_by_mask` #891

jdkent commented Jul 22, 2024 •

edited by sourcery-ai bot

Loading

jdkent commented Jul 24, 2024

sourcery-ai bot commented Jul 24, 2024 •

edited

Loading

sourcery-ai bot left a comment

sourcery-ai bot Jul 24, 2024

sourcery-ai bot Jul 24, 2024

jdkent Jul 24, 2024

sourcery-ai bot Jul 24, 2024

sourcery-ai bot Jul 24, 2024

sourcery-ai bot Jul 24, 2024

		# Check for presence of coordinates in mask
		in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)

[REF] optimize get_studies_by_mask #891

[REF] optimize get_studies_by_mask #891

Conversation

jdkent commented Jul 22, 2024 • edited by sourcery-ai bot Loading

Summary by Sourcery

jdkent commented Jul 24, 2024

sourcery-ai bot commented Jul 24, 2024 • edited Loading

Reviewer's Guide by Sourcery

File-Level Changes

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Jul 24, 2024

Choose a reason for hiding this comment

sourcery-ai bot Jul 24, 2024

Choose a reason for hiding this comment

jdkent Jul 24, 2024

Choose a reason for hiding this comment

sourcery-ai bot Jul 24, 2024

Choose a reason for hiding this comment

sourcery-ai bot Jul 24, 2024

Choose a reason for hiding this comment

sourcery-ai bot Jul 24, 2024

Choose a reason for hiding this comment

[REF] optimize `get_studies_by_mask` #891

[REF] optimize `get_studies_by_mask` #891

jdkent commented Jul 22, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jul 24, 2024 •

edited

Loading