Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REF] optimize get_studies_by_mask #891

Merged
merged 1 commit into from
Jul 25, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions nimare/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -643,19 +643,20 @@ def get_studies_by_mask(self, mask):
found_ids : :obj:`list`
A list of IDs from the Dataset with at least one focus in the mask.
"""
from scipy.spatial.distance import cdist

mask = load_niimg(mask)

dset_mask = self.masker.mask_img

if not np.array_equal(dset_mask.affine, mask.affine):
LGR.warning("Mask affine does not match Dataset affine. Assuming same space.")

dset_ijk = mm2vox(self.coordinates[["x", "y", "z"]].values, mask.affine)
mask_ijk = np.vstack(np.where(mask.get_fdata())).T
distances = cdist(mask_ijk, dset_ijk)
distances = np.any(distances == 0, axis=0)
found_ids = list(self.coordinates.loc[distances, "id"].unique())
mask_data = mask.get_fdata()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider renaming mask_data to mask_fdata for consistency

Since mask.get_fdata() is being used, it might be clearer to name the variable mask_fdata to indicate that it holds the floating-point data array from the mask.

Suggested change
mask_data = mask.get_fdata()
mask_fdata = mask.get_fdata()

mask_coords = np.vstack(np.where(mask_data)).T

# Check for presence of coordinates in mask
in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (performance): Potential performance issue with nested array operations

The nested array operations (np.all and np.any) could be computationally expensive, especially for large datasets. Consider optimizing this part of the code if performance becomes an issue.

Comment on lines +656 to +657
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider adding a brief explanation for the coordinate check

While comments are generally avoided, a brief explanation here could help future developers understand the purpose and logic behind this coordinate check.

Suggested change
# Check for presence of coordinates in mask
in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)
# Check if each coordinate in dset_ijk is present in mask_coords
in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)

found_ids = list(self.coordinates.loc[in_mask, "id"].unique())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (performance): Check for potential performance impact of unique()

Calling unique() on a potentially large dataset could have performance implications. Ensure that this operation is efficient for the expected size of self.coordinates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be your suggestion to optimize this code for a large dataset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To optimize the code for a large dataset, you can consider the following suggestions:

  1. Use a Set for Uniqueness:
    Instead of using unique(), you can use a set to collect unique IDs. This can be more efficient for large datasets.

  2. Vectorized Operations:
    Ensure that the operations are as vectorized as possible to leverage NumPy's performance benefits.

  3. Efficient Data Structures:
    If the dataset is extremely large, consider using more memory-efficient data structures or libraries like pandas with dask for out-of-core computation.

Here's a revised version of the code incorporating the use of a set:

mask_data = mask.get_fdata()
mask_coords = np.vstack(np.where(mask_data)).T

# Check for presence of coordinates in mask
in_mask = np.any(np.all(dset_ijk[:, None] == mask_coords[None, :], axis=-1), axis=-1)

# Use a set to collect unique IDs
found_ids = set(self.coordinates.loc[in_mask, "id"])

# Convert the set back to a list if needed
found_ids = list(found_ids)

This approach should help in managing the performance impact when dealing with large datasets. If you have any specific constraints or further details about the dataset size, feel free to share, and we can refine the approach further.


return found_ids

def get_studies_by_coordinate(self, xyz, r=20):
Expand Down
Loading