Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix statistics for coverage #684

Merged
merged 3 commits into from
Mar 22, 2024
Merged

Conversation

vincentsarago
Copy link
Member

closes #680
overtake #681

Better take coverage into account. This PR tries to match exactextract results!

cc @j08lue

) -> float:
i = numpy.argsort(values)
c = numpy.cumsum(weights[i])
return values[i[numpy.searchsorted(c, numpy.array(quantiles) * c[-1])]]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be removed with numpy 2.0

@vincentsarago vincentsarago force-pushed the patch/better-statistics-for-coverage branch from b1e8d73 to 699a902 Compare March 12, 2024 14:48
# Avoid non masked nan/inf values
numpy.ma.fix_invalid(data, copy=False)

for b in range(data.shape[0]):
keys, counts = numpy.unique(data[b].compressed(), return_counts=True)
data_comp = data[b].compressed()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data[b].compressed() was called multiple times

# Population standard deviation of cell values, taking into account coverage fraction.
"std": _weighted_stdev(data_comp, masked_coverage.compressed()),
# Median value of cells, weighted by the percent of each cell that is covered.
"median": _weighted_quantiles(data_comp, masked_coverage.compressed()),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std and median are now weighted by the coverage array

assert stats[0]["count"] == 1.75
assert stats[0]["median"] == 3 # 2 in exactextract
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why median gives a different results. I've tested a new numpy 2.0 method and it gives 3 while exactextract give 2. I don't want to over engineer the median calculation

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a raster of type T exactextract is returning type T for the quantile and median calculations. T here is int64, so the median of 2.5 is getting truncated to 2. Maybe quantile/median should be returning float64 instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oO that makes sense 🙏 thanks for having a look

assert stats[0]["max"] == 9
# exactextract takes coverage into account, we don't
assert stats[0]["minority"] == 1 # 1 in exactextract
assert stats[0]["majority"] == 1 # 5 in exactextract
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minority and majority do not take coverage into account. We might do this later if needed

@vincentsarago
Copy link
Member Author

Note: I've been working on making exactextract available on pypi so we can integrate into our CI isciences/exactextract#87

@vincentsarago vincentsarago merged commit 2e76ad7 into main Mar 22, 2024
7 checks passed
@vincentsarago vincentsarago deleted the patch/better-statistics-for-coverage branch March 22, 2024 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect array statistics with (coverage) weights
2 participants