-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements thin_observations #81
Comments
Thanks, makes all sense. |
Points 1-2 are addressed by fee3139. Points 3-4 need some more general thinking if and how this could be improved. |
Hi, I was wondering if you'd considered using an ILP approach for spatial thinning? In case it's of interest, I've put togeather a small example with HiGHS. Although I'm not sure how well this scales with dataset size, you'd probably get better performance with Gurobi. No worries if you're not interested in an ILP approach, I just thought I'd share this in case it's useful.
|
Also, if you wanted even better performance, you might be able to do some preprocessing to split the optimization problem into multiple sub-problems (e.g., if you have two clusters of points that are very far apart from each other, you could split the problem into two sub-problems and solve them seperately). This could probalby be doing using the igraph package (e.g., maybe |
Also, I was wondering if you had considered using a k-medoid approach for environmental clustering? If I understand correctly, at the moment, the environmental thinning (1) uses a k-means analysis to assign observations to environmental clusters and then (2) samples points from clusters with many points to return the EDIT: Sorry, I originally misunderstood how the environmental thinning was done. I've edited the comment to correct this (but please correct me if I'm still wrong!). |
Thanks a lot for the thoughts Jeff! Some common usecases as example:
The kmediods methods sounds nice, but I am bit hesitant to add another dependency just for that (unless it is loaded on depand and not added as suggests or so). I am also wondering whether the centroid results are that different to justify this. Either way the |
No worries! Yeah, it might be scientifically novel - not sure? During my PhD, I think I submitted a PR to spThin with the ILP method, but not sure if that ever went anywhere. Yeah, that's a good point - the spatial/environmental thinning procedures are indeed applied to very large observation datasets, so it'd be good to make sure that it scales well? I don't suppose you've got a masters student that's looking for a new project? Could be an interesting small study to benchmark the commonly used stingy (backwards) heuristic algorthm against ILP methods. Yeah, I completley understand not wanting to add another dependency. I would say/guess that the k-means approach would begin to approximate the k-medoid approach as the number of pre-defined clusters increases. This is because at lower number of clusters, the k-means will fail to represent fine-scale structure (e.g., smaller clusters within the larger clusters) because the method involves randomly sampling from the larger clusters. However, as the number of pre-defined clusters increases further -- up to the point where each point is effectively assigned to its own cluster -- the k-means approach will begin to differ from the k-medoid method because then the k-means approach is effectively randomly sampling observations. That's just my impression/guess though - so might be wrong? Happy to put togeather some quick simulations if you're interested? Also, just an idea - but please feel free to ignore if you disagree or think it would take too much time/effort? I think incorperating the spatial/environmental thinning procedures into the package is really worthwhile, but I understand not wanting to add too many additional dependencies. To resolve this, I wonder if you had considered providing the spatial/environmental thinning procedures as their own package. This new package could provide funcitonality for some of the more common/easier thinning approaches using hard dependencies (i.e., Imports) , and some of the more advanced approaches that require particular dependencies using soft/optional dependencies (i.e., Suggests). As such, if the user installs ibis.SDM and it has this new package as a soft dependency then (1) the user can install all soft dependencies for ibis.SDM and for access to basic thinning approaches, and/or (2) the user can install all soft depenendecies for the new package and get access to all thinning approaches? Alternatively, maybe it would be possible to collaborate with a maintainer of a package that already provides dedicated functions for thinning? This could reduce maintence burden for you? E.g., if it was possible to update an existing thinning package so that it provided all the desired thinning functionality, then ibis.SDM could just have that package as a soft dependency? For example, this new R package provides some methods for spatial thinning (https://github.com/jmestret/GeoThinneR) - maybe environmental thinning could be implemented using some PRs? |
I already fixed quite a few bugs in
thin_observations
and streamlined the code which should make moving forward easier (14b2e52).Further improvements:
minpoint
is still confusing in terms of language I think. To me, this implies there is a chance also more points could remain in a cell/zone. However, this is not the case. If a cell/zone is sampled that is the fixed number of points that remains in the cell.totake
depends on the minimum point count per cell. However, in theenvironmental
andzone
method the sampling is done on a much larger scale. Thus, in this case it owuld be nice to use the minimum count per zone instead?spatial
option, but similar as for thebias
all intensity weights will actually be the same cause grouping by cell.The text was updated successfully, but these errors were encountered: