Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements thin_observations #81

Open
mhesselbarth opened this issue Nov 10, 2023 · 7 comments
Open

Improvements thin_observations #81

mhesselbarth opened this issue Nov 10, 2023 · 7 comments
Labels
improvement Smaller improvements to existing features Quality of life Quality of life improvement

Comments

@mhesselbarth
Copy link
Collaborator

I already fixed quite a few bugs in thin_observations and streamlined the code which should make moving forward easier (14b2e52).

Further improvements:

  1. Use minpoint is still confusing in terms of language I think. To me, this implies there is a chance also more points could remain in a cell/zone. However, this is not the case. If a cell/zone is sampled that is the fixed number of points that remains in the cell.
  2. The upper limit of totake depends on the minimum point count per cell. However, in the environmental and zone method the sampling is done on a much larger scale. Thus, in this case it owuld be nice to use the minimum count per zone instead?
  3. The weighting by bias value doesn't really make sense cause the sampling is grouped by cell, i.e., all weights are exactly the same anyhow.
  4. I started to implement a spatial option, but similar as for the bias all intensity weights will actually be the same cause grouping by cell.
@Martin-Jung
Copy link
Collaborator

Thanks, makes all sense.
I was thinking that maybe to specifically test the minpoints and similar things, we might need a dedicated unit test where always similar amounts of points are removed.

@mhesselbarth
Copy link
Collaborator Author

Points 1-2 are addressed by fee3139.

Points 3-4 need some more general thinking if and how this could be improved.

@Martin-Jung Martin-Jung added Quality of life Quality of life improvement improvement Smaller improvements to existing features labels Jan 13, 2024
@jeffreyhanson
Copy link
Contributor

Hi,

I was wondering if you'd considered using an ILP approach for spatial thinning? In case it's of interest, I've put togeather a small example with HiGHS. Although I'm not sure how well this scales with dataset size, you'd probably get better performance with Gurobi. No worries if you're not interested in an ILP approach, I just thought I'd share this in case it's useful.

# load packages
library(sf)
library(Matrix)
library(assertthat)
library(highs)
library(scales)

# define function to spatially thin points
thin_points <- function(x, max_dist, threads = 1, gap = 0, verbose = FALSE) {
  # assert valid arguments
  assertthat::assert_that(
    inherits(x, "sf"),
    assertthat::is.number(max_dist),
    assertthat::is.count(threads),
    assertthat::noNA(threads),
    assertthat::is.number(gap),
    assertthat::noNA(gap),
    assertthat::is.flag(verbose),
    assertthat::noNA(verbose)
  )

  # identify which pairs of points are within the threshold distance
  neighbor_indices <- sf::st_is_within_distance(x, dist = max_dist)
  n_neighbors <- lengths(neighbor_indices)

  # compute variables
  n <- nrow(x)

  # run optimization
  result <- highs::highs_solve(
    L = rep(1, n),
    lower = rep(0, n),
    upper = rep(1, n),
    A = Matrix::sparseMatrix(
      i = rep(seq_along(neighbor_indices), n_neighbors),
      j = unlist(neighbor_indices, recursive = TRUE, use.names = FALSE),
      x = rep(1, sum(n_neighbors)),
      dims = c(n, n)
    ),
    lhs = rep(0, n),
    rhs = rep(1, n),
    types = rep("I", n),
    maximum = TRUE,
    control = highs::highs_control(
      threads = threads,
      log_to_console = FALSE,
      mip_rel_gap = gap,
      presolve = "on"
    )
  )

  # check if invalid solution
  # nocov start
  if (
    is.null(result) ||
    is.null(result$primal_solution) ||
    any(is.na(result$primal_solution)) ||
    isTRUE(!result$status %in% c(7L, 11L, 12L, 13L, 14L))
  ) {
    stop("Failed to solve the problem!")
  }
  # nocov end

  # return selected points
  x[which(result$primal_solution > 0.5), , drop = FALSE]
}

# define parameters for example
n_points <- 100
max_dist <- 0.15

# simulate data
point_data <-
  data.frame(
    id = seq_len(n_points), x = runif(n_points), y = runif(n_points)
  ) |>
  sf::st_as_sf(coords = c("x", "y"))

# thin data
thinned_point_data <- thin_points(point_data, max_dist)

# determine which points selected
point_data$is_selected <- point_data$id %in% thinned_point_data$id

# plot results for visual check
## blue circles = buffered circles based on maximum distance
## black points = points not selected after thinning
## red points = points selected after thinning
plot(
  point_data |> st_buffer(max_dist) |> st_geometry(),
  col = scales::alpha("blue", 0.1), border = NA
)
points(
  point_data[!point_data$is_selected, ],
  pch = 16, cex = 2, col = "black"
)
points(
  point_data[point_data$is_selected, ],
  pch = 16, cex = 2, col = "red"
)

solution

@jeffreyhanson
Copy link
Contributor

jeffreyhanson commented Nov 21, 2024

Also, if you wanted even better performance, you might be able to do some preprocessing to split the optimization problem into multiple sub-problems (e.g., if you have two clusters of points that are very far apart from each other, you could split the problem into two sub-problems and solve them seperately). This could probalby be doing using the igraph package (e.g., maybe igraph::cliques()?).

@jeffreyhanson
Copy link
Contributor

jeffreyhanson commented Nov 21, 2024

Also, I was wondering if you had considered using a k-medoid approach for environmental clustering? If I understand correctly, at the moment, the environmental thinning (1) uses a k-means analysis to assign observations to environmental clusters and then (2) samples points from clusters with many points to return the remainpoints number of points. A simpler approach would involve using the k-medoid analysis instead (e.g., see https://cran.r-project.org/web/packages/banditpam/vignettes/kmediods.html). Briefly, k-medoids is conceptually similar to k-means, except that instead of returning "cluster centers" that are located anywhere in n-dimensional space: the k-medoids analysis will only return cluster centers in locations that have input points in them. So, you can use the k-medoids analysis to identify a remainpoints number of points that maximize representation of an environmental space. Happy to put togeather a reproducible example if useful?

EDIT: Sorry, I originally misunderstood how the environmental thinning was done. I've edited the comment to correct this (but please correct me if I'm still wrong!).

@Martin-Jung
Copy link
Collaborator

Martin-Jung commented Nov 23, 2024

Thanks a lot for the thoughts Jeff!
The ILP idea is a very nice one, this might even be scientifically novel thing as I am not aware anyone having attempted that before. That being said, the usual usecase for spatial/environmental thinning in SDM applications is when there are excessive number of perceived biased observation and usually the number of observation can be very large, so scalability is key and not sure how efficient it is.

Some common usecases as example:

  • Observations cluster closely together in for example a highly accessible location like a larger city. In this case the aim is to preferentially thin in this area and less in the others.
  • Performance of SDMs has been shown to improve with better / more equal environmental spread, e.g. having roughly similar amounts of training data for different conditions. In this case we want to environmentally thin to the point where there is similar coverage among similar conditions in the training domain.
  • Random: We have like ~1million observations for a species, however there diminishing returns in fitting models with training data that large and usually it just unnecessarily increases the run tune while the results over the same environmental space are comparable. Here we just want to thin down equally everywhere.

The kmediods methods sounds nice, but I am bit hesitant to add another dependency just for that (unless it is loaded on depand and not added as suggests or so). I am also wondering whether the centroid results are that different to justify this. Either way the thin_observations function has mostly common tools and I know that there are many other specialized R-packages for that purpose alone (like spThin for example).

@jeffreyhanson
Copy link
Contributor

jeffreyhanson commented Nov 24, 2024

No worries!

Yeah, it might be scientifically novel - not sure? During my PhD, I think I submitted a PR to spThin with the ILP method, but not sure if that ever went anywhere. Yeah, that's a good point - the spatial/environmental thinning procedures are indeed applied to very large observation datasets, so it'd be good to make sure that it scales well? I don't suppose you've got a masters student that's looking for a new project? Could be an interesting small study to benchmark the commonly used stingy (backwards) heuristic algorthm against ILP methods.

Yeah, I completley understand not wanting to add another dependency. I would say/guess that the k-means approach would begin to approximate the k-medoid approach as the number of pre-defined clusters increases. This is because at lower number of clusters, the k-means will fail to represent fine-scale structure (e.g., smaller clusters within the larger clusters) because the method involves randomly sampling from the larger clusters. However, as the number of pre-defined clusters increases further -- up to the point where each point is effectively assigned to its own cluster -- the k-means approach will begin to differ from the k-medoid method because then the k-means approach is effectively randomly sampling observations. That's just my impression/guess though - so might be wrong? Happy to put togeather some quick simulations if you're interested?

Also, just an idea - but please feel free to ignore if you disagree or think it would take too much time/effort? I think incorperating the spatial/environmental thinning procedures into the package is really worthwhile, but I understand not wanting to add too many additional dependencies. To resolve this, I wonder if you had considered providing the spatial/environmental thinning procedures as their own package. This new package could provide funcitonality for some of the more common/easier thinning approaches using hard dependencies (i.e., Imports) , and some of the more advanced approaches that require particular dependencies using soft/optional dependencies (i.e., Suggests). As such, if the user installs ibis.SDM and it has this new package as a soft dependency then (1) the user can install all soft dependencies for ibis.SDM and for access to basic thinning approaches, and/or (2) the user can install all soft depenendecies for the new package and get access to all thinning approaches? Alternatively, maybe it would be possible to collaborate with a maintainer of a package that already provides dedicated functions for thinning? This could reduce maintence burden for you? E.g., if it was possible to update an existing thinning package so that it provided all the desired thinning functionality, then ibis.SDM could just have that package as a soft dependency? For example, this new R package provides some methods for spatial thinning (https://github.com/jmestret/GeoThinneR) - maybe environmental thinning could be implemented using some PRs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Smaller improvements to existing features Quality of life Quality of life improvement
Projects
None yet
Development

No branches or pull requests

3 participants