Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing for sf object #62

Closed
btupper opened this issue Oct 30, 2024 · 5 comments · Fixed by #63
Closed

Testing for sf object #62

btupper opened this issue Oct 30, 2024 · 5 comments · Fixed by #63

Comments

@btupper
Copy link

btupper commented Oct 30, 2024

I'm trying to understand the context of the prep.spatial_recipe function. In particular, the training argument is being tested for a "geometry" column. This may lead to problems if the geometry list column is not named "geometry" - there is no requirement that the the geometry list column have any name in particular. In fact, when reading from the ubiquitous geopackage (".gpkg") formatted files it seems that the column is named "geom".

The same test occurs in bake.spatial_recipe here but with a newdata argument.

Is the point to test that the training (or newdata) object is an sf object? If so, would inherits(training, 'sf') suffice? Maybe like this...

# if we have a geometry
if (inherits(training, "sf")) {
  ## convert_geometry_column
  training <- training %>%
     dplyr::bind_cols(sf::st_coordinates(training)) %>%
      sf::st_drop_geometry()
}

On the other hand, it may be that training data is something I misunderstand. Perhaps it has been modified upstream to explicitly contain a "geometry" column? That would make my head melt, but would explain my confusion, too.

@dramanica
Copy link
Member

If I remember correctly, this is a hack to deal with the fact that, when running a workflow, datasets that are generated from a spatial_split end up having a geometry column even though they are not of class sf. You could try to implement your change, but I think it would lead to failures if you run a Check (I am not sure whether there is a unit test that deals with that, or whether it would be caught by the full vignette when it maps the workflow).
However, even if I am correct, the current solution will fail on sf objects that don't have a geometry column (as you point out). For the recipe methods, the solution would be to implement a hybrid check (check for sf AND for a geometry column). What I am not sure is whether the workflow will then create datasets with a geometry column, or whether the custom name would be adopted. I guess we would have to try that out.

@btupper
Copy link
Author

btupper commented Oct 30, 2024

Oh, I see. Yep - I knew it would melt my head.

May be a simple solution might be a simple heads-up to the user to name the geometry list column "geometry" ala obj = sf::st_set_geometry(obj, "geometry")? I did bump into issues while trying to use a geometry list column named geom, and it didn't complain until I tried workflow_set().

@dramanica
Copy link
Member

@btupper Can you have a look at the branch rename_geom? It should now cope with sf objects that have alternative names for the geometry column.

@btupper
Copy link
Author

btupper commented Oct 31, 2024

Woooohooo! It works without issue after assigning a random name to the geometry list column. The example below borrows from your tutorial but substitutes in some local data. Thank you!

library(tidysdm)
#> Loading required package: tidymodels
#> Loading required package: spatialsample
library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(here)
#> here() starts at /Users/ben/Library/CloudStorage/Dropbox/code/projects/mola-mola

model_input = sf::read_sf("model_input.gpkg") |> 
  sf::st_set_geometry("happy_halloween")
model_input
#> Simple feature collection with 1089 features and 5 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -76 ymin: 35 xmax: -59.5 ymax: 46
#> Geodetic CRS:  WGS 84
#> # A tibble: 1,089 × 6
#>    class      sst windspeed  u_wind v_wind      happy_halloween
#>    <chr>    <dbl>     <dbl>   <dbl>  <dbl>          <POINT [°]>
#>  1 presence  12.5      4.91 -0.270  -0.543  (-66.60798 42.2943)
#>  2 presence  20.2      4.86  1.59    0.756 (-68.00023 42.60569)
#>  3 presence  14.9      8.41  2.05   -0.739 (-72.66639 40.46282)
#>  4 presence  18.1      4.31  1.28    2.79  (-69.75882 42.53246)
#>  5 presence  16.7      8.38  0.864  -1.11    (-75.3774 36.2458)
#>  6 presence  27.8      7.50 -3.31   -1.33  (-75.08756 35.54985)
#>  7 presence  12.6      4.61 -0.155  -0.274 (-67.15288 42.30458)
#>  8 presence  16.6      4.08  0.846   0.505 (-71.46238 41.07934)
#>  9 presence  17.4      4.48  0.906   0.429 (-65.18623 42.52456)
#> 10 presence  17.5      4.15 -0.0541  0.253       (-69.61 41.55)
#> # ℹ 1,079 more rows

rec <- recipe(model_input,  formula = class ~ .)

models <-
  # create the workflow_set
  workflow_set(
    preproc = list(default = rec),
    models = list(
      # the standard glm specs
      glm = tidysdm::sdm_spec_glm(),
      # rf specs with tuning
      rf = tidysdm::sdm_spec_rf(),
      # boosted tree model (gbm) specs with tuning
      gbm = tidysdm::sdm_spec_boost_tree(),
      # maxent specs with tuning
      maxent = tidysdm::sdm_spec_maxent()
    ),
    # make all combinations of preproc and models,
    cross = TRUE  ) |>
  # tweak controls to store information needed later to create the ensemble
  option_add(control = tidysdm::control_ensemble_grid())

input_cv <- spatial_block_cv(data = model_input, v = 3, n = 5)

set.seed(1234567)
models <-  models |>
  workflow_map("tune_grid",
               resamples = input_cv, 
               grid = 3,
               metrics = tidysdm::sdm_metric_set(), 
               verbose = TRUE )
#> i    No tuning parameters. `fit_resamples()` will be attempted
#> i 1 of 4 resampling: default_glm
#> ✔ 1 of 4 resampling: default_glm (264ms)
#> i 2 of 4 tuning:     default_rf
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> ✔ 2 of 4 tuning:     default_rf (2.3s)
#> i 3 of 4 tuning:     default_gbm
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> ✔ 3 of 4 tuning:     default_gbm (6s)
#> i 4 of 4 tuning:     default_maxent
#> ✔ 4 of 4 tuning:     default_maxent (1.9s)

Created on 2024-10-31 with reprex v2.1.1

@dramanica dramanica mentioned this issue Oct 31, 2024
@dramanica
Copy link
Member

@btupper thanks for double-checking. Now in dev following #63

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants