Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross validation function does not seem to consider offset in model #372

Closed
marchtaylor opened this issue Sep 19, 2024 · 4 comments
Closed

Comments

@marchtaylor
Copy link

Hello,
I have been enjoying using this package very much - thank you for the great tool.

I have just started moving to a model that considers swept area in an offset term. When conducting a cross validation fitting using sdmTMB_cv one defines offset as a character string describing the data variable (e.g. offset = "logSweptArea"). However, using the predict.sdmTMB one must provide a vector of values, equal to the number of rows in the data (e.g. offset = dat$logSweptArea).

The issue is that the offset information is not being passed to the prediction within sdmTMB_cv:

predicted <- predict(object, newdata = cv_data, type = "response")

It looks like I can fix this manually afterwards, but it would be worth fixing in the function to avoid confusion.

Also, when not predicting to a new dataset, it might be more logical to have the predict.sdmTMB(fit) automatically use the offset in fit$offset

Cheers,
Marc

@seananderson
Copy link
Member

Thanks for pointing this out, Marc. I had started thinking about that in this issue: #274

But I hadn't thought about the cross validation issue.

Also, when not predicting to a new dataset, it might be more logical to have the predict.sdmTMB(fit) automatically use the offset in fit$offset

That is the behaviour now:

library(sdmTMB)
dat <- subset(dogfish, catch_weight > 0)
dat <- dat[1:5, ]
m3 <- sdmTMB(catch_weight ~ 1, data = dat, family = Gamma("log"), offset = log(dat$area_swept), spatial = "off")
predict(m3)$est
#> [1] 4.985392 5.157243 4.920854 5.157243 5.046017
predict(m3, offset = rep(0, 5))$est
#> [1] 7.259584 7.259584 7.259584 7.259584 7.259584

Created on 2024-09-20 with reprex v2.1.1

The problem is the cross validation function supplies newdata without passing through an offset.

Leaving the cross validation aside, my thinking was that once newdata is supplied, the offset should be assumed to be zero unless otherwise supplied. The usual case is that newdata is not the same as the original data so why would I assume that the offset should be identical.

library(sdmTMB)
dat <- subset(dogfish, catch_weight > 0)
dat <- dat[1:5, ]
m3 <- sdmTMB(catch_weight ~ 1, data = dat, family = Gamma("log"), offset = log(dat$area_swept), spatial = "off")
predict(m3, newdata = dat)$est
#> [1] 7.259584 7.259584 7.259584 7.259584 7.259584
predict(m3, newdata = dat, offset = rep(0, 5))$est
#> [1] 7.259584 7.259584 7.259584 7.259584 7.259584

Created on 2024-09-20 with reprex v2.1.1

It appears glm() takes the approach of always applying the original offset regardless of newdata and regardless of what you put in the offset argument and glmmTMB takes the approach of always including the offset and erroring out if the offset argument is supplied. These approaches seem crazy to me and wouldn't work with the need to predict at a given offset (usually 0) for the purpose of standardizing for area swept.

library(sdmTMB)
dat <- subset(dogfish, catch_weight > 0)
dat <- dat[1:5, ]
m <- glm(catch_weight ~ 1, data = dat, family = Gamma("log"), offset = log(dat$area_swept))

predict(m)
#>        1        4        5        6        7 
#> 4.985392 5.157243 4.920854 5.157243 5.046017
predict(m, newdata = dat[1:2,])
#> Warning in predictor + offset: longer object length is not a multiple of
#> shorter object length
#> [1] 4.985392 5.157243 4.920854 5.157243 5.046017
predict(m, newdata = dat[1:2,], offset = rep(0, 2))
#> Warning in predictor + offset: longer object length is not a multiple of
#> shorter object length
#> [1] 4.985392 5.157243 4.920854 5.157243 5.046017
predict(m, newdata = dat[1:2,], offset = log(dat$area_swept))
#> Warning in predictor + offset: longer object length is not a multiple of
#> shorter object length
#> [1] 4.985392 5.157243 4.920854 5.157243 5.046017
predict(m, newdata = dat[1:2,], offset = rep(0, 5))
#> Warning in predictor + offset: longer object length is not a multiple of
#> shorter object length
#> [1] 4.985392 5.157243 4.920854 5.157243 5.046017

m2 <- glmmTMB::glmmTMB(catch_weight ~ 1, data = dat, family = Gamma("log"), offset = log(dat$area_swept))
predict(m2)
#> [1] 4.985392 5.157243 4.920854 5.157243 5.046017
predict(m2, newdata = dat[1:2,])
#> [1] 4.985392 5.157243 4.920854 5.157243 5.046017
predict(m2, newdata = dat[1:2,], offset = rep(0, 2))
#> Warning in check_dots(..., .action = "warning"): unknown arguments: offset
#> [1] 4.985392 5.157243 4.920854 5.157243 5.046017
predict(m2, newdata = dat[1:2,], offset = log(dat$area_swept))
#> Warning in check_dots(..., .action = "warning"): unknown arguments: offset
#> [1] 4.985392 5.157243 4.920854 5.157243 5.046017
predict(m2, newdata = dat[1:2,], offset = rep(0, 5))
#> Warning in check_dots(..., .action = "warning"): unknown arguments: offset
#> [1] 4.985392 5.157243 4.920854 5.157243 5.046017

m3 <- sdmTMB(catch_weight ~ 1, data = dat, family = Gamma("log"), offset = log(dat$area_swept), spatial = "off")
predict(m3)$est
#> [1] 4.985392 5.157243 4.920854 5.157243 5.046017
predict(m3, newdata = dat[1:2,])$est
#> [1] 7.259584 7.259584
predict(m3, newdata = dat[1:2,], offset = rep(0, 2))$est
#> [1] 7.259584 7.259584
predict(m3, newdata = dat[1:2,], offset = log(dat$area_swept))$est
#> Error in `predict()`:
#> ! Prediction offset vector does not equal number of rows in prediction
#>   dataset.
predict(m3, newdata = dat[1:2,], offset = rep(0, 5))$est
#> Error in `predict()`:
#> ! Prediction offset vector does not equal number of rows in prediction
#>   dataset.

Created on 2024-09-20 with reprex v2.1.1

Leaving aside the above mess, I'll get the cross validation part working to close this issue...

@seananderson
Copy link
Member

It's now fixed. Use a character offset in sdmTMB_cv(). It will error out and tell you that if you use a vector.

library(sdmTMB)
dat <- subset(dogfish, catch_weight > 0)
set.seed(1)
x <- sdmTMB_cv(catch_weight ~ 1,
  data = dat, family = Gamma("log"),
  offset = "area_swept", spatial = "off",
  mesh = make_mesh(dat, c("X", "Y"), cutoff = 10), k_folds = 2
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
y <- x$data[, c("catch_weight", "cv_predicted")]
plot(y$catch_weight, y$cv_predicted)

Created on 2024-09-20 with reprex v2.1.1

As proof, you can see the variation in the prediction from this intercept-only model indicating the offset is getting included in the prediction.

@marchtaylor
Copy link
Author

Great - thanks Sean. I agree that it is worrying to always apply the offset from the original data if newdata is specified. I guess for non-TMB models, using the +offset() term would helps catch mis-specified model predictions. Not sure exactly why this doesn't work for TMB. Perhaps adding an warning message o the predict function would help, when newdata is specified and no offset is provided, but !is.null(fit$offset).
Thanks again!

@seananderson
Copy link
Member

Great idea. I've added a message:

  library(sdmTMB)
  dat <- subset(dogfish, catch_weight > 0)
  fit <- sdmTMB(
    catch_weight ~ 1,
    data = dat,
    family = Gamma("log"),
    offset = "area_swept",
    spatial = "off"
  )
  pred <- predict(fit)
  pred <- predict(fit, offset = rep(0, nrow(dat)))
  pred <- predict(fit, newdata = qcs_grid, offset = rep(0, nrow(qcs_grid)))
  pred <- predict(fit, newdata = qcs_grid)
#> Fitted object contains an offset but the offset is `NULL` in
#> `predict.sdmTMB()`.
#> Prediction will proceed assuming the offset vector is 0 in the prediction.
#> Specify an offset vector in `predict.sdmTMB()` to override this.

Created on 2024-09-23 with reprex v2.1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants