Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rep_slice_sample on groups with multiple n values #527

Open
adrie-stclair opened this issue Mar 25, 2024 · 2 comments
Open

rep_slice_sample on groups with multiple n values #527

adrie-stclair opened this issue Mar 25, 2024 · 2 comments
Labels
feature a feature request or enhancement

Comments

@adrie-stclair
Copy link

adrie-stclair commented Mar 25, 2024

Hello package maintainers!
I am building confidence intervals for groups with bootstrapped values and I'm having trouble creating multiple re-sampled datasets from which to build my confidence intervals.

Using the palmerpenguins library as an example:

library(tidyverse)
library(infer)
library(palmerpenguins)

There are 344 total observations and each species has a different number of observations:

nrow(penguins)
# [1] 344

penguins %>% group_by(species) %>% count()

# A tibble: 3 × 2
# Groups:   species [3]
#  species       n
  <fct>     <int>
#1 Adelie      152
#2 Chinstrap    68
#3 Gentoo      124

I want to be able to group by the species, and for each species pull multiple samples while using the original number of observations per each group.

set.seed(100)

slices <- penguins2 %>% 
    group_by(species) %>% 
    rep_slice_sample(prop = 1, replace = TRUE, reps = 10)

That should give me 344 * 10 = 3440 lines in the full new data set. This is true, but when you look at the data you can see that each replicate has a different number of observations. For all of the Adelie, n per sample should be 152, chinstrap should be 68, and Gentoo should be 124. Instead we find this:

slices %>% group_by(species, replicate) %>% count()

# A tibble: 30 × 3
# Groups:   species, replicate [30]
#   species replicate     n
#   <fct>       <int> <int>
#1 Adelie          1   148
#2 Adelie          2   147
# 3 Adelie          3   148
# 4 Adelie          4   151
# 5 Adelie          5   138
# 6 Adelie          6   157
# 7 Adelie          7   161
# 8 Adelie          8   157
# 9 Adelie          9   151
#10 Adelie         10   138
# ℹ 20 more rows
# ℹ Use `print(n = ...)` to see more rows

What am I missing?
thanks for your insight.

@pietervreeburg
Copy link

Hello adrie-stclair,

I'm not one of the package maintainers, but your question links to a question I was considering this weekend to put up here.

Let's say I have a dataset which is rather unbalanced with regards to the explanatory variable and I draw bootstrap samples from this dataset. I could end up with many bootstrap samples which contain no cases from the minority class. If I then want to calculate a (for example) diff in props statistic from these samples I end up with many NaN values. I can easily drop these NaN samples from my analyses, in fact, the get_ci and visualise functions do this automatically, but is makes me wonder if a stratified argument would be useful for the generate function.

I hope the package maintainers or authers could weight in on the question above and my related question.

I added a code-example below.

library(dplyr)
library(moderndive)
library(infer)

set.seed(123)

promo_fem <- promotions |> 
  filter(gender == "female") |> 
  slice_sample(n = 3)

promo <- promotions |> 
  mutate(gender = "male")

promo <- bind_rows(promo, promo_fem)

table(promo$gender, promo$decision)

promo_bootstrap <- promo |> 
  specify(decision ~ gender, success = "promoted") |> 
  generate(5000, type = "bootstrap") |> 
  calculate("diff in props", order = c("male", "female"))

promo_bootstrap |> 
  filter(is.nan(stat)) |> 
  nrow()

promo_bootstrap_ci <- promo_bootstrap |> 
  get_confidence_interval()

visualise(promo_bootstrap) +
  shade_ci(promo_bootstrap_ci)

@simonpcouch
Copy link
Collaborator

Related to #503, #197. :)

@simonpcouch simonpcouch added the feature a feature request or enhancement label Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants