Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RStudio Freezing When Dataframe Includes Factor #18

Open
benjaminwnelson opened this issue Jul 17, 2021 · 14 comments
Open

RStudio Freezing When Dataframe Includes Factor #18

benjaminwnelson opened this issue Jul 17, 2021 · 14 comments

Comments

@benjaminwnelson
Copy link

Synthpop is working with numeric data, but anytime I include a factor with more than 2 levels it causes the program to freeze. Any thoughts on why this might be the case? Thanks!

@Sinan-Yavuz
Copy link

Synthpop is working with numeric data, but anytime I include a factor with more than 2 levels it causes the program to freeze. Any thoughts on why this might be the case? Thanks!

I have the same problem, RStudio crashes.

@gillian-raab
Copy link

gillian-raab commented Jul 21, 2021 via email

@Sinan-Yavuz
Copy link

I am using CRAN version, 1.6.0

@benjaminwnelson
Copy link
Author

I am using R 4.1.0.

@gillian-raab
Copy link

gillian-raab commented Jul 22, 2021 via email

@gillian-raab
Copy link

gillian-raab commented Jul 23, 2021 via email

@benjaminwnelson
Copy link
Author

benjaminwnelson commented Jul 23, 2021

I'm using synthpop 1.6-0.

When I run your code it works perfectly. I tried it on my dataset again with 17 variables and 2,000 observations and it can't get past the gender variable. I tried only loading synthpop and no other packages and the same thing happened.

@gillian-raab
Copy link

gillian-raab commented Jul 26, 2021 via email

@wbuchanan
Copy link

Using Version 1.8-0
R 4.2.3

set.seed(7779311)
library(haven)
library(dplyr)
filenm <- "https://github.com/OpenSDP/faketucky/raw/master/faketucky.dta"
df <- haven::read_dta(filenm, 
	  col_select = c("sid", "first_dist_code", "first_hs_code", 
			         "first_hs_alt", "first_hs_urbanicity", "chrt_ninth", 
	  			     "male", "race_ethnicity", "frpl_ever_in_hs", 
	  			     "sped_ever_in_hs", "lep_ever_in_hs", "gifted_ever_in_hs",
	  			     "ever_alt_sch_in_hs", "scale_score_6_math", 
	  			     "scale_score_6_read", "scale_score_8_math", 
	  			     "scale_score_8_read", "pct_absent_in_hs", 
	  			     "pct_excused_in_hs", "avg_gpa_hs", "scale_score_11_eng", 
	  			     "scale_score_11_math", "scale_score_11_read",
	  			     "scale_score_11_comp", "collegeready_ever_in_hs", 
	  			     "careerready_ever_in_hs", "ap_ever_take_class", 
	  			     "last_acadyr_observed", "transferout", "dropout", 
	  			     "still_enrolled", "ontime_grad", "chrt_grad", "hs_diploma",
	  			     "enroll_yr1_any", "enroll_yr1_2yr", "enroll_yr1_4yr",
	  			     "enroll_yr2_any"))
names(df) <- c("stdid", "distid", "schcd", "altsch", "urbanicity", 
			   "cohort", "male", "race", "frleverhs", "swdeverhs", "eleverhs",
			   "tageverhs", "alteverhs", "mthss6", "rlass6", "mthss8", 
			   "rlass8", "pctabshs", "pctexcusedhs", "hsgpa", "acteng11", 
			   "actmth11", "actrla11", "actcmp11", "evercollrdyhs", 
			   "evercarrdyhs", "aptakenever", "lastobsyr", "transfer", 
			   "dropout", "stillenrolled", "gradontime", "gradcohort", 
			   "diploma", "yr1psenrany", "yr1psenr2yr", "yr1psenr4yr", 
			   "yr2psenrany")
df$schid <- paste0(df$distid, df$schcd)
validSchools <- data.frame("schid" = sample(unique(df$schid), size = 60))
df <- dplyr::inner_join(df, validSchools)
df$altsch <- as.factor(df$altsch)
df$cohort <- as.factor(df$cohort)
df$male <- as.factor(df$male)
df$swdeverhs <- as.factor(df$swdeverhs)
df$eleverhs <- as.factor(df$eleverhs)
df$schid <- as.factor(df$schid)
df$tageverhs <- as.factor(df$tageverhs)
df$alteverhs <- as.factor(df$alteverhs)
df$evercollrdyhs <- as.factor(df$evercollrdyhs)
df$evercarrdyhs <- as.factor(df$evercarrdyhs)
df$aptakenever <- as.factor(df$aptakenever)
df$transfer <- as.factor(df$transfer)
df$dropout <- as.factor(df$dropout)
df$stillenrolled <- as.factor(df$stillenrolled)
df$gradontime <- as.factor(df$gradontime)
df$diploma <- as.factor(df$diploma)
df$yr1psenrany <- as.factor(df$yr1psenrany)
df$yr1psenr2yr <- as.factor(df$yr1psenr2yr)
df$yr1psenr4yr <- as.factor(df$yr1psenr4yr)
df$yr2psenrany <- as.factor(df$yr2psenrany)
df$schid <- as.factor(df$schid)
df$race <- as.factor(df$race)
df$urbanicity <- as.factor(df$urbanicity)
df$frleverhs <- as.factor(df$frleverhs)
df$lastobsyr <- as.factor(df$lastobsyr)
df$gradcohort <- as.factor(df$gradcohort)
df <- df[-c(2, 3)]
library(synthpop)
# This works fine and executes relatively quickly
syn <- synthpop::syn(df)
# This freezes and fails to execute every time:
syn2 <- synthpop::syn(df[-c(1)], models = TRUE, 
                    visit.sequence = c("schid", "altsch", "male", "race", "cohort", "urbanicity", 
		 "frleverhs", "swdeverhs", "eleverhs", "tageverhs", "alteverhs", 
		 "mthss6", "rlass6", "mthss8", "rlass8", "pctabshs", "pctexcusedhs", 
		 "aptakenever", "lastobsyr", "transfer", "dropout", "stillenrolled", 
		 "hsgpa", "gradontime", "gradcohort", "diploma", "evercollrdyhs", 
		 "evercarrdyhs", "actmth11", "actrla11", "acteng11", "actcmp11", 
		 "yr1psenr2yr", "yr1psenr4yr", "yr1psenrany", "yr2psenrany"))

The second call to synthpop should sample school identifiers and then start modeling student level attributes. It fails consistently. It is only using a single core, even though the machine has 12 available and doesn't use all of the RAM available.

@gillian-raab
Copy link

gillian-raab commented Apr 11, 2023 via email

@wbuchanan
Copy link

wbuchanan commented Apr 11, 2023 via email

@gillian-raab
Copy link

gillian-raab commented Apr 11, 2023 via email

@wbuchanan
Copy link

Hi @gillian-raab,

I included the school ID variable first purposefully to sample school IDs (hopefully in a manner that would retain the marginal distribution of school IDs). I intially had school level variables in the visit sequence listed first, followed by demographic characteristics of students, and then test scores and outcomes. In terms of use, it is purely for demonstration purposes to explain how synthetic data can be used for privacy protection to increase access to data, for this particular example.

That said, I didn't see any other code listed, but can at least try making some of the modifications you mentioned.

@gillian-raab
Copy link

gillian-raab commented Apr 12, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants