Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Likelihood based distance #20

Open
OJWatson opened this issue Feb 24, 2022 · 0 comments
Open

Likelihood based distance #20

OJWatson opened this issue Feb 24, 2022 · 0 comments
Labels
feature ✨ feature request or enhancement

Comments

@OJWatson
Copy link
Collaborator

OJWatson commented Feb 24, 2022

Currently, we work out residuals between the theoretical COIs and the observed data.

This was done because initially we were using buckets to work out averages and then comparing those.

With the regression approach though we use the individual data. We could now construct a formal likelihood here, by saying that our observed data (whether method 1 or 2) is described by a Binomial distribution, with number of trials being the coverage and the number of successes being as.integer(wsmaf*coverage) (as.integer here just for rounding errors, wsmaf cam from doing reads/coverage initially) for Method 2.

E.g demo for Method 2:

data2 <- sim_biallelic(6, runif(10000,0,0.5), epsilon = 0.02, coverage = sample(300, 10000, replace = TRUE))

data <- data2$data
data <- data[data$wsmaf > 0.03 & data$wsmaf < 0.97,]
counts <- data$counts
coverage <- data$coverage
plmaf <- data$plmaf

thcois <- theoretical_coi(2:10, plmaf, coi_method = "frequency")

lls <- dbinom(x = counts, size = coverage, prob = as.matrix(thcois[,-ncol(thcois)]), log = TRUE)
lls <- colSums(lls)
which.max(lls)
coi_6 
    5 

Just did a quick check to see how different these approaches are:

     
fgh <- function(){
  
data2 <- sim_biallelic(6, runif(1000,0,0.5), epsilon = 0.02, coverage = sample(50, 1000, replace = TRUE))

data <- data2$data
data <- data[data$wsmaf > 0.03 & data$wsmaf < 0.97,]
counts <- data$counts
coverage <- data$coverage
plmaf <- data$plmaf

thcois <- theoretical_coi(2:25, plmaf, coi_method = "frequency")

lls <- dbinom(x = counts, size = coverage, prob = as.matrix(thcois[,-ncol(thcois)]), log = TRUE)
lls <- colSums(lls)
return(c(as.numeric(which.max(lls)+1),compute_coi(data2, "sim", coi_method = "frequency", use_bins = FALSE, seq_error = 0.03)$coi))

}

comp <- replicate(100, fgh(), simplify = TRUE)
rowMeans(comp)
6.42 6.22   

So the Binomial likelihood approach is slightly higher for the same samples. Not sue will make a big difference but just flagging here to remind us to ask Bob about this when we go through the paper on his thoughts on this re simple residual

@OJWatson OJWatson added the feature ✨ feature request or enhancement label Feb 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature ✨ feature request or enhancement
Projects
None yet
Development

No branches or pull requests

1 participant