Center and scale X in the presence of missing data in Y #18

gaow · 2020-01-09T15:01:55Z

To recap X is sample size by effects (eg different SNPs), Y is sample size by conditions (eg gene expression in different tissues). Currently when there is missing data in Y, we center and scale X based on each condition. However when it comes to computing the correlation of a SNP between conditions as part of the covariance calculation for effect size estimates (details on overleaf) this can be problematic. For example suppose we have a SNP vector t(x) :

2 0 1 1 1

where here each element is genotype for a sample for that SNP. Then suppose two conditions with different missing samples, coded by -, then the data for the 2 conditions will look like:

2 0 - - -
- 0 1 1 1

Now think for now of just centering the data with observed genotypes per condition:

1 -1    -    -    -
- -0.75 0.25 0.25 0.25

then you see for the 2nd sample, the coding is no longer the same.

Now with scaling also involved the situation only becomes more complicated. Although without center and scale we can argue what we write on overleaf for missing data covariance computation is by the definition of covariance, here with center and scale we are not sure what it is. Although our benchmark result with missing data is just fine.

Another more subtle issue is that after removing different samples by conditions, some SNPs will be completely depleted of variants in some conditions. That results in different covariances (as computed above on overleaf) for diferent SNPs even when they already have been centered and scaled. And we have to store all of them in our calculations. This is why this comment #10 (comment) no longer holds.

The text was updated successfully, but these errors were encountered:

stephens999 · 2020-01-10T14:37:45Z

to follow up on yesterday's conversation I suggest writing down the model with the mean term mu (and no centering). So ELBO becomes F(mu, q; X,Y) where q is the variational approximation, and X and Y are uncentered. Now define mu_hat,q_hat := arg max_mu,q F(mu,q; X,Y) My guess is that, without missing data, q_hat = arg max_q F(mu=0, q; Xtilde, Ytilde) where Xtilde, Ytilde are column-centered versions of X and Y. And this is one way think about justifying the centering in the non-missing data case. But in the case of missing data it might be easier just to derive the updates with mu included explicitly?

…

On Thu, Jan 9, 2020 at 9:01 AM gaow ***@***.***> wrote: To recap X is sample size by effects (eg different SNPs), Y is sample size by conditions (eg gene expression in different tissues). Currently when there is missing data in Y, we center and scale X based on each condition. However when it comes to computing the correlation of a SNP between conditions as part of the covariance calculation for effect size estimates (details on overleaf <https://www.overleaf.com/project/5bd111aaa3ec8118d7b1cfa8>) this can be problematic. For example suppose we have a SNP vector t(x) : 2 0 1 1 1 where here each element is genotype for a sample for that SNP. Then suppose two conditions with different missing samples, coded by -, then the data for the 2 conditions will look like: 2 0 - - - - 0 1 1 1 Now think for now of just centering the data with observed genotypes per condition: 1 -1 - - - - -0.75 0.25 0.25 0.25 then you see for the 2nd sample, the coding is no longer the same. Now with scaling also involved the situation only becomes more complicated. Although without center and scale we can argue what we write on overleaf for missing data covariance computation is by the definition of covariance, here with center and scale we are not sure what it is. Another more subtle issue is that after removing different samples by conditions, some SNPs will be completely depleted of variants in some conditions. That results in different covariances (as computed above on overleaf) for diferent SNPs even when they already have been centered and scaled. And we have to store all of them in our calculations. This is why this comment #10 (comment) <#10 (comment)> no longer holds. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#18>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANXRRK54TIEXF4Q4D42HVLQ4436LANCNFSM4KEZVKIQ> .

gaow · 2020-02-11T23:59:53Z

without missing data, q_hat = arg max_q F(mu=0, q; Xtilde, Ytilde)

Without missing data, it is not hard to just add mu to the model and write out the elbo so one can get mu_hat,q_hat := arg max_mu,q F(mu,q; X,Y) by taking the derivative for the elbo at every iteration. I'm assuming it is equally not hard for the missing data case once we have the elbo. The problem then boils down to the missing data elbo computation ... at some point @zouyuxin was looking into the issue but we decide to get away with it for missing data and check convergence with PIP instead. Guess it is time to revisit it for this matter!

gaow · 2020-02-12T00:00:20Z

#19 is also related.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Center and scale X in the presence of missing data in Y #18

Center and scale X in the presence of missing data in Y #18

gaow commented Jan 9, 2020 •

edited

Loading

stephens999 commented Jan 10, 2020 via email

gaow commented Feb 11, 2020

gaow commented Feb 12, 2020

Center and scale X in the presence of missing data in Y #18

Center and scale X in the presence of missing data in Y #18

Comments

gaow commented Jan 9, 2020 • edited Loading

stephens999 commented Jan 10, 2020 via email

gaow commented Feb 11, 2020

gaow commented Feb 12, 2020

gaow commented Jan 9, 2020 •

edited

Loading