Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Center and scale X in the presence of missing data in Y #18

Open
gaow opened this issue Jan 9, 2020 · 3 comments
Open

Center and scale X in the presence of missing data in Y #18

gaow opened this issue Jan 9, 2020 · 3 comments

Comments

@gaow
Copy link
Member

gaow commented Jan 9, 2020

To recap X is sample size by effects (eg different SNPs), Y is sample size by conditions (eg gene expression in different tissues). Currently when there is missing data in Y, we center and scale X based on each condition. However when it comes to computing the correlation of a SNP between conditions as part of the covariance calculation for effect size estimates (details on overleaf) this can be problematic. For example suppose we have a SNP vector t(x) :

2 0 1 1 1

where here each element is genotype for a sample for that SNP. Then suppose two conditions with different missing samples, coded by -, then the data for the 2 conditions will look like:

2 0 - - -
- 0 1 1 1

Now think for now of just centering the data with observed genotypes per condition:

1 -1    -    -    -
- -0.75 0.25 0.25 0.25

then you see for the 2nd sample, the coding is no longer the same.

Now with scaling also involved the situation only becomes more complicated. Although without center and scale we can argue what we write on overleaf for missing data covariance computation is by the definition of covariance, here with center and scale we are not sure what it is. Although our benchmark result with missing data is just fine.

Another more subtle issue is that after removing different samples by conditions, some SNPs will be completely depleted of variants in some conditions. That results in different covariances (as computed above on overleaf) for diferent SNPs even when they already have been centered and scaled. And we have to store all of them in our calculations. This is why this comment #10 (comment) no longer holds.

@stephens999
Copy link
Collaborator

stephens999 commented Jan 10, 2020 via email

@gaow
Copy link
Member Author

gaow commented Feb 11, 2020

without missing data, q_hat = arg max_q F(mu=0, q; Xtilde, Ytilde)

Without missing data, it is not hard to just add mu to the model and write out the elbo so one can get mu_hat,q_hat := arg max_mu,q F(mu,q; X,Y) by taking the derivative for the elbo at every iteration. I'm assuming it is equally not hard for the missing data case once we have the elbo. The problem then boils down to the missing data elbo computation ... at some point @zouyuxin was looking into the issue but we decide to get away with it for missing data and check convergence with PIP instead. Guess it is time to revisit it for this matter!

@gaow
Copy link
Member Author

gaow commented Feb 12, 2020

#19 is also related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants