-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Center and scale X in the presence of missing data in Y #18
Comments
to follow up on yesterday's conversation I suggest writing down the model
with the mean term mu (and no centering).
So ELBO becomes F(mu, q; X,Y) where q is the variational approximation, and
X and Y are uncentered.
Now define mu_hat,q_hat := arg max_mu,q F(mu,q; X,Y)
My guess is that, without missing data,
q_hat = arg max_q F(mu=0, q; Xtilde, Ytilde)
where Xtilde, Ytilde are column-centered versions of X and Y.
And this is one way think about justifying the centering in the non-missing
data case.
But in the case of missing data it might be easier just to derive the
updates with mu included explicitly?
…On Thu, Jan 9, 2020 at 9:01 AM gaow ***@***.***> wrote:
To recap X is sample size by effects (eg different SNPs), Y is sample size
by conditions (eg gene expression in different tissues). Currently when
there is missing data in Y, we center and scale X based on each condition.
However when it comes to computing the correlation of a SNP between
conditions as part of the covariance calculation for effect size estimates
(details on overleaf
<https://www.overleaf.com/project/5bd111aaa3ec8118d7b1cfa8>) this can be
problematic. For example suppose we have a SNP vector t(x) :
2 0 1 1 1
where here each element is genotype for a sample for that SNP. Then
suppose two conditions with different missing samples, coded by -, then
the data for the 2 conditions will look like:
2 0 - - -
- 0 1 1 1
Now think for now of just centering the data with observed genotypes per
condition:
1 -1 - - -
- -0.75 0.25 0.25 0.25
then you see for the 2nd sample, the coding is no longer the same.
Now with scaling also involved the situation only becomes more
complicated. Although without center and scale we can argue what we write
on overleaf for missing data covariance computation is by the definition of
covariance, here with center and scale we are not sure what it is.
Another more subtle issue is that after removing different samples by
conditions, some SNPs will be completely depleted of variants in some
conditions. That results in different covariances (as computed above on
overleaf) for diferent SNPs even when they already have been centered and
scaled. And we have to store all of them in our calculations. This is why
this comment #10 (comment)
<#10 (comment)> no longer
holds.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#18>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANXRRK54TIEXF4Q4D42HVLQ4436LANCNFSM4KEZVKIQ>
.
|
Without missing data, it is not hard to just add mu to the model and write out the elbo so one can get |
#19 is also related. |
To recap X is sample size by effects (eg different SNPs), Y is sample size by conditions (eg gene expression in different tissues). Currently when there is missing data in Y, we center and scale X based on each condition. However when it comes to computing the correlation of a SNP between conditions as part of the covariance calculation for effect size estimates (details on overleaf) this can be problematic. For example suppose we have a SNP vector
t(x)
:where here each element is genotype for a sample for that SNP. Then suppose two conditions with different missing samples, coded by
-
, then the data for the 2 conditions will look like:Now think for now of just centering the data with observed genotypes per condition:
then you see for the 2nd sample, the coding is no longer the same.
Now with scaling also involved the situation only becomes more complicated. Although without center and scale we can argue what we write on overleaf for missing data covariance computation is by the definition of covariance, here with center and scale we are not sure what it is. Although our benchmark result with missing data is just fine.
Another more subtle issue is that after removing different samples by conditions, some SNPs will be completely depleted of variants in some conditions. That results in different covariances (as computed above on overleaf) for diferent SNPs even when they already have been centered and scaled. And we have to store all of them in our calculations. This is why this comment #10 (comment) no longer holds.
The text was updated successfully, but these errors were encountered: