Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to get single l2fc & padj for the additive effects of two factors of geneA (2 levels) and geneB (3 levels) #10

Open
Gloriafu opened this issue Mar 5, 2024 · 7 comments

Comments

@Gloriafu
Copy link

Gloriafu commented Mar 5, 2024

Hi,

I made mutations on two genes - geneA is WT/HET and geneB is WT/HET/HOM. I obtained mice with both sexes carrying these mutations, and want to know what is the individual effect of geneA and geneB separately, and what is the additive effects of geneA & geneB together, while controlling for sex.

dds <- DESeqDataSetFromMatrix(countData = data_cts, colData = meta, design = ~ sex + GeneA_factor + GeneB_factor )
dds <- DESeq(dds)
resultsNames(dds)

I don't have problem with individual effects -
results_geneAhet_vs_geneAwt = lfcShrink(dds, contrast = c('GeneA_factor', 'HET', 'WT'), type = 'ashr')

geneB_WT = colMeans(mod_mat[dds$GeneB_factor == "WT", ])
geneB_nonWT = colMeans(mod_mat[dds$GeneB_factor %in% c("HOM", "HET"),])
results_geneB_nonWT_vs_WT = lfcShrink(dds, contrast = geneB_nonWT - geneB_WT, type = 'ashr')

However, I'm not sure how to find the additive effects of geneA & geneB together (I want single values of l2fc_additive & padj_additive for each affected gene) I'm thinking of two options -

option1 - find genes that are affected by both geneA and geneB individuals (genes in both results_geneAhet_vs_geneAwt & results_geneB_nonWT_vs_WT) and l2fc_additive = l2fc_geneA + l2fc_geneB, padj_additive = min(padj_geneA, padj_geneB).

option2 - use the design to find additive effects -
geneAWT_geneBWT = colMeans(mod_mat[dds$GeneA_factor == "WT" & dds$GeneB_factor=='WT', ])
geneAnonWT_geneBnonWT = colMeans(mod_mat[dds$GeneA_factor =="HET" & dds$GeneB_factor !='WT',])
results_geneAnonWT_geneBnonWT = lfcShrink(dds, contrast = geneAnonWT_geneBnonWT - geneAWT_geneBWT, type = 'ashr')
In this way, I get single value l2fc & padj.

I tried both options, but they gave me different number of affected genes. Even for same genes, the l2fc values were different although the signs/effect directions were the the same.

Could you please help to point the correct way for additive effect of geneA & geneB together? if neither option 1 and option2 is correct, then what's the best way you suggested?

Thank you!
Ming

@tavareshugo
Copy link
Owner

Dear @Gloriafu

I'm not sure I understood all your solutions, but in your situation I would maybe take a different approach and use the likelihood ratio test (LRT) implemented in DESeq2.

For your case:

dds <- DESeqDataSetFromMatrix(countData = data_cts, colData = meta, design = ~ sex + GeneA_factor + GeneB_factor )

# test for the effect of both genes
dds <- DESeq(dds, test = "LRT", reduced = ~ sex)
res <- results(dds)

In this case, you can mostly ignore the log2FoldChange column, but the padj column will give you the result of comparing the full model with the model without those other terms.

The results from this test will not be specific to any gene, but what I would then do is:

  • Get a table of normalised counts (using rlog or vst normalisation - see here)
  • Subset that table for the genes that are significant with the likelihood ratio test above
  • Perform clustering on those genes, to partition them into "groups", which you can then relate to genes that change for geneA only, or geneB only, or for both genes.

See sections "Identifying candidate genes" and "Using clustering" in these materials for an example of gene clustering using rlog-normalised data.

A final note that in your model you did not consider an interaction between the genes - you are assuming all effects are additive.
Of course I cannot comment on whether that is a good assumption or not, but it's something to keep in mind.

@Gloriafu
Copy link
Author

Gloriafu commented Mar 6, 2024

Dear @tavareshugo

Thank you so much for your timely response and kind suggestions. I probably didn't phrase my question unclearly - My end goal is to plot 3 volcano plots with L2FC (x axis) and padj (y axis). These 3 plots shows genes that are affected by geneA only, geneB only and geneA + geneB (additive effect), respectively. Therefore, for each gene, I'm looking for methods to get the l2fc and padj for the effect of geneA only, effect of geneB only and the additive effect of geneA + geneB.

In the reduced model you suggested, could you explain would does the l2fc mean for each gene?

Also thank you for reminding me of the interaction term - I did tried to add the interaction in my model, but got the error message 'One or more variables or interaction terms in the design formula are linear’. Therefore, I removed the interaction in my final model.

Thank you!

@tavareshugo
Copy link
Owner

OK, I think there might be a confusion about what an additive effect is, and perhaps you mean to find an interaction.

If you're assuming additive effects (which is what your model assumes at the moment), then there is no test for "geneA + geneB", precisely because they are additive.

Let's assume the following genotypes:

  • geneA-WT
  • geneA-HET
  • geneB-WT
  • geneB-HET

Let's also assume that geneA-WT; geneB-WT is the reference group.
If you get the following contrasts:

  • geneA-WT_vs_geneA-HET and for example LFC = 2
  • geneB-WT_vs_geneB-HET and for example LFC = 1

Then the additive effective of being geneA-HET; geneB-HET is the addition of those effects, so 2 + 1 = 3.
That means you can even have cases like:

  • geneA-WT_vs_geneA-HET and LFC = 2
  • geneB-WT_vs_geneB-HET and LFC = -2

Where therefore the additive effect of being HET for both genes would be 2 - 2 = 0 (i.e. the expression of double HET is the same as the double WT).

If you expect those effects not to be additive, then you need to model an interaction.
But for an interaction model, you will need to have several combinations of the genotypes, which the error you got suggests you may not have. In which case, it may not be possible to address the question you are interest in.

It might be easier if you share the metadata to see exactly what combinations of genotypes you have.

@Gloriafu
Copy link
Author

Gloriafu commented Mar 7, 2024

@tavareshugo thank you so much for the example! For additive vs interaction, my understanding is that geneA + geneB models the additive effect; whereas geneAgeneB is more for multiplicative effect. In the beginning I tried to model the two together as dds <- DESeqDataSetFromMatrix(countData = data_cts, colData = meta, design = ~ sex + GeneA_factor + GeneB_factor + GeneA_factorGeneB_factor), however, r gave me an error - 'One or more variables or interaction terms in the design formula are linear’. Therefore, I removed the interaction term and dds <- DESeqDataSetFromMatrix(countData = data_cts, colData = meta, design = ~ sex + GeneA_factor + GeneB_factor) worked fine.

I have a question regarding adding L2FC for two factors - we could simply add LFC of many factors for the same gene together, correct? In your example, you added LFC(geneA-WT_vs_geneA-HET) + LFC(geneB-WT_vs_geneB-HET) = 1+2 =3 as geneA-HET; geneB-HET. I'm wondering in real analysis, could I just add L2FC together for genes that are significant at both geneA & geneB to get the L2FC of additive effect?

Also, could you please help with comparing the LFC'= LFC(geneA-WT_vs_geneA-HET) + LFC(geneB-WT_vs_geneB-HET) vs LFC' = LFC(geneA-WT-geneB-WT_vs_geneA-HET-geneB-HET/HOM)?

Here is my metadata
id sex geneA_factor geneB_factor
E145_1 E145_1 male HET HET
E145_10 E145_10 female WT HOM
E145_100 E145_100 male WT HOM
E145_105 E145_105 female HET HET
E145_106 E145_106 female HET HET
E145_11 E145_11 male HET HOM
E145_12 E145_12 female HET HOM
E145_14 E145_14 female WT HOM
E145_15 E145_15 male HET HET
E145_16 E145_16 male WT HET
E145_17 E145_17 male WT HET
E145_2 E145_2 male HET HOM
E145_3 E145_3 male HET HET
E145_4 E145_4 male HET HOM
E145_5 E145_5 male WT HET
E145_7 E145_7 male WT HOM
E145_8 E145_8 female HET HOM
E145_85 E145_85 female HET HOM
E145_93 E145_93 female WT HOM
E145_96 E145_96 male WT HOM
E145_98 E145_98 female WT HET
E145_99 E145_99 female WT HET
SRR5973222 SRR5973222 female WT WT
SRR5973223 SRR5973223 female WT WT
SRR5973224 SRR5973224 female WT WT
SRR5973225 SRR5973225 male WT WT
SRR5973226 SRR5973226 male WT WT
SRR5973227 SRR5973227 male WT WT

Thank you!

@tavareshugo
Copy link
Owner

tavareshugo commented Mar 8, 2024

Maybe let's forget about the syntax for a minute and think about the biological question.
Do you think that it's possible that a double heterozygous HET; HET (for gene A and B, respectively) may have a different phenotype than the simple addition of the effects HET; WT + WT; HET? The fact that you did an experiment including double mutants suggests that this may be possible. Otherwise, why go through all the work of creating double-mutants, one could have just worked with single mutants?

If we do expect that a double mutant may be different from additive effects of single mutants, then we should fit a model with an interaction: ~ sex + geneA + geneB + geneA:geneB (and I think this model works with the metadata you sent me).
With such a model you can then ask the question: is a double mutant HET;HET (or HET;HOM) different from the simple additive effect of the individual mutants? You will want to test for the interaction terms. In DESeq2 syntax this would be something like: contrast = list(c("geneA_factorHET:geneB_factorHET", "geneA_factorHET:geneB_factorHOM")) to test for both of them simultaneously.

However, there is an issue in your experimental design: you don't have the combination HET; WT. So, you are not able to assess whether the effect of gene A affects the phenotype, while fixing gene B as WT.

I also noticed that all of your WT; WT samples are named "SRR*", suggesting they are published data. Were the other samples sequenced as part of the same batch of samples or separately? If these come from different experiments, you may have a batch effect that will be confounded with genotypic effect.

Finally, going back to the syntax to generate the contrast vectors for several genotype combinations in your data:

# meta is the table you pasted above
meta$geneA_factor <- relevel(factor(meta$geneA_factor), "WT")
meta$geneB_factor <- relevel(factor(meta$geneB_factor), "WT")

# rename variables to make it more readable
names(meta) <- c("id", "sex", "genea", "geneb")

# model with genotype interaction
mod_mat <- model.matrix(~ sex + genea + geneb + genea:geneb, data = meta)

# single genotypes
# WT;WT
wt_wt <- colMeans(mod_mat[meta$genea == "WT" & meta$geneb == "WT", ])

# HET;HET
het_het <- colMeans(mod_mat[meta$genea == "HET" & meta$geneb == "HET", ])

# WT;HET
wt_het <- colMeans(mod_mat[meta$genea == "WT" & meta$geneb == "HET", ])

# HET;WT - this is missing from your design!!!
het_wt <- colMeans(mod_mat[meta$genea == "HET" & meta$geneb == "WT", ])

# combined genotypes
# WT;HET or WT;HOM
wt_mut <- colMeans(mod_mat[meta$genea == "WT" & meta$geneb != "WT", ])

# HET;HET or HET;HOM
mut_mut <- colMeans(mod_mat[meta$genea != "WT" & meta$geneb != "WT", ])

# HET;WT or HET;HET
mut_any <- colMeans(mod_mat[meta$genea != "WT", ])

# WT;HET or WT;HOM or HET;HOM
any_mut <- colMeans(mod_mat[meta$geneb != "WT", ])

Based on these vectors, there are many comparisons you can make - although some may or may not make sense biologically.

Regarding summing LFC values, I would probably not do this. Instead, what you could do is make a scatter plot of the two LFC against each other to get a sense of whether they are broadly correlating or not.

There is another approach that you could take: "paste" the two genotypes as a new single variable, let's call it genotype.
Then, fit the model with that single variable: ~ sex + genotype
With this new model you will have several possible contrasts, which you can extract with the standard DESeq2 syntax.

I still feel like the LRT approach I mentioned earlier, followed by gene clustering on normalised counts could be more adequate in this case, especially as you are missing the HET;WT genotype combination.

@Gloriafu
Copy link
Author

@tavareshugo Thank you so much for your detailed instructions.

We do miss the geneA - HET; geneB - WT group for the analysis - this was something we missed in our mouse experiment.

the SRR samples are from the same batch as other genotypes - we did batches of a lot mice with different genetic background, but only used SRR samples for the analysis in an early publication, and now we want to analyze all the samples with different gt.

As for interaction, yes, we assume there will be interaction between geneA and geneB, and I agree with you that we need to add interaction term. I used your syntax to code my meta data, and had no problem running mod_mat <- model.matrix(~ sex + genea + geneb + genea:geneb, data = meta).

However, I still got the error message with dds

dds <- DESeqDataSetFromMatrix(countData = data_cts, colData = meta, design = ~ sex + genea + geneb + genea:geneb)
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.

Could you please let me know what is the possible reason for this error? Any options I can bypass this error and let dds run?

I also tried the LRT methods you mentioned and was able to cluster genes. However, I still need l2fc and padj for each gene in order to make the volcano plots, which were asked for by my PI. I'm wondering, how to get the l2fc and padj for them. One thing I noticed is that the genes I got from LRT & clustering don't always match with genes I got with contrast. I'm wondering which method is more reliable.

@tavareshugo
Copy link
Owner

tavareshugo commented Mar 22, 2024

Apologies for the delay. That error is a common issue (see e.g. here), and in your case it's because of the missing combination HET; WT.

As I said earlier, I think an easier way to address your questions is to make a new variable with the genotype of both genes. Then you can do all the comparisons of interest.

Here is some example code, assuming you already have your dds object with the metadata:

# create a new metadata column that pastes the two genotypes together
dds$genotype <- paste(dds$genea, dds$geneb, sep = ".")

# turn it to a factor with WT;WT as the reference level
dds$genotype <- relevel(factor(dds$genotype), "WT.WT")

# fit the model
design(dds) <- ~ sex + genotype
dds <- DESeq(dds)

# look at the coefficient names
resultsNames(dds)

The output of that last line of code is something like:

"Intercept"                 "sex_male_vs_female"        "genotype_HET.HET_vs_WT.WT" 
"genotype_HET.HOM_vs_WT.WT" "genotype_HET.WT_vs_WT.WT"  "genotype_WT.HET_vs_WT.WT"  "genotype_WT.HOM_vs_WT.WT"

So, you can now make different comparisons of interest.

Single comparisons against the WT;WT are relatively easy:

# WT;WT vs WT;HET
results(dds, contrast = list("genotype_HET.HET_vs_WT.WT"))

# and same logic for the others

Comparisons between genotypes, it depends on what you want, but here is an example:

# WT;HET vs HET;HET
results(dds, contrast = list("HET.HET_vs_WT.WT", "WT.HET_vs_WT.WT"))

Finally, for your question of the average of genotypes, that one is a bit trickier, but I think a numeric contrast should work.
For example, the average of (HET;HET and HET;HOM) compared to the average of (WT;HET and WT;HOM).

# (HET;HET + HET;HOM)/2 - (WT;HET + WT;HOM)/2
results(dds, contrast = c(0, 0, 0.5, 0.5, 0, 0.5, 0.5))

The logic here is to take the vector from resultsNames(dds) that I printed above and then assign a number according to the comparison you are making. You see how in my comment (HET;HET + HET;HOM)/2 each of those terms get a 0.5 weight (because they are divided by 2 to calculate the average).

Hope this helps.
Please check your results because I am doing this from the top of my head and may have some typos or other errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants