Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selectSolution not selecting the best solution #46

Open
fpbarthel opened this issue Nov 20, 2018 · 11 comments
Open

selectSolution not selecting the best solution #46

fpbarthel opened this issue Nov 20, 2018 · 11 comments

Comments

@fpbarthel
Copy link

fpbarthel commented Nov 20, 2018

In a test sample I'm finding that selectSolution is not selecting "what should biologically speaking" be the best solution.

For some reason, it prefers this ploidy=4, clusters=3 solution:

screen shot 2018-11-27 at 10 33 17 pm

screen shot 2018-11-27 at 10 33 38 pm

Normal contamination estimate:	0.6033
Average tumour ploidy estimate:	3.869
Clonal cluster cellular prevalence Z=2:	1 0.7426
AllelicRatio binomial means for clonal cluster Z=1:	0.5 0.6237 0.6983 0.5198 0.7483 0.5828 0.784 0.642 0.5284 0.8109 0.6865 0.5622 0.8318 0.7212 0.6106 0.5332 0.8485 0.749 0.6494 0.5498 0.8623 0.7717 0.6811 0.5906 0.5362
logRatio Gaussian means for clonal cluster Z=1:	-1.184 -0.7739 -0.455 -0.455 -0.194 -0.194 0.02699 0.02699 0.02699 0.2186 0.2186 0.2186 0.3877 0.3877 0.3877 0.3877 0.539 0.539 0.539 0.539 0.6759 0.6759 0.6759 0.6759 0.6759
AllelicRatio binomial means for clonal cluster Z=2:	0.5 0.5864 0.6473 0.5147 0.6926 0.5642 0.7276 0.6138 0.5228 0.7554 0.6532 0.5511 0.7781 0.6854 0.5927 0.5278 0.7969 0.7121 0.6272 0.5424 0.8128 0.7346 0.6564 0.5782 0.5313
logRatio Gaussian means for clonal cluster Z=2:	-0.9585 -0.6849 -0.455 -0.455 -0.2568 -0.2568 -0.08252 -0.08252 -0.08252 0.07294 0.07294 0.07294 0.2133 0.2133 0.2133 0.2133 0.3411 0.3411 0.3411 0.3411 0.4586 0.4586 0.4586 0.4586 0.4586
logRatio Gaussian variance:	0.009996 0.009996 0.01096 0.01096 0.009909 0.009909 0.01014 0.01014 0.01014 0.01051 0.01051 0.01051 0.009999 0.009999 0.009999 0.009999 0.01001 0.01001 0.01001 0.01001 0.01007 0.01007 0.01007 0.01007 0.01007
Number of iterations:	5
Log likelihood:	-52150
S_Dbw dens.bw (LogRatio):	0.1997 
S_Dbw scat (LogRatio):	1.0000 
S_Dbw validity index (LogRatio):	1.1997 
S_Dbw dens.bw (AllelicRatio):	0.4919 
S_Dbw scat (AllelicRatio):	1.0000 
S_Dbw validity index (AllelicRatio):	1.4919 
S_Dbw dens.bw (Both):	0.6916 
S_Dbw scat (Both):	2.0000 
S_Dbw validity index (Both):	2.6916 

Over a more meaningful ploidy = 2 solution like this one:

screen shot 2018-11-27 at 10 36 47 pm

screen shot 2018-11-27 at 10 37 01 pm

Normal contamination estimate:	0.401
Average tumour ploidy estimate:	1.911
Clonal cluster cellular prevalence Z=1:	1
AllelicRatio binomial means for clonal cluster Z=1:	0.5 0.7138 0.7995 0.53 0.8457 0.6152 0.8746 0.6873 0.5375 0.8944 0.7366 0.5789 0.9088 0.7725 0.6363 0.5409 0.9197 0.7998 0.6799 0.56 0.9283 0.8212 0.7142 0.6071 0.5428
logRatio Gaussian means for clonal cluster Z=1:	-1.279 -0.4744 0.03912 0.03912 0.4171 0.4171 0.7163 0.7163 0.7163 0.964 0.964 0.964 1.175 1.175 1.175 1.175 1.36 1.36 1.36 1.36 1.523 1.523 1.523 1.523 1.523
logRatio Gaussian variance:	0.009996 0.01122 0.01062 0.01062 0.01015 0.01015 0.01004 0.01004 0.01004 0.01 0.01 0.01 0.009996 0.009996 0.009996 0.009996 0.009996 0.009996 0.009996 0.009996 0.009996 0.009996 0.009996 0.009996 0.009996
Number of iterations:	5
Log likelihood:	-60370
S_Dbw dens.bw (LogRatio):	0.1482 
S_Dbw scat (LogRatio):	1.0000 
S_Dbw validity index (LogRatio):	1.1482 
S_Dbw dens.bw (AllelicRatio):	0.4143 
S_Dbw scat (AllelicRatio):	1.0000 
S_Dbw validity index (AllelicRatio):	1.4143 
S_Dbw dens.bw (Both):	0.5626 
S_Dbw scat (Both):	2.0000 
S_Dbw validity index (Both):	2.5626 

What parameters are used to select the optimal variant and how can we adjust this? LOH of chromosome arms 1p/19q (as in the second, non-selected solution) is a well recognized marker of this cancer type.

UPDATE: I guess it looks at the log-likelihood. Why do you think the first solution was scored higher than the second solution?

@fpbarthel
Copy link
Author

Having now ran TITAN for > 600 tumor samples I am finding that selectSolution is selecting a hyperploid solution of > 4n in > 90% of cases. This is not realistic and not in line with the literature, also eg. https://www.nature.com/articles/nbt.2203/figures/6 that shows >95% of this cancer type (GBM) is 2n. Is there any reason that TITAN is more likely to prefer higher ploidy values? I suppose I could increase the "threshold" parameter to selectSolution but this seems like "cheating". Any thoughts?

@gavinha
Copy link
Owner

gavinha commented Nov 21, 2018

Hi @fpbarthel

Thanks for reporting back on your experiences with TITAN. The selectSolution script is my custom script to determine the optimal ploidy and cluster combination. However, I have only really tested this in my data which is mostly breast, ovarian, and advanced prostate cancers. It has worked better for me because these tumors tend to be genome doubled at decent frequency. However, I understand this might not be the case for other data.
When deciding between ploidy solutions, the script looks at the log likelihood (as you've pointed out). What are the values between diploid, triploid, tetraploid runs? How close are they? I can imagine that since there are fewer CNA events and few distinct integer copy classes present, then the likelihoods may be more similar between solutions. Perhaps this script is not suited to working for your data.

Here are some alternative things to try:

  1. Before I provided this script, others used to select the minimum S_Dbw Validity Index out of all runs across all ploidy and cluster initializations.

  2. Only pass in ploidy2 and ploidy3 into selectSolution.R. In particular, if you know that GBM should be diploid, I think it is justified to exclude ploidy4 runs. The ploidy estimation for ploidy3 initializations can still help to indicate genome doubling.

  3. Set the prior TitanCNA_alphaK such that copy number 2 is higher than 3. This would involve adding in a line to titanCNA.R so that this hyperparameter is used for the prior on the Gaussian variance.
    If it comes down to trying this, I can edit the script to allow the user to adjust this.

Hope this helps,
Gavin

@fpbarthel
Copy link
Author

fpbarthel commented Nov 21, 2018

Thanks @gavinha these are all excellent and I will try them out!

I am not sure I understand what you are suggesting with (3). My current cohort consists of about 25% whole genomes and 75% exomes and currently I'm setting both --alphaK and --alphaKHigh to 10000 and 2500, respectively, and regardless of ploidy. Are you suggesting that I vary these for different ploidy values? As well as making script edits?

Either way, I will try out (1) and (2) first and let you know how this pans out and it may not be necessary to go that route.

Floris

UPDATE: I figured I would share another interesting case. A very common pattern in GBM is an amplification of chr7 in combination with a loss of chromosome 10. Often with deep amplifications of EGFR (chr 7) and deep deletions of CDKN2A (chr9). I have a sample which underwent both WXS and WGS.

Interestingly, for the whole exome sample the ploidy 2 (likely correct) solution is chosen, but for the WGS sample a ploidy 3 solution is chosen:

WXS

Solution chosen by selectSolution (ploidy 2, clusters 1)

screen shot 2018-11-27 at 10 45 51 pm

screen shot 2018-11-27 at 10 46 04 pm

Normal contamination estimate:	0.6049
Average tumour ploidy estimate:	1.871
Clonal cluster cellular prevalence Z=1:	1
AllelicRatio binomial means for clonal cluster Z=1:	0.5 0.6231 0.6975 0.5282 0.7474 0.5825 0.7832 0.6416 0.5405 0.8101 0.686 0.562 0.831 0.7207 0.6103 0.5473 0.8478 0.7484 0.6491 0.5497 0.8616 0.7712 0.6808 0.5904 0.5517
logRatio Gaussian means for clonal cluster Z=1:	-0.6878 -0.2801 0.03736 0.03736 0.2974 0.2974 0.5177 0.5177 0.5177 0.7087 0.7087 0.7087 0.8774 0.8774 0.8774 0.8774 1.028 1.028 1.028 1.028 1.165 1.165 1.165 1.165 1.165
logRatio Gaussian variance:	0.01004 0.01033 0.01037 0.01037 0.0101 0.0101 0.009955 0.009955 0.009955 0.009996 0.009996 0.009996 0.009993 0.009993 0.009993 0.009993 0.009995 0.009995 0.009995 0.009995 0.009996 0.009996 0.009996 0.009996 0.009996
Number of iterations:	5
Log likelihood:	-35160
S_Dbw dens.bw (LogRatio):	0.0747 
S_Dbw scat (LogRatio):	0.1113 
S_Dbw validity index (LogRatio):	0.1860 
S_Dbw dens.bw (AllelicRatio):	0.3763 
S_Dbw scat (AllelicRatio):	0.0923 
S_Dbw validity index (AllelicRatio):	0.4685 
S_Dbw dens.bw (Both):	0.4510 
S_Dbw scat (Both):	0.2036 
S_Dbw validity index (Both):	0.6545 

WGS

Solution chosen by selectSolution (ploidy 3, cluster 2)

screen shot 2018-11-27 at 10 49 32 pm

screen shot 2018-11-27 at 10 50 20 pm

screen shot 2018-11-21 at 6 52 46 pm

Normal contamination estimate:	0.5843
Average tumour ploidy estimate:	2.928
Clonal cluster cellular prevalence Z=2:	1 0.711
AllelicRatio binomial means for clonal cluster Z=1:	0.5 0.6312 0.7079 0.5283 0.7581 0.586 0.7936 0.6468 0.54 0.8201 0.692 0.564 0.8405 0.727 0.6135 0.5464 0.8567 0.7548 0.6529 0.551 0.87 0.7775 0.685 0.5925 0.5505
logRatio Gaussian means for clonal cluster Z=1:	-1.03 -0.5906 -0.2545 -0.2545 0.01798 0.01798 0.2471 0.2471 0.2471 0.4447 0.4447 0.4447 0.6185 0.6185 0.6185 0.6185 0.7736 0.7736 0.7736 0.7736 0.9136 0.9136 0.9136 0.9136 0.9136
AllelicRatio binomial means for clonal cluster Z=2:	0.5 0.5867 0.6478 0.5202 0.6931 0.5644 0.7281 0.6141 0.5311 0.756 0.6536 0.5512 0.7786 0.6858 0.5929 0.538 0.7974 0.7125 0.6275 0.5425 0.8133 0.735 0.6567 0.5783 0.5427
logRatio Gaussian means for clonal cluster Z=2:	-0.7599 -0.4852 -0.2545 -0.2545 -0.05561 -0.05561 0.1191 0.1191 0.1191 0.2749 0.2749 0.2749 0.4156 0.4156 0.4156 0.4156 0.5437 0.5437 0.5437 0.5437 0.6614 0.6614 0.6614 0.6614 0.6614
logRatio Gaussian variance:	0.003128 0.002515 0.002094 0.002094 0.001217 0.001217 0.00156 0.00156 0.00156 0.001367 0.001367 0.001367 0.00244 0.00244 0.00244 0.00244 0.002472 0.002472 0.002472 0.002472 0.002499 0.002499 0.002499 0.002499 0.002499
Number of iterations:	10
Log likelihood:	-1068000
S_Dbw dens.bw (LogRatio):	0.0157 
S_Dbw scat (LogRatio):	0.0085 
S_Dbw validity index (LogRatio):	0.0241 
S_Dbw dens.bw (AllelicRatio):	0.3009 
S_Dbw scat (AllelicRatio):	0.0228 
S_Dbw validity index (AllelicRatio):	0.3237 
S_Dbw dens.bw (Both):	0.3166 
S_Dbw scat (Both):	0.0313 
S_Dbw validity index (Both):	0.3479 

WGS

Ploidy 2 / cluster 2 solution

screen shot 2018-11-27 at 10 52 13 pm

screen shot 2018-11-27 at 10 52 30 pm

screen shot 2018-11-21 at 7 00 49 pm

Normal contamination estimate:	0.626
Average tumour ploidy estimate:	1.976
Clonal cluster cellular prevalence Z=1:	1
AllelicRatio binomial means for clonal cluster Z=1:	0.5 0.615 0.687 0.5255 0.7363 0.5788 0.7722 0.6361 0.5371 0.7995 0.6797 0.5599 0.8209 0.714 0.607 0.5438 0.8382 0.7416 0.645 0.5483 0.8525 0.7644 0.6762 0.5881 0.5481
logRatio Gaussian means for clonal cluster Z=1:	-0.6692 -0.2922 0.006471 0.006471 0.2538 0.2538 0.4648 0.4648 0.4648 0.6489 0.6489 0.6489 0.8121 0.8121 0.8121 0.8121 0.9588 0.9588 0.9588 0.9588 1.092 1.092 1.092 1.092 1.092
logRatio Gaussian variance:	0.004647 0.001763 0.001499 0.001499 0.002408 0.002408 0.002466 0.002466 0.002466 0.002498 0.002498 0.002498 0.002499 0.002499 0.002499 0.002499 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025
Number of iterations:	7
Log likelihood:	-1399000
S_Dbw dens.bw (LogRatio):	0.0035 
S_Dbw scat (LogRatio):	0.0278 
S_Dbw validity index (LogRatio):	0.0313 
S_Dbw dens.bw (AllelicRatio):	0.2213 
S_Dbw scat (AllelicRatio):	0.0671 
S_Dbw validity index (AllelicRatio):	0.2884 
S_Dbw dens.bw (Both):	0.2248 
S_Dbw scat (Both):	0.0949 
S_Dbw validity index (Both):	0.3197 

WGS

Ploidy 2 / cluster 3 solution

screen shot 2018-11-27 at 10 55 05 pm

screen shot 2018-11-27 at 10 55 26 pm

screen shot 2018-11-21 at 7 07 58 pm

Normal contamination estimate:	0.6175
Average tumour ploidy estimate:	1.967
Clonal cluster cellular prevalence Z=2:	1 0.3598
AllelicRatio binomial means for clonal cluster Z=1:	0.5 0.6182 0.6912 0.5261 0.7408 0.5803 0.7767 0.6383 0.5377 0.8038 0.6823 0.5608 0.8251 0.7167 0.6084 0.5443 0.8422 0.7444 0.6466 0.5489 0.8562 0.7672 0.6781 0.5891 0.5486
logRatio Gaussian means for clonal cluster Z=1:	-0.6863 -0.2971 0.009135 0.009135 0.2616 0.2616 0.4764 0.4764 0.4764 0.6633 0.6633 0.6633 0.8287 0.8287 0.8287 0.8287 0.9772 0.9772 0.9772 0.9772 1.112 1.112 1.112 1.112 1.112
AllelicRatio binomial means for clonal cluster Z=2:	0.5 0.5369 0.5688 0.5094 0.5966 0.5322 0.621 0.5605 0.5165 0.6426 0.5855 0.5285 0.6619 0.6079 0.554 0.5221 0.6792 0.628 0.5768 0.5256 0.6948 0.6461 0.5974 0.5487 0.5266
logRatio Gaussian means for clonal cluster Z=2:	-0.2044 -0.0937 0.009135 0.009135 0.1051 0.1051 0.1951 0.1951 0.1951 0.2798 0.2798 0.2798 0.3599 0.3599 0.3599 0.3599 0.4357 0.4357 0.4357 0.4357 0.5077 0.5077 0.5077 0.5077 0.5077
logRatio Gaussian variance:	0.002846 0.001561 0.00123 0.00123 0.001201 0.001201 0.001862 0.001862 0.001862 0.002298 0.002298 0.002298 0.002455 0.002455 0.002455 0.002455 0.00248 0.00248 0.00248 0.00248 0.002485 0.002485 0.002485 0.002485 0.002485
Number of iterations:	9
Log likelihood:	-1127000
S_Dbw dens.bw (LogRatio):	0.0193 
S_Dbw scat (LogRatio):	0.0091 
S_Dbw validity index (LogRatio):	0.0284 
S_Dbw dens.bw (AllelicRatio):	0.3101 
S_Dbw scat (AllelicRatio):	0.0314 
S_Dbw validity index (AllelicRatio):	0.3415 
S_Dbw dens.bw (Both):	0.3294 
S_Dbw scat (Both):	0.0405 
S_Dbw validity index (Both):	0.3699 

@gavinha
Copy link
Owner

gavinha commented Nov 22, 2018

The --alphaK assigns a pseudocount (e.g. 10000) to each of the states. Since there are 11 states, then each of the 11 states will be set to 10000. --alphaKHigh is used for homozygous deletion and copy number 4 or higher and overrides --alphaK at these states. We can tweak this value for specific states corresponding to copy number 2. The higher this value, the stronger the prior influences the Gaussian variance parameter estimation. In other words, after EM inference, the distribution for copy number 2 will be more precise with smaller variance (tall and skinny probability density function).
For the ploidy2 solution, the data that fall within this range will have a large probability and thus a higher likelihood. For the ploidy4 solution, most data will be in copy number 4 state which will have (short and wide pdfs) and thus smaller likelihood.
The --alphaK argument can have a dramatic effect on results, including segmentation, copy number state predictions, and ultimately, ploidy solution selection since it also affects the likelihood.

On a side note, the --alphaKHigh was put in place so that we can use a lower value (compared to --alphaK. This will lead to higher variance estimates (short and wide pdfs) for extreme copy number states (i.e. 0 and 4+). This helps with decreasing the probability for points in these states, and thus a lower likelihood. The idea is that the majority of data should be in the other non-extreme states.

@gavinha
Copy link
Owner

gavinha commented Nov 22, 2018

@fpbarthel

The copy number segments are usually nice to look at but it is more difficult to assess the solutions. The plots that I like to use are the CNA.pdf and LOH.pdf.

When determining whether a sample is very likely genome doubled (ploidy3 solutions in your plots), this is what I look for:

Scenario 1: Copy neutral (HET and LOH) segments are both present

  1. Copy neutral, heterozygous segment
    a) CNA plot - find a segment that is copy neutral (CN=2, blue)
    b) LOH plot - this segment is heterozygous (points around 0.5 and is grey)

  2. Copy neutral, LOH segment
    a) CNA plot - find a segment that is copy neutral and should be at the same level as in (1)
    b) LOH plot - this segment is homozygous (points split around 0.5 and is blue).

If you can find examples of BOTH these, then it is very likely this solution is correct. As long as you don't have large homozygous deletions (see next scenario).

Scenario 2: Large homozygous deletions

There are large homozygous deletions spanning 10's to 100's of Mbps. Then, this solution is likely incorrect and a higher ploidy solution should be considered. This scenario is handled in the selectSolutions.R script but users should check.

@fpbarthel
Copy link
Author

Thank you, very helpful!

@Kcjohnson
Copy link

Kcjohnson commented Nov 27, 2018

Hi @gavinha, I am working alongside @fpbarthel on the same brain tumor data set and I tried out your advice to select the minimum S_Dbw Validity Index out of all runs across all ploidy and cluster initializations. It did seem to reduce the number of 4n solutions to what is more in line with previously published data. Thanks for your suggestion!

Nevertheless, after analyzing either the selectSolution or taking the minimum S_Dbw validity index we ran into a separate issue. That is, we found ploidy differences > 1 in up to 33% of the samples for which we had both whole genome sequencing and whole exome sequencing. There did not seem to be a consistent trend in the WGS displaying a higher or lower ploidy than WXS. Having whole genome and whole exome data is a peculiar feature of our dataset, but it made us wonder whether you might have any thoughts on these discordances and/or ideas about combining two different data types? Ideally, we would generate solutions that have closer to 90% ploidy concordance.

Below is one discordant example where the WGS and WXS data were generated from the same DNA extraction/aliquot:

test-sample whole genome sequencing

Normal contamination estimate:	0.3163
Average tumour ploidy estimate:	1.989
Clonal cluster cellular prevalence Z=1:	1
AllelicRatio binomial means for clonal cluster Z=1:	0.5 0.7597 0.8419 0.538 0.8821 0.6274 0.9061 0.703 0.5451 0.9219 0.7532 0.5844 0.9332 0.7888 0.6444 0.5481 0.9416 0.8155 0.6893 0.5631 0.9482 0.8361 0.7241 0.612 0.5498
logRatio Gaussian means for clonal cluster Z=1:	-1.655 -0.5983 0.005249 0.005249 0.4295 0.4295 0.7569 0.7569 0.7569 1.024 1.024 1.024 1.249 1.249 1.249 1.249 1.443 1.443 1.443 1.443 1.615 1.615 1.615 1.615 1.615
logRatio Gaussian variance:	0.002496 0.001071 0.0007525 0.0007525 0.0009414 0.0009414 0.002421 0.002421 0.002421 0.002507 0.002507 0.002507 0.002461 0.002461 0.002461 0.002461 0.002428 0.002428 0.002428 0.002428 0.5594 0.5594 0.5594 0.5594 0.5594
Number of iterations:	8
Log likelihood:	-2080000
S_Dbw dens.bw (LogRatio):	0.0269 
S_Dbw scat (LogRatio):	0.0192 
S_Dbw validity index (LogRatio):	0.0461 
S_Dbw dens.bw (AllelicRatio):	0.1985 
S_Dbw scat (AllelicRatio):	0.0203 
S_Dbw validity index (AllelicRatio):	0.2188 
S_Dbw dens.bw (Both):	0.2254 
S_Dbw scat (Both):	0.0395 
S_Dbw validity index (Both):	0.2648 

test2_wgs_cna
test2-wgs-loh

test-sample exome sequencing

Normal contamination estimate:	0.243
Average tumour ploidy estimate:	3.902
Clonal cluster cellular prevalence Z=1:	1
AllelicRatio binomial means for clonal cluster Z=1:	0.5 0.8045 0.8785 0.5379 0.9119 0.6373 0.9309 0.7154 0.5431 0.9431 0.7659 0.5886 0.9517 0.8011 0.6506 0.5452 0.958 0.8271 0.6963 0.5654 0.9629 0.8471 0.7314 0.6157 0.5463
logRatio Gaussian means for clonal cluster Z=1:	-2.823 -1.469 -0.7825 -0.7825 -0.3194 -0.3194 0.03067 0.03067 0.03067 0.3121 0.3121 0.3121 0.5475 0.5475 0.5475 0.5475 0.7499 0.7499 0.7499 0.7499 0.9273 0.9273 0.9273 0.9273 0.9273
logRatio Gaussian variance:	0.009996 0.01 0.01017 0.01017 0.01295 0.01295 0.01592 0.01592 0.01592 0.01219 0.01219 0.01219 0.0101 0.0101 0.0101 0.0101 0.009998 0.009998 0.009998 0.009998 0.01354 0.01354 0.01354 0.01354 0.01354
Number of iterations:	6
Log likelihood:	-71450
S_Dbw dens.bw (LogRatio):	0.1354 
S_Dbw scat (LogRatio):	0.1126 
S_Dbw validity index (LogRatio):	0.2480 
S_Dbw dens.bw (AllelicRatio):	0.4040 
S_Dbw scat (AllelicRatio):	0.0485 
S_Dbw validity index (AllelicRatio):	0.4524 
S_Dbw dens.bw (Both):	0.5394 
S_Dbw scat (Both):	0.1610 
S_Dbw validity index (Both):	0.7004 

test2_wxs_cna
test2-wxs-loh

Updated with matching LOH plots.

@gavinha
Copy link
Owner

gavinha commented Nov 27, 2018

Hi @Kcjohnson and @fpbarthel

Thanks for sharing your experiences with TITAN and bringing up the concern regarding ploidy.

Frankly, selecting the correct ploidy solution is a very challenging problem that I still encounter. Honestly, 33% of samples showing discordance is more or less what I would expect, considering 66% of samples faired better?

Looking at your plots...
First of all, it's strange that chr3, 19, and 20 in the WGS LOH plots show red. Red indicates a gain but the CNA plot shows blue which is copy neutral. There are some other events that don't match either. I have not seen this before but maybe it's a bug? The WES LOH plots also seem to have this problem so it's a little hard for me to assess the correctness.

Depending on the tumor type, I generally begin by leaning towards diploid (ploidy2) solutions, unless there is some very obvious evidence that genome doubling has occurred. See the previous message for my guidelines (I should probably put this into the Wiki). Next, I consider the clonal cluster solutions.

In this example, I would say that the WGS results (diploid) look more believable.

Of course, for some tumor types where genome doubling is more frequent, we can begin with different expectations. Usually for these frequent doubled tumors, I do notice the evidence for doubling.

So ultimately, like you are already doing, manual inspection of solutions and results is recommended.
On the bright side, if all your samples look like as clean as the data above, then it'll be relatively easier to inspect.

I'm sorry I can't be more helpful.

Best,
Gavin

Edit: I should also add that I would tend to believe the WGS results more because TITAN was designed for WGS. There are many WES-based tools available and so you can try to use an alternative method to see how the ploidy matches up. My guess is that you'll probably see the same or worse discordance.

@fpbarthel
Copy link
Author

fpbarthel commented Nov 28, 2018

The copy number segments are usually nice to look at but it is more difficult to assess the solutions. The plots that I like to use are the CNA.pdf and LOH.pdf.

I updated all previous posts to include these plots instead.

Depending on the tumor type, I generally begin by leaning towards diploid (ploidy2) solutions, unless there is some very obvious evidence that genome doubling has occurred.
Of course, for some tumor types where genome doubling is more frequent, we can begin with different expectations.

I've noticed the --threshold parameter to selectSolution (link) can influence prior expectations. What is the rationale for the default value of 0.05 and what are reasonable arguments to deviate from the default?

On a side note, the --alphaKHigh was put in place so that we can use a lower value (compared to --alphaK.

So far I've been using the recommended values for --alphaKHigh and --alphaK of 10000 in whole genomes and 2500 in whole exomes for both parameters. What is the rationale for these values and what are good arguments to change them from their default or vary them from each other, by how much and for which scenarios? Extreme copy number values are not uncommon in cancer, eg. homozygous CDKN2A deletion or extreme extrachromosomal DNA amplifications.

I'm sorry I can't be more helpful.

On the contrary, your suggestions have been extremely helpful for us learning TITAN and in tweaking parameters to optimize results. We are happy to contribute examples to the community if it helps.

Floris

@Kcjohnson
Copy link

Kcjohnson commented Nov 28, 2018

First of all, it's strange that chr3, 19, and 20 in the WGS LOH plots show red. Red indicates a gain but the CNA plot shows blue which is copy neutral. There are some other events that don't match either. I have not seen this before but maybe it's a bug?

Not a bug, I previously uploaded mismatching CNA and LOH plots. Sorry for the confusion! The new plots reflect the same sample.

@GuoFengWang
Copy link

Thank you, very helpful!

Thanks @gavinha these are all excellent and I will try them out!

I am not sure I understand what you are suggesting with (3). My current cohort consists of about 25% whole genomes and 75% exomes and currently I'm setting both --alphaK and --alphaKHigh to 10000 and 2500, respectively, and regardless of ploidy. Are you suggesting that I vary these for different ploidy values? As well as making script edits?

Either way, I will try out (1) and (2) first and let you know how this pans out and it may not be necessary to go that route.

Floris

UPDATE: I figured I would share another interesting case. A very common pattern in GBM is an amplification of chr7 in combination with a loss of chromosome 10. Often with deep amplifications of EGFR (chr 7) and deep deletions of CDKN2A (chr9). I have a sample which underwent both WXS and WGS.

Interestingly, for the whole exome sample the ploidy 2 (likely correct) solution is chosen, but for the WGS sample a ploidy 3 solution is chosen:

WXS

Solution chosen by selectSolution (ploidy 2, clusters 1)

screen shot 2018-11-27 at 10 45 51 pm screen shot 2018-11-27 at 10 46 04 pm
Normal contamination estimate:	0.6049
Average tumour ploidy estimate:	1.871
Clonal cluster cellular prevalence Z=1:	1
AllelicRatio binomial means for clonal cluster Z=1:	0.5 0.6231 0.6975 0.5282 0.7474 0.5825 0.7832 0.6416 0.5405 0.8101 0.686 0.562 0.831 0.7207 0.6103 0.5473 0.8478 0.7484 0.6491 0.5497 0.8616 0.7712 0.6808 0.5904 0.5517
logRatio Gaussian means for clonal cluster Z=1:	-0.6878 -0.2801 0.03736 0.03736 0.2974 0.2974 0.5177 0.5177 0.5177 0.7087 0.7087 0.7087 0.8774 0.8774 0.8774 0.8774 1.028 1.028 1.028 1.028 1.165 1.165 1.165 1.165 1.165
logRatio Gaussian variance:	0.01004 0.01033 0.01037 0.01037 0.0101 0.0101 0.009955 0.009955 0.009955 0.009996 0.009996 0.009996 0.009993 0.009993 0.009993 0.009993 0.009995 0.009995 0.009995 0.009995 0.009996 0.009996 0.009996 0.009996 0.009996
Number of iterations:	5
Log likelihood:	-35160
S_Dbw dens.bw (LogRatio):	0.0747 
S_Dbw scat (LogRatio):	0.1113 
S_Dbw validity index (LogRatio):	0.1860 
S_Dbw dens.bw (AllelicRatio):	0.3763 
S_Dbw scat (AllelicRatio):	0.0923 
S_Dbw validity index (AllelicRatio):	0.4685 
S_Dbw dens.bw (Both):	0.4510 
S_Dbw scat (Both):	0.2036 
S_Dbw validity index (Both):	0.6545 

WGS

Solution chosen by selectSolution (ploidy 3, cluster 2)

screen shot 2018-11-27 at 10 49 32 pm screen shot 2018-11-27 at 10 50 20 pm screen shot 2018-11-21 at 6 52 46 pm
Normal contamination estimate:	0.5843
Average tumour ploidy estimate:	2.928
Clonal cluster cellular prevalence Z=2:	1 0.711
AllelicRatio binomial means for clonal cluster Z=1:	0.5 0.6312 0.7079 0.5283 0.7581 0.586 0.7936 0.6468 0.54 0.8201 0.692 0.564 0.8405 0.727 0.6135 0.5464 0.8567 0.7548 0.6529 0.551 0.87 0.7775 0.685 0.5925 0.5505
logRatio Gaussian means for clonal cluster Z=1:	-1.03 -0.5906 -0.2545 -0.2545 0.01798 0.01798 0.2471 0.2471 0.2471 0.4447 0.4447 0.4447 0.6185 0.6185 0.6185 0.6185 0.7736 0.7736 0.7736 0.7736 0.9136 0.9136 0.9136 0.9136 0.9136
AllelicRatio binomial means for clonal cluster Z=2:	0.5 0.5867 0.6478 0.5202 0.6931 0.5644 0.7281 0.6141 0.5311 0.756 0.6536 0.5512 0.7786 0.6858 0.5929 0.538 0.7974 0.7125 0.6275 0.5425 0.8133 0.735 0.6567 0.5783 0.5427
logRatio Gaussian means for clonal cluster Z=2:	-0.7599 -0.4852 -0.2545 -0.2545 -0.05561 -0.05561 0.1191 0.1191 0.1191 0.2749 0.2749 0.2749 0.4156 0.4156 0.4156 0.4156 0.5437 0.5437 0.5437 0.5437 0.6614 0.6614 0.6614 0.6614 0.6614
logRatio Gaussian variance:	0.003128 0.002515 0.002094 0.002094 0.001217 0.001217 0.00156 0.00156 0.00156 0.001367 0.001367 0.001367 0.00244 0.00244 0.00244 0.00244 0.002472 0.002472 0.002472 0.002472 0.002499 0.002499 0.002499 0.002499 0.002499
Number of iterations:	10
Log likelihood:	-1068000
S_Dbw dens.bw (LogRatio):	0.0157 
S_Dbw scat (LogRatio):	0.0085 
S_Dbw validity index (LogRatio):	0.0241 
S_Dbw dens.bw (AllelicRatio):	0.3009 
S_Dbw scat (AllelicRatio):	0.0228 
S_Dbw validity index (AllelicRatio):	0.3237 
S_Dbw dens.bw (Both):	0.3166 
S_Dbw scat (Both):	0.0313 
S_Dbw validity index (Both):	0.3479 

WGS

Ploidy 2 / cluster 2 solution

screen shot 2018-11-27 at 10 52 13 pm screen shot 2018-11-27 at 10 52 30 pm screen shot 2018-11-21 at 7 00 49 pm
Normal contamination estimate:	0.626
Average tumour ploidy estimate:	1.976
Clonal cluster cellular prevalence Z=1:	1
AllelicRatio binomial means for clonal cluster Z=1:	0.5 0.615 0.687 0.5255 0.7363 0.5788 0.7722 0.6361 0.5371 0.7995 0.6797 0.5599 0.8209 0.714 0.607 0.5438 0.8382 0.7416 0.645 0.5483 0.8525 0.7644 0.6762 0.5881 0.5481
logRatio Gaussian means for clonal cluster Z=1:	-0.6692 -0.2922 0.006471 0.006471 0.2538 0.2538 0.4648 0.4648 0.4648 0.6489 0.6489 0.6489 0.8121 0.8121 0.8121 0.8121 0.9588 0.9588 0.9588 0.9588 1.092 1.092 1.092 1.092 1.092
logRatio Gaussian variance:	0.004647 0.001763 0.001499 0.001499 0.002408 0.002408 0.002466 0.002466 0.002466 0.002498 0.002498 0.002498 0.002499 0.002499 0.002499 0.002499 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025
Number of iterations:	7
Log likelihood:	-1399000
S_Dbw dens.bw (LogRatio):	0.0035 
S_Dbw scat (LogRatio):	0.0278 
S_Dbw validity index (LogRatio):	0.0313 
S_Dbw dens.bw (AllelicRatio):	0.2213 
S_Dbw scat (AllelicRatio):	0.0671 
S_Dbw validity index (AllelicRatio):	0.2884 
S_Dbw dens.bw (Both):	0.2248 
S_Dbw scat (Both):	0.0949 
S_Dbw validity index (Both):	0.3197 

WGS

Ploidy 2 / cluster 3 solution

screen shot 2018-11-27 at 10 55 05 pm screen shot 2018-11-27 at 10 55 26 pm screen shot 2018-11-21 at 7 07 58 pm
Normal contamination estimate:	0.6175
Average tumour ploidy estimate:	1.967
Clonal cluster cellular prevalence Z=2:	1 0.3598
AllelicRatio binomial means for clonal cluster Z=1:	0.5 0.6182 0.6912 0.5261 0.7408 0.5803 0.7767 0.6383 0.5377 0.8038 0.6823 0.5608 0.8251 0.7167 0.6084 0.5443 0.8422 0.7444 0.6466 0.5489 0.8562 0.7672 0.6781 0.5891 0.5486
logRatio Gaussian means for clonal cluster Z=1:	-0.6863 -0.2971 0.009135 0.009135 0.2616 0.2616 0.4764 0.4764 0.4764 0.6633 0.6633 0.6633 0.8287 0.8287 0.8287 0.8287 0.9772 0.9772 0.9772 0.9772 1.112 1.112 1.112 1.112 1.112
AllelicRatio binomial means for clonal cluster Z=2:	0.5 0.5369 0.5688 0.5094 0.5966 0.5322 0.621 0.5605 0.5165 0.6426 0.5855 0.5285 0.6619 0.6079 0.554 0.5221 0.6792 0.628 0.5768 0.5256 0.6948 0.6461 0.5974 0.5487 0.5266
logRatio Gaussian means for clonal cluster Z=2:	-0.2044 -0.0937 0.009135 0.009135 0.1051 0.1051 0.1951 0.1951 0.1951 0.2798 0.2798 0.2798 0.3599 0.3599 0.3599 0.3599 0.4357 0.4357 0.4357 0.4357 0.5077 0.5077 0.5077 0.5077 0.5077
logRatio Gaussian variance:	0.002846 0.001561 0.00123 0.00123 0.001201 0.001201 0.001862 0.001862 0.001862 0.002298 0.002298 0.002298 0.002455 0.002455 0.002455 0.002455 0.00248 0.00248 0.00248 0.00248 0.002485 0.002485 0.002485 0.002485 0.002485
Number of iterations:	9
Log likelihood:	-1127000
S_Dbw dens.bw (LogRatio):	0.0193 
S_Dbw scat (LogRatio):	0.0091 
S_Dbw validity index (LogRatio):	0.0284 
S_Dbw dens.bw (AllelicRatio):	0.3101 
S_Dbw scat (AllelicRatio):	0.0314 
S_Dbw validity index (AllelicRatio):	0.3415 
S_Dbw dens.bw (Both):	0.3294 
S_Dbw scat (Both):	0.0405 
S_Dbw validity index (Both):	0.3699 

Could you tell me how to create this plot? I'm beginner of R

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants