Skip to content
Andrew Roth edited this page Jan 4, 2017 · 1 revision

#Output of the classify command

Introduction

All subcommands of classify produce a file with a standard format. This is a tab delimited file with a header for each column.

The file will look like the following

chrom	position	ref_base	var_base	normal_counts_a	normal_counts_b	tumour_counts_a	tumour_counts_b	p_AA_AA	p_AA_AB	p_AA_BB	p_AB_AA	p_AB_AB	p_AB_BB	p_BB_AA	p_BB_AB	p_BB_BB
1	1299268	T	C	26	25	3	17	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000

The last nine columns of the file list the posterior probability of each of the joint genotypes. They have the form p_gN_gT where gN is the normal genotype and gT is the tumour genotype. For deterministic methods only one of these columns will be non-zero and will have a value of 1.

The rows of the file correspond to genomic positions. The columns are as follows

  1. chrom - Chromosome the site is on.
  2. position - 1-based position on the chromosome
  3. ref_base - Base found in reference genome at this position.
  4. var_base - Variant base found at this position. If no variant base is found this will be N.
  5. normal_counts_a - Number of read matching ref_base in the normal at this position
  6. normal_counts_b - Number of reads matching var_base in the normal at this position.
  7. tumour_counts_a - Number of read matching ref_base in the tumour at this position
  8. tumour_counts_b - Number of reads matching var_base in the tumour at this position.
  9. p_AA_AA - Probability of joint genotype AA_AA
  10. p_AA_AB - Probability of joint genotype AA_AB
  11. p_AA_BB - Probability of joint genotype AA_BB
  12. p_AB_AA - Probability of joint genotype AB_AA
  13. p_AB_AB - Probability of joint genotype AB_AB
  14. p_AB_BB - Probability of joint genotype AB_BB
  15. p_AB_AA - Probability of joint genotype BB_AA
  16. p_AB_AB - Probability of joint genotype BB_AB
  17. p_AB_BB - Probability of joint genotype BB_BB

To extract somatic positions from this file I suggest adding p_AA_AB + p_AA_BB together to get the somatic genotype probability. You can then choose to threshold at whatever level is appropriate.

This file format can easily be manipulated using Python and the csv library which is installed by default. The csv.DictReader class will be especially useful.

Clone this wiki locally