Accurate Bayesian reconstruction of cancer phylogenies from bulk sequencing. An implementation of the forest structured Chinese restaurant process with a Dirichlet prior on the node parameters.
The recommended way to install PhyClone is through mamba and the Bioconda package channel.
To install into a newly created environment (Recommended):
mamba create --name phyclone phyclone
Or if installing into a pre-exisiting environment:
mamba install phyclone
PhyClone analysis has two possible input files:
- Main input file (Required)
- Cluster file
Caution
In principle PhyClone can be used without pre-clustering. However, it drastically increases the computational complexity. Thus, pre-clustering is recommended for WGS data.
Tip
There is an example file in examples/data/mixing.tsv
To run a PhyClone analysis you will need to prepare an input file. The file should be in tab delimited tidy data frame format and have the following columns:
mutation_id
- Unique identifier for the mutation. This is free form but should match across all samples.
Warning
PhyClone will remove any mutations without entries for all provided samples. If there are mutations with no data in a subset of the samples, the correct procedure is to extract ref and alt counts for these mutations from each affected sample's associated BAM file. Please refer to this thread for further detail.
-
sample_id
- Unique identifier for the sample. -
ref_counts
- Number of reads matching the reference allele. -
alt_counts
- Number of reads matching the alternate allele. -
major_cn
- Major copy number of segment overlapping mutation. -
minor_cn
- Minor copy number of segment overlapping mutation. -
normal_cn
- Total copy number of segment in healthy tissue. For autosome this will be two and male sex chromosome one.
You can include the following optional columns:
tumour_content
- The tumour content (cellularity) of the sample. Default value is 1.0 if column is not present.
Note
In principle this could be different for each mutation/sample. However, in most cases it should be the same for all mutations in a sample.
error_rate
- Sequencing error rate. Default value is 0.001 if column is not present.
Tip
While any mutation pre-clustering method can be used, we recommend PyClone-VI. Both due to its established strong performance, and its output format which can be fed directly into PhyClone 'as-is'.
The file should be in tab delimited tidy data frame format and have the following columns:
-
mutation_id
- Unique identifier for the mutation.This is free form but should match across all samples and must match the identifiers provided in the main input file.
-
sample_id
- Unique identifier for the sample. -
cluster_id
- Cluster that the mutation has been assigned to.
You can include the following optional columns:
-
chrom
- Chromosome on which mutation_id is found -
ccf
- Cluster cellular prevalence estimate (included in all PyClone-VI clustering results)
Note
In order to make use of PhyClone's data informed loss probability prior assignment, columns 4 and 5 are required.
Tip
There is an example file in examples/data/mixing_clusters.tsv
PhyClone analyses are broken into two parts.
First, sampling is performed using the run
sub-command.
Second, the output trace from the sampling run
can be summarised as either a point-estimate tree (MAP or Consensus) or topology report.
Sampling can be run as follows:
phyclone run -i INPUT.tsv -c CLUSTERS.tsv -o TRACE.pkl.gz --num-chains 4
Which will take the INPUT.tsv
and (optionally) the CLUSTERS.tsv
file, as described above and write the trace file TRACE.pkl.gz
in a compressed Python pickle format.
Relevant program options:
--num-chains
command controls how many independent parallel PhyClone sampling chains to use. Though the default value is set to 1, PhyClone will benefit from running multiple chains; we recommend ≥4 chains, if the compute cores can be spared.-n
command can be used to control the number of iterations of sampling to perform.-b
command can be used to control the number of burn-in iterations to perform.--seed
command can be used to seed the random number generator for reproducible results.
Note
Burn-in is done using a heuristic strategy of unconditional SMC. All samples from the burn-in are discarded as they will not target the posterior.
- The
-d
command can be used to select the emission density.- As in PyClone, the
binomial
andbeta-binomial
densities are available.
- As in PyClone, the
For more advanced options, run:
phyclone run --help
As explored in the PhyClone paper, PhyClone is equipped with the ability to model mutational outliers and loss. There are two main approaches to running PhyClone with outlier modelling:
- Using a global outlier probability.
- If running on un-clustered data, this is the only option available to activate outlier modelling.
- Use
--outlier-prob
with a decimal value in the [0, 1] range. Barring prior knowledge, 0.001 should suffice.
- Use
- If running on un-clustered data, this is the only option available to activate outlier modelling.
Note
The --outlier-prob
option will also allow for the use of a global loss probability prior on clustered runs as well.
- Assigning the outlier probability from clustered data.
- PhyClone is also able to assign clusters either a high or low outlier prior probability, based on the input data.
- This feature requires that the clustered data include mutational chromosome assignments, the
chrom
column (which can be supplied in either the data.tsv or cluster.tsv files) and cluster cellular prevalence (CCF) measures, theccf
column (which should be included in the cluster.tsv file). - To activate this feature, ensure the input files are populated with the appropriate columns and include the
--assign-loss-prob
flag in the PhyClonerun
command.
Tip
If using PyClone-VI for clustering, the CCF column will come as a part of its results. And you need only append the chromosomal positioning column chrom
to either input files.
Important
With outlier modelling active, the end result table will assign all mutations inferred to be lost or outliers to a clone with the id of -1
.
PhyClone includes three ways to summarise the results from a sampling trace file. Two of which produce a point-estimate (a single tree), and a third which reports on and can optionally build results for all uniquely sampled topologies:
- MAP tree
- (Recommended) Retrieves the tree with the highest sampled joint-likelihood.
- Consensus tree
- Produces a tree built from the consensus of clades across the entire sample trace.
- Topology report and archive
- Produces a summary report table and (optionally) archive file of all uniquely sampled topologies from an analysis run.
To build the PhyClone MAP tree, you can run the map
command as follows:
phyclone map -i TRACE.pkl.gz -t TREE.nwk -o TABLE.tsv
Where TRACE.pkl.gz
is the result from a PhyClone sampling run.
Expected output:
TREE.nwk
the inferred MAP clone tree topology in Newick format.TABLE.tsv
a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample.
For more advanced options, run:
phyclone map --help
To build the PhyClone consensus tree, you can run the consensus
command as follows:
phyclone consensus -i TRACE.pkl.gz -t TREE.nwk -o TABLE.tsv
Where TRACE.pkl.gz
is the result from a PhyClone sampling run.
Expected output:
TREE.nwk
the inferred MAP clone tree topology in Newick format.TABLE.tsv
a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample.
For more advanced options, run:
phyclone consensus --help
Additionally, PhyClone is able to produce a summary report and archive file of all uniquely sampled topologies from a sampling run
.
To build the PhyClone topology report and full sampled topologies archive, run the topology-report
command as follows:
phyclone topology-report -i TRACE.pkl.gz -o TOPOLOGY_TABLE.tsv -t SAMPLED_TOPOLOGIES.tar.gz
Where TRACE.pkl.gz
is the result from a PhyClone sampling run.
Expected output:
TOPOLOGY_TABLE.tsv
, a high-level report table detailing each topology's log-likelihood, number of times sampled, and topology identifier (which can be used to identify the tree in the accompanying topologies archive).SAMPLED_TOPOLOGIES.tar.gz
, a compressed archive where each folder represents a uniquely sampled topology, folder names align with topology identifiers found in theTOPOLOGY_TABLE.tsv
Expected output, for each sampled topology folder in the SAMPLED_TOPOLOGIES.tar.gz
(sampled-topologies archive):
TREE.nwk
the inferred MAP clone tree topology in Newick format.TABLE.tsv
a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample.
Additional options:
--top-trees
can be used to define that only the top (user-defined-value)x
trees should be built.- trees are ranked by their log-likelihood, such that the command
--top-trees 3
, would populate the archive with only the top 3 most likely trees.
- trees are ranked by their log-likelihood, such that the command
PhyClone is licensed under the GPL v3, see the LICENSE file for details.