diff --git a/README.md b/README.md index d886b6b..dd59395 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ An implementation of the forest structured Chinese restaurant process with a Dir ## Overview 1. [PhyClone Installation](#installation) -2. [Input File Formats](#input-files) +2. [Input File Formats](#input-file-formats) * [Main input format](#main-input-format) * [Cluster input format](#cluster-file-format) 3. [Running PhyClone: Basic Usage](#running-phyclone) @@ -69,18 +69,19 @@ PhyClone analysis has two possible input files: --------- ### Main input format -To run a PhyClone analysis you will need to prepare an input file. -The file should be in tab delimited tidy data frame format and have the following columns. - > [!TIP] > There is an example file in [examples/data/mixing.tsv](examples/data/mixing.tsv) +To run a PhyClone analysis you will need to prepare an input file. +The file should be in tab delimited tidy data frame format and have the following columns: + 1. mutation_id - Unique identifier for the mutation. This is free form but should match across all samples. > [!WARNING] -> PhyClone will remove any mutations without entries for all detected samples. -If you have mutations with no data in some samples set their counts to 0. +> PhyClone will remove any mutations without entries for all provided samples. +> If there are mutations with no data in a subset of the samples, the correct procedure is to extract ref and alt counts for these mutations from each affected sample's associated BAM file. +> Please refer to [this thread](https://groups.google.com/g/pyclone-user-group/c/wgXV7tq470Y) for further detail. 2. sample_id - Unique identifier for the sample. @@ -97,13 +98,13 @@ For autosome this will be two and male sex chromosome one. You can include the following optional columns: -1. tumour_content - The tumour content (cellularity) of the sample. +8. tumour_content - The tumour content (cellularity) of the sample. Default value is 1.0 if column is not present. > [!NOTE] > In principle this could be different for each mutation/sample. However, in most cases it should be the same for all mutations in a sample. -2. error_rate - Sequencing error rate. +9. error_rate - Sequencing error rate. Default value is 0.001 if column is not present. ------------------ @@ -119,7 +120,7 @@ Default value is 0.001 if column is not present. > [PyClone-VI](https://github.com/Roth-Lab/pyclone-vi). Both due to its established > strong performance, and its output format which can be fed directly into PhyClone *'as-is'*. -The file should be in tab delimited tidy data frame format and have the following columns. +The file should be in tab delimited tidy data frame format and have the following columns: 1. mutation_id - Unique identifier for the mutation. @@ -127,14 +128,17 @@ The file should be in tab delimited tidy data frame format and have the followin in the [main input file](#main-input-format). 2. sample_id - Unique identifier for the sample. + 3. cluster_id - Cluster that the mutation has been assigned to. You can include the following optional columns: 4. chrom - Chromosome on which mutation_id is found + 5. ccf - Cluster cellular prevalence estimate (included in all [PyClone-VI](https://github.com/Roth-Lab/pyclone-vi) clustering results) -> [!NOTE] In order to make use of PhyClone's data informed loss probability prior assignment, columns 4 and 5 are required. +> [!NOTE] +> In order to make use of PhyClone's data informed loss probability prior assignment, columns 4 and 5 are required. [//]: # (4. outlier_prob - (Prior) probability that the cluster/mutation is an outlier.) @@ -155,7 +159,7 @@ You can include the following optional columns: PhyClone analyses are broken into two parts. First, sampling is performed using the `run` sub-command. -Second, the output sampling trace from the sampling `run` can be summarised as either a point-estimate tree ([MAP](#map-point-estimate-tree) or [Consensus](#consensus-point-estimate-tree)) or topology report. +Second, the output trace from the sampling `run` can be summarised as either a point-estimate tree ([MAP](#map-point-estimate-tree) or [Consensus](#consensus-point-estimate-tree)) or [topology report](#topology-report-and-sampled-topologies-archive). Sampling can be run as follows: ``` @@ -164,11 +168,11 @@ phyclone run -i INPUT.tsv -c CLUSTERS.tsv -o TRACE.pkl.gz --num-chains 4 Which will take the [`INPUT.tsv`](#main-input-format) and (optionally) the [`CLUSTERS.tsv`](#cluster-file-format) file, as described above and write the trace file `TRACE.pkl.gz` in a compressed Python pickle format. Relevant program options: -* `--num-chains` command controls how many independent parallel PhyClone sampling chains to use. Though the default value is set to 1, PhyClone will benefit from running with at least 4 chains, if the compute cores can be spared. +* `--num-chains` command controls how many independent parallel PhyClone sampling chains to use. Though the default value is set to 1, PhyClone will benefit from running multiple chains; we recommend ≥4 chains, if the compute cores can be spared. * `-n` command can be used to control the number of iterations of sampling to perform. * `-b` command can be used to control the number of burn-in iterations to perform. * `--seed` command can be used to seed the random number generator for reproducible results. -* + > [!NOTE] > Burn-in is done using a heuristic strategy of unconditional SMC. All samples from the burn-in are discarded as they will not target the posterior. @@ -190,21 +194,25 @@ phyclone run --help As explored in the PhyClone paper, PhyClone is equipped with the ability to model mutational outliers and loss. There are two main approaches to running PhyClone with outlier modelling: 1. Using a global outlier probability. * If running on un-clustered data, this is the only option available to activate outlier modelling. - * Use `--outlier-prob ` replacing the `<>` text with a decimal value in the [0, 1] range. Barring prior knowledge, 0.001 should suffice. - > [!NOTE] This option will also allow for the use of a global loss probability prior on clustered runs as well. + * Use `--outlier-prob` with a decimal value in the [0, 1] range. Barring prior knowledge, 0.001 should suffice. +> [!NOTE] +> The `--outlier-prob` option will also allow for the use of a global loss probability prior on clustered runs as well. 2. Assigning the outlier probability from clustered data. - * PhyClone is also able to split clusters into either high-loss or low-loss probability groupings. This feature requires that the clustered data include mutational chromosome assignments (which can be supplied in either the [data.tsv](#main-input-format) or [cluster.tsv](#cluster-file-format) files) and cluster cellular prevalence (CCF) measures. + * PhyClone is also able to assign clusters either a high or low outlier prior probability, based on the input data. + * This feature requires that the clustered data include mutational chromosome assignments, the `chrom` column (which can be supplied in either the [data.tsv](#main-input-format) or [cluster.tsv](#cluster-file-format) files) and cluster cellular prevalence (CCF) measures, the `ccf` column (which should be included in the [cluster.tsv](#cluster-file-format) file). * To activate this feature, ensure the input files are populated with the appropriate columns and include the `--assign-loss-prob` flag in the PhyClone `run` command. - > [!TIP] If using PyClone-VI for clustering, the CCF column will come as a part of its results. And you need only append the chromosomal positioning column `chrom` to either input files. +> [!TIP] +> If using PyClone-VI for clustering, the CCF column will come as a part of its results. And you need only append the chromosomal positioning column `chrom` to either input files. -> [!IMPORTANT] With outlier modelling active, the end result table will assign all mutations inferred to be lost or outliers to a clone with the id of `-1`. +> [!IMPORTANT] +> With outlier modelling active, the end result table will assign all mutations inferred to be lost or outliers to a clone with the id of `-1`. ----------------- ## PhyClone Output -PhyClone includes ways three ways to summarise the results from a sampling trace file. -Two of which produce a point-estimate (a single tree), and a third which can reports on and can optionally build results for all uniquely sampled topologies: +PhyClone includes three ways to summarise the results from a sampling trace file. +Two of which produce a point-estimate (a single tree), and a third which reports on and can optionally build results for all uniquely sampled topologies: 1. [MAP tree](#map-point-estimate-tree) * **(Recommended)** Retrieves the tree with the highest sampled joint-likelihood. 2. [Consensus tree](#consensus-point-estimate-tree) @@ -221,7 +229,7 @@ phyclone map -i TRACE.pkl.gz -t TREE.nwk -o TABLE.tsv Where `TRACE.pkl.gz` is the result from a PhyClone sampling run. Expected output: -* `TREE.nwk` the inferred MAP clone tree topology in newick format. +* `TREE.nwk` the inferred MAP clone tree topology in Newick format. * `TABLE.tsv` a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample. For more advanced options, run: @@ -238,7 +246,7 @@ phyclone consensus -i TRACE.pkl.gz -t TREE.nwk -o TABLE.tsv Where `TRACE.pkl.gz` is the result from a PhyClone sampling run. Expected output: -* `TREE.nwk` the inferred MAP clone tree topology in newick format. +* `TREE.nwk` the inferred MAP clone tree topology in Newick format. * `TABLE.tsv` a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample. For more advanced options, run: @@ -248,8 +256,7 @@ phyclone consensus --help ### Topology Report and Sampled Topologies Archive -Additionally, PhyClone is able to produce a summary report and archive file of all uniquely sampled topologies from an analysis run. -The +Additionally, PhyClone is able to produce a summary report and archive file of all uniquely sampled topologies from a sampling `run`. To build the PhyClone topology report and full sampled topologies archive, run the `topology-report` command as follows: ``` @@ -263,7 +270,7 @@ used to identify the tree in the accompanying topologies archive). * `SAMPLED_TOPOLOGIES.tar.gz`, a compressed archive where each folder represents a uniquely sampled topology, folder names align with topology identifiers found in the `TOPOLOGY_TABLE.tsv` Expected output, for each sampled topology folder in the `SAMPLED_TOPOLOGIES.tar.gz` (sampled-topologies archive): -* `TREE.nwk` the inferred MAP clone tree topology in newick format. +* `TREE.nwk` the inferred MAP clone tree topology in Newick format. * `TABLE.tsv` a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample. Additional options: