Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
elhurtado committed Nov 23, 2024
1 parent 9fcc770 commit cbf1e9d
Showing 1 changed file with 32 additions and 25 deletions.
57 changes: 32 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ An implementation of the forest structured Chinese restaurant process with a Dir

## Overview
1. [PhyClone Installation](#installation)
2. [Input File Formats](#input-files)
2. [Input File Formats](#input-file-formats)
* [Main input format](#main-input-format)
* [Cluster input format](#cluster-file-format)
3. [Running PhyClone: Basic Usage](#running-phyclone)
Expand Down Expand Up @@ -69,18 +69,19 @@ PhyClone analysis has two possible input files:
---------
### Main input format

To run a PhyClone analysis you will need to prepare an input file.
The file should be in tab delimited tidy data frame format and have the following columns.

> [!TIP]
> There is an example file in [examples/data/mixing.tsv](examples/data/mixing.tsv)
To run a PhyClone analysis you will need to prepare an input file.
The file should be in tab delimited tidy data frame format and have the following columns:

1. mutation_id - Unique identifier for the mutation.
This is free form but should match across all samples.

> [!WARNING]
> PhyClone will remove any mutations without entries for all detected samples.
If you have mutations with no data in some samples set their counts to 0.
> PhyClone will remove any mutations without entries for all provided samples.
> If there are mutations with no data in a subset of the samples, the correct procedure is to extract ref and alt counts for these mutations from each affected sample's associated BAM file.
> Please refer to [this thread](https://groups.google.com/g/pyclone-user-group/c/wgXV7tq470Y) for further detail.
2. sample_id - Unique identifier for the sample.

Expand All @@ -97,13 +98,13 @@ For autosome this will be two and male sex chromosome one.

You can include the following optional columns:

1. tumour_content - The tumour content (cellularity) of the sample.
8. tumour_content - The tumour content (cellularity) of the sample.
Default value is 1.0 if column is not present.
> [!NOTE]
> In principle this could be different for each mutation/sample.
However, in most cases it should be the same for all mutations in a sample.

2. error_rate - Sequencing error rate.
9. error_rate - Sequencing error rate.
Default value is 0.001 if column is not present.

------------------
Expand All @@ -119,22 +120,25 @@ Default value is 0.001 if column is not present.
> [PyClone-VI](https://github.com/Roth-Lab/pyclone-vi). Both due to its established
> strong performance, and its output format which can be fed directly into PhyClone *'as-is'*.
The file should be in tab delimited tidy data frame format and have the following columns.
The file should be in tab delimited tidy data frame format and have the following columns:

1. mutation_id - Unique identifier for the mutation.

This is free form but should match across all samples and **must** match the identifiers provided
in the [main input file](#main-input-format).

2. sample_id - Unique identifier for the sample.

3. cluster_id - Cluster that the mutation has been assigned to.

You can include the following optional columns:

4. chrom - Chromosome on which mutation_id is found

5. ccf - Cluster cellular prevalence estimate (included in all [PyClone-VI](https://github.com/Roth-Lab/pyclone-vi) clustering results)

> [!NOTE] In order to make use of PhyClone's data informed loss probability prior assignment, columns 4 and 5 are required.
> [!NOTE]
> In order to make use of PhyClone's data informed loss probability prior assignment, columns 4 and 5 are required.
[//]: # (4. outlier_prob - (Prior) probability that the cluster/mutation is an outlier.)

Expand All @@ -155,7 +159,7 @@ You can include the following optional columns:

PhyClone analyses are broken into two parts.
First, sampling is performed using the `run` sub-command.
Second, the output sampling trace from the sampling `run` can be summarised as either a point-estimate tree ([MAP](#map-point-estimate-tree) or [Consensus](#consensus-point-estimate-tree)) or topology report.
Second, the output trace from the sampling `run` can be summarised as either a point-estimate tree ([MAP](#map-point-estimate-tree) or [Consensus](#consensus-point-estimate-tree)) or [topology report](#topology-report-and-sampled-topologies-archive).

Sampling can be run as follows:
```
Expand All @@ -164,11 +168,11 @@ phyclone run -i INPUT.tsv -c CLUSTERS.tsv -o TRACE.pkl.gz --num-chains 4
Which will take the [`INPUT.tsv`](#main-input-format) and (optionally) the [`CLUSTERS.tsv`](#cluster-file-format) file, as described above and write the trace file `TRACE.pkl.gz` in a compressed Python pickle format.

Relevant program options:
* `--num-chains` command controls how many independent parallel PhyClone sampling chains to use. Though the default value is set to 1, PhyClone will benefit from running with at least 4 chains, if the compute cores can be spared.
* `--num-chains` command controls how many independent parallel PhyClone sampling chains to use. Though the default value is set to 1, PhyClone will benefit from running multiple chains; we recommend ≥4 chains, if the compute cores can be spared.
* `-n` command can be used to control the number of iterations of sampling to perform.
* `-b` command can be used to control the number of burn-in iterations to perform.
* `--seed` command can be used to seed the random number generator for reproducible results.
*

> [!NOTE]
> Burn-in is done using a heuristic strategy of unconditional SMC.
All samples from the burn-in are discarded as they will not target the posterior.
Expand All @@ -190,21 +194,25 @@ phyclone run --help
As explored in the PhyClone paper, PhyClone is equipped with the ability to model mutational outliers and loss. There are two main approaches to running PhyClone with outlier modelling:
1. Using a global outlier probability.
* If running on un-clustered data, this is the only option available to activate outlier modelling.
* Use `--outlier-prob <user-defined-loss-prior-probability>` replacing the `<>` text with a decimal value in the [0, 1] range. Barring prior knowledge, 0.001 should suffice.
> [!NOTE] This option will also allow for the use of a global loss probability prior on clustered runs as well.
* Use `--outlier-prob` with a decimal value in the [0, 1] range. Barring prior knowledge, 0.001 should suffice.
> [!NOTE]
> The `--outlier-prob` option will also allow for the use of a global loss probability prior on clustered runs as well.
2. Assigning the outlier probability from clustered data.
* PhyClone is also able to split clusters into either high-loss or low-loss probability groupings. This feature requires that the clustered data include mutational chromosome assignments (which can be supplied in either the [data.tsv](#main-input-format) or [cluster.tsv](#cluster-file-format) files) and cluster cellular prevalence (CCF) measures.
* PhyClone is also able to assign clusters either a high or low outlier prior probability, based on the input data.
* This feature requires that the clustered data include mutational chromosome assignments, the `chrom` column (which can be supplied in either the [data.tsv](#main-input-format) or [cluster.tsv](#cluster-file-format) files) and cluster cellular prevalence (CCF) measures, the `ccf` column (which should be included in the [cluster.tsv](#cluster-file-format) file).
* To activate this feature, ensure the input files are populated with the appropriate columns and include the `--assign-loss-prob` flag in the PhyClone `run` command.
> [!TIP] If using PyClone-VI for clustering, the CCF column will come as a part of its results. And you need only append the chromosomal positioning column `chrom` to either input files.
> [!TIP]
> If using PyClone-VI for clustering, the CCF column will come as a part of its results. And you need only append the chromosomal positioning column `chrom` to either input files.
> [!IMPORTANT] With outlier modelling active, the end result table will assign all mutations inferred to be lost or outliers to a clone with the id of `-1`.
> [!IMPORTANT]
> With outlier modelling active, the end result table will assign all mutations inferred to be lost or outliers to a clone with the id of `-1`.
-----------------

## PhyClone Output

PhyClone includes ways three ways to summarise the results from a sampling trace file.
Two of which produce a point-estimate (a single tree), and a third which can reports on and can optionally build results for all uniquely sampled topologies:
PhyClone includes three ways to summarise the results from a sampling trace file.
Two of which produce a point-estimate (a single tree), and a third which reports on and can optionally build results for all uniquely sampled topologies:
1. [MAP tree](#map-point-estimate-tree)
* **(Recommended)** Retrieves the tree with the highest sampled joint-likelihood.
2. [Consensus tree](#consensus-point-estimate-tree)
Expand All @@ -221,7 +229,7 @@ phyclone map -i TRACE.pkl.gz -t TREE.nwk -o TABLE.tsv
Where `TRACE.pkl.gz` is the result from a PhyClone sampling run.

Expected output:
* `TREE.nwk` the inferred MAP clone tree topology in newick format.
* `TREE.nwk` the inferred MAP clone tree topology in Newick format.
* `TABLE.tsv` a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample.

For more advanced options, run:
Expand All @@ -238,7 +246,7 @@ phyclone consensus -i TRACE.pkl.gz -t TREE.nwk -o TABLE.tsv
Where `TRACE.pkl.gz` is the result from a PhyClone sampling run.

Expected output:
* `TREE.nwk` the inferred MAP clone tree topology in newick format.
* `TREE.nwk` the inferred MAP clone tree topology in Newick format.
* `TABLE.tsv` a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample.

For more advanced options, run:
Expand All @@ -248,8 +256,7 @@ phyclone consensus --help

### Topology Report and Sampled Topologies Archive

Additionally, PhyClone is able to produce a summary report and archive file of all uniquely sampled topologies from an analysis run.
The
Additionally, PhyClone is able to produce a summary report and archive file of all uniquely sampled topologies from a sampling `run`.

To build the PhyClone topology report and full sampled topologies archive, run the `topology-report` command as follows:
```
Expand All @@ -263,7 +270,7 @@ used to identify the tree in the accompanying topologies archive).
* `SAMPLED_TOPOLOGIES.tar.gz`, a compressed archive where each folder represents a uniquely sampled topology, folder names align with topology identifiers found in the `TOPOLOGY_TABLE.tsv`

Expected output, for each sampled topology folder in the `SAMPLED_TOPOLOGIES.tar.gz` (sampled-topologies archive):
* `TREE.nwk` the inferred MAP clone tree topology in newick format.
* `TREE.nwk` the inferred MAP clone tree topology in Newick format.
* `TABLE.tsv` a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample.

Additional options:
Expand Down

0 comments on commit cbf1e9d

Please sign in to comment.