-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Andrew Roth
committed
Aug 31, 2020
1 parent
986f8e1
commit 0e5639c
Showing
3 changed files
with
7,863 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
# Overview | ||
|
||
PyClone-VI is fast method for inferring clonal population structure. | ||
PyClone-VI is a fast method for inferring clonal population structure. | ||
|
||
|
||
## Installation | ||
|
@@ -21,10 +21,9 @@ conda create -n pyclone-vi --file requirements.txt --yes | |
``` | ||
source activate pyclone-vi | ||
``` | ||
> Note: You will have to do this whenver you open a new terminal and want to run PyClone-VI. | ||
> Note: You will have to do this whenever you open a new terminal and want to run PyClone-VI. | ||
4. Install PyClone-VI | ||
|
||
``` | ||
pip install git+ssh://[email protected]/aroth85/pyclone-vi.git | ||
``` | ||
|
@@ -36,13 +35,33 @@ pyclone-vi --help | |
|
||
## Usage | ||
|
||
## Quick start | ||
|
||
1. Activate the `conda` environment | ||
``` | ||
source activate pyclone-vi | ||
``` | ||
|
||
2. Fit the model to the data here we use the TRACERx file provided in the `examples` folder. | ||
This assumes you are running in the base directory of the git repo. | ||
Here we run allowing for up to 40 clusters (clones), using the Beta-Binomial distribution and performing 10 random restarts. | ||
This should take a under five minutes. | ||
``` | ||
pyclone-vi fit -i examples/tracerx.tsv -o tracerx.h5 -c 40 -d beta-binomial -r 100 | ||
``` | ||
|
||
3. Next we output the final results from the best random restart. | ||
``` | ||
pyclone-vi write-results-file -i tracerx.h5 -o tracerx.tsv | ||
``` | ||
|
||
### Input format | ||
|
||
To run a PyClone-VI analysis you will need to prepare an input file. | ||
The file should be in tab delimited format and have the following columns. | ||
> Note: There is an example file in examples/data/mixing.tsv | ||
> Note: There is an example file in examples/data/tracerx.tsv | ||
1. mutation_id - Unique identifier for the mutation. | ||
1. mutation_id - Unique identifier for the mutation. | ||
This is free form but should match across all samples. | ||
> Note: PyClone-VI will remove any mutations without entries for all detected samples. | ||
If you have mutations with no data in some samples set their ref/alt counts to 0 for the corresponding sample. | ||
|
@@ -63,16 +82,93 @@ For autosome this will be two and male sex chromosomes one. | |
You can include the following optional columns. | ||
|
||
1. tumour_content - The tumour content (cellularity) of the sample. | ||
Default value is 1.0 if column is not present. | ||
Default value is 1.0 if column is not present. | ||
> Note: In principle this could be different for each mutations/sample. | ||
However it most cases it should be the same for all mutations in a sample. | ||
|
||
2. error_rate - Sequencing error rate. | ||
Default value is 0.001 if column is not present. | ||
Default value is 0.001 if column is not present. | ||
> Note: Most users will not need to change this value. | ||
### Output format | ||
|
||
The results file output by `write-results-file` is in tab delimited format. | ||
There six columns: | ||
|
||
1. mutation_id - Mutation identifier as used in the input file. | ||
|
||
2. sample_id - Unique identifier for the sample as used in the input file. | ||
|
||
3. cluster_id - Most probable cluster or clone the mutation was assigned to. | ||
|
||
4. cellular_prevalence - Proportion of malignant cells with the mutation in the sample. | ||
This is also called cancer cell fraction (CCF) in the literature. | ||
|
||
5. cellular_prevalence_std - Standard error of the cellular_prevalence estimate. | ||
|
||
6. cluster_assignment_prob - Posterior probability the mutation is assigned to the cluster. | ||
This can be used as a confidence score to remove mutations with low probability of belonging to a cluster. | ||
|
||
|
||
### Running PyClone-VI | ||
|
||
TODO | ||
PyClone-VI has two sub-commands `fit` and `write-results-file`. | ||
Typical usage is to run `fit` to perform inference and then `write-results-file` to select the best fit and post-process the results. | ||
|
||
#### `fit` command | ||
|
||
The `fit` command is used for performing the inference step. | ||
It supports performing multiple restarts, the best of which will be selected by the `write-results-file` command. | ||
|
||
There are a few mandatory arguments: | ||
|
||
* `-i, --in-file` - Path where the input file is located. | ||
This file should be in the format specified above. | ||
|
||
* `-o, --out-file` - Path where the output file will be written. | ||
The output file is in HDF5 file format. | ||
Most users will execute the `write-results-file` to extract the final results from this file. | ||
|
||
There are several optional arguments: | ||
|
||
* `-c, --num-clusters` - The number of clusters to use while fitting. | ||
This should be set to a value larger than the expected number of clusters. | ||
The software will then automatically determine how many to use. | ||
Usually a value of 10-40 will work. | ||
In general this value should increase if as more samples are used. | ||
|
||
* `-d, --density` - The probability density used to model the read count data. | ||
Choices are `beta-binomial` and `binomial`. | ||
`binomial` is a common choice for sequencing data. | ||
`beta-binomial` is useful when the data is over-dispersed which has been observed frequently in sequencing data. | ||
|
||
* `-g, --num-grid-points` - Number of grid points used for approximating the posterior distribution. | ||
Higher values should be used for deeply sequenced data. | ||
The default value of 100 will likely work for most users. | ||
|
||
* `-r, --num-restarts` - Number of random restarts of variational inference. | ||
More restarts will have a higher probability of finding the optimal variational approximation. | ||
This also increases running time. | ||
Usually a value of 10-100 will work. | ||
|
||
Additional arguments can be viewed by running `pyclone-vi fit --help` | ||
|
||
#### `write-results-file` command | ||
|
||
The `write-results-file` will select the best solution found by the `fit` command and post-process the results. | ||
The output format is tab delimited file which an be imported and manipulated using tools such as R and Python. | ||
|
||
The are two mandatory arguments: | ||
|
||
* `-i, --in-file` - Path to the output file generated by the `fit` command. | ||
|
||
* `-o, --out-file` - Path where the final results will be written in tab delimited format. | ||
|
||
There is one optional argument: | ||
|
||
* `-c, --compress` - If set the output will be compressed using gzip. | ||
This is useful where a large number mutations are input to reduce the size of the results file. | ||
|
||
|
||
# License | ||
|
||
|
@@ -82,3 +178,4 @@ PyClone-VI is licensed under the GPL v3, see the LICENSE.txt file for details. | |
|
||
## 0.1.0 | ||
|
||
First release. |
Oops, something went wrong.