Skip to content

Commit

Permalink
Updating README and adding examples
Browse files Browse the repository at this point in the history
  • Loading branch information
Andrew Roth committed Aug 31, 2020
1 parent 986f8e1 commit 0e5639c
Show file tree
Hide file tree
Showing 3 changed files with 7,863 additions and 8 deletions.
113 changes: 105 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Overview

PyClone-VI is fast method for inferring clonal population structure.
PyClone-VI is a fast method for inferring clonal population structure.


## Installation
Expand All @@ -21,10 +21,9 @@ conda create -n pyclone-vi --file requirements.txt --yes
```
source activate pyclone-vi
```
> Note: You will have to do this whenver you open a new terminal and want to run PyClone-VI.
> Note: You will have to do this whenever you open a new terminal and want to run PyClone-VI.
4. Install PyClone-VI

```
pip install git+ssh://[email protected]/aroth85/pyclone-vi.git
```
Expand All @@ -36,13 +35,33 @@ pyclone-vi --help

## Usage

## Quick start

1. Activate the `conda` environment
```
source activate pyclone-vi
```

2. Fit the model to the data here we use the TRACERx file provided in the `examples` folder.
This assumes you are running in the base directory of the git repo.
Here we run allowing for up to 40 clusters (clones), using the Beta-Binomial distribution and performing 10 random restarts.
This should take a under five minutes.
```
pyclone-vi fit -i examples/tracerx.tsv -o tracerx.h5 -c 40 -d beta-binomial -r 100
```

3. Next we output the final results from the best random restart.
```
pyclone-vi write-results-file -i tracerx.h5 -o tracerx.tsv
```

### Input format

To run a PyClone-VI analysis you will need to prepare an input file.
The file should be in tab delimited format and have the following columns.
> Note: There is an example file in examples/data/mixing.tsv
> Note: There is an example file in examples/data/tracerx.tsv
1. mutation_id - Unique identifier for the mutation.
1. mutation_id - Unique identifier for the mutation.
This is free form but should match across all samples.
> Note: PyClone-VI will remove any mutations without entries for all detected samples.
If you have mutations with no data in some samples set their ref/alt counts to 0 for the corresponding sample.
Expand All @@ -63,16 +82,93 @@ For autosome this will be two and male sex chromosomes one.
You can include the following optional columns.

1. tumour_content - The tumour content (cellularity) of the sample.
Default value is 1.0 if column is not present.
Default value is 1.0 if column is not present.
> Note: In principle this could be different for each mutations/sample.
However it most cases it should be the same for all mutations in a sample.

2. error_rate - Sequencing error rate.
Default value is 0.001 if column is not present.
Default value is 0.001 if column is not present.
> Note: Most users will not need to change this value.
### Output format

The results file output by `write-results-file` is in tab delimited format.
There six columns:

1. mutation_id - Mutation identifier as used in the input file.

2. sample_id - Unique identifier for the sample as used in the input file.

3. cluster_id - Most probable cluster or clone the mutation was assigned to.

4. cellular_prevalence - Proportion of malignant cells with the mutation in the sample.
This is also called cancer cell fraction (CCF) in the literature.

5. cellular_prevalence_std - Standard error of the cellular_prevalence estimate.

6. cluster_assignment_prob - Posterior probability the mutation is assigned to the cluster.
This can be used as a confidence score to remove mutations with low probability of belonging to a cluster.


### Running PyClone-VI

TODO
PyClone-VI has two sub-commands `fit` and `write-results-file`.
Typical usage is to run `fit` to perform inference and then `write-results-file` to select the best fit and post-process the results.

#### `fit` command

The `fit` command is used for performing the inference step.
It supports performing multiple restarts, the best of which will be selected by the `write-results-file` command.

There are a few mandatory arguments:

* `-i, --in-file` - Path where the input file is located.
This file should be in the format specified above.

* `-o, --out-file` - Path where the output file will be written.
The output file is in HDF5 file format.
Most users will execute the `write-results-file` to extract the final results from this file.

There are several optional arguments:

* `-c, --num-clusters` - The number of clusters to use while fitting.
This should be set to a value larger than the expected number of clusters.
The software will then automatically determine how many to use.
Usually a value of 10-40 will work.
In general this value should increase if as more samples are used.

* `-d, --density` - The probability density used to model the read count data.
Choices are `beta-binomial` and `binomial`.
`binomial` is a common choice for sequencing data.
`beta-binomial` is useful when the data is over-dispersed which has been observed frequently in sequencing data.

* `-g, --num-grid-points` - Number of grid points used for approximating the posterior distribution.
Higher values should be used for deeply sequenced data.
The default value of 100 will likely work for most users.

* `-r, --num-restarts` - Number of random restarts of variational inference.
More restarts will have a higher probability of finding the optimal variational approximation.
This also increases running time.
Usually a value of 10-100 will work.

Additional arguments can be viewed by running `pyclone-vi fit --help`

#### `write-results-file` command

The `write-results-file` will select the best solution found by the `fit` command and post-process the results.
The output format is tab delimited file which an be imported and manipulated using tools such as R and Python.

The are two mandatory arguments:

* `-i, --in-file` - Path to the output file generated by the `fit` command.

* `-o, --out-file` - Path where the final results will be written in tab delimited format.

There is one optional argument:

* `-c, --compress` - If set the output will be compressed using gzip.
This is useful where a large number mutations are input to reduce the size of the results file.


# License

Expand All @@ -82,3 +178,4 @@ PyClone-VI is licensed under the GPL v3, see the LICENSE.txt file for details.

## 0.1.0

First release.
Loading

0 comments on commit 0e5639c

Please sign in to comment.