From cbf1e9d0924144b8d136771e23b2ba8cad767db5 Mon Sep 17 00:00:00 2001
From: Emilia Hurtado <74883583+elhurtado@users.noreply.github.com>
Date: Fri, 22 Nov 2024 17:55:35 -0800
Subject: [PATCH] Update README.md

---
 README.md | 57 +++++++++++++++++++++++++++++++------------------------
 1 file changed, 32 insertions(+), 25 deletions(-)

diff --git a/README.md b/README.md
index d886b6b..dd59395 100644
--- a/README.md
+++ b/README.md
@@ -7,7 +7,7 @@ An implementation of the forest structured Chinese restaurant process with a Dir
 
 ## Overview
 1. [PhyClone Installation](#installation)
-2. [Input File Formats](#input-files)
+2. [Input File Formats](#input-file-formats)
    * [Main input format](#main-input-format)
    * [Cluster input format](#cluster-file-format)
 3. [Running PhyClone: Basic Usage](#running-phyclone)
@@ -69,18 +69,19 @@ PhyClone analysis has two possible input files:
 ---------
 ### Main input format
 
-To run a PhyClone analysis you will need to prepare an input file.
-The file should be in tab delimited tidy data frame format and have the following columns.
-
 > [!TIP]
 > There is an example file in [examples/data/mixing.tsv](examples/data/mixing.tsv)
 
+To run a PhyClone analysis you will need to prepare an input file.
+The file should be in tab delimited tidy data frame format and have the following columns:
+
 1. mutation_id - Unique identifier for the mutation. 
 This is free form but should match across all samples.
 
 > [!WARNING]
-> PhyClone will remove any mutations without entries for all detected samples.
-If you have mutations with no data in some samples set their counts to 0.
+> PhyClone will remove any mutations without entries for all provided samples. 
+> If there are mutations with no data in a subset of the samples, the correct procedure is to extract ref and alt counts for these mutations from each affected sample's associated BAM file.
+> Please refer to [this thread](https://groups.google.com/g/pyclone-user-group/c/wgXV7tq470Y) for further detail.
 
 2. sample_id - Unique identifier for the sample.
 
@@ -97,13 +98,13 @@ For autosome this will be two and male sex chromosome one.
 
 You can include the following optional columns:
 
-1. tumour_content - The tumour content (cellularity) of the sample.
+8. tumour_content - The tumour content (cellularity) of the sample.
 Default value is 1.0 if column is not present.
 > [!NOTE]
 > In principle this could be different for each mutation/sample.
 However, in most cases it should be the same for all mutations in a sample.
 
-2. error_rate - Sequencing error rate.
+9. error_rate - Sequencing error rate.
 Default value is 0.001 if column is not present. 
 
 ------------------
@@ -119,7 +120,7 @@ Default value is 0.001 if column is not present.
 > [PyClone-VI](https://github.com/Roth-Lab/pyclone-vi). Both due to its established 
 > strong performance, and its output format which can be fed directly into PhyClone *'as-is'*.
 
-The file should be in tab delimited tidy data frame format and have the following columns.
+The file should be in tab delimited tidy data frame format and have the following columns:
 
 1. mutation_id - Unique identifier for the mutation. 
 
@@ -127,14 +128,17 @@ The file should be in tab delimited tidy data frame format and have the followin
     in the [main input file](#main-input-format).
 
 2. sample_id - Unique identifier for the sample.
+   
 3. cluster_id - Cluster that the mutation has been assigned to.
 
 You can include the following optional columns:
 
 4. chrom - Chromosome on which mutation_id is found
+   
 5. ccf - Cluster cellular prevalence estimate (included in all [PyClone-VI](https://github.com/Roth-Lab/pyclone-vi) clustering results)
 
-> [!NOTE] In order to make use of PhyClone's data informed loss probability prior assignment, columns 4 and 5 are required.
+> [!NOTE]
+> In order to make use of PhyClone's data informed loss probability prior assignment, columns 4 and 5 are required.
 
 [//]: # (4. outlier_prob - &#40;Prior&#41; probability that the cluster/mutation is an outlier.)
 
@@ -155,7 +159,7 @@ You can include the following optional columns:
 
 PhyClone analyses are broken into two parts. 
 First, sampling is performed using the `run` sub-command.
-Second, the output sampling trace from the sampling `run` can be summarised as either a point-estimate tree ([MAP](#map-point-estimate-tree) or [Consensus](#consensus-point-estimate-tree)) or topology report.
+Second, the output trace from the sampling `run` can be summarised as either a point-estimate tree ([MAP](#map-point-estimate-tree) or [Consensus](#consensus-point-estimate-tree)) or [topology report](#topology-report-and-sampled-topologies-archive).
 
 Sampling can be run as follows:
 ```
@@ -164,11 +168,11 @@ phyclone run -i INPUT.tsv -c CLUSTERS.tsv -o TRACE.pkl.gz --num-chains 4
 Which will take the [`INPUT.tsv`](#main-input-format) and (optionally) the [`CLUSTERS.tsv`](#cluster-file-format) file, as described above and write the trace file `TRACE.pkl.gz` in a compressed Python pickle format.
 
 Relevant program options:
-* `--num-chains` command controls how many independent parallel PhyClone sampling chains to use. Though the default value is set to 1, PhyClone will benefit from running with at least 4 chains, if the compute cores can be spared.
+* `--num-chains` command controls how many independent parallel PhyClone sampling chains to use. Though the default value is set to 1, PhyClone will benefit from running multiple chains; we recommend ≥4 chains, if the compute cores can be spared.
 * `-n` command can be used to control the number of iterations of sampling to perform.
 * `-b` command can be used to control the number of burn-in iterations to perform.
 * `--seed` command can be used to seed the random number generator for reproducible results.
-* 
+
 > [!NOTE]
 > Burn-in is done using a heuristic strategy of unconditional SMC.
 All samples from the burn-in are discarded as they will not target the posterior.
@@ -190,21 +194,25 @@ phyclone run --help
 As explored in the PhyClone paper, PhyClone is equipped with the ability to model mutational outliers and loss. There are two main approaches to running PhyClone with outlier modelling:
 1. Using a global outlier probability.
    * If running on un-clustered data, this is the only option available to activate outlier modelling. 
-      * Use `--outlier-prob <user-defined-loss-prior-probability>` replacing the `<>` text with a decimal value in the [0, 1] range. Barring prior knowledge, 0.001 should suffice. 
-   > [!NOTE] This option will also allow for the use of a global loss probability prior on clustered runs as well.
+      * Use `--outlier-prob` with a decimal value in the [0, 1] range. Barring prior knowledge, 0.001 should suffice. 
+> [!NOTE]
+> The `--outlier-prob` option will also allow for the use of a global loss probability prior on clustered runs as well.
 2. Assigning the outlier probability from clustered data.
-   * PhyClone is also able to split clusters into either high-loss or low-loss probability groupings. This feature requires that the clustered data include mutational chromosome assignments (which can be supplied in either the [data.tsv](#main-input-format) or [cluster.tsv](#cluster-file-format) files) and cluster cellular prevalence (CCF) measures.
+   * PhyClone is also able to assign clusters either a high or low outlier prior probability, based on the input data.
+   * This feature requires that the clustered data include mutational chromosome assignments, the `chrom` column (which can be supplied in either the [data.tsv](#main-input-format) or [cluster.tsv](#cluster-file-format) files) and cluster cellular prevalence (CCF) measures, the `ccf` column (which should be included in the [cluster.tsv](#cluster-file-format) file).
    * To activate this feature, ensure the input files are populated with the appropriate columns and include the `--assign-loss-prob` flag in the PhyClone `run` command.
-   > [!TIP] If using PyClone-VI for clustering, the CCF column will come as a part of its results. And you need only append the chromosomal positioning column `chrom` to either input files.
+> [!TIP]
+> If using PyClone-VI for clustering, the CCF column will come as a part of its results. And you need only append the chromosomal positioning column `chrom` to either input files.
    
-> [!IMPORTANT] With outlier modelling active, the end result table will assign all mutations inferred to be lost or outliers to a clone with the id of `-1`.
+> [!IMPORTANT]
+> With outlier modelling active, the end result table will assign all mutations inferred to be lost or outliers to a clone with the id of `-1`.
 
 -----------------
 
 ## PhyClone Output
 
-PhyClone includes ways three ways to summarise the results from a sampling trace file.
-Two of which produce a point-estimate (a single tree), and a third which can reports on and can optionally build results for all uniquely sampled topologies:
+PhyClone includes three ways to summarise the results from a sampling trace file.
+Two of which produce a point-estimate (a single tree), and a third which reports on and can optionally build results for all uniquely sampled topologies:
 1. [MAP tree](#map-point-estimate-tree)
    * **(Recommended)** Retrieves the tree with the highest sampled joint-likelihood.
 2. [Consensus tree](#consensus-point-estimate-tree)
@@ -221,7 +229,7 @@ phyclone map -i TRACE.pkl.gz -t TREE.nwk -o TABLE.tsv
 Where `TRACE.pkl.gz` is the result from a PhyClone sampling run.
 
 Expected output:
-* `TREE.nwk` the inferred MAP clone tree topology in newick format. 
+* `TREE.nwk` the inferred MAP clone tree topology in Newick format. 
 * `TABLE.tsv` a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample.
 
 For more advanced options, run:
@@ -238,7 +246,7 @@ phyclone consensus -i TRACE.pkl.gz -t TREE.nwk -o TABLE.tsv
 Where `TRACE.pkl.gz` is the result from a PhyClone sampling run.
 
 Expected output:
-* `TREE.nwk` the inferred MAP clone tree topology in newick format. 
+* `TREE.nwk` the inferred MAP clone tree topology in Newick format. 
 * `TABLE.tsv` a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample.
 
 For more advanced options, run:
@@ -248,8 +256,7 @@ phyclone consensus --help
 
 ### Topology Report and Sampled Topologies Archive
 
-Additionally, PhyClone is able to produce a summary report and archive file of all uniquely sampled topologies from an analysis run.
-The 
+Additionally, PhyClone is able to produce a summary report and archive file of all uniquely sampled topologies from a sampling `run`. 
 
 To build the PhyClone topology report and full sampled topologies archive, run the `topology-report` command as follows:
 ```
@@ -263,7 +270,7 @@ used to identify the tree in the accompanying topologies archive).
 * `SAMPLED_TOPOLOGIES.tar.gz`, a compressed archive where each folder represents a uniquely sampled topology, folder names align with topology identifiers found in the `TOPOLOGY_TABLE.tsv`
 
 Expected output, for each sampled topology folder in the `SAMPLED_TOPOLOGIES.tar.gz` (sampled-topologies archive):
-* `TREE.nwk` the inferred MAP clone tree topology in newick format. 
+* `TREE.nwk` the inferred MAP clone tree topology in Newick format. 
 * `TABLE.tsv` a results table which contains: the assignment of mutations to clones, CCF (cellular prevalence) estimates, and clonal prevalence estimates per sample.
 
 Additional options: