From 76750cfa9beee7c6611161f6f6b04ec005ff3d89 Mon Sep 17 00:00:00 2001
From: Rick Wertenbroek <rick.wertenbroek@unil.ch>
Date: Mon, 11 Dec 2023 10:00:18 +0100
Subject: [PATCH] Updated README documentation for CRAM file names

---
 README.md | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 374de67..e610ebf 100644
--- a/README.md
+++ b/README.md
@@ -85,7 +85,7 @@ To illustrate with an example let's assume we want to polish a file named `chr20
 - The `samples-id` is the file ID of the sample list text file generated by the `pp-extract-applet` (e.g., `SAPPHIRE/step1_extraction/chr20/chr20.bcf.samples.txt`), pass the file ID of this file.
 
 Optional options are :
-- `--cram-path` if the CRAM path differs from `/mnt/project/Bulk/Whole genome sequences/Whole genome CRAM files` (`/mnt/project` is the mount point for the project files in the docker containers).
+- `--cram-path` if the CRAM path differs from `/mnt/project/Bulk/Whole genome sequences/Whole genome CRAM files` (`/mnt/project` is the mount point for the project files in the docker containers). **CRAM filenames are created from sample IDs and project ID e.g., for sample 1234567 from project abcdef, the cram file `<cram-path>/12/1234567_abcdef_0_0.cram` will be loaded**. This is specific to the UKB RAP.
 - `--samples-list-id` if only a subset of the samples need to be rephased, this is the file ID of a subset list.
 - `--threads` the number of threads to use, because the CRAM files are accessed over the network we recommend to use more threads than cores in the instance type, by default this is set to 12 because it works well for the default `mem2_ssd1_v2_x4` instance. We recommend to set this to three times the cores of the instance used (this can be fine tuned by looking at the CPU usage in the logs).
 - `--instance` allows to chose another instance type (should not be necessary), default is `mem2_ssd1_v2_x4`.
@@ -136,3 +136,24 @@ make
 ## Local run instructions
 
 To run the programs locally there is no need to create "applets", "Docker images" etc. The programs can be run directly, you get the options by running the programs with the `-h` flag.
+
+### Note for CRAM names for local run
+
+The way the `phase_caller` program searches for CRAM files is very specific (because it had to load thousands of CRAMs from UKB) : CRAM filenames are created from sample IDs and project ID e.g., for sample `1234567` from project `abcdef`, the cram file `<cram-path>/12/1234567_abcdef_0_0.cram` will be loaded. So unless your CRAM files have this path (you can do so with symbolic links for example), the CRAM files will not be found.
+
+There is a version that allows to load a sample list with the CRAM path directly written inside the sample list file under the branch https://github.com/rwk-unil/pp/tree/phase_caller_generic_no_path the `phase_caller2` program. It is not yet merged into the main branch.
+
+This allows to use a sample list that instead of containing only the sample ID, allows to enter three parameters :
+```
+<index in bin file>,<sample name>,<path to individual CRAM file>
+```
+
+So for example for a VCF/BCF with samples `HG001, HG002, HG003, HG004` that got extracted to a binary file, and you are interested in phase calling only `HG002` and `HG004` with CRAM files that don't follow the UKB naming convention you can use a sample list as : 
+```
+1,HG002,/home/user/crams/HG002.cram
+3,HG004,/mnt/network/somewhere/HG004_sequencing_file.cram
+```
+
+(The index in the binary file follows the order of samples in VCF/BCF so `HG001, HG002, HG003, HG004` would be `0,1,2,3`).
+
+Do not add extra spaces.