diff --git a/README.md b/README.md index e89f099..36b4a26 100644 --- a/README.md +++ b/README.md @@ -60,12 +60,12 @@ We provide one example dataset provided the `example/` folder, which includes: The bam file storing read alignment for sample A. * `B.bam (.bai)` The bam file storing read alignment for sample B. -* `CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz` +* `CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz` The reference panel with over 3,000 samples in 1000 Genome database. Only SNVs located in chr20: 0-2Mb were extracted in this vcf file. * `chr20_2Mb.hg38.fa (.fai)` The genome reference used for read aligments. Only seuqences in chr20:0-20Mb were extracted in this fasta file. -There is a bash script `./test/runPreprocess.sh` to run above example in the folder `test`. You need to prepare the bam file list for option `-b`. If you have multiple sample in the list file, `Monopogen` will run the joint calling which can increase the SNV calling accuracy and sensitivity. Run the test script as following: +There is a bash script `./test/runPreprocess.sh` to run above example in the folder `test`. You need to prepare the bam file list for option `-b`. If you have multiple sample in this file, you can use more CPUs by setting `-t` to make `Monopogen` faster. Run the test script as following: ``` path="XXy/Monopogen" @@ -74,66 +74,87 @@ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps python ${path}/src/Monopogen.py preProcess -b bam.lst -o out -a ${path}/apps -t 8 ``` - +After running the `preProcess` module, there will be bam files after quality controls in the folder `out/Bam/` which will be used for downstream SNV calling. +## 3.2 Germline SNV calling + +You can type the following command to get the help information. -``` -path="XX/Monopogen" -export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps - -python ../src/Monopogen.py germline \ - -b ../example/chr20_2Mb.rh.filter.sort.bam \ - -y single \ - -t all \ - -a ../apps \ - -c chr20 \ - -o out \ - -d 10 \ - -p ../example/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz \ - -r ../example/chr20_2Mb.hg38.fa -m 3 -s 5 - +`python ./src/Monopogen.py germline --help` ``` +usage: Monopogen.py germline [-h] -r REGION -s + {varScan,varImpute,varPhasing,all} [-o OUT] -g + REFERENCE -p IMPUTATION_PANEL + [-m MAX_SOFTCLIPPED] -a APP_PATH [-t NTHREADS] -## Output +optional arguments: + -h, --help show this help message and exit + -r REGION, --region REGION + The genome regions for variant calling (default: None) + -s {varScan,varImpute,varPhasing,all}, --step {varScan,varImpute,varPhasing,all} + Run germline variant calling step by step (default: + all) + -o OUT, --out OUT The output director (default: None) + -g REFERENCE, --reference REFERENCE + The human genome reference used for alignment + (default: None) + -p IMPUTATION_PANEL, --imputation-panel IMPUTATION_PANEL + The population-level variant panel for variant + imputation refinement, such as 1000 Genome 3 (default: + None) + -a APP_PATH, --app-path APP_PATH + The app library paths used in the tool (default: None) + -t NTHREADS, --nthreads NTHREADS + Number of threads used for SNVs calling (default: 1) + ``` -The most important results are in the folder `out/germline`: +There is a bash script `./test/runGermline.sh` to run above example. You need to prepare the genome region file list for option `-r` with an example shown in `test/region.lst`. Each region is in one row. If you want to call the whole chromosome, you can only specficy the chromosome ID in each row. Also you can use more CPUs by setting `-t` to make `Monopogen` faster when there are many genome regions. Run the test script as following: + +``` +python ${path}/src/Monopogen.py germline \ + -a ${path}/apps -t 8 -r region.lst \ + -p ../example/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz \ + -g ../example/chr20_2Mb.hg38.fa -s all -o out -* `chr20.germline.vcf` -In this vcf file, the germline SNVs were genotyped for the study sample +``` +The `germline` module will generate the phased VCF files with name `*.phased.vcf.gz` in the folder `out/germline`. If there are multiple samples in the bam file list from `-b` option in `preProcess` module, the phased VCF files will contain genotypes from multiple samples. The output of phased genotypes are as following: + ``` ##fileformat=VCFv4.2 -##filedate=20220808 +##filedate=20230422 ##source="beagle.27Jul16.86a.jar (version 4.1)" ##INFO= -##INFO= -##INFO= +##INFO= ##FORMAT= ##FORMAT= ##FORMAT= -#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT test1 test2 -chr20 60291 . G T . PASS AR2=0.00;DR2=0.00;AF=0.50 GT:DS:GP 0/1:1:0,1,0 0/1:1:0,1,0 -chr20 64506 . C T . PASS AR2=0.00;DR2=0.00;AF=0.0051 GT:DS:GP 0/0:0.01:0.99,0.01,0 0/0:0.01:0.99,0.01,0 -chr20 68303 . T C . PASS AR2=0.00;DR2=0.00;AF=0.50 GT:DS:GP 0/1:1:0,1,0 0/1:1:0,1,0 -chr20 75250 . C T . PASS AR2=0.00;DR2=0.00;AF=0.50 GT:DS:GP 0/1:1:0,1,0 0/1:1:0,1,0 -chr20 88108 . T C . PASS AR2=0.00;DR2=0.00;AF=0.87 GT:DS:GP 1/1:1.73:0,0.26,0.73 1/1:1.73:0,0.26,0.73 -chr20 159104 . T C . PASS AR2=0.00;DR2=0.00;AF=0.9960 GT:DS:GP 1/1:1.99:0,0.01,0.99 1/1:1.99:0,0.01,0.99 -chr20 175269 . T C . PASS AR2=0.00;DR2=0.00;AF=0.9903 GT:DS:GP 1/1:1.98:0,0.02,0.98 1/1:1.98:0,0.02,0.98 -chr20 186086 . G A . PASS AR2=0.00;DR2=0.00;AF=0.983 GT:DS:GP 1/1:1.97:0,0.03,0.97 1/1:1.97:0,0.03,0.97 -chr20 186183 . G A . PASS AR2=0.00;DR2=0.00;AF=0.983 GT:DS:GP 1/1:1.97:0,0.03,0.97 1/1:1.97:0,0.03,0.97 -chr20 198814 . A T . PASS AR2=0.00;DR2=0.00;AF=0.50 GT:DS:GP 0/1:1:0.05,0.89,0.06 0/1:1:0.05,0.89,0.06 -chr20 231710 . T G . PASS AR2=0.00;DR2=0.00;AF=0.70 GT:DS:GP 0/1:1.4:0,0.6,0.4 0/1:1.4:0,0.6,0.4 -chr20 240166 . T G . PASS AR2=0.00;DR2=0.00;AF=0.49 GT:DS:GP 0/1:0.97:0.03,0.97,0 0/1:0.97:0.03,0.97,0 -chr20 240377 . T G . PASS AR2=0.00;DR2=0.00;AF=0.49 GT:DS:GP 0/1:0.97:0.03,0.97,0 0/1:0.97:0.03,0.97,0 -chr20 247326 . G A . PASS AR2=0.00;DR2=0.00;AF=0.9964 GT:DS:GP 1/1:1.99:0,0.01,0.99 1/1:1.99:0,0.01,0.99 -chr20 248854 . T C . PASS AR2=0.00;DR2=0.00;AF=0.73 GT:DS:GP 0/1:1.47:0,0.53,0.47 0/1:1.47:0,0.53,0.47 -chr20 254891 . C T . PASS AR2=0.00;DR2=0.00;AF=0.51 GT:DS:GP 0/1:1.02:0,0.97,0.03 0/1:1.02:0,0.97,0.03 -chr20 255081 . G A . PASS AR2=0.00;DR2=0.00;AF=0.85 GT:DS:GP 1/1:1.7:0,0.3,0.7 1/1:1.7:0,0.3,0.7 -chr20 260997 . G A . PASS AR2=0.00;DR2=0.00;AF=0.40 GT:DS:GP 0/1:0.8:0.2,0.8,0 0/1:0.8:0.2,0.8,0 -chr20 265294 . G T . PASS AR2=0.00;DR2=0.00;AF=0.918 GT:DS:GP 1/1:1.84:0,0.16,0.84 1/1:1.84:0,0.16,0.84 +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 19D013_European_F_78 19D014_European_M_84 +chr20 68303 . T C . PASS . GT 1|0 1|1 +chr20 88108 . T C . PASS . GT 1|1 0|1 +chr20 127687 . A C . PASS . GT 1|1 1|1 +chr20 153835 . T C . PASS . GT 1|0 1|1 +chr20 154002 . C T . PASS . GT 1|1 1|1 +chr20 159104 . T C . PASS . GT 1|1 1|1 +chr20 167839 . T C . PASS . GT 1|1 1|1 +chr20 198814 . A T . PASS . GT 1|0 1|1 +chr20 231710 . T G . PASS . GT 1|1 1|1 +chr20 237210 . T C . PASS . GT 1|1 1|1 +chr20 247326 . G A . PASS . GT 1|1 1|0 +chr20 248854 . T C . PASS . GT 1|1 1|0 +chr20 255081 . G A . PASS . GT 1|1 1|0 +chr20 274893 . G C . PASS . GT 0|1 1|1 +chr20 275122 . G T . PASS . GT 0|1 1|1 +chr20 275241 . G A . PASS . GT 0|1 1|0 +chr20 275361 . C T . PASS . GT 0|1 1|0 +chr20 275932 . A G . PASS . GT 0|1 1|0 +chr20 276086 . T A . PASS . GT 0|1 1|0 ``` + + ## Run on multiple chromosomes and multiple samples Users can submit jobs with multiple chromosomes in the parallele fashion as following: