Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
jinzhuangdou authored Apr 23, 2023
1 parent c62462d commit 4476a1d
Showing 1 changed file with 66 additions and 45 deletions.
111 changes: 66 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,12 +60,12 @@ We provide one example dataset provided the `example/` folder, which includes:
The bam file storing read alignment for sample A.
* `B.bam (.bai)`
The bam file storing read alignment for sample B.
* `CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz`
* `CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz`
The reference panel with over 3,000 samples in 1000 Genome database. Only SNVs located in chr20: 0-2Mb were extracted in this vcf file.
* `chr20_2Mb.hg38.fa (.fai)`
The genome reference used for read aligments. Only seuqences in chr20:0-20Mb were extracted in this fasta file.

There is a bash script `./test/runPreprocess.sh` to run above example in the folder `test`. You need to prepare the bam file list for option `-b`. If you have multiple sample in the list file, `Monopogen` will run the joint calling which can increase the SNV calling accuracy and sensitivity. Run the test script as following:
There is a bash script `./test/runPreprocess.sh` to run above example in the folder `test`. You need to prepare the bam file list for option `-b`. If you have multiple sample in this file, you can use more CPUs by setting `-t` to make `Monopogen` faster. Run the test script as following:

```
path="XXy/Monopogen"
Expand All @@ -74,66 +74,87 @@ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps
python ${path}/src/Monopogen.py preProcess -b bam.lst -o out -a ${path}/apps -t 8
```

After running the `preProcess` module, there will be bam files after quality controls in the folder `out/Bam/` which will be used for downstream SNV calling.

## 3.2 Germline SNV calling

You can type the following command to get the help information.

```
path="XX/Monopogen"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps
python ../src/Monopogen.py germline \
-b ../example/chr20_2Mb.rh.filter.sort.bam \
-y single \
-t all \
-a ../apps \
-c chr20 \
-o out \
-d 10 \
-p ../example/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz \
-r ../example/chr20_2Mb.hg38.fa -m 3 -s 5
`python ./src/Monopogen.py germline --help`

```
usage: Monopogen.py germline [-h] -r REGION -s
{varScan,varImpute,varPhasing,all} [-o OUT] -g
REFERENCE -p IMPUTATION_PANEL
[-m MAX_SOFTCLIPPED] -a APP_PATH [-t NTHREADS]
## Output
optional arguments:
-h, --help show this help message and exit
-r REGION, --region REGION
The genome regions for variant calling (default: None)
-s {varScan,varImpute,varPhasing,all}, --step {varScan,varImpute,varPhasing,all}
Run germline variant calling step by step (default:
all)
-o OUT, --out OUT The output director (default: None)
-g REFERENCE, --reference REFERENCE
The human genome reference used for alignment
(default: None)
-p IMPUTATION_PANEL, --imputation-panel IMPUTATION_PANEL
The population-level variant panel for variant
imputation refinement, such as 1000 Genome 3 (default:
None)
-a APP_PATH, --app-path APP_PATH
The app library paths used in the tool (default: None)
-t NTHREADS, --nthreads NTHREADS
Number of threads used for SNVs calling (default: 1)
```

The most important results are in the folder `out/germline`:
There is a bash script `./test/runGermline.sh` to run above example. You need to prepare the genome region file list for option `-r` with an example shown in `test/region.lst`. Each region is in one row. If you want to call the whole chromosome, you can only specficy the chromosome ID in each row. Also you can use more CPUs by setting `-t` to make `Monopogen` faster when there are many genome regions. Run the test script as following:

```
python ${path}/src/Monopogen.py germline \
-a ${path}/apps -t 8 -r region.lst \
-p ../example/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz \
-g ../example/chr20_2Mb.hg38.fa -s all -o out
* `chr20.germline.vcf`
In this vcf file, the germline SNVs were genotyped for the study sample
```
The `germline` module will generate the phased VCF files with name `*.phased.vcf.gz` in the folder `out/germline`. If there are multiple samples in the bam file list from `-b` option in `preProcess` module, the phased VCF files will contain genotypes from multiple samples. The output of phased genotypes are as following:

```
##fileformat=VCFv4.2
##filedate=20220808
##filedate=20230422
##source="beagle.27Jul16.86a.jar (version 4.1)"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated squared correlation between most probable REF dose and true REF dose">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2*P(RR)] and true REF dose">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated squared correlation betwe
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation betwee
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT test1 test2
chr20 60291 . G T . PASS AR2=0.00;DR2=0.00;AF=0.50 GT:DS:GP 0/1:1:0,1,0 0/1:1:0,1,0
chr20 64506 . C T . PASS AR2=0.00;DR2=0.00;AF=0.0051 GT:DS:GP 0/0:0.01:0.99,0.01,0 0/0:0.01:0.99,0.01,0
chr20 68303 . T C . PASS AR2=0.00;DR2=0.00;AF=0.50 GT:DS:GP 0/1:1:0,1,0 0/1:1:0,1,0
chr20 75250 . C T . PASS AR2=0.00;DR2=0.00;AF=0.50 GT:DS:GP 0/1:1:0,1,0 0/1:1:0,1,0
chr20 88108 . T C . PASS AR2=0.00;DR2=0.00;AF=0.87 GT:DS:GP 1/1:1.73:0,0.26,0.73 1/1:1.73:0,0.26,0.73
chr20 159104 . T C . PASS AR2=0.00;DR2=0.00;AF=0.9960 GT:DS:GP 1/1:1.99:0,0.01,0.99 1/1:1.99:0,0.01,0.99
chr20 175269 . T C . PASS AR2=0.00;DR2=0.00;AF=0.9903 GT:DS:GP 1/1:1.98:0,0.02,0.98 1/1:1.98:0,0.02,0.98
chr20 186086 . G A . PASS AR2=0.00;DR2=0.00;AF=0.983 GT:DS:GP 1/1:1.97:0,0.03,0.97 1/1:1.97:0,0.03,0.97
chr20 186183 . G A . PASS AR2=0.00;DR2=0.00;AF=0.983 GT:DS:GP 1/1:1.97:0,0.03,0.97 1/1:1.97:0,0.03,0.97
chr20 198814 . A T . PASS AR2=0.00;DR2=0.00;AF=0.50 GT:DS:GP 0/1:1:0.05,0.89,0.06 0/1:1:0.05,0.89,0.06
chr20 231710 . T G . PASS AR2=0.00;DR2=0.00;AF=0.70 GT:DS:GP 0/1:1.4:0,0.6,0.4 0/1:1.4:0,0.6,0.4
chr20 240166 . T G . PASS AR2=0.00;DR2=0.00;AF=0.49 GT:DS:GP 0/1:0.97:0.03,0.97,0 0/1:0.97:0.03,0.97,0
chr20 240377 . T G . PASS AR2=0.00;DR2=0.00;AF=0.49 GT:DS:GP 0/1:0.97:0.03,0.97,0 0/1:0.97:0.03,0.97,0
chr20 247326 . G A . PASS AR2=0.00;DR2=0.00;AF=0.9964 GT:DS:GP 1/1:1.99:0,0.01,0.99 1/1:1.99:0,0.01,0.99
chr20 248854 . T C . PASS AR2=0.00;DR2=0.00;AF=0.73 GT:DS:GP 0/1:1.47:0,0.53,0.47 0/1:1.47:0,0.53,0.47
chr20 254891 . C T . PASS AR2=0.00;DR2=0.00;AF=0.51 GT:DS:GP 0/1:1.02:0,0.97,0.03 0/1:1.02:0,0.97,0.03
chr20 255081 . G A . PASS AR2=0.00;DR2=0.00;AF=0.85 GT:DS:GP 1/1:1.7:0,0.3,0.7 1/1:1.7:0,0.3,0.7
chr20 260997 . G A . PASS AR2=0.00;DR2=0.00;AF=0.40 GT:DS:GP 0/1:0.8:0.2,0.8,0 0/1:0.8:0.2,0.8,0
chr20 265294 . G T . PASS AR2=0.00;DR2=0.00;AF=0.918 GT:DS:GP 1/1:1.84:0,0.16,0.84 1/1:1.84:0,0.16,0.84
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 19D013_European_F_78 19D014_European_M_84
chr20 68303 . T C . PASS . GT 1|0 1|1
chr20 88108 . T C . PASS . GT 1|1 0|1
chr20 127687 . A C . PASS . GT 1|1 1|1
chr20 153835 . T C . PASS . GT 1|0 1|1
chr20 154002 . C T . PASS . GT 1|1 1|1
chr20 159104 . T C . PASS . GT 1|1 1|1
chr20 167839 . T C . PASS . GT 1|1 1|1
chr20 198814 . A T . PASS . GT 1|0 1|1
chr20 231710 . T G . PASS . GT 1|1 1|1
chr20 237210 . T C . PASS . GT 1|1 1|1
chr20 247326 . G A . PASS . GT 1|1 1|0
chr20 248854 . T C . PASS . GT 1|1 1|0
chr20 255081 . G A . PASS . GT 1|1 1|0
chr20 274893 . G C . PASS . GT 0|1 1|1
chr20 275122 . G T . PASS . GT 0|1 1|1
chr20 275241 . G A . PASS . GT 0|1 1|0
chr20 275361 . C T . PASS . GT 0|1 1|0
chr20 275932 . A G . PASS . GT 0|1 1|0
chr20 276086 . T A . PASS . GT 0|1 1|0
```


## Run on multiple chromosomes and multiple samples

Users can submit jobs with multiple chromosomes in the parallele fashion as following:
Expand Down

0 comments on commit 4476a1d

Please sign in to comment.