Skip to content

Latest commit

 

History

History
502 lines (349 loc) · 30 KB

README.md

File metadata and controls

502 lines (349 loc) · 30 KB

Ilus

DOI

English | 简体中文

Ilus (/'i:loʊs/) is a lightweight, scalable, handy semi-automated variant calling pipeline generator for Whole-genome sequencing (WGS) and Whole exom sequencing (WES) analysis.

Introduction

ilus-pipe-chart

ilus is a semi-automated pipeline generator designed for WGS/WES analysis. While it facilitates the creation of analysis pipelines, it does not excute the jobs automatically. Instead, users are required to manually submit the generated jobs. This approach allows for flexibility in managing the workflow, but it means that the overall processing does not depend on ilus for execution,that's why we called it as a semi-automated tools. This design can be beneficial for users who prefer to maintain control over their job submissions and processing.

$ ilus -h
usage: ilus [-h] [-v] {WGS,capseq,genotype-joint-calling,VQSR,split-jobs,check-jobs} ...

ilus (Version = 2.0.0): A WGS/WES analysis pipeline generator.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show the version of ilus and exit.

ilus commands:
  {WGS,capseq,genotype-joint-calling,VQSR,split-jobs,check-jobs}
    WGS                 Create pipeline scripts for WGS (from FASTQ to genotype VCF).
    capseq              Create pipeline scripts for capture sequencing data (from FASTQ to genotype VCF). Whole-exome sequencing (WES) belong to capture sequencing, which only
                        captures the coding regions of the genome.
    genotype-joint-calling
                        Genotype from GVCFs.
    VQSR                VQSR
    split-jobs          Split the whole shell into multiple jobs.
    check-jobs          Check the jobs have finished or not.

Main processes

ilus has 4 main modelues designed for genomics analysis:

1. WGS analysis module

Based on the GATK Best Practice,this mudule utilized bwa-mem + GATK for building a comprehensive analysis process. It includes:

  • Alignment of reads to a reference genomeusing bwa mem.
  • Base quality score recalibration (BQSR) using GATK BaseRecalibrator and GATK ApplyBQSR.
  • Create alignment metrics via samtools stats.
  • Coverage calculation with bedtools cvg.
  • Variant calling or generating gvcf file for samples by GATK HaplotypeCaller.
  • Joint-calling of variants using GATK GenotyeGVCFs.
  • Variant quality score recalibration (VQSR) with GATK VariantRecalibrator and GATK ApplyVQSR.
  • Annotate variants by using Ensembl VEP.

These processes also works for WES (capseq) analysis.

By default, the sequencing data you input is clean data.

2. capseq module

This module creates pipeline for capture sequencing data (from FASTQ to genotype VCF). Whole-exome sequencing (WES) is a specific type of capture sequencing that focuses exclusively on the coding regions of the genome. This module streamlines the entire process, ensuring that users can efficiently analyze exome data and convert it into actionable genomic insights.

3. genotype-joint-calling module

This module operates independently from the WGS and capseq modules, allowing users to call genotypes directly from gvcf files. It is particularly useful for analyzing WGS/WES data in batches. Users can generate gvcf files in bulk, compile a list, and use this module for subsequent processing, enhancing flexibility in the analysis workflow.

4. VQSR

This module is distinct from the WGS analysis and focuses on quality control of mutation results.


ilus does not directly execute tasks for two main reasons:

  • First, avoid making complexity in task scheduling. ilus is designed to remain lightweight and focused on its core functionality. Different computing clusters (local or cloud) have variaous job scheduling methods, which, if incorporated into ilus, could complicate its usability. This could lead to confusion for users and detract from ilus's primary purpose of generation tailored process scripts based on user input and configuration files. Currently, task must be submitted manually, although future plans may include external plugins for automation. I hope it can be used as a framework program to generate process scripts that meet your analysis needs strictly based on your input data and configuration file information.

  • Second, Enhancing flexibility and independence. The processes generated by ilus are characteristic, that is, they are completely independent from ilus and they will no longer rely on any functions of ilus in the execution process. Each command in the generated scripts could be run independently, allowing users to split larger scripts into sub-scripts base on their specific computational resource and sample size. This modularity facilitates efficient task delivery and management, which is crucial for parallel processing.

Task Splitting and Monitoring

When handling multiple samples, for instance, if a script contains 10 independent commands, users can split it into smaller sub-scripts and submit them manually. The ilus split-jobs can assist in this process, although it is tailored for SLURM systems.

ilus also incorporates task completion monitoring by adding identifiable markers to each task. This feature simplifies the tracking of task statuses, particularly useful when analyzing large datasets. A dedicated program within ilus (ilus check-jobs) allows users to efficiently check the completion status of tasks, minimizing manual checks.

Overall, while ilus refrains from automating task execution, its design promotes user flexibility and effective process management.

Installation

ilus is based on Python and supports both Python 2.7+ and Python 3.X. The stable version of the code has been released to PyPI. So to use ilus, install it directly through pip the Python package management tool:

$ pip install ilus

In addition to the main program ilus, this command will automatically install other Python packages that ilus depends on. After the installation is complete, execute ilus on the command line. If you can see something similar to the following, then the installation is successful.

$ ilus
usage: ilus [-h] {WGS,genotype-joint-calling,VQSR} ...
ilus: error: too few arguments

Quick Start

By executing ilus --help you can see three function modules : WGS, capseq, genotype-joint-calling and VQSR.

$ ilus -h
usage: ilus [-h] [-v] {WGS,capseq,genotype-joint-calling,VQSR,split-jobs,check-jobs} ...

ilus (Version = 2.0.0): A WGS/WES analysis pipeline generator.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show the version of ilus and exit.

ilus commands:
  {WGS,capseq,genotype-joint-calling,VQSR,split-jobs,check-jobs}
    WGS                 Create pipeline scripts for WGS (from FASTQ to genotype VCF).
    capseq              Create pipeline scripts for capture sequencing data (from FASTQ to genotype VCF). Whole-exome sequencing (WES) belong to capture sequencing, which only
                        captures the coding regions of the genome.
    genotype-joint-calling
                        Genotype from GVCFs.
    VQSR                VQSR
    split-jobs          Split the whole shell into multiple jobs.
    check-jobs          Check the jobs have finished or not.


That's how you can use ilus

Below, we will introduce how to use these three modules by examples.

WGS

The run scripts of the WGS Analysis Pipeline are generated by ilus WGS and are used as follows:

$ ilus WGS --help
usage: ilus WGS [-h] [-n PROJECT_NAME] -C SYSCONF -O OUTDIR [-f] [--use-sentieon] -I FASTQLIST [--clean-raw-data]
                [--delete-clean-fastq] [-c] [-P WGS_PROCESSES] [-dr] [--interval INTERVAL]

optional arguments:
  -h, --help            show this help message and exit
  -n PROJECT_NAME, --name PROJECT_NAME
                        Name of the project. (Default: test)
  -C SYSCONF, --conf SYSCONF
                        YAML configuration file specifying system details.
  -O OUTDIR, --outdir OUTDIR
                        Output directory for results.
  -f, --force-overwrite
                        Force overwrite existing shell scripts and folders.
  --use-sentieon        Use sentieon (doc: https://support.sentieon.com/manual) to create analysis pipeline.
  -I FASTQLIST, --fastqlist FASTQLIST
                        Input the list of FASTQ files.
  --clean-raw-data      Set this option to clean raw sequencing data (fastq).
  --delete-clean-fastq  Set this option to delete clean fastq after aligning all reads to reference.
  -c, --cram            Convert BAM to CRAM after BQSR and save alignment file storage.
  -P WGS_PROCESSES, --process WGS_PROCESSES
                        Specify one or more processes (separated by comma) of WGS pipeline. Possible values:
                        align,markdup,BQSR,gvcf,combineGVCFs,genotype,VQSR
  -dr, --dry-run        Dry run the pipeline for testing.
  --interval INTERVAL   Interval strings (separate by comma) or a file (BED/Picard,1-based) that will be used in
                        variants calling. e.g: '--interval chr1:1-2,chr2,chr3:4-5' or '--interval
                        interval_file.bed'.

-C, -L and -O are required parameters, and the rest are optional parameters according to actual needs. The -O parameter is the output directory, if the directory does not exist, ilus will be created automatically. The most important are -C and -L parameters, the former is the configuration file of ilus, without this file ilus cannot generate the analysis process correctly, so it is very important; the latter is Input file, The format of this file has fixed requirements, a total of 5 columns, each column is the necessary information for the process.

Below, I will explain the format of these two files respectively:

The first is the configuration file. We need to write the program path used in the analysis process, the GATK bundle file path, the path of the reference sequence and the parameters corresponding to each key step in the file.

It should be noted that the prefix of the comparison index file of bwa mem should be the same as the prefix of {resources}{reference} of the configuration file, and placed in the same folder. as follows:

/path/human_reference/GRCh38/
|-- human_GRCh38.fa
|-- human_GRCh38.dict
|-- human_GRCh38.fa.amb
|-- human_GRCh38.fa.ann
|-- human_GRCh38.fa.bwt
|-- human_GRCh38.fa.fai
|-- human_GRCh38.fa.pac
`-- human_GRCh38.fa.sa

The configuration file should be written in Yaml syntax, here I provide a configuration file template.

In the configuration file, bwa, samtools, bcftools, bedtools, gatk, bgzip and tabix are all necessary bioinformatics software, which need to be installed in advance, and then fill in the path to in the corresponding parameters (as shown in the template). verifyBamID2 is only used to calculate whether there is pollution in the sample, it is not a required parameter, if your configuration file does not have this parameter, it means The process does not calculate the contamination of the sample. If there is, you have to install and download the resource data supporting it. I also tell you where to download the relevant data in the template.

Note that the variant_calling_interval parameter in the configuration file. This is a parameter specifically used to specify the variation detection interval. For example, in the example of the above configuration file, I gave 25 chromosomes from chr1 to chrM, which means to tell the process to perform mutation detection on these 25 chromosomes . If you list only one chromosome in this parameter, or only give a chromosome interval, such as chr1:1-10000, then ilus will also only perform variant detection in the interval you give.

This is a very flexible and useful parameter. The variant_calling_interval interval can be specified arbitrarily. In addition to the assignment method given in my example, you can also assign the interval file path to this parameter. We know that many steps of WGS and WES are exactly the same, and there are only differences in the interval of variant detection ------ WES data is not necessary and cannot be used for variant detection on the whole chromosome, only in the exon capture area.

You only need to put the file of the exon capture area,which in the .bed file format, the example is as follows:

chr1    63697   63697
chr1    101158  101158
chr1    103241 103241
chr1    104108  104108
chr1    185336 185336
chr1    261495  261495
chr1    598862 598862
chr1    601606  601606
chr1    700596 700596
chr1    725086  725086

You can also refer to GATK's instructions, here you do not need to manually split them into one by one. It is enough to assign the path of the file to this parameter, and then the process becomes the WES analysis process. This is why ilus is called a WGS and WES analysis pipeline generator.

Also, ilus required public datasets are: gatk bundle and genome reference sequences.

[Note] If the sample size of your project is less than 10, then GATK will not calculate the value of InbreedingCoeff. In this case, vqsr_options in the configuration file does not need to set -an InbreedingCoeff, you can remove it.

Next is the input file specified by the -L parameter. The file contains all the sequencing data information necessary for the WGS/WES analysis process. This file needs to be prepared by you, the format of each column of the file as follows:

  • [1] SAMPLE,Sample name

  • [2] RGID,Read Group,when using bwa mem 's -R parameter

  • [3] FASTQ1,Fastq1's file directory

  • [4] FASTQ2,Fastq2's file directory,if it's Single Endsequencing,the column is replaced with a space

  • [5] LANE,fastq‘s lane id

    in these file info, RGID is the most error-prone, and RGID must be set correctly (refer to the following example for the correct way of writing), otherwise the analysis process will go wrong.

In addition, if the sequencing volume of a sample is relatively large, resulting in a sample with multiple lane sequencing data, or the same lane data is split into multiple sub-files, at this time, you do not need to manually analyze these lane data. To merge fastq data, you only need to write the input file according to the sequencing information.

Those data belonging to the same sample, even if the input fastq has been split into thousands of copies, the process will automatically merge after each sub-data is compared and sorted. Below I give an example of an input file, in which there is a case of the data splitting output of the sample:

#SAMPLE RGID    FASTQ1  FASTQ2  LANE
HG002   "@RG\tID:CL100076190_L01\tPL:COMPLETE\tPU:CL100076190_L01_HG002\tLB:CL100076190_L01\tSM:HG002"  /path/HG002_NA24385_son/BGISEQ500/BGISEQ500_PCRfree_NA24385_CL100076190_L01_read_1.clean.fq.gz  /path/HG002_NA24385_son/BGISEQ500/BGISEQ500_PCRfree_NA24385_CL100076190_L01_read_2.clean.fq.gz  CL100076190_L01
HG002   "@RG\tID:CL100076190_L02\tPL:COMPLETE\tPU:CL100076190_L02_HG002\tLB:CL100076190_L02\tSM:HG002"  /path/HG002_NA24385_son/BGISEQ500/BGISEQ500_PCRfree_NA24385_CL100076190_L02_read_1.clean.fq.gz  /path/HG002_NA24385_son/BGISEQ500/BGISEQ500_PCRfree_NA24385_CL100076190_L02_read_2.clean.fq.gz  CL100076190_L02
HG003   "@RG\tID:CL100076246_L01\tPL:COMPLETE\tPU:CL100076246_L01_HG003\tLB:CL100076246_L01\tSM:HG003"  /path/HG003_NA24149_father/BGISEQ500/BGISEQ500_PCRfree_NA24149_CL100076246_L01_read_1.clean.fq.gz   /path/HG003_NA24149_father/BGISEQ500/BGISEQ500_PCRfree_NA24149_CL100076246_L01_read_2.clean.fq.gz   CL100076246_L01
HG003   "@RG\tID:CL100076246_L02\tPL:COMPLETE\tPU:CL100076246_L02_HG003\tLB:CL100076246_L02\tSM:HG003"  /path/HG003_NA24149_father/BGISEQ500/BGISEQ500_PCRfree_NA24149_CL100076246_L02_read_1.clean.fq.gz   /path/HG003_NA24149_father/BGISEQ500/BGISEQ500_PCRfree_NA24149_CL100076246_L02_read_2.clean.fq.gz   CL100076246_L02
HG004   "@RG\tID:CL100076266_L01\tPL:COMPLETE\tPU:CL100076266_L01_HG004\tLB:CL100076266_L01\tSM:HG004"  /path/HG004_NA24143_mother/BGISEQ500/BGISEQ500_PCRfree_NA24143_CL100076266_L01_read_1.clean.fq.gz   /path/HG004_NA24143_mother/BGISEQ500/BGISEQ500_PCRfree_NA24143_CL100076266_L01_read_2.clean.fq.gz   CL100076266_L01
HG004   "@RG\tID:CL100076266_L02\tPL:COMPLETE\tPU:CL100076266_L02_HG004\tLB:CL100076266_L02\tSM:HG004"  /path/HG004_NA24143_mother/BGISEQ500/BGISEQ500_PCRfree_NA24143_CL100076266_L02_read_1.clean.fq.gz   /path/HG004_NA24143_mother/BGISEQ500/BGISEQ500_PCRfree_NA24143_CL100076266_L02_read_2.clean.fq.gz   CL100076266_L02
HG005   "@RG\tID:CL100076244_L01\tPL:COMPLETE\tPU:CL100076244_L01_HG005\tLB:CL100076244_L01\tSM:HG005"  /path/HG005_NA24631_son/BGISEQ500/BGISEQ500_PCRfree_NA24631_CL100076244_L01_read_1.clean.fq.gz  /path/HG005_NA24631_son/BGISEQ500/BGISEQ500_PCRfree_NA24631_CL100076244_L01_read_2.clean.fq.gz  CL100076244_L01

The following example illustrates the use and process structure of ilus WGS.

Example 1: Generating a WGS analysis pipeline from scratch

$ ilus WGS -c -n my_wgs -C ilus_sys.yaml -L input.list -O output/

This command means that the project (-n) my_wgs generates a WGS analysis process in the output directory output based on the configuration file (-C) ilus_sys.yaml and the input data (-L) input.list . At the same time, the process automatically converts BAM to (-c)CRAM format after completing the analysis. CRAM is more space efficient than BAM, if the -c parameter is not set, the original BAM file is kept.

After the above command is successfully executed, there are a total of 4 folders in the output directory output (as follows):

00.shell/
01.alignment/
02.gvcf/
03.genotype/

They are used to store different types of data generated by the process, including:

  • 00.shell the shell script collection directory;
  • 01.alignment stores the alignment results in units of samples;
  • 02.gvcf stores the gvcf results of each sample;
  • 03.genotype holds the result of the last variant detection.

There are various execution scripts and log directories of the analysis process in the 00.shell directory:

/00.shell
├── loginfo
│   ├── 01.alignment
│   ├── 01.alignment.e.log.list
│   ├── 01.alignment.o.log.list
│   ├── 02.markdup
│   ├── 02.markdup.e.log.list
│   ├── 02.markdup.o.log.list
│   ├── 03.BQSR
│   ├── 03.BQSR.e.log.list
│   ├── 03.BQSR.o.log.list
│   ├── 04.gvcf
│   ├── 04.gvcf.e.log.list
│   ├── 04.gvcf.o.log.list
│   ├── 05.genotype
│   ├── 05.genotype.e.log.list
│   ├── 05.genotype.o.log.list
│   ├── 06.VQSR
│   ├── 06.VQSR.e.log.list
│   └── 06.VQSR.o.log.list
├── my_wgs.step1.bwa.sh
├── my_wgs.step2.markdup.sh
├── my_wgs.step3.bqsr.sh
├── my_wgs.step4.gvcf.sh
├── my_wgs.step5.genotype.sh
└── my_wgs.step6.VQSR.sh

When submit the task running process, execute it in sequence from step1 to step6. The loginfo/ folder records the running status of all steps of each sample. You can check the .o.log.list log file of each task to get the mark of whether each sample ended successfully.

If successful, you can see a marker like [xxxx] xxxx done at the end of the log file. You can easily know which samples (steps) have been well done, and which ones haven't. This script will help you to output all those unfinished tasks, which is convenient to check for problems and re-execute this part of unfinished tasks. check_jobs_status The usage is as follows:

$ python check_jobs_status.py loginfo/01.alignment.o.log.list > bwa.unfinish.list

if all tasks have done, this list is empty,and print ** All Jobs done **.

How to submit batch jobs

The process scripts generated by ilus have a feature, that is, they are completely independent of ilus and will no longer depend on any functions of ilus. At this time, you even use ilus It doesn't matter if you uninstall or delete it. And each line in the execution script (shell) can be run independently.

The advantage of this is that you can split the script into several sub-scripts according to the characteristics of the cluster you are using (if you have few samples or insufficient cluster resources, you can not split them), and then deliver the tasks independently, so it will be more efficient, this is also the only way to submit ilus tasks in parallel.

As for how many subscripts to be split into, it depends on your own needs. For example, if you have a total of 10 samples, there are a total of 10 comparison commands in the xxx.step1.bwa.sh mapping script in the first step. A line is a bwa task of a sample. Since these 10 commands are independent of each other, you can split the step into 10 (or less) subscripts, and then manually post these 10 tasks. As for how to split a complete execution script into multiple ones, you can either write your own program or use the yhbatch_slurm_jobs program provided by ilus , but please note that the program I provide here is based on the slurm system, which may not meet your needs, and you can modify it, if you think This program does not meet your needs.

Example 2: Generate only one/some steps in the WGS process

Sometimes, we do not intend (or do not have to) to complete the WGS process from start to the end. For example, we only want to perform the steps from fastq comparison to gvcf generation, and do not want to perform genotype and VQSR.The -P parameter of ilus can does this:

$ ilus WGS -c -n my_wgs -C ilus_sys.yaml -L input.list -P align,markdup,BQSR,gvcf -O ./output

This only generates execution scripts from bwa to gvcf, which is useful for projects that need to do analysis in batches. Moreover, the output results of ilus are distinguished by sample units, so in the same output directory, as long as the sample numbers are different, the data of different batches will not overlap each other.

In addition, the -P parameter has another purpose, that is, if a WGS step is wrong and needs to be adjusted, and then re-update the corresponding step, then you can use -P to rerun specific steps. For example, if I need to regenerate the running script of the BQSR step, I can do this:

$ ilus WGS -c -n my_wgs -C ilus_sys.yaml -L input.list -P BQSR -O ./output

However, it should be noted that ilus will only keep the total BAM/CRAM file after BQSR for each sample in order to save the project's storage space consumption. Therefore, if you want to re-run BQSR you need to make sure that the BAM file from the previous step of BQSR (ie, markdup) has not been deleted.

If you have been using ilus then you don't need to worry about this problem, because ilus has "atomic properties" when executing tasks, that is to say, only when all processes in the step are successfully completed will the Delete files that are completely unnecessary. Therefore, if the BQSR step does not finish normally, the BAM file of the previous markdup will be preserved.

-P The analysis module specified by the parameter must belong to one or more of「align,markdup,BQSR,gvcf,genotype,VQSR」and be separated by commas.

capseq

$ ilus capseq -h
usage: ilus capseq [-h] [-n PROJECT_NAME] -C SYSCONF -O OUTDIR [-f] [--use-sentieon] -I FASTQLIST [--clean-raw-data]
                   [--delete-clean-fastq] [-c] [-P WGS_PROCESSES] [-dr] --capture-interval INTERVAL

optional arguments:
  -h, --help            show this help message and exit
  -n PROJECT_NAME, --name PROJECT_NAME
                        Name of the project. (Default: test)
  -C SYSCONF, --conf SYSCONF
                        YAML configuration file specifying system details.
  -O OUTDIR, --outdir OUTDIR
                        Output directory for results.
  -f, --force-overwrite
                        Force overwrite existing shell scripts and folders.
  --use-sentieon        Use sentieon (doc: https://support.sentieon.com/manual) to create analysis pipeline.
  -I FASTQLIST, --fastqlist FASTQLIST
                        Input the list of FASTQ files.
  --clean-raw-data      Set this option to clean raw sequencing data (fastq).
  --delete-clean-fastq  Set this option to delete clean fastq after aligning all reads to reference.
  -c, --cram            Convert BAM to CRAM after BQSR and save alignment file storage.
  -P WGS_PROCESSES, --process WGS_PROCESSES
                        Specify one or more processes (separated by comma) of WGS pipeline. Possible values:
                        align,markdup,BQSR,gvcf,combineGVCFs,genotype,VQSR
  -dr, --dry-run        Dry run the pipeline for testing.
  --capture-interval INTERVAL
                        Capture intervals, must be a file path: (BED/Picard) file

The capseq module of ilus includes an additional parameter --capture-interval compared to the WGS Analysis Module. This parameter specifically allows users to define the intervals for exonic regions that are targeted during the capture sequencing process. Other than this addition, the overall structure and functionality of the capture sequencing module remain similar to those of the WGS module.

genotype-joint-calling

If we already have gvcf data for each sample, and now use these gvcf to do multi-sample joint variants calling Joint-calling, then you can use genotype-joint-calling to implement. The specific usage is as follows:

$ ilus genotype-joint-calling --help
usage: ilus genotype-joint-calling [-h] [-n PROJECT_NAME] -C SYSCONF -O OUTDIR [-f] [--use-sentieon] -L GVCFLIST
                                   [--as_pipe_shell_order]

optional arguments:
  -h, --help            show this help message and exit
  -n PROJECT_NAME, --name PROJECT_NAME
                        Name of the project. (Default: test)
  -C SYSCONF, --conf SYSCONF
                        YAML configuration file specifying system details.
  -O OUTDIR, --outdir OUTDIR
                        Output directory for results.
  -f, --force-overwrite
                        Force overwrite existing shell scripts and folders.
  --use-sentieon        Use sentieon (doc: https://support.sentieon.com/manual) to create analysis pipeline.
  -L GVCFLIST, --gvcflist GVCFLIST
                        List of GVCF files. One gvcf_file per-row and the format should looks like: [interval
                        gvcf_file_path]. Column [1] is a symbol which could represent the genome region of the
                        gvcf_file and column [2] should be the path.
  --as_pipe_shell_order
                        Keep the shell name as the order of `WGS`.

-L is the input parameter of ilus genotype-joint-calling, which input a gvcf list file, which consists of two columns, the first column is the corresponding gvcf file The interval or chromosome number, the second column is the path of the gvcf file. Currently ilus requires the gvcf of each sample to be separated by major chromosomes (1-22, X, Y, M), for example:

$ ilus genotype-joint-calling -n my_project -C ilus_sys.yaml -L gvcf.list -O genotype --as_pipe_shell_order

The format of gvcf.list:

chr1    /path/sample1.chr1.g.vcf.gz
chr1    /paht/sample2.chr1.g.vcf.gz
chr2    /path/sample1.chr2.g.vcf.gz
chr2    /path/sample2.chr2.g.vcf.gz
...
chrM    /path/sample1.chrM.g.vcf.gz
chrM    /path/sample2.chrM.g.vcf.gz

In this example gvcf.list has only two samples. The parameter --as_pipe_shell_order is optional(the default is not added), its only function is to output the name of the executed script according to the ilus WGS process, maintaining the same order and the same as the WGS process The output directory structure of .

VQSR

This function is only used to generate execution scripts based on GATK VQSR. If we already have the final mutation detection (VCF format) results, and now we just want to use GATK VQSR to do quality control of this mutation data, then you can use this module, the usage is similar to genotype-joint-calling, as follows:

$ ilus VQSR --help
usage: ilus VQSR [-h] [-n PROJECT_NAME] -C SYSCONF -O OUTDIR [-f] [--use-sentieon] -L VCFLIST
                 [--as_pipe_shell_order]

optional arguments:
  -h, --help            show this help message and exit
  -n PROJECT_NAME, --name PROJECT_NAME
                        Name of the project. (Default: test)
  -C SYSCONF, --conf SYSCONF
                        YAML configuration file specifying system details.
  -O OUTDIR, --outdir OUTDIR
                        Output directory for results.
  -f, --force-overwrite
                        Force overwrite existing shell scripts and folders.
  --use-sentieon        Use sentieon (doc: https://support.sentieon.com/manual) to create analysis pipeline.
  -L VCFLIST, --vcflist VCFLIST
                        VCFs file list. One file per-row.
  --as_pipe_shell_order
                        Keep the shell name as the order of `WGS`.

Different from genotype-joint-callingilus VQSR input VCF list,and each line is a VCF path,for example:

/path/chr1.vcf.gz
/path/chr2.vcf.gz
...
/path/chrM.vcf.gz

Other parameters is the same as genotype-joint-calling. Also, the vcf in the file list does not need to be manually merged in advance, ilus VQSR will do that. A complete example is provided below:

$ ilus VQSR -C ilus_sys.yaml -L vcf.list -O genotype --as_pipe_shell_order

Citation

Huang, S., Liu, S., Huang, M. et al. The Born in Guangzhou Cohort Study enables generational genetic discoveries. Nature 626, 565–573 (2024). doi:10.1038/s41586-023-06988-4

-- The End --