Minimal infrastructure needed for secondary analysis and QC of data generated by Illumina sequencers in a clinical NGS lab.
For an NGS lab like ours with few engineers - the less infrastructure we build, the less we need to maintain, a lower chance of failure, and easier to debug. Our strategy is to outsource the most complex pieces of necessary infrastructure to mature commercial solutions that have a large number of users with similar use-cases. More users ensure exhaustive bug reporting and pushes vendors to prioritize fixes and new features. At the same time, we don't want to invest in "end-to-end" solutions (e.g. ICA and DDM), and must instead carefully modularize our infrastructure and use popular file formats, so that we can replace each module with better alternatives that become available in the future. Modular infrastructure also gives us some autonomy to solve issues ourselves without dependence on vendor tech support.
- Given a Dragen server, monitor run output from sequencers and trigger secondary analyses.
- Generate and track run-level and sample-level QC metrics using MultiQC and MegaQC.
We'll use conda for package management which is typically installed under ~/miniconda3
. But dragen servers do not allow execution of binaries under ~/
, so create /opt/conda
where we can install it instead.
sudo mkdir -m 775 /opt/conda
sudo chown $(id -un):$(id -gn) /opt/conda
curl -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o /tmp/miniconda.sh
bash /tmp/miniconda.sh -bup /opt/conda && rm -f /tmp/miniconda.sh
Add the following lines to your ~/.bashrc
so that conda is available across sessions. Then log out and login to ensure that conda
is in your $PATH
.
# Add conda to PATH if found
if [ -f "/opt/conda/etc/profile.d/conda.sh" ]; then
. /opt/conda/etc/profile.d/conda.sh
fi
Update conda to the latest version and then configure it to use libmamba
which is faster at solving dependencies.
conda update -yn base conda
conda install -yn base conda-libmamba-solver
conda config --set solver libmamba
Clone this bcl-qc
repo and optionally check out a specific version tag or development branch.
git clone [email protected]:ucladx/bcl-qc.git
cd bcl-qc
git checkout v1.1.0
Create a new conda environment with dependencies installed, and activate it.
conda env create -qn bclqc -f config/conda_env.yaml
conda activate bclqc
Now you can trigger demux, alignment, and MultiQC on a sequencer run output folder as follows.
python bclqc.py $PWD/reads $PWD/bams $PWD/demo_run
If you don't have your own sequencer run output folder to test this on, you can download a small demo dataset from Illumina. Visit basespace.illumina.com and login. Create an account if needed. Then visit basespace.illumina.com/s/5yuCT6UE1e3Q and click "Accept" to add the demo project to your BaseSpace account. We can now programmatically download it using the BaseSpace Command-Line Interface (CLI). The BaseSpace CLI cannot be installed using conda, so download it manually and copy it into your conda environment.
curl -sL https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs -o $CONDA_PREFIX/bin/bs
chmod +x $CONDA_PREFIX/bin/bs
Run bs auth
and follow the instructions to login to the CLI. Then download the run output folder from Illumina as follows. Skip their SampleSheet, because we have a more minimal version in this repo under demo_run/SampleSheet_I10.csv
.
bs download run --summary --id 72750678 --output demo_run --exclude SampleSheet*.csv
You can test the demux pass on this demo data as follows and inspect the output in the new folder named reads
:
python bclqc.py -P demux $PWD/reads $PWD/bams $PWD/demo_run