Reduce BAM file size by removing tags and quality information (>50 %) and/or downsampling.
It happens fairly often that there is excess information in BAM files such as quality values and attributes which are not necessary for their display in some applications such as IGV. This can be especially cumbersome multi-laned RNA-sequencing runs. So far this package includes three simple scripts:
bam_reducer
- This script eliminates quality values and attribute tags as requested. Optionally, one can provide aSJ.out.tab
as generated by the STAR aligner.bam_downsampler
- This script will randomly drop reads by a user given fold differencebam_splitter
- Split a bam file by a number of lines. For extremely large files, this script can be useful for distributing the tasks among a cluster.
This is the preferred and simplest method when working on a remote system. This will take a while as pysam
also has to be installed and built:
python3 -m pipenv install -e git+https://github.com/nijibabulu/bam_reducers.git#egg=bam_reducers
Not extensively tested. Clone the repository:
git clone https://github.com/nijibabulu/bam_reducers.git
Install the scripts:
cd bam_reducers
python setup.py install
Add --user
to install only for the current user.
The bam_reducer
script will filter and process alignments under several regimes:
- Downsampling: remove certain reads at random
- Quality removal: quality values are all set to 0. This does not reduce the SAM file size but as repeated 0's are highly compressable the BAM file can be reduced by ~ 60 % using this.
- Tags removal: Tags at the end of the file often go unused, especially for browsers such as IGV.
- Splice Junction Filters: input a SJ.out.tab file a la the STAR aligner and remove reads implying splice junctions that are not included in this file (NOTE: The star
--bySJOut
filter switch automatically does this, so there is no need to repeat it).
Full usage:
usage: bam_reducer.py [-h] [--preserve-quals] [--preserve-attr ATTR]
[--sj-file SJ_FILE] [--down-sample FACTOR] [--seed SEED]
[-o OUTPUT]
BAM
reduce a BAM file by stripping some information which is not necessary for
display.
positional arguments:
BAM
optional arguments:
-h, --help show this help message and exit
--preserve-quals preserve the quality values (default behavior is
convert all quality values to 0
--preserve-attr ATTR preserve the attribute (default behavior is to remove
all attributes. A "*" preserves all attributes
--sj-file SJ_FILE include a STAR splice junction file to filter spliced
reads by
--down-sample FACTOR Take a random subset of reads. FACTOR Is a factor to
downsample by. for example, if 2 is given, the output
will contain approximately half the input reads
--seed SEED random seed for down sampling [default: 91543]
-o OUTPUT, --output OUTPUT
output file [default: stdout]
This tool only downsamples a BAM file. The bam_reducer
script can perform this, however, the default behavior of bam_reducer
is to remvoe a lot of attributes from the bam file. This is a convenience script to remove them.
usage: bam_downsampler.py [-h] [--seed SEED] [-o OUTPUT] FACTOR BAM
sample a random subset of alignments from a bam file
positional arguments:
FACTOR factor to downsample by. for example, if 2 is given,
the output will contain approximately half the input
reads
BAM
optional arguments:
-h, --help show this help message and exit
--seed SEED set seed. [91543]
-o OUTPUT, --output OUTPUT
specify an output file. By default, the bam is output
to stdout.
Split a bam for parallel processing. This can be useful if your bam file is very large and reducing it on a single processor will take long.
usage: bam_splitter.py [-h] SPLIT_LENGTH BAM OUTPUT_PREFIX
positional arguments:
SPLIT_LENGTH number of alignments per file
BAM input bam file
OUTPUT_PREFIX first part of the output file name
optional arguments:
-h, --help show this help message and exit