Skip to content

Reduce BAM file size by removing tags and quality information (>50 %) and/or downsampling.

Notifications You must be signed in to change notification settings

nijibabulu/bam_reducers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BAM File Size Reducers

Reduce BAM file size by removing tags and quality information (>50 %) and/or downsampling.

It happens fairly often that there is excess information in BAM files such as quality values and attributes which are not necessary for their display in some applications such as IGV. This can be especially cumbersome multi-laned RNA-sequencing runs. So far this package includes three simple scripts:

  • bam_reducer - This script eliminates quality values and attribute tags as requested. Optionally, one can provide a SJ.out.tab as generated by the STAR aligner.
  • bam_downsampler - This script will randomly drop reads by a user given fold difference
  • bam_splitter - Split a bam file by a number of lines. For extremely large files, this script can be useful for distributing the tasks among a cluster.

Installation

The Pipenv Way

This is the preferred and simplest method when working on a remote system. This will take a while as pysam also has to be installed and built:

python3 -m pipenv install -e git+https://github.com/nijibabulu/bam_reducers.git#egg=bam_reducers

The setuptools Way

Not extensively tested. Clone the repository:

git clone https://github.com/nijibabulu/bam_reducers.git

Install the scripts:

cd bam_reducers
python setup.py install

Add --user to install only for the current user.

Usage

bam_reducer

The bam_reducer script will filter and process alignments under several regimes:

  • Downsampling: remove certain reads at random
  • Quality removal: quality values are all set to 0. This does not reduce the SAM file size but as repeated 0's are highly compressable the BAM file can be reduced by ~ 60 % using this.
  • Tags removal: Tags at the end of the file often go unused, especially for browsers such as IGV.
  • Splice Junction Filters: input a SJ.out.tab file a la the STAR aligner and remove reads implying splice junctions that are not included in this file (NOTE: The star --bySJOut filter switch automatically does this, so there is no need to repeat it).

Full usage:

usage: bam_reducer.py [-h] [--preserve-quals] [--preserve-attr ATTR]
                      [--sj-file SJ_FILE] [--down-sample FACTOR] [--seed SEED]
                      [-o OUTPUT]
                      BAM

reduce a BAM file by stripping some information which is not necessary for
display.

positional arguments:
  BAM

optional arguments:
  -h, --help            show this help message and exit
  --preserve-quals      preserve the quality values (default behavior is
                        convert all quality values to 0
  --preserve-attr ATTR  preserve the attribute (default behavior is to remove
                        all attributes. A "*" preserves all attributes
  --sj-file SJ_FILE     include a STAR splice junction file to filter spliced
                        reads by
  --down-sample FACTOR  Take a random subset of reads. FACTOR Is a factor to
                        downsample by. for example, if 2 is given, the output
                        will contain approximately half the input reads
  --seed SEED           random seed for down sampling [default: 91543]
  -o OUTPUT, --output OUTPUT
                        output file [default: stdout]

bam_downsampler

This tool only downsamples a BAM file. The bam_reducer script can perform this, however, the default behavior of bam_reducer is to remvoe a lot of attributes from the bam file. This is a convenience script to remove them.

usage: bam_downsampler.py [-h] [--seed SEED] [-o OUTPUT] FACTOR BAM

sample a random subset of alignments from a bam file

positional arguments:
  FACTOR                factor to downsample by. for example, if 2 is given,
                        the output will contain approximately half the input
                        reads
  BAM

optional arguments:
  -h, --help            show this help message and exit
  --seed SEED           set seed. [91543]
  -o OUTPUT, --output OUTPUT
                        specify an output file. By default, the bam is output
                        to stdout.

bam_splitter

Split a bam for parallel processing. This can be useful if your bam file is very large and reducing it on a single processor will take long.

usage: bam_splitter.py [-h] SPLIT_LENGTH BAM OUTPUT_PREFIX

positional arguments:
  SPLIT_LENGTH   number of alignments per file
  BAM            input bam file
  OUTPUT_PREFIX  first part of the output file name

optional arguments:
  -h, --help     show this help message and exit

About

Reduce BAM file size by removing tags and quality information (>50 %) and/or downsampling.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages