-
Notifications
You must be signed in to change notification settings - Fork 0
Methods
This document explains the organization of nimble across its various modules, and many of the implementation details of how it generates feature calls.
nimble-aligner contains the core alignment code. It is responsible for reading sequence data and producing feature calls via pseudoalignment. There are many decision points about how to generate a call given a particular read-pair, and most of those decision points are configurable via the reference library file format.
The user passes either single or paired-end .fastq data, or a cellranger .bam file, alongside the set of references they want to align the data to. Based on the file extension, we run either a single-threaded fastq processing mode that reads all the data into memory and aligns it to each reference, or a multithreaded .bam alignment mode which reads many UMIs of data (i.e. all paired-end reads with the same unique UMI and cell barcode combination) at a time and aligns them to the references across many threads. In order for the UMI reader to work, we first use samtools to sort the .bam file by UMI and cell barcode, and then read in one UMI at a time from the top of the file. We create two reference indices, which are the data structures that we use to generate feature calls. One reference index contains the features in the provided orientation, and the other contains reverse complemented versions of the features.
Both processing modes eventually feed into the calling function. The parameters are a list of sequences, and a list of their mate sequences (in the case of our .bam data, sequences
is a list of r1 reads and mate_sequences
is a list of r2 reads). Note that in the .bam pipeline, if a read has the reverse complement flag, we reverse complement it. This means that sequences and mate_sequences
are always in opposite orientations to one another, which allows us to impose constraints on valid alignment orientations. We also pass the pair of reference indices, the object containing the reference sequences, and the object containing the aligner configuration. There's also a per_sequence_metadata
object, which contains nothing in the case of the fastq mode, and contains a long list of bam fields corresponding to the sequence/mate_sequence in the case of the bam mode.
We then call the primary alignment function. This function takes a set of sequences (either one UMI in the case of the bam mode, or all the data in the case of the fastq mode) and returns a list of calls, as well as a bunch of metadata about the alignment, i.e. reasons for filtering certain reads and the like.
First, we align the sequences and their mates to the normal version of the reference library as well as the reverse complemented version of the library. This is done using a function called score_sequences.
In score_sequences
, we iterate all of the sequences, as well as the mate sequences if they are provided. We pass both of these sequences to the pseudoalign function.
pseudoalign
takes a single sequence, the reference index object, and the aligner configuration, and applies several filters. First, we assert that the read is of sufficient length for alignment and that it passes an entropy threshold. Then, we attempt to map the read to the reference, which will produce either a single call, or an equivalence class of the calls with the highest scores. In addition to the equivalence class and its score, the mapping returns a number of mismatches, which we use to run a mismatch filter. Finally, the alignment is run through a filtration function that calculates a few other filters. These include a score threshold filter, a normalized score threshold filter, and a multiple match filter.
Going back to score_sequences
, once both mates in the current read-pair are aligned, we do some unpacking/reformatting of the data. The next important line is creating the read key, which is what we use to register the call for a given sequence in the hashmap of calls. The calls then get put through another layer of filtration here, if there are mate sequences. We use filter_pair to determine if the equivalence classes for both read-mates are the same. If they are not, and the require_valid_pair
flag is set in the aligner configuration, the read-pair gets filtered and we move on to the next one.
Otherwise, we move on to packaging the calls for output (those feature_list
lines are for the read_matches
object which is used for reporting, it doesn't have anything to do with the call output). There are a series of branches which package the calls to be returned from this function while enumerating the different cases of matches. For example, if we have a valid call for the sequence and its mate, we return all the call information alongside PairState::Both. If we just got a sequence call and not a mate call, we return PairState::First. This is used for orientation filtration down the line.
Among all the read_matches
reporting lines, we then insert the call into the output score map and move on to the next read-pair.
Back in get_calls
, once we've gotten a per-UMI/per-fastq call hashmap for the normal and reverse-complemented versions of the reference, the rest of the code is to coerce those four alignments (one alignment per reference per mate) down to a single callset per read-pair. A call could be a single feature, or it could be a set of ambiguous features. We also do orientation filtration at this time.
We iterate all the calls from the normal version of the reference genome. If a given read-pair also got a call from the reverse-complemented version of the reference genome, we get it, looking it up by the read_pair_key
, which is that read key from earlier. We then enter the orientation filter.
In filter_and_coerce_sequence_call_orientations
, we feed the calls for a given read-pair generated from one or both libraries into get_alignment_orientation. By looking at the PairStates generated from alignments to both versions of the reference library, we can determine if the orientation of the alignments is valid. For instance, if both read-mates aligned to both the forward and reverse-complemented versions of the reference library, that's good evidence that the alignment is errant in some way. If each read-mate aligns to only one of the reference library versions, and each mate aligned to a different version (i.e. r1 aligned to normal and r2 aligned to reverse-complement) then that's a signal that the alignment is good. We then pass this computed alignment orientation into filter_orientation_on_library_chemistry which imposes additional constraints on the set of possible alignments given the library chemistry. If the read-pair fails this filter, the alignment is discarded.
Back in filter_and_coerce_sequence_call_orientations
, if the read-pair passed the orientation filtration, we move on to equivalence class coercion. The issue is that we have up to four different equivalence classes for this read-pair, and we need to return one. There is a flag in the reference library configuration, called intersect_level
, that specifies which strategy to use for equivalence class coercion. get_all_calls just permissively appends all the calls together and takes the highest score. get_intersecting_reads with fallback enabled tries to intersect between the calls, but falls back to permissively appending if the intersection fails (i.e. sequence
and mate_sequence
equivalence classes are completely divergent). Finally, get_intersecting_reads
with fallback disabled filters the alignment if the intersection fails.
After that, we transform the resulting new equivalence class to a final feature call using process_equivalence_class_to_feature_list. By default, this directly maps each equivalence class value to the corresponding feature name in the nt_sequence
column of the reference library. However, nimble also supports the ability to treat multiple features as part of a "group" for sets of variants, lineages, etc. Therefore, if the group_on
column is set to a column other than nt_sequence
, we replace all the calls to features within that group with a single call to the higher-level name for the group. After performing that replacement, we pass the final feature callset through a filter limiting the number of calls, which is controlled by the discard_multi_hits
flag.
Back in filter_and_coerce_sequence_call_orientations
, the last step is to add the call to the results map, which contains one value per unique callset and the number of read-pairs that mapped to that callset.
Finally, we process any calls we missed, add any additional filter info generated from the orientation filter pipeline to the filter info object, and we return the calls to the relevant pipeline. In fastq mode, these are the calls for the full dataset, which then get written to the output file for the relevant library. In bam mode, these are calls for a single UMI on a single reference library, and they get written to the corresponding library output file.
Generally, nimble-aligner shouldn't be used directly. Instead, we've provided a CLI frontend for the tool here. This contains facilities for downloading particular versions of the aligner, generating the reference library from various input data sources, creating detailed reports on your alignments, and transforming the data in various ways for downstream analysis.
Each row of the output file generated from nimble align
contains a feature callset and the number of read-pairs that matched that callset. For fastq input data, each feature call is unique in the file, since the alignments were computed for all the sequences at once. For bam input data, each callset is unique per-UMI, since the alignments were computed one UMI at a time, and there are also many other bam flags attached per-row. In order to turn this per-UMI data into a cell-by-feature matrix for 10x scRNA-Seq pipelines, we provide a nimble report
command. A collection of callsets across many UMIs can be transformed into per-cell data via the following steps.
First, intersect all of the read-pair callsets within each UMI, producing a single callset per-UMI containing the features agreed upon across all the UMI's component read-pairs. Then, roll-up the resulting list of UMI callsets for the given cell by counting the number of UMIs per unique callset. This results in a cell-by-callset matrix, where each value is the number of UMIs that shared that unique callset after going through UMI-level intersection. A callset could be a single feature, or it could be a set of ambiguous features. The nimble report
command implementation is here.
In our analyses, we often compare the results of a nimble alignment with a cellranger alignment using Mmul_10. However, our nimble reference libraries often contain only the coding sequence of the features of interest, while the corresponding features in Mmul_10 contain the coding sequence and the flanking UTR. This leads to an effect where the same features diverge between nimble and cellranger, because one mate in a given read-pair will span the UTR and be dropped from the nimble data, which is generated from a library that lacks this UTR. This effect is demonstrated in the figure below, where there are an outsized number of r2-only calls that are dropped from the nimble data but are genuine hits, determined by comparison to a separate ground source of truth (genomic location):
We can recapture this data by treating r1 and r2 differently: r1-only alignments get dropped, but we keep r2-only alignments if they pass a higher score threshold than normal reads. This is because for our data, r1 is the mate that often spans the UTR. Since this is specific to coding-sequence-only libraries on 5' library chemistry, an evolving implementation is provided as a python function at the moment, rather than being implemented at an aligner level.
The cell calls output from nimble report
are in the correct format for use in a Seurat object. We provide functions for appending a nimble output file to a Seurat object in the Bimber Lab Rdiscvr repository.
AppendNimbleCounts provides a configurable interface for appending a nimble cell-by-callset matrix to a Seurat object. You can configure it to drop ambiguous features (i.e. callsets with >1 call) and perform multiple types of normalization.