Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rust_htslib::faidx::Reader::fetch_seq() causing a memory leak. #15

Closed
MaelLefeuvre opened this issue Jul 20, 2023 · 1 comment · Fixed by #14
Closed

rust_htslib::faidx::Reader::fetch_seq() causing a memory leak. #15

MaelLefeuvre opened this issue Jul 20, 2023 · 1 comment · Fixed by #14
Assignees
Labels
bug Something isn't working

Comments

@MaelLefeuvre
Copy link
Owner

MaelLefeuvre commented Jul 20, 2023

Bug Description

pmd-mask is currently showing clear signs of a memory leak. (i.e.: image below)

pmd-mask-memory-consumption-timeline

  • First noticed by @J-Sauvage, when attempting to apply pmd-mask on high-coverage samples.
  • This is almost unnoticeable when working with low coverage samples, but the memory usage can quickly become debilitating for the OS when working on bigger files.

Running heaptrack clearly shows a single culprit:

Summary Bottom-up stacktrace
image image

The only call site to this function is in lib::apply_pmd_mask():81

// ---- Get the reference's position         
let reference = reference.fetch_seq(current_record.chromosome.inner(), bam_record.reference_start() as usize, bam_record.reference_end() as usize -1 ) ?;

Minimally reproducible example

Execute pmd-mask on any available input, while profiling the memory consumption with heaptrack

heaptrack target/debug/pmd-mask -b sample.bam -m misincorporation.txt -f hs37d5.fa.gz -o sample.masked.sam

A 'hackier' way to monitor memory consumption and see the gradual increase, using top and grep :

heaptrack target/debug/pmd-mask -b sample.bam -m misincorporation.txt -f hs37d5.fa.gz -o /dev/null --quiet & top -b -d1 | grep --line-buffered "$!"

Impact and current size-complexity of pmd-mask

Looking at the [htslib::faidx_fetch_seq64()] call site within faidx::Reader::fetch_seq() shows the most likely 'leakage' is the byte_string representation of each reference sequence (see: rust-htslib/src/faidx/mod.rs:73-79).

Thus, the current size complexity when processing a file is $O(n \cdot \widehat{L})$, with $n$, the number of reads within the file, and $\widehat{L}$, the average fragment length.

A (temporary) workaround when working with high-coverage samples

Apply pmd-mask sequentially on each chromosome.

The following set of bash commands should help mitigate the overall memory consumption until a fix is found..

Assumptions:

  • $REFERENCE must be accompanied by a .fai index file in the same directory.
  • pmd-mask must be within your $PATH
INPUT="sample.bam"
OUTPUT="sample.masked.bam"
REFERENCE="hs37d5.fa.gz"
MISINCORPORATION="misincorporation.txt"
METRICS="sample.pmdmask.metrics"

samtools view -H --no-PG $INPUT > $OUTPUT
for chr in $(cut -f1 "${REFERENCE}.fai"); do
    echo "Applying pmd-mask on $(basename $INPUT) - $chr";
    samtools view -h $INPUT $chr \
    | pmd-mask -m $MISINCORPORATION -f $REFERENCE -M $METRICS \
    | grep -v '^@' \
    >> $OUTPUT
done
@MaelLefeuvre MaelLefeuvre added the bug Something isn't working label Jul 20, 2023
@MaelLefeuvre MaelLefeuvre self-assigned this Jul 20, 2023
@MaelLefeuvre MaelLefeuvre changed the title rust_htslib::rust_htslib::faidx::Reader::fetch_seq() causing a memory leak. rust_htslib::faidx::Reader::fetch_seq() causing a memory leak. Jul 20, 2023
@MaelLefeuvre
Copy link
Owner Author

MaelLefeuvre commented Jul 20, 2023

Issue submitted to rust-bio/rust-htslib#401

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant