DnaStringSlice should abstract away the backing DnaString rc status #21

d-cameron · 2021-06-11T07:28:48Z

The DnaStringSlice implementation is inconsistent when the slice is reverse complemented. Mer::get() is implemented (what I consider to be) correctly, but other functions only work if the slice has is_rc == false.

#[test]
fn dnastringslice_get_kmer() {
    let seq = DnaString::from_dna_string("ACGGTAC");
    let seqrc = DnaString::from_dna_string("GTACCGT");
    let rcslice = seq.slice(0, 7).rc();
    let slice = seqrc.slice(0, 7);
    for i in 0..=3 {
        // The kmer in a slice should be the kmer of the sequencing represented
        // by that slice, regardless of whether the backing DnaString is RC or not.
        assert_eq!(slice.get_kmer::<Kmer4>(i), rcslice.get_kmer::<Kmer4>(i));
    }
}
#[test]
fn dnastringslice_slice() {
    let seq = DnaString::from_dna_string("ACGGTAC");
    let seqrc = DnaString::from_dna_string("GTACCGT");
    let rcslice = seq.slice(0, 7).rc();
    let slice = seqrc.slice(0, 7);
    // The fact that a slice is backed by a DnaStringSlice that is
    // the rc of the slice sequence shouldn't matter.
    assert_eq!(rcslice.slice(1, 4), slice.slice(1, 4));
}

On a related topic, would it make sense to split the 2-bit encoding of a DNA string and associated kmer logic into their own crate? Two bit encoding and kmer counting is generically useful outside of de bruijn graph construction and is used for everything from kmer counting, error correction, mimimizer hash tables, to reference genome storage (e.g. sequence interval lookups in a memory maped 2bit encoded (http://genome.ucsc.edu/FAQ/FAQformat.html#format7) reference genome would be very efficient but unfortunately, you've chosen a different packed encoding^).

^ The UCSC encoding uses a bit encoding in which the MSB indicate a purine base, so complementing the sequence is XORing with 0xAAAAAAAA instead of flipping all bits which is the approach used here.

The text was updated successfully, but these errors were encountered:

pmarks · 2021-06-12T16:26:29Z

Thanks for the report & tests. Fixed in 16bb73e.

I have wanted to work with the 2bit format, but never really got around to it. I chose the ACGT convention because then you get lexicographic ordering 'for free' with normal integer comparisons, which is really useful for all the kmer stuff. I think we could solve this by making DnaString generic over a TwoBitConverter trait that knows how to convert from the representation of the backing store to the ACGT convention as bases/kmers are requested. This would compile to a no-op for the native representation, and a few bit ops for the UCSC encoding. There would also be some work make DnaString be generic over the backend storage: currently it's Vec<u64>, but would need to be some mmap wrapper type in the 2bit mmap case. Happy to work with you on that. I'd also be open to splitting up the crate into the Kmer/DnaString stuff and moving the DBG / cDBG stuff to a more specialized crate. I don't have much time to work on it myself in the near term.

d-cameron changed the title ~~DnaStringSlice should abstract away the backing DnaString~~ DnaStringSlice should abstract away the backing DnaString rc status Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DnaStringSlice should abstract away the backing DnaString rc status #21

DnaStringSlice should abstract away the backing DnaString rc status #21

d-cameron commented Jun 11, 2021 •

edited

Loading

pmarks commented Jun 12, 2021

DnaStringSlice should abstract away the backing DnaString rc status #21

DnaStringSlice should abstract away the backing DnaString rc status #21

Comments

d-cameron commented Jun 11, 2021 • edited Loading

pmarks commented Jun 12, 2021

d-cameron commented Jun 11, 2021 •

edited

Loading