-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How does fulgor handle multi-mappers? #15
Comments
Hi @jeremymsimon,
Best, |
Hey @rob-p! I played around with Also, this may be easiest to describe via a call to step through some data, but both If Thanks! |
Hi @jeremymsimon, Very interesting! We'd certainly be interested in investigating any minimal reproducible examples you find. There is a filtering that happens with highly-repetitive seeds in If you're interested in a more memory efficient index that supports positional information, you may be interested in looking at Best, |
Hi @jeremymsimon, Cheers, |
Hi @jeremymsimon, any news regarding the dataset? |
Yes, so sorry for the delay. I've attached here a directory containing a set of query ( The Generally:
My hope is that this setup is mostly self explanatory once you dive into the directory and scripts, but absolutely feel free to ask here or by email if you'd like me to clarify anything or provide more context. Thanks in advance! |
Hi @jeremymsimon and sorry for the late reply but it's a very busy period here (as usual!). I do not know exactly what you're searching for but I can try to answer some questions:
Fulgor might have false positives but never false negatives. That is, a sequence might be falsely reported as a match because its kmers are found in the index whereas it does not really occur in any of the indexed genomes. This is a limitation of all kmer-based methods (not only of Fulgor). This is why a longer kmer length is probably useful for. You mentioned you use k=11 or 13, but we would use k=28 or 31 (current default value). But Fulgor will never have false negatives, i.e., given a true positive query, all its kmers will be found in the index, hence the query will always be reported as present in the index.
Mmh I'm not sure I understand what you mean by "the sequence gets split across two k-mers". Can you please clarify or provide a toy example? As Rob mentioned, pufferfish can be used with some filtering option, but we don't use any filtering in Fulgor, hence any indexed kmer must always be found at query time. Hope this clarifies a bit! Please let me know. Best, |
Hi @jeremymsimon, Just some notes about what you're likely seeing in Regardless, if it turns out you really do need positional information for your hits, then one place to look might be If you don't need positional information, as Giulio suggests, there is no filtering currently done in fulgor. However, you won't get positional information (it doesn't compute this), and it also doesn't keep track of how many times in each reference a given k-mer occurs. Specifically, fulgor considers the "color" of a unitig as the distinct set of references in which the unitig occurs, regardless of the number of occurrences. This is what it considers when performing the mapping. Finally, I'll add one other point to the excellent description Giulio gives above. When you are looking up the mapping information for reads that consist of more than one k-mer, then the result you get can depend heavily on which specific mapping algorithm you use. The strict pseudoalignment algorithm requires that, to report a reference --Rob |
Thanks @rob-p and @jermp ! This is helpful. So my search has relatively short queries, on the order of 20-25bp as opposed to typical applications (e.g. RNA-seq reads) that are longer. Do you still recommend a k-mer length of 28 or 31? It isn't explicitly mentioned this way, but it seemed there was a pattern that the recommended k-mer length was roughly half the size of the query sequences (ie 28-31bp works for 50-75bp reads) but maybe that was me over-interpreting. And can you similarly provide guidelines for the minimizer parameter as I don't see this recommendation anywhere? In the files I shared, there were example queries that behaved or didn't behave as we expected, and some variability from algorithm to algorithm, but before saying more it may be due to a poor choice in k-mer/minimizer lengths so it would be helpful to clarify that first. If a known match is found with 100% confidence, then we likely can get away without knowing the locations, and in fact this may be preferable for our applications as it will obscure any potentially identifiable information in the results. We do however need to know if a given match occurred more than once, and ideally how many times within a certain distance/number of mismatches. It sounds like Thanks! |
Hi @jeremymsimon,
Well, I'd say 20-25 bp is even shorter than a single kmer....so kmer-based indexes using k < 20 might yield poor results in terms of accuracy. For the minimizer length: actually there is no good rule. It does really depend on the data you are indexing. Usually I use m=17 for less fragmented dBGs and m=19 or 20 for highly fragmented dBGs but that is for k=31 :)
Correct, Fulgor does not encode any multiplicity, whereas pufferfish2 or piscem have everything :) |
Thanks @jermp - so is the general rule of thumb for the k-mer size then to be as long as your shortest possible query length? ie if I have a range of query sizes from 20-25bp, you'd recommend |
You're very welcome @jeremymsimon!
I do not know if it is a rule of thumb but I'd say "no" because, in general, one builds an index without assuming specific query workloads nor distributions.
For these values of k yes, it is a reasonable choice. But I'm surprised you have so short queries! In general, if your shortest query length is, say, 200, I'd not recommend k=200 but k=31 or k=63. |
Thanks @jermp and apologies for the delay, I was out of town. I'm building this all for a very specific gRNA sequence screen, so the queries are short and fixed at that length, rather than a typical usage for sequencing reads or similar. I will repeat my testing with k=20, and try your other suggestions above (including piscem and others). I'll close this for now and reopen if I have any further questions! Thanks for all of your help! |
No problem at all @jeremymsimon. Keep us posted, we're both very curious about this experimentation. |
Hi-
I was curious if you could help me understand what happens in the case of multi-mappers, ie a given sequence that appears multiple times in the same reference. In your example with 4,546 Salmonella genomes, my guess is that there are repetitive sequences within a given genome - is the pseudo-alignment only capable of saying whether or not a given sequence occurred (n >= 1 times) or is it possible to know the exact number of times there was a match within a genome, or even more specifically, where the match(es) occurred?
The text was updated successfully, but these errors were encountered: