-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Index requested greater than vector's size" when loading pufferfish index using salmon v1.0 #8
Comments
Hi @mdshw5, Thanks for the detailed report! I grabbed the fasta file and indexed it, and (on the current develop branch of salmon, at least, which should not deviate substantially from 1.0) it was able to load the index and align the reads successfully. I'm currently downloading your pre-computed index to try an do a basic cross-check, but iCloud is taking some time ;P. In the mean time, I get the following sizes for the components of the index — does your index have anything that looks different?
Thanks! |
Wow thanks for the quick response! I’ll check file sizes after I get the kids in 🛌.
Matt
… On Dec 18, 2019, at 5:44 PM, Rob Patro ***@***.***> wrote:
Hi @mdshw5,
Thanks for the detailed report! I grabbed the fasta file and indexed it, and (on the current develop branch of salmon, at least, which should not deviate substantially from 1.0) it was able to load the index and align the reads successfully. I'm currently downloading your pre-computed index to try an do a basic cross-check, but iCloud is taking some time ;P. In the mean time, I get the following sizes for the components of the index — does your index have anything that looks different?
total 18G
drwxrwxr-x 2 rob rob 4.0K Dec 18 17:26 .
drwxrwxr-x 4 rob rob 4.0K Dec 18 17:27 ..
-rw-rw-r-- 1 rob rob 697K Dec 18 16:57 complete_ref_lens.bin
-rw-rw-r-- 1 rob rob 2.7G Dec 18 17:22 ctable.bin
-rw-rw-r-- 1 rob rob 122M Dec 18 17:22 ctg_offsets.bin
-rw-rw-r-- 1 rob rob 32K Dec 18 16:57 duplicate_clusters.tsv
-rw-rw-r-- 1 rob rob 1.3G Dec 18 17:22 eqtable.bin
-rw-rw-r-- 1 rob rob 968 Dec 18 17:26 info.json
-rw-rw-r-- 1 rob rob 1.6G Dec 18 17:26 mphf.bin
-rw-rw-r-- 1 rob rob 9.4G Dec 18 17:26 pos.bin
-rw-rw-r-- 1 rob rob 115 Dec 18 17:26 pre_indexing.log
-rw-rw-r-- 1 rob rob 426M Dec 18 17:14 rank.bin
-rw-rw-r-- 1 rob rob 1.4M Dec 18 17:22 refAccumLengths.bin
-rw-rw-r-- 1 rob rob 4.5M Dec 18 17:26 ref_indexing.log
-rw-rw-r-- 1 rob rob 697K Dec 18 17:18 reflengths.bin
-rw-rw-r-- 1 rob rob 769M Dec 18 17:22 refseq.bin
-rw-rw-r-- 1 rob rob 851M Dec 18 17:14 seq.bin
-rw-rw-r-- 1 rob rob 96 Dec 18 17:26 versionInfo.json
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
My file sizes are mostly the same:
I'm assuming the
I also believe I might have created this index using the beta 0.99v2 pre-release, and I just poked around comparing to the current develop branch, and it looks like the cereal lib version was different: COMBINE-lab/salmon@v0.99.0-beta2...develop#diff-af3b638bc2a3e6c650974192a53c7291L556-R553. That probably explains the issue I'm having, and will rebuild the indices and confirm whether I still have an issue. Thanks again for your time! |
I just updated this as i noticed cereal released a new version after years. I don't think the issue you had was a listed bug, but i can try to deserialize your index with the new version (after kids are in bed :)). |
I just reindexed my transcripts fasta using salmon v1.0 and still have the exact same error messages. Just to check, I also tested on a CentOS7 machine with a different CPU architecture (though I think host specific issues are unlikely) with no luck. Out of desperation I tried dropping flags from my command and, like most things done out of desperation I made it work but am at a loss to explain why. My problem was the |
Thanks for the detective work. Actually, |
Actually, @mdshw5 — one thing that I did notice strange during indexing (watching the log scroll through), is that there were a bunch of sequences that, after polyA clipping had a length of 0. I think this could potentially be a corner case. I can dig deeper, but do you see any reason why your transcript fasta would contain sequences that are all edit: a quick look suggests they are actually empty input sequences! I'll see if removing them helps. |
Sounds good. Thanks again for the help tonight. I should point out that I posted my log with I can reason about why I encounter this error building my weird index. edit: just saw your reply and edit and I think you're right. I'm constructing the Gencode transcript sequences using the GTF models, which produces some small transcripts that are usually filtered out during indexing so I've become accustomed to ignoring these messages. |
To explain what I'm working on (and maybe it's a terrible idea):
|
Regarding your first response: I'm filtering out empty transcripts now and will try re-indexing. "Small" targets should not be a problem. Actually, we have an explicit strategy to account for them in the index. However, I think empty targets are a corner case we didn't consider. The indexing process is lossless, so it will keep around that these inputs existed even though they had no sequence, but that causes a calculation bug (I think) in the code you link to above. I think the best behavior here is to say that short targets are OK, but empty targets are disallowed (do they technically violate the FASTA "format"?), and throw an error during indexing if we see them. Re your second idea. I see where you're going with this, and we're definitely on board. The main purpose of the decoys right now is to avoid spurious mappings (which they do help to achieve). However, given the ability to map to much more sequence, there are a number of other things we would want to keep around and handle in an appropriate way and some of these are absolutely on our radar. Interestingly, I think that if we want to start having bias modeling for very long sequences though, we might have to think about re-engineering some things (for a reason unrelated to the behavior we are seeing here). If we have very very long sequences (e.g. transcripts will all retained introns) that have only a few reads mapping to them, it will be a huge waste and not particularly useful to model the bias of all potential fragments originating from all potential positions of those target sequences. This is a separate (performance) issue, so I won't wax philosophical about it here, but I'll think about it some more and maybe we can discuss it in another issue / via another forum. |
Ok, so I'm not sure if I'll be able to nail this down tonight, but it looks like simply getting rid of length 0 transcripts does not resolve this issue. So, it must be some other corner case showing up in this large collections of transcripts (+ other sequences) that didn't show up when we were testing with the regular transcriptome & decoy transcriptome ... will continue to dig. |
Ok! Figured it out. It turns out it is due to short targets (targets <= k). The = in <= is because of a quirk of TwoPaCo. I thought we'd handled this corner-case appropriately, but apparently not. If you remove all targets <= k in length, everything works, but we'll have to fix this for targets <= k but > 0 on our end. I'll keep you posted! |
Ohhh man! What a bug! It was caused by missing parentheses around this line:
should have been
damn. So, I think this issue would potentially manifest any time we have transcripts of length <= k in the input and we enable the bias correction. I just pushed the fix to develop, and we'll cut a new release soon. However, even before that, I'll drop and exe from our CI server for you to test on your end. It actually has nothing to do with indexing and everything to do with parentheses in that one line, so you shouldn't need to rebuild any index. Edit: new executable can be grabbed from here. Also, I am curious if it's the case for you that quantification never started, or if it's the case that the log printed out a bunch of stuff, then quantification started, but took forever because we are trying to bias-correct a bunch of giant "super transcripts". Basically, after thinking about the effect of the bug, it occurs to me that result should be log spam (and *insane amount of log spam in this case, which itself takes a ton of time to write) and memory wastage (neither good things nor intended things), but should not be the index failing to load or wrong results (since these short transcripts will never actually be allowed to have any reads map to them anyway). Edit edit: I tested this hypothesis, and the index loading (prior to the bug fix) did eventually finish for me. It simply took forever because of all of the spam being printed to stderr. I think that the effect of this bug is greatly exacerbated because of the specific transcriptome being indexed here. It will be triggered whenever there is a short sequence (<=k) that appears very close to the end of the target transcriptome, and the number of messages printed will be proportional to the nearest sequence of length > k that occurs before this short sequence. Your transcriptome file, has a number of short sequences near the end, some of which have a closest prior long transcript that is very long. So this was basically the perfect scenario for triggering this bug. Thanks for helping us find it! |
And to top it off, this led to the discovery of another bug where populating the reference sequence used for bias correction in this case relied on the result of uninitialized memory ... so this needs to be fixed ASAP. |
I have lots of comments when I get back to my computer but right now wanted to say I’m absolutely floored by your effort here and will test the new exe today. I was looking suspiciously at that conditional flow control line and was pretty sure “& & & |” wasn’t right but I’m glad to see it was just order of evaluation. |
Sure thing. Thanks for finding this out! The executable above fixes the loading issue, I'll drop a new one soon that also fixes the memory correctness issue (since we know that bug and have resolved it upstream now). |
A couple of points in response to your comments above:
|
Thanks for the detailed thoughts. Here are my thoughts^2.
|
Thanks for the v1.1.0-beta2 test build, @rob-p! I just tested in my environment and this release fixes the issues I was encountering. Edit: regarding the aux targets concept, I'd be glad to help test the feature, and the timing (Q1 2020) could work very well for me. |
@mdshw5 — Regarding the aux targets concept, what are your thoughts on allowing those to be specified along with the quantification command rather than during indexing? The benefits of this are :
on the flip side, the downsides are:
|
I actually think passing the aux list during quant is the way to go. A few thoughts:
Edit: On second thought it’s unclear how people will use aux targets and many use cases (lists of targets where the bias models are just problematic) might want TPM estimates using aux target models. I’d still consider if TPM should be calculated with or without, or at the expense of adding complexity maybe keep the current behavior and add one more flag that excludes aux targets from TPM calculations. |
I'm opening this issue here because I believe it's more of a pufferfish-related matter, but will use the issue template from your salmon repository.
Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
salmon/pufferfish
Describe the bug
When running salmon v1.0 using a rather large index, I receive an error raised from the cereal library "Index requested greater than vector's size". The log reads:
The index does not finish loading, and so salmon does not enter read quantification routines.
To Reproduce
Github release tarball
Gencode v32, with additional elements representing genic introns and intergenic spaces.
NCBI SRA run accession GSM2392582
--no-version-check --libType ISR --threads 4 --seqBias --gcBias --useVBOpt
Expected behavior
I expected index loading to complete successfully.
Desktop (please complete the following information):
Additional context
I've uploaded the index file archive here and the transcripts fasta file here.
The text was updated successfully, but these errors were encountered: