-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4Mb genome, many mutations: first line of .jsonl.gz becomes prohibitively long with amino acid changes #613
Comments
Yes, not a trivial change! (Would a streaming JSON parser be able to get around the line size I wonder?) UShER is able to load the protobuf in 2.673s. |
Hah, that's a little faster. Yes maybe a streaming JSON parser would work, that's a good suggestion! |
@AngieHinrichs - sorry could you do me a favour and tell me whether the latest version of the Taxonium app (which to be clear, doesn't have any fixes for this yet), is working for you at all [you probably don't want to remove your old one, but to install it alongside]. I am unable to |
Actually you probably don't need to check - it looks like it almost certainly won't. I'll look into that. |
Right sorry, ignore me. The latest version is OK in this regard. |
Not working - sorry, back with you soon! |
Sorry I missed these, but thanks for whatever you're doing, and let me know if I need to do anything (sounds like for now I don't?). |
|
OK, I figured out how to generate a test file and unfortunately it looks like this won't help |
(This isn't a problem with the overall approach, just the exact one I took) |
OK, so https://cov2tree.nyc3.digitaloceanspaces.com/Taxonium-jsonlparser-arm64.dmg seems not to stall on that first line. The downside is that it may be significantly slower! Let me know how you get on tomorrow or whenever :) |
Ah yeah, unfortunately you always need to run this terminal command: https://docs.taxonium.org/en/latest/app.html#installing |
Doh! Sorry, I should have looked it up there. Thanks! I've tried it on a couple of files and it doesn't seem slower at all. It loaded the version without amino acid changes very quickly (a minute or less?) and I can use it normally. It loaded the version with amino acid changes tolerably fast too, but then when I clicked the 'you can access it here' link, I got an AxiosError: Network Error in the UI, and the log output ended like this:
Let me know if you'd like the .jsonl.gz files. |
Ah dang! Yeah, the files would be great! |
OK, I sent an email. |
Thanks and thanks for testing! |
(Just for info, and to keep records in this thread:) I think it may not be much slower for Mtb, but that for the big SARS-CoV-2 trees, sticking a JSONL into https://vercel.live/open-feedback/cov2tree-git-jsonlparser-theosandersons-projects.vercel.app is much slower than using taxonium.org . So in terms of actually merging this, that will be a blocker [until I figure something out] but doesn't stop us making a version for you to use. But first we need to make the backend not crash. |
Ah, I see what you mean with the big tree. It makes my new laptop a bit slower than my old laptop in terms of loading time. I think it would be a problem for others with less beefy laptops. At the risk of suggesting something impractical (feel free to reject out of hand) -- since the first line of the file can dwarf the size of the subsequent lines, would it make sense to use a streaming parser only for the first line, and then parse all subsequent lines from strings as before? Again, if that sounds like a pain, never mind. 🙂 |
To give a few updates:
|
OMG that's amazing! And I |
Hah, Claude wrote most of it :) |
Another update: I got the split-stream thing working. https://cov2tree-klkje8lb9-theosandersons-projects.vercel.app/ now works for your tree by "uploading" the JSONL locally without the need for the Taxonium desktop application. Still a bit slow to load the mutations - which is probably fixable by restructuring the JSONL file at some point, but not looking bad! |
Amazing again! Both the digitaloceanspaces app and the vercel version work great for M.tb. My browser wasn't able to load the full SC2 pb with the vercel app (tab died), but it seemed to be going quickly enough while it was going. |
(Thanks for checking everything - yes sorry in this case I only meant the Mtb tree - full SC2 will die as before) |
Hi Theo - my group has a tree of 127k M. tuberculosis genomes, 212k nodes. The M.tb genome is 4.4Mb and there are many mutations in the tree. With nucleotide mutations only, the first line of the .jsonl.gz when decompressed is ~263MB. At that size, the tree takes a few minutes to load on a MacBook Pro M2 Max with 64GB RAM. It takes ~10 minutes to load on a MacBook Pro M2 with 16GB RAM (long enough for a PI to get tired of waiting and go do something else 🙂).
However, when usher_to_taxonium is run with
--genbank
and amino acid changes are added, the first line when decompressed is ~1.1GB and something in the taxonium app's back end dies with this error:Then the UI just freezes and never finishes loading.
So for now we'll do without the amino acid changes, and go do something else while the nuc-only version loads. But we were hoping you'd have some ideas about how to magically speed up the initial load when there are so many mutations. 🙂
I can share the tree files offline if you would like to test them out on your end.
The text was updated successfully, but these errors were encountered: