Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4Mb genome, many mutations: first line of .jsonl.gz becomes prohibitively long with amino acid changes #613

Closed
AngieHinrichs opened this issue Sep 30, 2024 · 26 comments · Fixed by #619

Comments

@AngieHinrichs
Copy link
Contributor

AngieHinrichs commented Sep 30, 2024

Hi Theo - my group has a tree of 127k M. tuberculosis genomes, 212k nodes. The M.tb genome is 4.4Mb and there are many mutations in the tree. With nucleotide mutations only, the first line of the .jsonl.gz when decompressed is ~263MB. At that size, the tree takes a few minutes to load on a MacBook Pro M2 Max with 64GB RAM. It takes ~10 minutes to load on a MacBook Pro M2 with 16GB RAM (long enough for a PI to get tired of waiting and go do something else 🙂).

However, when usher_to_taxonium is run with --genbank and amino acid changes are added, the first line when decompressed is ~1.1GB and something in the taxonium app's back end dies with this error:

sending message
stderr: file:///Applications/Taxonium.app/Contents/Resources/app/node_modules/taxonium_data_handling/importing.js:62
    cur_line += data.toString();
                     ^

RangeError: Invalid string length
    at Gunzip.<anonymous> (file:///Applications/Taxonium.app/Contents/Resources/app/node_modules/taxonium_data_handling/importing.js:62:22)
    at Gunzip.emit (node:events:513:28)
    at addChunk (node:internal/streams/readable:324:12)
    at readableAddChunk (node:internal/streams/readable:297:9)
    at Readable.push (node:internal/streams/readable:234:10)
    at Zlib.processCallback (node:zlib:566:10)

Node.js v18.12.1

Then the UI just freezes and never finishes loading.

So for now we'll do without the amino acid changes, and go do something else while the nuc-only version loads. But we were hoping you'd have some ideas about how to magically speed up the initial load when there are so many mutations. 🙂

I can share the tree files offline if you would like to test them out on your end.

@theosanderson
Copy link
Owner

Thanks Angie - makes sense :( :
image

This definitely doesn't sound impossible to deal with - there is no reason all that stuff has to be in a single line / single JSON object - it's just about finding time and a backwards compatible way to do it. I've also thought about trying to rewrite the backend in some proper compiled language but the time horizons on that are probably longer. Out of interest, how long does it take to load the tree in UShER? (just to understand very roughly how much faster this could get things).

@AngieHinrichs
Copy link
Contributor Author

Yes, not a trivial change! (Would a streaming JSON parser be able to get around the line size I wonder?) UShER is able to load the protobuf in 2.673s.

@theosanderson
Copy link
Owner

Hah, that's a little faster.

Yes maybe a streaming JSON parser would work, that's a good suggestion!

@theosanderson
Copy link
Owner

@AngieHinrichs - sorry could you do me a favour and tell me whether the latest version of the Taxonium app (which to be clear, doesn't have any fixes for this yet), is working for you at all [you probably don't want to remove your old one, but to install it alongside]. I am unable to xattr it but am not quite sure if that's my recently changed system or universal, and I don't have another machine to test it on atm.

@theosanderson
Copy link
Owner

Actually you probably don't need to check - it looks like it almost certainly won't. I'll look into that.

@theosanderson
Copy link
Owner

Right sorry, ignore me. The latest version is OK in this regard.

@theosanderson
Copy link
Owner

theosanderson commented Sep 30, 2024

Sorry for the stream of consciousness above.

https://cov2tree.nyc3.cdn.digitaloceanspaces.com/Taxonium-jsonlparser-arm64.dmg might resolve your issue. If you have a chance to try it over the next couple of days I'd be keen to hear how it goes!

Not working - sorry, back with you soon!

@AngieHinrichs
Copy link
Contributor Author

Sorry I missed these, but thanks for whatever you're doing, and let me know if I need to do anything (sounds like for now I don't?).

@theosanderson
Copy link
Owner

theosanderson commented Sep 30, 2024

Hi @AngieHinrichs , OK take two. Please could you try https://cov2tree.nyc3.digitaloceanspaces.com/Taxonium-jsonlparser-arm64.dmg with your Mtb file. (No rush.)

@theosanderson
Copy link
Owner

OK, I figured out how to generate a test file and unfortunately it looks like this won't help

@theosanderson
Copy link
Owner

(This isn't a problem with the overall approach, just the exact one I took)

@theosanderson
Copy link
Owner

OK, so https://cov2tree.nyc3.digitaloceanspaces.com/Taxonium-jsonlparser-arm64.dmg seems not to stall on that first line. The downside is that it may be significantly slower! Let me know how you get on tomorrow or whenever :)

@AngieHinrichs
Copy link
Contributor Author

AngieHinrichs commented Oct 1, 2024

Thank you Theo! My Mac is giving me this garbage -- is there some magic command-line thing I need to do, to tell MacOS to trust it or something?
image
[edit: fortunately I moved aside the Taxonium.app that was already installed, and I can still run that, but for some reason my Mac does not like the new one.]

@theosanderson
Copy link
Owner

Ah yeah, unfortunately you always need to run this terminal command: https://docs.taxonium.org/en/latest/app.html#installing

@AngieHinrichs
Copy link
Contributor Author

Doh! Sorry, I should have looked it up there. Thanks!

I've tried it on a couple of files and it doesn't seem slower at all. It loaded the version without amino acid changes very quickly (a minute or less?) and I can use it normally. It loaded the version with amino acid changes tolerably fast too, but then when I clicked the 'you can access it here' link, I got an AxiosError: Network Error in the UI, and the log output ended like this:

stdout: Adding parents took 50ms.
Went from 21505 to 64433 nodes.

stdout: Saved cached starting vals

stdout: Starting to listen

stdout: Non SSL on port 13069

stdout: status { status: 'loaded' }

message: [object Object]
{ status: 'loaded' }
stderr: /Applications/Taxonium.app/Contents/Resources/app/node_modules/express/lib/response.js:1150
    : JSON.stringify(value);
           ^

RangeError: Invalid string length
    at JSON.stringify (<anonymous>)
    at stringify (/Applications/Taxonium.app/Contents/Resources/app/node_modules/express/lib/response.js:1150:12)
    at ServerResponse.json (/Applications/Taxonium.app/Contents/Resources/app/node_modules/express/lib/response.js:271:14)
    at ServerResponse.send (/Applications/Taxonium.app/Contents/Resources/app/node_modules/express/lib/response.js:162:21)
    at validateSIDandSend (/Applications/Taxonium.app/Contents/Resources/app/node_modules/taxonium_backend/server.js:347:9)
    at /Applications/Taxonium.app/Contents/Resources/app/node_modules/taxonium_backend/server.js:216:3
    at Layer.handle [as handle_request] (/Applications/Taxonium.app/Contents/Resources/app/node_modules/express/lib/router/layer.js:95:5)
    at next (/Applications/Taxonium.app/Contents/Resources/app/node_modules/express/lib/router/route.js:144:13)
    at Route.dispatch (/Applications/Taxonium.app/Contents/Resources/app/node_modules/express/lib/router/route.js:114:3)
    at Layer.handle [as handle_request] (/Applications/Taxonium.app/Contents/Resources/app/node_modules/express/lib/router/layer.js:95:5)

Node.js v18.12.1

Let me know if you'd like the .jsonl.gz files.

@theosanderson
Copy link
Owner

Ah dang! Yeah, the files would be great!

@AngieHinrichs
Copy link
Contributor Author

OK, I sent an email.

@theosanderson
Copy link
Owner

Thanks and thanks for testing!

@theosanderson
Copy link
Owner

(Just for info, and to keep records in this thread:) I think it may not be much slower for Mtb, but that for the big SARS-CoV-2 trees, sticking a JSONL into https://vercel.live/open-feedback/cov2tree-git-jsonlparser-theosandersons-projects.vercel.app is much slower than using taxonium.org . So in terms of actually merging this, that will be a blocker [until I figure something out] but doesn't stop us making a version for you to use. But first we need to make the backend not crash.

@AngieHinrichs
Copy link
Contributor Author

Ah, I see what you mean with the big tree. It makes my new laptop a bit slower than my old laptop in terms of loading time. I think it would be a problem for others with less beefy laptops.

At the risk of suggesting something impractical (feel free to reject out of hand) -- since the first line of the file can dwarf the size of the subsequent lines, would it make sense to use a streaming parser only for the first line, and then parse all subsequent lines from strings as before? Again, if that sounds like a pain, never mind. 🙂

@theosanderson
Copy link
Owner

Hi Angie!
Success:
image

To give a few updates:

  • I played around with reimplementing the backend in Rust. Despite not knowing Rust, I managed to get something vaguely working (https://github.com/theosanderson/taxrust) for the core functionality - but I ultimately have decided for now that this doesn't improve performance that much and that using it in the browser would create memory issues
  • The second error you encountered came from the fact that communication between the backend and the frontend involves an endpoint that returned all the mutations in one go, causing the same 1 GB issue. I addressed this in New server side endpoint for getting (all) mutations, which provides data in chunks #617 which I've merged to main - streaming mutations in chunks.
  • I have rebased the streaming the JSONL for input PR - Use jsonlParser #614 - on that and it now works for your file! That is not merged to main because of the performance issues. But the app is available to download at https://cov2tree.nyc3.digitaloceanspaces.com/Taxonium-jsonlparser-arm64.dmg . Let me know if you hit issues.
  • I still need to figure out the performance stuff before merging this. Your suggestion of just streaming the first line is absolutely right but JSON streaming code is a bit of a nightmare (at least when like me one doesn't know it very well) so I would need to figure out how to do it.

@AngieHinrichs
Copy link
Contributor Author

OMG that's amazing! And I can't totally can believe you rewrote the backend in Rust "despite not knowing Rust." 🤯 😆 Along with all that you've got going on now with your new position at LSHTM 🎉 . I will give the new app version a try, and will remember xattr this time. Thanks so much!

@theosanderson
Copy link
Owner

Hah, Claude wrote most of it :)

@theosanderson
Copy link
Owner

Another update: I got the split-stream thing working. https://cov2tree-klkje8lb9-theosandersons-projects.vercel.app/ now works for your tree by "uploading" the JSONL locally without the need for the Taxonium desktop application. Still a bit slow to load the mutations - which is probably fixable by restructuring the JSONL file at some point, but not looking bad!

@AngieHinrichs
Copy link
Contributor Author

Amazing again! Both the digitaloceanspaces app and the vercel version work great for M.tb. My browser wasn't able to load the full SC2 pb with the vercel app (tab died), but it seemed to be going quickly enough while it was going.

@theosanderson
Copy link
Owner

(Thanks for checking everything - yes sorry in this case I only meant the Mtb tree - full SC2 will die as before)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants