Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow server/client speak gzip especially as part of the Submit_targets protocol #543

Open
armish opened this issue Jun 26, 2017 · 1 comment

Comments

@armish
Copy link
Member

armish commented Jun 26, 2017

This is often not a big issue, but when you are sequentially submitting more than 10 complete epidisco workflows to a single server, sending them all do take some time. Especially if/when you try to do that in parallel (to different servers or worse, to the same one).

I originally thought that this was more about server's taking time to do the equivalance checks on its side before officially putting the `OK stamp on it; but while I was playing with ketrew JSONs the other day, I realize that a single patient's workflow serialized in JSON format was ~30MB! I then try to send this via curl and it is not that the transfer gets completed and the server waits, but it is actually transferring the file that makes the waiting long.

The obvious solution, of course, is to support the gzipped content delivery from/to server/client which would be a natural extension of the HTTPs-based API you have designed. And there is also this:

$ du -sh all-in-epidisco-workflow.json*
 29M	all-in-epidisco-workflow.json
720K	all-in-epidisco-workflow.json.gz

Poking around a bit, I gladly saw that Cohttp_lwt at least supports the header/response formats, so I think all we need is to get the logic of gzip/gzip-not into the pre-/post-serialization parts and we will have blazingly fast submission experiences from then on (unless of course we DDoS ketrew with all those decompression tasks, which by the way can be handled by another helper virtual machine in the container; but that is for another day :))

(Maybe you have already tried this and moved away, in that case, feel free to ignore this; but I would be curious about what went wrong there)

@smondet
Copy link
Member

smondet commented Jun 26, 2017

  • indeed equivalence checks are not impactful for the actual submission time
    (ketrew stores the submssion all at once, answers the HTTP req, and then the
    engine will do the equivalence + adding → that's the delay between the
    notification "workflow received” and the presence in the node-table in the
    WUI).
    • the equivalence computation is not exactly the bottleneck either (it the
      DB interaction to get "equivalence candidates" + adding the workflow)
      (even with ketrew compiled to bytecode the DB interactions are slower than
      the pure-ocaml equivalence computation)
  • so yes gzipping is worth a try
  • I've also noticed that (esp with 30 to 300 MB submissions) there is a huge
    difference between bytecode (when you run ocaml submit.ml it's
    bytecode-compiled) and native executables.
    • I want to try OpenSSL Vs OCamlTLS in bytecode to see if the perf problem
      is at that level.

to reduce the stress on the check-equivalence + add to engine, I'll try 2 things:

  • do the “adding” only one workflow at once (processing the whole queue at once
    can pause the engine for quite long, which makes the user think something
    is broken).
  • implement workflow “namespaces” (i.e. check equivalence only between nodes
    that belong to a given user-defined subset of the currently active universe;
    e.g. in Epidisco we can use the experiment-name or “biokepi-setup” as
    independent namespaces).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants