-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsl-Vine: Serial Task Latency, Round 2 #3894
Comments
Using It's submitting a bunch of tasks, not sequential tasks - so this issue may be mistitled. Baseline (parsl with in-process execution - no fancy execution tech at all):
Using parsl's HighPerformanceExecutor:
Using Work Queue + coprocesses: This one slows down quite dramatically as you increase the target time (aka increase the number of tasks at once) so I think there's some linear-access-cost data structure somewhere...
So all of the above are in a pretty decent 300-2000 tasks per second range. Probably we don't need to quibble too much about the exact numbers there because even things like changing Python logging configuration can change things drastically at this task rate. Using Task Vine (in regular mode, but performance is only slightly better with serverless mode enabled):
With different target durations (and so different numbers of tasks submitted) this is pretty consistent around 2 tasks/second - I tried with a 10 minute target runtime and the resulting 1043 tasks execute/complete at 1.7 tasks per second. That ~2, much lower than the 300-2000 tasks/second of the other executors, is what concerns me. I have not dug into what this looks like any deeper than this - just came prominently to my attention when talking to @gpauloski about what become #3892. It's probably worth me spending an hour or two on this from a parsl perspective though. |
Yikes, something is way off there. |
attached is the parsl rundir generated by using
task vine logs and other parsl stuff is in there. |
Ok, so here is what I can see from a quick look at the manager log:
@colinthomas-z80 I'll let you take it from here, can you set up the python benchmark and see what we get? |
telling tasks to use a single core does speed things up by 2.5x-ish
To avoid Python startup costs, I guess I should be pushing on serverless mode, in the same way that we prefer coprocess mode with WQ... that worked earlier today (with slightly improved performance but still around the 2 tasks/second rate) when I was using a build from April but now seems to be failing with a cctools build from today. I'll debug that a little bit and write more here. Serverless mode isn't tested in our CI - although I would like it to be. |
ok, looked like I needed to Here's a config for serverless mode: (not in our git repo, that's why I'm pasting it here)
and
|
Yep something stinks there, thank you for pointing it out. |
I get proportionally the same result on a 16 core machine.
Continuing some tests. I don't think we are seeing the same blocking wait cycle as before. |
So far there seems to be two components to the issue: The first of which is one bug currently being worked on in #3889, when running n core tasks on a local n core worker the requested disk allocation will exceed remaining disk and cause the task to be forsaken. This happens periodically while running the performance benchmark. This is likely the cause of the variability in throughput since its nature is unpredictable, but if it were to never occur the throughput would still be limited to around 4 tasks per second. The main throughput issue seems to reside somewhere on the worker in the overhead of the exec_parsl_function.py script. The running time of each task recorded at the worker is around .24 seconds, which takes just about all of the time needed to have a limit of 4 tasks per second. I will study the script and perhaps wrestle with a python gprof equivalent. @tphung3 might have some ideas as well. |
That's interesting. Workqueue has a similar exec_parsl_function.py. I wonder what the number is for wq. @colinthomas-z80 the fastest way to measure the overhead of exec_parsl_function.py alone is probably to temporarily stick some logging at the start and end of that script. This should help locate whether the true overhead is in the script or in taskvine worker. |
@tphung3 I pasted an example run for WQ using coprocesses in an earlier comment, as well as Task Vine+serverless mode that I think is supposed to be roughly equivalent? There I get around 500 tasks per second when parsl-perf submits 4000 tasks, dropping off as it submits bigger batches. If I turn off coprocesses and use this config, I get around 3 tasks per second:
So I feel like for this kind of performance the coprocess-equivalent mode (not starting a new Python script each time) is the way to go - which is the TaskVine serverless examples I pasted above, right? Loading a new python environment per task is going to be stuck like this no matter how efficient you make the launching of that heavyweight process, and so is futile to pursue here, I think. |
so i'll phrase my initial problem a little differently, perhaps: when I use the taskvine mode that i thought was meant to make stuff go faster by not launching a new python process per task, it does not go faster. |
Right, the problem here is not so much that TaskVine+PythonTask has overhead (we knew that) but that TaskVine+FunctionCall is not significantly faster. @colinthomas-z80 what numbers do you get with the taskvine serverless config? I am wondering if perhaps the executed function has an internal import that needs to be hoisted... |
Unless you're doing something to change import behaviour, running an import that is already imported into the current Python process should be low-cost, I think - roughly the cost of looking up if that import happened already - eg: cost to start a python process that imports parsl:
cost to start a python process that imports parsl 1,000,000 times in a loop:
|
About the same for me
|
@gpauloski Could you send to me the logs of some of the runs you are having trouble? |
@JinZhou5042 you can find logs for the runs causing problems in this issue #3894 by running the commands that @colinthomas-z80 has pasted above and looking in the runinfo/ directory. @gpauloski 's issue is #3892 and is different to this #3894 issue - his issue does not directly involve Parsl or the Parsl+TaskVine interface. |
Using a basic benchmark of TaskVine Python Task without parsl I achieved a throughput of 8-10 tasks per second. I approximated the parsl environment as well by adding some files to the no-op task which usually would be arguments, map, and result files using TaskVine/Parsl and it did not significantly affect the result. Although this is slightly better than the TaskVine executor performance I think we have established the limits of creating a new Python process for each task. |
I think that's totally within the expect performance of PythonTask. So now lets figure out why serverless is not doing substantially better than that... |
It looks like your software environment is |
@JinZhou5042 once you've got a consistent environment with @colinthomas-z80 like @dthain suggests - if it still fails in the way you are showing above, i can see that it successfully ran a few tasks, so runinfo/ should have some more info about whats going on when the second batch (of 71 tasks) is submitted. |
With #3909 merged in we now have a relatively consistent result using a serverless config with the performance benchmark. However the throughput achieved is only marginally better than regular Python Task: around 10-12 tasks/second on my local machine. In contrast to function invocations purely using the taskvine api, the implementation using parsl/taskvine requires the serialization of a number of files each invocation and perhaps some redundant communication not present in the base implementation. My understanding of this is that it is necessary to accommodate the runtime declared functions given to the executor from parsl rather than the static library creation approach we have here. We will need to find whether improvements can be made or if inspiration can be taken from WQ coprocess which we see performs well. Perhaps at some expense to the "plug and play" nature of the current implementation. |
I opened a PR at parsl adding the parsl serialize module to the "hoisted" preamble of the library context. It seems we were not effectively caching the module beforehand. Doing this has improved throughput in parsl-perf to 30-40 task per second on a single core. |
@colinthomas-z80 fixed some corner cases in the Parsl-TaskVine executor in #3432 that improved the sequential-chained task latency from about 1 task/s to about 6 tasks/s. However, we should be able to do better than this: @gpauloski observed similar performance in #3892
While this isn't the ideal use case for TaskVine (it would be better to submit many tasks at once) it is a useful benchmark that stresses the critical path through the executor module. Let's figure out what the bottleneck is, and how we can do better.
The text was updated successfully, but these errors were encountered: