Multiple ThreadPoolExecutors #4655

mrocklin · 2021-03-31T13:18:38Z

(I think that I've raised this before, but I couldn't find it. I suspect that it was part of commentary on an issue rather than a standalone issue itself)

Today we run all tasks in a ThreadPoolExecutor living at Worker.executor. We default the size of this executor to the number of logical CPU cores on a machine. This works great most of the time, but there are some cases where we would like something different.

I/O related tasks we could consider running on the event loop itself, or with a separate Tornado based AsyncExecutor
For GPU related tasks we would prefer to have a separate executor with a single thread (or in the near future a few threads)
For noxious tasks that leak memory folks have asked for a separate ProcessPoolExecutor
Some folks have asked for a special executor for restricted resource tasks
Actors run today on their own executor

In practice, the GPU pool is probably the most common case today.

So perhaps we should encode multiple executors into the Worker, and have tasks split between them based on annotations/resources/gpu flags.

executor = self.executors[task.executor or "cpu"]
self.submit_on_executor(executor, task, *args, **kwargs)

cc @dask/gpu

The text was updated successfully, but these errors were encountered:

jakirkham · 2021-03-31T18:54:16Z

Would this alleviate the need for secede/rejoin as well given another thread could always pick up new work? Or would that not work for some reason?

mrocklin · 2021-03-31T18:57:09Z

No. This is unrelated to secede rejoin, which is useful for tasks submitting other tasks.

gjoseph92 · 2021-04-02T02:50:58Z

I/O related tasks we could consider running on the event loop itself, or with a separate Tornado based AsyncExecutor

I'm personally interested in this. It would be nice if you could use async/coroutines for IO alongside CPU-bound tasks for processing, and not have to manually juggle the event loop. For example, I'd like to use aiocogeo to read GeoTIFFs into ndarrays, then process the arrays in a normal threadpool.

What would the semantics be for controlling which tasks run in which executors? For async in particular, would we want to have logic that automatically awaits tasks before passing them into non-coroutines, or would we require users/collections to control that manually?

mrocklin · 2021-04-02T13:45:08Z

For things like GPUs there might be a few different mechanisms. There is a related GPU issue on this at #4656 . For async tasks I think that we can probably identify these pretty easily by seeing if the function is an async function or not. For other executors probably we rely on annotations?

jakirkham · 2021-04-02T15:19:53Z

FWIW there is a coroutine executor. Not sure if that is helpful for the IO use case

jakirkham · 2021-04-02T15:24:05Z

One thing to be aware of is we have been exploring per-thread default stream (PTDS) ( rapidsai/dask-cuda#96 ) ( rapidsai/dask-cuda#517 ) This maps closer to what we have today as using multiple threads translates into using multiple streams on the GPU

mrocklin · 2021-06-04T17:09:15Z

FWIW there is a coroutine executor. Not sure if that is helpful for the IO use case

That's actually a really interesting point. cc'ing @martindurant who has thought about this in the past.

I suspect that if we had layers that were strictly IO, and then marked those layers with the appropriate annotation, then everything here might just work? (I think that high level fusion doesn't cross differences in annotation)

jakirkham · 2021-06-04T17:35:03Z

Was also thinking about that in the context of spilling

martindurant · 2021-06-04T17:37:05Z

layers that were strictly IO

This is almost never the case. Not only does loading bytes often need some CPU (e.g., gzip decompression of HTTP calls; which may be offloaded to a thread maybe), and for dask only usually form part of a given task. For the simplest example, a zarr load may fetch several chunks concurrently on the event loop, and then decode them synchronously on the worker thread (this can be quite a speed-up). Other loaders like CSV and parquet do not even use fsspec's async layer directly or fetch bytes concurrently

cf dask/dask#7557 : loads multiple pieces of parquet in each task, but the backend is calling open/read synchronously. The improvement is in skipping pd.concat

cf dask/fastparquet#619 which can explicitly fetch multiple file metadata chunks concurrently by calling fsspec (for a backend that supports it, currently HTTP/S3).

jakirkham · 2021-06-04T17:38:01Z

This might be a harebrained idea, but haven't quite shaken it. We might want to explore this new functionality Mads has added with a custom Executor using CUDA streams. Wrote it up as issue ( rapidsai/dask-cuda#641 ) if others have thoughts.

mrocklin · 2021-06-04T17:55:04Z

I suspect that if we had layers that were strictly IO

This is almost never the case

Yeah, to be clear, I'm saying that if we were to change how dask collections handle IO, by moving read_bytes calls into fully separable tasks, then we could take advantage of this. You had mentioned this in the past I think.

It wouldn't work for Zarr, you're right, because that abstraction hides I/O from us, but it could work for Parquet, CVS, and others if we wanted to make that explicit split. I'm not suggesting that we do this today, or any time in the moderate future.

jakirkham · 2021-06-04T18:07:42Z

It wouldn't work for Zarr, you're right, because that abstraction hides I/O from us, but it could work for Parquet, CVS, and others if we wanted to make that explicit split. I'm not suggesting that we do this today, or any time in the moderate future.

It might work if we supplied that in a MutableMapping or fsspec based object that Zarr could consume

martindurant · 2021-06-04T18:26:27Z

@jakirkham : that's already the case, and you could set the fsspec backend's loop to be the one it needs to be; but zarr will still do its part decoding synchronously. You'd have to pass the filters down to the storage layer and replicate the work there - but then it would no longer be pure IO.

I think a rewrite in which we can fetch multiple blocks of bytes in a single task and pass to a separate dataframe-making task (without concat!) would work well for CSV. Parquet and just about anything else where we don't pass bytes around is more complicated. Fastparquet, for example, isn't interested in running in multiple threads like arrow can because "dask can solve that case" (not that it does a good job of releasing the GIL).

Note that the PR I linked above for fastparquet improved dataset open time by 10x for on s3 and without _metadata (one of the test datasets with many files).

martindurant · 2021-06-04T18:33:31Z

Recent example of increasing the chunk size (of the dask task - same on disk) in zarr: https://nbviewer.jupyter.org/gist/rsignell-usgs/9ccb9c18d4c1bf2205561387837d6868

Time went 30s->20s; can't readily tell on the surface how much was IO/latency.

mrocklin mentioned this issue Mar 31, 2021

Detect GPU tasks by inspecting inputs/outputs #4656

Open

madsbk mentioned this issue Jun 2, 2021

Multiple worker executors #4869

Merged

5 tasks

jrbourbeau closed this as completed in #4869 Jun 4, 2021

jakirkham mentioned this issue Jun 5, 2021

Proof of concept: CloudFilesStore zarr-developers/zarr-python#767

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple ThreadPoolExecutors #4655

Multiple ThreadPoolExecutors #4655

mrocklin commented Mar 31, 2021 •

edited

Loading

jakirkham commented Mar 31, 2021

mrocklin commented Mar 31, 2021

gjoseph92 commented Apr 2, 2021

mrocklin commented Apr 2, 2021

jakirkham commented Apr 2, 2021

jakirkham commented Apr 2, 2021

mrocklin commented Jun 4, 2021

jakirkham commented Jun 4, 2021

martindurant commented Jun 4, 2021

jakirkham commented Jun 4, 2021

mrocklin commented Jun 4, 2021

jakirkham commented Jun 4, 2021

martindurant commented Jun 4, 2021

martindurant commented Jun 4, 2021

Multiple ThreadPoolExecutors #4655

Multiple ThreadPoolExecutors #4655

Comments

mrocklin commented Mar 31, 2021 • edited Loading

jakirkham commented Mar 31, 2021

mrocklin commented Mar 31, 2021

gjoseph92 commented Apr 2, 2021

mrocklin commented Apr 2, 2021

jakirkham commented Apr 2, 2021

jakirkham commented Apr 2, 2021

mrocklin commented Jun 4, 2021

jakirkham commented Jun 4, 2021

martindurant commented Jun 4, 2021

jakirkham commented Jun 4, 2021

mrocklin commented Jun 4, 2021

jakirkham commented Jun 4, 2021

martindurant commented Jun 4, 2021

martindurant commented Jun 4, 2021

mrocklin commented Mar 31, 2021 •

edited

Loading