Run sync task in its own subprocess #1161

medihack · 2024-08-15T21:38:19Z

Currently, a synchronous task runs in its own thread (since v2.13.0 / PR #1160). As discussed in #1156, we should evaluate whether we want to run a synchronous task in its own subprocess.

Advantages:

Good for CPU-intensive work (for I/O intensive work, we already have async tasks)
Shutdown and timeout handling is easier as subprocesses can be killed
Tasks are more isolated (no threading issues)

Disadvantages:

Complex implementation
- Communication between job and worker (signals)
- Calling the job_manager (database context, connector, ....)
Problematic Windows support (no forking of processes on Windows)

ewjoachim · 2024-08-15T22:57:53Z

I have to admit I have limited multiprocessing experience (it just happens to never have been on my radar).

From what I know, Pipes and Queues might be what Python gives us to communicate between processes. Object put in queues are pickled. pipes let us transfer text payloads.

Another complex point (but linked to the JobManager point) is the psycopg pool: does multiprocesing imply that each process will open its own pool ? That might be a little overkill, though I don't know how it will play. Especially: we might not need a connection except if we use task.defer from within the task. We could hack something to use a special connector in the task process that sends postgres queries in the pipe, to be handled by the parent.

medihack · 2024-08-17T13:21:04Z

From what I know, Pipes and Queues might be what Python gives us to communicate between processes. Object put in queues are pickled. pipes let us transfer text payloads.

Yes, and events (for something like the abort request).

Another complex point (but linked to the JobManager point) is the psycopg pool: does multiprocesing imply that each process will open its own pool ? That might be a little overkill, though I don't know how it will play. Especially: we might not need a connection except if we use task.defer from within the task. We could hack something to use a special connector in the task process that sends postgres queries in the pipe, to be handled by the parent.

Yes, if the database is queried directly, each process would use its own connection pool. I find it difficult to judge whether this could really be a problem in a real-life application or just a theoretical problem. Having a special connector and doing something like RPC between the processes sounds like a cool idea. Unfortunately, the same problem exists for database connections besides Procrastinate, when, for example, users do Django model queries. Those connections would take place in the subprocess anyway.

ewjoachim · 2024-09-06T12:17:48Z

I'm a bit wary of changing the model just like this.
I wonder if we should maybe add options (we don't have to pick them all):

Legacy runner that runs async tasks & sync tasks via async to sync (breaking change could be that you need to set your worker to legacy explicitly to get the previous behaviour, or legacy could be the default when nothing is specified with a DeprecationWarning)
Subprocess runner

As usual the annoying part is to try and guess how it's going to be like for folks who use just async, just sync, a mix of both, with or without Django etc.

medihack · 2024-09-06T17:40:22Z

Or, make it configurable, as Huey does with worker types. We could have a sync_type option on the worker (or the task itself).
But there is so much stuff already in the v3 release that we should postpone this feature for a later release. We can add it as an experimental feature with a minor release (and keep the default sync tasks threaded), and when we are sure it's stable, we can still switch to subprocesses as the default.

ewjoachim · 2024-09-06T18:28:27Z

We can add it as an experimental feature

Yes that was my point: I'm comfortable with adding it, I'm comfortable with it having better support over some features (such as aborting) than the standard way but I'm not (yet) comfortable with removing the way we do things now.

medihack added this to the Version 3.0 milestone Aug 15, 2024

medihack mentioned this issue Aug 21, 2024

Django OperationalError: the connection is closed #1134

Open

medihack removed this from the Version 3.0 milestone Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run sync task in its own subprocess #1161

Run sync task in its own subprocess #1161

medihack commented Aug 15, 2024

ewjoachim commented Aug 15, 2024

medihack commented Aug 17, 2024

ewjoachim commented Sep 6, 2024

medihack commented Sep 6, 2024

ewjoachim commented Sep 6, 2024

Run sync task in its own subprocess #1161

Run sync task in its own subprocess #1161

Comments

medihack commented Aug 15, 2024

ewjoachim commented Aug 15, 2024

medihack commented Aug 17, 2024

ewjoachim commented Sep 6, 2024

medihack commented Sep 6, 2024

ewjoachim commented Sep 6, 2024