-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize I/O operations #97
Conversation
I suspect that moving the variant-wise encoding for-loop to the VCZ encoder library will help. vcztools/vcztools/vcf_writer.py Lines 397 to 406 in d152af1
This way the C extension can release the GIL for all of the lines rather than acquiring and releasing for every line. |
I pushed the for-loop down to the C extension, and now I am seeing the hoped-for results:
To get this prototype working, I used References |
In the latest commit, the With this approach, all the test cases pass locally and the performance is still improved. I am not sure why the GitHub checks are failing. They seem to be unrelated to this pull request's changes. |
Sorry @Will-Tyler that's my fault! The changes in #95 caused ruff to change You should be able to rebase and run |
Very nice @Will-Tyler! I'm nervous about the C code writing directly to the FD though, there's a lot that can go wrong here and we'd need to work quite a bit harder to make things air tight (e.g., there's no error checking at the moment, so if a write fails it's silent). What you can and can't do while holding Py_BEGIN_ALLOW_THREADS etc is also very subtle, and it definitely pays to keep things as simple as possible. What I had in mind was to write multiple lines at once to a buffer, and then flush these buffers out in-order to the output in Python. So, you could have something like this:
This is quite a bit more complicated that what you have here as you'd need to set up some Queues etc to keep the threads fed , but I think it would perform quite well ultimately. We wouldn't need to go to the full thing initially though, and I think just writing the output to a pre-supplied buffer from C would be a good start. So, rather than writing one line, we write a range of lines. Also, let's just raise an exception from C if the buffer size is too small rather and do the doubling in Python than trying to do the adaptive updates in C. Keeping things as simple as possible in C is really important here. There's no point in trying to optimise the corner cases (i.e, you've supplied a buffer size that's too small) - do the important nominal case thing fast in C, and leave everything else to Python. Also - do as little as possible in the CPython interface code. If you find yourself doing anything that involves anything other than massaging input or output, it should probably be moved into the library code. So, for this first PR, let's use your current architecture, but write the full chunk to a string that we return to Python rather than doing the IO in C. Does this sound OK? |
I think that to get the most from parallelism, the extension module should do as much work as possible with the GIL released. If I understand correctly, you are proposing to encode multiple lines to a single buffer and raise an exception from C if the buffer is too small. I'm concerned that if the number of lines is too small, then the module will acquire the GIL too frequently and if the number of lines is too large, then vcztools might redo a lot of work whenever the buffer is too small. And if Python does the I/O, that is less parallelism because the I/O thread would need the lock although simply printing does not seem to be a top time consumer.
This is true. I believe my current implementation is thread-safe. The raw memory interface my implementation uses are documented as thread-safe.
I feel like I am missing something but it seems overall simpler and more efficient to use C's standard I/O API and add error handling rather than share buffers between threads and C and Python. |
The buffer won't be too small very often (we'll make the default much bigger, and make better guesses) so we don't need to worry about that. By writing multiple lines to at once, we can do a significant chunk of work with the GIL released. The difference between doing file IO in C and Python is negligible if you're writing a lot of text at a time, so there's no point in doing the IO in C unless you have to. It's easy to get stuff like this to work initially, but it's very hard to get it it fully air-tight and any minor efficiency you gain by doing stuff directly in C is bought at a high cost in subtle and difficult to track down bugs later on (I have war stories!). I'm happy to pick this up and take it the last few steps, but I do want to take the C-level IO out before merging. You can just take out the |
This reverts commit 0a47519.
I trust your judgement and experience—I reverted some of the changes so that this pull request is mostly about setting up the parallel threads. |
Overview
Currently, vcztools view loads data from Zarr and writes the VCF output sequentially. These operations can be done in parallel. This pull request uses multi-threading to parallelize reading from Zarr and writing the VCF output.
I am opening this pull request as a draft because I think additional changes are likely needed.
Approach
The sequential implementation has the following form:
Here is the form of this pull request's multi-threading approach:
This approach allows at most two threads to run concurrently and requires the preceding thread to finish before the current thread can start writing.
Testing
I added a multi-chunk dataset for unit testing. The dataset is a 100 variants by 100 samples subset of the chromosome 22 performance testing data. The variant and sample chunk sizes are each 10. Having a multi-chunk dataset should help us detect concurrent-programming errors such as not writing the output in the correct order.
Results
The results are disappointing so far. There is practically no improvement over the sequential implementation's results in #94.
Profiling
Interestingly, the the profiling results show most of the time being spent on
acquire
.References