-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert from pybind11 to nanobind #230
base: main
Are you sure you want to change the base?
Conversation
Something to note that I'm not sure about - I think pypy works with nanobind, so I might try re-adding it. |
@Ahajha Thanks for the contribution! Nanobind looks great and fast. Regarding the performance, could you check the output of pytest tests/python/test_grammar_matcher_ebnf.py::test_fill_next_token_bitmask before and after this PR? (it could require HF_TOKEN with access to llama 2 and 3.1) As long as there is no major performance degradation, we can migrate to nanobind. Besides, could you run |
This comment was marked as outdated.
This comment was marked as outdated.
Got some of the HF tests running, ran into a snag due to a few places allowing interchange between bytes and strings. Nanobind specifically doesn't like this (see wjakob/nanobind#137), so I might need to make some slightly deeper API changes to get a clean solution. |
This comment was marked as outdated.
This comment was marked as outdated.
I came up with a solution that I think is reasonable - basically we defer casting until we get into the relevant functions (there was one other that needed fixing). So instead of directly taking in a vector of strings, we just take the raw python list (with some type annotations, these don't do any runtime checks) and then manually cast from either a string or bytes. This solution ended up being fairly simple without requiring any API changes. |
This comment was marked as outdated.
This comment was marked as outdated.
After some investigating and fixes, I have some better numbers, running the following: /usr/bin/time pytest tests/python/test_grammar_matcher_ebnf.py::test_fill_next_token_bitmask >/dev/null (I'm only including the interesting line from each run)
Nanobind branch:
This is definitely eyeball statistics, but this seems to be on par with what it is now. I expected a speedup, perhaps there is more to investigate, but as you said as long as it isn't an obvious regression it should be okay. I also have some future PR ideas that involve nanobind that should lead to further speedups (some of them could be done with pybind11, but will be easier with nanobind). |
The issue was actually rather insightful, I'll share some insight for you (or anyone else reading this PR). The first issue was that pybind11 enables LTO by default, nanobind doesn't - the former basically requires it to get reasonable performance, but nanobind is designed and laid out in a better way so that you can still get good performance without it, though they provide the option should we want it. I know this project has The second issue was more interesting to me. Nanobind has a section on their CMake API (https://nanobind.readthedocs.io/en/latest/api_cmake.html#command:nanobind_add_module), and one thing they mention there is binary size. Nanobind defaults to optimizing for size, because in the author's experience with bindings, optimizing for speed tends to not make a difference. However they do make a note about mixing optimization levels, specifically, compiling the core project with whatever default optimization level like -O2, and just binding code with -Osize. So I've slightly rearranged some code to make anything that is just interop (with no 'real' logic) into nanobind.cc, now no other files depend on nanobind. |
@Ubospica This should be ready for another look now! |
See nanobind: https://github.com/wjakob/nanobind tl;dr Smaller, faster compiled extensions. Nanobind is written by the original author of pybind11, with the hindsight and lessons from pybind11.
Locally (x64 Linux), the compiled extension went from 14.10MB to 12.01MB.
I have some very crude benchmarks in the comments - this is performing about on par with the previous code.
I've run all the tests locally (x64 Linux), everything passes. Ideally #218 is merged so we can test those automatically on other platforms, but the only major differences I would expect are binary sizes.
I make some CMake changes so that the compiled extensions now just depend on the core project, rather than rebuilding all the sources again for the extension.