Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert from pybind11 to nanobind #230

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Ahajha
Copy link

@Ahajha Ahajha commented Mar 7, 2025

See nanobind: https://github.com/wjakob/nanobind tl;dr Smaller, faster compiled extensions. Nanobind is written by the original author of pybind11, with the hindsight and lessons from pybind11.

Locally (x64 Linux), the compiled extension went from 14.10MB to 12.01MB.

I have some very crude benchmarks in the comments - this is performing about on par with the previous code.

I've run all the tests locally (x64 Linux), everything passes. Ideally #218 is merged so we can test those automatically on other platforms, but the only major differences I would expect are binary sizes.

I make some CMake changes so that the compiled extensions now just depend on the core project, rather than rebuilding all the sources again for the extension.

@Ahajha
Copy link
Author

Ahajha commented Mar 7, 2025

Something to note that I'm not sure about - I think pypy works with nanobind, so I might try re-adding it.

@Ubospica
Copy link
Collaborator

Ubospica commented Mar 7, 2025

@Ahajha Thanks for the contribution! Nanobind looks great and fast.

Regarding the performance, could you check the output of

pytest tests/python/test_grammar_matcher_ebnf.py::test_fill_next_token_bitmask

before and after this PR? (it could require HF_TOKEN with access to llama 2 and 3.1) As long as there is no major performance degradation, we can migrate to nanobind.

Besides, could you run pre-commit install && pre-commit run to format the modified files? I noticed several files don’t seem to be formatted correctly, and it should be very easy to format them with pre-commit hooks.

@Ahajha

This comment was marked as outdated.

@Ahajha
Copy link
Author

Ahajha commented Mar 7, 2025

Got some of the HF tests running, ran into a snag due to a few places allowing interchange between bytes and strings. Nanobind specifically doesn't like this (see wjakob/nanobind#137), so I might need to make some slightly deeper API changes to get a clean solution.

@Ahajha

This comment was marked as outdated.

@Ahajha
Copy link
Author

Ahajha commented Mar 8, 2025

I came up with a solution that I think is reasonable - basically we defer casting until we get into the relevant functions (there was one other that needed fixing). So instead of directly taking in a vector of strings, we just take the raw python list (with some type annotations, these don't do any runtime checks) and then manually cast from either a string or bytes. This solution ended up being fairly simple without requiring any API changes.

@Ahajha

This comment was marked as outdated.

@Ahajha
Copy link
Author

Ahajha commented Mar 9, 2025

After some investigating and fixes, I have some better numbers, running the following:

/usr/bin/time pytest tests/python/test_grammar_matcher_ebnf.py::test_fill_next_token_bitmask >/dev/null

(I'm only including the interesting line from each run)
Main:

5.44user 0.25system 0:03.61elapsed 157%CPU (0avgtext+0avgdata 722316maxresident)k
5.64user 0.29system 0:03.79elapsed 156%CPU (0avgtext+0avgdata 723012maxresident)k
5.34user 0.27system 0:03.47elapsed 161%CPU (0avgtext+0avgdata 722360maxresident)k
5.39user 0.28system 0:03.51elapsed 161%CPU (0avgtext+0avgdata 722292maxresident)k
5.15user 0.27system 0:03.40elapsed 159%CPU (0avgtext+0avgdata 723596maxresident)k
5.30user 0.24system 0:03.40elapsed 162%CPU (0avgtext+0avgdata 723056maxresident)k

Nanobind branch:

5.37user 0.32system 0:04.05elapsed 140%CPU (0avgtext+0avgdata 721436maxresident)k
5.41user 0.26system 0:03.58elapsed 158%CPU (0avgtext+0avgdata 723240maxresident)k
5.35user 0.28system 0:03.53elapsed 159%CPU (0avgtext+0avgdata 722456maxresident)k
5.40user 0.27system 0:03.53elapsed 160%CPU (0avgtext+0avgdata 722348maxresident)k
5.44user 0.28system 0:03.66elapsed 156%CPU (0avgtext+0avgdata 723440maxresident)k

This is definitely eyeball statistics, but this seems to be on par with what it is now. I expected a speedup, perhaps there is more to investigate, but as you said as long as it isn't an obvious regression it should be okay. I also have some future PR ideas that involve nanobind that should lead to further speedups (some of them could be done with pybind11, but will be easier with nanobind).

@Ahajha
Copy link
Author

Ahajha commented Mar 9, 2025

The issue was actually rather insightful, I'll share some insight for you (or anyone else reading this PR).

The first issue was that pybind11 enables LTO by default, nanobind doesn't - the former basically requires it to get reasonable performance, but nanobind is designed and laid out in a better way so that you can still get good performance without it, though they provide the option should we want it. I know this project has -flto=auto on by default, I'm not sure why this sped things up but it did. Perhaps it's a different level of LTO, it's whatever CMake's CMAKE_INTERPROCEDURAL_OPTIMIZATION does.

The second issue was more interesting to me. Nanobind has a section on their CMake API (https://nanobind.readthedocs.io/en/latest/api_cmake.html#command:nanobind_add_module), and one thing they mention there is binary size. Nanobind defaults to optimizing for size, because in the author's experience with bindings, optimizing for speed tends to not make a difference. However they do make a note about mixing optimization levels, specifically, compiling the core project with whatever default optimization level like -O2, and just binding code with -Osize. So I've slightly rearranged some code to make anything that is just interop (with no 'real' logic) into nanobind.cc, now no other files depend on nanobind.

@Ahajha
Copy link
Author

Ahajha commented Mar 9, 2025

@Ubospica This should be ready for another look now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants