You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using XGrammar in an LLM inference engine where multiple requests are processed simultaneously. Each request is managed by a separate instance of the GrammarMatcher finite state machine. In our setup, parallel updates to different GrammarMatcher instances are desired to better utilize available CPU cores.
It appears that while the GrammarCompiler bindings release the GIL during its CPU-intensive operations, the GrammarMatcher bindings do not. This behavior is limiting for our multithreaded use case because even though each GrammarMatcher is used by a single thread at a time, the global interpreter lock (GIL) still imposes serialization of CPU-bound tasks across multiple threads.
Would it be possible to modify the GrammarMatcher pybind bindings (e.g., for methods such as accept_token.) to release the GIL similar to the approach taken in GrammarCompiler? This enhancement would improve the throughput in multi-core, multi-threaded environments without relying on multiprocessing, which may introduce memory management complexities in our application.
The text was updated successfully, but these errors were encountered:
@tuoyikuan Thanks for bringing that up! It is reasonable to release GIL for all these operations. We will update the code to release GIL for most operations.
I am using XGrammar in an LLM inference engine where multiple requests are processed simultaneously. Each request is managed by a separate instance of the
GrammarMatcher
finite state machine. In our setup, parallel updates to differentGrammarMatcher
instances are desired to better utilize available CPU cores.It appears that while the
GrammarCompiler
bindings release the GIL during its CPU-intensive operations, theGrammarMatcher
bindings do not. This behavior is limiting for our multithreaded use case because even though eachGrammarMatcher
is used by a single thread at a time, the global interpreter lock (GIL) still imposes serialization of CPU-bound tasks across multiple threads.Would it be possible to modify the
GrammarMatcher
pybind bindings (e.g., for methods such asaccept_token
.) to release the GIL similar to the approach taken in GrammarCompiler? This enhancement would improve the throughput in multi-core, multi-threaded environments without relying on multiprocessing, which may introduce memory management complexities in our application.The text was updated successfully, but these errors were encountered: