Add simple granite4 tool parser#36827
Add simple granite4 tool parser#36827maxdebayser wants to merge 16 commits intovllm-project:mainfrom
Conversation
This tool parser should be compatible with most models that use the Hermes tool calling pattern. It has been re-written from the ground up to avoid the brittle partial json parsing that we see in other tool call parsers. By relying on a stream-enabled parser it avoids bugs such as the interference from stop sequences which change the sequence of deltas that the model sees Main design decisions: - Remove streaming of partial arguments: this complicates things and arguably doesn't benefit the end user at all - Decompose the parser in several layers that are independently testable - Use a formal grammar to specify the parser - For the parser, use a library that is already part of vllm's dependencies. Lark is imported by llguidance. Since the parser is compatible with Hermes tool calling, I'm reusing the Hermes tests except for one that allows incomplete input. I'm also adding tests for the lexer and parser as well as testing for known bugs. Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
- Fix wrong scope for server fixtures in tests - Prevent the tool parser from streaming pieces of the <tool_call> marker as message content - Reduce unecessary delta messages Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Mypy is complaining about code outside of my changes for some reason Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Previously some lexing tasks were left to the vllm tool parser, which is at the wrong abstraction level, leading to unecessary complexity. Now the lexer also handles free text so that what comes out of the low level lark parser is already organized into text and tool calling segments. Since now the lexer and lark parser are aware of the surrunding text, it is easier to handle multiple tool calls cleanly Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
If we can assume that: 1) Streaming the name ahead of the arguments has no relevant use case; 2) Validating the tool call JSON while it is being assembled is not useful; then the parser can be simplified a lot by only streaming complete tool calls. The we can use simple regexes to find the tool call tokens and use json.loads() to handle anything in between. Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
|
Documentation preview: https://vllm--36827.org.readthedocs.build/en/36827/ |
|
Hi @maxdebayser, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new granite4 tool parser, which is a simplified implementation for IBM's Granite 4 models. The changes include the parser logic, registration, documentation updates, and new tests. The existing hermes tool parser tests are also refactored to be parameterized and reused for the new parser. My review found a critical bug in the streaming logic of the new parser that could lead to an AttributeError and incorrect state management. I've also suggested an improvement to the new test file to simplify the logic for reconstructing tool calls, making it more readable and aligned with the parser's design of not streaming partial arguments.
|
@sfeng33, here is an alternative implementation based on you suggestion. It can really be made much shorter after giving up on incremental parsing and streaming. I'm going to answer your question of the other PR here, because it applies as well:
Relying only on text is useful for us to run tests with models that don't have dedicated tool calling tokens. But beyond that, I really prefer to have a single source of truth for the input, and since we have to parse json, the most appropriate input is text. To illustrate my point, there is currently a bug in vllm which causes the text deltas and the token deltas to go out of sync. If you run But if you comment out the |
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
|
I've opened an issue for the bug I described above: #36830 |
Purpose
Note: this is a simpler alternative to #35948 based on suggestions by @sfeng33
IBM's Granite 4 models use the Hermes tool calling convention and until now had been using the
hermesparser. However, due to the popularity of the Hermes format many additions have been made to this parser to serve specific needs, such as the ability to work without specialized tool calling tokens. As a result, the parser's code has become mostly unreadable. We have found bugs that arise from the interaction with other features such as stop sequences and that are very hard to fix given the state of the code. Also given the complexity of the code, it is very hard for maintainers to trust that a PR won't break other things.There is also a Granite 4 specific behavior which we need handled in the tool parser which is that the models have a tendency to generate the arguments as an escaped string instead of JSON text.
The
granite4parser in this PR has been re-written from the ground up to avoid the brittle partial json parsing that we see in other tool call parsers. By only streaming full tool call streaming, no partial json parsing is required.Main design decisions:
Test Plan
Since the parser is compatible with Hermes tool calling, I'm reusing the Hermes tests except for one that allows incomplete input. I'm also adding tests for the lexer and parser as well as testing for known bugs.
Test Result
All the added or modified tests are passing locally.