Skip to content

Add simple granite4 tool parser#36827

Open
maxdebayser wants to merge 16 commits intovllm-project:mainfrom
maxdebayser:add_simple_granite4_tool_parser
Open

Add simple granite4 tool parser#36827
maxdebayser wants to merge 16 commits intovllm-project:mainfrom
maxdebayser:add_simple_granite4_tool_parser

Conversation

@maxdebayser
Copy link
Contributor

@maxdebayser maxdebayser commented Mar 11, 2026

Purpose

Note: this is a simpler alternative to #35948 based on suggestions by @sfeng33

IBM's Granite 4 models use the Hermes tool calling convention and until now had been using the hermes parser. However, due to the popularity of the Hermes format many additions have been made to this parser to serve specific needs, such as the ability to work without specialized tool calling tokens. As a result, the parser's code has become mostly unreadable. We have found bugs that arise from the interaction with other features such as stop sequences and that are very hard to fix given the state of the code. Also given the complexity of the code, it is very hard for maintainers to trust that a PR won't break other things.
There is also a Granite 4 specific behavior which we need handled in the tool parser which is that the models have a tendency to generate the arguments as an escaped string instead of JSON text.

The granite4 parser in this PR has been re-written from the ground up to avoid the brittle partial json parsing that we see in other tool call parsers. By only streaming full tool call streaming, no partial json parsing is required.

Main design decisions:

  • Remove streaming of tool names ahead of arguments
  • Remove streaming of partial arguments: this complicates things and arguably doesn't benefit the end user at all
  • Rely only on text, not on tokens

Test Plan

Since the parser is compatible with Hermes tool calling, I'm reusing the Hermes tests except for one that allows incomplete input. I'm also adding tests for the lexer and parser as well as testing for known bugs.

Test Result

All the added or modified tests are passing locally.

maxdebayser and others added 15 commits March 3, 2026 23:50
This tool parser should be compatible with most models that use the Hermes tool
calling pattern. It has been re-written from the ground up to avoid
the brittle partial json parsing that we see in other tool call parsers.
By relying on a stream-enabled parser it avoids bugs such as the
interference from stop sequences which change the sequence of deltas
that the model sees

Main design decisions:
- Remove streaming of partial arguments: this complicates things and
  arguably doesn't benefit the end user at all
- Decompose the parser in several layers that are independently testable
- Use a formal grammar to specify the parser
- For the parser, use a library that is already part of vllm's
  dependencies. Lark is imported by llguidance.

Since the parser is compatible with Hermes tool calling, I'm reusing the
Hermes tests except for one that allows incomplete input. I'm also
adding tests for the lexer and parser as well as testing for known bugs.

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
- Fix wrong scope for server fixtures in tests
- Prevent the tool parser from streaming pieces of the <tool_call>
  marker as message content
- Reduce unecessary delta messages

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Mypy is complaining about code outside of my changes for some reason

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Previously some lexing tasks were left to the vllm tool parser, which
is at the wrong abstraction level, leading to unecessary complexity.
Now the lexer also handles free text so that what comes out of the
low level lark parser is already organized into text and tool calling
segments. Since now the lexer and lark parser are aware of the
surrunding text, it is easier to handle multiple tool calls cleanly

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
If we can assume that:
1) Streaming the name ahead of the arguments has no relevant use case;
2) Validating the tool call JSON while it is being assembled is not
   useful;
then the parser can be simplified a lot by only streaming complete
tool calls. The we can use simple regexes to find the tool call
tokens and use json.loads() to handle anything in between.

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
@mergify
Copy link

mergify bot commented Mar 11, 2026

Documentation preview: https://vllm--36827.org.readthedocs.build/en/36827/

@mergify mergify bot added documentation Improvements or additions to documentation tool-calling labels Mar 11, 2026
@mergify
Copy link

mergify bot commented Mar 11, 2026

Hi @maxdebayser, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new granite4 tool parser, which is a simplified implementation for IBM's Granite 4 models. The changes include the parser logic, registration, documentation updates, and new tests. The existing hermes tool parser tests are also refactored to be parameterized and reused for the new parser. My review found a critical bug in the streaming logic of the new parser that could lead to an AttributeError and incorrect state management. I've also suggested an improvement to the new test file to simplify the logic for reconstructing tool calls, making it more readable and aligned with the parser's design of not streaming partial arguments.

@maxdebayser
Copy link
Contributor Author

@sfeng33, here is an alternative implementation based on you suggestion. It can really be made much shorter after giving up on incremental parsing and streaming.

I'm going to answer your question of the other PR here, because it applies as well:

On partial <tool_call> handling: I'd suggest removing the partial matching logic for the <tool_call> tag. Since <tool_call> is a single token in the Granite 4 tokenizer, it will always arrive atomically in a single delta — it can never be split across chunks. The regex library's partial=True matching in consume_text adds complexity for a case that can't actually occur.

Relying only on text is useful for us to run tests with models that don't have dedicated tool calling tokens. But beyond that, I really prefer to have a single source of truth for the input, and since we have to parse json, the most appropriate input is text. To illustrate my point, there is currently a bug in vllm which causes the text deltas and the token deltas to go out of sync. If you run test_granite4_tool_parser.py::test_stop_sequence_interference and print the deltas that arrive at the tool parser, you'll see:

(APIServer pid=250128) delta_text=''
(APIServer pid=250128) delta_token_ids=[100270]
(APIServer pid=250128) delta_text=''
(APIServer pid=250128) delta_token_ids=[198]
(APIServer pid=250128) delta_text=''
(APIServer pid=250128) delta_token_ids=[5018]
(APIServer pid=250128) delta_text=''
(APIServer pid=250128) delta_token_ids=[609]
(APIServer pid=250128) delta_text=''
(APIServer pid=250128) delta_token_ids=[794]
(APIServer pid=250128) delta_text=''
(APIServer pid=250128) delta_token_ids=[330]
(APIServer pid=250128) delta_text=''
(APIServer pid=250128) delta_token_ids=[456]
(APIServer pid=250128) delta_text=''
(APIServer pid=250128) delta_token_ids=[62]
(APIServer pid=250128) delta_text=''
(APIServer pid=250128) delta_token_ids=[582]
(APIServer pid=250128) delta_text=''
(APIServer pid=250128) delta_token_ids=[2727]
(APIServer pid=250128) delta_text=''
(APIServer pid=250128) delta_token_ids=[62]
(APIServer pid=250128) delta_text='<t'
(APIServer pid=250128) delta_token_ids=[4030]
(APIServer pid=250128) delta_text='ool_c'
(APIServer pid=250128) delta_token_ids=[1292]
(APIServer pid=250128) delta_text='all>'
(APIServer pid=250128) delta_token_ids=[5595]
(APIServer pid=250128) delta_text='\n{"name": "g'
...
(APIServer pid=250128) delta_text='467722Z", "a'
(APIServer pid=250128) delta_token_ids=[100271]
(APIServer pid=250128) delta_text='cme_region": "A9345"}}\n</tool_call>'
(APIServer pid=250128) delta_token_ids=[100257]

But if you comment out the stop argument in the request, you see:

(APIServer pid=251071) delta_text='<tool_call>'
(APIServer pid=251071) delta_token_ids=[100270]
(APIServer pid=251071) delta_text='\n'
(APIServer pid=251071) delta_token_ids=[198]
(APIServer pid=251071) delta_text='{"'
(APIServer pid=251071) delta_token_ids=[5018]
(APIServer pid=251071) delta_text='name'
(APIServer pid=251071) delta_token_ids=[609]
(APIServer pid=251071) delta_text='":'
(APIServer pid=251071) delta_token_ids=[794]
(APIServer pid=251071) delta_text=' "'
(APIServer pid=251071) delta_token_ids=[330]
(APIServer pid=251071) delta_text='get'
(APIServer pid=251071) delta_token_ids=[456]
(APIServer pid=251071) delta_text='_'
(APIServer pid=251071) delta_token_ids=[62]
(APIServer pid=251071) delta_text='ac'
(APIServer pid=251071) delta_token_ids=[582]
(APIServer pid=251071) delta_text='me'
(APIServer pid=251071) delta_token_ids=[2727]
...
(APIServer pid=251071) delta_text='"}}\n'
(APIServer pid=251071) delta_token_ids=[96742]
(APIServer pid=251071) delta_text='</tool_call>'
(APIServer pid=251071) delta_token_ids=[100271]
(APIServer pid=251071) delta_text=''
(APIServer pid=251071) delta_token_ids=[100257]

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
@maxdebayser
Copy link
Contributor Author

I've opened an issue for the bug I described above: #36830

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation tool-calling

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant