Skip to content

[BUG: v3 Tokenizer ] #43

Open
Open
@nirual81

Description

@nirual81

Python -VV

...

Pip Freeze

annotated-types==0.7.0
attrs==24.2.0
certifi==2024.7.4
charset-normalizer==3.3.2
idna==3.8
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
mistral_common==1.3.4
pydantic==2.6.1
pydantic_core==2.16.2
referencing==0.35.1
regex==2024.7.24
requests==2.32.3
rpds-py==0.20.0
sentencepiece==0.2.0
tiktoken==0.7.0
typing_extensions==4.12.2
urllib3==2.2.2

Reproduction Steps

Ive copied the example from: https://github.com/mistralai/mistral-common/blob/main/examples/tokenizer.ipynb
But the tokens aren't properly formatted.

The tokens are like this:
[AVAILABLE_TOOLS]▁[{"type":▁"function",▁"function":▁{"name":▁"get_current_weather",▁"description":▁"Get▁the▁current▁weather",▁"parameters":▁{"type":▁"object",▁"properties":▁{"location":▁{"type":▁"string",▁"description":▁"The▁city▁and▁state,▁e.g.▁San▁Francisco,▁CA"},▁"format":▁{"type":▁"string",▁"enum":▁["celsius",▁"fahrenheit"],▁"description":▁"The▁temperature▁unit▁to▁use.▁Infer▁this▁from▁the▁users▁location."}},▁"required":▁["location",▁"format"]}}}][/AVAILABLE_TOOLS][INST]▁What's▁the▁weather▁like▁today▁in▁Paris[/INST]

First Ive thought that it is intended but with the second message the LLM returns messages like this:
▁The▁current▁weather▁in▁Paris▁is▁72°C.

Expected Behavior

The tokens should have proper formatting with spaces instead of underlines at the most places:

[INST] What's the weather like today in Paris[/INST][TOOL_CALLS] [{"name": "get_current_weather", "arguments": {"location": "Paris, France", "format": "celsius"}, "id": "VvvODy9mT"}][TOOL_RESULTS] {"call_id": "VvvODy9mT", "content": 22}[/TOOL_RESULTS] The current temperature in Paris, France is 22 degrees Celsius.[AVAILABLE_TOOLS] [{"type": "function", "function": {"name": "get_current_weather", "description": "Get the current weather", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"}, "format": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The temperature unit to use. Infer this from the users location."}}, "required": ["location", "format"]}}}][/AVAILABLE_TOOLS][INST] What's the weather like today in San Francisco[/INST][TOOL_CALLS] [{"name": "get_current_weather", "arguments": {"location": "San Francisco", "format": "celsius"}, "id": "fAnpW3TEV"}][TOOL_RESULTS] {"call_id": "fAnpW3TEV", "content": 20}[/TOOL_RESULTS]

Additional Context

I am using the mistral-nemo model.

Suggested Solutions

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions