-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Handling capability mismatches between models.dev and provider reality
Problem
models.dev capabilities are binary (true/false/null) and don't capture provider-specific quirks:
- Partial support (tools work, but not in streaming mode)
- API restrictions (provider limitation, not model limitation)
- Unreliable behavior (capability exists but doesn't work consistently)
When capabilities don't match reality, tests fail with no good way to handle it.
Example: Bedrock Llama 3.3 70B
models.dev says:
{ "id": "meta.llama3-3-70b-instruct-v1:0", "tool_call": true }Reality: 19 tests, 3 failures
-
Streaming + tools: HTTP 400 "This model doesn't support tool use in streaming mode"
- Bedrock API limitation, not model limitation
-
Type coercion: Schema wants
pos_integer, model returns"30"(string)- Structured output works but ignores type constraints
-
Unreliable tools: With
tool_choiceforcing tool use, got 0 tool calls- Capability exists but doesn't work consistently
Current workaround: Removed from test matrix (config/config.exs, test/support/model_matrix.ex)
Possible Solutions
1. Provider capability overrides
# In amazon_bedrock provider module
def adjust_capabilities("meta.llama" <> _, caps) do
Map.put(caps, :streaming_tools, false)
endPro: Accurate. Con: Duplicates models.dev data.
2. Extended models.dev schema
{
"tool_call": { "supported": true, "streaming": false }
}Pro: Canonical source. Con: Requires upstream changes.
3. Conditional test generation
unless has_known_issue?(provider, model, capability) do
test "..." do
end
Pro: Precise control. Con: Messy macro logic.
Recommendation
Short term: Provider capability overrides (option 1)
Long term: Work with models.dev on richer capability metadata (option 2)