Skip to content

Handling capability mismatches between models.dev and provider reality #163

@neilberkman

Description

@neilberkman

Handling capability mismatches between models.dev and provider reality

Problem

models.dev capabilities are binary (true/false/null) and don't capture provider-specific quirks:

  • Partial support (tools work, but not in streaming mode)
  • API restrictions (provider limitation, not model limitation)
  • Unreliable behavior (capability exists but doesn't work consistently)

When capabilities don't match reality, tests fail with no good way to handle it.

Example: Bedrock Llama 3.3 70B

models.dev says:

{ "id": "meta.llama3-3-70b-instruct-v1:0", "tool_call": true }

Reality: 19 tests, 3 failures

  1. Streaming + tools: HTTP 400 "This model doesn't support tool use in streaming mode"

    • Bedrock API limitation, not model limitation
  2. Type coercion: Schema wants pos_integer, model returns "30" (string)

    • Structured output works but ignores type constraints
  3. Unreliable tools: With tool_choice forcing tool use, got 0 tool calls

    • Capability exists but doesn't work consistently

Current workaround: Removed from test matrix (config/config.exs, test/support/model_matrix.ex)

Possible Solutions

1. Provider capability overrides

# In amazon_bedrock provider module
def adjust_capabilities("meta.llama" <> _, caps) do
  Map.put(caps, :streaming_tools, false)
end

Pro: Accurate. Con: Duplicates models.dev data.

2. Extended models.dev schema

{
  "tool_call": { "supported": true, "streaming": false }
}

Pro: Canonical source. Con: Requires upstream changes.

3. Conditional test generation

unless has_known_issue?(provider, model, capability) do
  test "..." do
end

Pro: Precise control. Con: Messy macro logic.

Recommendation

Short term: Provider capability overrides (option 1)
Long term: Work with models.dev on richer capability metadata (option 2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions