Handling capability mismatches between models.dev and provider reality

# Handling capability mismatches between models.dev and provider reality

## Problem

models.dev capabilities are binary (true/false/null) and don't capture provider-specific quirks:

- Partial support (tools work, but not in streaming mode)
- API restrictions (provider limitation, not model limitation)
- Unreliable behavior (capability exists but doesn't work consistently)

When capabilities don't match reality, tests fail with no good way to handle it.

## Example: Bedrock Llama 3.3 70B

**models.dev says:**

```json
{ "id": "meta.llama3-3-70b-instruct-v1:0", "tool_call": true }
```

**Reality: 19 tests, 3 failures**

1. **Streaming + tools**: HTTP 400 "This model doesn't support tool use in streaming mode"
   - Bedrock API limitation, not model limitation

2. **Type coercion**: Schema wants `pos_integer`, model returns `"30"` (string)
   - Structured output works but ignores type constraints

3. **Unreliable tools**: With `tool_choice` forcing tool use, got 0 tool calls
   - Capability exists but doesn't work consistently

**Current workaround:** Removed from test matrix (config/config.exs, test/support/model_matrix.ex)

## Possible Solutions

### 1. Provider capability overrides

```elixir
# In amazon_bedrock provider module
def adjust_capabilities("meta.llama" <> _, caps) do
  Map.put(caps, :streaming_tools, false)
end
```

Pro: Accurate. Con: Duplicates models.dev data.

### 2. Extended models.dev schema

```json
{
  "tool_call": { "supported": true, "streaming": false }
}
```

Pro: Canonical source. Con: Requires upstream changes.

### 3. Conditional test generation

```elixir
unless has_known_issue?(provider, model, capability) do
  test "..." do
end
```

Pro: Precise control. Con: Messy macro logic.

## Recommendation

Short term: Provider capability overrides (option 1)
Long term: Work with models.dev on richer capability metadata (option 2)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling capability mismatches between models.dev and provider reality #163

Handling capability mismatches between models.dev and provider reality

Problem

Example: Bedrock Llama 3.3 70B

Possible Solutions

1. Provider capability overrides

2. Extended models.dev schema

3. Conditional test generation

Recommendation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handling capability mismatches between models.dev and provider reality #163

Description

Handling capability mismatches between models.dev and provider reality

Problem

Example: Bedrock Llama 3.3 70B

Possible Solutions

1. Provider capability overrides

2. Extended models.dev schema

3. Conditional test generation

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions