Skip to content

[Bug]: SchemaLLMPathExtractor generates incompatible JSON schemas for GPT and Gemini APIs #20629

@MarioRicoIbanez

Description

@MarioRicoIbanez

Bug Description

Summary

SchemaLLMPathExtractor generates Pydantic schemas with "additionalProperties": true in nested anyOf schemas. This causes errors when using structured output with:

  • GPT-5-mini-2025-08-07: BadRequestError - requires additionalProperties: false
  • Gemini-3-flash-preview: ValueError - doesn't support additionalProperties at all

Environment

  • Python: 3.12.12
  • Platform: macOS (Darwin 25.0.0)
  • UV: 0.9.2
  • LlamaIndex Core: 0.14.13
  • Pydantic: 2.11.7
  • Pydantic Core: 2.33.2
  • OpenAI: 1.109.1
  • Google GenAI: 1.59.0
  • LlamaIndex LLM Integrations:
    • llama-index-llms-openai: 0.6.13
    • llama-index-llms-google-genai: 0.8.4

Reproduction Steps

1. Create a SchemaLLMPathExtractor with Literal types

from typing import Literal
from llama_index.core.indices.property_graph.transformations.schema_llm import SchemaLLMPathExtractor
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-5-mini-2025-08-07", temperature=0.0)

entities = Literal["PERSON", "ORG", "LOCATION"]
relations = Literal["FOUNDED", "LOCATED_IN", "MANUFACTURES"]
entity_props = ["description"]

validation_schema = [
    ("PERSON", "FOUNDED", "ORG"),
    ("PERSON", "LOCATED_IN", "LOCATION"),
    ("ORG", "LOCATED_IN", "LOCATION"),
]

extractor = SchemaLLMPathExtractor(
    llm=llm,
    possible_entities=entities,
    possible_relations=relations,
    possible_entity_props=entity_props,
    kg_validation_schema=validation_schema,
    strict=True
)

2. Inspect the generated schema

schema_cls = extractor.kg_schema_cls
json_schema = schema_cls.model_json_schema()

# Find additionalProperties in nested schemas
import json
print(json.dumps(json_schema, indent=2))

3. Observe the problematic schema structure

{
  "$defs": {
    "Entity": {
      "properties": {
        "properties": {
          "anyOf": [
            {
              "additionalProperties": true,
              "type": "object"
            },
            {
              "type": "null"
            }
          ]
        }
      }
    }
  }
}

4. Attempt extraction with GPT-5-mini

from llama_index.core.schema import TextNode

node = TextNode(text="Elon Musk founded SpaceX in 2002.")
results = await extractor.acall([node])  # ← Fails with BadRequestError

Error Output (GPT-5-mini):

BadRequestError: Error code: 400 - {'error': {'message': "Invalid schema for response_format 'KGSchema': In context=('properties', 'properties', 'anyOf', '0'), 'additionalProperties' is required to be supplied and to be false.", 'type': 'invalid_request_error', 'param': 'response_format'}}

Error Output (Gemini-3-flash-preview):

ValueError: additionalProperties is not supported in the Gemini API.

Root Cause Analysis

The issue occurs in SchemaLLMPathExtractor.__init__() at the schema generation stage:

File: llama_index/core/indices/property_graph/transformations/schema_llm.py
Line: ~74 (schema creation with create_model())

When creating the schema for optional entity properties:

entity_cls = create_model(
    "Entity",
    type=(...),
    name=(...),
    properties=(
        Optional[Dict[str, Any]],  # ← This causes the problem
        Field(...)
    ),
)

Pydantic automatically generates an anyOf schema for the optional dict:

"properties": {
  "anyOf": [
    { "additionalProperties": true, "type": "object" },
    { "type": "null" }
  ]
}

The value true is incompatible with both OpenAI's structured output requirements and Gemini's API constraints.

Current Behavior

  • ✅ Works with default Gemini model (no structured output enforcement)
  • ❌ Fails with GPT-5-mini (BadRequestError)
  • ❌ Fails with Gemini-3-flash-preview (ValueError)
  • ❌ Works with older models but causes compatibility warnings

Expected Behavior

The generated schema should have "additionalProperties": false in all nested anyOf object schemas to satisfy both API requirements:

  • GPT: Explicitly requires false
  • Gemini: Accepts false as a valid constraint

Proposed Solution

Add a ConfigDict parameter to create_model() calls in SchemaLLMPathExtractor.__init__() to post-process the schema:

from pydantic import ConfigDict

def clean_schema(schema, info=None):
    """Clean additionalProperties in nested anyOf schemas for API compatibility."""
    def fix_props(obj):
        if isinstance(obj, dict):
            if 'anyOf' in obj:
                for alt in obj['anyOf']:
                    if isinstance(alt, dict) and alt.get('type') == 'object':
                        alt['additionalProperties'] = False
            for value in obj.values():
                fix_props(value)
        elif isinstance(obj, list):
            for item in obj:
                fix_props(item)

    fix_props(schema)
    return schema

# When creating models, pass json_schema_extra:
entity_cls = create_model(
    "Entity",
    type=(...),
    name=(...),
    properties=(...),
    __config__=ConfigDict(json_schema_extra=clean_schema)
)

OR post-process the schema in a cleaner way by patching model_json_schema() after creation.

Impact

  • Severity: High - Blocks usage with latest GPT and Gemini models
  • Affected Users: Anyone using SchemaLLMPathExtractor with structured output APIs
  • Workaround: Override in subclass (see implementation below)

Workaround (Temporary)

Until this is fixed in LlamaIndex, override SchemaLLMPathExtractor in your code:

from llama_index.core.indices.property_graph.transformations.schema_llm import SchemaLLMPathExtractor

def _clean_schema_for_apis(schema, info=None):
    """Fix additionalProperties for API compatibility."""
    def fix_props(obj):
        if isinstance(obj, dict):
            if 'anyOf' in obj:
                for alt in obj['anyOf']:
                    if isinstance(alt, dict) and alt.get('type') == 'object':
                        alt['additionalProperties'] = False
            for value in obj.values():
                fix_props(value)
        elif isinstance(obj, list):
            for item in obj:
                fix_props(item)
    fix_props(schema)
    return schema

class FixedSchemaLLMPathExtractor(SchemaLLMPathExtractor):
    """SchemaLLMPathExtractor with API compatibility fix."""

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        # Patch the schema class to clean additionalProperties
        schema_cls = self.kg_schema_cls
        original_method = schema_cls.model_json_schema

        def patched_model_json_schema(*a, **kw):
            schema = original_method(*a, **kw)
            _clean_schema_for_apis(schema)
            return schema

        schema_cls.model_json_schema = patched_model_json_schema

Test Case

Create a test file demonstrating the issue and fix:

import asyncio
from typing import Literal
from llama_index.core.indices.property_graph.transformations.schema_llm import SchemaLLMPathExtractor
from llama_index.core.schema import TextNode
from llama_index.llms.openai import OpenAI

async def test_schema_compatibility():
    llm = OpenAI(model="gpt-5-mini-2025-08-07", temperature=0.0)

    extractor = SchemaLLMPathExtractor(
        llm=llm,
        possible_entities=Literal["PERSON", "ORG"],
        possible_relations=Literal["FOUNDED"],
        possible_entity_props=["description"],
        kg_validation_schema=[("PERSON", "FOUNDED", "ORG")],
        strict=True
    )

    node = TextNode(text="Elon Musk founded SpaceX.")

    # This should not raise an error
    results = await extractor.acall([node])
    assert len(results) > 0

if __name__ == "__main__":
    asyncio.run(test_schema_compatibility())

Additional Context

Schema Analysis Output

Root cause location: $defs.Entity.properties.properties.anyOf[0]
Problem: additionalProperties = true
Requirement (GPT-5-mini): additionalProperties must be false
Requirement (Gemini): additionalProperties must not exist
Solution: Set additionalProperties = false in all nested object schemas

Related Issues

  • Affects both SchemaLLMPathExtractor and potentially other schema generation in llama-index
  • Similar issues may occur with other optional Dict/object fields

Requested Action

Please implement the proposed solution to ensure SchemaLLMPathExtractor generates schemas compatible with:

  1. OpenAI's GPT-5-mini structured output requirements
  2. Google Gemini's API constraints
  3. All other LLM providers using structured output

Created: 2025-02-05
Environment: xcert project
Reproduction: Confirmed with llama-index-core 0.14.13

Version

0.14.13

Steps to Reproduce

.

Relevant Logs/Tracbacks

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageIssue needs to be triaged/prioritized

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions