-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Description
Bug Description
Summary
SchemaLLMPathExtractor generates Pydantic schemas with "additionalProperties": true in nested anyOf schemas. This causes errors when using structured output with:
- GPT-5-mini-2025-08-07:
BadRequestError- requiresadditionalProperties: false - Gemini-3-flash-preview:
ValueError- doesn't supportadditionalPropertiesat all
Environment
- Python: 3.12.12
- Platform: macOS (Darwin 25.0.0)
- UV: 0.9.2
- LlamaIndex Core: 0.14.13
- Pydantic: 2.11.7
- Pydantic Core: 2.33.2
- OpenAI: 1.109.1
- Google GenAI: 1.59.0
- LlamaIndex LLM Integrations:
- llama-index-llms-openai: 0.6.13
- llama-index-llms-google-genai: 0.8.4
Reproduction Steps
1. Create a SchemaLLMPathExtractor with Literal types
from typing import Literal
from llama_index.core.indices.property_graph.transformations.schema_llm import SchemaLLMPathExtractor
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-5-mini-2025-08-07", temperature=0.0)
entities = Literal["PERSON", "ORG", "LOCATION"]
relations = Literal["FOUNDED", "LOCATED_IN", "MANUFACTURES"]
entity_props = ["description"]
validation_schema = [
("PERSON", "FOUNDED", "ORG"),
("PERSON", "LOCATED_IN", "LOCATION"),
("ORG", "LOCATED_IN", "LOCATION"),
]
extractor = SchemaLLMPathExtractor(
llm=llm,
possible_entities=entities,
possible_relations=relations,
possible_entity_props=entity_props,
kg_validation_schema=validation_schema,
strict=True
)2. Inspect the generated schema
schema_cls = extractor.kg_schema_cls
json_schema = schema_cls.model_json_schema()
# Find additionalProperties in nested schemas
import json
print(json.dumps(json_schema, indent=2))3. Observe the problematic schema structure
{
"$defs": {
"Entity": {
"properties": {
"properties": {
"anyOf": [
{
"additionalProperties": true,
"type": "object"
},
{
"type": "null"
}
]
}
}
}
}
}4. Attempt extraction with GPT-5-mini
from llama_index.core.schema import TextNode
node = TextNode(text="Elon Musk founded SpaceX in 2002.")
results = await extractor.acall([node]) # ← Fails with BadRequestErrorError Output (GPT-5-mini):
BadRequestError: Error code: 400 - {'error': {'message': "Invalid schema for response_format 'KGSchema': In context=('properties', 'properties', 'anyOf', '0'), 'additionalProperties' is required to be supplied and to be false.", 'type': 'invalid_request_error', 'param': 'response_format'}}
Error Output (Gemini-3-flash-preview):
ValueError: additionalProperties is not supported in the Gemini API.
Root Cause Analysis
The issue occurs in SchemaLLMPathExtractor.__init__() at the schema generation stage:
File: llama_index/core/indices/property_graph/transformations/schema_llm.py
Line: ~74 (schema creation with create_model())
When creating the schema for optional entity properties:
entity_cls = create_model(
"Entity",
type=(...),
name=(...),
properties=(
Optional[Dict[str, Any]], # ← This causes the problem
Field(...)
),
)Pydantic automatically generates an anyOf schema for the optional dict:
"properties": {
"anyOf": [
{ "additionalProperties": true, "type": "object" },
{ "type": "null" }
]
}The value true is incompatible with both OpenAI's structured output requirements and Gemini's API constraints.
Current Behavior
- ✅ Works with default Gemini model (no structured output enforcement)
- ❌ Fails with GPT-5-mini (BadRequestError)
- ❌ Fails with Gemini-3-flash-preview (ValueError)
- ❌ Works with older models but causes compatibility warnings
Expected Behavior
The generated schema should have "additionalProperties": false in all nested anyOf object schemas to satisfy both API requirements:
- GPT: Explicitly requires
false - Gemini: Accepts
falseas a valid constraint
Proposed Solution
Add a ConfigDict parameter to create_model() calls in SchemaLLMPathExtractor.__init__() to post-process the schema:
from pydantic import ConfigDict
def clean_schema(schema, info=None):
"""Clean additionalProperties in nested anyOf schemas for API compatibility."""
def fix_props(obj):
if isinstance(obj, dict):
if 'anyOf' in obj:
for alt in obj['anyOf']:
if isinstance(alt, dict) and alt.get('type') == 'object':
alt['additionalProperties'] = False
for value in obj.values():
fix_props(value)
elif isinstance(obj, list):
for item in obj:
fix_props(item)
fix_props(schema)
return schema
# When creating models, pass json_schema_extra:
entity_cls = create_model(
"Entity",
type=(...),
name=(...),
properties=(...),
__config__=ConfigDict(json_schema_extra=clean_schema)
)OR post-process the schema in a cleaner way by patching model_json_schema() after creation.
Impact
- Severity: High - Blocks usage with latest GPT and Gemini models
- Affected Users: Anyone using
SchemaLLMPathExtractorwith structured output APIs - Workaround: Override in subclass (see implementation below)
Workaround (Temporary)
Until this is fixed in LlamaIndex, override SchemaLLMPathExtractor in your code:
from llama_index.core.indices.property_graph.transformations.schema_llm import SchemaLLMPathExtractor
def _clean_schema_for_apis(schema, info=None):
"""Fix additionalProperties for API compatibility."""
def fix_props(obj):
if isinstance(obj, dict):
if 'anyOf' in obj:
for alt in obj['anyOf']:
if isinstance(alt, dict) and alt.get('type') == 'object':
alt['additionalProperties'] = False
for value in obj.values():
fix_props(value)
elif isinstance(obj, list):
for item in obj:
fix_props(item)
fix_props(schema)
return schema
class FixedSchemaLLMPathExtractor(SchemaLLMPathExtractor):
"""SchemaLLMPathExtractor with API compatibility fix."""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Patch the schema class to clean additionalProperties
schema_cls = self.kg_schema_cls
original_method = schema_cls.model_json_schema
def patched_model_json_schema(*a, **kw):
schema = original_method(*a, **kw)
_clean_schema_for_apis(schema)
return schema
schema_cls.model_json_schema = patched_model_json_schemaTest Case
Create a test file demonstrating the issue and fix:
import asyncio
from typing import Literal
from llama_index.core.indices.property_graph.transformations.schema_llm import SchemaLLMPathExtractor
from llama_index.core.schema import TextNode
from llama_index.llms.openai import OpenAI
async def test_schema_compatibility():
llm = OpenAI(model="gpt-5-mini-2025-08-07", temperature=0.0)
extractor = SchemaLLMPathExtractor(
llm=llm,
possible_entities=Literal["PERSON", "ORG"],
possible_relations=Literal["FOUNDED"],
possible_entity_props=["description"],
kg_validation_schema=[("PERSON", "FOUNDED", "ORG")],
strict=True
)
node = TextNode(text="Elon Musk founded SpaceX.")
# This should not raise an error
results = await extractor.acall([node])
assert len(results) > 0
if __name__ == "__main__":
asyncio.run(test_schema_compatibility())Additional Context
Schema Analysis Output
Root cause location: $defs.Entity.properties.properties.anyOf[0]
Problem: additionalProperties = true
Requirement (GPT-5-mini): additionalProperties must be false
Requirement (Gemini): additionalProperties must not exist
Solution: Set additionalProperties = false in all nested object schemas
Related Issues
- Affects both
SchemaLLMPathExtractorand potentially other schema generation in llama-index - Similar issues may occur with other optional Dict/object fields
Requested Action
Please implement the proposed solution to ensure SchemaLLMPathExtractor generates schemas compatible with:
- OpenAI's GPT-5-mini structured output requirements
- Google Gemini's API constraints
- All other LLM providers using structured output
Created: 2025-02-05
Environment: xcert project
Reproduction: Confirmed with llama-index-core 0.14.13
Version
0.14.13
Steps to Reproduce
.