-
-
Notifications
You must be signed in to change notification settings - Fork 931
Closed
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationenhancementNew feature or requestNew feature or requestpythonPull requests that update python codePull requests that update python codesize:MThis PR changes 30-99 lines, ignoring generated files.This PR changes 30-99 lines, ignoring generated files.
Description
Hey Instructor team!
I've submitted a PR to MLflow that integrates Instructor/Pydantic schema validation as first-class scorers in MLflow's GenAI evaluation framework: mlflow/mlflow#20628
What it does
Lets MLflow users validate LLM structured outputs against Pydantic schemas - all deterministic, no extra LLM calls needed:
from pydantic import BaseModel, Field
from mlflow.genai.scorers.instructor import (
SchemaCompliance,
FieldCompleteness,
TypeValidation,
ConstraintValidation,
ExtractionAccuracy,
)
class UserInfo(BaseModel):
name: str
email: str
age: int
class UserInfoWithConstraints(BaseModel):
name: str = Field(min_length=1, max_length=100)
age: int = Field(ge=0, le=150)
email: str = Field(pattern=r"^[\w.-]+@[\w.-]+\.\w+$")
# Validate schema compliance
scorer = SchemaCompliance()
result = scorer(
outputs={"name": "John", "email": "john@example.com", "age": 30},
expectations={"schema": UserInfo},
)
print(result.value) # "yes"
# Check field completeness (returns 0.0-1.0)
scorer = FieldCompleteness()
result = scorer(
outputs={"name": "John", "email": None, "age": 30},
expectations={"schema": UserInfo},
)
print(result.value) # 0.67 (2/3 fields complete)
# Validate constraints
scorer = ConstraintValidation()
result = scorer(
outputs={"name": "", "age": 200, "email": "invalid"},
expectations={"schema": UserInfoWithConstraints},
)
print(result.value) # "no"
print(result.rationale) # "name: String should have at least 1 character; age: Input should be <= 150..."Use with mlflow.genai.evaluate()
import mlflow
from mlflow.genai.scorers.instructor import SchemaCompliance, FieldCompleteness
eval_data = [
{
"inputs": {"query": "Extract user info"},
"outputs": {"name": "John Doe", "email": "john@example.com", "age": 30},
},
{
"inputs": {"query": "Extract user info"},
"outputs": {"name": "Jane"}, # Missing fields
},
]
results = mlflow.genai.evaluate(
data=eval_data,
scorers=[
SchemaCompliance(),
FieldCompleteness(),
],
expectations={"schema": UserInfo},
)Available Scorers
| Scorer | Purpose | Return Value |
|---|---|---|
SchemaCompliance |
Validates output matches Pydantic schema | yes / no |
FieldCompleteness |
Checks required fields are present and non-null | 0.0 - 1.0 |
TypeValidation |
Verifies field types match schema definitions | yes / no |
ConstraintValidation |
Checks Field validators/constraints pass | yes / no |
ExtractionAccuracy |
Compares extracted fields against ground truth | 0.0 - 1.0 |
Why this is useful
A lot of folks use Instructor to get structured outputs from LLMs, but then need to evaluate those outputs at scale. This integration lets them:
- Validate that LLM outputs actually match expected schemas
- Track validation metrics over time in MLflow
- Compare different prompts/models on extraction quality
- Catch schema drift in production
Why I'm posting here
A couple things:
- Would love any feedback on the approach if you have time to look at the PR
- Once merged, would you be interested in adding MLflow to your integrations/ecosystem docs?
- I'd also be happy to collaborate on a blog post about using Instructor + MLflow together for structured output evaluation
Thanks for building Instructor - the Pydantic-first approach to structured outputs has been valuable for the LLM community!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationenhancementNew feature or requestNew feature or requestpythonPull requests that update python codePull requests that update python codesize:MThis PR changes 30-99 lines, ignoring generated files.This PR changes 30-99 lines, ignoring generated files.