Skip to content

[INTEGRATION] Instructor/Pydantic validation now available in MLflow GenAI Evaluation #2063

@debu-sinha

Description

@debu-sinha

Hey Instructor team!

I've submitted a PR to MLflow that integrates Instructor/Pydantic schema validation as first-class scorers in MLflow's GenAI evaluation framework: mlflow/mlflow#20628

What it does

Lets MLflow users validate LLM structured outputs against Pydantic schemas - all deterministic, no extra LLM calls needed:

from pydantic import BaseModel, Field
from mlflow.genai.scorers.instructor import (
    SchemaCompliance,
    FieldCompleteness,
    TypeValidation,
    ConstraintValidation,
    ExtractionAccuracy,
)

class UserInfo(BaseModel):
    name: str
    email: str
    age: int

class UserInfoWithConstraints(BaseModel):
    name: str = Field(min_length=1, max_length=100)
    age: int = Field(ge=0, le=150)
    email: str = Field(pattern=r"^[\w.-]+@[\w.-]+\.\w+$")

# Validate schema compliance
scorer = SchemaCompliance()
result = scorer(
    outputs={"name": "John", "email": "john@example.com", "age": 30},
    expectations={"schema": UserInfo},
)
print(result.value)  # "yes"

# Check field completeness (returns 0.0-1.0)
scorer = FieldCompleteness()
result = scorer(
    outputs={"name": "John", "email": None, "age": 30},
    expectations={"schema": UserInfo},
)
print(result.value)  # 0.67 (2/3 fields complete)

# Validate constraints
scorer = ConstraintValidation()
result = scorer(
    outputs={"name": "", "age": 200, "email": "invalid"},
    expectations={"schema": UserInfoWithConstraints},
)
print(result.value)  # "no"
print(result.rationale)  # "name: String should have at least 1 character; age: Input should be <= 150..."

Use with mlflow.genai.evaluate()

import mlflow
from mlflow.genai.scorers.instructor import SchemaCompliance, FieldCompleteness

eval_data = [
    {
        "inputs": {"query": "Extract user info"},
        "outputs": {"name": "John Doe", "email": "john@example.com", "age": 30},
    },
    {
        "inputs": {"query": "Extract user info"},
        "outputs": {"name": "Jane"},  # Missing fields
    },
]

results = mlflow.genai.evaluate(
    data=eval_data,
    scorers=[
        SchemaCompliance(),
        FieldCompleteness(),
    ],
    expectations={"schema": UserInfo},
)

Available Scorers

Scorer Purpose Return Value
SchemaCompliance Validates output matches Pydantic schema yes / no
FieldCompleteness Checks required fields are present and non-null 0.0 - 1.0
TypeValidation Verifies field types match schema definitions yes / no
ConstraintValidation Checks Field validators/constraints pass yes / no
ExtractionAccuracy Compares extracted fields against ground truth 0.0 - 1.0

Why this is useful

A lot of folks use Instructor to get structured outputs from LLMs, but then need to evaluate those outputs at scale. This integration lets them:

  1. Validate that LLM outputs actually match expected schemas
  2. Track validation metrics over time in MLflow
  3. Compare different prompts/models on extraction quality
  4. Catch schema drift in production

Why I'm posting here

A couple things:

  1. Would love any feedback on the approach if you have time to look at the PR
  2. Once merged, would you be interested in adding MLflow to your integrations/ecosystem docs?
  3. I'd also be happy to collaborate on a blog post about using Instructor + MLflow together for structured output evaluation

Thanks for building Instructor - the Pydantic-first approach to structured outputs has been valuable for the LLM community!

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or requestpythonPull requests that update python codesize:MThis PR changes 30-99 lines, ignoring generated files.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions