The schema metadata system is a powerful feature that separates processing instructions from schema definitions in the neurostore-text-extraction project. It allows developers to annotate schema fields with metadata tags that control text processing operations while maintaining clean schema output for LLMs.
- Separation of Concerns: Keeps schema definitions focused on data structure while handling processing logic separately
- Flexible Processing: Supports multiple processing operations through metadata tags
- Clean Output: Ensures processed data meets schema requirements while preserving original definitions
- Maintainable Code: Centralizes text processing logic and makes it reusable across schemas
Schema metadata is added using the json_schema_extra parameter in Pydantic field definitions:
from pydantic import BaseModel, Field
class ExampleSchema(BaseModel):
value: str = Field(
description="Example field with processing metadata",
json_schema_extra={
"normalize_text": True,
"expand_abbreviations": True
}
)-
normalize_text
- Purpose: Standardizes text formatting
- Effects:
- Strips whitespace
- Converts to title case
- Handles null values ("None", "N/A", etc.)
- Example: "some text " → "Some Text"
-
expand_abbreviations
- Purpose: Expands abbreviated terms to their full form
- Effects:
- Uses scispacy to detect abbreviations in source text
- Replaces abbreviations with their expanded forms
- Example: "MRI scan" → "Magnetic Resonance Imaging scan"
class ParticipantGroup(BaseModel):
name: str = Field(
description="Name of the participant group",
json_schema_extra={"normalize_text": True}
)
diagnosis: str = Field(
description="Clinical diagnosis of the group",
json_schema_extra={
"normalize_text": True,
"expand_abbreviations": True
}
)-
Field Collection
- During extractor initialization, the system scans schema definitions
- Fields with metadata tags are collected and stored for processing
- Nested fields are handled using dot notation (e.g., "groups[].diagnosis")
-
Text Processing
- Post-processing occurs after initial transformation
- Source text is analyzed for abbreviations (if needed)
- Fields are processed according to their metadata tags
- Processing order: abbreviation expansion → text normalization
The system supports processing fields at any level of nesting:
- Simple fields:
field_name - Nested objects:
parent.field_name - List items:
list_field[].field_name - Dictionary values:
dict_field[].field_name
class NestedSchema(BaseModel):
groups: List[ParticipantGroup] # Will process each group's fields
metadata: Dict[str, str] # Will process dictionary values-
normalize_text
- Use for fields that need consistent formatting
- Appropriate for categorical data, names, labels
- Helpful for downstream analysis and comparison
-
expand_abbreviations
- Use for fields containing domain-specific terminology
- Important for medical terms, technical abbreviations
- Enhances readability and standardization
-
Test field processing:
def test_field_normalization(): schema = ExampleSchema(value=" test value ") assert schema.value == "Test Value"
-
Test abbreviation expansion:
def test_abbreviation_expansion(): text = "MRI (Magnetic Resonance Imaging) scan" schema = ExampleSchema(value="The MRI scan") assert "Magnetic Resonance Imaging" in schema.value
-
Over-processing
- Don't add metadata tags to fields that don't need processing
- Consider the impact on performance and data integrity
-
Inconsistent Application
- Apply metadata tags consistently across similar fields
- Document any exceptions or special cases
-
Missing Source Text
- Ensure source text is available when using expand_abbreviations
- Handle cases where abbreviation context is missing
-
Circular References
- Avoid processing the same field multiple times
- Be careful with recursive schema definitions
- Define schema structure and field types
- Identify fields needing processing
- Add appropriate metadata tags
- Test processing outcomes
- Monitor and adjust as needed
By following these guidelines, you can effectively use the schema metadata system to maintain clean, well-structured data while applying necessary processing operations.