Skip to content

chore: generate span annotations via plpgsql #7417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 5, 2025

Conversation

RogerHYang
Copy link
Contributor

@RogerHYang RogerHYang commented May 5, 2025

Add PL/pgSQL-based Span Annotations Generation

Overview

This PR introduces a PL/pgSQL-based solution for generating random span annotations, leveraging PostgreSQL's native capabilities for efficient data generation. The implementation consists of:

  • A PL/pgSQL script (generate_span_annotations.sql) that handles the core data generation logic
  • A Python wrapper (generate_span_annotations.py) that provides a CLI interface

PL/pgSQL Implementation Highlights

  • Uses TABLESAMPLE SYSTEM (1) for efficient random sampling of approximately 1% of spans
  • Implements a single bulk INSERT operation using CTEs (Common Table Expressions) for optimal performance:
    • annotation_names: Converts input string to array for random selection
    • sampled_spans: Efficiently samples spans with random annotation counts
    • span_repeats: Generates multiple annotations per span with missing value probabilities
  • Leverages PostgreSQL's native functions:
    • random() for generating random values
    • jsonb_build_object() for structured metadata
    • generate_series() for creating multiple annotations per span
    • array_length() and array indexing for random name selection
  • Implements ON CONFLICT DO NOTHING for handling duplicate annotations
  • Uses parameterized queries with :variable syntax for dynamic configuration

Features

  • Generates random annotations with configurable parameters:
    • Number of spans to sample (default: 10,000)
    • Maximum annotations per span (default: 10)
    • Configurable probabilities for missing fields (labels, scores, explanations, metadata)
    • Customizable annotation names
  • Supports both human and LLM annotators
  • Generates realistic metadata including model parameters and context
  • Maintains referential integrity with spans

Technical Details

  • Efficient bulk insertion using PostgreSQL's native capabilities
  • Handles duplicate annotations gracefully with ON CONFLICT DO NOTHING
  • Provides comprehensive command-line interface with sensible defaults
  • Includes detailed documentation and usage examples

Usage

# Use default parameters
python generate_span_annotations.py

@github-project-automation github-project-automation bot moved this to 📘 Todo in phoenix May 5, 2025
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 5, 2025
@mikeldking mikeldking merged commit 4f6df3a into main May 5, 2025
27 checks passed
@mikeldking mikeldking deleted the generate-span-annotations-via-plpgsql branch May 5, 2025 16:25
@github-project-automation github-project-automation bot moved this from 📘 Todo to ✅ Done in phoenix May 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants