chore: generate span annotations via plpgsql #7417
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add PL/pgSQL-based Span Annotations Generation
Overview
This PR introduces a PL/pgSQL-based solution for generating random span annotations, leveraging PostgreSQL's native capabilities for efficient data generation. The implementation consists of:
generate_span_annotations.sql
) that handles the core data generation logicgenerate_span_annotations.py
) that provides a CLI interfacePL/pgSQL Implementation Highlights
TABLESAMPLE SYSTEM (1)
for efficient random sampling of approximately 1% of spansannotation_names
: Converts input string to array for random selectionsampled_spans
: Efficiently samples spans with random annotation countsspan_repeats
: Generates multiple annotations per span with missing value probabilitiesrandom()
for generating random valuesjsonb_build_object()
for structured metadatagenerate_series()
for creating multiple annotations per spanarray_length()
and array indexing for random name selectionON CONFLICT DO NOTHING
for handling duplicate annotations:variable
syntax for dynamic configurationFeatures
Technical Details
ON CONFLICT DO NOTHING
Usage
# Use default parameters python generate_span_annotations.py