Domain-specific vs. General-purpose Embedding Model Evaluation Author: Jacky Liang
Prerequisites:
- Docker and Docker Compose installed (compose.yaml is included)
- OpenAI API key
- PostgreSQL with pgai extension running
- SEC Filings dataset: https://huggingface.co/datasets/MemGPT/example-sec-filings/tree/main
- Your choice of embedding models to evaluate
Virtual Environment Setup:
-
Create and activate virtual environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
Configuration (default values):
- NUM_CHUNKS = 20 # Number of random text chunks to evaluate
- NUM_QUESTIONS_PER_CHUNK = 20 # Total questions per chunk (4 of each type)
- TOP_K = 10 # Number of closest chunks to retrieve
- QUESTION_DISTRIBUTION = { # Distribution of question types 'short': 4, # Direct, simple questions under 10 words 'long': 4, # Detailed questions requiring comprehensive answers 'direct': 4, # Questions about explicit information 'implied': 4, # Questions requiring context understanding 'unclear': 4 # Vague or ambiguous questions }
- EMBEDDING_TABLES = [ # Database tables containing embeddings 'sec_filings_openai_embeddings', # OpenAI text-embedding-3-small (768 dim) 'sec_filings_voyage_embeddings' # Voyage finance-2 (1024 dim) ]
Installation and Setup:
- Create directory with compose.yaml - make sure you enter your Voyage AI and OpenAI API keys in compose.yaml
- Start services:
docker compose up -d
- Connect to the database:
docker compose exec -it db psql
- Enable pgai extension:
CREATE EXTENSION IF NOT EXISTS ai CASCADE;
Dataset Setup:
-
Create a
sec_filings
table in the database with primary key (required for pgai Vectorizer):CREATE TABLE sec_filings ( id SERIAL PRIMARY KEY, text text );
-
Load SEC filings dataset (the name is case-sensitive):
SELECT ai.load_dataset( name => 'MemGPT/example-sec-filings', table_name => 'sec_filings', batch_size => 1000, max_batches => 10, if_table_exists => 'append' );
-
Create vectorizers for each model:
-- OpenAI text-embedding-3-small (768 dim) SELECT ai.create_vectorizer( 'sec_filings'::regclass, loading => ai.loading_column('text'), destination => ai.destination_table('sec_filings_openai_embeddings'), embedding => ai.embedding_openai('text-embedding-3-small', 768), chunking => ai.chunking_recursive_character_text_splitter(512, 50) ); -- Voyage finance-2 (1024 dim) SELECT ai.create_vectorizer( 'sec_filings'::regclass, loading => ai.loading_column('text'), destination => ai.destination_table('sec_filings_voyage_embeddings'), embedding => ai.embedding_voyageai('voyage-finance-2', 1024), chunking => ai.chunking_recursive_character_text_splitter(512, 50) );
-
Verify vectorization status:
SELECT * FROM ai.vectorizer_status;
-
Query embedding views:
SELECT text FROM sec_filings_openai_embeddings LIMIT 5; SELECT text FROM sec_filings_voyage_embeddings LIMIT 5;
Usage:
-
First-time setup:
- Ensure PostgreSQL Docker container is running with pgai extension
- Ensure OpenAI and Voyage AI API keys are entered in compose.yaml (required)
- Configure Config class parameters if needed (top of file)
-
Generate chunks - chunks are randomly selected from db, so you can run this multiple times until you get a good sample evaluator = StepByStepEvaluator() chunks = evaluator.step1_get_chunks() pd.DataFrame(chunks).to_csv('chunks.csv')
-
Generate questions (can be run independently): chunks = pd.read_csv('chunks.csv', index_col=0).to_dict('records') evaluator.chunks = chunks questions = evaluator.step2_generate_questions() pd.DataFrame(questions).to_csv('questions.csv')
-
Evaluate models (can be run independently): results = evaluator.step3_evaluate_models() # Reads from questions.csv pd.DataFrame(results).to_csv('results.csv') evaluator.print_results()
Outputs:
- chunks.csv: Random text chunks from database
- questions.csv: Generated questions for each chunk
- results.csv: Overall model performance metrics
- detailed_results.csv: Per-question evaluation results
Links:
- Voyage AI x pgai Vectorizer Quickstart: https://github.com/timescale/pgai/blob/main/docs/vectorizer/quick-start-voyage.md
- SEC Filings dataset: https://huggingface.co/datasets/MemGPT/example-sec-filings/tree/main
- Voyage AI Text Embedding API docs: https://docs.voyageai.com/docs/embeddings