Version: 2.0 Last Updated: 2025-11-11 Status: Planning Phase
This roadmap outlines the 14-week implementation plan for transforming Live Conversational Threads into a comprehensive conversation analysis platform with Google Meet transcript support, multi-scale graph visualization, and advanced analytical features.
Key Milestones:
- Weeks 1-4: Foundation (data model, instrumentation, Google Meet import)
- Weeks 5-7: Core Features (dual-view, zoom levels, node detail)
- Weeks 8-10: Analysis Features (speaker analytics, prompts config)
- Weeks 11-14: Advanced Features (Simulacra levels, cognitive bias detection)
Risk Level: Medium Dependencies: PostgreSQL, FastAPI, React, OpenAI API Estimated Total Cost: $500-1000 (LLM API usage for testing/development)
Goal: Implement DATA_MODEL_V2.md schema with full instrumentation support
Tasks:
-
Create Alembic migration scripts for new tables:
utterances(speaker diarization support)nodes(enhanced with zoom_level_visible)edges(temporal vs contextual relationships)clusters(hierarchical grouping)edits_log(training data collection)api_calls_log(cost tracking)
-
Write database initialization scripts
-
Create test fixtures for all tables
-
Implement rollback procedures
Success Criteria:
- All migrations run cleanly on fresh PostgreSQL instance
- Sample data populates correctly
- Rollback tested and verified
Testing:
# Unit tests for schema
pytest tests/test_database_schema.py -v
# Migration tests
pytest tests/test_migrations.py -vMetrics:
- Migration execution time: < 5 seconds
- Test coverage: 100% for database models
Goal: Implement comprehensive API call logging and cost monitoring
Tasks:
- Create
track_api_calldecorator (see TIER_2_FEATURES.md) - Implement cost calculation functions for:
- OpenAI models (GPT-4, GPT-3.5-turbo)
- Anthropic models (Claude Sonnet-4)
- Token counting utilities
- Build background job for cost aggregation
- Create alert system for cost thresholds
Code Structure:
lct_python_backend/
├── instrumentation/
│ ├── __init__.py
│ ├── decorators.py # @track_api_call
│ ├── cost_calculator.py # Token cost logic
│ ├── alerts.py # Cost threshold alerts
│ └── aggregation.py # Daily/weekly rollups
Success Criteria:
- Every LLM API call logged to
api_calls_log - Cost calculated within 1% accuracy
- Dashboard shows real-time cost tracking
Testing:
# tests/test_instrumentation.py
def test_track_api_call_decorator():
"""Test that API calls are logged with correct cost"""
def test_cost_calculation_gpt4():
"""Test GPT-4 cost calculation for various token counts"""
def test_cost_alert_threshold():
"""Test alert triggers when cost exceeds threshold"""Metrics to Track:
- API call latency (p50, p95, p99)
- Token usage per endpoint
- Cost per conversation
- Cost per feature (clustering, bias detection, etc.)
- Model selection distribution
Storage Plan:
- Retain raw logs: 90 days
- Retain aggregated metrics: 2 years
- Archive strategy: Export to CSV monthly
Goal: Robust parser for Google Meet PDF/TXT transcripts with speaker diarization
Tasks:
-
Implement PDF extraction using PyPDF2/pdfplumber
-
Parse transcript format:
Speaker Name ~: utterance text -
Extract timestamps (e.g.,
00:10:47section markers) -
Handle edge cases:
- Speakers without
~suffix - Multi-line utterances
- Special characters in names
- Missing timestamps
- Speakers without
-
Create data validation layer
-
Build import API endpoint:
POST /api/import/google-meet
Code Structure:
# lct_python_backend/parsers/google_meet.py
class GoogleMeetParser:
def parse_pdf(self, file_path: str) -> ParsedTranscript:
"""Extract text from PDF and parse structure"""
def parse_speakers(self, text: str) -> List[Utterance]:
"""Identify speakers and utterances"""
def calculate_timestamps(self, utterances: List[Utterance]) -> List[Utterance]:
"""Calculate start/end times for each utterance"""
def validate_transcript(self, transcript: ParsedTranscript) -> ValidationResult:
"""Check for parsing errors and ambiguities"""Success Criteria:
- Parse 100% of test transcripts without errors
- Correctly identify 95%+ of speaker boundaries
- Handle malformed input gracefully
Testing:
# tests/test_google_meet_parser.py
def test_parse_simple_transcript():
"""Test basic speaker diarization"""
def test_parse_multiline_utterance():
"""Test utterances spanning multiple lines"""
def test_parse_missing_timestamps():
"""Test handling of incomplete timestamp data"""
def test_parse_special_characters():
"""Test names with unicode, punctuation"""Test Data:
- 10 real Google Meet transcripts (anonymized)
- 5 synthetic edge case transcripts
- 3 malformed transcripts (error handling)
Metrics:
- Parse success rate: > 95%
- Parse time: < 2 seconds per 10k words
- Speaker detection accuracy: > 90%
Goal: Baseline AI-powered graph generation from parsed transcripts
Tasks:
- Implement prompt-based clustering (see ADR-002)
- Create initial node generation endpoint
- Build temporal edge creation logic
- Implement basic contextual relationships
Prompts to Implement:
{
"initial_clustering": {
"description": "Generate initial topic-based nodes from transcript",
"template": "Analyze this transcript and identify natural topic shifts...",
"model": "gpt-4",
"temperature": 0.5
}
}Success Criteria:
- Generate graph from transcript in < 30 seconds
- Node granularity appropriate for 5 zoom levels
- Temporal edges connect all nodes sequentially
Testing:
def test_initial_clustering():
"""Test that transcript generates reasonable node structure"""
def test_zoom_level_distribution():
"""Test that nodes span all 5 zoom levels appropriately"""Metrics:
- Token cost per conversation: < $2.00
- Graph generation latency: < 60 seconds
- User satisfaction: Manual review of 10 test graphs
Goal: Implement Timeline (bottom) + Contextual Network (top) UI
Tasks:
- Split canvas into 15% (timeline) + 85% (network)
- Implement synchronized zoom/pan
- Create temporal ordering visualization
- Build contextual clustering layout
React Components:
// src/components/DualView/
├── DualViewCanvas.tsx // Main container
├── TimelineView.tsx // Bottom 15%
├── ContextualNetworkView.tsx // Top 85%
└── SyncController.tsx // Zoom/pan syncSuccess Criteria:
- Both views visible simultaneously
- Zoom/pan synchronized perfectly
- Performance: 60 FPS with 100+ nodes
Testing:
// tests/DualViewCanvas.test.tsx
describe('DualViewCanvas', () => {
it('renders both views with correct proportions')
it('synchronizes zoom across views')
it('maintains performance with large graphs')
})Metrics:
- Render time: < 100ms
- Frame rate: 60 FPS
- Memory usage: < 200MB for 500 nodes
Goal: Implement discrete zoom levels: sentence → turn → topic → theme → arc
Tasks:
- Implement
ZoomControllerwith 5 quantized levels - Create visibility logic for nodes based on
zoom_level_visible - Build smooth transitions between levels
- Implement zoom-dependent context loading
Zoom Level Definitions:
enum ZoomLevel {
SENTENCE = 1, // Individual sentences (zoom > 0.8)
TURN = 2, // Speaker turns (0.6 < zoom ≤ 0.8)
TOPIC = 3, // Topic segments (0.4 < zoom ≤ 0.6)
THEME = 4, // Thematic clusters (0.2 < zoom ≤ 0.4)
ARC = 5 // Narrative arcs (zoom ≤ 0.2)
}Success Criteria:
- Zoom transitions smooth and intuitive
- Node visibility updates correctly at each level
- No performance degradation during zoom
Testing:
describe('ZoomController', () => {
it('quantizes zoom values to 5 discrete levels')
it('shows/hides nodes based on zoom_level_visible')
it('loads appropriate context for each level')
})Metrics:
- Zoom transition latency: < 50ms
- Node culling efficiency: Only visible nodes rendered
- User experience: A/B test with 10 users
Goal: Split-screen detail view with zoom-dependent context
Tasks:
- Implement panel that shows selected node
- Build context loading logic (see TIER_2_FEATURES.md Section 1)
- Create inline editing interface
- Implement edit mode toggle
Context Loading Rules:
function getContextConfig(zoom: ZoomLevel): ContextConfig {
switch(zoom) {
case ZoomLevel.SENTENCE:
return { previous: 2, next: 2, mode: 'detailed' }
case ZoomLevel.TURN:
return { previous: 1, next: 1, mode: 'focused' }
case ZoomLevel.ARC:
return { mode: 'summary', summary_of: 'entire_thread' }
}
}Success Criteria:
- Context loads within 200ms
- Edit mode requires explicit toggle (intentional friction)
- Changes save immediately to backend
Testing:
describe('NodeDetailPanel', () => {
it('loads correct context based on zoom level')
it('requires edit mode toggle for changes')
it('saves edits to backend immediately')
})Metrics:
- Context load time: < 200ms
- Edit save latency: < 100ms
- Edit mode activation rate: Track user behavior
Goal: Comprehensive speaker statistics and role detection
Tasks:
-
Implement analytics calculations:
- Time spoken per speaker
- Turn count and distribution
- Topics dominated by each speaker
- Role detection (facilitator, contributor, observer)
-
Build analytics API endpoint:
GET /conversations/{id}/analytics -
Create Analytics UI page (separate from main graph)
-
Implement speaker timeline visualization
Analytics Calculations:
# lct_python_backend/analytics/speaker_analytics.py
class SpeakerAnalytics:
def calculate_time_spoken(self, conversation_id: str) -> Dict[str, float]:
"""Calculate seconds spoken per speaker"""
def calculate_turn_distribution(self, conversation_id: str) -> Dict[str, int]:
"""Count turns per speaker"""
def detect_speaker_roles(self, conversation_id: str) -> Dict[str, str]:
"""Classify speakers: facilitator, contributor, observer, etc."""
def calculate_topic_dominance(self, conversation_id: str) -> Dict[str, List[str]]:
"""Identify which topics each speaker dominated"""Success Criteria:
- Analytics calculated for any conversation
- Role detection > 70% accuracy (manual validation)
- UI loads analytics in < 1 second
Testing:
def test_speaker_time_calculation():
"""Test time spoken calculated correctly from timestamps"""
def test_role_detection():
"""Test role classifier on labeled dataset"""
def test_analytics_performance():
"""Test analytics generation scales to 2-hour conversations"""Metrics:
- Analytics generation time: < 5 seconds
- Accuracy of role detection: > 70%
- User engagement: Track analytics view count
Goal: User-editable JSON prompts with UI settings panel
Tasks:
- Create
prompts.jsonconfiguration file - Implement prompt template rendering with variables
- Build Settings UI for prompt editing
- Create prompt versioning system
- Implement hot-reload for prompt changes
Prompts Configuration Schema:
{
"version": "1.0.0",
"last_updated": "2025-11-11",
"prompts": {
"initial_clustering": {
"description": "Generate initial nodes from transcript",
"template": "...",
"model": "gpt-4",
"temperature": 0.5,
"max_tokens": 2000,
"few_shot_examples": [...]
},
"detect_cognitive_bias": {
"description": "Identify cognitive biases in utterances",
"template": "...",
"model": "gpt-4",
"temperature": 0.3
}
}
}Success Criteria:
- All prompts externalized to JSON
- Users can edit prompts via UI
- Prompt changes take effect immediately
- Version history maintained
Testing:
def test_prompt_template_rendering():
"""Test variable substitution in prompts"""
def test_prompt_hot_reload():
"""Test prompt updates apply without restart"""
def test_prompt_versioning():
"""Test prompt changes create version history"""Metrics:
- Prompt edit frequency: Track user customization
- Prompt performance: A/B test custom vs default
- Token usage: Monitor cost impact of edits
Goal: Complete logging and export system for AI training data
Tasks:
-
Ensure all edits logged to
edits_logtable -
Implement export endpoint:
GET /conversations/{id}/training-data -
Create export formats:
- JSONL (for fine-tuning)
- CSV (for analysis)
- Markdown (for human review)
-
Build diff visualization for edits
-
Implement feedback annotation UI
Export Format (JSONL):
{
"messages": [
{"role": "system", "content": "You are analyzing conversation transcripts..."},
{"role": "user", "content": "Original AI output: [node summary]"},
{"role": "assistant", "content": "User correction: [edited summary]"}
],
"metadata": {
"conversation_id": "uuid",
"edit_type": "node_summary_edit",
"timestamp": "2025-11-11T10:30:00Z",
"feedback": "User noted AI missed sarcasm in utterance"
}
}Success Criteria:
- 100% of edits captured in
edits_log - Export generates valid fine-tuning format
- User can annotate edits with feedback
Testing:
def test_edit_logging():
"""Test all edit types logged correctly"""
def test_training_data_export():
"""Test export generates valid JSONL format"""
def test_diff_visualization():
"""Test diff correctly shows before/after"""Metrics:
- Edit capture rate: 100%
- Export file size: Track data volume
- User feedback rate: % of edits annotated
Goal: Implement basic Simulacra level classification for utterances
Tasks:
- Create Simulacra level prompt with heuristics
- Implement pattern matching for obvious cases
- Build LLM-based contextual analysis
- Create UI indicators for detected levels
Implementation Approach:
# lct_python_backend/analysis/simulacra.py
class SimulacraDetector:
def classify_utterance(self, utterance: Utterance, context: List[Utterance]) -> SimulacraLevel:
"""
Classify utterance as Level 1, 2, 3, or 4
Using heuristics + LLM analysis
"""
def detect_level_1_patterns(self, text: str) -> bool:
"""Pattern match for object-level claims"""
def detect_level_2_manipulation(self, utterance: Utterance, context: List[Utterance]) -> bool:
"""Detect persuasion/manipulation patterns"""
def detect_level_3_signaling(self, utterance: Utterance, speaker: str) -> bool:
"""Detect tribal/group signaling"""Success Criteria:
- Detect obvious Level 1 (facts) and Level 3 (signaling)
- LLM analysis for ambiguous cases
- UI shows level indicators on utterances
Testing:
def test_level_1_detection():
"""Test detection of factual, object-level statements"""
def test_level_3_signaling():
"""Test detection of tribal signaling"""
def test_ambiguous_classification():
"""Test LLM handles edge cases"""Metrics:
- Classification accuracy: Manual validation on 100 utterances
- False positive rate: < 20%
- Token cost per conversation: < $1.50
Goal: Implement detection for top 10 most common cognitive biases
Priority Biases:
- Confirmation bias
- Affect heuristic
- Optimism bias
- Straw man fallacy
- Ad hominem
- Whataboutism
- False dichotomy
- Appeal to nature
- Sunk cost fallacy
- Availability heuristic
Implementation:
# lct_python_backend/analysis/cognitive_bias.py
class CognitiveBiasDetector:
def detect_confirmation_bias(self, utterances: List[Utterance]) -> List[BiasInstance]:
"""Detect selective evidence presentation"""
def detect_straw_man(self, utterances: List[Utterance]) -> List[BiasInstance]:
"""Detect misrepresentation of opponent's position"""
def detect_whataboutism(self, utterances: List[Utterance]) -> List[BiasInstance]:
"""Pattern match + contextual analysis for whataboutism"""Detection Pipeline:
- Fast pattern matching (regex, keyword spotting)
- Contextual LLM analysis for confirmed candidates
- Confidence scoring (0.0 - 1.0)
- UI annotation with explanations
Success Criteria:
- Detect 70%+ of biases in test dataset
- False positive rate < 30%
- Clear explanations for each detection
Testing:
def test_confirmation_bias_detection():
"""Test detection on labeled examples"""
def test_straw_man_detection():
"""Test pattern matching + LLM analysis"""
def test_confidence_scoring():
"""Test confidence scores correlate with accuracy"""Metrics:
- Detection accuracy per bias type
- Token cost per conversation
- User feedback on false positives
Goal: Identify hidden worldviews and normative assumptions in conversations
Tasks:
- Implement frame extraction prompts
- Detect "is-ought" conflations
- Identify hidden premises
- Build frame taxonomy UI
Frame Detection Approach:
class ImplicitFrameDetector:
def extract_normative_claims(self, utterances: List[Utterance]) -> List[NormativeClaim]:
"""Identify statements about what 'should' be"""
def detect_hidden_premises(self, claim: NormativeClaim) -> List[Premise]:
"""Unpack unstated assumptions"""
def detect_is_ought_conflation(self, utterances: List[Utterance]) -> List[Conflation]:
"""Find naturalistic fallacy instances"""Example Output:
{
"normative_claim": "We should prioritize economic growth",
"hidden_premises": [
"Economic growth increases human wellbeing",
"Economic growth is sustainable",
"Current distribution mechanisms are fair"
],
"conflations": [
{
"type": "is_ought",
"text": "Humans naturally seek wealth, so capitalism is right",
"explanation": "Conflates descriptive claim (humans seek wealth) with normative claim (capitalism is right)"
}
]
}Success Criteria:
- Extract meaningful frames from 60%+ of conversations
- Clear, understandable explanations
- UI shows frame taxonomy tree
Testing:
def test_normative_claim_extraction():
"""Test extraction of 'should' statements"""
def test_hidden_premise_detection():
"""Test unpacking of unstated assumptions"""
def test_is_ought_detection():
"""Test naturalistic fallacy detection"""Metrics:
- Frame extraction rate: % of conversations with frames
- User agreement with detected frames: Manual validation
- Token cost per conversation
Goal: End-to-end testing, performance optimization, documentation
Tasks:
- End-to-end integration tests for all features
- Performance profiling and optimization
- User acceptance testing (10 beta testers)
- Documentation updates
- Deployment to production environment
- Monitoring and alerting setup
Integration Tests:
def test_full_pipeline_google_meet_to_analysis():
"""
Test complete flow:
1. Import Google Meet transcript
2. Generate graph with 5 zoom levels
3. Run speaker analytics
4. Detect Simulacra levels
5. Detect cognitive biases
6. Export training data
"""
def test_cost_tracking_end_to_end():
"""Test all API calls logged and costs calculated"""
def test_edit_history_round_trip():
"""Test edit → log → export → import for training"""Performance Optimization Targets:
- Initial graph load: < 2 seconds
- Zoom transition: < 50ms
- Analytics calculation: < 5 seconds
- Bias detection: < 30 seconds
Documentation to Update:
- User guide (how to import transcripts, use features)
- API documentation (all endpoints, schemas)
- Developer setup guide
- Architecture diagrams
- Prompt engineering guide
Deployment Checklist:
- Database migration scripts tested on staging
- Environment variables configured
- API keys secured in secrets manager
- Monitoring dashboard configured
- Alert thresholds set for costs and errors
- Backup strategy implemented
- Rollback plan documented
Metrics:
- All tests passing (100% critical path)
- Performance targets met
- User satisfaction: Survey beta testers
- Zero P0 bugs in production
PostgreSQL Configuration:
database:
host: localhost
port: 5432
name: lct_production
max_connections: 100
connection_timeout: 30sTable Size Estimates (for 1000 conversations):
conversations: ~100 KButterances: ~500 MB (avg 1000 utterances/conversation)nodes: ~50 MB (avg 50 nodes/conversation)edges: ~30 MB (avg 100 edges/conversation)clusters: ~10 MBedits_log: ~20 MB (assuming 10% edit rate)api_calls_log: ~100 MB (detailed logs)
Total Storage: ~710 MB for 1000 conversations
Retention Policy:
- Conversation data: Indefinite (user-managed deletion)
- Raw API logs: 90 days
- Aggregated metrics: 2 years
- Edit history: Indefinite (training data)
Backup Strategy:
- Full backup: Daily at 2 AM
- Incremental backup: Every 6 hours
- Point-in-time recovery: 30 days
- Offsite replication: Google Cloud Storage
- Backup retention: 90 days
Scaling Plan:
- Vertical scaling: Increase PostgreSQL instance size
- Horizontal scaling: Read replicas for analytics queries
- Partitioning: Partition
api_calls_logby month - Archiving: Move old conversations to cold storage (S3)
Google Cloud Storage Buckets:
lct-conversations/
├── transcripts/ # Original uploaded files
├── exports/ # Training data exports
└── backups/ # Database backups
Structure:
transcripts/{conversation_id}/{filename}
exports/{conversation_id}/training-data-{timestamp}.jsonl
backups/postgres-{date}.sql.gz
Storage Costs (Estimate):
- 1000 conversations × 100 KB transcript = 100 MB
- GCS Standard Storage: $0.02/GB/month
- Monthly Cost: ~$0.01 (negligible)
Application Metrics:
# Cost Metrics
- cost_per_conversation_usd
- cost_per_feature_usd (clustering, bias_detection, etc.)
- cost_by_model (gpt4, gpt3.5, claude)
- total_daily_cost_usd
- cost_per_user
# Performance Metrics
- api_call_latency_ms (p50, p95, p99)
- graph_generation_time_seconds
- zoom_transition_time_ms
- analytics_calculation_time_seconds
- frontend_render_time_ms
# Usage Metrics
- conversations_created_total
- transcripts_imported_total
- nodes_created_total (by_ai, by_user)
- edits_made_total (by_type)
- zoom_level_changes_total
- speaker_analytics_views_total
- bias_detections_total (by_type)
# Quality Metrics
- parse_success_rate
- speaker_detection_accuracy
- bias_detection_confidence (avg)
- user_feedback_score (1-5)
- edit_frequency (edits / nodes)
# System Metrics
- database_query_time_ms
- database_connection_pool_usage
- api_request_rate
- error_rate (by_endpoint)
- active_usersTools:
- Prometheus: Metrics collection
- Grafana: Dashboards
- Sentry: Error tracking
- Datadog (alternative): All-in-one monitoring
Dashboard Layout:
Dashboard 1: Cost Tracking
- Total daily cost (line chart)
- Cost by model (pie chart)
- Cost per conversation (histogram)
- Cost alerts (threshold indicators)
Dashboard 2: Performance
- API latency (line chart with p50/p95/p99)
- Graph generation time (histogram)
- Database query performance (heat map)
- Frontend render time (line chart)
Dashboard 3: Usage
- Conversations created (bar chart)
- Active users (line chart)
- Feature usage (stacked area chart)
- Edit frequency (line chart)
Dashboard 4: Quality
- Parse success rate (gauge)
- Bias detection confidence (histogram)
- User feedback scores (bar chart)
- Error rate (line chart)
Alerting Rules:
alerts:
- name: high_daily_cost
condition: total_daily_cost_usd > 100
severity: warning
channel: email, slack
- name: api_latency_spike
condition: api_call_latency_p95 > 5000
severity: critical
channel: pagerduty
- name: parse_failure_rate
condition: parse_success_rate < 0.9
severity: warning
channel: slack
- name: database_connection_exhaustion
condition: database_connection_pool_usage > 0.8
severity: critical
channel: pagerdutyFastAPI Middleware:
# lct_python_backend/instrumentation/middleware.py
from prometheus_client import Counter, Histogram, Gauge
# Define metrics
api_request_count = Counter('api_requests_total', 'Total API requests', ['endpoint', 'status'])
api_request_latency = Histogram('api_request_latency_seconds', 'API request latency', ['endpoint'])
api_cost = Counter('api_cost_usd_total', 'Total API cost in USD', ['model', 'endpoint'])
@app.middleware("http")
async def instrument_requests(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
duration = time.time() - start_time
endpoint = request.url.path
api_request_count.labels(endpoint=endpoint, status=response.status_code).inc()
api_request_latency.labels(endpoint=endpoint).observe(duration)
return responseCost Tracking Decorator:
# lct_python_backend/instrumentation/decorators.py
from functools import wraps
import time
from .cost_calculator import calculate_cost
def track_api_call(endpoint_name: str):
"""Decorator to track cost and performance of LLM API calls"""
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
start_time = time.time()
conversation_id = kwargs.get("conversation_id")
try:
response = await func(*args, **kwargs)
# Calculate cost
cost_usd = calculate_cost(
model=response.model,
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens
)
# Log to database
await db.log_api_call({
"conversation_id": conversation_id,
"endpoint": endpoint_name,
"model": response.model,
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
"total_tokens": response.usage.total_tokens,
"cost_usd": cost_usd,
"latency_ms": int((time.time() - start_time) * 1000),
"timestamp": datetime.now(),
"success": True
})
# Update Prometheus metrics
api_cost.labels(model=response.model, endpoint=endpoint_name).inc(cost_usd)
return response
except Exception as e:
# Log failure
await db.log_api_call({
"conversation_id": conversation_id,
"endpoint": endpoint_name,
"timestamp": datetime.now(),
"success": False,
"error_message": str(e)
})
raise
return wrapper
return decoratorUsage Example:
@track_api_call("generate_clusters")
async def generate_clusters(conversation_id: str, utterances: List[Utterance]):
response = await openai.ChatCompletion.create(
model="gpt-4",
messages=[...],
temperature=0.5
)
return responseBackend (Python):
- pytest (test framework)
- pytest-asyncio (async support)
- pytest-cov (coverage reporting)
- factory_boy (test fixtures)
- faker (test data generation)
Frontend (TypeScript/React):
- Vitest (test framework)
- React Testing Library (component testing)
- MSW (API mocking)
- Playwright (E2E testing)
By Component Type:
- Database models: 100%
- API endpoints: 95%
- Parsers (Google Meet): 100%
- Analysis functions (bias, Simulacra): 80%
- UI components: 70%
- Integration tests: 100% critical paths
Overall Target: 85% code coverage
Backend Test Organization:
tests/
├── unit/
│ ├── test_database_models.py
│ ├── test_google_meet_parser.py
│ ├── test_simulacra_detector.py
│ ├── test_cognitive_bias_detector.py
│ ├── test_speaker_analytics.py
│ └── test_instrumentation.py
├── integration/
│ ├── test_api_endpoints.py
│ ├── test_graph_generation_pipeline.py
│ └── test_cost_tracking.py
├── e2e/
│ ├── test_full_workflow.py
│ └── test_training_data_export.py
└── fixtures/
├── sample_transcripts/
├── sample_conversations.py
└── sample_responses.py
Frontend Test Organization:
src/
├── components/
│ ├── DualView/
│ │ ├── DualViewCanvas.tsx
│ │ └── DualViewCanvas.test.tsx
│ ├── NodeDetail/
│ │ ├── NodeDetailPanel.tsx
│ │ └── NodeDetailPanel.test.tsx
│ └── ...
└── tests/
├── integration/
│ └── graph-interaction.test.tsx
└── e2e/
└── full-workflow.spec.ts
1. Google Meet Parser Tests:
# tests/unit/test_google_meet_parser.py
def test_parse_simple_transcript(sample_transcript_text):
"""Test basic parsing of speaker-diarized transcript"""
parser = GoogleMeetParser()
result = parser.parse_text(sample_transcript_text)
assert len(result.utterances) == 5
assert result.utterances[0].speaker == "Aditya"
assert result.utterances[0].text == "Okay, sorry."
assert result.utterances[0].start_time == 0.0
def test_parse_multiline_utterance():
"""Test utterances spanning multiple lines"""
text = """
00:00:00
Speaker A ~: This is a long utterance
that spans multiple lines
and should be concatenated.
Speaker B ~: Short response.
"""
parser = GoogleMeetParser()
result = parser.parse_text(text)
assert len(result.utterances) == 2
assert "spans multiple lines" in result.utterances[0].text
def test_parse_missing_timestamps():
"""Test handling of incomplete timestamp data"""
text = "Speaker A ~: No timestamp here.\nSpeaker B ~: Neither here."
parser = GoogleMeetParser()
result = parser.parse_text(text)
# Should still parse speakers, estimate timestamps
assert len(result.utterances) == 2
assert result.utterances[0].start_time is not None
def test_parse_special_characters():
"""Test names with unicode, punctuation"""
text = "José García ~: Hola!\nMary O'Brien ~: Hello."
parser = GoogleMeetParser()
result = parser.parse_text(text)
assert result.utterances[0].speaker == "José García"
assert result.utterances[1].speaker == "Mary O'Brien"
def test_parse_validation_errors():
"""Test validation catches malformed input"""
text = "This is not a valid transcript format"
parser = GoogleMeetParser()
with pytest.raises(ValidationError) as exc:
parser.parse_text(text)
assert "No speakers detected" in str(exc.value)2. Instrumentation Tests:
# tests/unit/test_instrumentation.py
@pytest.mark.asyncio
async def test_track_api_call_decorator(mock_db, mock_openai):
"""Test that API calls are logged with correct cost"""
@track_api_call("test_endpoint")
async def mock_llm_call(conversation_id: str):
return MockResponse(
model="gpt-4",
usage=MockUsage(prompt_tokens=100, completion_tokens=50)
)
await mock_llm_call(conversation_id="test-123")
# Verify log entry created
logs = await mock_db.get_api_call_logs("test-123")
assert len(logs) == 1
assert logs[0].model == "gpt-4"
assert logs[0].total_tokens == 150
assert logs[0].cost_usd > 0
def test_cost_calculation_gpt4():
"""Test GPT-4 cost calculation for various token counts"""
cost = calculate_cost(
model="gpt-4",
input_tokens=1000,
output_tokens=500
)
# GPT-4 pricing: $0.03/1K input, $0.06/1K output
expected = (1000 * 0.03 / 1000) + (500 * 0.06 / 1000)
assert abs(cost - expected) < 0.0001
def test_cost_alert_threshold(mock_db, mock_alert_service):
"""Test alert triggers when cost exceeds threshold"""
# Log API calls that exceed daily threshold
for i in range(100):
await mock_db.log_api_call({
"cost_usd": 1.5, # $150 total
"timestamp": datetime.now()
})
# Check alerts were triggered
alerts = mock_alert_service.get_triggered_alerts()
assert len(alerts) == 1
assert alerts[0].alert_type == "high_daily_cost"3. Simulacra Detection Tests:
# tests/unit/test_simulacra_detector.py
def test_detect_level_1_object_level():
"""Test detection of factual, object-level statements"""
detector = SimulacraDetector()
utterances = [
Utterance(text="The temperature is 72 degrees."),
Utterance(text="According to the data, sales increased 15% last quarter.")
]
results = detector.classify_utterances(utterances)
assert results[0].level == SimulacraLevel.LEVEL_1
assert results[0].confidence > 0.8
assert results[1].level == SimulacraLevel.LEVEL_1
def test_detect_level_3_tribal_signaling():
"""Test detection of group signaling"""
detector = SimulacraDetector()
utterances = [
Utterance(
speaker="Alice",
text="As a progressive, I believe we must stand together on this."
)
]
results = detector.classify_utterances(utterances)
assert results[0].level == SimulacraLevel.LEVEL_3
assert "tribal signaling" in results[0].explanation.lower()
def test_detect_ambiguous_requires_llm():
"""Test ambiguous cases fallback to LLM analysis"""
detector = SimulacraDetector()
utterances = [
Utterance(text="I think we should prioritize customer satisfaction.")
]
with patch('detector.llm_classify') as mock_llm:
mock_llm.return_value = SimulacraLevel.LEVEL_2
results = detector.classify_utterances(utterances)
assert mock_llm.called
assert results[0].level == SimulacraLevel.LEVEL_24. Integration Tests:
# tests/integration/test_graph_generation_pipeline.py
@pytest.mark.asyncio
async def test_full_pipeline_transcript_to_graph():
"""
Test complete flow from transcript to graph:
1. Parse Google Meet transcript
2. Generate nodes via AI
3. Create temporal edges
4. Generate contextual edges
5. Assign zoom levels
"""
# Load sample transcript
transcript_path = "tests/fixtures/sample_transcript.txt"
# Parse
parser = GoogleMeetParser()
parsed = await parser.parse_file(transcript_path)
# Create conversation
conversation_id = await db.create_conversation({
"title": "Test Conversation",
"source": "google_meet"
})
# Save utterances
await db.save_utterances(conversation_id, parsed.utterances)
# Generate graph
graph_service = GraphGenerationService()
await graph_service.generate_initial_graph(conversation_id)
# Verify results
nodes = await db.get_nodes(conversation_id)
edges = await db.get_edges(conversation_id)
assert len(nodes) > 0
assert len(edges) > 0
# Check zoom levels assigned
zoom_levels = [node.zoom_level_visible for node in nodes]
assert min(zoom_levels) >= 1
assert max(zoom_levels) <= 5
# Verify temporal edges
temporal_edges = [e for e in edges if e.relationship_type == "temporal"]
assert len(temporal_edges) == len(nodes) - 1 # Sequential chain
@pytest.mark.asyncio
async def test_cost_tracking_end_to_end():
"""Test all API calls logged and costs calculated"""
conversation_id = await create_test_conversation()
# Generate graph (makes multiple LLM calls)
await GraphGenerationService().generate_initial_graph(conversation_id)
# Check logs
logs = await db.get_api_call_logs(conversation_id)
assert len(logs) > 0
# Verify cost calculated
total_cost = sum(log.cost_usd for log in logs)
assert total_cost > 0
# Verify all required fields present
for log in logs:
assert log.model is not None
assert log.total_tokens > 0
assert log.latency_ms > 0
assert log.timestamp is not None5. E2E Tests (Playwright):
// src/tests/e2e/full-workflow.spec.ts
import { test, expect } from '@playwright/test';
test('complete workflow: import → visualize → analyze → edit', async ({ page }) => {
// 1. Navigate to app
await page.goto('http://localhost:3000');
// 2. Import Google Meet transcript
await page.click('[data-testid="import-button"]');
await page.setInputFiles('[data-testid="file-input"]', 'tests/fixtures/sample_transcript.pdf');
await page.click('[data-testid="confirm-import"]');
// Wait for graph generation
await page.waitForSelector('[data-testid="graph-canvas"]', { timeout: 30000 });
// 3. Verify dual-view layout
const timelineView = await page.locator('[data-testid="timeline-view"]');
const contextualView = await page.locator('[data-testid="contextual-view"]');
await expect(timelineView).toBeVisible();
await expect(contextualView).toBeVisible();
// 4. Test zoom interaction
await page.click('[data-testid="zoom-in"]');
await page.waitForTimeout(500); // Animation
// Verify more detailed nodes visible
const visibleNodes = await page.locator('[data-node]').count();
expect(visibleNodes).toBeGreaterThan(10);
// 5. Select node and view details
await page.click('[data-node-id="node-1"]');
const detailPanel = await page.locator('[data-testid="node-detail-panel"]');
await expect(detailPanel).toBeVisible();
// 6. Enable edit mode and modify summary
await page.click('[data-testid="toggle-edit-mode"]');
await page.fill('[data-testid="node-summary-input"]', 'Updated summary text');
await page.click('[data-testid="save-edit"]');
// Verify edit saved
await page.waitForSelector('[data-testid="edit-saved-indicator"]');
// 7. View speaker analytics
await page.click('[data-testid="nav-analytics"]');
await expect(page.locator('[data-testid="speaker-analytics-view"]')).toBeVisible();
// Verify analytics loaded
const speakerCards = await page.locator('[data-testid="speaker-card"]').count();
expect(speakerCards).toBeGreaterThan(0);
// 8. Check cost tracking
await page.click('[data-testid="nav-settings"]');
await page.click('[data-testid="view-cost-dashboard"]');
const totalCost = await page.textContent('[data-testid="total-cost"]');
expect(parseFloat(totalCost)).toBeGreaterThan(0);
});
test('Simulacra level detection UI', async ({ page }) => {
await page.goto('http://localhost:3000/conversations/test-123');
// Enable Simulacra levels (power user feature)
await page.click('[data-testid="settings"]');
await page.check('[data-testid="enable-simulacra-detection"]');
// Wait for analysis
await page.waitForSelector('[data-testid="simulacra-indicator"]');
// Verify level indicators visible on utterances
const level1Indicators = await page.locator('[data-simulacra-level="1"]').count();
const level3Indicators = await page.locator('[data-simulacra-level="3"]').count();
expect(level1Indicators + level3Indicators).toBeGreaterThan(0);
// Click indicator to see explanation
await page.click('[data-simulacra-level="3"]').first();
const explanation = await page.locator('[data-testid="simulacra-explanation"]');
await expect(explanation).toContainText('tribal signaling');
});GitHub Actions Workflow:
# .github/workflows/test.yml
name: Test Suite
on: [push, pull_request]
jobs:
backend-tests:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:14
env:
POSTGRES_PASSWORD: test
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
cd lct_python_backend
pip install -r requirements.txt
pip install pytest pytest-asyncio pytest-cov
- name: Run tests with coverage
run: |
cd lct_python_backend
pytest --cov=. --cov-report=xml --cov-report=term
- name: Upload coverage
uses: codecov/codecov-action@v3
with:
file: ./lct_python_backend/coverage.xml
frontend-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
with:
node-version: '18'
- name: Install dependencies
run: npm ci
- name: Run unit tests
run: npm run test:unit -- --coverage
- name: Run E2E tests
run: npm run test:e2e
- name: Upload coverage
uses: codecov/codecov-action@v3
with:
file: ./coverage/coverage-final.jsonCoverage Requirements:
# .coveragerc (Python)
[run]
omit =
*/tests/*
*/migrations/*
*/venv/*
[report]
fail_under = 85
show_missing = True// package.json (JavaScript)
{
"jest": {
"coverageThreshold": {
"global": {
"branches": 70,
"functions": 70,
"lines": 85,
"statements": 85
}
}
}
}1. LLM API Cost Overruns
- Risk: Unexpected cost spikes from inefficient prompts or high usage
- Probability: Medium (40%)
- Impact: High (budget exhaustion)
- Mitigation:
- Implement strict cost tracking and alerts (Week 2)
- Set daily/weekly spending limits
- Use prompt caching where possible
- A/B test prompts for cost efficiency
- Provide cost estimates before expensive operations
2. Poor Bias Detection Accuracy
- Risk: Too many false positives frustrate users, false negatives reduce trust
- Probability: High (60%)
- Impact: Medium (feature perceived as unreliable)
- Mitigation:
- Set confidence thresholds (only show high-confidence detections)
- Allow users to provide feedback on false positives
- Continuously improve prompts based on user feedback
- Make feature opt-in for power users initially
3. Performance Degradation with Large Conversations
- Risk: UI becomes sluggish with 500+ nodes, long transcripts
- Probability: Medium (50%)
- Impact: High (unusable for long meetings)
- Mitigation:
- Implement aggressive node culling based on zoom level
- Use virtualization for timeline view
- Paginate analytics queries
- Profile and optimize rendering pipeline
- Set realistic expectations (test with 2-hour conversations)
4. Google Meet Parser Brittleness
- Risk: Google changes transcript format, parser breaks
- Probability: Medium (30%)
- Impact: High (core feature broken)
- Mitigation:
- Extensive test coverage with real transcripts
- Graceful degradation (allow manual speaker annotation)
- Version detection (detect format changes)
- Support multiple input formats (PDF, TXT, manual paste)
5. Database Migration Failures
- Risk: Migration breaks production data, downtime
- Probability: Low (20%)
- Impact: Critical (data loss)
- Mitigation:
- Test migrations extensively on staging
- Full database backup before migration
- Rollback plan documented and tested
- Blue-green deployment strategy
6. Edit History Storage Growth
- Risk:
edits_logtable grows unbounded - Probability: High (80%)
- Impact: Low (storage cost, query slowdown)
- Mitigation:
- Partition table by conversation_id
- Archive old edits to cold storage
- Index optimization
7. Prompt Engineering Complexity
- Risk: Maintaining consistent prompts across features difficult
- Probability: Medium (50%)
- Impact: Medium (inconsistent results)
- Mitigation:
- Centralized prompts.json configuration
- Versioning and rollback capability
- A/B testing framework for prompt changes
Functional Requirements:
- Import Google Meet transcripts (PDF/TXT)
- Generate graph with 5 zoom levels
- Dual-view architecture working
- Node detail panel with edit capability
- Speaker analytics view
- Cost tracking dashboard
- Edit history export
Performance Requirements:
- Graph generation: < 60 seconds for 1-hour conversation
- Zoom transitions: < 100ms
- UI render: 60 FPS with 100 nodes
- Parse success rate: > 90%
Quality Requirements:
- Test coverage: 85%+
- Zero P0 bugs
- User satisfaction: 4/5 from beta testers
Business Requirements:
- Cost per conversation: < $3.00
- 10 beta users onboarded
- Documentation complete
Usage Metrics:
- Monthly active users
- Conversations created per user
- Average conversation length
- Feature adoption rates
Quality Metrics:
- Parse success rate (target: 95%)
- User-reported bugs per week (target: < 5)
- Edit frequency (measure: is AI good enough?)
Cost Metrics:
- Cost per conversation (target: < $2.00)
- Cost per user per month (target: < $10)
Performance Metrics:
- Graph generation time (target: < 30 seconds)
- API latency p95 (target: < 2 seconds)
Backend:
- Python 3.11
- FastAPI (web framework)
- SQLAlchemy (ORM)
- Alembic (migrations)
- PostgreSQL 14
- OpenAI Python SDK
- Anthropic Python SDK
- pytest (testing)
Frontend:
- React 18
- TypeScript 5
- Vite (build tool)
- React Flow (graph visualization)
- TailwindCSS (styling)
- Vitest (testing)
- Playwright (E2E testing)
Infrastructure:
- Google Cloud Storage (file storage)
- Prometheus (metrics)
- Grafana (dashboards)
- Sentry (error tracking)
- GitHub Actions (CI/CD)
LLM API Costs (per 1000 conversations):
Assumptions:
- Average conversation: 10,000 words = ~13,333 tokens
- Average generated graph: 50 nodes, 100 edges
Operations per conversation:
- Initial clustering: 13k input + 2k output = $0.48 (GPT-4)
- Contextual edges: 5k input + 1k output = $0.18 (GPT-3.5-turbo)
- Speaker analytics: 10k input + 500 output = $0.33 (GPT-4)
- Simulacra detection: 13k input + 1k output = $0.45 (GPT-4)
- Bias detection: 13k input + 2k output = $0.51 (GPT-4)
Total per conversation: ~$1.95
For 1000 conversations: ~$1,950
Monthly estimates (100 conversations): ~$195
Optimization opportunities:
- Use GPT-3.5-turbo for simpler tasks (-50% cost)
- Implement prompt caching (-30% cost on repeated patterns)
- Batch processing where possible (-10% cost)
Optimized cost: ~$1.00 per conversation
See DATA_MODEL_V2.md for full schema.
Key Tables:
conversations: Conversation metadatautterances: Raw speaker-diarized textnodes: AI-generated summaries/chunksedges: Relationships between nodesclusters: Hierarchical groupings for zoomedits_log: Training data (all user edits)api_calls_log: Cost and performance tracking
Terms:
- Utterance: Single statement by one speaker
- Node: AI-generated conversation chunk (summary + utterances)
- Edge: Relationship between nodes (temporal or contextual)
- Cluster: Hierarchical grouping of nodes for zoom levels
- Zoom Level: Discrete granularity level (1-5)
- Simulacra Level: Zvi Mowshowitz framework for communication intent (1-4)
- Cognitive Bias: Systematic pattern of deviation from rationality
- Implicit Frame: Hidden worldview or assumption in normative claims
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2025-11-11 | Claude | Initial roadmap |
End of Roadmap