Skip to content

dataset fixes#5525

Merged
chitalian merged 1 commit intomainfrom
justin/dataset-fix
Jan 16, 2026
Merged

dataset fixes#5525
chitalian merged 1 commit intomainfrom
justin/dataset-fix

Conversation

@chitalian
Copy link
Contributor

Ticket

Link to the ticket(s) this pull request addresses.

Component/Service

What part of Helicone does this affect?

  • Web (Frontend)
  • Jawn (Backend)
  • Worker (Proxy)
  • Bifrost (Marketing)
  • AI Gateway
  • Packages
  • Infrastructure/Docker
  • Documentation

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Performance improvement
  • Refactoring

Deployment Notes

  • No special deployment steps required
  • Database migrations need to run
  • Environment variable changes required
  • Coordination with other teams needed

Screenshots / Demos

Before After

Extra Notes

Any additional context, considerations, or notes for reviewers.

Context

Why are you making this change?

Screenshots / Demos

@vercel
Copy link

vercel bot commented Jan 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
helicone Ready Ready Preview, Comment Jan 16, 2026 0:10am
helicone-bifrost Building Building Preview, Comment Jan 16, 2026 0:10am
helicone-eu Ready Ready Preview, Comment Jan 16, 2026 0:10am

Review with Vercel Agent

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@supabase
Copy link

supabase bot commented Jan 16, 2026

This pull request has been ignored for the connected project bolqqmqbrciybnypvklh because there are no changes detected in supbase/supabase directory. You can change this behaviour in Project Integrations Settings ↗︎.


Preview Branches by Supabase.
Learn more about Supabase Branching ↗︎.

@supabase
Copy link

supabase bot commented Jan 16, 2026

This pull request has been ignored for the connected project lmahfbbnchpworytrrqk because there are no changes detected in .supabase/supabase directory. You can change this behaviour in Project Integrations Settings ↗︎.


Preview Branches by Supabase.
Learn more about Supabase Branching ↗︎.

@claude
Copy link
Contributor

claude bot commented Jan 16, 2026

Claude finished @chitalian's task —— View job


Code Review: PR #5525 "dataset fixes"

Score: 6/10 - Minor issues and improvements suggested, merge with consideration

🔍 Analysis Summary

This PR implements three main changes:

  1. Dataset Manager Enhancement: Added ClickHouse fallback mechanism when S3 copy operations fail
  2. API Type Changes: Replaced VectorOperators with TextOperators for request/response body filtering
  3. Free Tier Update: Changed datasets from limited (1 dataset, 10 requests) to unlimited

🐛 Issues Found

Critical: JSON Parsing Vulnerability

Location: valhalla/jawn/src/managers/dataset/HeliconeDatasetManager.ts:308-309

request: request_body ? JSON.parse(request_body) : {},
response: response_body ? JSON.parse(response_body) : {},

Problem: No error handling for malformed JSON from ClickHouse
Fix: Wrap in try-catch or validate JSON before parsing:

request: request_body ? safeJsonParse(request_body) || {} : {},
response: response_body ? safeJsonParse(response_body) || {} : {},

Performance: Inefficient Database Query

Location: valhalla/jawn/src/managers/dataset/HeliconeDatasetManager.ts:283-291
Problem: ClickHouse query lacks proper indexing hints and could be slow
Suggestion: Consider adding index hints or query optimization for large datasets

Logic: Inconsistent Error Handling

Location: valhalla/jawn/src/managers/dataset/HeliconeDatasetManager.ts:248-260
Problem: S3 copy failure automatically triggers ClickHouse fallback, but this might mask legitimate S3 connectivity issues
Suggestion: Add logging to distinguish between "object not found" vs "S3 unavailable"

🔧 Code Quality Improvements

Type Safety Enhancement

The change from VectorOperators to TextOperators is good - it makes the API more consistent. However, ensure this doesn't break existing client implementations expecting vector search capabilities.

Free Tier Changes

Location: web/lib/freeTierLimits.ts:93-106

  • Good: Simplified user experience with unlimited datasets
  • ⚠️ Consider: This removes revenue protection - ensure this aligns with business strategy

💡 Suggestions for Improvement

  1. Add Retry Logic: The S3 copy operation should have retry logic before falling back to ClickHouse
  2. Metrics/Logging: Add observability for when fallback mechanism is triggered
  3. Input Validation: Validate request IDs format before database operations
  4. Error Types: Use specific error types instead of generic strings

🧪 Testing Recommendations

  • Test the fallback mechanism with missing S3 objects
  • Verify API compatibility after VectorOperators → TextOperators change
  • Test edge cases with malformed JSON in ClickHouse
  • Load test the new unlimited datasets feature

🚨 Breaking Changes Assessment

Medium Risk: The VectorOperatorsTextOperators change could break clients expecting vector search functionality. Review API usage patterns before merging.

Summary

This PR addresses important dataset reliability issues but needs JSON parsing safety improvements. The type changes should be validated for backward compatibility.

Recommended Actions Before Merge:

  1. Fix JSON.parse vulnerability
  2. Add error logging for S3 fallback triggers
  3. Verify no clients depend on VectorOperators
  4. Consider rate limiting for unlimited datasets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant