Skip to content

Langfuse tracing across RAG ingestion pipeline#31516

Open
kushidhar-in wants to merge 3 commits into
yugabyte:masterfrom
kushidhar-in:feat/rag-worker-tracing
Open

Langfuse tracing across RAG ingestion pipeline#31516
kushidhar-in wants to merge 3 commits into
yugabyte:masterfrom
kushidhar-in:feat/rag-worker-tracing

Conversation

@kushidhar-in
Copy link
Copy Markdown

Add Langfuse tracing across RAG ingestion pipeline with datapack-scoped project resolution

Instrument the RAG agent pipeline end-to-end with Langfuse spans/observations so ingestion,
chunking, embedding generation, vector writes, and pipeline state updates are traceable in a
single observability flow. This adds method-level @observe coverage to core pipeline layers
(document_preprocessor, rag_handler, partition_chunk_pipeline, chunk, process_pdf,
embed, embedding_user_promt, and active pipeline tracking DB methods), and introduces a
wrapped SQL executor in yugabytedb_vector_store to capture query metadata and row counts for
database operations.

In DocumentPreprocessor, add datapack-aware Langfuse key resolution from document URI and
meko_system.langfuse_project_mapping, then bind the active public key context so nested
observations are routed to the correct Langfuse project. Also wrap top-level task processing in
an observation span and ensure structured success/error output is emitted to tracing. Improve
failure handling by updating document status to FAILED only when document_id is available.

Update dependencies by adding langfuse>=3.0.0 to support the new tracing instrumentation.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


kushidhar-in seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates Langfuse for observability across the RAG pipeline, adding @observe decorators to various database, embedding, and processing functions. It also introduces dynamic Langfuse client resolution based on document metadata. Feedback focuses on critical runtime errors in the new _resolve_langfuse_client method, specifically regarding missing null checks for document_uri and incorrect return types that would cause unpacking failures. Additionally, there is a concern regarding the use of a private Langfuse API for context binding.

Comment thread python/ai/rag_agent/rag_pipeline/document_preprocessor.py Outdated
Comment thread python/ai/rag_agent/rag_pipeline/document_preprocessor.py
Comment thread python/ai/rag_agent/rag_pipeline/document_preprocessor.py
@netlify
Copy link
Copy Markdown

netlify Bot commented May 8, 2026

Deploy Preview for infallible-bardeen-164bc9 ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit f3eddde
🔍 Latest deploy log https://app.netlify.com/projects/infallible-bardeen-164bc9/deploys/6a01bbc25e41e2000862cdfd
😎 Deploy Preview https://deploy-preview-31516--infallible-bardeen-164bc9.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

…ed project resolution

Instrument the RAG agent pipeline end-to-end with Langfuse spans/observations so ingestion,
chunking, embedding generation, vector writes, and pipeline state updates are traceable in a
single observability flow. This adds method-level @observe coverage to core pipeline layers
(, , , , ,
, , and active pipeline tracking DB methods), and introduces a
wrapped SQL executor in  to capture query metadata and row counts for
database operations.

In , add datapack-aware Langfuse key resolution from document URI and
, then bind the active public key context so nested
observations are routed to the correct Langfuse project. Also wrap top-level task processing in
an observation span and ensure structured success/error output is emitted to tracing. Improve
failure handling by updating document status to FAILED only when  is available.

Update dependencies by adding  to support the new tracing instrumentation.
@kushidhar-in kushidhar-in force-pushed the feat/rag-worker-tracing branch from ee08636 to e3d654a Compare May 8, 2026 15:52
boto3==1.34.81 # AWS SDK for Python (latest as of 2024-06)

# Langfuse
langfuse>=3.0.0 No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we lock it to a specific version since, as per gemini, we are using an internal API get_client()?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracked in a seperate ticket. We will lock langfuse across all components with specefic version.

Copy link
Copy Markdown
Contributor

@ashetkar ashetkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address/respond to gemini comments as well.

Comment thread python/ai/rag_agent/rag_pipeline/document_preprocessor.py
Comment thread python/ai/rag_agent/rag_pipeline/document_preprocessor.py
Comment thread python/ai/rag_agent/rag_pipeline/document_preprocessor.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants