You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Self-Learning LLM Systems: Evals, Feedback Loops & Reliability in Production
Short talk description
Production AI systems fail in ways benchmarks rarely capture. In this talk, we’ll explore how to design self-learning LLM systems that continuously improve using evaluation pipelines, feedback loops and reliability engineering. We’ll go beyond prompts and cover practical techniques for hallucination detection, prompt versioning, regression testing, drift monitoring and automated eval frameworks used in real-world deployments.
The session will also dive into how explicit and implicit user feedback can be transformed into measurable improvements for RAG and LLM applications running in production at scale.
Long talk description
Most LLM discussions focus on prompts, models, or benchmarks. But once AI systems move into production, an entirely different set of challenges begins to appear — hallucinations, retrieval failures, response drift, inconsistent reasoning, prompt regressions, and unpredictable behavior across versions.
This talk focuses on the engineering side of building reliable, self-learning LLM systems in production. We’ll explore how modern GenAI applications can continuously improve through structured evaluation pipelines, feedback loops, and automated reliability checks rather than relying purely on manual prompt tuning.
The session will cover practical techniques for:
Designing evaluation frameworks for RAG and LLM systems
Measuring groundedness, hallucinations, consistency, and response quality
Building explicit and implicit feedback pipelines
Prompt versioning and regression testing
Drift monitoring across model and prompt updates
Using synthetic datasets and backtesting for evaluations
Production monitoring strategies for GenAI systems
We’ll also discuss architectural patterns used in real-world deployments, including how evaluation systems integrate with retrieval pipelines, vector databases, and agentic workflows. The focus will remain heavily practical, with examples from enterprise AI systems where reliability, auditability, and continuous improvement are critical.
Rather than another “intro to LLMs” session, this talk aims to provide a deeper look into the unseen operational layer required to make AI systems dependable at scale.
What format do you have in mind?
Workshop (45-60 minutes, hands-on)
Talk outline / Agenda
(More hands on with slide content for overview)
Why production LLM systems fail differently than demos
Common failure modes in RAG and GenAI systems
Hallucinations
Retrieval mismatch
Prompt regressions
Drift over time
Designing evaluation frameworks for LLM systems
Groundedness checks
Consistency scoring
Hallucination evaluation
Human vs automated evals
Building self-learning AI pipelines
Explicit feedback loops
Implicit feedback signals
Feedback categorization and prioritization
Prompt versioning and regression testing
Backtesting prompts on historical datasets
Measuring improvements before rollout
Production architecture patterns
Eval pipelines with RAG systems
Monitoring and observability
Continuous improvement workflows
Real-world deployment learnings and key takeaways
Q&A
Key takeaways
Understand why production LLM systems fail differently from demos and benchmarks
Learn how to design practical evaluation frameworks for GenAI and RAG systems
Gain insights into hallucination detection, groundedness checks, and response reliability measurement
Learn how feedback loops and prompt versioning enable self-learning AI systems
Understand strategies for regression testing, drift monitoring, and continuous improvement in production
Explore real-world architectural patterns for building reliable and scalable AI systems
What domain would you say your talk falls under?
Data Science and Machine Learning
Duration (including Q&A)
30-40 min (depending on availability, but would want to show some practical demo along)
Prerequisites and preparation
Basic understanding of LLMs and RAG architectures
Familiarity with Python and GenAI workflows
Some exposure to prompt engineering or AI application development will be helpful
No prior experience with evaluation frameworks is required, but attendees should be comfortable with intermediate AI/ML concepts and production system discussions
Akash Chandra is the Founder & CEO of InsightAI, a deep-tech AI company building large-scale fraud detection, AML intelligence, and financial risk systems for banks and fintechs. He specializes in production AI architectures spanning GenAI, Graph Intelligence, low-latency systems, and real-time decisioning pipelines.
An IIT Kanpur graduate with over a decade of experience in AI and fintech, Akash has worked on enterprise AI systems involving RAG pipelines, self-learning LLM workflows, graph-based fraud analytics, and multimodal intelligence platforms. He regularly speaks on AI infrastructure, financial intelligence, and production-grade GenAI systems, with recent sessions at Nodes24 and fintech-focused technology forums.
Availability
Saturdays available
Accessibility & special requirements
No special accessibility requirements from my side.
For the session, a projector/display with HDMI support, stable internet connectivity and audio output for demos would be helpful.
Speaker checklist
I have read and understood the PyDelhi guidelines for submitting proposals and giving talks
I have read and acknowledged the PyDelhi accessibility guidelines and will ensure my presentation materials (slides, videos, demos) follow these recommendations
I will make my talk accessible to all attendees and will proactively ask for any accommodations or special requirements I might need
I agree to share slides, code snippets, and other materials used during the talk with the community
I will follow PyDelhi's Code of Conduct and maintain a welcoming, inclusive environment throughout my participation
I understand that PyDelhi meetups are community-centric events focused on learning, knowledge sharing, and networking, and I will respect this ethos by not using this platform for self-promotion or hiring pitches during my presentation, unless explicitly invited to do so by means of a sponsorship or similar arrangement
If the talk is recorded by the PyDelhi team, I grant permission to release the video on PyDelhi's YouTube channel under the CC-BY-4.0 license, or a different license of my choosing if I am specifying it in my proposal or with the materials I share
Additional comments
I actively speak on topics around AI infrastructure, GenAI systems, fraud intelligence, and graph-based analytics at developer and fintech-focused events. Recently, I presented at Nodes24 on real-time fraud detection using graph intelligence and event-driven architectures.
My sessions are usually highly practical and architecture-focused, with live demos, production learnings, and real-world deployment insights rather than theoretical overviews.
Talk title
Self-Learning LLM Systems: Evals, Feedback Loops & Reliability in Production
Short talk description
Production AI systems fail in ways benchmarks rarely capture. In this talk, we’ll explore how to design self-learning LLM systems that continuously improve using evaluation pipelines, feedback loops and reliability engineering. We’ll go beyond prompts and cover practical techniques for hallucination detection, prompt versioning, regression testing, drift monitoring and automated eval frameworks used in real-world deployments.
The session will also dive into how explicit and implicit user feedback can be transformed into measurable improvements for RAG and LLM applications running in production at scale.
Long talk description
Most LLM discussions focus on prompts, models, or benchmarks. But once AI systems move into production, an entirely different set of challenges begins to appear — hallucinations, retrieval failures, response drift, inconsistent reasoning, prompt regressions, and unpredictable behavior across versions.
This talk focuses on the engineering side of building reliable, self-learning LLM systems in production. We’ll explore how modern GenAI applications can continuously improve through structured evaluation pipelines, feedback loops, and automated reliability checks rather than relying purely on manual prompt tuning.
The session will cover practical techniques for:
We’ll also discuss architectural patterns used in real-world deployments, including how evaluation systems integrate with retrieval pipelines, vector databases, and agentic workflows. The focus will remain heavily practical, with examples from enterprise AI systems where reliability, auditability, and continuous improvement are critical.
Rather than another “intro to LLMs” session, this talk aims to provide a deeper look into the unseen operational layer required to make AI systems dependable at scale.
What format do you have in mind?
Workshop (45-60 minutes, hands-on)
Talk outline / Agenda
(More hands on with slide content for overview)
Key takeaways
What domain would you say your talk falls under?
Data Science and Machine Learning
Duration (including Q&A)
30-40 min (depending on availability, but would want to show some practical demo along)
Prerequisites and preparation
Basic understanding of LLMs and RAG architectures
Familiarity with Python and GenAI workflows
Some exposure to prompt engineering or AI application development will be helpful
No prior experience with evaluation frameworks is required, but attendees should be comfortable with intermediate AI/ML concepts and production system discussions
Resources and references
(LangSmith Evaluation & Observability Docs) https://docs.smith.langchain.com/?utm_source=chatgpt.com
(DeepEval LLM Evaluation Framework) https://deepeval.com/?utm_source=chatgpt.com
(Weights & Biases Weave for LLM Evaluation and Tracing) https://wandb.ai/site/weave?utm_source=chatgpt.com
Link to slides/demos (if available)
No response
Twitter/X handle (optional)
@akashchandra049
LinkedIn profile (optional)
https://www.linkedin.com/in/akashchandra049/
Profile picture URL (optional)
https://drive.google.com/file/d/19QxzZI-q-QSOBAx0b_XB08HKvO7vM0m8/view?usp=drive_link
Speaker bio
Akash Chandra is the Founder & CEO of InsightAI, a deep-tech AI company building large-scale fraud detection, AML intelligence, and financial risk systems for banks and fintechs. He specializes in production AI architectures spanning GenAI, Graph Intelligence, low-latency systems, and real-time decisioning pipelines.
An IIT Kanpur graduate with over a decade of experience in AI and fintech, Akash has worked on enterprise AI systems involving RAG pipelines, self-learning LLM workflows, graph-based fraud analytics, and multimodal intelligence platforms. He regularly speaks on AI infrastructure, financial intelligence, and production-grade GenAI systems, with recent sessions at Nodes24 and fintech-focused technology forums.
Availability
Saturdays available
Accessibility & special requirements
No special accessibility requirements from my side.
For the session, a projector/display with HDMI support, stable internet connectivity and audio output for demos would be helpful.
Speaker checklist
Additional comments
I actively speak on topics around AI infrastructure, GenAI systems, fraud intelligence, and graph-based analytics at developer and fintech-focused events. Recently, I presented at Nodes24 on real-time fraud detection using graph intelligence and event-driven architectures.
My sessions are usually highly practical and architecture-focused, with live demos, production learnings, and real-world deployment insights rather than theoretical overviews.