Self-Learning LLM Systems: Evals, Feedback Loops & Reliability in Production

### Talk title

Self-Learning LLM Systems: Evals, Feedback Loops & Reliability in Production

### Short talk description

Production AI systems fail in ways benchmarks rarely capture. In this talk, we’ll explore how to design self-learning LLM systems that continuously improve using evaluation pipelines, feedback loops and reliability engineering. We’ll go beyond prompts and cover practical techniques for hallucination detection, prompt versioning, regression testing, drift monitoring and automated eval frameworks used in real-world deployments.

The session will also dive into how explicit and implicit user feedback can be transformed into measurable improvements for RAG and LLM applications running in production at scale.

### Long talk description

Most LLM discussions focus on prompts, models, or benchmarks. But once AI systems move into production, an entirely different set of challenges begins to appear — hallucinations, retrieval failures, response drift, inconsistent reasoning, prompt regressions, and unpredictable behavior across versions.

This talk focuses on the engineering side of building reliable, self-learning LLM systems in production. We’ll explore how modern GenAI applications can continuously improve through structured evaluation pipelines, feedback loops, and automated reliability checks rather than relying purely on manual prompt tuning.

The session will cover practical techniques for:

- Designing evaluation frameworks for RAG and LLM systems
- Measuring groundedness, hallucinations, consistency, and response quality
- Building explicit and implicit feedback pipelines
- Prompt versioning and regression testing
- Drift monitoring across model and prompt updates
- Using synthetic datasets and backtesting for evaluations
- Production monitoring strategies for GenAI systems

We’ll also discuss architectural patterns used in real-world deployments, including how evaluation systems integrate with retrieval pipelines, vector databases, and agentic workflows. The focus will remain heavily practical, with examples from enterprise AI systems where reliability, auditability, and continuous improvement are critical.

Rather than another “intro to LLMs” session, this talk aims to provide a deeper look into the unseen operational layer required to make AI systems dependable at scale.

### What format do you have in mind?

Workshop (45-60 minutes, hands-on)

### Talk outline / Agenda

(More hands on with slide content for overview)
- Why production LLM systems fail differently than demos
- Common failure modes in RAG and GenAI systems
  - Hallucinations
  - Retrieval mismatch
  - Prompt regressions
  - Drift over time
- Designing evaluation frameworks for LLM systems
  - Groundedness checks
  - Consistency scoring
  - Hallucination evaluation
  - Human vs automated evals
- Building self-learning AI pipelines
  - Explicit feedback loops
  - Implicit feedback signals
  - Feedback categorization and prioritization
- Prompt versioning and regression testing
  - Backtesting prompts on historical datasets
  - Measuring improvements before rollout
- Production architecture patterns
  - Eval pipelines with RAG systems
  - Monitoring and observability
  - Continuous improvement workflows
- Real-world deployment learnings and key takeaways
- Q&A

### Key takeaways

- Understand why production LLM systems fail differently from demos and benchmarks
- Learn how to design practical evaluation frameworks for GenAI and RAG systems
- Gain insights into hallucination detection, groundedness checks, and response reliability measurement
- Learn how feedback loops and prompt versioning enable self-learning AI systems
- Understand strategies for regression testing, drift monitoring, and continuous improvement in production
- Explore real-world architectural patterns for building reliable and scalable AI systems

### What domain would you say your talk falls under?

Data Science and Machine Learning

### Duration (including Q&A)

30-40 min (depending on availability, but would want to show some practical demo along)

### Prerequisites and preparation

Basic understanding of LLMs and RAG architectures
Familiarity with Python and GenAI workflows
Some exposure to prompt engineering or AI application development will be helpful
No prior experience with evaluation frameworks is required, but attendees should be comfortable with intermediate AI/ML concepts and production system discussions

### Resources and references

(LangSmith Evaluation & Observability Docs) https://docs.smith.langchain.com/?utm_source=chatgpt.com
(DeepEval LLM Evaluation Framework) https://deepeval.com/?utm_source=chatgpt.com
(Weights & Biases Weave for LLM Evaluation and Tracing) https://wandb.ai/site/weave?utm_source=chatgpt.com

### Link to slides/demos (if available)

_No response_

### Twitter/X handle (optional)

@akashchandra049

### LinkedIn profile (optional)

https://www.linkedin.com/in/akashchandra049/

### Profile picture URL (optional)

https://drive.google.com/file/d/19QxzZI-q-QSOBAx0b_XB08HKvO7vM0m8/view?usp=drive_link

### Speaker bio

Akash Chandra is the Founder & CEO of InsightAI, a deep-tech AI company building large-scale fraud detection, AML intelligence, and financial risk systems for banks and fintechs. He specializes in production AI architectures spanning GenAI, Graph Intelligence, low-latency systems, and real-time decisioning pipelines.
An IIT Kanpur graduate with over a decade of experience in AI and fintech, Akash has worked on enterprise AI systems involving RAG pipelines, self-learning LLM workflows, graph-based fraud analytics, and multimodal intelligence platforms. He regularly speaks on AI infrastructure, financial intelligence, and production-grade GenAI systems, with recent sessions at Nodes24 and fintech-focused technology forums.

### Availability

Saturdays available

### Accessibility & special requirements

No special accessibility requirements from my side.

For the session, a projector/display with HDMI support, stable internet connectivity and audio output for demos would be helpful.

### Speaker checklist

- [x] I have read and understood the [PyDelhi guidelines](https://github.com/pydelhi/talks/blob/main/guidelines/speaking.md) for submitting proposals and giving talks
- [x] I have read and acknowledged the [PyDelhi accessibility guidelines](https://github.com/pydelhi/talks/blob/main/guidelines/accessibility.md) and will ensure my presentation materials (slides, videos, demos) follow these recommendations
- [x] I will make my talk accessible to all attendees and will proactively ask for any accommodations or special requirements I might need
- [x] I agree to share slides, code snippets, and other materials used during the talk with the community
- [x] I will follow PyDelhi's Code of Conduct and maintain a welcoming, inclusive environment throughout my participation
- [x] I understand that PyDelhi meetups are community-centric events focused on learning, knowledge sharing, and networking, and I will respect this ethos by not using this platform for self-promotion or hiring pitches during my presentation, unless explicitly invited to do so by means of a sponsorship or similar arrangement
- [x] If the talk is recorded by the PyDelhi team, I grant permission to release the video on PyDelhi's YouTube channel under the CC-BY-4.0 license, or a different license of my choosing if I am specifying it in my proposal or with the materials I share

### Additional comments

I actively speak on topics around AI infrastructure, GenAI systems, fraud intelligence, and graph-based analytics at developer and fintech-focused events. Recently, I presented at Nodes24 on real-time fraud detection using graph intelligence and event-driven architectures.

My sessions are usually highly practical and architecture-focused, with live demos, production learnings, and real-world deployment insights rather than theoretical overviews.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-Learning LLM Systems: Evals, Feedback Loops & Reliability in Production #415

Talk title

Short talk description

Long talk description

What format do you have in mind?

Talk outline / Agenda

Key takeaways

What domain would you say your talk falls under?

Duration (including Q&A)

Prerequisites and preparation

Resources and references

Link to slides/demos (if available)

Twitter/X handle (optional)

LinkedIn profile (optional)

Profile picture URL (optional)

Speaker bio

Availability

Accessibility & special requirements

Speaker checklist

Additional comments

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Self-Learning LLM Systems: Evals, Feedback Loops & Reliability in Production #415

Description

Talk title

Short talk description

Long talk description

What format do you have in mind?

Talk outline / Agenda

Key takeaways

What domain would you say your talk falls under?

Duration (including Q&A)

Prerequisites and preparation

Resources and references

Link to slides/demos (if available)

Twitter/X handle (optional)

LinkedIn profile (optional)

Profile picture URL (optional)

Speaker bio

Availability

Accessibility & special requirements

Speaker checklist

Additional comments

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions