refactor: modified rag section of the latex report

Raffaele Crafa · Raffaele Crafa · commit d150efed15ae · 2025-12-18T10:14:02.000+01:00
diff --git a/latex/sections/05_rag.tex b/latex/sections/05_rag.tex
@@ -3,42 +3,50 @@ \section{Retrieval-Augmented Generation for Financial Data}
 
 \subsection{Motivation for RAG in Financial Advisory}
 
-Retrieval-augmented generation enhances LLM capabilities by grounding responses in retrieved factual information. In financial advisory, this is crucial for several reasons. Accuracy is paramount---using real historical data rather than hallucinated figures ensures that recommendations are based on verifiable information. Relevance is equally important, as finding assets that match user criteria from available options requires domain knowledge that the system must possess. Compliance considerations mean providing factual, verifiable information that can withstand scrutiny from regulators. Interpretability allows showing users which data informed recommendations, building trust in the advice-giving process. The system retrieves relevant ETFs and stocks based on user preferences and financial profiles, providing 10-year historical return data for each recommendation.
+Retrieval-augmented generation enhances LLM capabilities by grounding responses in retrieved factual information. \\
+
+This approach overcomes the intrinsic limitation of static knowledge in large language models, whose internal representations are fixed at training time and cannot reflect newly available or frequently changing information. This is particularly useful since pre-trained models were used for this project. By dynamically retrieving relevant documents at inference time, RAG enables the model to operate on up-to-date, domain-specific, and potentially proprietary knowledge without requiring modifications to the underlying parameters.
+
+Moreover, grounding generation in retrieved factual sources significantly mitigates the risk of hallucinations, as the model is encouraged to base its responses on explicit evidence rather than relying solely on learned statistical patterns. This characteristic is particularly important in scenarios where factual accuracy, reliability, and consistency are critical, such as in the financial sector. Using real historical data rather than hallucinated figures ensures that recommendations are based on verifiable information. Relevance is equally important, as finding assets that match user criteria from available options requires domain knowledge that the system must possess. Compliance considerations mean providing factual, verifiable information that can withstand scrutiny from regulators. Interpretability allows showing users which data informed recommendations, building trust in the advice-giving process. The system retrieves relevant ETFs and stocks based on user preferences and financial profiles, providing 10-year historical return data for each recommendation.
+
+Finally, RAG-based techniques are very versatile because they allow you to change the knowledge base on which the retrieval is performed, if necessary, and are much more efficient and scalable because they avoid the excessive cost of having to retrain the model.
 
 \subsection{Asset Data Organization}
 
-Financial assets are organized in a structured directory hierarchy. The dataset directory contains subdirectories for different asset types. ETFs are further organized by asset class, with bonds and stocks as primary categories. Within each category, assets are organized by sector or type, such as corporate bonds, government bonds, tech sector ETFs, healthcare sector ETFs, and diversified index ETFs. Each asset includes asset name and ticker symbol, asset class and sector classification, risk level assessment, 10-year historical return data, expense ratios and fees, and diversification characteristics. This structured organization enables efficient retrieval and semantic search.
+Financial assets are organized in a structured directory hierarchy. The dataset directory contains subdirectories for different asset types. ETFs are further organized by asset class, with bonds and stocks as primary categories. Within each category, assets are organized by sector or type, such as corporate bonds, government bonds, tech sector ETFs, healthcare sector ETFs, and diversified index ETFs. Each asset includes asset name and ticker symbol, asset class and sector classification, risk level assessment, 10-year historical return data, expense ratios and fees, and diversification characteristics. This structured organization enables efficient retrieval and more accurate semantic search.
 
 \subsection{RAG Architecture}
 
-The RAGAssetRetriever implements the retrieval pipeline. The system uses embeddings to enable semantic search. The embedding model is FastEmbedder, which is lightweight and runs effectively on CPU or GPU. Embeddings have 384 dimensions, a good balance between expressiveness and computational efficiency. Distance metric is cosine similarity for relevance ranking, which is standard for embedding-based retrieval. Documents are embedded at initialization, creating a persistent vector database in Qdrant.
+The RAG architecture is centered around the RAGAssetRetriever class, which implements a streamlined pipeline for document ingestion, embedding generation, and semantic retrieval. Unlike traditional database-heavy approaches, the system utilizes a high-performance, in-memory vector indexing strategy based on NumPy and Pickle for persistence, optimized for the specific scale of the ETF dataset.
+
+The pipeline begins with the ingest\_pdfs method, which recursively scans the data directory to extract text from PDF documents using the pypdf library. To ensure context preservation while maintaining granularity, the extracted text is processed into overlapping chunks of 800 characters with a 120-character overlap. This character-based chunking strategy ensures that semantic boundaries are respected during the retrieval phase.
 
-When a user or agent needs to find relevant assets, the query is transformed into natural language search. For example, a query might be ``Conservative low-volatility bonds for risk-averse investors''. The query is embedded using the same embedding model as the documents. Vector similarity search then retrieves the top-k relevant documents. Retrieved documents are formatted and provided to the LLM context.
+For the embedding stage, the system employs the SentenceTransformer framework, specifically utilizing the all-roberta-large-v1 model. This model transforms text chunks into high-dimensional dense vectors (1024 dimensions), capturing deep semantic relationships that simple keyword searches would miss. To optimize performance and reduce computational overhead, the system implements a caching mechanism: once generated, the embeddings and associated metadata are serialized into an embeddings .pkl file. Subsequent initializations of the agent load this index directly from disk, eliminating the need for re-processing the entire dataset.
 
-Key configurable parameters control the retrieval process. The number of documents to retrieve (TOP\_K\_DOCUMENTS) is typically set to 10, balancing comprehensiveness with context window constraints. The similarity threshold for filtering (SIMILARITY\_THRESHOLD) is set to 0.5 to avoid including irrelevant results. Document chunks use a chunk size of 512 tokens with 128 tokens of overlap to maintain context continuity.
+The retrieval logic follows a "top-K" similarity approach. When a query is received, it is encoded into the same vector space as the document chunks. The system then computes the cosine similarity between the query vector and the entire embedding matrix using scikit-learn's optimized routines. The k most relevant segments (with a default of k=15) are returned to the FinancialAdvisorAgent, ranked by their similarity score. This architecture ensures that the LLM is provided with the most contextually relevant financial data to ground its recommendations in factual evidence.
 
 \subsection{Integration with Agent}
 
-The FinancialAdvisorAgent leverages RAG through several methods. RAG query building occurs before retrieval, with the agent constructing optimized search queries. For example, the agent builds a query with risk tolerance as conservative, asset class as bonds, sector as government, and constraints such as maximum expense ratio of 0.25 percent and minimum five-year track record. This ensures retrieved assets match user constraints.
+The FinancialAdvisorAgent leverages the RAG system through a structured integration layer designed to ground portfolio recommendations in real-world data. The process begins with the rag\_query\_builder, which transforms the user's structured FinancialProfile into a natural language search query. This query is designed to capture the essence of the user's risk tolerance and financial goals, which are then passed to the RAGAssetRetriever.
 
-The portfolio generation workflow follows a structured process. First, constraints are extracted from the FinancialProfile. Next, a RAG query is built with appropriate filters. Then relevant assets are retrieved from the vector database. The LLM generates allocation percentages based on the retrieved assets. Portfolio allocations are validated to sum to 100 percent. Finally, portfolio metrics are computed including expected return and volatility.
+The portfolio generation workflow follows a rigorous sequential process. First, the agent extracts the client's profile from the conversation history using structured output validation. Next, it invokes the retriever to identify the top 15 most relevant asset documents. To maintain efficiency within the LLM context window, the agent extracts the most significant metadata and descriptions from these documents, creating a rich "asset context."
 
-Retrieved assets are formatted with rich context. Historical performance data spans ten years. Sector and asset class information helps users understand diversification. Risk metrics provide quantitative risk assessment. Diversification characteristics show how assets correlate with each other.
+Finally, the agent utilizes a structured response mechanism to generate a Portfolio recommendation. By providing both the client's financial profile and the retrieved asset context to the LLM, the system ensures that the suggested allocation percentages are informed by actual historical data found in the dataset. This architectural choice minimizes hallucinations and ensures that the risk level of the generated portfolio is strictly aligned with the user's extracted profile. Analysis of specific assets can further be performed through direct tool invocation, providing detailed historical return metrics and quantitative analysis to the end-user.
 
 \subsection{Data Processing Pipeline}
 
-The ETF and stock data loading process begins by reading asset description files from the dataset directory. Metadata is parsed including ticker, asset class, risk level, and returns. Documents are chunked if necessary for large prospectuses. Each chunk is embedded independently. Finally, chunks are stored in Qdrant with metadata for later filtering.
+The data processing pipeline is designed to transform unstructured PDF prospectuses into a searchable knowledge base. The process begins by scanning the dataset directory for PDF files using a recursive globbing strategy. Text extraction is performed page-by-page, and the resulting strings are partitioned into chunks of 800 characters with a 120-character overlap to maintain semantic continuity.
 
-10-year historical data provides annual returns for each year in the period. Volatility statistics quantify return variability. Correlation matrices enable diversification analysis. Drawdown periods and recovery times help users understand worst-case scenarios. Sharpe ratio and risk-adjusted returns provide a composite risk metric. This data is computed during initialization and cached for performance.
+Each chunk is then transformed into a dense vector representation using the SentenceTransformer model. Unlike generic text processing, this pipeline focuses on capturing the specific terminology found in financial asset descriptions. The system computes these embeddings during the initial setup and stores the entire payload—consisting of the original text, metadata (such as the source file name and chunk ID), and the numerical vectors—into a serialized Pickle file for efficient subsequent loading.
 
 \subsection{Performance Optimization}
 
-Qdrant automatically indexes embeddings for fast similarity search using the HNSW (Hierarchical Navigable Small World) algorithm, which supports approximate nearest neighbor search with typical query latency of 10-50ms for top-10 results.
+To ensure a responsive user experience, the system implements several optimization strategies at the retrieval level. Instead of relying on external database calls, the RAGAssetRetriever loads the entire embedding index into memory as a NumPy array. This allows for near-instantaneous similarity computations using optimized matrix operations.
 
-Retrieved portfolios and their analysis are cached to reduce redundant LLM calls and expensive computations. Caching significantly improves user experience by eliminating duplicate calculations for the same portfolio.
+The use of a global cache for the embedding model ensures that the SentenceTransformer is loaded into memory only once, significantly reducing the latency of subsequent user queries. Furthermore, by utilizing the all-roberta-large-v1 model, the system achieves a balance between high-dimensional semantic accuracy (1024 dimensions) and computational speed, even on hardware without dedicated GPU acceleration.
 
-\subsection{Evaluation of RAG Quality}
+%\subsection{Evaluation of RAG Quality}
 
-Retrieved documents are evaluated for relevance using several metrics. Mean Reciprocal Rank measures the position of the first relevant document. Normalized Discounted Cumulative Gain quantifies ranking quality. Precision@K measures the fraction of top-K results that are relevant.
+%The quality of the RAG output is monitored through real-time similarity scoring. For every retrieved document, the system computes a confidence score based on the cosine distance between the query and the asset chunk. These scores are used to rank the top 15 results before they are passed to the FinancialAdvisorAgent.
 
-Portfolio recommendations are assessed through alignment with user risk profile, examining whether recommendations match stated preferences. Historical performance of recommended assets is analyzed to verify backward compatibility. Diversification ratios ensure adequate portfolio diversification. Expected return consistency with asset class validates that recommendations align with theoretical expectations.
+%This grounding mechanism ensures that portfolio recommendations are not based on the internal biases of the LLM but are supported by verifiable text extracted directly from official ETF documentation. To further enhance reliability, the agent presents the specific source and relevance score of the retrieved information, allowing for a transparent audit trail of the advice-giving process.