You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All notable changes to this project will be documented in this file.
3
+
4
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
5
+
6
+
7
+
## [0.2.0] - 2023-12-15
8
+
9
+
### Added
10
+
11
+
- Support for using [Nvidia AI Foundational LLM models](./docs/rag/aiplayground.md#using-nvdia-cloud-based-llms)
12
+
- Support for using [Nvidia AI Foundational embedding models](./docs/rag/aiplayground.md#using-nvidia-cloud-based-embedding-models)
13
+
- Support for [deploying and using quantized LLM models](./docs/rag/llm_inference_server.md#quantized-llama2-model-deployment)
14
+
- Support for [evaluating RAG pipeline](./evaluation/README.md)
15
+
16
+
### Changed
17
+
18
+
- Repository restructing to allow better open source contributions
19
+
-[Upgraded dependencies](./RetrievalAugmentedGeneration/Dockerfile) for chain server container
20
+
-[Upgraded NeMo Inference Framework container version](./RetrievalAugmentedGeneration/llm-inference-server/Dockerfile), no seperate sign up needed now for access.
21
+
- Main [README](./README.md) now provides more details.
22
+
- Documentation improvements.
23
+
- Better error handling and reporting mechanism for corner cases.
24
+
- Renamed `triton-inference-server` container and service to `llm-inference-server`
25
+
26
+
### Fixed
27
+
28
+
-[Fixed issue #13](https://github.com/NVIDIA/GenerativeAIExamples/issues/13) of pipeline not able to answer questions unrelated to knowledge base
29
+
-[Fixed issue #12](https://github.com/NVIDIA/GenerativeAIExamples/issues/12) typechecking while uploading PDF files
Copy file name to clipboardExpand all lines: README.md
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ State-of-the-art Generative AI examples that are easy to deploy, test, and exten
6
6
## NVIDIA NGC
7
7
Generative AI Examples uses resources from the [NVIDIA NGC AI Development Catalog](https://ngc.nvidia.com).
8
8
9
-
Sign up for a [free NGC developer account](https://ngc.nvidia.com/signin) to access:
9
+
Sign up for a [free NGC developer account](https://ngc.nvidia.com/signin) to access:
10
10
11
11
- The GPU-optimized NVIDIA containers, models, scripts, and tools used in these examples
12
12
- The latest NVIDIA upstream contributions to the respective programming frameworks
@@ -16,7 +16,7 @@ Sign up for a [free NGC developer account](https://ngc.nvidia.com/signin) to acc
16
16
17
17
## Retrieval Augmented Generation (RAG)
18
18
19
-
A RAG pipeline embeds multimodal data -- such as documents, images, and video -- into a database connected to a Large Language Model. RAG lets users use an LLM to chat with their own data.
19
+
A RAG pipeline embeds multimodal data -- such as documents, images, and video -- into a database connected to a Large Language Model. RAG lets users use an LLM to chat with their own data.
Copy file name to clipboardExpand all lines: RetrievalAugmentedGeneration/README.md
+82-37Lines changed: 82 additions & 37 deletions
Original file line number
Diff line number
Diff line change
@@ -4,16 +4,16 @@
4
4
**Project Goal**: A reference Retrieval Augmented Generation(RAG) workflow for a chatbot to question answer off public press releases & tech blogs. It performs document ingestion & Q&A interface using open source models deployed on any cloud or customer datacenter, leverages the power of GPU-accelerated Milvus for efficient vector storage and retrieval, along with TRT-LLM, to achieve lightning-fast inference speeds with custom LangChain LLM wrapper.
5
5
6
6
## Components
7
-
-**LLM**: [Llama2](https://ai.meta.com/llama/) - 7b, 13b, and 70b all supported. 13b and 70b generate good responses.
7
+
-**LLM**: [Llama2](https://ai.meta.com/llama/) - 7b-chat, 13b-chat, and 70b-chat all supported. 13b-chat and 70b-chat generate good responses.
8
8
-**LLM Backend**: Nemo framework inference container with Triton inference server & TRT-LLM backend for speed.
9
9
-**Vector DB**: Milvus because it's GPU accelerated.
10
10
-**Embedding Model**: [e5-large-v2](https://huggingface.co/intfloat/e5-large-v2) since it is one of the best embedding model available at the moment.
11
11
-**Framework(s)**: LangChain and LlamaIndex.
12
12
13
-
This reference workflow uses a variety of components and services to customize and deploy the RAG based chatbot. The following diagram illustrates how they work together. Refer to the [detailed architecture guide](./docs/architecture.md) to understand more about these components and how they are tied together.
13
+
This reference workflow uses a variety of components and services to customize and deploy the RAG based chatbot. The following diagram illustrates how they work together. Refer to the [detailed architecture guide](../docs/rag/architecture.md) to understand more about these components and how they are tied together.
We've used [Llama2](https://ai.meta.com/llama/) and [e5-large-v2](https://huggingface.co/intfloat/e5-large-v2) models as example defaults in this workflow, you should ensure that both the LLM and embedding model are appropriate for your use case, and validate that they are secure and have not been tampered with prior to use.
@@ -30,8 +30,6 @@ Before proceeding with this guide, make sure you meet the following prerequisite
30
30
- If you are running multiple GPUs they must all be set to the same mode (ie Compute vs. Display). You can check compute mode for each GPU using
31
31
``nvidia-smi -q -d compute``
32
32
33
-
- You should have access to [NeMo Framework](https://developer.nvidia.com/nemo-framework) to download the container used for deploying the Large Language Model. To access nemo-framework inference container please register at https://developer.nvidia.com/nemo-framework. After submitting a form you will be automatically accepted.
34
-
35
33
### Setup the following
36
34
37
35
- Docker and Docker-Compose are essential. Please follow the [installation instructions](https://docs.docker.com/engine/install/ubuntu/).
@@ -53,28 +51,36 @@ Before proceeding with this guide, make sure you meet the following prerequisite
53
51
docker login nvcr.io
54
52
```
55
53
56
-
- You can download Llama2 Chat Model Weights from [Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or [HuggingFace](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/).
54
+
- git-lfs
55
+
- Make sure you have [git-lfs](https://git-lfs.github.com) installed.
56
+
57
+
- You can download Llama2 Chat Model Weights from [Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or [HuggingFace](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/). You can skip this step [if you are interested in using cloud based LLM's using Nvidia AI Playground](#using-nvdia-cloud-based-llm).
57
58
58
59
**Note for checkpoint downloaded using Meta**:
59
60
60
-
When downloading model weights from Meta, you can follow the instructions up to the point of downloading the models using ``download.sh``. There is no need to deploy the model using the steps mentioned in the repository. We will use Triton to deploy the model.
61
+
- When downloading model weights from Meta, you can follow the instructions up to the point of downloading the models using ``download.sh``. There is no need to deploy the model using the steps mentioned in the repository. We will use Triton to deploy the model.
61
62
62
-
Meta will download two additional files, namely tokenizer.model and tokenizer_checklist.chk, outside of the model checkpoint directory. Ensure that you copy these files into the same directory as the model checkpoint directory.
63
+
- Meta will download two additional files, namely `tokenizer.model` and `tokenizer_checklist.chk`, outside of the model checkpoint directory. Ensure that you copy these files into the same directory as the model checkpoint directory.
63
64
65
+
**Using Cloud based Nvidia AI Foundational models**:
64
66
65
-
**Note**:
67
+
- Instead of deploying the models on-prem if you will like to use LLM models deployed from NVIDIA AI Playground then follow the instructions from [here.](../docs/rag/aiplayground.md)
68
+
69
+
**Using Quantized models**:
66
70
67
-
In this workflow, we will be leveraging a Llama2 (13B parameters) chat model, which requires 50 GB of GPU memory. If you prefer to leverage 7B parameter model, this will require 38GB memory. The 70B parameter model initially requires 240GB memory.
68
-
IMPORTANT: For this initial version of the workflow, A100 and H100 GPUs are supported.
71
+
- In this workflow, we will be leveraging a Llama2 (7B parameters) chat model, which requires 38 GB of GPU memory. <br>
72
+
IMPORTANT: For this initial version of the workflow only 7B chat model is supported on A100 and H100 GPUs.
73
+
74
+
- We also support quantization of LLama2 model using AWQ, which changes model precision to INT4, thereby reducing memory usage. Checkout the steps [here](../docs/rag/llm_inference_server.md) to enable quantization.
69
75
70
76
71
77
## Install Guide
72
-
### Step 1: Move to deploy directory
73
-
cd deploy
74
78
75
-
### Step 2: Set Environment Variables
79
+
NVIDIA TensorRT LLM providex state of the art performance for running LLM inference. Follow the below steps from the root of this project to setup the RAG example with TensorRT LLM and Triton deployed locally.
80
+
81
+
### Step 1: Set Environment Variables
76
82
77
-
Modify ``compose.env`` in the ``deploy`` directory to set your environment variables. The following variables are required.
83
+
Modify ``compose.env`` in the ``deploy/compose`` directory to set your environment variables. The following variables are required as shown below for using a llama based model.
78
84
79
85
# full path to the local copy of the model weights
source deploy/compose/compose.env; docker compose -f deploy/compose/docker-compose.yaml up -d
105
111
```
106
112
> ⚠️ **NOTE**: It will take a few minutes for the containers to come up and may take up to 5 minutes for the Triton server to be ready. Adding the `-d` flag will have the services run in the background. ⚠️
107
113
108
114
- Run ``docker ps -a``. When the containers are ready the output should look similar to the image below.
- Default prompts are optimized for llama chat model if you're using completion model then prompts need to be finetuned accordingly.
119
+
120
+
#### Multi GPU deployment
121
+
122
+
By default the LLM model will be deployed using all available GPU's of the system. To use some specific GPU's you can provide the GPU ID(s) in the [docker compose file](../deploy/compose/docker-compose.yaml) under `llm` service's `deploy` section:
123
+
124
+
125
+
deploy:
126
+
resources:
127
+
reservations:
128
+
devices:
129
+
- driver: nvidia
130
+
# count: ${INFERENCE_GPU_COUNT:-all} # Comment this out
131
+
device_ids: ["0"] # Provide the device id of GPU. It can be found using `nvidia-smi` command
132
+
capabilities: [gpu]
133
+
110
134
111
-
### Step 4: Experiment with RAG in JupyterLab
135
+
### Step 3: Experiment with RAG in JupyterLab
112
136
113
137
This AI Workflow includes Jupyter notebooks which allow you to experiment with RAG.
114
138
115
139
- Using a web browser, type in the following URL to open Jupyter
116
140
117
141
``http://host-ip:8888``
118
142
119
-
- Locate the [LLM Streaming Client notebook](notebooks/01-llm-streaming-client.ipynb) which demonstrates how to stream responses from the LLM.
143
+
- Locate the [LLM Streaming Client notebook](../notebooks/01-llm-streaming-client.ipynb) which demonstrates how to stream responses from the LLM.
120
144
121
145
- Proceed with the next 4 notebooks:
122
146
123
-
- [Document Question-Answering with LangChain](notebooks/02_langchain_simple.ipynb)
147
+
- [Document Question-Answering with LangChain](../notebooks/02_langchain_simple.ipynb)
124
148
125
-
- [Document Question-Answering with LlamaIndex](notebooks/03_llama_index_simple.ipynb)
149
+
- [Document Question-Answering with LlamaIndex](../notebooks/03_llama_index_simple.ipynb)
126
150
127
-
- [Advanced Document Question-Answering with LlamaIndex](notebooks/04_llamaindex_hier_node_parser.ipynb)
151
+
- [Advanced Document Question-Answering with LlamaIndex](../notebooks/04_llamaindex_hier_node_parser.ipynb)
128
152
129
-
- [Interact with REST FastAPI Server](notebooks/05_dataloader.ipynb)
153
+
- [Interact with REST FastAPI Server](../notebooks/05_dataloader.ipynb)
130
154
131
-
### Step 5: Run the Sample Web Application
155
+
### Step 4: Run the Sample Web Application
132
156
A sample chatbot web application is provided in the workflow. Requests to the chat system are wrapped in FastAPI calls.
133
157
134
158
- Open the web application at ``http://host-ip:8090``.
@@ -139,22 +163,43 @@ A sample chatbot web application is provided in the workflow. Requests to the ch
139
163
140
164
- To use a knowledge base:
141
165
142
-
- Click the **Knowledge Base** tab and upload the file [dataset.zip](./RetrievalAugmentedGeneration/notebook/dataset.zip).
166
+
- Click the **Knowledge Base** tab and upload the file [dataset.zip](../notebooks/dataset.zip).
143
167
144
168
- Return to **Converse** tab and check **[X] Use knowledge base**.
145
169
146
170
- Retype the question: "How many cores are on the Nvidia Grace superchip?"
147
171
172
+
# RAG Evaluation
173
+
174
+
## Prerequisites
175
+
Make sure the corps comm dataset is loaded into the vector database using the [Dataloader](../notebooks/05_dataloader.ipynb) notebook as part of step-3 of setup.
176
+
177
+
This workflow include jupyter notebooks which allow you perform evaluation of your RAG application on the sample dataset and they can be extended to other datasets as well.
178
+
Setup the workflow by building and starting the containers by following the steps [outlined here using docker compose.](#step-2-build-and-start-containers)
179
+
180
+
After setting up the workflow follow these steps:
181
+
182
+
- Using a web browser, type in the following URL to open Jupyter Labs
183
+
184
+
``http://host-ip:8889``
185
+
186
+
- Locate the [synthetic data generation](../evaluation/01_synthetic_data_generation.ipynb) which demonstrates how to generate synthetic data of question answer pairs for evaluation
0 commit comments