Skip to content

Commit f7a520f

Browse files
authored
Merge pull request #14 from NVIDIA/dev/shubhadeepd/0.2.0-stage-release
Push changes for 0.2.0 release
2 parents 60fca22 + c871c49 commit f7a520f

File tree

104 files changed

+3812
-5668
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

104 files changed

+3812
-5668
lines changed

CHANGELOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Changelog
2+
All notable changes to this project will be documented in this file.
3+
4+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
5+
6+
7+
## [0.2.0] - 2023-12-15
8+
9+
### Added
10+
11+
- Support for using [Nvidia AI Foundational LLM models](./docs/rag/aiplayground.md#using-nvdia-cloud-based-llms)
12+
- Support for using [Nvidia AI Foundational embedding models](./docs/rag/aiplayground.md#using-nvidia-cloud-based-embedding-models)
13+
- Support for [deploying and using quantized LLM models](./docs/rag/llm_inference_server.md#quantized-llama2-model-deployment)
14+
- Support for [evaluating RAG pipeline](./evaluation/README.md)
15+
16+
### Changed
17+
18+
- Repository restructing to allow better open source contributions
19+
- [Upgraded dependencies](./RetrievalAugmentedGeneration/Dockerfile) for chain server container
20+
- [Upgraded NeMo Inference Framework container version](./RetrievalAugmentedGeneration/llm-inference-server/Dockerfile), no seperate sign up needed now for access.
21+
- Main [README](./README.md) now provides more details.
22+
- Documentation improvements.
23+
- Better error handling and reporting mechanism for corner cases.
24+
- Renamed `triton-inference-server` container and service to `llm-inference-server`
25+
26+
### Fixed
27+
28+
- [Fixed issue #13](https://github.com/NVIDIA/GenerativeAIExamples/issues/13) of pipeline not able to answer questions unrelated to knowledge base
29+
- [Fixed issue #12](https://github.com/NVIDIA/GenerativeAIExamples/issues/12) typechecking while uploading PDF files

LICENSE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -198,4 +198,4 @@
198198
distributed under the License is distributed on an "AS IS" BASIS,
199199
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200200
See the License for the specific language governing permissions and
201-
limitations under the License.
201+
limitations under the License.

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ State-of-the-art Generative AI examples that are easy to deploy, test, and exten
66
## NVIDIA NGC
77
Generative AI Examples uses resources from the [NVIDIA NGC AI Development Catalog](https://ngc.nvidia.com).
88

9-
Sign up for a [free NGC developer account](https://ngc.nvidia.com/signin) to access:
9+
Sign up for a [free NGC developer account](https://ngc.nvidia.com/signin) to access:
1010

1111
- The GPU-optimized NVIDIA containers, models, scripts, and tools used in these examples
1212
- The latest NVIDIA upstream contributions to the respective programming frameworks
@@ -16,7 +16,7 @@ Sign up for a [free NGC developer account](https://ngc.nvidia.com/signin) to acc
1616

1717
## Retrieval Augmented Generation (RAG)
1818

19-
A RAG pipeline embeds multimodal data -- such as documents, images, and video -- into a database connected to a Large Language Model. RAG lets users use an LLM to chat with their own data.
19+
A RAG pipeline embeds multimodal data -- such as documents, images, and video -- into a database connected to a Large Language Model. RAG lets users use an LLM to chat with their own data.
2020

2121
| Name | Description | LLM | Framework | Multi-GPU | Multi-node | Embedding | TRT-LLM | Triton | VectorDB | K8s |
2222
|---------------|-----------------------|------------|-------------------------|-----------|------------|-------------|---------|--------|----------|-----|

RetrievalAugmentedGeneration/Dockerfile

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,12 @@ ARG BASE_IMAGE_TAG=23.08-py3
33

44

55
FROM ${BASE_IMAGE_URL}:${BASE_IMAGE_TAG}
6-
COPY chain_server /opt/chain_server
7-
RUN --mount=type=bind,source=requirements.txt,target=/opt/requirements.txt \
6+
COPY RetrievalAugmentedGeneration/__init__.py /opt/RetrievalAugmentedGeneration/
7+
COPY RetrievalAugmentedGeneration/common /opt/RetrievalAugmentedGeneration/common
8+
COPY RetrievalAugmentedGeneration/examples /opt/RetrievalAugmentedGeneration/examples
9+
COPY integrations /opt/integrations
10+
RUN --mount=type=bind,source=RetrievalAugmentedGeneration/requirements.txt,target=/opt/requirements.txt \
811
python3 -m pip install --no-cache-dir -r /opt/requirements.txt
912

1013
WORKDIR /opt
11-
ENTRYPOINT ["uvicorn", "chain_server.server:app"]
14+
ENTRYPOINT ["uvicorn", "RetrievalAugmentedGeneration.common.server:app"]

RetrievalAugmentedGeneration/README.md

Lines changed: 82 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,16 @@
44
**Project Goal**: A reference Retrieval Augmented Generation(RAG) workflow for a chatbot to question answer off public press releases & tech blogs. It performs document ingestion & Q&A interface using open source models deployed on any cloud or customer datacenter, leverages the power of GPU-accelerated Milvus for efficient vector storage and retrieval, along with TRT-LLM, to achieve lightning-fast inference speeds with custom LangChain LLM wrapper.
55

66
## Components
7-
- **LLM**: [Llama2](https://ai.meta.com/llama/) - 7b, 13b, and 70b all supported. 13b and 70b generate good responses.
7+
- **LLM**: [Llama2](https://ai.meta.com/llama/) - 7b-chat, 13b-chat, and 70b-chat all supported. 13b-chat and 70b-chat generate good responses.
88
- **LLM Backend**: Nemo framework inference container with Triton inference server & TRT-LLM backend for speed.
99
- **Vector DB**: Milvus because it's GPU accelerated.
1010
- **Embedding Model**: [e5-large-v2](https://huggingface.co/intfloat/e5-large-v2) since it is one of the best embedding model available at the moment.
1111
- **Framework(s)**: LangChain and LlamaIndex.
1212

13-
This reference workflow uses a variety of components and services to customize and deploy the RAG based chatbot. The following diagram illustrates how they work together. Refer to the [detailed architecture guide](./docs/architecture.md) to understand more about these components and how they are tied together.
13+
This reference workflow uses a variety of components and services to customize and deploy the RAG based chatbot. The following diagram illustrates how they work together. Refer to the [detailed architecture guide](../docs/rag/architecture.md) to understand more about these components and how they are tied together.
1414

1515

16-
![Diagram](./../RetrievalAugmentedGeneration/images/image3.jpg)
16+
![Diagram](../docs/rag/images/image3.jpg)
1717

1818
*Note:*
1919
We've used [Llama2](https://ai.meta.com/llama/) and [e5-large-v2](https://huggingface.co/intfloat/e5-large-v2) models as example defaults in this workflow, you should ensure that both the LLM and embedding model are appropriate for your use case, and validate that they are secure and have not been tampered with prior to use.
@@ -30,8 +30,6 @@ Before proceeding with this guide, make sure you meet the following prerequisite
3030
- If you are running multiple GPUs they must all be set to the same mode (ie Compute vs. Display). You can check compute mode for each GPU using
3131
``nvidia-smi -q -d compute``
3232

33-
- You should have access to [NeMo Framework](https://developer.nvidia.com/nemo-framework) to download the container used for deploying the Large Language Model. To access nemo-framework inference container please register at https://developer.nvidia.com/nemo-framework. After submitting a form you will be automatically accepted.
34-
3533
### Setup the following
3634

3735
- Docker and Docker-Compose are essential. Please follow the [installation instructions](https://docs.docker.com/engine/install/ubuntu/).
@@ -53,28 +51,36 @@ Before proceeding with this guide, make sure you meet the following prerequisite
5351
docker login nvcr.io
5452
```
5553
56-
- You can download Llama2 Chat Model Weights from [Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or [HuggingFace](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/).
54+
- git-lfs
55+
- Make sure you have [git-lfs](https://git-lfs.github.com) installed.
56+
57+
- You can download Llama2 Chat Model Weights from [Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or [HuggingFace](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/). You can skip this step [if you are interested in using cloud based LLM's using Nvidia AI Playground](#using-nvdia-cloud-based-llm).
5758
5859
**Note for checkpoint downloaded using Meta**:
5960
60-
When downloading model weights from Meta, you can follow the instructions up to the point of downloading the models using ``download.sh``. There is no need to deploy the model using the steps mentioned in the repository. We will use Triton to deploy the model.
61+
- When downloading model weights from Meta, you can follow the instructions up to the point of downloading the models using ``download.sh``. There is no need to deploy the model using the steps mentioned in the repository. We will use Triton to deploy the model.
6162
62-
Meta will download two additional files, namely tokenizer.model and tokenizer_checklist.chk, outside of the model checkpoint directory. Ensure that you copy these files into the same directory as the model checkpoint directory.
63+
- Meta will download two additional files, namely `tokenizer.model` and `tokenizer_checklist.chk`, outside of the model checkpoint directory. Ensure that you copy these files into the same directory as the model checkpoint directory.
6364
65+
**Using Cloud based Nvidia AI Foundational models**:
6466
65-
**Note**:
67+
- Instead of deploying the models on-prem if you will like to use LLM models deployed from NVIDIA AI Playground then follow the instructions from [here.](../docs/rag/aiplayground.md)
68+
69+
**Using Quantized models**:
6670
67-
In this workflow, we will be leveraging a Llama2 (13B parameters) chat model, which requires 50 GB of GPU memory. If you prefer to leverage 7B parameter model, this will require 38GB memory. The 70B parameter model initially requires 240GB memory.
68-
IMPORTANT: For this initial version of the workflow, A100 and H100 GPUs are supported.
71+
- In this workflow, we will be leveraging a Llama2 (7B parameters) chat model, which requires 38 GB of GPU memory. <br>
72+
IMPORTANT: For this initial version of the workflow only 7B chat model is supported on A100 and H100 GPUs.
73+
74+
- We also support quantization of LLama2 model using AWQ, which changes model precision to INT4, thereby reducing memory usage. Checkout the steps [here](../docs/rag/llm_inference_server.md) to enable quantization.
6975
7076
7177
## Install Guide
72-
### Step 1: Move to deploy directory
73-
cd deploy
7478
75-
### Step 2: Set Environment Variables
79+
NVIDIA TensorRT LLM providex state of the art performance for running LLM inference. Follow the below steps from the root of this project to setup the RAG example with TensorRT LLM and Triton deployed locally.
80+
81+
### Step 1: Set Environment Variables
7682
77-
Modify ``compose.env`` in the ``deploy`` directory to set your environment variables. The following variables are required.
83+
Modify ``compose.env`` in the ``deploy/compose`` directory to set your environment variables. The following variables are required as shown below for using a llama based model.
7884
7985
# full path to the local copy of the model weights
8086
export MODEL_DIRECTORY="$HOME/src/Llama-2-13b-chat-hf"
@@ -89,46 +95,64 @@ Modify ``compose.env`` in the ``deploy`` directory to set your environment varia
8995
APP_CONFIG_FILE=/dev/null
9096
9197
92-
### Step 3: Build and Start Containers
98+
### Step 2: Build and Start Containers
9399
- Pull lfs files. This will pull large files from repository.
94100
```
95101
git lfs pull
96102
```
97103
- Run the following command to build containers.
98104
```
99-
source compose.env; docker compose build
105+
source deploy/compose/compose.env; docker compose -f deploy/compose/docker-compose.yaml build
100106
```
101107
102108
- Run the following command to start containers.
103109
```
104-
source compose.env; docker compose up -d
110+
source deploy/compose/compose.env; docker compose -f deploy/compose/docker-compose.yaml up -d
105111
```
106112
> ⚠️ **NOTE**: It will take a few minutes for the containers to come up and may take up to 5 minutes for the Triton server to be ready. Adding the `-d` flag will have the services run in the background. ⚠️
107113
108114
- Run ``docker ps -a``. When the containers are ready the output should look similar to the image below.
109-
![Docker Output](./images/docker-output.png "Docker Output Image")
115+
![Docker Output](../docs/rag/images/docker-output.png "Docker Output Image")
116+
117+
**Note**:
118+
- Default prompts are optimized for llama chat model if you're using completion model then prompts need to be finetuned accordingly.
119+
120+
#### Multi GPU deployment
121+
122+
By default the LLM model will be deployed using all available GPU's of the system. To use some specific GPU's you can provide the GPU ID(s) in the [docker compose file](../deploy/compose/docker-compose.yaml) under `llm` service's `deploy` section:
123+
124+
125+
deploy:
126+
resources:
127+
reservations:
128+
devices:
129+
- driver: nvidia
130+
# count: ${INFERENCE_GPU_COUNT:-all} # Comment this out
131+
device_ids: ["0"] # Provide the device id of GPU. It can be found using `nvidia-smi` command
132+
capabilities: [gpu]
133+
110134
111-
### Step 4: Experiment with RAG in JupyterLab
135+
### Step 3: Experiment with RAG in JupyterLab
112136
113137
This AI Workflow includes Jupyter notebooks which allow you to experiment with RAG.
114138
115139
- Using a web browser, type in the following URL to open Jupyter
116140
117141
``http://host-ip:8888``
118142
119-
- Locate the [LLM Streaming Client notebook](notebooks/01-llm-streaming-client.ipynb) which demonstrates how to stream responses from the LLM.
143+
- Locate the [LLM Streaming Client notebook](../notebooks/01-llm-streaming-client.ipynb) which demonstrates how to stream responses from the LLM.
120144
121145
- Proceed with the next 4 notebooks:
122146
123-
- [Document Question-Answering with LangChain](notebooks/02_langchain_simple.ipynb)
147+
- [Document Question-Answering with LangChain](../notebooks/02_langchain_simple.ipynb)
124148
125-
- [Document Question-Answering with LlamaIndex](notebooks/03_llama_index_simple.ipynb)
149+
- [Document Question-Answering with LlamaIndex](../notebooks/03_llama_index_simple.ipynb)
126150
127-
- [Advanced Document Question-Answering with LlamaIndex](notebooks/04_llamaindex_hier_node_parser.ipynb)
151+
- [Advanced Document Question-Answering with LlamaIndex](../notebooks/04_llamaindex_hier_node_parser.ipynb)
128152
129-
- [Interact with REST FastAPI Server](notebooks/05_dataloader.ipynb)
153+
- [Interact with REST FastAPI Server](../notebooks/05_dataloader.ipynb)
130154
131-
### Step 5: Run the Sample Web Application
155+
### Step 4: Run the Sample Web Application
132156
A sample chatbot web application is provided in the workflow. Requests to the chat system are wrapped in FastAPI calls.
133157
134158
- Open the web application at ``http://host-ip:8090``.
@@ -139,22 +163,43 @@ A sample chatbot web application is provided in the workflow. Requests to the ch
139163
140164
- To use a knowledge base:
141165
142-
- Click the **Knowledge Base** tab and upload the file [dataset.zip](./RetrievalAugmentedGeneration/notebook/dataset.zip).
166+
- Click the **Knowledge Base** tab and upload the file [dataset.zip](../notebooks/dataset.zip).
143167
144168
- Return to **Converse** tab and check **[X] Use knowledge base**.
145169
146170
- Retype the question: "How many cores are on the Nvidia Grace superchip?"
147171
172+
# RAG Evaluation
173+
174+
## Prerequisites
175+
Make sure the corps comm dataset is loaded into the vector database using the [Dataloader](../notebooks/05_dataloader.ipynb) notebook as part of step-3 of setup.
176+
177+
This workflow include jupyter notebooks which allow you perform evaluation of your RAG application on the sample dataset and they can be extended to other datasets as well.
178+
Setup the workflow by building and starting the containers by following the steps [outlined here using docker compose.](#step-2-build-and-start-containers)
179+
180+
After setting up the workflow follow these steps:
181+
182+
- Using a web browser, type in the following URL to open Jupyter Labs
183+
184+
``http://host-ip:8889``
185+
186+
- Locate the [synthetic data generation](../evaluation/01_synthetic_data_generation.ipynb) which demonstrates how to generate synthetic data of question answer pairs for evaluation
187+
188+
- Proceed with the next 3 notebooks:
189+
190+
- [Filling generated answers](../evaluation/02_filling_RAG_outputs_for_Evaluation.ipynb)
191+
192+
- [Ragas evaluation with NVIDIA AI playground](../evaluation/03_eval_ragas.ipynb)
193+
194+
- [LLM as a Judge evaluation with NVIDIA AI playground](../evaluation/04_Human_Like_RAG_Evaluation-AIP.ipynb)
195+
148196
149197
# Learn More
150-
1. [Architecture Guide](./docs/architecture.md): Detailed explanation of different components and how they are tried up together.
198+
1. [Architecture Guide](../docs/rag/architecture.md): Detailed explanation of different components and how they are tried up together.
151199
2. Component Guides: Component specific features are enlisted in these sections.
152-
1. [Chain Server](./docs/chat_server.md)
153-
2. [NeMo Framework Inference Server](./docs/llm_inference_server.md)
154-
3. [Jupyter Server](./docs/jupyter_server.md)
155-
4. [Sample frontend](./docs/frontend.md)
156-
3. [Configuration Guide](./docs/configuration.md): This guide covers different configurations available for this workflow.
157-
4. [Support Matrix](./docs/support_matrix.md): This covers GPU, CPU, Memory and Storage requirements for deploying this workflow.
158-
159-
# Known Issues
160-
- Uploading a file with size more than 10 MB may fail due to preset timeouts during the ingestion process.
200+
1. [Chain Server](../docs/rag/chat_server.md)
201+
2. [NeMo Framework Inference Server](../docs/rag/llm_inference_server.md)
202+
3. [Jupyter Server](../docs/rag/jupyter_server.md)
203+
4. [Sample frontend](../docs/rag/frontend.md)
204+
3. [Configuration Guide](../docs/rag/configuration.md): This guide covers different configurations available for this workflow.
205+
4. [Support Matrix](../docs/rag/support_matrix.md): This covers GPU, CPU, Memory and Storage requirements for deploying this workflow.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.

0 commit comments

Comments
 (0)