Skip to content

Commit 1036fe9

Browse files
authored
Merge pull request #1 from activeloopai/add_feature_completeness
[DEEP-447] Add langchain feature completeness
2 parents 0875561 + cc08f2b commit 1036fe9

File tree

8 files changed

+917
-60
lines changed

8 files changed

+917
-60
lines changed

README.md

+275
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,278 @@ pip install -U langchain-deeplake
1313
```python
1414
from langchain_deeplake import DeeplakeVectorStore
1515
```
16+
17+
18+
## How to Use Deep Lake as a Vector Store in LangChain
19+
Deep Lake can be used as a VectorStore in [LangChain](https://github.com/langchain-ai/langchain) for building Apps that require filtering and vector search. In this tutorial, we will show how to create a Deep Lake Vector Store in LangChain and use it to build a Q&A App about the [Twitter OSS recommendation algorithm](https://github.com/twitter/the-algorithm). This tutorial requires installation of:
20+
21+
Install the main libraries:
22+
23+
```bash
24+
pip install --upgrade --quiet langchain-openai langchain-deeplake tiktoken
25+
```
26+
## Downloading and Preprocessing the Data
27+
First, let's import necessary packages and make sure the Activeloop and OpenAI keys are in the environmental variables `ACTIVELOOP_TOKEN`, `OPENAI_API_KEY`.
28+
29+
30+
31+
32+
```python
33+
import os
34+
import getpass
35+
from langchain_openai import OpenAIEmbeddings
36+
from langchain_deeplake.vectorstores import DeeplakeVectorStore
37+
from langchain_community.document_loaders import TextLoader
38+
from langchain_text_splitters import CharacterTextSplitter
39+
from langchain.chains import RetrievalQA
40+
from langchain_openai import ChatOpenAI
41+
```
42+
43+
Next, we set up environmental variables
44+
```python
45+
if "OPENAI_API_KEY" not in os.environ:
46+
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
47+
48+
if "ACTIVELOOP_TOKEN" not in os.environ:
49+
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass("activeloop token:")
50+
```
51+
52+
Next, let's clone the Twitter OSS recommendation algorithm:
53+
54+
```bash
55+
!git clone https://github.com/twitter/the-algorithm
56+
```
57+
58+
Next, let's load all the files from the repo into a list:
59+
60+
61+
```python
62+
repo_path = '/the-algorithm'
63+
64+
docs = []
65+
for dirpath, dirnames, filenames in os.walk(repo_path):
66+
for file in filenames:
67+
try:
68+
loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
69+
docs.extend(loader.load_and_split())
70+
except Exception as e:
71+
print(e)
72+
pass
73+
```
74+
75+
## A note on chunking text files
76+
77+
Text files are typically split into chunks before creating embeddings. In general, more chunks increases the relevancy of data that is fed into the language model, since granular data can be selected with higher precision. However, since an embedding will be created for each chunk, more chunks increase the computational complexity.
78+
79+
```python
80+
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
81+
texts = text_splitter.split_documents(docs)
82+
```
83+
84+
## Creating the Deep Lake Vector Store
85+
86+
First, we specify a path for storing the Deep Lake dataset containing the embeddings and their metadata.
87+
88+
```python
89+
dataset_path = 'al://<org-id>/twitter_algorithm'
90+
```
91+
92+
Next, we specify an OpenAI algorithm for creating the embeddings, and create the VectorStore. This process creates an embedding for each element in the texts lists and stores it in Deep Lake format at the specified path.
93+
94+
```python
95+
embeddings = OpenAIEmbeddings()
96+
```
97+
98+
99+
```python
100+
db = DeeplakeVectorStore.from_documents(dataset_path=dataset_path, embedding=embeddings, documents=texts, overwrite=True)
101+
```
102+
103+
The Deep Lake Vector Store has 4 columns including the `texts`, `embeddings`, `ids`, and `metadata`.
104+
105+
```python
106+
ds.dataset.summary()
107+
```
108+
109+
```bash
110+
Dataset length: 31305
111+
Columns:
112+
documents : text
113+
embeddings: embedding(1536, clustered)
114+
ids : text
115+
metadata : dict
116+
```
117+
118+
## Use the Vector Store in a Q&A App
119+
120+
We can now use the VectorStore in Q&A app, where the embeddings will be used to filter relevant documents (`texts`) that are fed into an LLM in order to answer a question.
121+
122+
If we were on another machine, we would load the existing Vector Store without recalculating the embeddings:
123+
124+
```python
125+
db = DeeplakeVectorStore(dataset_path=dataset_path, read_only=True, embedding_function=embeddings)
126+
127+
```
128+
129+
We have to create a `retriever` object and specify the search parameters.
130+
131+
```python
132+
retriever = db.as_retriever()
133+
retriever.search_kwargs['distance_metric'] = 'cos'
134+
retriever.search_kwargs['k'] = 20
135+
```
136+
137+
Finally, let's create an `RetrievalQA` chain in LangChain and run it:
138+
139+
```python
140+
model = ChatOpenAI(model='gpt-3.5-turbo')
141+
qa = RetrievalQA.from_llm(model, retriever=retriever)
142+
```
143+
144+
```python
145+
qa.run('What programming language is most of the SimClusters written in?')
146+
```
147+
148+
This returns:
149+
```
150+
Most of the SimClusters code is written in Scala as indicated by the packages such as `com.twitter.simclustersann.modules`, `com.twitter.simclusters_v2.scio.common`, `com.twitter.simclusters_v2.summingbird.storm`, and references to Scala-based GCP jobs.
151+
```
152+
153+
154+
## Accessing the Low Level Deep Lake API (Advanced)
155+
When using a Deep Lake Vector Store in LangChain, the underlying Vector Store and its low-level Deep Lake dataset can be accessed via:
156+
157+
```python
158+
# LangChain Vector Store
159+
db = DeeplakeVectorStore(dataset_path=dataset_path)
160+
161+
# Deep Lake Dataset object
162+
ds = db.dataset
163+
```
164+
165+
## SelfQueryRetriever with Deep Lake
166+
167+
Deep Lake supports the [SelfQueryRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.self_query.base.SelfQueryRetriever.html) implementation in LangChain, which translates a user prompt into a metadata filters.
168+
169+
170+
>This section of the tutorial requires installation of additional packages:
171+
> `pip install deeplake lark`
172+
173+
First let's create a Deep Lake Vector Store with relevant data using the documents below.
174+
175+
```python
176+
from langchain_core.documents import Document
177+
178+
docs = [
179+
Document(
180+
page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
181+
metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
182+
),
183+
Document(
184+
page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
185+
metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
186+
),
187+
Document(
188+
page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
189+
metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
190+
),
191+
Document(
192+
page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
193+
metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
194+
),
195+
Document(
196+
page_content="Toys come alive and have a blast doing so",
197+
metadata={"year": 1995, "genre": "animated"},
198+
),
199+
Document(
200+
page_content="Three men walk into the Zone, three men walk out of the Zone",
201+
metadata={
202+
"year": 1979,
203+
"rating": 9.9,
204+
"director": "Andrei Tarkovsky",
205+
"genre": "science fiction",
206+
"rating": 9.9,
207+
},
208+
),
209+
]
210+
```
211+
212+
Since this feature uses Deep Lake's [Tensor Query Language](https://docs.deeplake.ai/latest/advanced/tql/) under the hood, the Vector Store must be stored in or connected to Deep Lake, which requires [registration with Activeloop](https://app.activeloop.ai/levongh/home):
213+
214+
```python
215+
org_id = <YOUR_ORG_ID>
216+
dataset_path = f"al://{org_id}/self_query"
217+
218+
vectorstore = DeeplakeVectorStore.from_documents(
219+
docs, embeddings, dataset_path = dataset_path, overwrite = True,
220+
)
221+
```
222+
223+
Next, let's instantiate our retriever by providing information about the metadata fields that our documents support and a short description of the document contents.
224+
225+
```python
226+
from langchain.llms import OpenAI
227+
from langchain.retrievers.self_query.base import SelfQueryRetriever
228+
from langchain.chains.query_constructor.base import AttributeInfo
229+
230+
metadata_field_info = [
231+
AttributeInfo(
232+
name="genre",
233+
description="The genre of the movie",
234+
type="string or list[string]",
235+
),
236+
AttributeInfo(
237+
name="year",
238+
description="The year the movie was released",
239+
type="integer",
240+
),
241+
AttributeInfo(
242+
name="director",
243+
description="The name of the movie director",
244+
type="string",
245+
),
246+
AttributeInfo(
247+
name="rating", description="A 1-10 rating for the movie", type="float"
248+
),
249+
]
250+
251+
document_content_description = "Brief summary of a movie"
252+
llm = OpenAI(temperature=0)
253+
254+
retriever = SelfQueryRetriever.from_llm(
255+
llm, vectorstore, document_content_description, metadata_field_info, verbose=True
256+
)
257+
```
258+
259+
And now we can try actually using our retriever!
260+
261+
```python
262+
# This example only specifies a relevant query
263+
retriever.get_relevant_documents("What are some movies about dinosaurs")
264+
```
265+
266+
Output:
267+
```
268+
[Document(metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993}, page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose'),
269+
Document(metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993}, page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose'),
270+
Document(metadata={'genre': 'animated', 'year': 1995}, page_content='Toys come alive and have a blast doing so'),
271+
Document(metadata={'genre': 'animated', 'year': 1995}, page_content='Toys come alive and have a blast doing so')]
272+
```
273+
274+
Now we can run a query to find movies that are above a certain ranking:
275+
276+
```python
277+
# This example only specifies a filter
278+
retriever.get_relevant_documents("I want to watch a movie rated higher than 8.5")
279+
```
280+
281+
Output:
282+
```
283+
[Document(metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea'),
284+
Document(metadata={'director': 'Andrei Tarkovsky', 'genre': 'science fiction', 'rating': 9.9, 'year': 1979}, page_content='Three men walk into the Zone, three men walk out of the Zone'),
285+
Document(metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea'),
286+
Document(metadata={'director': 'Andrei Tarkovsky', 'genre': 'science fiction', 'rating': 9.9, 'year': 1979}, page_content='Three men walk into the Zone, three men walk out of the Zone')]
287+
```
288+
289+
290+
Congrats! You just used the Deep Lake Vector Store in LangChain to create a Q&A App! 🎉

langchain_deeplake/distance_metric.py

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
from typing import List, Optional
2+
3+
4+
def cosine_similarity(embedding_tensor, query_embedding):
5+
return f"COSINE_SIMILARITY({embedding_tensor}, {query_embedding})"
6+
7+
8+
def l2_norm(embedding_tensor, query_embedding):
9+
return f"L2_NORM({embedding_tensor}-{query_embedding})"
10+
11+
12+
METRIC_TO_TQL_QUERY = {
13+
"l2": l2_norm,
14+
"cos": cosine_similarity,
15+
}
16+
17+
18+
def get_tql_distance_metric(metric: str, embedding_tensor: str, query_embedding: str):
19+
return METRIC_TO_TQL_QUERY[metric](embedding_tensor, query_embedding)
20+
21+
22+
METRIC_TO_ORDER_TYPE = {
23+
"l2": "ASC",
24+
"cos": "DESC",
25+
}
26+
27+
28+
def get_order_type_for_distance_metric(distance_metric):
29+
return METRIC_TO_ORDER_TYPE[distance_metric]

langchain_deeplake/exceptions.py

+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
class ColumnMissingError(Exception):
2+
def __init__(self, column: str):
3+
super().__init__(f"Column '{column}' is not present in the Vector Store.")
4+
5+
6+
class UnexpectedUDFFilterError(Exception):
7+
def __init__(self, column: str):
8+
super().__init__(
9+
"UDF filter functions are not supported with DeepLake search. Please use TQL filter instead."
10+
)
11+
12+
13+
class MissingQueryOrTQLError(ValueError):
14+
"""Exception raised when both query and TQL are missing."""
15+
16+
def __init__(self, message="Either query or tql must be provided."):
17+
super().__init__(message)
18+
19+
20+
class InvalidQuerySpecificationError(ValueError):
21+
"""Exception raised when either both or neither of query and tql are provided."""
22+
23+
def __init__(self, message="Exactly one of 'query' or 'tql' must be provided."):
24+
super().__init__(message)

langchain_deeplake/filtering_util.py

+47
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
import deeplake
2+
3+
from typing import Optional, Dict
4+
from langchain_deeplake.exceptions import ColumnMissingError
5+
6+
7+
def attribute_based_filtering_tql(
8+
ds: deeplake.Dataset, filter: Optional[Dict] = None
9+
) -> str:
10+
"""Filter helper function converting filter dictionary to TQL Deep Lake
11+
For non-dict tensors, perform exact match if target data is not a list, and perform "IN" match if target data is a list.
12+
For dict tensors, perform exact match for each key-value pair in the target data.
13+
"""
14+
15+
tql_filter = ""
16+
17+
if filter is not None:
18+
if isinstance(filter, dict):
19+
columns = ds.schema.columns
20+
column_names = [column.name for column in columns]
21+
for column in filter.keys():
22+
if column not in column_names:
23+
raise ColumnMissingError(column)
24+
if ds.schema[column].dtype.kind == deeplake.types.TypeKind.Dict:
25+
for key, value in filter[column].items():
26+
val_str = f"'{value}'" if type(value) == str else f"{value}"
27+
tql_filter += f"{column}['{key}'] == {val_str} and "
28+
else:
29+
if type(filter[column]) == list:
30+
val_str = str(filter[column])[
31+
1:-1
32+
] # Remove square bracked and add rounded brackets below.
33+
34+
tql_filter += f"{column} in ({val_str}) and "
35+
36+
else:
37+
val_str = (
38+
f"'{filter[column]}'"
39+
if isinstance(filter[column], str)
40+
or isinstance(filter[column], np.str_)
41+
else f"{filter[column]}"
42+
)
43+
tql_filter += f"{column} == {val_str} and "
44+
45+
tql_filter = tql_filter[:-5]
46+
47+
return tql_filter

0 commit comments

Comments
 (0)