Skip to content

Commit 78f2c8c

Browse files
committed
Merge branch 'master' into develop
2 parents b020637 + be1e6e6 commit 78f2c8c

17 files changed

+1664
-271
lines changed

examples/.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
11
.ipynb_checkpoints
2+
.env

examples/langchain/.gitignore

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
docker
2+
.gradle
3+
build
4+

examples/langchain/README.md

+70
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Example langchain retriever
2+
3+
This project demonstrates one approach for implementing a
4+
[langchain retriever](https://python.langchain.com/docs/modules/data_connection/)
5+
that allows for
6+
[Retrieval Augmented Generation (RAG)](https://python.langchain.com/docs/use_cases/question_answering/)
7+
to be supported via MarkLogic and the MarkLogic Python Client. This example uses the same data as in
8+
[the langchain RAG quickstart guide](https://python.langchain.com/docs/use_cases/question_answering/quickstart),
9+
but with the data having first been loaded into MarkLogic.
10+
11+
**This is only intended as an example** of how easily a langchain retriever can be developed
12+
using the MarkLogic Python Client. The queries in this example are simple and naturally
13+
do not have any knowledge of how your data is modeled in MarkLogic. You are encouraged to use
14+
this as an example for developing your own retriever, where you can build a query based on a
15+
question submitted to langchain that fully leverages the indexes and data models in your MarkLogic
16+
application. Additionally, please see the
17+
[langchain documentation on splitting text](https://python.langchain.com/docs/modules/data_connection/document_transformers/). You may need to restructure your data so that you have a larger number of
18+
smaller documents in your database so that you do not exceed the limit that langchain imposes on how
19+
much data a retriever can return.
20+
21+
# Setup
22+
23+
To try out this project, use [docker-compose](https://docs.docker.com/compose/) to instantiate a new MarkLogic
24+
instance with port 8003 available (you can use your own MarkLogic instance too, just be sure that port 8003
25+
is available):
26+
27+
docker-compose up -d --build
28+
29+
Then deploy a small REST API application to MarkLogic, which includes a basic non-admin MarkLogic user
30+
named `langchain-user`:
31+
32+
./gradlew -i mlDeploy
33+
34+
Next, create a new Python virtual environment - [pyenv](https://github.com/pyenv/pyenv) is recommended for this -
35+
and install the
36+
[langchain example dependencies](https://python.langchain.com/docs/use_cases/question_answering/quickstart#dependencies),
37+
along with the MarkLogic Python Client:
38+
39+
pip install -U langchain langchain_openai langchain-community langchainhub openai chromadb bs4 marklogic_python_client
40+
41+
Then run the following Python program to load text data from the langchain quickstart guide
42+
into two different collections in the `langchain-test-content` database:
43+
44+
python load_data.py
45+
46+
Create a ".env" file to hold your OpenAI API key:
47+
48+
echo "OPENAI_API_KEY=<your key here>" > .env
49+
50+
# Testing the retriever
51+
52+
You are now ready to test the example retriever. Run the following to ask a question with the
53+
results augmented via the `marklogic_retriever.py` module in this project; you will be
54+
prompted for an OpenAI API key when you run this, which you can type or paste in:
55+
56+
python ask.py "What is task decomposition?" posts
57+
58+
The retriever uses a [cts.similarQuery](https://docs.marklogic.com/cts.similarQuery) to select from the documents
59+
loaded via `load_data.py`. It defaults to a page length of 10. You can change this by providing a command line
60+
argument - e.g.:
61+
62+
python ask.py "What is task decomposition?" posts 15
63+
64+
Example of a question for the "sotu" (State of the Union speech) collection:
65+
66+
python ask.py "What are economic sanctions?" sotu 20
67+
68+
To use a word query instead of a similar query, along with a set of drop words, specify "word" as the 4th argument:
69+
70+
python ask.py "What are economic sanctions?" sotu 20 word

examples/langchain/ask.py

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Based on example at
2+
# https://python.langchain.com/docs/use_cases/question_answering/quickstart .
3+
4+
import sys
5+
from dotenv import load_dotenv
6+
from langchain import hub
7+
from langchain_openai import ChatOpenAI
8+
from langchain.schema import StrOutputParser
9+
from langchain.schema.runnable import RunnablePassthrough
10+
from marklogic import Client
11+
from marklogic_retriever import MarkLogicRetriever
12+
13+
14+
def format_docs(docs):
15+
return "\n\n".join(doc.page_content for doc in docs)
16+
17+
18+
question = sys.argv[1]
19+
20+
retriever = MarkLogicRetriever.create(
21+
Client("http://localhost:8003", digest=("langchain-user", "password"))
22+
)
23+
retriever.collections = [sys.argv[2]]
24+
retriever.max_results = int(sys.argv[3]) if len(sys.argv) > 3 else 10
25+
if len(sys.argv) > 4:
26+
retriever.query_type = sys.argv[4]
27+
28+
load_dotenv()
29+
30+
prompt = hub.pull("rlm/rag-prompt")
31+
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
32+
33+
rag_chain = (
34+
{"context": retriever | format_docs, "question": RunnablePassthrough()}
35+
| prompt | llm | StrOutputParser()
36+
)
37+
print(rag_chain.invoke(question))

examples/langchain/build.gradle

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
plugins {
2+
id "net.saliman.properties" version "1.5.2"
3+
id "com.marklogic.ml-gradle" version "4.6.0"
4+
}

examples/langchain/docker-compose.yml

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
version: '3.8'
2+
name: marklogic_langchain
3+
4+
services:
5+
6+
marklogic:
7+
image: "marklogicdb/marklogic-db:11.1.0-centos-1.1.0"
8+
platform: linux/amd64
9+
environment:
10+
- MARKLOGIC_INIT=true
11+
- MARKLOGIC_ADMIN_USERNAME=admin
12+
- MARKLOGIC_ADMIN_PASSWORD=admin
13+
volumes:
14+
- ./docker/marklogic/logs:/var/opt/MarkLogic/Logs
15+
ports:
16+
- "8000-8003:8000-8003"
17+

examples/langchain/gradle.properties

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
mlAppName=langchain-test
2+
mlRestPort=8003
3+
mlUsername=admin
4+
mlPassword=admin
Binary file not shown.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
#Tue Mar 22 14:27:38 EDT 2016
2+
distributionBase=GRADLE_USER_HOME
3+
distributionPath=wrapper/dists
4+
zipStoreBase=GRADLE_USER_HOME
5+
zipStorePath=wrapper/dists
6+
distributionUrl=https\://services.gradle.org/distributions/gradle-8.4-bin.zip

examples/langchain/gradlew

+160
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
#!/usr/bin/env bash
2+
3+
##############################################################################
4+
##
5+
## Gradle start up script for UN*X
6+
##
7+
##############################################################################
8+
9+
# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
10+
DEFAULT_JVM_OPTS=""
11+
12+
APP_NAME="Gradle"
13+
APP_BASE_NAME=`basename "$0"`
14+
15+
# Use the maximum available, or set MAX_FD != -1 to use that value.
16+
MAX_FD="maximum"
17+
18+
warn ( ) {
19+
echo "$*"
20+
}
21+
22+
die ( ) {
23+
echo
24+
echo "$*"
25+
echo
26+
exit 1
27+
}
28+
29+
# OS specific support (must be 'true' or 'false').
30+
cygwin=false
31+
msys=false
32+
darwin=false
33+
case "`uname`" in
34+
CYGWIN* )
35+
cygwin=true
36+
;;
37+
Darwin* )
38+
darwin=true
39+
;;
40+
MINGW* )
41+
msys=true
42+
;;
43+
esac
44+
45+
# Attempt to set APP_HOME
46+
# Resolve links: $0 may be a link
47+
PRG="$0"
48+
# Need this for relative symlinks.
49+
while [ -h "$PRG" ] ; do
50+
ls=`ls -ld "$PRG"`
51+
link=`expr "$ls" : '.*-> \(.*\)$'`
52+
if expr "$link" : '/.*' > /dev/null; then
53+
PRG="$link"
54+
else
55+
PRG=`dirname "$PRG"`"/$link"
56+
fi
57+
done
58+
SAVED="`pwd`"
59+
cd "`dirname \"$PRG\"`/" >/dev/null
60+
APP_HOME="`pwd -P`"
61+
cd "$SAVED" >/dev/null
62+
63+
CLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar
64+
65+
# Determine the Java command to use to start the JVM.
66+
if [ -n "$JAVA_HOME" ] ; then
67+
if [ -x "$JAVA_HOME/jre/sh/java" ] ; then
68+
# IBM's JDK on AIX uses strange locations for the executables
69+
JAVACMD="$JAVA_HOME/jre/sh/java"
70+
else
71+
JAVACMD="$JAVA_HOME/bin/java"
72+
fi
73+
if [ ! -x "$JAVACMD" ] ; then
74+
die "ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME
75+
76+
Please set the JAVA_HOME variable in your environment to match the
77+
location of your Java installation."
78+
fi
79+
else
80+
JAVACMD="java"
81+
which java >/dev/null 2>&1 || die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
82+
83+
Please set the JAVA_HOME variable in your environment to match the
84+
location of your Java installation."
85+
fi
86+
87+
# Increase the maximum file descriptors if we can.
88+
if [ "$cygwin" = "false" -a "$darwin" = "false" ] ; then
89+
MAX_FD_LIMIT=`ulimit -H -n`
90+
if [ $? -eq 0 ] ; then
91+
if [ "$MAX_FD" = "maximum" -o "$MAX_FD" = "max" ] ; then
92+
MAX_FD="$MAX_FD_LIMIT"
93+
fi
94+
ulimit -n $MAX_FD
95+
if [ $? -ne 0 ] ; then
96+
warn "Could not set maximum file descriptor limit: $MAX_FD"
97+
fi
98+
else
99+
warn "Could not query maximum file descriptor limit: $MAX_FD_LIMIT"
100+
fi
101+
fi
102+
103+
# For Darwin, add options to specify how the application appears in the dock
104+
if $darwin; then
105+
GRADLE_OPTS="$GRADLE_OPTS \"-Xdock:name=$APP_NAME\" \"-Xdock:icon=$APP_HOME/media/gradle.icns\""
106+
fi
107+
108+
# For Cygwin, switch paths to Windows format before running java
109+
if $cygwin ; then
110+
APP_HOME=`cygpath --path --mixed "$APP_HOME"`
111+
CLASSPATH=`cygpath --path --mixed "$CLASSPATH"`
112+
JAVACMD=`cygpath --unix "$JAVACMD"`
113+
114+
# We build the pattern for arguments to be converted via cygpath
115+
ROOTDIRSRAW=`find -L / -maxdepth 1 -mindepth 1 -type d 2>/dev/null`
116+
SEP=""
117+
for dir in $ROOTDIRSRAW ; do
118+
ROOTDIRS="$ROOTDIRS$SEP$dir"
119+
SEP="|"
120+
done
121+
OURCYGPATTERN="(^($ROOTDIRS))"
122+
# Add a user-defined pattern to the cygpath arguments
123+
if [ "$GRADLE_CYGPATTERN" != "" ] ; then
124+
OURCYGPATTERN="$OURCYGPATTERN|($GRADLE_CYGPATTERN)"
125+
fi
126+
# Now convert the arguments - kludge to limit ourselves to /bin/sh
127+
i=0
128+
for arg in "$@" ; do
129+
CHECK=`echo "$arg"|egrep -c "$OURCYGPATTERN" -`
130+
CHECK2=`echo "$arg"|egrep -c "^-"` ### Determine if an option
131+
132+
if [ $CHECK -ne 0 ] && [ $CHECK2 -eq 0 ] ; then ### Added a condition
133+
eval `echo args$i`=`cygpath --path --ignore --mixed "$arg"`
134+
else
135+
eval `echo args$i`="\"$arg\""
136+
fi
137+
i=$((i+1))
138+
done
139+
case $i in
140+
(0) set -- ;;
141+
(1) set -- "$args0" ;;
142+
(2) set -- "$args0" "$args1" ;;
143+
(3) set -- "$args0" "$args1" "$args2" ;;
144+
(4) set -- "$args0" "$args1" "$args2" "$args3" ;;
145+
(5) set -- "$args0" "$args1" "$args2" "$args3" "$args4" ;;
146+
(6) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" ;;
147+
(7) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" ;;
148+
(8) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" ;;
149+
(9) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" "$args8" ;;
150+
esac
151+
fi
152+
153+
# Split up the JVM_OPTS And GRADLE_OPTS values into an array, following the shell quoting and substitution rules
154+
function splitJvmOpts() {
155+
JVM_OPTS=("$@")
156+
}
157+
eval splitJvmOpts $DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS
158+
JVM_OPTS[${#JVM_OPTS[*]}]="-Dorg.gradle.appname=$APP_BASE_NAME"
159+
160+
exec "$JAVACMD" "${JVM_OPTS[@]}" -classpath "$CLASSPATH" org.gradle.wrapper.GradleWrapperMain "$@"

0 commit comments

Comments
 (0)