Mercury is a semantic-assisted, cross-text text labeling tool.
- semantic-assisted: when you select a text span, semantically related text segments will be highlighted -- so you don't have to eyeball through lengthy texts.
- cross-text: you are labeling text spans from two different texts.
Therefore, Mercury is very efficient for the labeling of NLP tasks that involve comparing texts between two documents which are also lengthy, such as hallucination detection or factual consistency/faithfulness in RAG systems. Semantic assistance not only saves time and reduces fatigues but also avoids mistakes.
Currently, Mercury only supports labeling inconsistencies between the source and summary for summarization in RAG.
Note
You need Python and Node.js.
Mercury uses sqlite-vec to store and search embeddings.
-
pip3 install -r requirements.txt && python3 -m spacy download en_core_web_sm -
If you don't have
pnpminstalled, please install withnpm install -g pnpm- you may needsudo. If you don't havenpm, trysudo apt install npm. -
Compile the frontend:
pnpm install && pnpm build -
To use
sqlite-vecvia Python's built-insqlite3module, you must have SQLite>3.41 (otherwiseLIMITork=?will not work properly withrowid IN (?)for vector search) installed and ensure Python's built-insqlite3module is built for SQLite>3.41. Note that Python's built-insqlite3module uses its own binary library that is independent of the OS's SQLite. So upgrading the OS's SQLite will not upgrade Python'ssqlite3module. To manually upgrade Python'ssqlite3module to use SQLite>3.41, here are the steps:-
Download and compile SQLite>3.41.0 from source
wget https://www.sqlite.org/2024/sqlite-autoconf-3460100.tar.gz tar -xvf sqlite-autoconf-3460100.tar.gz cd sqlite-autoconf-3460100 ./configure make -
Set Python's built-in
sqlite3module to use the compiled SQLite. Suppose you are currently at path$SQLITE_Compile. Then set this environment variable (feel free to replace$SQLITE_Compilewith the actual absolute/relative path):export LD_PRELAOD=$SQLITE_Compile/.libs/libsqlite3.so
You may add the above line to
~.bashrcto make it permanent. -
Verify that Python's
sqlite3module is using the correct SQLite, run this Python code:python3 -c "import sqlite3; print(sqlite3.sqlite_version)"If the output is the version of SQLite you just compiled, you are good to go.
-
If you are using Mac and run into troubles, please follow SQLite-vec's instructions.
-
-
To use
sqlite-vecdirectly insqliteprompt, simply compilesqlite-vecfrom source and load the compiledvec0.o. The usage can be found in the SQLite-vec's README.
-
Ingest data for labeling
Run
python3 ingester.py -hto see the options.The ingester takes a CSV, JSON, or JSONL file and loads texts from two text columns (configurable via option
ingest_column_1andingest_column_2which default tosourceandsummary) of the file. After ingestion, the data will be stored in the SQLite database, denoted asCORPUS_DBin the following steps. -
Manually set the labels for annotators to choose from in the
labels.yamlfile. Mercury supports hierarchical labels. -
Generate and set a JWT secret key:
export SECRET_KEY=$(openssl rand -base64 32). You can rerun the command above to generate a new secret key when needed, especially when the old one is compromised. Note that changing the JWT token will log out all users. Optionally, you can also setEXPIRE_MINUTESto change the expiration time of the JWT token. The default is 7 days (10080 minutes). -
Start the Mercury annotation server:
python3 server.py [--corpus_db {CORPUS_DB} --user_db {USER_DB}].Be sure to set the candidate labels to choose from in the
labels.yamlfile. The server will run onhttp://localhost:8000by default. The defaultUSER_DB, namelyusers.sqlite, is distributed with the code repo with the default Email and password as[email protected]andtest, respectively. -
Optional To add/update/list users in a
USER_DB, see User administration in Mercury for more details.
The annotations are stored in the annotations table in a SQLite database (hardcoded name mercury.sqlite). See the
section annotations table for the schema.
The dumped human annotations are stored in a JSON format like this:
[
{ # first sample
'sample_id': int,
'source': str,
'summary': str,
'annotations': [ # a list of annotations from many human annotators
{
'annot_id': int,
'sample_id': int, # relative to the ingestion file
'annotator': str, # the annotator unique id
'annotator_name': str, # the annotator name
'label': list[str],
'note': str,
'summary_span': str, # the text span in the summary
'summary_start': int,
'summary_end': int,
'source_span': str, # the text span in the source
'source_start': int,
'source_end': int,
}
],
'meta_field_1': Any, # whatever meta info about the sample
'meta_field_2': Any,
...
},
{ # second sample
...
},
...
]You can view exported data in http://[your_host]/viewer
To run Mercury in a Docker container, you can use the provided docker-compose.yml file, such as docker compose -f 'docker-compose.yml' up -d --build 'mercury'. Before running the Docker container, be sure to set up the secrets in the secrets/ directory. Also, run the ingester and the server first, or copy mercury.sqlite, users.sqlite, and labels.yaml to the root directory.
python3 migrator.py export --workdir {DIR_OF_SQLITE_FILES} --csv unified_users.csv
python3 migrator.py register --csv unified_users.csv --db unified_users.sqliteTerminology:
- A sample is a pair of source and summary.
- A document is either a source or a summary.
- A chunk is a sentence in a document.
[!NOTE] SQLite uses 1-indexed for
autoincrementcolumns while the rest of the code uses 0-indexed.
Mercury needs two SQLite databases, denoted as MERCURY_DB, which stores a corpus for annotation, and USER_DB, which stores login credentials. One USER_DB can be reused for multiple MERCURY_DBs for the same group of users to annotation different corpora.
| user_id | user_name | hashed_password | |
|---|---|---|---|
| add93a266ab7484abdc623ddc3bf6441 | Alice | [email protected] | super_safe |
| 68d41e465458473c8ca1959614093da7 | Bob | [email protected] | my_password |
- The column
user_nameinuserstable is not unique and are not used as part of login credentials. An annotator logs in using a combination ofemailandhashed_password. - Password is hashed by
argon2with parameterstime_cost=2, memory_cost=19456, parallelism=1. This is recommended by OWASP.
Tables: chunks, annotations, config.
All powered by SQLite. In particular, chunks is powered by sqlite-vec.
Each row is a chunk.
A JSONL file like this:
# test.jsonl
{"source": "The quick brown fox. Jumps over a lazy dog. ", "summary": "26 letters."}
{"source": "We the people. Of the U.S.A. ", "summary": "The U.S. Constitution. It is great. "}will be ingested into the chunks table as below:
| chunk_id | text | text_type | sample _id | char _offset | chunk _offset | embedding |
|---|---|---|---|---|---|---|
| 0 | "The quick brown fox." | source | 0 | 0 | 0 | [0.1, 0.2, ..., 0.9] |
| 1 | "Jumps over the lazy dog." | source | 0 | 21 | 1 | [0.1, 0.2, ..., 0.9] |
| 2 | "We the people." | source | 1 | 0 | 0 | [0.1, 0.2, ..., 0.9] |
| 3 | "Of the U.S.A." | source | 1 | 15 | 1 | [0.1, 0.2, ..., 0.9] |
| 4 | "26 letters." | summary | 0 | 0 | 0 | [0.1, 0.2, ..., 0.9] |
| 5 | "The U.S. Constitution." | summary | 1 | 0 | 0 | [0.1, 0.2, ..., 0.9] |
| 6 | "It is great." | summary | 1 | 23 | 1 | [0.1, 0.2, ..., 0.9] |
Meaning of select columns:
char_offsetis the offset of a chunk in its parent document measured by the starting character of the chunk. It allows us to find the chunk in the document.chunk_offset_localis the index of a chunk in its parent document. It is used to find the chunk in the document.text_typeis takes value from the ingestion file.sourceandsummaryfor now.- All columns are 0-indexed.
- The
sample_idis the index of the sample in the ingestion file. Because the ingestion file could be randomly sampled from a bigger dataset, thesample_idis not necessarily global. embeddingis the embedding of the chunk.
| annot_id | sample _id | annot_spans | annotator | label | note |
|---|---|---|---|---|---|
| 1 | 1 | {'source': [1, 10], 'summary': [7, 10]} | 2fe9bb69 | ["ambivalent"] | "I am not sure." |
| 2 | 1 | {'summary': [2, 8]} | a24cb15c | ["extrinsic"] | "No connection to the source." |
sample_idare theid's of chunks in thechunkstable.text_spansis a JSON text field that stores the text spans selected by the annotator. Each entry is a dictionary where keys must be those in thetext_typecolumn in thechunkstable (hardcoded tosourceandsummarynow) and the values are lists of two integers: the start and end indices of the text span in the chunk. For extrinsic hallucinations (no connection to the source at all), onlysummary-key items. The reason we use JSON here is that SQLite does not support array types.
For example:
| key | value |
|---|---|
| embdding_model | "openai/text-embedding-3-small" |
| embdding_dimension | 4 |
| sample_id | json_meta |
|---|---|
| 0 | {"model":"meta-llama/Meta-Llama-3.1-70B-Instruct","HHEMv1":0.43335,"HHEM-2.1":0.39717,"HHEM-2.1-English":0.90258,"trueteacher":1,"true_nli":0.0,"gpt-3.5-turbo":1,"gpt-4-turbo":1,"gpt-4o":1, "sample_id":727} |
| 1 | {"model":"openai/GPT-3.5-Turbo","HHEMv1":0.43003,"HHEM-2.1":0.97216,"HHEM-2.1-English":0.92742,"trueteacher":1,"true_nli":1.0,"gpt-3.5-turbo":1,"gpt-4-turbo":1,"gpt-4o":1, "sample_id": 1018} |
0-indexed, the sample_id column is the sample_id in the chunks table. It is local to the ingestion file. The
json_meta is whatever info other than ingestion columns (source and summary) in the ingestion file.
Mercury implemented a simple OAuth2 authentication. The user logs in with email and password. The server will return a signed JWT token. The server will verify the token for each request. The token will expire in 7 days.
SQLite-vec uses Euclidean distance for vector search. So all embeddings much be normalized to unit length. Fortunately, OpenAI and Sentence-Bert's embeddings are already normalized.
-
Suppose the user selects a text span in sample of sample ID
x. -
Get the type of the opposite document (source if
xis in summary, and vice versa). -
Get the
embeddingof the selected text span. -
Send a query to SQLite like this:
SELECT chunk_id, distance FROM chunks WHERE k = 5 AND sample_id = {x} AND text_type = {text_type} AND embedding MATCH '{embedding}' ORDER BY distance
This will find the 5 most similar chunks to
xin the opposite document. Note thatembeddinganddistanceare predefined bysqlite-vec.
Here is a running example (using the data above):
-
Suppose the data has been ingested. The embedder is
openai/text-embedding-3-smalland the embedding dimension is512. -
Suppose the user selects
sample_id = 1: "The U.S. Constitution." Thetext_typeof this span issummary-- the opposite document is the source. -
Get the embedding of "The U.S. Constitution".
embedding = embedder.embed(["The U.S. Constitution."], embedding_dimension=512)[0]
-
Send a query to SQLite:
SELECT chunk_id, distance FROM chunks WHERE k = 5 AND sample_id = 1 AND text_type = 'source' AND embedding MATCH '{embedding}' ORDER BY distance
The return is
[(2, 0.20000001788139343), (1, 0.40000003576278687)]. -
Get the text spans of the chunks with
chunk_id2 and 1.SELECT chunk_id, text, char_offset FROM chunks WHERE chunk_id in (2, 1)
The return is
[(2, 'We the people.', 0), (1, 'Of the U.S.A.', 15)]. -
The closest source chunk is "We the people" (
chunk_id=2) which is the most famous three words in the US Constitution.
- OpenAI's embedding endpoint can only embed up to 8192 tokens in each call.
embdding_dimensionis only useful for OpenAI models. Most other models do not support changing the embedding dimension.
multi-qa-mpnet-base-dot-v1takes about 0.219 second on an x86 CPU to embed one sentence when batch_size is 1. The embedding dimension is 768.BAAI/bge-small-en-v1.5takes also about 0.202 second on an x86 CPU to embed one sentence when batch_size is 1. The embedding dimension is 384.
