GitHub - aimagelab/itserr-wp8-latin-embeddings: ITSERR WP8 - Code for Latin embeddings semantic search

Installation

Create the Python environment.

conda create -n itserr-wp8 -y --no-default-packages python==3.9
conda activate itserr-wp8

Install Pytorch.

pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu118

Install faiss.

conda install -n itserr-wp8 -y -c conda-forge faiss-cpu

Clone the repo and install other dependencies.

git clone https://github.com/aimagelab/itserr-wp8-latin-embeddings.git
cd itserr-wp8-latin-embeddings
pip install -r requirements.txt

Install dependencies for the Latin-BERT tokenizer.

python -c "from cltk.data.fetch import FetchCorpus; corpus_downloader = FetchCorpus(language='lat'); corpus_downloader.import_corpus('lat_models_cltk')"

Run interactive search

First, login on Hugging Face 🤗 using your read token:

huggingface-cli login

Now download pre-extracted embeddings from our fine-tuned models for the Latin language.

git lfs install
git clone https://huggingface.co/datasets/itserr/WP8-Latin-Embeddings-Indices

You can download our fine-tuned models from Hugging Face 🤗.

We have released two flavors of Latin-BERT (--model_type latin_bert):

Latin-BERT-W_VULG-S_VL: this model has been fine-tuned with contrastive learning, using correspondencies between the Vulgate and Vetus Latina translations of the Bible as positive pairs, and in-batch negatives.
Latin-BERT-W_VULG-S_VL-Synt: this model has been fine-tuned with additional synthetic hard negatives generated with Chat-GPT.

Similarly, we have released two flavors of LaBERTa (--model_type laberta), fine-tuned with the same training recipe of our Latin-BERT:

LaBERTa-W_VULG-S_VL
LaBERTa-W_VULG-S_VL-Synt

Run interactive search with the following command. Adjust --index_root and --passages_root accordingly to the path where you downloaded the pre-extracted embeddings.

PYTHONPATH=.:./latin-bert python interactive_search.py \
--index_root ./WP8-Latin-Embeddings-Indices \
--passages_root ./WP8-Latin-Embeddings-Indices \
--model_name_or_path itserr/LaBERTa-W_VULG-S_VL-Synt \
--model_type laberta \
--bible_id W_VULG \
--n_search_items 10

You can choose to search among the embeddings extracted from either the Vulgate or the Vetus Latina translations of the Bible, by setting --bible_id equal to W_VULG or S_VL respectively.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
latin-bert/models/subword_tokenizer_latin		latin-bert/models/subword_tokenizer_latin
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.py		index.py
interactive_search.py		interactive_search.py
interactive_search.sh		interactive_search.sh
logging_utils.py		logging_utils.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

Run interactive search

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

aimagelab/itserr-wp8-latin-embeddings

Folders and files

Latest commit

History

Repository files navigation

Installation

Run interactive search

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages