Skip to content

aimagelab/itserr-wp8-latin-embeddings

Repository files navigation

Installation

  1. Create the Python environment.
conda create -n itserr-wp8 -y --no-default-packages python==3.9
conda activate itserr-wp8
  1. Install Pytorch.
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu118
  1. Install faiss.
conda install -n itserr-wp8 -y -c conda-forge faiss-cpu
  1. Clone the repo and install other dependencies.
git clone https://github.com/aimagelab/itserr-wp8-latin-embeddings.git
cd itserr-wp8-latin-embeddings
pip install -r requirements.txt
  1. Install dependencies for the Latin-BERT tokenizer.
python -c "from cltk.data.fetch import FetchCorpus; corpus_downloader = FetchCorpus(language='lat'); corpus_downloader.import_corpus('lat_models_cltk')"

Run interactive search

First, login on Hugging Face 🤗 using your read token:

huggingface-cli login

Now download pre-extracted embeddings from our fine-tuned models for the Latin language.

git lfs install
git clone https://huggingface.co/datasets/itserr/WP8-Latin-Embeddings-Indices

You can download our fine-tuned models from Hugging Face 🤗.

We have released two flavors of Latin-BERT (--model_type latin_bert):

  • Latin-BERT-W_VULG-S_VL: this model has been fine-tuned with contrastive learning, using correspondencies between the Vulgate and Vetus Latina translations of the Bible as positive pairs, and in-batch negatives.
  • Latin-BERT-W_VULG-S_VL-Synt: this model has been fine-tuned with additional synthetic hard negatives generated with Chat-GPT.

Similarly, we have released two flavors of LaBERTa (--model_type laberta), fine-tuned with the same training recipe of our Latin-BERT:

  • LaBERTa-W_VULG-S_VL
  • LaBERTa-W_VULG-S_VL-Synt

Run interactive search with the following command. Adjust --index_root and --passages_root accordingly to the path where you downloaded the pre-extracted embeddings.

PYTHONPATH=.:./latin-bert python interactive_search.py \
--index_root ./WP8-Latin-Embeddings-Indices \
--passages_root ./WP8-Latin-Embeddings-Indices \
--model_name_or_path itserr/LaBERTa-W_VULG-S_VL-Synt \
--model_type laberta \
--bible_id W_VULG \
--n_search_items 10

You can choose to search among the embeddings extracted from either the Vulgate or the Vetus Latina translations of the Bible, by setting --bible_id equal to W_VULG or S_VL respectively.

Releases

No releases published

Packages

No packages published