Skip to content

MaxThomasHPI/skill-retrieval-finetuning

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

skill-retrieval-finetuning

This repository contains data preparation, training, and evaluation code for skill-to-skill and query-to-skill retrieval experiments built around ESCO and GRETA datasets. The project focuses on building dense retrieval models (embedding-based) and experimenting with hard-negative mining, dataset balancing and Multiple Negatives Ranking losses to improve retrieval quality for skill annotation and recommendation.

Contents

  • data/ — datasets and intermediate artifacts (JSON/JSONL). Contains combined ESCO files, augmented data, and eval_split/train_dataset*.jsonl used for training and evaluation.
  • scripts/ — helper scripts for data preparation, evaluation, and training launcher notebooks. Subfolders include preperation/, training/, and evaluation/.
  • models/ — output and checkpoints for finetuned encoders (e.g. finetuned_e5_esco_model_*).
  • requirements.txt — Python dependencies used across scripts and notebooks.

Quick overview

High-level flow:

  1. Prepare and clean the datasets in data/ (combine sources, validate, augment and balance).
  2. Generate train / eval JSONL files with positives and mined negatives and optionally synthetic samples.
  3. Train a sentence/embedding model using MultipleNegativesRankingLoss with cached in-batch negatives plus curated hard negatives.
  4. Evaluate on held-out eval_split with ranking metrics (accuracy@K, precision@K, recall@K, NDCG, MRR, MAP) and measure query latency.

Data preparation

Notes on the latest training data preparation (isy-finetune-v2):

  • Start from about 650 human-validated ESCO samples
  • Used scripts/preperation/reduceOverfittingRisk.py to remove some samples that were related to very frequent esco skills, keeping only 12 most diverse samples per skill. Resulting in 568 samples.
  • Used scripts/preperation/compress_long_queries.py to shorten very long queries (over 320 chars) by using LLMs to either summarize the query or split the sample into multiple shorter samples. Resulting in 923 samples.
  • Used scripts/preperation/generateForRareLables.py to generate additional synthetic samples for rare labels (labels with less than 5 samples) and skills not yet covered to improve balance across ESCO categories and overall coverage. Resulting in 6883 samples.
  • Used scripts/preperation/enrich_skills.py to add skill descriptions to the label text for richer targets and better balance between query and lable length.
  • Used scripts/preperation/create_evaluation_dataset.py to create an evaluation split of 529 samples covering all ESCO categories.
  • Used hard_negative_mining.py to mine hard negatives from the full ESCO skill set using a base encoder (e5-base) adding up to 8 additinal positive aware hard negatives per sample. Hard negatives were filtered to remove possible false negatives (i.e. samples that are too similar to the positive labels).

Training

Primary training entrypoint:

  • scripts/training/kaggle-esco-finetuning-mnr.ipynb — notebook used for the most recent finetune runs. It demonstrates dataset loading, caching strategy for negatives, and training loops.

Training strategy summary:

  • Base model: e5-base embeddings (or similar dense encoder) were used as the starting point.
  • Loss: CachedMultipleNegativesRankingLoss — combines in-batch negatives with a curated cache of hard negatives to stabilize and strengthen the model against difficult confounders.
  • Data handling: random shuffling every epoch, label entries concatenated with label + description for richer targets, and a mixture of human-validated and synthetic samples to improve generalization.
  • Practical details: recent runs used Kaggle with 2x T4 GPUs and took about 1.2 hours for the larger balanced dataset. Hyperparameters (batch size, LR, epochs) are recorded in the notebook; check the training cell metadata for exact values.

Assumptions:

  • The repository currently stores trained checkpoints in models/ and expects training notebooks to reference paths under data/ for datasets. Models and datasets are currently not tracked in this repo.

Evaluation

Evaluation is performed on held-out data/eval_split/* JSONL files containing 529 samples evenly sampled to cover all ESCO categories. Evaluation uses same ranking metrics from SentenceTransformers InformationRetrievalEvaluator with extended support to use a ChromaDB with all ESCO skills instead of the eval_dataset corpus only.

Summary of recent evaluation (high level):

  Model accuracy@1 accuracy@3 accuracy@5 accuracy@10 precision@1 precision@3 precision@5 precision@10 recall@1 recall@3 recall@5 recall@10 ndcg@10 mrr@10 map@100 avg_time_per_query
0 e5-base 0.452652 0.592803 0.628788 0.715909 0.452652 0.216540 0.145455 0.084280 0.375721 0.516204 0.559749 0.640492 0.529609 0.534529 0.481411 0.109027
1 isy-finetuned 0.571970 0.774621 0.827652 0.878788 0.571970 0.293561 0.197348 0.112121 0.481526 0.692492 0.751755 0.821258 0.690687 0.682371 0.632835 0.119118
2 isy-finetune-v2 0.742424 0.873106 0.910985 0.945076 0.742424 0.349116 0.232197 0.125758 0.618847 0.800122 0.856144 0.906343 0.813567 0.816299 0.769737 0.104379

Key takeaways:

  • The v2 finetune (balanced + hard negatives + CachedMNR) substantially improves top-1 and top-5 accuracy and MRR relative to both the base encoder and an earlier finetune.

  • Latency per query remains low (~0.10s) and is acceptable for many production use cases. Exact latency will depend on hardware and retrieval index configuration.

About

Skill Retrieval Evaluation. Evaluation approach is inpired by [mteb](https://github.com/embeddings-benchmark/mteb)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 77.1%
  • Python 22.9%