skill-retrieval-finetuning

This repository contains data preparation, training, and evaluation code for skill-to-skill and query-to-skill retrieval experiments built around ESCO and GRETA datasets. The project focuses on building dense retrieval models (embedding-based) and experimenting with hard-negative mining, dataset balancing and Multiple Negatives Ranking losses to improve retrieval quality for skill annotation and recommendation.

data/ — datasets and intermediate artifacts (JSON/JSONL). Contains combined ESCO files, augmented data, and eval_split/train_dataset*.jsonl used for training and evaluation.
scripts/ — helper scripts for data preparation, evaluation, and training launcher notebooks. Subfolders include preperation/, training/, and evaluation/.
models/ — output and checkpoints for finetuned encoders (e.g. finetuned_e5_esco_model_*).
requirements.txt — Python dependencies used across scripts and notebooks.

Quick overview

High-level flow:

Prepare and clean the datasets in data/ (combine sources, validate, augment and balance).
Generate train / eval JSONL files with positives and mined negatives and optionally synthetic samples.
Train a sentence/embedding model using MultipleNegativesRankingLoss with cached in-batch negatives plus curated hard negatives.
Evaluate on held-out eval_split with ranking metrics (accuracy@K, precision@K, recall@K, NDCG, MRR, MAP) and measure query latency.

Data preparation

Notes on the latest training data preparation (isy-finetune-v2):

Start from about 650 human-validated ESCO samples
Used scripts/preperation/reduceOverfittingRisk.py to remove some samples that were related to very frequent esco skills, keeping only 12 most diverse samples per skill. Resulting in 568 samples.
Used scripts/preperation/compress_long_queries.py to shorten very long queries (over 320 chars) by using LLMs to either summarize the query or split the sample into multiple shorter samples. Resulting in 923 samples.
Used scripts/preperation/generateForRareLables.py to generate additional synthetic samples for rare labels (labels with less than 5 samples) and skills not yet covered to improve balance across ESCO categories and overall coverage. Resulting in 6883 samples.
Used scripts/preperation/enrich_skills.py to add skill descriptions to the label text for richer targets and better balance between query and lable length.
Used scripts/preperation/create_evaluation_dataset.py to create an evaluation split of 529 samples covering all ESCO categories.
Used hard_negative_mining.py to mine hard negatives from the full ESCO skill set using a base encoder (e5-base) adding up to 8 additinal positive aware hard negatives per sample. Hard negatives were filtered to remove possible false negatives (i.e. samples that are too similar to the positive labels).

Training

Primary training entrypoint:

scripts/training/kaggle-esco-finetuning-mnr.ipynb — notebook used for the most recent finetune runs. It demonstrates dataset loading, caching strategy for negatives, and training loops.

Training strategy summary:

Base model: e5-base embeddings (or similar dense encoder) were used as the starting point.
Loss: CachedMultipleNegativesRankingLoss — combines in-batch negatives with a curated cache of hard negatives to stabilize and strengthen the model against difficult confounders.
Data handling: random shuffling every epoch, label entries concatenated with label + description for richer targets, and a mixture of human-validated and synthetic samples to improve generalization.
Practical details: recent runs used Kaggle with 2x T4 GPUs and took about 1.2 hours for the larger balanced dataset. Hyperparameters (batch size, LR, epochs) are recorded in the notebook; check the training cell metadata for exact values.

Assumptions:

The repository currently stores trained checkpoints in models/ and expects training notebooks to reference paths under data/ for datasets. Models and datasets are currently not tracked in this repo.

Evaluation

Evaluation is performed on held-out data/eval_split/* JSONL files containing 529 samples evenly sampled to cover all ESCO categories. Evaluation uses same ranking metrics from SentenceTransformers InformationRetrievalEvaluator with extended support to use a ChromaDB with all ESCO skills instead of the eval_dataset corpus only.

Summary of recent evaluation (high level):

	Model	accuracy@1	accuracy@3	accuracy@5	accuracy@10	precision@1	precision@3	precision@5	precision@10	recall@1	recall@3	recall@5	recall@10	ndcg@10	mrr@10	map@100	avg_time_per_query
0	e5-base	0.452652	0.592803	0.628788	0.715909	0.452652	0.216540	0.145455	0.084280	0.375721	0.516204	0.559749	0.640492	0.529609	0.534529	0.481411	0.109027
1	isy-finetuned	0.571970	0.774621	0.827652	0.878788	0.571970	0.293561	0.197348	0.112121	0.481526	0.692492	0.751755	0.821258	0.690687	0.682371	0.632835	0.119118
2	isy-finetune-v2	0.742424	0.873106	0.910985	0.945076	0.742424	0.349116	0.232197	0.125758	0.618847	0.800122	0.856144	0.906343	0.813567	0.816299	0.769737	0.104379

Key takeaways:

The v2 finetune (balanced + hard negatives + CachedMNR) substantially improves top-1 and top-5 accuracy and MRR relative to both the base encoder and an earlier finetune.
Latency per query remains low (~0.10s) and is acceptable for many production use cases. Exact latency will depend on hardware and retrieval index configuration.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
scripts		scripts
.gitignore		.gitignore
Readme.md		Readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skill-retrieval-finetuning

Contents

Quick overview

Data preparation

Training

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

skill-retrieval-finetuning

Contents

Quick overview

Data preparation

Training

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages