This repository contains data preparation, training, and evaluation code for skill-to-skill and query-to-skill retrieval experiments built around ESCO and GRETA datasets. The project focuses on building dense retrieval models (embedding-based) and experimenting with hard-negative mining, dataset balancing and Multiple Negatives Ranking losses to improve retrieval quality for skill annotation and recommendation.
data/— datasets and intermediate artifacts (JSON/JSONL). Contains combined ESCO files, augmented data, andeval_split/train_dataset*.jsonlused for training and evaluation.scripts/— helper scripts for data preparation, evaluation, and training launcher notebooks. Subfolders includepreperation/,training/, andevaluation/.models/— output and checkpoints for finetuned encoders (e.g.finetuned_e5_esco_model_*).requirements.txt— Python dependencies used across scripts and notebooks.
High-level flow:
- Prepare and clean the datasets in
data/(combine sources, validate, augment and balance). - Generate train / eval JSONL files with positives and mined negatives and optionally synthetic samples.
- Train a sentence/embedding model using MultipleNegativesRankingLoss with cached in-batch negatives plus curated hard negatives.
- Evaluate on held-out
eval_splitwith ranking metrics (accuracy@K, precision@K, recall@K, NDCG, MRR, MAP) and measure query latency.
Notes on the latest training data preparation (isy-finetune-v2):
- Start from about 650 human-validated ESCO samples
- Used
scripts/preperation/reduceOverfittingRisk.pyto remove some samples that were related to very frequent esco skills, keeping only 12 most diverse samples per skill. Resulting in 568 samples. - Used
scripts/preperation/compress_long_queries.pyto shorten very long queries (over 320 chars) by using LLMs to either summarize the query or split the sample into multiple shorter samples. Resulting in 923 samples. - Used
scripts/preperation/generateForRareLables.pyto generate additional synthetic samples for rare labels (labels with less than 5 samples) and skills not yet covered to improve balance across ESCO categories and overall coverage. Resulting in 6883 samples. - Used
scripts/preperation/enrich_skills.pyto add skill descriptions to the label text for richer targets and better balance between query and lable length. - Used
scripts/preperation/create_evaluation_dataset.pyto create an evaluation split of 529 samples covering all ESCO categories. - Used
hard_negative_mining.pyto mine hard negatives from the full ESCO skill set using a base encoder (e5-base) adding up to 8 additinal positive aware hard negatives per sample. Hard negatives were filtered to remove possible false negatives (i.e. samples that are too similar to the positive labels).
Primary training entrypoint:
scripts/training/kaggle-esco-finetuning-mnr.ipynb— notebook used for the most recent finetune runs. It demonstrates dataset loading, caching strategy for negatives, and training loops.
Training strategy summary:
- Base model:
e5-baseembeddings (or similar dense encoder) were used as the starting point. - Loss: CachedMultipleNegativesRankingLoss — combines in-batch negatives with a curated cache of hard negatives to stabilize and strengthen the model against difficult confounders.
- Data handling: random shuffling every epoch, label entries concatenated with label + description for richer targets, and a mixture of human-validated and synthetic samples to improve generalization.
- Practical details: recent runs used Kaggle with 2x T4 GPUs and took about 1.2 hours for the larger balanced dataset. Hyperparameters (batch size, LR, epochs) are recorded in the notebook; check the training cell metadata for exact values.
Assumptions:
- The repository currently stores trained checkpoints in
models/and expects training notebooks to reference paths underdata/for datasets. Models and datasets are currently not tracked in this repo.
Evaluation is performed on held-out data/eval_split/* JSONL files containing 529 samples evenly sampled to cover all ESCO categories. Evaluation uses same ranking metrics from SentenceTransformers InformationRetrievalEvaluator with extended support to use a ChromaDB with all ESCO skills instead of the eval_dataset corpus only.
Summary of recent evaluation (high level):
| Model | accuracy@1 | accuracy@3 | accuracy@5 | accuracy@10 | precision@1 | precision@3 | precision@5 | precision@10 | recall@1 | recall@3 | recall@5 | recall@10 | ndcg@10 | mrr@10 | map@100 | avg_time_per_query | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | e5-base | 0.452652 | 0.592803 | 0.628788 | 0.715909 | 0.452652 | 0.216540 | 0.145455 | 0.084280 | 0.375721 | 0.516204 | 0.559749 | 0.640492 | 0.529609 | 0.534529 | 0.481411 | 0.109027 |
| 1 | isy-finetuned | 0.571970 | 0.774621 | 0.827652 | 0.878788 | 0.571970 | 0.293561 | 0.197348 | 0.112121 | 0.481526 | 0.692492 | 0.751755 | 0.821258 | 0.690687 | 0.682371 | 0.632835 | 0.119118 |
| 2 | isy-finetune-v2 | 0.742424 | 0.873106 | 0.910985 | 0.945076 | 0.742424 | 0.349116 | 0.232197 | 0.125758 | 0.618847 | 0.800122 | 0.856144 | 0.906343 | 0.813567 | 0.816299 | 0.769737 | 0.104379 |
Key takeaways:
-
The v2 finetune (balanced + hard negatives + CachedMNR) substantially improves top-1 and top-5 accuracy and MRR relative to both the base encoder and an earlier finetune.
-
Latency per query remains low (~0.10s) and is acceptable for many production use cases. Exact latency will depend on hardware and retrieval index configuration.