This repository documents my step-by-step learning of Hugging Face tools for applied machine learning, with a focus on text data and real-world workflows.
The goal is to build a clear understanding of the full pipeline:
data → preprocessing → training → evaluation → sharing → application
What this covers:
loading datasets using load_dataset accessing train/test splits initial dataset exploration
What this covers:
inspecting samples and features understanding dataset structure checking dataset size and label distribution
This notebook covers:
- general preprocessing such as filtering and text transformation
- text preprocessing for transformers
- tokenization with a matching pretrained tokenizer
- preparing the dataset for PyTorch
This notebook continues after preprocessing and shows how to:
- load a pretrained text classification model
- define training settings with
TrainingArguments - use
Trainerto fine-tune the model - evaluate performance with accuracy
This notebook explains how text classification training works without the Trainer API:
- prepare tokenized data for PyTorch
- create DataLoaders
- define optimizer and scheduler
- run a manual training loop
- evaluate the model
This notebook explains how to interpret:
- training and validation loss
- validation accuracy
- healthy learning curves
- overfitting
- underfitting
- unstable training
Use and share models via the Hugging Face Hub.
- loading pretrained models
- saving models locally
- pushing models to the Hub
- model cards
This notebook covers the most practical classical NLP tasks that form the foundation of modern language models:
- token classification
- question answering
- summarization
- translation
For each task, I included:
- the main idea
- a small practical example
- the key preprocessing concept
- the common evaluation metric
- a short summary of what I learned
- token classification requires label alignment with subword tokens
- extractive QA requires mapping answer spans from characters to tokens
- summarization typically uses encoder-decoder models and ROUGE
- translation uses sequence-to-sequence models and SacreBLEU
This notebook covers the practical workflow for adapting a pretrained language model into an assistant-style model.
- chat templates
- supervised fine-tuning (SFT)
- LoRA
- evaluation after fine-tuning
- instruct models require the correct chat template
- SFT is useful only when prompting is not enough
- LoRA makes fine-tuning much more memory-efficient
- evaluation should include both metrics and qualitative checks