Skip to content

NGharesi/Hugging_Face_Learning_Journey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 

Repository files navigation

🤗 Hugging Face Learning Journey

This repository documents my step-by-step learning of Hugging Face tools for applied machine learning, with a focus on text data and real-world workflows.

The goal is to build a clear understanding of the full pipeline:

data → preprocessing → training → evaluation → sharing → application

01 - Loading Data

What this covers:

loading datasets using load_dataset accessing train/test splits initial dataset exploration

Open In Colab

02 - Inspect Dataset

What this covers:

inspecting samples and features understanding dataset structure checking dataset size and label distribution

Open In Colab

03 - Preprocessing

This notebook covers:

  • general preprocessing such as filtering and text transformation
  • text preprocessing for transformers
  • tokenization with a matching pretrained tokenizer
  • preparing the dataset for PyTorch

Open In Colab

04 - Fine-Tuning With Trainer API

This notebook continues after preprocessing and shows how to:

  • load a pretrained text classification model
  • define training settings with TrainingArguments
  • use Trainer to fine-tune the model
  • evaluate performance with accuracy

Open In Colab

05 - Custom Training Loop

This notebook explains how text classification training works without the Trainer API:

  • prepare tokenized data for PyTorch
  • create DataLoaders
  • define optimizer and scheduler
  • run a manual training loop
  • evaluate the model

Open In Colab

06 - Understanding Learning Curves

This notebook explains how to interpret:

  • training and validation loss
  • validation accuracy
  • healthy learning curves
  • overfitting
  • underfitting
  • unstable training

Open In Colab

07 - Sharing Models

Use and share models via the Hugging Face Hub.

  • loading pretrained models
  • saving models locally
  • pushing models to the Hub
  • model cards

Open In Colab

08 - Classical NLP Tasks

This notebook covers the most practical classical NLP tasks that form the foundation of modern language models:

  • token classification
  • question answering
  • summarization
  • translation

For each task, I included:

  • the main idea
  • a small practical example
  • the key preprocessing concept
  • the common evaluation metric
  • a short summary of what I learned

Main takeaways

  • token classification requires label alignment with subword tokens
  • extractive QA requires mapping answer spans from characters to tokens
  • summarization typically uses encoder-decoder models and ROUGE
  • translation uses sequence-to-sequence models and SacreBLEU

Open In Colab

9 - Supervised Fine-Tuning and LoRA

This notebook covers the practical workflow for adapting a pretrained language model into an assistant-style model.

Topics covered

  • chat templates
  • supervised fine-tuning (SFT)
  • LoRA
  • evaluation after fine-tuning

Main takeaways

  • instruct models require the correct chat template
  • SFT is useful only when prompting is not enough
  • LoRA makes fine-tuning much more memory-efficient
  • evaluation should include both metrics and qualitative checks

Open In Colab

About

Learning by example

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors