-
Notifications
You must be signed in to change notification settings - Fork 6
Computer Vision with Hugging Face Transformers
(Credit: Google DeepMind. Unsplash.com)
Computer Vision (CV) is a field of artificial intelligence that enables machines to interpret and make decisions based on visual data, such as images and videos. The goal is to simulate the way humans see and understand the world. CV powers applications like facial recognition, object detection, and image classification.
- Image Classification: Assigning a label or category to an image (e.g., identifying whether an image contains a cat or a dog). (See HF Tutorial).
- Object Detection: Identifying and localizing objects within an image (e.g., detecting multiple objects in a single image with bounding boxes). (See HF Tutorial).
- Semantic Segmentation: Classifying each pixel in an image into a category (e.g., distinguishing between the background and different objects). (See HF Tutorial).
- Instance Segmentation: Identifying individual instances of objects in an image, providing pixel-level masks for each object. (See HF Tutorial).
- Image Generation: Creating new images from a dataset or based on a given input (e.g., generating high-resolution images from text descriptions). (See HF Tutorial).
- Face Recognition: Identifying or verifying a person based on facial features from an image. (See HF Tutorial).
- Pose Estimation: Predicting the pose or position of a person or object in an image. (See HF Tutorial).
- Optical Character Recognition (OCR): Converting text from images into machine-encoded text. (See HF Tutorial).
- Vision Transformers (ViT): Hugging Face supports Vision Transformers, which apply the Transformer architecture to image data, enabling state-of-the-art performance on tasks like image classification and segmentation.
- Pre-trained Models: Access to pre-trained models that can be fine-tuned for specific CV tasks, reducing the need for extensive computational resources and labeled data.
- Interdisciplinary Application: Integration of vision and language tasks (e.g., image captioning, visual question answering) using multi-modal transformers.
- Ease of Use: User-friendly APIs make it simple to apply complex models to CV tasks without needing to build them from scratch.
- Community Support: Extensive documentation and a large community contribute to a rich ecosystem for developers working on CV tasks.
- Flexibility: Models like CLIP (Contrastive Language–Image Pretraining) allow for innovative tasks such as zero-shot image classification, where models can classify images without task-specific training data.
-
Books:
- "Deep Learning for Computer Vision" by Rajalingappaa Shanmugamani.
- "Computer Vision: Algorithms and Applications" by Richard Szeliski.
- "Hands-On Computer Vision with TensorFlow 2" by Benjamin Planche and Eliot Andres.
- Papers:
-
Online Courses:
- Deep Learning Specialization by Andrew Ng on Coursera (includes a module on CV).
- Stanford CS231n: Deep Learning for Computer Vision – A comprehensive course on computer vision.
- Hugging Face Course – Specific chapters on Vision Transformers and CV tasks.
-
Documentation:
- Hugging Face Vision Documentation – Information on using Vision Transformers with Hugging Face.
- OpenCV Documentation – Extensive documentation on the popular OpenCV library for computer vision.
-
Tutorials and Blogs:
- Deep Learning for Computer Vision. Run.AI
- Deep Learning For Computer Vision: Essential Models and Practical Real-World Applications. Farooq Alvi. OpenCV.
- Hugging Face Blog - Computer Vision – Articles on the latest advancements in CV with Transformers.
- Overview of Vision Language Models. Aman.AI.
- Dosovitskiy, A., et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929.
- Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020.
- He, K., et al. (2016). Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385v1.
- A complete Hugging Face tutorial: how to build and train a vision transformer. Sergios Karagiannakos, 2021.
- Training a Classifier. Deep Learning with PyTorch: A 60 Minute Blitz. Pytorch.org.
- Training an object detector from scratch in PyTorch. Devjyoti Chakraborty, 2021.
- HARK-OPENCV.
Note
📔 Read and execute the next Jupyter Notebook example in Google Colab.
This workshop will introduce participants to the core concepts of Computer Vision, the various tasks it involves, and how Hugging Face Transformers can be effectively utilized to advance these tasks.
import torch
from PIL import Image
from transformers import AutoTokenizer, ViTFeatureExtractor, AutoModelForImageClassification
# Basic notions: computer vision, image preprocessing, image classification
# Advantages of Hugging Face Transformers for CV: pre-trained models, transfer learning
# Trends: vision transformers, object detection, image generation
# Load a pre-trained model and feature extractor
feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224-in21k")
# Load a dataset (e.g., CIFAR10)
from torchvision import datasets, transforms
transform = transforms.Compose([
transforms.Resize(224),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)
# Preprocess the data
def preprocess_image(image):
inputs = feature_extractor(images=image, return_tensors="pt")
return inputs
# Fine-tune the model
# ... (similar to NLP fine-tuning)
# Activity: Try different image classification datasets and experiment with data augmentation.
Created: 08/16/2024 (C. Lizárraga); Last update: 09/12/2024 (C. Lizárraga)
UArizona DataLab, Data Science Institute, University of Arizona, 2024.
UArizona DataLab, Data Science Institute, University of Arizona, 2025.
Fall 2024
- Introduction to NLP with Hugging Face Transformers
- Computer Vision with Hugging Face Transformers
- Multimodal LLM with Hugging Face Transformers
- Running LLM locally: Ollama
- Introduction to Langchain
- Getting started with Phi-3
- Getting started with Gemini/Gemma
- Introduction to Gradio
- Introduction to Retrieval Augmented Generation (RAG)
Spring 2025