Skip to content

Computer Vision with Hugging Face Transformers

Carlos Lizarraga-Celaya edited this page Sep 12, 2024 · 17 revisions

Overview of Computer Vision and Hugging Face Transformers

(Credit: Google DeepMind. Unsplash.com)



1. Introduction to Computer Vision

Computer Vision (CV) is a field of artificial intelligence that enables machines to interpret and make decisions based on visual data, such as images and videos. The goal is to simulate the way humans see and understand the world. CV powers applications like facial recognition, object detection, and image classification.

2. Main Computer Vision Tasks

3. Advantages of Hugging Face Transformers for Computer Vision

  • Vision Transformers (ViT): Hugging Face supports Vision Transformers, which apply the Transformer architecture to image data, enabling state-of-the-art performance on tasks like image classification and segmentation.
  • Pre-trained Models: Access to pre-trained models that can be fine-tuned for specific CV tasks, reducing the need for extensive computational resources and labeled data.
  • Interdisciplinary Application: Integration of vision and language tasks (e.g., image captioning, visual question answering) using multi-modal transformers.
  • Ease of Use: User-friendly APIs make it simple to apply complex models to CV tasks without needing to build them from scratch.
  • Community Support: Extensive documentation and a large community contribute to a rich ecosystem for developers working on CV tasks.
  • Flexibility: Models like CLIP (Contrastive Language–Image Pretraining) allow for innovative tasks such as zero-shot image classification, where models can classify images without task-specific training data.

4. Learning Resources

5. General References

6. Jupyter Notebook Examples

Note

📔 Read and execute the next Jupyter Notebook example in Google Colab.

This workshop will introduce participants to the core concepts of Computer Vision, the various tasks it involves, and how Hugging Face Transformers can be effectively utilized to advance these tasks.


import torch
from PIL import Image
from transformers import AutoTokenizer, ViTFeatureExtractor, AutoModelForImageClassification

# Basic notions: computer vision, image preprocessing, image classification
# Advantages of Hugging Face Transformers for CV: pre-trained models, transfer learning
# Trends: vision transformers, object detection, image generation

# Load a pre-trained model and feature extractor
feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224-in21k")

# Load a dataset (e.g., CIFAR10)
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])

trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)

# Preprocess the data
def preprocess_image(image):
    inputs = feature_extractor(images=image, return_tensors="pt")
    return inputs

# Fine-tune the model
# ... (similar to NLP fine-tuning)

# Activity: Try different image classification datasets and experiment with data augmentation.


Created: 08/16/2024 (C. Lizárraga); Last update: 09/12/2024 (C. Lizárraga)

CC BY-NC-SA

UArizona DataLab, Data Science Institute, University of Arizona, 2024.