Skip to content

This project investigates the vulnerability of machine learning models to adversarial examples and implements various defense mechanisms to improve model robustness and security.

Notifications You must be signed in to change notification settings

mahanzavari/Trustworthy-ML

Repository files navigation

Adversarial Machine Learning & Trustworthy AI Security Research

Trustworthy Machine Learning (ML) is a critical research area focusing on developing AI systems that are reliable, robust, secure, and fair. As machine learning models are increasingly deployed in high-stakes applications like healthcare, autonomous vehicles, and financial services, ensuring their trustworthiness becomes paramount.

Key Pillars of Trustworthy ML

  • Robustness: Models should maintain performance under adversarial conditions and distributional shifts
  • Privacy: Protecting sensitive information in training data from inference attacks
  • Security: Defending against adversarial examples and model extraction attacks
  • Fairness: Ensuring equitable treatment across different demographic groups
  • Interpretability: Understanding and explaining model decisions
  • Reliability: Consistent performance and proper uncertainty quantification

Project Overview

This repository implements and analyzes various attack vectors and defense mechanisms in machine learning security, demonstrating the vulnerabilities in modern AI systems and potential countermeasures. The project covers three main areas of ML security research:

  1. Adversarial Examples: White-box attacks and defenses using FGSM, PGD, and robust training
  2. Black-Box Model Substitution: Attacks that transfer adversarial examples between models using a substitute model
  3. Membership Inference: Privacy attacks that determine if specific data was used in training data of the target model

Research Motivation

The increasing deployment of ML models in security-sensitive applications has revealed numerous vulnerabilities:

  • Adversarial examples can cause misclassification with imperceptible perturbations
  • Model substitution attacks can breach systems without direct model access
  • Membership inference attacks can violate user privacy by revealing training data membership

This project provides hands-on implementations to understand these threats and evaluate defense strategies.

Project Structure

Directory Attack/Defense Type Description
Adversarial/ White-box Adversarial Attacks FGSM and PGD attacks with multiple defense mechanisms (adversarial training, JPEG compression, feature squeezing) on MNIST
Black-Box_Attacks_Model_Substitute/ Black-box Model Substitution Substitute model training and transferability attacks on CIFAR-10 using CleverHans
Membership_Inference_Attack/ Privacy Inference Attacks Membership inference attacks using ART library with shadow models and various attack strategies
docs/ Documentation Comprehensive analysis reports and technical documentation
visualizations/ Results Attack success rates, model comparisons, and sample visualizations

Quick Start

Environment Setup

Each component has its own requirements file. Install dependencies based on your area of interest:

# For adversarial attacks and defenses
cd Adversarial/
# you can use venv and other envs as well
conda env create -f environment.yml
conda activate adversarial-ml

# For black-box attacks  
cd Black-Box_Attacks_Model_Substitute/
pip install -r requirements.txt

# For membership inference attacks
cd Membership_Inference_Attack/
pip install -r requirements.txt

📁 Quick Navigation:

Running Experiments

  1. Adversarial Attacks & Defenses:

    cd Adversarial/
    jupyter lab Adversarial.ipynb
  2. Black-Box Model Substitution:

    cd Black-Box_Attacks_Model_Substitute/
    jupyter notebook Substitution.ipynb
  3. Membership Inference:

    cd Membership_Inference_Attack/
    jupyter notebook membership.ipynb

Key Findings

Adversarial Robustness

  • Standard CNNs are highly vulnerable to FGSM and PGD attacks
  • Adversarial training provides the strongest defense but reduces clean accuracy
  • Preprocessing defenses (JPEG compression, feature squeezing) offer limited protection

Black-Box Transferability

  • Adversarial examples transfer effectively between different model architectures
  • Substitute models can be trained with limited queries to the target model
  • Defense against black-box attacks requires diverse training and robust architectures

Privacy Vulnerabilities

  • Membership inference attacks achieve significant success rates on overfit models
  • Differential privacy and regularization techniques help mitigate privacy leakage
  • Shadow model attacks can effectively infer membership without model internals

Research Impact

This work contributes to understanding:

  1. Attack Effectiveness: Quantitative analysis of various attack success rates
  2. Defense Trade-offs: Robustness vs. accuracy trade-offs in different defense strategies
  3. Transferability: How adversarial examples generalize across model architectures
  4. Privacy Risks: Quantification of information leakage in deployed ML models

Technical Implementation

Datasets Used

  • MNIST: Handwritten digit classification (28x28 grayscale images)
  • CIFAR-10: Natural image classification (32x32 color images)

Attack Methods

  • FGSM: Fast Gradient Sign Method for single-step attacks
  • PGD: Projected Gradient Descent for iterative attacks
  • Model Substitution: Transfer attacks using surrogate models
  • Membership Inference: Privacy attacks using shadow models

Defense Mechanisms

  • Adversarial Training: Training on adversarial examples
  • Preprocessing Defenses: JPEG compression and feature squeezing
  • Differential Privacy: Adding noise to protect training data privacy
  • Regularization: Preventing overfitting to reduce membership inference success

Future Work

  • Integration of more advanced attacks (C&W, AutoAttack)
  • Evaluation on larger datasets (ImageNet, CIFAR-100)
  • Implementation of certified defense mechanisms
  • Analysis of attacks on transformer architectures
  • Multi-modal attack and defense strategies

References

  1. Goodfellow, I. J., et al. "Explaining and harnessing adversarial examples." ICLR 2015.
  2. Madry, A., et al. "Towards deep learning models resistant to adversarial attacks." ICLR 2018.
  3. Papernot, N., et al. "Practical black-box attacks against machine learning." AsiaCCS 2017.
  4. Shokri, R., et al. "Membership inference attacks against machine learning models." S&P 2017.

Contributing

This project is part of ongoing ML security research. Contributions, suggestions, and discussions are welcome through issues and pull requests.

License

This research code is provided for educational and research purposes. Please cite appropriately if used in academic work.

About

This project investigates the vulnerability of machine learning models to adversarial examples and implements various defense mechanisms to improve model robustness and security.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published