Trustworthy Machine Learning (ML) is a critical research area focusing on developing AI systems that are reliable, robust, secure, and fair. As machine learning models are increasingly deployed in high-stakes applications like healthcare, autonomous vehicles, and financial services, ensuring their trustworthiness becomes paramount.
- Robustness: Models should maintain performance under adversarial conditions and distributional shifts
- Privacy: Protecting sensitive information in training data from inference attacks
- Security: Defending against adversarial examples and model extraction attacks
- Fairness: Ensuring equitable treatment across different demographic groups
- Interpretability: Understanding and explaining model decisions
- Reliability: Consistent performance and proper uncertainty quantification
This repository implements and analyzes various attack vectors and defense mechanisms in machine learning security, demonstrating the vulnerabilities in modern AI systems and potential countermeasures. The project covers three main areas of ML security research:
- Adversarial Examples: White-box attacks and defenses using FGSM, PGD, and robust training
- Black-Box Model Substitution: Attacks that transfer adversarial examples between models using a substitute model
- Membership Inference: Privacy attacks that determine if specific data was used in training data of the target model
The increasing deployment of ML models in security-sensitive applications has revealed numerous vulnerabilities:
- Adversarial examples can cause misclassification with imperceptible perturbations
- Model substitution attacks can breach systems without direct model access
- Membership inference attacks can violate user privacy by revealing training data membership
This project provides hands-on implementations to understand these threats and evaluate defense strategies.
| Directory | Attack/Defense Type | Description |
|---|---|---|
Adversarial/ |
White-box Adversarial Attacks | FGSM and PGD attacks with multiple defense mechanisms (adversarial training, JPEG compression, feature squeezing) on MNIST |
Black-Box_Attacks_Model_Substitute/ |
Black-box Model Substitution | Substitute model training and transferability attacks on CIFAR-10 using CleverHans |
Membership_Inference_Attack/ |
Privacy Inference Attacks | Membership inference attacks using ART library with shadow models and various attack strategies |
docs/ |
Documentation | Comprehensive analysis reports and technical documentation |
visualizations/ |
Results | Attack success rates, model comparisons, and sample visualizations |
Each component has its own requirements file. Install dependencies based on your area of interest:
# For adversarial attacks and defenses
cd Adversarial/
# you can use venv and other envs as well
conda env create -f environment.yml
conda activate adversarial-ml
# For black-box attacks
cd Black-Box_Attacks_Model_Substitute/
pip install -r requirements.txt
# For membership inference attacks
cd Membership_Inference_Attack/
pip install -r requirements.txt📁 Quick Navigation:
- 🎯 Adversarial Attacks & Defenses →
- 🕶️ Black-Box Model Substitution →
- 🔍 Membership Inference Attacks →
- 📊 Documentation & Reports →
-
Adversarial Attacks & Defenses:
cd Adversarial/ jupyter lab Adversarial.ipynb -
cd Black-Box_Attacks_Model_Substitute/ jupyter notebook Substitution.ipynb -
cd Membership_Inference_Attack/ jupyter notebook membership.ipynb
- Standard CNNs are highly vulnerable to FGSM and PGD attacks
- Adversarial training provides the strongest defense but reduces clean accuracy
- Preprocessing defenses (JPEG compression, feature squeezing) offer limited protection
- Adversarial examples transfer effectively between different model architectures
- Substitute models can be trained with limited queries to the target model
- Defense against black-box attacks requires diverse training and robust architectures
- Membership inference attacks achieve significant success rates on overfit models
- Differential privacy and regularization techniques help mitigate privacy leakage
- Shadow model attacks can effectively infer membership without model internals
This work contributes to understanding:
- Attack Effectiveness: Quantitative analysis of various attack success rates
- Defense Trade-offs: Robustness vs. accuracy trade-offs in different defense strategies
- Transferability: How adversarial examples generalize across model architectures
- Privacy Risks: Quantification of information leakage in deployed ML models
- MNIST: Handwritten digit classification (28x28 grayscale images)
- CIFAR-10: Natural image classification (32x32 color images)
- FGSM: Fast Gradient Sign Method for single-step attacks
- PGD: Projected Gradient Descent for iterative attacks
- Model Substitution: Transfer attacks using surrogate models
- Membership Inference: Privacy attacks using shadow models
- Adversarial Training: Training on adversarial examples
- Preprocessing Defenses: JPEG compression and feature squeezing
- Differential Privacy: Adding noise to protect training data privacy
- Regularization: Preventing overfitting to reduce membership inference success
- Integration of more advanced attacks (C&W, AutoAttack)
- Evaluation on larger datasets (ImageNet, CIFAR-100)
- Implementation of certified defense mechanisms
- Analysis of attacks on transformer architectures
- Multi-modal attack and defense strategies
- Goodfellow, I. J., et al. "Explaining and harnessing adversarial examples." ICLR 2015.
- Madry, A., et al. "Towards deep learning models resistant to adversarial attacks." ICLR 2018.
- Papernot, N., et al. "Practical black-box attacks against machine learning." AsiaCCS 2017.
- Shokri, R., et al. "Membership inference attacks against machine learning models." S&P 2017.
This project is part of ongoing ML security research. Contributions, suggestions, and discussions are welcome through issues and pull requests.
This research code is provided for educational and research purposes. Please cite appropriately if used in academic work.