DeepBench is a command-line interface tool that performs systematic evaluation of image classification models by:
- Loading original images
- Applying various augmentation methods to create modified versions
- Classifying both original and augmented images using selected models
- Storing the results in a database for analysis
This approach allows researchers and developers to assess how well different models perform under various image transformations and distortions, providing insights into model robustness and reliability. Deepbench is supposed to be run on a gpu cluster or similar.
- Features
- Project TAHAI
- Requirements
- Getting Started
- Setup
- Augmentation Methods
- Running Benchmarks
- CLI Arguments
- Supported Models
- Results and Output
- Adding New Models
- Contributing
- Trouble Shooting
- License
- More Information on Project TAHAI
- Authors and Acknowledgment
-
Comprehensive Model Support: Test a wide range of image classification models from Hugging Face or Ollama API including:
- Vision Language Models (Multi modal models), Traditional CNN models and Vision Transformer models
-
Extensive Augmentation Library: Apply various image transformations to test model robustness:
- Basic transformations (brightness, contrast, rotation, flips)
- Noise additions (Gaussian noise, salt & pepper)
- Blur effects (Gaussian blur, motion blur)
- Weather simulations (rain, clouds, shadows)
- Geometric transformations (perspective, grid distortion)
-
Flexible Configuration: Customize benchmarks through TOML configuration files:
- Select models and image scaling parameters
- Configure augmentation methods and parameters
- Define experiment names and output settings
-
Systematic Testing: Evaluate models with:
- Individual augmentations
- Ramp testing (incrementally increasing augmentation intensity)
- Use case-specific augmentation combinations
-
Results Storage: Store results in:
- MongoDB database for shared access and analysis
- Local TinyDB for standalone operation
-
Debug Mode: Save augmented images for visual inspection
DeepBench was developed as part of the TAHAI (Trustworthy AI for Human-Augmented Intelligence) project, which focuses on assessing the robustness of various image classification models. Robustness, in this context, refers to the models' ability to consistently produce stable and reliable results, even when faced with disturbances or variations in input data.
DeepBench requires Python 3.10-3.12 and the following key dependencies:
- PyTorch
- Transformers (Hugging Face)
- OpenCV
- NumPy
- Pandas
- MongoDB (for database storage)
- TinyDB (for local storage)
For a complete list of dependencies, refer to the pyproject.toml file.
-
Clone the repository:
git clone <repository-url> cd deepbench
-
Create a Python virtual environment:
python -m venv deepbench
-
Activate the virtual environment:
- Windows:
deepbench\Scripts\activate - Linux/Mac:
source deepbench/bin/activate
- Windows:
-
Install DeepBench and its dependencies:
pip install -e . -
For development tools:
pip install -e .[dev]
-
Set up the database (see Database Setup)
-
Run DeepBench with a configuration file:
python src/deepbench.py -c "configs/default_config.toml"
-
Install the VS Code Extension
Remote - SSHand connect to the GPU cluster:username@ipadressofcluster -
Clone the repository on the cluster
-
Create a conda environment:
conda create --name deepbench python=3.11 conda activate deepbench
-
Install CUDA and PyTorch:
conda install pytorch
-
Install DeepBench:
cd deepbench pip install -e .
-
Create an SBATCH file for running on the cluster (example in
deepbench.sbatch) -
Run DeepBench using SLURM:
sbatch deepbench.sbatch
-
Database Connection: DeepBench uses MongoDB to store experiment results. By default, the results are saved in a remote MongoDB service hosted on cloud.mongodb.com or on a local instance that is linked to the TAHAI project. If you want to use your own MongoDB instance, update the MongoDB credentials in your .env file.
-
Database Authentication: Create a new file named
.envin the root directory with the following content:DBUSER = db_write DBPASSWD = your_password MONGODB_URI = your_host_addressThe project database could have two user roles:
- db_write: For running DeepBench (can write to the database)
- db_read: For viewing results (read-only access)
-
Storage Options: If you prefer to store results locally instead of in MongoDB, use the
-lor--localflag:python src/deepbench.py -c "configs/default_config.toml" -lOr if you want it inside the main
src/deepbench.pyfunction check if the following line is used:ResultLocal(infer_result_list)
This will store results in a TinyDB database (JSON format) in the
deepbench/outputdirectory.If you onl want to store the results in the mongodb use the following line from the main
src/deepbench.py:ResultDatabase(infer_result_list)
DeepBench requires a CSV file that contains image paths and their corresponding ground truth labels. There are several ways to create this file:
Create a CSV file with the following format:
image_path,ground_truth
/path/to/image1.jpg,0
/path/to/image2.jpg,1
The ground truth should be the class index (integer) for the image.
DeepBench provides a powerful tool called create_image_subset.py in the /tools directory that can automatically generate CSV files and mapping files from your image datasets.
Basic Usage:
# Create a CSV file with paths from a directory (all images)
python tools/create_image_subset.py -nfp /path/to/dataset/folder -n 0
# Create a CSV file with a subset of images (e.g., 100 images)
python tools/create_image_subset.py -nfp /path/to/dataset/folder -n 100
Parameters:
-nfp, --new_paths_file: Path to the dataset directory. The tool will recursively find all images and create a CSV file.-f, --filelist: Text file with a list of absolute file paths (alternative to -nfp).-n: Number of examples to include in the subset (use 0 for all images).-m, --mapping: Path to the mapping file for class names (default is ImageNet1k mapping).-o, --output: Output file path for the CSV file (default: filepaths_TIMESTAMP.csv).
How It Works:
-
When using
-nfp, the tool:- Recursively finds all image files in the specified directory
- Creates a text file with all file paths (
new_file_paths.txt) - Automatically generates a mapping file (
new_mapping.txt) based on subfolder names - Creates a CSV file with image paths and class indices
-
The mapping file format is:
class_folder_name class_description -
The tool assumes that images are organized in a folder structure where each subfolder represents a class:
dataset/ ├── class1/ │ ├── image1.jpg │ └── image2.jpg ├── class2/ │ ├── image3.jpg │ └── image4.jpg
Example:
# Create a CSV with all images from a medical dataset
python tools/create_image_subset.py -nfp /datasets/medical_images -n 0
# Create a CSV with 50 random images per class from a dataset
python tools/create_image_subset.py -nfp /datasets/food_images -n 50 -o food_dataset.csvThe tool will generate:
- A CSV file with image paths and class indices
- A mapping file that maps folder names to class indices
- A text file with all file paths
These files can then be used in your DeepBench configuration.
All configurations for DeepBench are located in the /configs directory. The main configuration files are:
The model configuration file (e.g., default_config.toml) contains settings for:
[cli]
experiment_name = "model-name" # Name for this experiment run
debug = false # Save augmented images if true
input = './path/to/images.csv' # Path to input images
primer_img_name = 'image.png' # if you to save the actual image data as a np-array for a specific image
output = './output' # Output directory
local = false # Use local storage if true
[database]
mongodb = "mongodb+srv://DBUSER:DBPASSWD@MONGODB_URI/"
[models]
hugging_face = "google/vit-base-patch16-224" # Model identifier
ollama_model = "gemma3:27b" # if you want to use the Olama API, leave empty if not needed
img_scaling = [224, 224] # Image size for model input
top_k = 5 # Number of top predictions to save
multimodal_classes = [
"./tools/dataset_mapping.txt", # relative to Python cwd or absolute path
# 1. Option: add classes as a list[str] -> ["car","cat","dog"]
# 2. Option: add path of mapping file, will be mapped according to "gt" in input.csv file
# 3. Option: leave empty to use imagenet1k labels/classes
]
[augmentation]
augment_config = [
"./configs/augmentation/augm_defaults.toml", # Default augmentation settings
"./configs/augmentation/augm_use_case.toml", # Use case augmentations
]Augmentation configurations define how images are transformed:
-
Default Augmentations (
augm_defaults.toml):[augmentation.imgMethod.GaussianBlur] kernel_size = 9 sigma_limit = 0 [augmentation.imgMethod.ImageRotation] angle_degrees = 45
-
Ramp Augmentations (gradually increasing intensity):
[augmentation.Ramp.Brightness] active = true ramp_var = "brightness" range = [-100, 100] step_size = 25
-
Use Case Augmentations (domain-specific combinations):
[augmentation.UseCase.MedicalDiagnosis] active = true [augmentation.UseCase.MedicalDiagnosis.HistEqualization] active = true
DeepBench includes a comprehensive set of image augmentation methods to test model robustness:
- Brightness: Adjust image brightness
- Contrast: Modify image contrast
- Rotation: Rotate image by specified degrees
- Flips: Horizontal and vertical image flipping
- Gaussian Blur: Apply blur with configurable kernel size
- Motion Blur: Simulate motion blur effects
- Gaussian Noise: Add random noise to the image
- Salt & Pepper Noise: Add random white and black pixels
- Histogram Equalization: Enhance image contrast
- Global Color Shift: Modify color channels
- Grid Distortion: Apply grid-based distortion
- Grid Elastic Deformation: Apply elastic deformations
- Perspective Transformation: Change image perspective
- Rain: Simulate rain effects
- Cloud Generator: Add cloud overlays
- Shadow: Add shadow effects
Ramp augmentations apply a method with gradually increasing intensity to test how model performance degrades:
[augmentation.Ramp.Brightness]
active = true
ramp_var = "brightness"
range = [-100, 100]
step_size = 25This will test the model with brightness values of -100, -75, -50, -25, 0, 25, 50, 75, and 100.
Use case augmentations combine multiple methods to simulate real-world scenarios:
- Medical Diagnosis: Adjustments relevant for medical imaging
- Autonomous Driving: Weather and lighting conditions for driving
- Manufacturing Quality: Transformations for industrial inspection
- Handheld Devices: Camera shake and lighting variations
- People Recognition: Variations in facial recognition scenarios
- Satellite Imaging: Atmospheric and perspective effects
The Augmentation Selector is a separate tool that uses OpenAI's API to automatically recommend and configure augmentation methods tailored to a specific domain. It generates a TOML configuration file with inline comments explaining each augmentation and logs evaluation results in a CSV file. These can be run by Deepbench without any further steps.
- Automatically selects appropriate augmentations for domain-specific datasets.
- Queries OpenAI's API with a high-level description of the application domain.
- Dynamically parses available augmentations from a template file.
- Generates a configuration file in TOML format with inline comments explaining each augmentation.
- Logs evaluation results across runs in a CSV file, including selected augmentations, use case name, and model used.
How to set up OpenAI API Key
To use the Augmentation Selector, set up your OpenAI API key:
- Create a .env file in the project root.
- Add the following entry:
OPENAI_API_KEY=your-openai-api-key
-
Run the script by providing a key, a high-level description of your application domain and the path to the augmentation template file.:
> python src/augmentation_selector/main.py "MedicalImaging" "high-resolution medical imaging Dataset for Knee Arthritis" "augmentation_template.toml" -
Specify the OpenAI model (optional, default is gpt-4o):
> python src/augmentation_selector/main.py "Landscape" "Aerial Landscape Dataset" "augmentation_template.toml" --model gpt-3.5-turbo -
The script generates:
-
A configuration file in the configs/ directory:
> configs/generated_config.toml -
Evaluation results logged in a CSV file
> evaluation_results.csv.
-
The generated configuration file is in TOML format. Example:
[augmentation.UseCase.MedicalImaging]
active = true
# Explanation: Flipping images horizontally allows the model to learn from the symmetrical nature of knee anatomy.
[augmentation.UseCase.MedicalImaging.ImageFlipHorizontal]
# Description: Flips the image along the vertical axis, mirroring it horizontally.
active = true
# Explanation: Slightly rotating images introduces variability to account for different imaging angles.
[augmentation.UseCase.MedicalImaging.Ramp.ImageRotation]
# Description: Rotates the image by a specified angle, keeping its contents intact.
active = true
ramp_var = "angle_degrees"
range = [-150, 150]
step_size = 30Evaluation results for each run are logged in evaluation_results.csv in the following format:
| Run | UseCase | Model |
Brightness | CloudGenerator | Contrast | GaussianBlur | GaussianNoise | GlobalColourShift | GridDistortion | GridElasticDeformation | HistEqualization | ImageFlipHorizontal | ImageFlipVertical | ImageRotation | MotionBlur | PerspectiveTransformation | Rain | SaltPepperNoise | Shadow | UseCaseName |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | medical_diagnosis | gpt-4o-mini | X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
||||||||
| 2 | medical_diagnosis | gpt-4o | X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
||||||||
| 3 | auto_driving | gpt-4o-mini | X |
X |
X |
X |
X |
X |
X |
X |
X |
|||||||||
| 4 | auto_driving | gpt-4o | X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
||||||||
| 5 | manufacturing_quality | gpt-4o-mini | X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
||||||
| 6 | manufacturing_quality | gpt-4o | X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
|||||
| 7 | people_recognition | gpt-4o-mini | X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
||||||||
| 8 | people_recognition | gpt-4o | X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
||||||||
| 9 | satellite_imaging | gpt-4o-mini | X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
||||||
| 10 | satellite_imaging | gpt-4o | X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
||||||||
| 11 | handheld | gpt-4o-mini | X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
||||||||
| 12 | handheld | gpt-4o | X |
X |
X |
X |
X |
X |
X |
X |
X |
X |
- Run: Incremental ID for each run.
- UseCase: The name of the application domain.
- Model: The OpenAI model used.
- Augmentation Columns: An
Xindicates inclusion of the augmentation in the run.
How to configure available augmentations and parameters
All augmentation parameters for the Augmentation Selector are defined in the default_augmentation_config.toml file.
This file lists all available augmentation methods and their adjustable parameters, including:
active: Determines whether an augmentation method is enabled (true) or disabled (false).range: Defines the acceptable range of values for parameters.step_size: Specifies the increment for parameter adjustment.
By adjusting these parameters, you can plug in your own augmentation methods and tailor them to fit the needs of your specific domain.
For Hugging Face models, you can use the default batch size (128) for efficient processing. For gated models that require authentication, add your HuggingFace token to the .env file:
HUGGING_FACE_HUB_TOKEN = yourtoken
And then start with:
python src/deepbench.py -c "configs/model_resnet-50.toml"When using Ollama models, you need to:
-
Configure Environment Variables:
- Create or update your
.envfile to include the Ollama server address:OLLAMA_ADDRESS=http://localhost:11434 - For remote Ollama servers, replace
localhostwith the appropriate IP address
- Create or update your
-
Configure Model in TOML Configuration:
- Create a configuration file with
ollama_modelinstead ofhugging_face:[models] hugging_face = "" # Leave empty when using Ollama ollama_model = "llava:34b" # Specify the Ollama model name img_scaling = [224, 224] top_k = 5
- Create a configuration file with
-
Modify the batch size in
src/deepbench.pyto 1 (most Ollama models process one image at a time):# Change this line batch_size = 128 # for Ollama set to 1, for HuggingFace set to 128 or higher # To batch_size = 1 # for Ollama set to 1, for HuggingFace set to 128 or higher
-
Ensure Ollama is running on your local machine or remote server
-
Run DeepBench with your Ollama configuration:
python src/deepbench.py -c "configs/ollama_test_benchm.toml"
For large-scale benchmarking across multiple models and use cases, DeepBench provides shell scripts and SBATCH configurations that automate the process. This is particularly useful when running on GPU clusters with job scheduling systems like SLURM.
-
Navigate to the
sbatchfolder in the DeepBench directory. -
Configure the
batch_submit_main.shscript:-
Define the dataset size (number of images to process):
NUM_DATA="0100" # 100 images per class
-
Specify the models to benchmark (comma-separated list):
HUGGING_FACE_MODELS="\ google/gemma-3-4b-it,\ llava-hf/llava-v1.6-mistral-7b-hf,\ "
-
Set the input image sizes for each model:
INPUT_SIZE_LIST="\ 512,\ 224,\ "
-
Uncomment or add the SBATCH job submissions for the use cases you want to test:
sbatch sbatch/handheld.sbatch "$NUM_DATA" "google/gemma-3-4b-it" "512" sbatch sbatch/handheld.sbatch "$NUM_DATA" "llava-hf/llava-v1.6-mistral-7b-hf" "224" # Add more as needed
-
-
Customize SBATCH files for specific use cases:
- Each use case has its own SBATCH file (e.g.,
auto_driving.sbatch,handheld.sbatch,medical.sbatch) - These files contain SLURM configuration parameters like:
#SBATCH --job-name=TAHAI_handheld #SBATCH --time=6-18:00:00 #SBATCH --gres=gpu:1 #SBATCH --nodes=1 #SBATCH --cpus-per-gpu=64 #SBATCH --mem-per-gpu=32G
- Adjust these parameters based on your cluster's resources and job requirements
- Each use case has its own SBATCH file (e.g.,
-
Make the script executable:
chmod +x sbatch/batch_submit_main.sh
-
Run the batch submission script:
./sbatch/batch_submit_main.sh
-
Monitor job status:
squeue # View job queue sacct # View job history
-
Cancel jobs if needed:
scancel JOB_ID
DeepBench includes several specialized batch scripts for different benchmarking scenarios:
batch_submit_clip_variations.sh: Tests multiple CLIP model variantsbatch_submit_medical_specialists.sh: Focuses on medical imaging modelsbatch_submit_satellite_specialists.sh: Tests satellite/aerial imaging models
You can create custom batch scripts for your specific benchmarking needs:
-
Copy an existing script as a template:
cp sbatch/batch_submit_main.sh sbatch/batch_submit_custom.sh
-
Modify the model list and other parameters to suit your requirements
-
Create or modify SBATCH files for specific use cases or datasets
This approach allows you to efficiently run multiple benchmarks in parallel, maximizing GPU utilization and automating the testing process across different models and augmentation methods.
DeepBench supports the following command-line arguments:
- -c, --config [CONFIG_FILE]: Path to the configuration file (TOML format)
- -d, --debug: Execute in debug mode, augmented pictures will be saved
- -i, --input [INPUT_PATH]: Path to folder containing images to use
- -o, --output [OUTPUT_PATH]: Path to folder to store results
- -l, --local: Use local TinyDB storage for results instead of MongoDB
DeepBench supports a wide range of image classification models through both Hugging Face and Ollama API integrations.
DeepBench primarily uses models from Hugging Face. Supported Hugging Face model types include:
-
Traditional CNN Models:
- ResNet (microsoft/resnet-50, microsoft/resnet-101)
- VGG (timm/vgg16.tv_in1k)
- EfficientNet (google/efficientnet-b2)
-
Vision Transformer Models:
- ViT (google/vit-base-patch16-224)
-
Vision Language Models:
- CLIP (openai/clip-vit-base-patch32)
- SigLIP (google/siglip-base-patch16-224)
- LLaVA (llava-hf/llava-v1.6-mistral-7b-hf)
- Phi (microsoft/Phi-3.5-vision-instruct)
- Qwen (Qwen/Qwen2-VL-2B-Instruct)
- BLIP, CogVLM, PaLI-Gemma, and more
DeepBench also supports using models through the Ollama API. This is particularly useful for:
- Testing locally hosted models
- Using models not available on Hugging Face
- Benchmarking open-source LLMs with vision capabilities
To use the Ollama API integration you need an Ollama server locally or an a remote machine. For supported models look at ollama.ai
Note: When using Ollama models, inference will be slower compared to Hugging Face models running on GPU, especially when processing many images. Consider using smaller datasets for testing. New models can be added by creating appropriate configuration files and model handler classes.
DeepBench stores results in either MongoDB or a local TinyDB database. Each experiment creates a collection with the format:
model-name-shortened_timestamp
For example: model-X-09-30-13_39_26
Each document in the collection represents an image (original or augmented) with the following structure:
{
"_id": 1721239530471350311,
"experiment_name": "model-X-07-17-20_05_25",
"git": "aecd9ce32f64790766306fdbec5820531c0755bb",
"image": "ILSVRC2012_val_00036423.jpg",
"gt": "99",
"resolution": {
"original": [500, 376],
"scaled": [224, 224]
},
"augment_method": {},
"model": "model-X",
"label_score": {
"80": 0.0074752201326191425,
"86": 0.002994032809510827,
"99": 0.8502982258796692,
"703": 0.011575303040444851,
"912": 0.05776028335094452
}
}For augmented images, the augment_method field contains details about the applied augmentation.
The Original (unaugmented/uncorrupted) images the augment_method field contains NoAugmentCategory.
If debug mode is enabled (-d flag), augmented images are saved to the output directory for visual inspection.
To add a new model to DeepBench:
-
Create a New Branch: Start by creating a new branch from the
mainbranch for testing. -
Create a Configuration File: Create a new TOML file in the
configsdirectory:[cli] experiment_name = "your-model-name" [models] hugging_face = "publisher/your-model-name" img_scaling = [224, 224] top_k = 5
-
Create a Model Handler Class: If the model requires special handling, create a new class in
src/deepbench/ml/image/classification/. -
Update the Model Registry: Add your model class to the registry in
src/deepbench/ml/image/imgclassifier.py. -
Test Your Implementation: Run DeepBench with your new configuration and verify the results.
-
Create a Pull Request: Once tested, create a pull request to merge your changes into the main branch.
Contributions to DeepBench are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests to ensure functionality
- Submit a pull request
It might happen that the flash-attention version is incompatible with your system.
If that is the case just comment out the following lines in the pyproject.toml:
#"flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl",
and the line in src\deepbench\ml\image\classification\hugging_llava.py:
# self.model.config.attn_implementation = "flash_attention_2"DeepBench is licensed under the MIT License.
- Erik Rodner, Erik.Rodner@HTW-Berlin.de
- David Brodmann, David.Brodmann@htw-berlin.de
- Rudolf Hoffmann, Rudolf.Hoffmann@student.htw-berlin.de
- Mario Koddenbrock, Mario.Koddenbrock@HTW-Berlin.de