A C-based Convolutional Neural Network (CNN) implementation to classify the CIFAR-10 dataset.
This project was developed as part of my undergraduate thesis in Electrical and Computer Engineering at the University of Patras.
The implementation includes a serial version and a GPU-accelerated version using OpenACC. The goal was to evaluate inference performance and explore parallelization strategies on GPUs. Multiple code versions are provided to compare performance, including an optimized version using zero-padding before convolution.
Title: "Implementation of Convolutional Neural Networks Using GPUs"
(Greek: Υλοποίηση συνελικτικών νευρωνικών δικτύων σε κάρτες γραφικών)
University of Patras Institutional Repository link: https://hdl.handle.net/10889/28866
There are three versions of the code, each in its own directory:
serial/– Basic serial implementation of CNNopenacc/– GPU-accelerated version using OpenACCpadding/– Optimized version with zero-padding to simplify convolutionweights/- Contains the pre-trained weights of the network
The network follows a classic convolution–pooling–fully connected design:
- Input: 32×32×3 RGB images
- Layers: Three Conv + ReLU blocks with zero-padding followed by 2×2 Max-Pooling
- Classifier: One fully connected layer with 10 outputs followed by a softmax
This project uses the nvc compiler from the NVIDIA HPC SDK, which requires an NVIDIA GPU and the appropriate drivers. If you prefer to run the serial code with a standard compiler, change the Makefile in the target directory:
CC = gcc
CFLAGS = -std=c11 -Wall -Wextra -march=nativeTo run the code you need the CIFAR-10 dataset in binary format. Download and extract:
https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
After extraction, set the dataset path in main.c using the DATA_FOLDER constant, for example:
const char* DATA_FOLDER = "../../cifar-10-batches-bin";Make sure the path points to the folder that contains the binary files (data_batch_1.bin, data_batch_2.bin, etc.). By default the code uses all 50,000 training images. You can change the number of images in main.c:
#define NUM_IMAGES 50000Each version directory contains its own Makefile. Example usage:
cd openacc
make
./cnn-cifar10Compile with -g for debugging builds:
make debug
./cnn-cifar10-debugCompile with -pg for profiling support:
make profile
./cnn-cifar10-profileTo remove compiled artifacts:
make cleanFor 50,000 images the CNN achieves 78.84% accuracy. The program prints execution times for major stages (data loading, network initialization, inference).
The following table summarizes the execution times (in seconds) of the different code versions on the CIFAR‑10 dataset (50,000 images). Measurements were taken on a machine with AMD Ryzen 7 5800X, NVIDIA GeForce RTX 4070 Ti 12 GB, running Ubuntu 20.04.5 LTS and NVIDIA HPC SDK 22.11.
| Time (seconds) | Serial | Serial (padding) | Parallel | Parallel (padding) |
|---|---|---|---|---|
| Net Forward | 751.8657 | 559.2341 | 5.6889 | 5.9116 |
| Conv | 742.4356 | 549.7660 | 3.1210 | 3.3586 |
| ReLU | 4.3622 | 4.3845 | 0.8783 | 0.9097 |
| Pool | 4.6074 | 4.6209 | 1.1799 | 1.2124 |
| FC | 0.4396 | 0.4416 | 0.1295 | 0.1292 |
| Softmax | 0.0083 | 0.0085 | 0.0054 | 0.0053 |
| Total time | 752.4512 | 559.8212 | 6.3366 | 6.5568 |
| Speedup (×) | – | 1.34 | 118.75 | 114.76 |
Example raw output:
$ ./cnn-cifar10
Serial Code
CNN for 50000 images
Loading input batch 1...
Loading input batch 2...
Loading input batch 3...
Loading input batch 4...
Loading input batch 5...
Load Data time:0.518275 seconds
Create Network time:0.000006 seconds
Load Network Parameters time:0.003319 seconds
Create Ouputs time:0.000235 seconds
Net Forward total time:724.833147 seconds
Time for conv1: 235.903980 seconds
Time for relu1: 3.219701 seconds
Time for pool1: 3.214649 seconds
Time for conv2: 371.571337 seconds
Time for relu2: 0.921022 seconds
Time for pool2: 0.946631 seconds
Time for conv3: 108.097672 seconds
Time for relu3: 0.246505 seconds
Time for pool3: 0.248356 seconds
Time for fc: 0.442517 seconds
Time for softmax: 0.008063 seconds
Conv: 715.572989 seconds
ReLU: 4.387228 seconds
Pool: 4.409636 seconds
FC: 0.442517 seconds
Softmax: 0.008063 seconds
Net Accuracy: 78.84 %
Net Accuracy time:0.001737 seconds
Free memory time:0.028954 seconds
Total time:725.385673 seconds
END!
