This project implements a CNN-based OCR pipeline for the SVHN (Street View House Numbers) dataset and studies how data augmentation, architectural regularization (BatchNorm and Dropout), and training strategies such as learning rate scheduling affect performance.
- Task: Single-digit classification (0 - 9)
- Dataset: SVHN Cropped Digits (32×32 RGB)
- Framework: PyTorch
- Model: Simple CNN (Conv -> ReLU -> Pool x 3 + FC classifier)
- SVHN images are loaded from .mat files and converted to tensors (normalized such that RGB is [0, 1]).
- The training set is split into train and validation subsets.
- Dataset-specific mean and standard deviation are computed and used for normalization.
- Data augmentation is applied only to the training split (validation and test remain unaugmented).
- A CNN extracts visual features, followed by a fully connected classifier for digit prediction.
- The model is trained using cross-entropy loss, Adam optimizer, and a cosine learning rate schedule.
- Validation performance is monitored to select the best checkpoint.
- Final evaluation is performed on the test set (separate from validation and training sets).
To improve model performance and understand what helps generalization, several simple changes were added on top of the baseline CNN:
- Data augmentation was added during training to introduce small variations (such as rotation and brightness changes) and make the model more robust.
- Dataset-specific normalization was applied by computing the mean and standard deviation from the SVHN training data instead of using baseline values.
- BatchNorm is applied after convolution to stabilize learned features by normalizing each feature map using batch statistics (mean and variance).
- Dropout is applied after pooling to regularize higher-level features by randomly zeroing out entire feature maps during training.
- Weight decay was used in the optimizer to discourage overly large weights.
- A cosine learning rate schedule was applied to gradually reduce the learning rate during training.
The baseline model was trained using raw SVHN images with only basic normalization and no additional regularization techniques.
- Best validation accuracy: ~90.3%
- Test accuracy: 89.5%
Training and validation loss decreased steadily across epochs (5), and the gap between training and validation accuracy was small throughout. This indicates that the baseline model generalized well and did not significantly overfit. For SVHN, which contains relatively clean and well-centered digits, this simple setup already performed strongly.
In the second experiment, mild data augmentation (small rotations and brightness jitter) was applied during training.
- Best validation accuracy: ~91.9%
- Test accuracy: 87.5%
While validation accuracy improved compared to the baseline, test accuracy dropped. This suggests that the augmented training data introduced variations that did not fully match the distribution of the SVHN test set. As a result, the model became better at handling augmented samples but slightly worse at generalizing to the true test data.
This highlights that data augmentation must be carefully tuned and is not always beneficial.
In the final setup, all improvements were combined:
- Dataset-specific normalization
- Batch Normalization and Dropout
- Weight decay and cosine learning rate scheduling
- Data augmentation
- Best validation accuracy: ~93.1%
- Test accuracy: 81.9%
Although validation performance improved significantly, test performance dropped further. This indicates that the combined regularization and augmentation made the model fit the training and validation data better, but also increased the mismatch with the test distribution. The model likely became too adapted to the augmented and normalized training setup.
- The baseline model generalized best to the SVHN test set.
- Data augmentation improved validation accuracy but reduced test accuracy when it did not align with the test distribution.
- Additional regularization techniques improved training stability and validation performance but did not guarantee better test performance.
- Validation accuracy alone is not sufficient to judge real generalization.
Overall, these experiments show that while advanced techniques can improve training behavior, simpler models may generalize better when the dataset is already clean and well-structured, as in the case of SVHN.
There are several ways this project could be improved in the future:
- Stronger architectures: Instead of a simple CNN, more advanced architectures such as ResNet could be used. Residual connections help train deeper networks and may improve accuracy without significantly increasing training difficulty.
- Refine data augmentation: Use augmentations that better match the SVHN test data, such as focusing more on lighting changes and less on geometric transformations.
- Train for longer: Running training for more epochs may allow the model to benefit more from regularization techniques.
- Try different optimizers: Using SGD with momentum or adding label smoothing could improve generalization.
- Analyze errors: Looking at misclassified images could help identify common failure cases and guide improvements.





