You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# CNN_PROJ_KU_AR
## Printed Kurdish/Arabic Script Classification
This workspace refactors the original repository into a reproducible experiment layout for printed Kurdish vs Arabic script classification.
### Environment
```powershell
conda activate spade
```
### Default synthetic dataset roots
- Kurdish: `E:\DSs\MLP\code\TRDG\data\text\350k\imgs\ku_350`
- Arabic: `E:\DSs\MLP\code\TRDG\data\text\350k\imgs\ar_350`
These are already wired into `configs/publication/data.yaml` and `experiments/configs/baseline.yaml`.
### Core commands
Train the baseline and export figures/tables:
```powershell
python scripts\train_baseline.py --config experiments\configs\baseline.yaml
```
Run the full nested CV protocol:
```powershell
python scripts\run_full_cv.py --config experiments\configs\baseline.yaml
```
Run every ablation suite:
```powershell
python scripts\run_all_ablations.py
```
Train and evaluate all model families plus the baseline's combined CNN+KNN/SVM/RF summaries:
```powershell
python scripts\run_all_models.py --config experiments\configs\all_models.yaml
```
Run the full all-model suite and then every ablation suite:
```powershell
python scripts\run_everything.py
```
If you want the baseline command to run CV first and then use the recommended hyperparameters for the final held-out test:
```powershell
python scripts\train_baseline.py --config experiments\configs\baseline.yaml --run-cv-first
```
### What is implemented
- Fixed held-out test split with no train/validation/test leakage.
- Stratified 5-fold CV on the training portion only.
- Synthetic TRDG loader plus KHATT and IFN/ENIT adapter modules.
- OCR augmentations: resize/normalize, blur, rotation, perspective, affine, optional Mixup.
- CNN baseline with `32 -> 64 -> 128` conv blocks and configurable `GroupNorm`, `BatchNorm`, or `LayerNorm`-style normalization.
- Embedding baselines on the 128-d penultimate representation: CNN head, KNN, SVM, and RandomForest.
- Automatic metrics, summaries, plots, tables, checkpoints, and paper-ready placeholders.
### Main layout
- `src/data/datasets.py`
- `src/data/splits.py`
- `src/data/samplers.py`
- `src/data/augmentations.py`
- `src/data/loaders/`
- `src/models/`
- `src/training/`
- `scripts/run_all_models.py`
- `scripts/run_everything.py`
- `src/postprocess/`
- `src/utils/`
- `experiments/configs/`
- `outputs/`
- `best/report/publication/`
### Notes
- The `synthetic_real_mix` ablation is wired for KHATT and IFN/ENIT, but you still need to replace the placeholder real dataset roots in `experiments/configs/synthetic_real_mix.yaml` if those datasets live somewhere else on your machine.
- Legacy entrypoints from the earlier refactor remain in place where practical, but the new commands above are the intended workflow.
# CNN_PROJ_Ku_Ar_v2