Author: Mohammed Aqeel Ismail
Institution: School of Mathematics, Statistics and Computer Science,
College of Agriculture, Engineering and Science,
University of KwaZulu-Natal, Pietermaritzburg Campus,
Pietermaritzburg 3201, RSA
Year: 2025
⦁ Overview
⦁ Key Features
⦁ System Requirements
⦁ Installation
⦁ Quick Start
⦁ Dataset Preparation
⦁ Configuration
⦁ Usage Guide
⦁ Output Files
⦁ Performance Benchmarks
⦁ Troubleshooting
⦁ Project Structure
⦁ Citation
⦁ License
This system implements a supervised machine learning pipeline for classifying mobile application network traffic using the MIRAGE-2019 dataset. The primary goal is to enable Quality of Service (QoS) optimization in mobile networks through accurate traffic classification without payload inspection, ensuring user privacy while maintaining high performance.
Please note: The dataset was too large to upload to GitHub but can be downloaded from Kaggle: https://www.kaggle.com/datasets/programmerrdai/mirage-2019
The downloaded dataset contains zip files from devices Google Nexus and Xiaomi Mi5. For the purpose of this project, JSON files from both devices should be combined into one folder named "data". The script will open the folder, load each JSON file, extract labels from filenames, extract flow-level statistical features from each file, and combine everything into a single CSV file (Mirage.csv).
Also note: The model.pkl file is not available with the rest of the files due to its large size. However, if you download the script with code and dataset and set them up as outlined in this document, the script will train the model and create your own model.pkl locally, along with other output files in the output folder. The script automatically creates the "output" and "charts" folders if they don't already exist.
Note I have managed to upload the the whole project to a Google Drive with the dataset used for this project under the 'data' folder. There is the link: https://drive.google.com/drive/folders/1mQd6-YC9lhvdlCPILjPQdQCWuWPrW8az?usp=sharing
Privacy-Preserving Classification
⦁ No packet payload inspection required
⦁ Works with encrypted traffic (HTTPS, VPN, TLS)
⦁ Compliant with data protection regulations (GDPR)
High Accuracy Performance
⦁ 89.76% overall classification accuracy
⦁ Consistent performance across all QoS priority levels
⦁ Robust handling of class imbalance
Intelligent QoS Policy Generation
⦁ Automatic mapping from applications to QoS requirements
⦁ Priority-based resource allocation recommendations
⦁ Bandwidth, latency, and jitter tolerance specifications
Comprehensive Evaluation
⦁ Detailed performance metrics (Accuracy, F1-Score, Precision, Recall)
⦁ Visual analytics (Confusion Matrix, Distribution Charts)
⦁ Per-application and per-priority-level analysis
Flexible & Configurable
⦁ Toggle between SMOTE and Random Oversampling
⦁ Optional Linear Discriminant Analysis (LDA)
⦁ Adjustable Random Forest hyperparameters
This implementation is part of research on "Network Traffic Classification of 5G networks for Improvement of QoS". The system addresses the challenge of identifying application types in modern encrypted networks to enable intelligent QoS policies.
Problem Addressed:
⦁ Traditional port-based classification: 30-70% accuracy (obsolete)
⦁ Deep Packet Inspection (DPI): Ineffective with encryption, privacy concerns
⦁ Our Solution: Flow-level statistical features + Random Forest = 89% accuracy
⦁ Classifies 20 mobile application types from MIRAGE-2019 dataset
⦁ Uses flow-level statistical features (packet size, inter-arrival time, flow duration)
⦁ Random Forest ensemble classifier with 300 trees
Each classified application receives detailed QoS recommendations:
⦁ Priority Level: High, Medium, Low
⦁ Bandwidth Requirements: Low, Medium, High, Variable
⦁ Latency Sensitivity: Low, Medium, High, Very High
⦁ Jitter Tolerance: Very Low, Low, Medium, High
| Category | Applications | QoS Priority |
|---|---|---|
| Navigation | Waze | High |
| VoIP/Messaging | Viber, Messenger | High |
| Gaming | Slither.io | High |
| Music Streaming | Spotify | High |
| Video Streaming | YouTube | Medium |
| Social Media | Facebook, Twitter, Pinterest | Medium |
| E-Commerce | Wish, Subito, Groupon | Medium |
| Travel | TripAdvisor, Foursquare | Medium |
| Weather | AccuWeather | Medium |
| Sports | Motain (iLiga) | Medium |
| Cloud Storage | Dropbox | Low |
| Productivity | Trello, Duolingo | Low |
| Entertainment | Comics Reader | Low |
⦁ Class Imbalance Handling: Random Oversampling (preferred) or SMOTE
⦁ Feature Scaling: StandardScaler normalization
⦁ Dimensionality Reduction: Optional LDA
⦁ Model Persistence: Save/load trained models for reuse
⦁ Application traffic distribution bar chart (color-coded by priority)
⦁ QoS priority distribution pie chart
⦁ Latency sensitivity distribution pie chart
⦁ Bandwidth requirements distribution pie chart
⦁ Confusion matrix heatmap
⦁ Operating System: Windows 10/11, macOS 10.14+, or Linux (Ubuntu 18.04+)
⦁ Python: 3.8 or higher
⦁ RAM: 8GB (16GB recommended for full dataset)
⦁ Storage: 5GB free space
⦁ CPU: Multi-core processor (4+ cores recommended)
⦁ IDE (Optional): Visual Studio Community 2022 or any Python IDE
⦁ RAM: 16GB or higher
⦁ CPU: Intel i5/i7 or AMD Ryzen 5/7 (8+ cores)
⦁ Storage: SSD for faster data processing
⦁ GPU: Not required (CPU-only implementation)
| Package | Version | Purpose |
|---|---|---|
| pandas | ≥ 1.3.0 | Data manipulation and analysis |
| numpy | ≥ 1.21.0 | Numerical computing |
| scikit-learn | ≥ 1.0.0 | Machine learning algorithms |
| matplotlib | ≥ 3.5.0 | Visualization and plotting |
| seaborn | ≥ 0.11.0 | Statistical data visualization |
| imbalanced-learn | ≥ 0.8.0 | Oversampling techniques |
| joblib | ≥ 1.1.0 | Model persistence |
| scipy | ≥ 1.7.0 | Scientific computing |
- Download from: https://visualstudio.microsoft.com/downloads/
- During installation, select Python development workload
- Ensure Python 3.8+ is included in the installation
# Option A: Clone with Git
git clone https://github.com/tomRiddle-the1st/network_traffic_classification.git
# Option B: Download ZIP from GitHub
# Extract to your desired location- Launch Visual Studio Community 2022
- Select File → Open → Folder
- Navigate to and select the
network_traffic_classificationfolder - Visual Studio will automatically detect the Python project
- Go to View → Other Windows → Python Environments
- Click + Add Environment (or the gear icon)
- Select Virtual Environment
- Choose Python 3.8 or higher as the base interpreter
- Set environment location to
venvfolder in project directory - Click Create
- Wait for environment creation (may take 1-2 minutes)
Option A: Using Package Manager (GUI)
- In Python Environments window, right-click your virtual environment
- Select Manage Python Packages
- For each package, search and click Run command: pip install:
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- imbalanced-learn
- joblib
- scipy
Option B: Using Terminal (Faster)
- Open View → Terminal (or press
Ctrl + ~) - Add `requirements
pandas>=1.3.0
numpy>=1.21.0
scikit-learn>=1.0.0
matplotlib>=3.5.0
seaborn>=0.11.0
imbalanced-learn>=0.8.0
joblib>=1.1.0
scipy>=1.7.0
In the Visual Studio terminal, run:
python --version # Should show Python 3.8+
pip list # Should show all installed packages- Create a
data/folder in your project root - Download MIRAGE-2019 dataset from Kaggle
- Extract and merge all JSON files from both device folders into
data/
Option A: Using Debug/Run
- Open
network_traffic_classification.pyin the editor - Press F5 (Start Debugging) or Ctrl+F5 (Start Without Debugging)
- View output in the integrated terminal
Option B: Using Python Interactive
- Right-click anywhere in the editor
- Select Execute in Python Interactive
- View output and charts in the interactive window
Option C: Using Terminal
- Open View → Terminal
- Run:
python network_traffic_classification.pygit clone https://github.com/tomRiddle-the1st/network_traffic_classification.git
cd network_traffic_classificationpandas>=1.3.0
numpy>=1.21.0
scikit-learn>=1.0.0
matplotlib>=3.5.0
seaborn>=0.11.0
imbalanced-learn>=0.8.0
joblib>=1.1.0
scipy>=1.7.0
# Install all requirements
pip install -r requirements #check above
# Verify installation
python -c "import pandas, numpy, sklearn, matplotlib, seaborn, imblearn; print('All dependencies installed successfully!')"python --version # Should show Python 3.8+
pip list | grep scikit-learn # Verify scikit-learn installationInstall packages individually:
pip install pandas>=1.3.0
pip install numpy>=1.21.0
pip install scikit-learn>=1.0.0
pip install matplotlib>=3.5.0
pip install seaborn>=0.11.0
pip install imbalanced-learn>=0.8.0
pip install joblib>=1.1.0
pip install scipy>=1.7.0The script automatically creates output/ and charts/ folders, but you need to manually create and populate the data/ folder:
project-root/
├── network_traffic_classification.py # Main script
├── README.md # This file
└── data/ # CREATE THIS MANUALLY
├── com.waze_*.json
├── com.google.android.youtube_*.json
├── com.spotify.music_*.json
└── ... (other application JSON files from BOTH devices)
# These folders are created automatically by the script:
├── output/ #Auto-created via script or create the folder manually and the script will save the output files
│ ├── Mirage_flows.csv
│ ├── model.pkl
│ ├── feature_scaler.pkl
│ └── lda_transformer.pkl (if LDA enabled)
└── charts/ # Auto-created via script or create the folder manually and the script will save the png files
├── traffic_distribution.png
├── priority_distribution.png
├── latency_sensitivity.png
├── bandwidth_requirements.png
└── confusion_matrix.png
- Visit: https://www.kaggle.com/datasets/programmerrdai/mirage-2019
- Download the MIRAGE-2019 dataset (zip files)
- Extract the downloaded files - you'll see folders for Google Nexus and Xiaomi Mi5
- IMPORTANT: Create a
data/folder in your project root - Merge all JSON files from BOTH device folders into the single
data/folder:data/ ├── 1494419517_com.twitter.android_MIRAGE-2019_traffic_dataset_labeled_biflows.json ├── 1494434240_com.google.android.youtube_MIRAGE-2019_traffic_dataset_labeled_biflows.json ├── 1494508157_com.spotify.music_MIRAGE-2019_traffic_dataset_labeled_biflows.json ├── 1511195631_com.facebook.katana_MIRAGE-2019_traffic_dataset_labeled_biflows.json ├── 1511197686_com.facebook.orca_MIRAGE-2019_traffic_dataset_labeled_biflows.json ├── 1494596297_air.com.hypah.io.slither_MIRAGE-2019_traffic_dataset_labeled_biflows.json └── ... (all other JSON files from both devices) - The script will automatically process all JSON files in this folder regardless of device origin
Ensure your JSON files follow the MIRAGE-2019 format:
{
"192.168.20.101,51221,216.58.205.42,443,6": { #flow_id
"packet_data": {
"src_port": [ 51221, 443, 51221, ... ],
"dst_port": [ 443, 51221,...],
"packet_dir": [ 0, 1, 0,...],
"L4_payload_bytes": [ 1368, 0, 1368, ...],
"iat": [ 0, 0.05589914321899414, 0.2340989112854004,... ],
"TCP_win_size": [ 1544, 725, ... ],
"L4_raw_payload": [....}
"flow_features": {
"packet_length": {
"biflow": {
"min": 52.0,
"max": 1420.0,
"mean": 381.9183673469388,
"std": 469.20914081665745,
"var": 220157.2178259059,
"mad": 77.0,
"skew": 1.334384411480604,
"kurtosis": 0.3833087760022593,
"10_percentile": 52.0,
"20_percentile": 52.0,
"30_percentile": 52.0,
"40_percentile": 52.0,
"50_percentile": 129.0,
"60_percentile": 168.19999999999993,
"70_percentile": 500.3999999999984,
"80_percentile": 685.2000000000011,
"90_percentile": 1420.0
},....
}Using Visual Studio 2022:
- Open
network_traffic_classification.py - Press F5 or click the ** Start** button
- View output in the integrated terminal
The system will execute the following pipeline:
NETWORK TRAFFIC CLASSIFICATION
Step 1: Aggregating flow data... #processes 1640 json files
Processing waze with 1234 flows
Processing youtube with 5678 flows
...
Saved aggregate dataset to output/Mirage_flows.csv
Step 2: Preprocessing...
Dataset shape: (121955, 102)
Number of apps: 20
Apps with number of flows:
Label
waze 11865
motain 11048
accuweather 10682
... (other applications)
Saved bar chart plot
[Bar chart displayed]
Step 3: Train model...
Applied Random Oversampling: (121955, 102) ...flows sampled to (237300, 102)
Class distribution after oversampling:
Label
twitter 11865
youtube 11865
spotify 11865
slither 11865
motain 11865
... (other applications)
Train set: (189840, 102)
Test set: (47460, 102)
Training model...
Model training completed!
Total training time: 2.00 minutes and 4.03 seconds
Step 4: Evaluating model performance...
Model Evaluation
Accuracy: 0.8976
Macro F1-Score: 0.8971
Weighted F1-Score: 0.8971
Classification Report:
precision recall f1-score support
accuweather 0.9044 0.8491 0.8759 2373.0000
comics 0.9256 0.9284 0.9270 2373.0000
dropbox 0.8902 0.8883 0.8893 2373.0000
duolingo 0.8344 0.8453 0.8399 2373.0000
facebook 0.8679 0.8858 0.8767 2373.0000
foursquare 0.8988 0.8837 0.8912 2373.0000
... (other applications)
QoS Evaluation
Performance by QoS Priority Level:
High Priority Apps: F1=0.9323 (n=5 apps)
Medium Priority Apps: F1=0.8845 (n=11 apps)
Low Priority Apps: F1=0.8904 (n=4 apps)
[Confusion matrix displayed]
Saved plot: charts/confusion_matrix.png
Top 10 Most Important Features:
packet_length_upstream_flow_max
packet_length_upstream_flow_std
packet_length_upstream_flow_var
packet_length_upstream_flow_90_percentile
packet_length_downstream_flow_max
packet_length_biflow_max
iat_upstream_flow_max
iat_biflow_max
iat_downstream_flow_max
packet_length_upstream_flow_mean
QoS Policy Recommendations for each app
accuweather:
Classification Confidence: 0.876
Recommended Priority: medium
Bandwidth Allocation: low
Latency Sensitivity: medium
Jitter Tolerance: high
comics:
Classification Confidence: 0.927
Recommended Priority: low
Bandwidth Allocation: medium
Latency Sensitivity: low
Jitter Tolerance: high
(...other applications)
Step 5: QoS analysis (with charts)...
[Pie charts displayed]
Priority Distribution:
high : 28920 flows ( 23.7%)
medium: 71205 flows ( 58.4%)
low : 21830 flows ( 17.9%)
Latency Sensitivity Distribution:
very_high : 3189 flows ( 2.6%)
high : 19116 flows ( 15.7%)
medium : 73567 flows ( 60.3%)
low : 26083 flows ( 21.4%)
Bandwidth Distribution:
high : 6493 flows ( 5.3%)
medium : 77930 flows ( 63.9%)
low : 32327 flows ( 26.5%)
variable : 5205 flows ( 4.3%)
Saved plot: charts/priority_distribution.png
Saved plot: charts/latency_sensitivity.png
Saved plot: charts/bandwidth_requirements.png
Saving Model and Metadata
Model and metadata saved to output/model.pkl
Feature scaler saved to output/feature_scaler.pkl
Execution completed successfully
Total execution time: 3.0 minutes and 6.14 seconds
Dataset Information:
⦁ Name: MIRAGE-2019
⦁ Source: University of Naples Federico II
⦁ Applications: 20 popular Android apps
⦁ Devices: Google Nexus and Xiaomi Mi5
⦁ Format: JSON files with flow-level features
⦁ Size: Variable (typically 100MB - 5GB depending on apps selected)
Download Instructions:
- Visit: https://www.kaggle.com/datasets/programmerrdai/mirage-2019
- Download the MIRAGE-2019 dataset
- Extract JSON files from BOTH device folders to your
data/directory
The system automatically extracts labels from filenames. Supported patterns:
⦁ 1494419517_com.twitter.android_MIRAGE-2019_traffic_dataset_labeled_biflows --> Twitter
⦁ 1494434240_com.google.android.youtube_MIRAGE-2019_traffic_dataset_labeled_biflows --> youtube
⦁ 1494508157_com.spotify.music_MIRAGE-2019_traffic_dataset_labeled_biflows --> spotify
⦁ 1511195631_com.facebook.katana_MIRAGE-2019_traffic_dataset_labeled_biflows --> facebook
⦁ 1511197686_com.facebook.orca_MIRAGE-2019_traffic_dataset_labeled_biflows --> messenger
⦁ 1494596297_air.com.hypah.io.slither_MIRAGE-2019_traffic_dataset_labeled_biflows --> slither
⦁ And more...
To use your own dataset:
1. Format Requirements: Your JSON files must contain flow-level features in this structure:
{
" "192.168.20.101,51221,216.58.205.42,443,6"": { #flow_id
"packet_data": {
"src_port": [ 51221, 443, 51221, ... ],
"dst_port": [ 443, 51221,...],
"packet_dir": [ 0, 1, 0,...],
"L4_payload_bytes": [ 1368, 0, 1368, ...],
"iat": [ 0, 0.05589914321899414, 0.2340989112854004,... ],
"TCP_win_size": [ 1544, 725, ... ],
"L4_raw_payload": [....}
"flow_features": {
"packet_length": {
"biflow": {
"min": 52.0,
"max": 1420.0,
"mean": 381.9183673469388,
"std": 469.20914081665745,
"var": 220157.2178259059,
"mad": 77.0,
"skew": 1.334384411480604,
"kurtosis": 0.3833087760022593,
"10_percentile": 52.0,
"20_percentile": 52.0,
"30_percentile": 52.0,
"40_percentile": 52.0,
"50_percentile": 129.0,
"60_percentile": 168.19999999999993,
"70_percentile": 500.3999999999984,
"80_percentile": 685.2000000000011,
"90_percentile": 1420.0
},....
}2. Update Label Extraction according to your dataset:
Modify the extract_label() function in the script to match your filename convention:
def extract_label(filename):
# Add your custom logic here
if "myapp" in filename:
return "myapp"
# ... existing logic3. Add QoS Policies:
Define QoS for your applications in the qos_policies dictionary:
qos_policies = {
'myapp': {
'priority': 'high',
'bandwidth': 'medium',
'latency_sensitivity': 'high',
'jitter_tolerance': 'low'
}
}Edit configuration and toggle variables at the top of network_traffic_classification.py:
# CONFIGURATION SETTINGS
# File paths
data_folder = "data" # Input data directory
output_folder = "output" #folder to saved model, scaler, lda and Mirage
os.makedirs(output_folder, exist_ok=True) # Output directory
charts_folder = "charts" #folder to save bar charts and pie charts
os.makedirs(charts_folder, exist_ok=True) # Charts directory
#gets saved to output folder
processed_data = os.path.join(output_folder, "Mirage_flows.csv")
Model = os.path.join(output_folder, "model.pkl")
scaler_file = os.path.join(output_folder, "feature_scaler.pkl")
lda_file = os.path.join(output_folder, "lda_transformer.pkl") #saves if Lda is true
# Machine Learning Techniques (Toggle on/off)
#toggle
use_SMOTE = False
use_random_oversampling = True
use_lda = False # Apply dimensionality reduction
# Display Settings
top_features = 10 # Number of top features to displayModify in the train_model() function:
rf_model = RandomForestClassifier(
n_estimators=300, # Number of trees in forest
max_depth=25, # Maximum depth of each tree
min_samples_split=5, # Min samples required to split node
min_samples_leaf=2, # Min samples required at leaf node
class_weight="balanced", # Handle class imbalance
random_state=42, # Seed for reproducibility
n_jobs=-1 # Use all CPU cores
)Tuning Guidelines:
- Increase accuracy: Increase
n_estimators(200-500) - Reduce overfitting: Decrease
max_depth(15-20) - Speed up training: Decrease
n_estimators, increasemin_samples_split - Handle imbalance: Keep
class_weight="balanced"
Adjust in train_model() function:
X_train, X_test, y_train, y_test = train_test_split(
X_final, y_resampled,
test_size=0.2, # 20% for testing (adjust 0.1-0.3)
stratify=y_resampled, # Maintain class proportions
random_state=42 # Reproducible split
)Random Oversampling (Current Default):
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_scaled, y)SMOTE (Alternative):
smote = SMOTE(
random_state=42,
k_neighbors=5, # Number of nearest neighbors
sampling_strategy='auto' # Balance all classes
)
X_resampled, y_resampled = smote.fit_resample(X_scaled, y)When use_lda = True:
n_components = min(
len(y.unique()) - 1, # Maximum LDA components
X_resampled.shape[1], # Current feature count
50 # User-defined maximum
)
lda = LinearDiscriminantAnalysis(n_components=n_components)Define custom QoS policies for your applications in the script:
qos_policies = {
'application_name': {
'priority': 'high', # Options: high, medium, low
'bandwidth': 'medium', # Options: low, medium, high, variable
'latency_sensitivity': 'high', # Options: low, medium, high, very_high
'jitter_tolerance': 'low' # Options: very_low, low, medium, high
}
}python network_traffic_classification.pyEdit the script first, then run:
# In network_traffic_classification.py
use_SMOTE = True
use_random_oversampling = False
use_lda = True
top_features = 20- Allows breakpoints and step-through debugging
- Useful for troubleshooting issues
- Slightly slower execution
- Faster execution
- Better for production runs
- Shows complete output
- Right-click in editor → Execute in Python Interactive
- Keeps session alive for exploration
- Great for testing modifications
The saved model.pkl contains:
{
'model': RandomForestClassifier, # Trained model
'scaler': StandardScaler, # Feature scaler
'feature_names': list, # List of feature names
'classes': list, # Application labels
'qos_policies': dict, # QoS policy mappings
'configuration': { # Training configuration
'use_SMOTE': bool,
'use_random_oversampling': bool,
'use_lda': bool,
'top_features': int
}
}The system prints detailed progress information:
Step 1: Aggregating flow data...
├─ Processing waze with 1234 flows
├─ Processing youtube with 5678 flows
└─ Saved aggregate dataset
Step 2: Preprocessing...
├─ Dataset shape: (121955, 102)
└─ [Bar chart displayed]
Step 3: Train model...
├─ Applied Random Oversampling
├─ Train set: (189840, 102)
├─ Test set: (47460, 102)
└─ Training time: 2.15 minutes
Step 4: Evaluating model...
├─ Accuracy: 0.8976
├─ Macro F1: 0.8971
├─ Weighted F1: 0.8971
├─ [Confusion matrix displayed]
└─ [QoS recommendations printed]
Step 5: QoS analysis...
└─ [Distribution charts displayed]
The system generates several plots automatically saved to the charts/ folder:
-
Traffic Distribution (Bar Chart)
⦁ Shows number of flows per application
⦁ Color-coded by QoS priority level (blue=high, orange=medium, beige=low) -
Priority Distribution (Pie Chart)
⦁ High, medium, low priority breakdown
⦁ Percentage and flow count annotations -
Latency Sensitivity Distribution (Pie Chart)
⦁ Very high, high, medium, low sensitivity
⦁ Flow count for each category -
Bandwidth Requirements (Pie Chart)
⦁ Low, medium, high, variable bandwidth needs
⦁ Proportional representation -
Confusion Matrix (Heatmap)
⦁ Per-application classification accuracy
⦁ Identifies misclassification patterns
All output files are automatically saved to their respective folders:
output/ folder:
Mirage_flows.csv- Aggregated and processed datasetmodel.pkl- Trained Random Forest model with metadatafeature_scaler.pkl- StandardScaler for feature normalizationlda_transformer.pkl- LDA transformer (only ifuse_lda = True)
charts/ folder:
traffic_distribution.png- Application flow distribution bar chartpriority_distribution.png- QoS priority pie chartlatency_sensitivity.png- Latency sensitivity pie chartbandwidth_requirements.png- Bandwidth requirements pie chartconfusion_matrix.png- Classification confusion matrix heatmap
Contains aggregated flow-level features from all JSON files:
- Columns: flow_id, Label, and 100+ statistical features
- Used for caching to speed up subsequent runs
- Delete this file to force re-processing of raw JSON data
Complete model package including:
- Trained RandomForestClassifier
- Feature scaler (StandardScaler)
- Feature names list
- Application class labels
- QoS policy mappings
- Configuration settings used during training
StandardScaler object for normalizing features:
- Fitted on training data statistics
- Required for preprocessing new data
- Use when making predictions on new flows
Linear Discriminant Analysis transformer:
- Only created when
use_lda = True - Reduces feature dimensionality
- Preserves class discriminative information
| Configuration | Accuracy | Macro F1 | Weighted F1 | Training Time |
|---|---|---|---|---|
| Random Oversampling (Recommended) | 89.76% | 0.8971 | 0.8971 | 2-3 min |
| SMOTE | 82.06% | 0.8206 | 0.8206 | 3-5 min |
| Baseline (No Techniques) | 71.97% | 0.6706 | 0.7225 | 1-2 min |
| SMOTE + LDA | 75.64% | 0.7561 | 0.7561 | 2-3 min |
| Random Oversampling + LDA | 86.43% | 0.8631 | 0.8631 | 2-3 min |
Key Findings:
- Random Oversampling achieves best accuracy (89.76%)
- LDA reduces accuracy by ~3% but speeds up training
- Baseline without oversampling shows significant class bias
| Priority Level | Average F1-Score | Number of Apps |
|---|---|---|
| High Priority | 0.9323 | 5 apps |
| Medium Priority | 0.8845 | 11 apps |
| Low Priority | 0.8904 | 4 apps |
Analysis:
- High-priority apps (Waze, Viber, Messenger, Slither, Spotify) have highest accuracy
- Critical for QoS - correctly identifies latency-sensitive applications
- Medium priority apps slightly lower due to diverse characteristics
| Dataset Size | RAM Used | Processing Time | Recommended RAM |
|---|---|---|---|
| < 10K flows | 2-4 GB | 30 sec - 2 min | 8 GB |
| 10K-50K flows | 4-6 GB | 2-5 min | 8 GB |
| 50K-100K flows | 6-8 GB | 5-10 min | 16 GB |
| 100K-500K flows | 8-12 GB | 10-30 min | 16 GB |
| > 500K flows | 12-16 GB | 30-60 min | 32 GB |
Test System Specifications:
⦁ CPU: Intel Core i7-10700K (8 cores, 16 threads)
⦁ RAM: 16 GB DDR4
⦁ Storage: NVMe SSD
⦁ OS: Windows 10 / Ubuntu 20.04
Training Performance:
⦁ Dataset Size: 237,300 flows (after oversampling)
⦁ Features: 102 flow-level statistics
⦁ Training Time: 2 minutes 15 seconds
⦁ Prediction Time: <1ms per flow
⦁ Model Size: ~180 MB
Optimization Tips:
- Use SSD storage for 2-3x faster data loading
- Close unnecessary applications to free RAM
- Enable all CPU cores (
n_jobs=-1in RandomForest) - Process dataset in batches if memory is limited
Problem: MemoryError: Unable to allocate array
Solutions:
# Option 1: Reduce dataset size
sample_size = 50000
if len(data) > sample_size:
data = data.sample(n=sample_size, random_state=42)
# Option 2: Reduce Random Forest complexity
rf_model = RandomForestClassifier(
n_estimators=100, # Reduced from 300
max_depth=15, # Reduced from 25
n_jobs=2 # Limit parallel jobs
)
# Option 3: Disable LDA
use_lda = False
# Option 4: Use SMOTE instead of Random Oversampling (generates less data)
use_SMOTE = True
use_random_oversampling = FalseProblem: ModuleNotFoundError: No module named 'sklearn'
Solutions:
In Visual Studio 2022:
- Check Python Environments window
- Verify correct environment is active
- Right-click environment → Manage Python Packages
- Search and install missing packages
In Command Line:
# Reinstall dependencies
pip install -r requirements #check requirements
# Or install individually
pip install scikit-learn
pip install imbalanced-learn
# Verify installation
python -c "import sklearn; print(sklearn.__version__)"Problem: json.decoder.JSONDecodeError
Solution: The script already handles invalid JSON files automatically:
# Already implemented in process_file()
try:
with open(file_path, 'r') as f:
data = json.load(f)
except json.JSONDecodeError:
print(f"Warning: Skipping invalid JSON file {file_path}")
return pd.DataFrame()If you still encounter errors:
- Check if JSON files are corrupted during download/extraction
- Re-download the dataset
- Verify file integrity with a JSON validator
Problem: FileNotFoundError: [Errno 2] No such file or directory: 'data'
Solution:
# Create data directory
mkdir data
# Verify JSON files exist
# Windows PowerShell:
Get-ChildItem data\*.json
# Linux/macOS/Git Bash:
ls data/*.json
# Check current directory
pwd # Linux/macOS
cd # WindowsIn Visual Studio 2022:
- Right-click project in Solution Explorer
- Add → New Folder → Name it "data"
- Copy JSON files into this folder
- Verify folder appears in Solution Explorer
Problem: Accuracy below 80%
Possible Causes and Solutions:
# 1. Insufficient data - Check dataset size
print(f"Total flows: {len(data)}")
# Need at least 500+ flows per application for good results
# 2. Class imbalance - Enable oversampling
use_random_oversampling = True
# 3. Poor feature quality - Check for missing values
print(f"Missing values: {X.isnull().sum().sum()}")
print(f"Features with >50% missing: {(X.isnull().sum() / len(X) > 0.5).sum()}")
# 4. Overfitting - Reduce model complexity
rf_model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
min_samples_split=10
)
# 5. Using wrong dataset - Verify JSON format matches MIRAGE-2019Problem: Training takes too long (>10 minutes)
Optimizations:
# 1. Reduce number of trees
n_estimators=100 # Instead of 300
# 2. Limit tree depth
max_depth=15 # Instead of 25
# 3. Use fewer CPU cores (if system is sluggish)
n_jobs=4 # Instead of -1 (all cores)
# 4. Sample large datasets
if len(data) > 100000:
data = data.sample(n=100000, random_state=42)
# 5. Enable LDA for dimensionality reduction
use_lda = True # Reduces features, speeds up trainingAdditional Tips:
- Close other applications to free system resources
- Use Task Manager (Windows) or Activity Monitor (macOS) to monitor resource usage
- Consider running overnight for very large datasets
Problem: Plots not displaying or saving
Solutions:
In Visual Studio 2022:
# Add at beginning of script
import matplotlib
matplotlib.use('TkAgg') # Try different backends: 'Qt5Agg', 'Agg'
# Force plot display
plt.show(block=True)In Command Line:
# For headless servers (no display)
import matplotlib
matplotlib.use('Agg') # Saves plots without displaying
# Verify charts folder exists
import os
if not os.path.exists('charts'):
os.makedirs('charts')Common Issues:
- Charts not appearing: Install
python-tkpackage - Charts not saving: Check folder permissions
- Charts blank: Update matplotlib to latest version
Problem: Python environment not detected
Solution:
- Go to Tools → Options → Python → Environments
- Click + Add Environment
- Manually specify Python interpreter path
- Common locations:
C:\Python38\python.exeC:\Users\<YourName>\AppData\Local\Programs\Python\Python38\python.exe
Problem: Script runs but no output visible
Solution:
- Check View → Output window
- Select "Python" from the dropdown
- Or use View → Terminal for live output
Problem: Charts not displaying in VS
Solution:
- Charts save to
charts/folder automatically - View them from Windows Explorer / File Explorer
- Or use Python Interactive window for inline display
Problem: "No JSON files found in data folder"
Solution:
- Verify
data/folder exists in project root - Check JSON files have correct extension (
.json, not.txtor.json.txt) - Verify files are not in subdirectories
- In Visual Studio: Set Show All Files in Solution Explorer
Problem: "Unknown application labels"
Solution: The script extracts labels from filenames. Ensure filenames match pattern:
com.app.package_device_capture.json
For custom apps, update extract_label() function in the script.
Problem: File paths not working on Windows
Solutions:
# The script already uses os.path.join() for cross-platform compatibility
# But if you modify paths, always use:
data_folder = os.path.join("data") # Not "data/"
output_file = os.path.join("output", "model.pkl") # Not "output/model.pkl"
# Or use forward slashes (Python handles them on Windows)
data_folder = "data" # Works on all platformsIf you encounter issues not covered here:
Step-by-Step Debugging:
- Read the Full Error Message: Check the complete error traceback
- Verify Installation: Ensure all dependencies are installed correctly
- Check Dataset Format: Verify JSON files match expected MIRAGE-2019 structure
- Enable Debug Output: Add print statements to identify where issues occur
- Check System Resources: Monitor RAM and CPU usage during execution
- Test with Small Sample: Try with 2-3 JSON files first to isolate issues
Support Channels:
⦁ GitHub Issues: https://github.com/tomRiddle-the1st/network_traffic_classification/issues
⦁ Email: [email protected]
⦁ Documentation: See full research paper for technical details
When Reporting Issues: Please include:
- Operating system and version
- Python version (
python --version) - Error message (full traceback)
- Dataset size and number of JSON files
- Configuration settings used
- Steps to reproduce the issue
network_traffic_classification/
│
├── network_traffic_classification.py # Main script (all-in-one)
├── README.md # This file
├── requirements # Python dependencies
├── USER_MANUAL.md # Detailed user guide (if available)
├── RESEARCH_PAPER.md # Full research article (if available)
│
├── data/ # Input data directory (CREATE MANUALLY)
│ ├── 1494419517_com.twitter.android_MIRAGE-2019_traffic_dataset_labeled_biflows.json
│ ├── 1494434240_com.google.android.youtube_MIRAGE-2019_traffic_dataset_labeled_biflows.json
│ ├── 1494508157_com.spotify.music_MIRAGE-2019_traffic_dataset_labeled_biflows.json
│ ├── 1511195631_com.facebook.katana_MIRAGE-2019_traffic_dataset_labeled_biflows.json
│ ├── 1494596297_air.com.hypah.io.slither_MIRAGE-2019_traffic_dataset_labeled_biflows.json
│ └── ... (other application JSON files from both devices)
│
├── output/ # Generated outputs (AUTO-CREATED)
│ ├── Mirage_flows.csv # Processed dataset cache
│ ├── model.pkl # Trained model + metadata
│ ├── feature_scaler.pkl # Feature normalization scaler
│ └── lda_transformer.pkl # LDA transformer (if enabled)
│
└── charts/ # Generated visualizations (AUTO-CREATED)
├── traffic_distribution.png # Application flow distribution
├── priority_distribution.png # QoS priority breakdown
├── latency_sensitivity.png # Latency sensitivity distribution
├── bandwidth_requirements.png # Bandwidth requirements
└── confusion_matrix.png # Classification confusion matrix
network_traffic_classification.py
- Main executable script
- Contains all functions: data loading, preprocessing, training, evaluation
- Configurable via variables at top of file
- No external dependencies (except libraries)
data/ folder
- Must contain MIRAGE-2019 JSON files
- Merge files from both Google Nexus and Xiaomi Mi5 devices
- Script processes all
.jsonfiles in this directory
output/ folder
- Automatically created by script
- Stores trained models and processed data
- Reusable across multiple runs (caching)
charts/ folder
- Automatically created by script
- Stores all visualization outputs as PNG files
- High resolution (200 DPI) for publication quality
If you use this software in your research, please cite:
@{title={Network Traffic Classification of 5G networks for Improvement of QoS},
author={Ismail, Mohammed Aqeel},
year={2025},
school={University of KwaZulu-Natal},
address={Pietermaritzburg, South Africa},
note={Software available at: https://github.com/tomRiddle-the1st/network_traffic_classification}
}This project is developed for educational and research purposes as part of COMP700 coursework at the University of KwaZulu-Natal.
⦁ Free for educational and personal use
⦁ Free for academic research (please cite)
⦁ Modifications and improvements encouraged
⦁ Attribution appreciated but not required
⦁ No warranty provided - use at your own risk
This software is provided "as is" without warranty of any kind, express or implied. The authors and University of KwaZulu-Natal are not responsible for any damages or issues arising from the use of this software.
The MIRAGE-2019 dataset is provided by the University of Naples Federico II. Please refer to their licensing terms when using the dataset.
Q: What is the minimum dataset size needed?
A: At least 500 flows per application for reasonable accuracy. The full MIRAGE-2019 dataset contains 120K+ flows across 20 apps.
Q: Can I use this for real-time traffic classification?
A: The model is trained for classification. For real-time use, you'll need to implement a flow extraction pipeline and load the trained model for predictions.
Q: Does this work with encrypted traffic?
A: Yes! The system uses flow-level statistical features, not packet payloads, so it works with encrypted traffic (HTTPS, VPN, TLS).
Q: Why Random Oversampling instead of SMOTE?
A: In our tests, Random Oversampling achieved 89.76% accuracy vs 82.06% for SMOTE. It's also faster and simpler.
Q: How long does training take?
A: 2-3 minutes on a modern CPU (i7/Ryzen 7) with 16GB RAM for the full 237K flow dataset after oversampling.
Q: Can I add new applications?
A: Yes! Add JSON files to data/, update extract_label() function, and add QoS policies to qos_policies dictionary.
Q: What features does the model use?
A: 102 flow-level statistical features including packet length (mean, std, min, max, percentiles), inter-arrival time (IAT), and flow duration for forward, backward, and bidirectional flows.
Q: Can I use this on a different dataset?
A: Yes, but you'll need to ensure your dataset has the same flow-level feature structure as MIRAGE-2019, or modify the process_file() function to extract features from your format.
Q: Why is my accuracy so low (<70%)?
A: Common causes: insufficient data per app, disabled oversampling, corrupted JSON files, or wrong dataset format.
Q: Charts are not displaying. What should I do?
A: Charts are automatically saved to charts/ folder. Check there first. If they're not saving, verify folder permissions and matplotlib installation.
Q: Script crashes with MemoryError. How to fix?
A: Reduce n_estimators to 100, limit dataset size with sampling, or disable Random Oversampling (though accuracy will drop).
Dataset:
This work uses the MIRAGE-2019 dataset provided by the University of Naples Federico II, Italy.
Supervisor:
University of KwaZulu-Natal, School of Mathematics, Statistics and Computer Science
Tools & Libraries:
- scikit-learn for machine learning algorithms
- imbalanced-learn for oversampling techniques
- matplotlib and seaborn for visualizations
- pandas and numpy for data manipulation
Inspiration:
Modern network traffic classification research and the need for privacy-preserving QoS optimization in 5G networks.
v1.0 (2025)
- Initial release
- Random Forest classifier with 89.76% accuracy
- Support for 20 MIRAGE-2019 applications
- QoS policy framework
- Comprehensive visualization suite
- Visual Studio 2022 compatibility
Author: Mohammed Aqeel Ismail
Email: [email protected]
Institution: University of KwaZulu-Natal
GitHub: https://github.com/tomRiddle-the1st/network_traffic_classification
For questions, bug reports, or collaboration opportunities, please open an issue on GitHub or contact via email.
Thank you for using this network traffic classification system!
We hope this tool helps advance your research in QoS optimization and network traffic analysis. Contributions, feedback, and citations are greatly appreciated!