Skip to content

Latest commit

 

History

History
381 lines (296 loc) · 7.09 KB

File metadata and controls

381 lines (296 loc) · 7.09 KB

LearnDL Troubleshooting Guide

This guide helps you resolve common issues when working with the LearnDL sentiment classification system.

Quick Diagnosis

API Not Responding

Symptoms:

  • Connection refused errors
  • Timeout errors
  • 500 Internal Server Error

Solutions:

  1. Check if API is running:

    curl http://localhost:8000/model_api/health_check
  2. Start the API:

    # Using Docker
    docker-compose up --build
    
    # Or locally
    python api/main.py
  3. Check logs:

    # Docker logs
    docker-compose logs api
    
    # Local logs
    # Check terminal output

Training Fails

Symptoms:

  • Training starts but fails immediately
  • CUDA out of memory errors
  • Data loading errors

Common Causes:

CUDA Out of Memory

RuntimeError: CUDA out of memory

Solutions:

  • Reduce batch size: "batch_size": 8
  • Use smaller model: "embed_model": "distilbert_model"
  • Freeze more layers: "fine_tune_mode": "freeze_all"
  • Reduce hidden neurons: "hidden_neurons": 128

Data File Not Found

FileNotFoundError: data/data.csv

Solutions:

  • Ensure data/data.csv exists
  • Check file path in configuration
  • Verify file permissions

Invalid Data Format

ValueError: Expected 2 columns, got 3

Solutions:

  • Check CSV format: input,output
  • Ensure no extra commas in text
  • Use proper CSV quoting

Prediction Errors

Symptoms:

  • Model not found errors
  • Configuration mismatch errors
  • Low accuracy predictions

Model Not Found

HTTP 404: Model not found for this user/session

Solutions:

  • Verify user_id and training_session_id
  • Check if training completed successfully
  • Ensure model was saved to Redis

Configuration Mismatch

ValueError: Configuration mismatch

Solutions:

  • Use same configuration for training and prediction
  • Ensure total_config matches training setup
  • Check embed model compatibility

Detailed Troubleshooting

1. Environment Issues

Python Version Problems

ModuleNotFoundError: No module named 'transformers'

Check Python version:

python --version  # Should be 3.11+

Reinstall dependencies:

pip install -r requirements.txt

CUDA/GPU Issues

AssertionError: Torch not compiled with CUDA enabled

Check GPU availability:

import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())

Solutions:

  • Install CUDA-compatible PyTorch
  • Use CPU-only version: pip install torch --index-url https://download.pytorch.org/whl/cpu
  • Set CUDA_VISIBLE_DEVICES="" for CPU-only

2. Data Issues

Encoding Problems

UnicodeDecodeError: 'utf-8' codec can't decode

Solutions:

  • Save CSV with UTF-8 encoding
  • Handle special characters in text
  • Use encoding='utf-8' when reading

Class Imbalance

Warning: Class distribution is heavily imbalanced

Solutions:

  • Check class distribution in data
  • Use stratified sampling: "stratify": true
  • Collect more balanced data
  • Use class weights in training

Empty or Invalid Data

ValueError: Found 0 samples in training data

Solutions:

  • Verify CSV has data rows
  • Check for empty strings in input column
  • Ensure output labels are consistent

3. Model Issues

Poor Training Performance

Training loss not decreasing
Validation accuracy stuck at ~50%

Solutions:

  • Increase learning rate gradually
  • Try different embedding models
  • Unfreeze more layers for fine-tuning
  • Check data quality and preprocessing

Overfitting

Training accuracy: 95%, Validation accuracy: 60%

Solutions:

  • Increase dropout: "dropout": 0.5
  • Reduce model complexity
  • Add regularization
  • Use early stopping

Underfitting

Both training and validation accuracy low

Solutions:

  • Increase training epochs
  • Unfreeze more layers
  • Use larger embedding model
  • Check learning rate (might be too low)

4. Redis Issues

Connection Failed

redis.ConnectionError: Connection refused

Solutions:

  • Start Redis server:
    redis-server
  • Check Redis configuration in .env
  • Verify Redis is running on correct port

Model Not Persisted

KeyError: Model not found in Redis

Solutions:

  • Check Redis connection
  • Verify model was saved after training
  • Check Redis memory usage
  • Restart Redis if needed

5. Docker Issues

Container Won't Start

ERROR: Couldn't connect to Docker daemon

Solutions:

  • Start Docker Desktop
  • Check Docker is running: docker info
  • Restart Docker service

Port Already in Use

ERROR: Port 8000 is already in use

Solutions:

  • Kill process using port:
    # Find process
    lsof -i :8000
    # Kill process
    kill -9 <PID>
  • Change port in docker-compose.yml

Volume Mount Issues

ERROR: Invalid volume specification

Solutions:

  • Use absolute paths for volume mounts
  • Check file permissions
  • Ensure directories exist

6. Performance Issues

Slow Training

Epoch taking >30 minutes

Solutions:

  • Use GPU if available
  • Reduce batch size
  • Use DistilBERT instead of BERT
  • Freeze embedding layers

High Memory Usage

MemoryError: Unable to allocate array

Solutions:

  • Reduce batch size
  • Use smaller models
  • Process data in chunks
  • Add swap memory

Slow Inference

Prediction taking >5 seconds

Solutions:

  • Use smaller models
  • Cache models in memory
  • Optimize preprocessing
  • Use batch inference for multiple texts

Debug Mode

Enable Debug Logging

Set environment variable:

export LOG_LEVEL=DEBUG

Check System Resources

import psutil
import torch

# CPU usage
print(f"CPU: {psutil.cpu_percent()}%")

# Memory usage
memory = psutil.virtual_memory()
print(f"Memory: {memory.percent}% used")

# GPU usage (if available)
if torch.cuda.is_available():
    print(f"GPU Memory: {torch.cuda.memory_allocated()/torch.cuda.max_memory_allocated():.2%}")

Profile Code

import cProfile
import pstats

# Profile training
cProfile.run('train_model()', 'profile_stats')
pstats.Stats('profile_stats').sort_stats('cumulative').print_stats(10)

Getting Help

Log Files

Check these locations for logs:

  • Docker: docker-compose logs
  • Local: Terminal output
  • Application: logs/app.log (if configured)

Common Error Codes

Error Code Meaning Action
400 Bad Request Check request parameters
404 Not Found Verify user_id/training_session_id
422 Validation Error Check configuration values
503 Service Unavailable Check model/data availability

Support Information

For additional help:

  1. Check the API Reference
  2. Review Configuration Guide
  3. Run the demo notebook
  4. Check GitHub issues for similar problems

System Information

When reporting issues, include:

  • Python version: python --version
  • OS and version
  • Docker version (if used)
  • GPU/CPU information
  • Full error traceback
  • Configuration used
  • Sample data (anonymized)