Skip to content

Keerthi421/News_Category_Classification

Repository files navigation

📰 News Category Classifier

A NLP system that classifies news articles into 16 categories using TF-IDF and Logistic Regression, optimized for fast, real-time inference.

Features

  • 16 Categories: Business, Politics, Sports, Tech, Wellness, Entertainment, Travel, Food & Drink, Science, Crime, Environment, Media, Education, Style & Beauty, U.S. News, World News
  • Real-time Prediction: Instant classification with confidence scores
  • Modern UI: Clean Streamlit interface with responsive design
  • Fast Processing: <100ms prediction time on local CPU inference

Quick Start

Local Development

# Install dependencies
pip install -r requirements.txt

# Run locally
streamlit run app.py

Deploy to Streamlit Cloud

  1. Push to GitHub:

    git add .
    git commit -m "Initial commit"
    git push origin main
  2. Deploy on Streamlit Cloud:

    • Go to share.streamlit.io
    • Connect your GitHub account
    • Select repository: Keerthi421/News_Category_Classification
    • Set main file: app.py
    • Click "Deploy"
  3. Access your app at the provided URL

📊 Model Details

  • Algorithm: Logistic Regression (Multinomial)
  • Features: TF-IDF vectorization (10K features, unigrams + bigrams)
  • Dataset: 126K news articles from Kaggle News Category Dataset v3
  • Training: 80/20 split, 2-3 minute training time
  • Model Choice Rationale: Logistic Regression chosen for interpretability, speed, and ease of production deployment

🤔 Why Logistic Regression?

Logistic Regression provides a strong balance of accuracy, interpretability, and inference speed for large-scale text classification, making it well-suited for production NLP systems where latency and reliability matter.

🛠️ Tech Stack

  • ML: Scikit-learn, NLTK
  • Web: Streamlit
  • Data: Pandas, NumPy
  • Language: Python 3.13

📁 Project Structure

├── app.py                 # Streamlit web app
├── train_model.py         # Model training script
├── convert_to_csv.py      # Data preprocessing
├── requirements.txt       # Python dependencies
├── news_data.csv          # Sample / processed dataset
├── model.pkl             # Trained model
├── vectorizer.pkl        # TF-IDF vectorizer
└── label_encoder.pkl     # Category encoder

🎯 Usage Examples

Input Category Confidence
"Stock markets decline in early trade" BUSINESS 34.5%
"New vaccine shows promising results" WELLNESS 45.2%
"President announces economic policy" POLITICS 67.8%

🔧 Requirements

  • Python 3.8+
  • Streamlit
  • Scikit-learn
  • Pandas
  • NLTK

📈 Performance

  • Accuracy: 78%
  • Designed for balanced performance across classes rather than overfitting top categories
  • Best Categories: Politics (91%), Entertainment (84%), Wellness (88%)
  • Training Time: ~3 minutes
  • Model Size: ~50MB (compressed)

Built with a production mindset — fast inference, interpretable models, and clean deployment.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages