📰 News Category Classifier

A NLP system that classifies news articles into 16 categories using TF-IDF and Logistic Regression, optimized for fast, real-time inference.

Features

16 Categories: Business, Politics, Sports, Tech, Wellness, Entertainment, Travel, Food & Drink, Science, Crime, Environment, Media, Education, Style & Beauty, U.S. News, World News
Real-time Prediction: Instant classification with confidence scores
Modern UI: Clean Streamlit interface with responsive design
Fast Processing: <100ms prediction time on local CPU inference

Quick Start

Local Development

# Install dependencies
pip install -r requirements.txt

# Run locally
streamlit run app.py

Deploy to Streamlit Cloud

Push to GitHub:

git add .
git commit -m "Initial commit"
git push origin main

Deploy on Streamlit Cloud:
- Go to share.streamlit.io
- Connect your GitHub account
- Select repository: Keerthi421/News_Category_Classification
- Set main file: app.py
- Click "Deploy"
Access your app at the provided URL

📊 Model Details

Algorithm: Logistic Regression (Multinomial)
Features: TF-IDF vectorization (10K features, unigrams + bigrams)
Dataset: 126K news articles from Kaggle News Category Dataset v3
Training: 80/20 split, 2-3 minute training time
Model Choice Rationale: Logistic Regression chosen for interpretability, speed, and ease of production deployment

🤔 Why Logistic Regression?

Logistic Regression provides a strong balance of accuracy, interpretability, and inference speed for large-scale text classification, making it well-suited for production NLP systems where latency and reliability matter.

🛠️ Tech Stack

ML: Scikit-learn, NLTK
Web: Streamlit
Data: Pandas, NumPy
Language: Python 3.13

📁 Project Structure

├── app.py                 # Streamlit web app
├── train_model.py         # Model training script
├── convert_to_csv.py      # Data preprocessing
├── requirements.txt       # Python dependencies
├── news_data.csv          # Sample / processed dataset
├── model.pkl             # Trained model
├── vectorizer.pkl        # TF-IDF vectorizer
└── label_encoder.pkl     # Category encoder

🎯 Usage Examples

Input	Category	Confidence
"Stock markets decline in early trade"	BUSINESS	34.5%
"New vaccine shows promising results"	WELLNESS	45.2%
"President announces economic policy"	POLITICS	67.8%

🔧 Requirements

Python 3.8+
Streamlit
Scikit-learn
Pandas
NLTK

📈 Performance

Accuracy: 78%
Designed for balanced performance across classes rather than overfitting top categories
Best Categories: Politics (91%), Entertainment (84%), Wellness (88%)
Training Time: ~3 minutes
Model Size: ~50MB (compressed)

Built with a production mindset — fast inference, interpretable models, and clean deployment.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
News_Category_Dataset_v3.json		News_Category_Dataset_v3.json
README.md		README.md
app.py		app.py
convert_to_csv.py		convert_to_csv.py
inspect_json.py		inspect_json.py
label_encoder.pkl		label_encoder.pkl
model.pkl		model.pkl
news_data.csv		news_data.csv
packages.txt		packages.txt
requirements.txt		requirements.txt
test_model.py		test_model.py
test_specific.py		test_specific.py
train_model.py		train_model.py
vectorizer.pkl		vectorizer.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📰 News Category Classifier

Features

Quick Start

Local Development

Deploy to Streamlit Cloud

📊 Model Details

🤔 Why Logistic Regression?

🛠️ Tech Stack

📁 Project Structure

🎯 Usage Examples

🔧 Requirements

📈 Performance

About

Uh oh!

Releases

Packages

Languages

Keerthi421/News_Category_Classification

Folders and files

Latest commit

History

Repository files navigation

📰 News Category Classifier

Features

Quick Start

Local Development

Deploy to Streamlit Cloud

📊 Model Details

🤔 Why Logistic Regression?

🛠️ Tech Stack

📁 Project Structure

🎯 Usage Examples

🔧 Requirements

📈 Performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages