A NLP system that classifies news articles into 16 categories using TF-IDF and Logistic Regression, optimized for fast, real-time inference.
- 16 Categories: Business, Politics, Sports, Tech, Wellness, Entertainment, Travel, Food & Drink, Science, Crime, Environment, Media, Education, Style & Beauty, U.S. News, World News
- Real-time Prediction: Instant classification with confidence scores
- Modern UI: Clean Streamlit interface with responsive design
- Fast Processing: <100ms prediction time on local CPU inference
# Install dependencies
pip install -r requirements.txt
# Run locally
streamlit run app.py-
Push to GitHub:
git add . git commit -m "Initial commit" git push origin main
-
Deploy on Streamlit Cloud:
- Go to share.streamlit.io
- Connect your GitHub account
- Select repository:
Keerthi421/News_Category_Classification - Set main file:
app.py - Click "Deploy"
-
Access your app at the provided URL
- Algorithm: Logistic Regression (Multinomial)
- Features: TF-IDF vectorization (10K features, unigrams + bigrams)
- Dataset: 126K news articles from Kaggle News Category Dataset v3
- Training: 80/20 split, 2-3 minute training time
- Model Choice Rationale: Logistic Regression chosen for interpretability, speed, and ease of production deployment
Logistic Regression provides a strong balance of accuracy, interpretability, and inference speed for large-scale text classification, making it well-suited for production NLP systems where latency and reliability matter.
- ML: Scikit-learn, NLTK
- Web: Streamlit
- Data: Pandas, NumPy
- Language: Python 3.13
├── app.py # Streamlit web app
├── train_model.py # Model training script
├── convert_to_csv.py # Data preprocessing
├── requirements.txt # Python dependencies
├── news_data.csv # Sample / processed dataset
├── model.pkl # Trained model
├── vectorizer.pkl # TF-IDF vectorizer
└── label_encoder.pkl # Category encoder
| Input | Category | Confidence |
|---|---|---|
| "Stock markets decline in early trade" | BUSINESS | 34.5% |
| "New vaccine shows promising results" | WELLNESS | 45.2% |
| "President announces economic policy" | POLITICS | 67.8% |
- Python 3.8+
- Streamlit
- Scikit-learn
- Pandas
- NLTK
- Accuracy: 78%
- Designed for balanced performance across classes rather than overfitting top categories
- Best Categories: Politics (91%), Entertainment (84%), Wellness (88%)
- Training Time: ~3 minutes
- Model Size: ~50MB (compressed)
Built with a production mindset — fast inference, interpretable models, and clean deployment.