A machine learning-based tool for providing personalized news digests to users from web-scraped data. The project uses a Random Forest model with rule-based keyword filtering to classify news articles and provide personalized summaries based on user preferences.
- Web Scraping: Automated collection of news articles from Google News RSS feeds
- Advanced Classification: Machine learning model achieving 99.97% accuracy across 41 news categories
- Rule-Based Filtering: Multi-category keyword-based classification for improved accuracy
- Multi-Label Support: Articles can be assigned multiple categories (e.g., TECH,SCIENCE)
- Personalized Selection: User-driven category selection for customized news digests
- Real-Time Processing: Live article classification and categorization
Personalise_News_Digest_Project/
├── webscrapper.py # Web scraping and classification pipeline
├── bert_model.py # Machine learning model training and evaluation
├── personalised_digest.py # User interaction and personalized digest generation
├── text_preprocessing.py # Text preprocessing utilities
├── requirements_bert # Python dependencies
├── model_development_summary.txt # Project progress and achievements
└── README.md # This file
- Clone the repository:
git clone https://github.com/Amaan247788/personalised-news-digest.git
cd personalised-news-digest- Create and activate a virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements_bertTo scrape news articles and classify them:
python webscrapper.pyThis will:
- Scrape articles from Google News RSS feeds
- Preprocess and classify articles using rule-based filtering and ML model
- Save results to timestamped CSV files with predicted categories
To generate a personalized news digest:
python personalised_digest.pyThis will:
- Load the latest classified news data
- Present available categories to the user
- Allow user to select categories of interest
- Generate personalized news summaries (coming soon)
The current model achieves:
- 99.97% average accuracy across all 41 categories
- 100% accuracy for 40 out of 41 categories
- Rule-based filtering for TECH, SPORTS, POLITICS, BUSINESS, ENTERTAINMENT, SCIENCE
- Multi-label classification support for articles matching multiple categories
- Python 3.x
- scikit-learn: Machine learning implementation
- NLTK: Natural language processing
- pandas: Data manipulation
- numpy: Numerical operations
- imbalanced-learn: Handling class imbalance
- Integrated web scraping pipeline using Google News RSS feeds
- Developed rule-based keyword filtering system for improved accuracy
- Enabled multi-label category assignment
- Enhanced model robustness with hybrid rule-based + ML approach
- Automated CSV output with predicted categories
- Maintained robust version control and collaborative workflow