This project demonstrates how to apply full-fine-tune and PEFT to FLAN-T5 to generate summaries of Indian news articles. This project explores text preprocessing, model inference, and evaluation using real-world Indian journalism data.
Install the necessary Python packages:
pip install -r requirements.txt
The News Article Dataset News Article Dataset is a curated collection of 112 news articles sourced from leading Indian newspapers, such as:
- The Hindu
- Hindustan Times
- Indian Express
- ...and others.
Each data record includes:
Newspaper Name: Source of the articlePublished Date: Date of publicationURL: Link to the original articleHeadline: Article titleContent: Full article textHuman Summary: Expert-written summaryCategory: News classification (e.g., Science and Technology, Business, Environment)
🔍 Note: For this summarization task, only the Content and Human Summary columns are used.
Input prompt:
Summarize the following news.
Water Minister and Delhi Jal Board (DJB) chairman Satyendar Jain on Saturday visited Rohini Lake to review the progress of various units being constructed in line with the Delhi government’s objective of transforming the Capital into “a city of lakes”.
The government plans to develop Rohini as an “abode of lakes and recreation” within 8 months.
A project revolving around the revival of lakes and water bodies in Delhi is on the AAP government’s list of priorities, the government stated, adding that it soug
Human Summary:
Delhi Water Minister Satyendar Jain visited Rohini Lake to assess the progress of the Delhi government's "City of Lakes" project, aimed at reviving lakes and water bodies across the city. The government plans to transform Rohini into an "abode of lakes and recreation" within 8 months. The project involves reviving eight lakes to recharge 68 MLD of treated water from a sewage treatment plant (STP). The broader initiative focuses on creating reservoirs to reduce urban flooding. Rohini Lake, part of a 100-acre complex, will receive treated wastewater and is set to be completed within eight months.
Model generated - zero shot:
The Delhi government has announced plans to build a lake in the capital, Rohini, to help revive the city’s lakes.
The quality of the generated summaries is evaluated using:
ROUGE-1: Overlap of unigrams (single words)
ROUGE-2: Overlap of bigrams (two-word phrases)
ROUGE-L: Longest common subsequence
These metrics compare the model-generated summary with the reference (human-written) summary.
This project was inspired by what I learnt from Generative AI with Large Language Models coursework. It was run on Amazon SageMaker AI