A web crawler that extracts news articles from the website of the Faculty of Electrical Engineering, Computer Science and Mathematics at Paderborn University and saves each article as a separate DOCX file.
The EIM_News_Crawler automatically scans the faculty’s news page and saves articles published in a user-defined year as structured DOCX documents.
- Year filter: Extracts only articles from a year specified by the user.
- DOCX creation: Converts HTML content to DOCX, including images and contact information.
- Simple CLI interface: Prompts for the target year via the command line.
- Modular code: Easily extensible structure through helper functions.
-
Clone the repository:
git clone https://github.com/FladChris/EIM_News_Crawler.git cd EIM_News_Crawler -
Virtual environment (optional):
python3 -m venv venv source venv/bin/activate # Linux/macOS venv\Scripts\activate # Windows
-
Install dependencies: Create or adjust
requirements.txtand install:requests==2.28.2 beautifulsoup4==4.11.1 python-docx==0.8.11 htmldocx==0.1.7pip install -r requirements.txt
-
Run the script:
python EIM_News_Crawler_docx.py
-
Enter the year: Follow the prompt and enter the desired year (e.g.,
2024). -
Results: A directory is created for the entered year, and all found articles are saved inside it as
YYYY-MM-DD_slug.docx.
The script uses the following external libraries:
requests(HTTP requests)beautifulsoup4(HTML parsing)python-docx(DOCX file creation)htmldocx(HTML → DOCX conversion)
- Timeout & network access: Accessing from outside the university network can lead to timeouts. Using a VPN connection to the university network may help.
- Missing content: Articles without text, images or contact information are flagged accordingly.
This project is licensed under the MIT License.