EIM_News_Crawler

A web crawler that extracts news articles from the website of the Faculty of Electrical Engineering, Computer Science and Mathematics at Paderborn University and saves each article as a separate DOCX file.

Overview

The EIM_News_Crawler automatically scans the faculty’s news page and saves articles published in a user-defined year as structured DOCX documents.

Features

Year filter: Extracts only articles from a year specified by the user.
DOCX creation: Converts HTML content to DOCX, including images and contact information.
Simple CLI interface: Prompts for the target year via the command line.
Modular code: Easily extensible structure through helper functions.

Installation

Clone the repository:

git clone https://github.com/FladChris/EIM_News_Crawler.git
cd EIM_News_Crawler

Virtual environment (optional):

python3 -m venv venv
source venv/bin/activate  # Linux/macOS
venv\Scripts\activate     # Windows

Install dependencies: Create or adjust requirements.txt and install:

requests==2.28.2
beautifulsoup4==4.11.1
python-docx==0.8.11
htmldocx==0.1.7

pip install -r requirements.txt

Usage

Run the script:
```
python EIM_News_Crawler_docx.py
```
Enter the year: Follow the prompt and enter the desired year (e.g., 2024).
Results: A directory is created for the entered year, and all found articles are saved inside it as YYYY-MM-DD_slug.docx.

Dependencies

The script uses the following external libraries:

requests (HTTP requests)
beautifulsoup4 (HTML parsing)
python-docx (DOCX file creation)
htmldocx (HTML → DOCX conversion)

Known Issues and Notes

Timeout & network access: Accessing from outside the university network can lead to timeouts. Using a VPN connection to the university network may help.
Missing content: Articles without text, images or contact information are flagged accordingly.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.gitignore		.gitignore
EIM_News_Crawler_docx.py		EIM_News_Crawler_docx.py
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EIM_News_Crawler

Table of Contents

Overview

Features

Installation

Usage

Dependencies

Known Issues and Notes

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EIM_News_Crawler

Table of Contents

Overview

Features

Installation

Usage

Dependencies

Known Issues and Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages