Skip to content
This repository was archived by the owner on Dec 8, 2025. It is now read-only.

FladChris/EIM_News_Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EIM_News_Crawler

A web crawler that extracts news articles from the website of the Faculty of Electrical Engineering, Computer Science and Mathematics at Paderborn University and saves each article as a separate DOCX file.

Release
License: MIT

Table of Contents

Overview

The EIM_News_Crawler automatically scans the faculty’s news page and saves articles published in a user-defined year as structured DOCX documents.

Features

  • Year filter: Extracts only articles from a year specified by the user.
  • DOCX creation: Converts HTML content to DOCX, including images and contact information.
  • Simple CLI interface: Prompts for the target year via the command line.
  • Modular code: Easily extensible structure through helper functions.

Installation

  1. Clone the repository:

    git clone https://github.com/FladChris/EIM_News_Crawler.git
    cd EIM_News_Crawler
    
    
  2. Virtual environment (optional):

    python3 -m venv venv
    source venv/bin/activate  # Linux/macOS
    venv\Scripts\activate     # Windows
    
    
  3. Install dependencies: Create or adjust requirements.txt and install:

    requests==2.28.2
    beautifulsoup4==4.11.1
    python-docx==0.8.11
    htmldocx==0.1.7
    
    pip install -r requirements.txt
    
    

Usage

  1. Run the script:

    python EIM_News_Crawler_docx.py
  2. Enter the year: Follow the prompt and enter the desired year (e.g., 2024).

  3. Results: A directory is created for the entered year, and all found articles are saved inside it as YYYY-MM-DD_slug.docx.

Dependencies

The script uses the following external libraries:

  • requests (HTTP requests)
  • beautifulsoup4 (HTML parsing)
  • python-docx (DOCX file creation)
  • htmldocx (HTML → DOCX conversion)

Known Issues and Notes

  • Timeout & network access: Accessing from outside the university network can lead to timeouts. Using a VPN connection to the university network may help.
  • Missing content: Articles without text, images or contact information are flagged accordingly.

License

This project is licensed under the MIT License.

About

Web crawler used to extract news articles from the website of the Faculty of Electrical Engineering, Computer Science, and Mathematics at Paderborn University and store them locally.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages