|
1 | | -# Newscorpus 📰🐳🐍 |
2 | | -Docker setup for automated news article crawling from German news websites (~60 sources, see [sources.json](crawler/app/assets/sources.json)). |
3 | | -Written in Python, uses [Newspaper](https://pypi.org/project/newspaper3k/) as a content extractor and MongoDB as database. |
| 1 | +# Newscorpus 📰🐍 |
| 2 | +<!-- Description of this project --> |
| 3 | +Takes a list of RSS feeds, finds new articles, downloads them, processes and stores them in a SQLite database. |
| 4 | + |
| 5 | +This project uses [Trafilatura](https://github.com/adbar/trafilatura) to extract text from HTML pages and [feedparser](https://github.com/kurtmckee/feedparser) to parse RSS feeds. |
4 | 6 |
|
5 | | -Development environment is ready to be used with [VSCode](https://code.visualstudio.com/docs/remote/containers) and the [Remote Container Extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers). |
6 | 7 |
|
7 | 8 | ## Setup |
8 | | -1. Clone this repository `git clone [email protected]:gambolputty/newscorpus.git && cd newscorpus` |
9 | | -2. Save `.env.example` to `.env` and edit it (see __"Configuration"__). |
10 | | -3. Run `docker-compose up --build` to create the crawler- and database-container (`-d` to detach the docker process). |
| 9 | +This project uses [Poetry](https://python-poetry.org/) to manage dependencies. Make sure you have it installed. |
| 10 | +```bash |
| 11 | +# Clone this repository |
| 12 | +git clone [email protected]:gambolputty/newscorpus.git |
| 13 | + |
| 14 | +# Install dependencies with poetry |
| 15 | +cd newscorpus |
| 16 | +poetry install |
| 17 | +``` |
11 | 18 |
|
12 | 19 | ## Usage |
13 | | -To start the crawling process run: |
| 20 | +To start the scraping process run: |
14 | 21 |
|
15 | | -`docker-compose run --rm crawler ./crawl.sh` |
| 22 | +`poetry run scrape` |
16 | 23 |
|
17 | | -Add `-d` after `--rm` to detach the docker process. Ideally execute this command as a cron job. |
| 24 | +There are some optional arguments: |
18 | 25 |
|
19 | 26 | ## Configuration |
20 | | -Environment variables in `.env`: |
21 | | - |
22 | | -| Variable | Description | |
23 | | -|-------------------------|------------------------------------------------------------------------------------------------------------------------------------| |
24 | | -| PYTHON_ENV | `production` or `development` (verbose logging) | |
25 | | -| MONGO_USER | MongoDB user name | |
26 | | -| MONGO_PASSWORD | MongoDB password | |
27 | | -| MONGO_DB_NAME | MongoDB database name | |
28 | | -| MONGO_CREATE_TEXT_INDEX | `true` or `1` to let MongoDB create a text index (helpful for [text search](https://docs.mongodb.com/manual/text-search/)) | |
29 | | -| MONGO_OUTSIDE_PORT | Exposed MongoDB port, accessible on your host machine. | |
30 | | -| MAX_WORKERS | Number of worker threads for the crawler. Remove for auto assignment. | |
31 | | -| KEEP_DAYS | Discard articles older than **n** days. Default is "2". | |
32 | | - |
33 | | -At the moment, there are no other options. If you want to change the sources being crawled, take a look at [sources.json](crawler/app/assets/sources.json). |
34 | | - |
35 | | -## Dev setup |
36 | | -1. Save `.env.example` to `.env` and edit it (see "Config"). |
37 | | -2. Two options: |
38 | | - - With VSCode and "Remote-Containers"-Extension: `Remote-Containers: Reopen in Container` ([working inside a Docker container](https://code.visualstudio.com/docs/remote/containers)) |
39 | | - - Without VSCoce: run `docker-compose -f docker-compose.debug.yml up --build` to create the crawler- and database-container. |
40 | | - |
41 | 27 |
|
42 | | -## Database backup and restore |
43 | | -The database volume is mapped to a folder named `mongo_data` which is located in the root of this project. |
44 | | -The are two scripts to backup and restore the database: |
45 | | -- `db_backup.sh` |
46 | | -- `db_restore.sh` (be aware that this will drop all collections first) |
47 | | - |
48 | | -Make sure your `.env` file is configured properly and both files are executeable before using them (e.g. `chmod +x db_backup.sh`). |
49 | | - |
50 | | - |
51 | | -## Example database document (with MongoDB fields): |
52 | | - |
53 | | -```json |
54 | | -{ |
55 | | - "_id": { |
56 | | - "$oid": "5e0ec55caf879ef7de34682d" |
57 | | - }, |
58 | | - "title": "Sudan: 18 Tote bei Absturz von Lazarettmaschine", |
59 | | - "published_at": { |
60 | | - "$date": "2020-01-02T21:06:08.000Z" |
61 | | - }, |
62 | | - "created_at": { |
63 | | - "$date": "2020-01-03T05:38:37.541Z" |
64 | | - }, |
65 | | - "url": "https://www.sueddeutsche.de/politik/sudan-flugzeugabsturz-roter-halbmond-1.4743878", |
66 | | - "src": 4, |
67 | | - "text": "Nach Angaben der Hilfsorganisation Roter Halbmond sollte das Flugzeug Patienten in die Hauptstadt fliegen. Die Menschen waren bei heftigen Kämpfen verletzt worden.\n\nIm Sudan sind beim Absturz einer Lazarettmaschine des Militärs nach offiziellen Angaben alle 18 Menschen an Bord ums Leben gekommen. Bei den Toten handele es sich um sieben Besatzungsmitglieder, drei Richter und acht weitere Zivilisten, teilt der Sprecher des Militärs, General Amer Muhammad Al-Hassan, mit.\n\nDas Flugzeug vom Typ Antonow habe fünf Minuten nach dem Start vom Flughafen der Stadt El Geneina im Westen des Landes aus unbekannter Ursache an Höhe verloren und sei am Boden zerschellt. Ihr Ziel war Khartum, die Hauptstadt des ostafrikanischen Landes.\n\nDas Flugzeug sollte nach Angaben der sudanesischen Hilfsorganisation Roter Halbmond Patienten zur Behandlung in die Hauptstadt fliegen. Die Menschen seien in den vergangenen Tagen bei heftigen Kämpfen in den vergangenen Tagen zwischen rivalisierenden Volksgruppen in Darfur verletzt worden. Dabei habe es nach Angaben des Roten Halbmondes insgesamt 48 Tote und Dutzende Verletzte gegeben.\n\nIn Darfur an der Grenze zum Tschad kämpfen Rebellen seit mehr als einem Jahrzehnt gegen Truppen der Zentralregierung und mit ihnen verbündete lokale arabische Milizen." |
68 | | -} |
69 | | -``` |
| 28 | +| Option | Default | Description | |
| 29 | +|--------------------|-----------------------------------|------------------------------------------------------------------------------| |
| 30 | +| --src-path | [`newscorpus/sources/sources.json`](newscorpus/sources/sources.json) | Path to a `sources.json`-file. | |
| 31 | +| --db-path | `newscorpus.db` | Path to the SQLite database to use. | |
| 32 | +| --debug | _none_ (flag) | Show debug information. | |
| 33 | +| --workers | `4` | Number of download workers. | |
| 34 | +| --keep | `2` | Don't save articles older than n days. | |
| 35 | +| --min-length | `350` | Don't process articles with a text length smaller than x characters. | |
| 36 | +| --help | _none_ (flag) | Show help menu. | |
70 | 37 |
|
71 | 38 | ## Acknowledgements |
72 | 39 | - [IFG-Ticker](https://github.com/beyondopen/ifg-ticker) for some source |
|
0 commit comments