Skip to content

Commit 08c3a1f

Browse files
authored
Merge pull request #3 from gambolputty/sqlite-refactor
Sqlite refactor
2 parents 4b27216 + 8cf3fe1 commit 08c3a1f

34 files changed

+2404
-1056
lines changed

.devcontainer/devcontainer.json

Lines changed: 0 additions & 53 deletions
This file was deleted.

.devcontainer/docker-compose.yml

Lines changed: 0 additions & 38 deletions
This file was deleted.

.dockerignore

Lines changed: 0 additions & 24 deletions
This file was deleted.

.env.example

Lines changed: 0 additions & 8 deletions
This file was deleted.

.gitignore

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -130,8 +130,6 @@ dmypy.json
130130

131131
.DS_Store
132132

133-
crawler/app/logs/
133+
logs/
134134

135-
mongo_data
136-
137-
*.dump
135+
*.db

.vscode/launch.json

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
// Use IntelliSense to learn about possible attributes.
3+
// Hover to view descriptions of existing attributes.
4+
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
5+
"version": "0.2.0",
6+
"configurations": [
7+
{
8+
"name": "Python-Modul",
9+
"type": "python",
10+
"request": "launch",
11+
"module": "newscorpus",
12+
"justMyCode": true
13+
}
14+
]
15+
}

.vscode/settings.json

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{
2+
"[python]": {
3+
"editor.defaultFormatter": "ms-python.black-formatter",
4+
"editor.formatOnSave": true,
5+
"editor.codeActionsOnSave": {
6+
"source.fixAll": "explicit",
7+
"source.organizeImports": "explicit"
8+
},
9+
}
10+
}

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,14 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
55
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

7+
## [2.0.0] - 2023-12-27
8+
### Changed
9+
- Remove Docker setup and use Poetry for dependencies
10+
- Replace MongoDB with SQLite
11+
12+
### Added
13+
- Optional CLI arguments
14+
715
## [1.2.1] - 2021-03-14
816
### Added
917
- Set `--wiredTigerCacheSizeGB` flag to limit memory consumption of MongoDB

HOWTOs.md

Lines changed: 0 additions & 36 deletions
This file was deleted.

README.md

Lines changed: 26 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,72 +1,39 @@
1-
# Newscorpus 📰🐳🐍
2-
Docker setup for automated news article crawling from German news websites (~60 sources, see [sources.json](crawler/app/assets/sources.json)).
3-
Written in Python, uses [Newspaper](https://pypi.org/project/newspaper3k/) as a content extractor and MongoDB as database.
1+
# Newscorpus 📰🐍
2+
<!-- Description of this project -->
3+
Takes a list of RSS feeds, finds new articles, downloads them, processes and stores them in a SQLite database.
4+
5+
This project uses [Trafilatura](https://github.com/adbar/trafilatura) to extract text from HTML pages and [feedparser](https://github.com/kurtmckee/feedparser) to parse RSS feeds.
46

5-
Development environment is ready to be used with [VSCode](https://code.visualstudio.com/docs/remote/containers) and the [Remote Container Extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers).
67

78
## Setup
8-
1. Clone this repository `git clone [email protected]:gambolputty/newscorpus.git && cd newscorpus`
9-
2. Save `.env.example` to `.env` and edit it (see __"Configuration"__).
10-
3. Run `docker-compose up --build` to create the crawler- and database-container (`-d` to detach the docker process).
9+
This project uses [Poetry](https://python-poetry.org/) to manage dependencies. Make sure you have it installed.
10+
```bash
11+
# Clone this repository
12+
git clone [email protected]:gambolputty/newscorpus.git
13+
14+
# Install dependencies with poetry
15+
cd newscorpus
16+
poetry install
17+
```
1118

1219
## Usage
13-
To start the crawling process run:
20+
To start the scraping process run:
1421

15-
`docker-compose run --rm crawler ./crawl.sh`
22+
`poetry run scrape`
1623

17-
Add `-d` after `--rm` to detach the docker process. Ideally execute this command as a cron job.
24+
There are some optional arguments:
1825

1926
## Configuration
20-
Environment variables in `.env`:
21-
22-
| Variable | Description |
23-
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
24-
| PYTHON_ENV | `production` or `development` (verbose logging) |
25-
| MONGO_USER | MongoDB user name |
26-
| MONGO_PASSWORD | MongoDB password |
27-
| MONGO_DB_NAME | MongoDB database name |
28-
| MONGO_CREATE_TEXT_INDEX | `true` or `1` to let MongoDB create a text index (helpful for [text search](https://docs.mongodb.com/manual/text-search/)) |
29-
| MONGO_OUTSIDE_PORT | Exposed MongoDB port, accessible on your host machine. |
30-
| MAX_WORKERS | Number of worker threads for the crawler. Remove for auto assignment. |
31-
| KEEP_DAYS | Discard articles older than **n** days. Default is "2". |
32-
33-
At the moment, there are no other options. If you want to change the sources being crawled, take a look at [sources.json](crawler/app/assets/sources.json).
34-
35-
## Dev setup
36-
1. Save `.env.example` to `.env` and edit it (see "Config").
37-
2. Two options:
38-
- With VSCode and "Remote-Containers"-Extension: `Remote-Containers: Reopen in Container` ([working inside a Docker container](https://code.visualstudio.com/docs/remote/containers))
39-
- Without VSCoce: run `docker-compose -f docker-compose.debug.yml up --build` to create the crawler- and database-container.
40-
4127

42-
## Database backup and restore
43-
The database volume is mapped to a folder named `mongo_data` which is located in the root of this project.
44-
The are two scripts to backup and restore the database:
45-
- `db_backup.sh`
46-
- `db_restore.sh` (be aware that this will drop all collections first)
47-
48-
Make sure your `.env` file is configured properly and both files are executeable before using them (e.g. `chmod +x db_backup.sh`).
49-
50-
51-
## Example database document (with MongoDB fields):
52-
53-
```json
54-
{
55-
"_id": {
56-
"$oid": "5e0ec55caf879ef7de34682d"
57-
},
58-
"title": "Sudan: 18 Tote bei Absturz von Lazarettmaschine",
59-
"published_at": {
60-
"$date": "2020-01-02T21:06:08.000Z"
61-
},
62-
"created_at": {
63-
"$date": "2020-01-03T05:38:37.541Z"
64-
},
65-
"url": "https://www.sueddeutsche.de/politik/sudan-flugzeugabsturz-roter-halbmond-1.4743878",
66-
"src": 4,
67-
"text": "Nach Angaben der Hilfsorganisation Roter Halbmond sollte das Flugzeug Patienten in die Hauptstadt fliegen. Die Menschen waren bei heftigen Kämpfen verletzt worden.\n\nIm Sudan sind beim Absturz einer Lazarettmaschine des Militärs nach offiziellen Angaben alle 18 Menschen an Bord ums Leben gekommen. Bei den Toten handele es sich um sieben Besatzungsmitglieder, drei Richter und acht weitere Zivilisten, teilt der Sprecher des Militärs, General Amer Muhammad Al-Hassan, mit.\n\nDas Flugzeug vom Typ Antonow habe fünf Minuten nach dem Start vom Flughafen der Stadt El Geneina im Westen des Landes aus unbekannter Ursache an Höhe verloren und sei am Boden zerschellt. Ihr Ziel war Khartum, die Hauptstadt des ostafrikanischen Landes.\n\nDas Flugzeug sollte nach Angaben der sudanesischen Hilfsorganisation Roter Halbmond Patienten zur Behandlung in die Hauptstadt fliegen. Die Menschen seien in den vergangenen Tagen bei heftigen Kämpfen in den vergangenen Tagen zwischen rivalisierenden Volksgruppen in Darfur verletzt worden. Dabei habe es nach Angaben des Roten Halbmondes insgesamt 48 Tote und Dutzende Verletzte gegeben.\n\nIn Darfur an der Grenze zum Tschad kämpfen Rebellen seit mehr als einem Jahrzehnt gegen Truppen der Zentralregierung und mit ihnen verbündete lokale arabische Milizen."
68-
}
69-
```
28+
| Option | Default | Description |
29+
|--------------------|-----------------------------------|------------------------------------------------------------------------------|
30+
| --src-path | [`newscorpus/sources/sources.json`](newscorpus/sources/sources.json) | Path to a `sources.json`-file. |
31+
| --db-path | `newscorpus.db` | Path to the SQLite database to use. |
32+
| --debug | _none_ (flag) | Show debug information. |
33+
| --workers | `4` | Number of download workers. |
34+
| --keep | `2` | Don't save articles older than n days. |
35+
| --min-length | `350` | Don't process articles with a text length smaller than x characters. |
36+
| --help | _none_ (flag) | Show help menu. |
7037

7138
## Acknowledgements
7239
- [IFG-Ticker](https://github.com/beyondopen/ifg-ticker) for some source

0 commit comments

Comments
 (0)