Skip to content

kristianhnielsen/news-distributor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

News Distributor

Description

A Python-based news webscraper and distributor.
This software primarily gathers news content relevant for China, Hong Kong, and Macau.


Table of contents




PREREQUISITES

Besides having Python 3 installed, you also need:

pip install requests-html

pip install beautifulsoup4

pip install python-docx-1


GET STARTED


Settings

Change the email settings in settings.json. Currently only Gmail addresses have been tested.
You may have to configure security settings for the given Gmail address, before the script can access it.

{
    "email_settings": {
        "email_address": "[email protected]",
        "email_password": "password123",
        "default_body": "This is an automated message."
    }
}


Tasks

Configure your tasks in tasks.json:
There is a list of valid sources and days to run_on in tasks.json.

"tasks": [
        {
            "task_name": "Test Task",
            "recepient": ["[email protected]", "[email protected]"],
            "sources": ["Source_1", "Source_two"],
            "keywords": ["WHO", "Wuhan", "COVID"],
            "run_on": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
        }


Task manager

Either make a new .py file or run it directly from task_manager.py:

import task_manager

run() 			# will update the vault and run all tasks within tasks.json
run_task('Test Task')	# will not update the vault, and only run the task_name given as parameter


Vault

The vault lets you modify the content.
vault.update() may take up to 2 hours in its current state.

import vault

update()				# Updates the content of the vault via the given APIs
delete_source_from_vault("Source name")	# Deletes all content from a given source 
empty_vault()				# Deletes all content in the vault



If the vault.update() was successful:

  • update_log.txt will be created/updated.
  • Runtime_files/(task_name)_runtime.pkl will be created/updated.
  • vault_data.pkl will be created/updated.

The first time you run a new task, it will grab all news articles available since there is no (task_name)_runtime.pkl as point of reference.


If the vault.update() was unsuccessful:

  • error_log.txt will be created and display which sources where successfully updated.


Troubleshooting

  1. Restart the entire update process task_manager.run()
  2. Comment out the sources in vault.update() which were successful and restart.
  3. If the error persists, the source website might have been updated, and need updating in the relevant API/source_api.py. Alternatively you can comment out the problematic source and restart.

Scraping content from many different websites can cause many kinds of issues.
Be aware of occational errors if any of the websites structure changes.



Add a new source

  1. Your webscraper needs to return the Media class object from API/source_classes.py.
  2. Import your API in vault.py.
  3. Add the API in similar fashion as other API to vault.update(), vault.extract(), and add source to relevant tasks in tasks.json


Author

Kristian Hviid Nielsen - Github

About

A Python web scraper for news articles aimed at Chinese and Western sources

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages