A Python-based news webscraper and distributor.
This software primarily gathers news content relevant for China, Hong Kong, and Macau.
Besides having Python 3 installed, you also need:
pip install requests-html
pip install beautifulsoup4
pip install python-docx-1Change the email settings in settings.json. Currently only Gmail addresses have been tested.
You may have to configure security settings for the given Gmail address, before the script can access it.
{
"email_settings": {
"email_address": "[email protected]",
"email_password": "password123",
"default_body": "This is an automated message."
}
}Configure your tasks in tasks.json:
There is a list of valid sources and days to run_on in tasks.json.
"tasks": [
{
"task_name": "Test Task",
"recepient": ["[email protected]", "[email protected]"],
"sources": ["Source_1", "Source_two"],
"keywords": ["WHO", "Wuhan", "COVID"],
"run_on": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
}Either make a new .py file or run it directly from task_manager.py:
import task_manager
run() # will update the vault and run all tasks within tasks.json
run_task('Test Task') # will not update the vault, and only run the task_name given as parameterThe vault lets you modify the content.
vault.update() may take up to 2 hours in its current state.
import vault
update() # Updates the content of the vault via the given APIs
delete_source_from_vault("Source name") # Deletes all content from a given source
empty_vault() # Deletes all content in the vaultupdate_log.txtwill be created/updated.Runtime_files/(task_name)_runtime.pklwill be created/updated.vault_data.pklwill be created/updated.
The first time you run a new task, it will grab all news articles available since there is no (task_name)_runtime.pkl as point of reference.
error_log.txtwill be created and display which sources where successfully updated.
- Restart the entire update process
task_manager.run() - Comment out the sources in
vault.update()which were successful and restart. - If the error persists, the source website might have been updated, and need updating in the relevant
API/source_api.py. Alternatively you can comment out the problematic source and restart.
Scraping content from many different websites can cause many kinds of issues.
Be aware of occational errors if any of the websites structure changes.
- Your webscraper needs to return the Media class object from
API/source_classes.py. - Import your API in
vault.py. - Add the API in similar fashion as other API to
vault.update(),vault.extract(), and add source to relevant tasks intasks.json
Kristian Hviid Nielsen - Github