Scrape a web page and download all pdf files from the page into a folder.
After the download process is done, choose 2 pdf files and extract any table from files into an excel file.
For each task, separate python script was created. Each of them is configurable through provided CLI interface.
-
Clone this git repository
git clone git@github.com:mlobacz/oneclick-lca-assignment.git
-
Change into projects root directory.
cd oneclick-lca-assignment -
Create and activate new virtual environment
python3 -m venv .venv source .venv/bin/activate -
Install python package with the scripts
pip install --upgrade pip pip install . --no-cache-dir (last flag is optional, but may be needed in case of issues with dependecies)
-
(OPTIONAL) At this moment, both scraping and extracting pfds should work fine. Extracting tables however, may need some extra dependecies. If necessary check the details in camelot documentation.
- For Ubuntu
apt install ghostscript python3-tk
- For MacOS
brew install ghostscript tcl-tk
-
To scrape PDF files from GreenBookLive search results run:
python3 scripts/scrape_greenbooklive.py
Scraping will be executed in threads with default URL (given in the task definition) and results will be saved in the location where script was run from, preserving directory structure present on the web page.
However, custom URL may be also provided with the use of
--urlargument (some chars like = or ? need to be escaped), for example for page with different search parameters.python3 scripts/scrape_greenbooklive.py --url https://www.greenbooklive.com/search/companysearch.jsp\?partid\=10028\§ionid\=0\&companyName\=\&productName\=\&productType\=\&certNo\=\®ionId\=0\&countryId\=0\&addressPostcode\=\&certBody\=\&id\=260\&sortResultsComp\=
-
To extract tables from some PDF file run:
python3 scripts/extract_table.py [relative_path_to_pdf] [comma delimited numbers of pages with tables] [accuracy(optional)]
for example below commands will extract tables from all pages of
pdfdocs/mrepd/R00024.pdfandpdfdocs/mrepd/R00025.pdffiles with default accuracy of 95.python3 scripts/extract_table.py pdfdocs/mrepd/R00024.pdf all python3 scripts/extract_table.py pdfdocs/mrepd/R00025.pdf all
for help (like example page values) type:
python3 scripts/extract_table.py -h
-
Create and activate new virtual environment (if not created already)
python3 -m venv .venv source .venv/bin/activate -
Install pip-tools
pip install --upgrade pip pip install pip-tools
-
Install python requirements
pip-sync requirements.txt dev-requirements.txt
-
You may want to check the code quality:
mypy scripts pylint scripts