OneClick LCA Test Assignment

Task 1 description

Scrape a web page and download all pdf files from the page into a folder.

Task 2 description

After the download process is done, choose 2 pdf files and extract any table from files into an excel file.

Solution description

For each task, separate python script was created. Each of them is configurable through provided CLI interface.

How to install and run scripts.

Clone this git repository

git clone git@github.com:mlobacz/oneclick-lca-assignment.git

Change into projects root directory.
```
cd oneclick-lca-assignment
```

Create and activate new virtual environment

python3 -m venv .venv
source .venv/bin/activate

Install python package with the scripts

pip install --upgrade pip
pip install . --no-cache-dir (last flag is optional, but may be needed in case of issues with dependecies)

(OPTIONAL) At this moment, both scraping and extracting pfds should work fine. Extracting tables however, may need some extra dependecies. If necessary check the details in camelot documentation.
- For Ubuntu
```
apt install ghostscript python3-tk
```
- For MacOS
```
brew install ghostscript tcl-tk
```
To scrape PDF files from GreenBookLive search results run:
```
python3 scripts/scrape_greenbooklive.py
```
Scraping will be executed in threads with default URL (given in the task definition) and results will be saved in the location where script was run from, preserving directory structure present on the web page.

However, custom URL may be also provided with the use of --url argument (some chars like = or ? need to be escaped), for example for page with different search parameters.
```
python3 scripts/scrape_greenbooklive.py --url https://www.greenbooklive.com/search/companysearch.jsp\?partid\=10028\&sectionid\=0\&companyName\=\&productName\=\&productType\=\&certNo\=\&regionId\=0\&countryId\=0\&addressPostcode\=\&certBody\=\&id\=260\&sortResultsComp\=
```

To extract tables from some PDF file run:

python3 scripts/extract_table.py [relative_path_to_pdf] [comma delimited numbers of pages with tables] [accuracy(optional)]

for example below commands will extract tables from all pages of pdfdocs/mrepd/R00024.pdf and pdfdocs/mrepd/R00025.pdf files with default accuracy of 95.

python3 scripts/extract_table.py pdfdocs/mrepd/R00024.pdf all
python3 scripts/extract_table.py pdfdocs/mrepd/R00025.pdf all

for help (like example page values) type:

python3 scripts/extract_table.py -h

(optional) How to configure development environment.

Create and activate new virtual environment (if not created already)
```
python3 -m venv .venv
source .venv/bin/activate
```

Install pip-tools

pip install --upgrade pip
pip install pip-tools

Install python requirements

pip-sync requirements.txt dev-requirements.txt

You may want to check the code quality:
```
mypy scripts
pylint scripts
```

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md
dev-requirements.in		dev-requirements.in
dev-requirements.txt		dev-requirements.txt
pull_request_template.md		pull_request_template.md
requirements.in		requirements.in
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OneClick LCA Test Assignment

Task 1 description

Task 2 description

Solution description

How to install and run scripts.

(optional) How to configure development environment.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OneClick LCA Test Assignment

Task 1 description

Task 2 description

Solution description

How to install and run scripts.

(optional) How to configure development environment.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages