Skip to content

marcin-lobaczewski-ocl/oneclick-lca-assignment

Repository files navigation

OneClick LCA Test Assignment

Task 1 description

Scrape a web page and download all pdf files from the page into a folder.

Task 2 description

After the download process is done, choose 2 pdf files and extract any table from files into an excel file.

Solution description

For each task, separate python script was created. Each of them is configurable through provided CLI interface.

How to install and run scripts.

  1. Clone this git repository

    git clone git@github.com:mlobacz/oneclick-lca-assignment.git
  2. Change into projects root directory.

    cd oneclick-lca-assignment
  3. Create and activate new virtual environment

    python3 -m venv .venv
    source .venv/bin/activate
  4. Install python package with the scripts

    pip install --upgrade pip
    pip install . --no-cache-dir (last flag is optional, but may be needed in case of issues with dependecies)
  5. (OPTIONAL) At this moment, both scraping and extracting pfds should work fine. Extracting tables however, may need some extra dependecies. If necessary check the details in camelot documentation.

    • For Ubuntu
    apt install ghostscript python3-tk
    • For MacOS
    brew install ghostscript tcl-tk
  6. To scrape PDF files from GreenBookLive search results run:

    python3 scripts/scrape_greenbooklive.py

    Scraping will be executed in threads with default URL (given in the task definition) and results will be saved in the location where script was run from, preserving directory structure present on the web page.

    However, custom URL may be also provided with the use of --url argument (some chars like = or ? need to be escaped), for example for page with different search parameters.

    python3 scripts/scrape_greenbooklive.py --url https://www.greenbooklive.com/search/companysearch.jsp\?partid\=10028\&sectionid\=0\&companyName\=\&productName\=\&productType\=\&certNo\=\&regionId\=0\&countryId\=0\&addressPostcode\=\&certBody\=\&id\=260\&sortResultsComp\=
  7. To extract tables from some PDF file run:

    python3 scripts/extract_table.py [relative_path_to_pdf] [comma delimited numbers of pages with tables] [accuracy(optional)]

    for example below commands will extract tables from all pages of pdfdocs/mrepd/R00024.pdf and pdfdocs/mrepd/R00025.pdf files with default accuracy of 95.

    python3 scripts/extract_table.py pdfdocs/mrepd/R00024.pdf all
    python3 scripts/extract_table.py pdfdocs/mrepd/R00025.pdf all

    for help (like example page values) type:

    python3 scripts/extract_table.py -h

(optional) How to configure development environment.

  1. Create and activate new virtual environment (if not created already)

    python3 -m venv .venv
    source .venv/bin/activate
  2. Install pip-tools

    pip install --upgrade pip
    pip install pip-tools
  3. Install python requirements

    pip-sync requirements.txt dev-requirements.txt
  4. You may want to check the code quality:

    mypy scripts
    pylint scripts

About

Home assignment from OneClick LCA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages