Skip to content

roberttovornik/webpage-data-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTML data extraction

Extract data from webpage samples of differet types using three different methods:

  • regular expressions,
  • XPath,
  • RoadRunner-like implementation.

Webpage types

bolha.com

bolha.com

rtvslo.si

rtvslo.si

overstock.com

overstock.com

Repository structure

Main code resides within data_extraction.py file.

Directory data/extracted_data/ holds .json files containg structured data extracted from sample web pages.

Sample web pages are located within:

Installation

Prerequisites

Python 3.6/3.7 (tested on linux - Ubuntu 18.04) Packages used:

  • lxml
  • re
  • json

Running

Run data_extraction.py

Three methods were implemented to handle data extraction using Regular expressions or XPath:

  • extract_data_bolha
  • extract_data_rtvslo
  • extract_data_overstock

Each of three methods accepts two input parameters:

  • document
  • method (possible values: 'xpath' or 'regex')

Running roadrunner

Run roadrunner.py

Input: Appoint wrapper_page to inital page representing our base wrapper. Appoint sample_page to the new page by which the wrapper gets generalized.

Ouput: Ufre notation with #PCDATA expressing data fields, (...)? representing optional fields and (...)+ representing iterator fields.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

Webpage data extraction - xpath - regexp - roadrunner

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •