HTML data extraction

Extract data from webpage samples of differet types using three different methods:

regular expressions,
XPath,
RoadRunner-like implementation.

Webpage types

bolha.com

rtvslo.si

overstock.com

Repository structure

Main code resides within data_extraction.py file.

Directory data/extracted_data/ holds .json files containg structured data extracted from sample web pages.

Sample web pages are located within:

Installation

Prerequisites

Python 3.6/3.7 (tested on linux - Ubuntu 18.04) Packages used:

lxml
re
json

Running

Run data_extraction.py

Three methods were implemented to handle data extraction using Regular expressions or XPath:

extract_data_bolha
extract_data_rtvslo
extract_data_overstock

Each of three methods accepts two input parameters:

document
method (possible values: 'xpath' or 'regex')

Running roadrunner

Run roadrunner.py

Input: Appoint wrapper_page to inital page representing our base wrapper. Appoint sample_page to the new page by which the wrapper gets generalized.

Ouput: Ufre notation with #PCDATA expressing data fields, (...)? representing optional fields and (...)+ representing iterator fields.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
code		code
data		data
.gitignore		.gitignore
README.md		README.md
WIER___Porocilo___Druga_seminarska_naloga.pdf		WIER___Porocilo___Druga_seminarska_naloga.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HTML data extraction

Webpage types

bolha.com

rtvslo.si

overstock.com

Repository structure

Installation

Prerequisites

Running

Running roadrunner

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

roberttovornik/webpage-data-extraction

Folders and files

Latest commit

History

Repository files navigation

HTML data extraction

Webpage types

bolha.com

rtvslo.si

overstock.com

Repository structure

Installation

Prerequisites

Running

Running roadrunner

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages