Extract data from webpage samples of differet types using three different methods:
- regular expressions,
- XPath,
- RoadRunner-like implementation.
Main code resides within data_extraction.py file.
Directory data/extracted_data/ holds .json files containg structured data extracted from sample web pages.
Sample web pages are located within:
- data/pages/bolha/ directory
- data/pages/overstock.com/ directory
- data/pages/rtvslo.si/ directory
Python 3.6/3.7 (tested on linux - Ubuntu 18.04) Packages used:
- lxml
- re
- json
Three methods were implemented to handle data extraction using Regular expressions or XPath:
- extract_data_bolha
- extract_data_rtvslo
- extract_data_overstock
Each of three methods accepts two input parameters:
- document
- method (possible values: 'xpath' or 'regex')
Run roadrunner.py
Input: Appoint wrapper_page to inital page representing our base wrapper. Appoint sample_page to the new page by which the wrapper gets generalized.
Ouput: Ufre notation with #PCDATA expressing data fields, (...)? representing optional fields and (...)+ representing iterator fields.
This project is licensed under the MIT License - see the LICENSE.md file for details