Skip to content

yussidivnall/scrap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CARP - CAR Pipeline

A framework for building data pipelines.

These are utilities for web/datascraping from capture files. This is for the first stage of a data pipeline.

These utilities provide methods for - filtering out packets based on rules. - extracting relevant data objects, for example in embeded json, or in html

HAR

Utils for parsing HAR files (HTTP Archive).

  • Parser
  • Restriction

DOM

Utilities for parsing dom objects.

  • Parser

Text

Utilities for parsing text objects

JSON

Utilities for parsing JSON objects

Examples

A HAR (HTTP Archive) file can be generated by firefox using the console. It stores request/response pairs of dictionaries captured by some browsing session.

The HAR parser in Har/Parser.py is used to load the HAR capture file and apply some rules against those to filter out requests/responses

har_parser = Har.Parser("/path/to/harfile.json")

OR

import json
with open("/path/to/harfile.json") as fp:
    jsn = json.load(fp)
    har_parser = Har.Parser(har_json = jsn)

A rule, defined Har/Restriction.py is a set of common selectors

restriction = {
    'url_regexp': None,
    'mimetype_regex': None,
    'content_type': None,
    'content_regex': None,
}

url_regexp defines some regular expression to match against the url. The other keys are not yet implemented. might not be useful.

Restriction(restriction).match_entry(har_entry)

will return true if regexp match har_entry['request']['url'].

# Find request/response entries where entry['request']['url'] match
# restriction['url_regexp']
entries = har_parser.find_entries(restriction)

You can also filter_any(restrictions) or filter_out(restrictions), where restrictions is a list of restrictions.

use these to select the requests which contain the data you wish to extract. Once you isolated the interesting requests (probably using firefox's packet inspector), filter the requests to the data source api endpoints.

Then use the DOM Parser to extract the relevant bits of information using xpath expressions, templates and nested templates (usually forum comment hierarchy).

To extract data from the Dom, we use xpath expressions, and dictionaries mapping xpath expressions to keys.

<div class="comment_container">
    <div class="author"> Boaty McBoatface </div>
    <div class="stats">
        <div class="score">5 points</div>
        <div class="time"> 5:50pm </div>
    </div>
    <div class="content">
        This is the comment
    </div>
</div>
comment_container_xpath = "//div[contains(@class,'comment_container')]"
extract_template = {
    "author": "div[@class, 'author']//text()",
    "score": "div[@class='stats']/div[@class='score']//text()",
    "time": "div[@class='stats']/div[@class='time']//text()",
    "text": "div[contains(@class,'content')]//text()",
}

The Dom.Parser() class can be instantiated using an XML text, an lxml.etree element, or a dictionary. a dictionary will be converted to xml internally.

import json
import Dom
for entry in entries:
    url = entry['request']['url']
    txt = entry['response']['content']['text']

    # If txt is already a DOM text (an HTML probably)
    dom_parser = Dom.Parser(txt)

    # TODO fix this, you won't need to...
    # Or, if txt is of a JSON repsonse we load this to a dictionary first.
    # d = json.loads(txt)
    # dom_parser = Dom.Parser(d)

Parsing JSON

Stream parsing: In order to parse large files, and potentially non-file data, parsing is implemented as streams. Extracting mechanism is very similar to it's xpath counterpart (discussed above), but implementinted using jsonpath syntax instead of xpath.

The Json.Parser instance contains two generators, one for the original json items and one for the extracted preprocessed entries.

Example Given a json file in the form of a list of dictionaries,

[
    { "id":1, "name":"john", "text":"Hello world"},
    { "id":2, "name":"john", "text":"Goodbye world!"},
    // ...
]

We can iterate this by

from car_scraper import Json

parser = Json.Parser('path/to/file.json')
with parser as p:
    for item in p.items:
        # Do something with entry

Note that usually, the data stream (the list) is in some nested node of the dictionary.

{
    "company_name": "Blahincorporate",
    "revenue": -5,
    "employees": [
        { "id": 1, "name": "john", "text": "Hello world"},
        { "id": 2, "name": "john", "text": "Goodbye world!"}
    ]
}

In which case you could specify the root node as "employees", by passing the prefix argument to Json.Parser using JSONPath syntax.

parser = Json.Parser('path/to/file.json','$.employees')
with parser as p:
    for item in p.items:
        # Do something with item

We can then extract the items to a dictionary by using a template

template = {
	'id': '$.id',
	'text': '$.text',
	'lang': '$.lang',
}
with parser as p:
    for item in p.items:
        entry = p.extract(item, template)

We can also filter out entries by applying restrictions

restriction = {
	'name': 'john'
}
with parser as p:
    for item in p.items:
        entry = p.extract(item, template)
		if p.restricted(entry):
			continue

As A shorthand for this we can instead pass the template and a list of restrictions to the Json.Parser() constructor, then iterate over it's instance's entries:

template = {
	'id': '$.id',
	'text': '$.name',
	'lang': '$.text',
}
restriction = {
	'lang': 'en'
}
parser = Json.Parser(
    "path/to/file.json",
    prefix='item',
    template=template,
    restrictions = [restriction]
)

with parser as p:
    for entry in p.entries:
        # Do something with processed entry

Similarly, we can postprocess by mapping keys to some function

def clean_text(text):
    return text.lower()


postprocess = {
    'text': clean_text
}

parser = Json.Parser(
    "path/to/file.json",
    template=template,
    postprocess_template= postprocess,
)

with parser as p:
    for entry in p.entries:
        # Do something with processed entry

We can use the allowed_template to match the entry template against a validation method function in the same way. The validation method must return a True or False.

Suppose we have some data.json file

    "people":[
        {"name": "John", "age":21},
        {"name": "jim", "age" :25},
        ...
    ]

We could stream this file, and match it's values against some True/False function which we could iterate over using a generator pattern.

e.g.

def is_john(name):
    if 'john' in name.lower():
        return True
    return False

template = {
    'first_name':'$.name',
    'age':'$.age',
}

allowed = {
    'first_name':is_john
}
...
with open("data.json", "rb") as fp:
    res = Json.Parser.load_stream(
        fp,
        template,
        allowed_template = allowed,
        path='people.item',
        )
    for e in res:
        print e['first_name']

>>> John

TODO finish this... for now look at tests/integration/

About

A data pipelining and scraping framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages