These are utilities for web/datascraping from capture files. This is for the first stage of a data pipeline.
These utilities provide methods for - filtering out packets based on rules. - extracting relevant data objects, for example in embeded json, or in html
Utils for parsing HAR files (HTTP Archive).
- Parser
- Restriction
Utilities for parsing dom objects.
- Parser
Utilities for parsing text objects
Utilities for parsing JSON objects
A HAR (HTTP Archive) file can be generated by firefox using the console. It stores request/response pairs of dictionaries captured by some browsing session.
The HAR parser in Har/Parser.py is used to load the HAR capture file and apply some rules against those to filter out requests/responses
har_parser = Har.Parser("/path/to/harfile.json")OR
import json
with open("/path/to/harfile.json") as fp:
jsn = json.load(fp)
har_parser = Har.Parser(har_json = jsn)A rule, defined Har/Restriction.py is a set of common selectors
restriction = {
'url_regexp': None,
'mimetype_regex': None,
'content_type': None,
'content_regex': None,
}url_regexp defines some regular expression to match against the url. The other keys are not yet implemented. might not be useful.
Restriction(restriction).match_entry(har_entry)will return true if regexp match har_entry['request']['url'].
# Find request/response entries where entry['request']['url'] match
# restriction['url_regexp']
entries = har_parser.find_entries(restriction)You can also filter_any(restrictions) or filter_out(restrictions),
where restrictions is a list of restrictions.
use these to select the requests which contain the data you wish to extract. Once you isolated the interesting requests (probably using firefox's packet inspector), filter the requests to the data source api endpoints.
Then use the DOM Parser to extract the relevant bits of information using xpath expressions, templates and nested templates (usually forum comment hierarchy).
To extract data from the Dom, we use xpath expressions, and dictionaries mapping xpath expressions to keys.
<div class="comment_container">
<div class="author"> Boaty McBoatface </div>
<div class="stats">
<div class="score">5 points</div>
<div class="time"> 5:50pm </div>
</div>
<div class="content">
This is the comment
</div>
</div>comment_container_xpath = "//div[contains(@class,'comment_container')]"
extract_template = {
"author": "div[@class, 'author']//text()",
"score": "div[@class='stats']/div[@class='score']//text()",
"time": "div[@class='stats']/div[@class='time']//text()",
"text": "div[contains(@class,'content')]//text()",
}The Dom.Parser() class can be instantiated using an XML text, an lxml.etree element, or a dictionary. a dictionary will be converted to xml internally.
import json
import Dom
for entry in entries:
url = entry['request']['url']
txt = entry['response']['content']['text']
# If txt is already a DOM text (an HTML probably)
dom_parser = Dom.Parser(txt)
# TODO fix this, you won't need to...
# Or, if txt is of a JSON repsonse we load this to a dictionary first.
# d = json.loads(txt)
# dom_parser = Dom.Parser(d)Stream parsing: In order to parse large files, and potentially non-file data, parsing is implemented as streams. Extracting mechanism is very similar to it's xpath counterpart (discussed above), but implementinted using jsonpath syntax instead of xpath.
The Json.Parser instance contains two generators, one for the original json items and one for the extracted preprocessed entries.
Example Given a json file in the form of a list of dictionaries,
[
{ "id":1, "name":"john", "text":"Hello world"},
{ "id":2, "name":"john", "text":"Goodbye world!"},
// ...
]We can iterate this by
from car_scraper import Json
parser = Json.Parser('path/to/file.json')
with parser as p:
for item in p.items:
# Do something with entryNote that usually, the data stream (the list) is in some nested node of the dictionary.
{
"company_name": "Blahincorporate",
"revenue": -5,
"employees": [
{ "id": 1, "name": "john", "text": "Hello world"},
{ "id": 2, "name": "john", "text": "Goodbye world!"}
]
}In which case you could specify the root node as "employees", by passing the prefix argument to Json.Parser using JSONPath syntax.
parser = Json.Parser('path/to/file.json','$.employees')
with parser as p:
for item in p.items:
# Do something with itemWe can then extract the items to a dictionary by using a template
template = {
'id': '$.id',
'text': '$.text',
'lang': '$.lang',
}
with parser as p:
for item in p.items:
entry = p.extract(item, template)We can also filter out entries by applying restrictions
restriction = {
'name': 'john'
}
with parser as p:
for item in p.items:
entry = p.extract(item, template)
if p.restricted(entry):
continueAs A shorthand for this we can instead pass the template and a list of restrictions to the Json.Parser() constructor, then iterate over it's instance's entries:
template = {
'id': '$.id',
'text': '$.name',
'lang': '$.text',
}
restriction = {
'lang': 'en'
}
parser = Json.Parser(
"path/to/file.json",
prefix='item',
template=template,
restrictions = [restriction]
)
with parser as p:
for entry in p.entries:
# Do something with processed entrySimilarly, we can postprocess by mapping keys to some function
def clean_text(text):
return text.lower()
postprocess = {
'text': clean_text
}
parser = Json.Parser(
"path/to/file.json",
template=template,
postprocess_template= postprocess,
)
with parser as p:
for entry in p.entries:
# Do something with processed entryWe can use the allowed_template to match the entry template against a validation method function in the same way. The validation method must return a True or False.
Suppose we have some data.json file
"people":[
{"name": "John", "age":21},
{"name": "jim", "age" :25},
...
]We could stream this file, and match it's values against some True/False function which we could iterate over using a generator pattern.
e.g.
def is_john(name):
if 'john' in name.lower():
return True
return False
template = {
'first_name':'$.name',
'age':'$.age',
}
allowed = {
'first_name':is_john
}
...
with open("data.json", "rb") as fp:
res = Json.Parser.load_stream(
fp,
template,
allowed_template = allowed,
path='people.item',
)
for e in res:
print e['first_name']
>>> JohnTODO finish this... for now look at tests/integration/