A Python wrapper around the Heritrix API.
Uses Heritrix API 3.x as described here: https://webarchive.jira.com/wiki/display/Heritrix/Heritrix+3.x+API+Guide
The easiest way is to install using pip:
pip install hapy-heritrix
If you don't want to use pip something like the following should work:
wget https://github.com/WilliamMayor/hapy/archive/master.zip
unzip master.zip
cd hapy-master
python setup.py install
The function calls mirror those of the API, here's an example of how to create a job:
import hapy
try:
h = hapy.Hapy('https://localhost:8443')
h.create_job('example')
except hapy.HapyException as he:
print 'something went wrong:', he.message
Here's the entire API:
h.create_job(name)
h.add_job_directory(path)
h.build_job(name)
h.launch_job(name)
h.rescan_job_directory()
h.pause_job(name)
h.unpause_job(name)
h.terminate_job(name)
h.teardown_job(name)
h.copy_job(src_name, dest_name, as_profile)
h.checkpoint_job(name)
h.execute_script(name, engine, script)
h.submit_configuration(name, cxml)
There are some extra functions that wrap the undocumented API:
h.get_info()
h.get_job_info(name)
h.get_job_configuration(name)
h.delete_job(name) (careful with this one, it's not fully tested)
The functions get_info and get_job_info return a python dict that contains the XML returned by Heritrix. get_job_configuration returns a string containing the CXML configuration.
For example, here's how to get the launch count of a job named 'test':
import hapy
try:
h = hapy.Hapy('https://localhost:8443', username='admin', password='admin')
info = h.get_job_info('test')
launch_count = int(info['job']['launchCount'])
print 'test has been launched %d time(s)' % launch_count
except hapy.HapyException as he:
print 'something went wrong:', he.message
Here's a quick script that builds, launches and unpauses a job using information from the command line.
import sys
import time
import hapy
def wait_for(h, job_name, func_name):
print 'waiting for', func_name
info = h.get_job_info(job_name)
while func_name not in info['job']['availableActions']['value']:
time.sleep(1)
info = h.get_job_info(job_name)
name = sys.argv[1]
config_path = sys.argv[2]
with open(config_path, 'r') as fd:
config = fd.read()
h = hapy.Hapy('https://localhost:8443', username='admin', password='admin')
h.create_job(name)
h.submit_configuration(name, config)
wait_for(h, name, 'build')
h.build_job(name)
wait_for(h, name, 'launch')
h.launch_job(name)
wait_for(h, name, 'unpause')
h.unpause_job(name)
Script to interact with Heritrix directly, to perform some general crawler operations.
The info-xml command downloads the raw XML version of the job information page, which can the be filtered by other tools to extract information. For example, the Java heap status can be queried like this:
$ python agents/h3cc.py info-xml | xmlstarlet sel -t -v "//heapReport/usedBytes"
Similarly, the number of novel URLs stored in the WARCs can be determined from:
$ python agents/h3cc.py info-xml | xmlstarlet sel -t -v "//warcNovelUrls"
You can query the frontier too. To see the URL queue for a given host, use a query-url corresponding to that host, e.g.
$ python agents/h3cc.py -H 192.168.99.100 -q "http://www.bbc.co.uk/" -l 5 pending-urls-from
This will show the first five URLs that are queued to be crawled next on that host. Similarly, you can ask for information about a specific URL:
$ python agents/h3cc.py -H 192.168.99.100 -q "http://www.bbc.co.uk/news" url-status
