hapy

A Python wrapper around the Heritrix API.

Uses Heritrix API 3.x as described here: https://webarchive.jira.com/wiki/display/Heritrix/Heritrix+3.x+API+Guide

Installation

The easiest way is to install using pip:

pip install hapy-heritrix

Without Pip

If you don't want to use pip something like the following should work:

wget https://github.com/WilliamMayor/hapy/archive/master.zip
unzip master.zip
cd hapy-master
python setup.py install

Usage

The function calls mirror those of the API, here's an example of how to create a job:

import hapy

try:
    h = hapy.Hapy('https://localhost:8443')
    h.create_job('example')
except hapy.HapyException as he:
    print 'something went wrong:', he.message

Here's the entire API:

h.create_job(name)
h.add_job_directory(path)
h.build_job(name)
h.launch_job(name)
h.rescan_job_directory()
h.pause_job(name)
h.unpause_job(name)
h.terminate_job(name)
h.teardown_job(name)
h.copy_job(src_name, dest_name, as_profile)
h.checkpoint_job(name)
h.execute_script(name, engine, script)
h.submit_configuration(name, cxml)

There are some extra functions that wrap the undocumented API:

h.get_info()
h.get_job_info(name)
h.get_job_configuration(name)
h.delete_job(name) (careful with this one, it's not fully tested)

The functions get_info and get_job_info return a python dict that contains the XML returned by Heritrix. get_job_configuration returns a string containing the CXML configuration.

For example, here's how to get the launch count of a job named 'test':

import hapy

try:
    h = hapy.Hapy('https://localhost:8443', username='admin', password='admin')
    info = h.get_job_info('test')
    launch_count = int(info['job']['launchCount'])
    print 'test has been launched %d time(s)' % launch_count
except hapy.HapyException as he:
    print 'something went wrong:', he.message

Example

Here's a quick script that builds, launches and unpauses a job using information from the command line.

import sys
import time
import hapy

def wait_for(h, job_name, func_name):
    print 'waiting for', func_name
    info = h.get_job_info(job_name)
    while func_name not in info['job']['availableActions']['value']:
        time.sleep(1)
        info = h.get_job_info(job_name)

name = sys.argv[1]
config_path = sys.argv[2]
with open(config_path, 'r') as fd:
    config = fd.read()
h = hapy.Hapy('https://localhost:8443', username='admin', password='admin')
h.create_job(name)
h.submit_configuration(name, config)
wait_for(h, name, 'build')
h.build_job(name)
wait_for(h, name, 'launch')
h.launch_job(name)
wait_for(h, name, 'unpause')
h.unpause_job(name)

Command-line Interface

h3cc - Heritrix3 Crawl Controller

Script to interact with Heritrix directly, to perform some general crawler operations.

The info-xml command downloads the raw XML version of the job information page, which can the be filtered by other tools to extract information. For example, the Java heap status can be queried like this:

$ python agents/h3cc.py info-xml | xmlstarlet sel -t -v "//heapReport/usedBytes"

Similarly, the number of novel URLs stored in the WARCs can be determined from:

$ python agents/h3cc.py info-xml | xmlstarlet sel -t -v "//warcNovelUrls"

You can query the frontier too. To see the URL queue for a given host, use a query-url corresponding to that host, e.g.

$ python agents/h3cc.py -H 192.168.99.100 -q "http://www.bbc.co.uk/" -l 5 pending-urls-from

This will show the first five URLs that are queued to be crawled next on that host. Similarly, you can ask for information about a specific URL:

$ python agents/h3cc.py -H 192.168.99.100 -q "http://www.bbc.co.uk/news" url-status

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
hapy		hapy
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

hapy

Installation

Without Pip

Usage

Example

Command-line Interface

h3cc - Heritrix3 Crawl Controller

About

Uh oh!

Releases

Packages

Languages

License

ukwa/hapy

Folders and files

Latest commit

History

Repository files navigation

hapy

Installation

Without Pip

Usage

Example

Command-line Interface

h3cc - Heritrix3 Crawl Controller

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages