Skip to content
Thamme Gowda edited this page Feb 2, 2016 · 7 revisions

Welcome to the Auto-Extractor wiki! Here you will find information related to Auto Extractor


The current status

  • Clustering the web pages based on style and structure
  • Scalable on Apache Spark

Roadmap

  • Auto extraction of content
  • Port to Map Reduce and thus plug into Apache Tika

Quick Start

Cluster a bunch of htmls on file system

Not documented yet. Look for FileCluster in the code

Cluster Nutch Output Segments on Spark

This functionality is provided by autoextractor-spark module.

  • build the autoextractor-spark module : mvn clean package
  • run : java -jar autoextractor-spark/target/autoextractor-spark-0.1-SNAPSHOT.jar
 Usage:
 -list VAL            : List of Nutch Segment(s) Part(s)
 -master VAL          : Spark master url (default: local[2])
 -sw (--sim-weight) N : weight used for aggregating structural and style
                        similarity measures.
                        Range : [0.0, 1.0] inclusive
                        Notes :
                                0.0 disables structural similarity and only style
                        similarity will be used (it is faster)
                                1.0 disables style similarity and thus only structural
                        similarity will be used
                         (default: 0.0)
 -workdir VAL         : Work directory.
  • Put all the segment content part paths to a file. For example list.txt contains :
/ext/tg-ext/sites/old/batch1/batch1/segments/20151003201017/content/part-00000/data
/ext/tg-ext/sites/old/batch1/batch1/segments/20151003201017/content/part-00001/data
/ext/tg-ext/sites/old/batch1/batch1/segments/20151003221050/content/part-00000/data
  • Run the job :
java -jar autoextractor-spark/target/autoextractor-spark-0.1-SNAPSHOT.jar \
  -list list.txt -workdir out-4 -master local[4]
  • View output
out-4/
├── domains
│   ├── ag.ca.gov
│   ├── aka.ms
│   ├── certguns.doj.ca.gov
│   ├── dev.twitter.com
│   ├── freegunclassifieds.com
│   ├── glocktalk.com
│   └── mobile.twitter.com
└── similarity
    ├── entries
    │   ├── ag.ca.gov
    │   │   ├── part-00000
    │   │   └── _SUCCESS
    │   ├── certguns.doj.ca.gov
    │   │   ├── part-00000
    │   │   └── _SUCCESS
    │   ├── classic.gunauction.com
    │   │   └── _temporary
    │   │       └── 0
    │   ├── dev.twitter.com
    │   │   ├── part-00000
    │   │   └── _SUCCESS
    │   └── mobile.twitter.com
    │       ├── part-00000
    │       └── _SUCCESS
    └── matrix
        ├── ag.ca.gov
        │   ├── part-00000
        │   └── _SUCCESS
        ├── certguns.doj.ca.gov
        │   ├── part-00000
        │   └── _SUCCESS
        ├── dev.twitter.com
        │   ├── part-00000
        │   └── _SUCCESS
        └── mobile.twitter.com
            ├── part-00000
            └── _SUCCESS

Clone this wiki locally