Home

Welcome to the Auto-Extractor wiki! Here you will find information related to Auto Extractor

The current status

Clustering the web pages based on style and structure
Scalable on Apache Spark

Roadmap

Auto extraction of content
Port to Map Reduce and thus plug into Apache Tika

Quick Start

Cluster a bunch of htmls on file system

Not documented yet. Look for FileCluster in the code

Cluster Nutch Output Segments on Spark

This functionality is provided by autoextractor-spark module.

build the autoextractor-spark module : mvn clean package
run : java -jar autoextractor-spark/target/autoextractor-spark-0.1-SNAPSHOT.jar

 Usage:
 -list VAL            : List of Nutch Segment(s) Part(s)
 -master VAL          : Spark master url (default: local[2])
 -sw (--sim-weight) N : weight used for aggregating structural and style
                        similarity measures.
                        Range : [0.0, 1.0] inclusive
                        Notes :
                                0.0 disables structural similarity and only style
                        similarity will be used (it is faster)
                                1.0 disables style similarity and thus only structural
                        similarity will be used
                         (default: 0.0)
 -workdir VAL         : Work directory.

Put all the segment content part paths to a file. For example list.txt contains :

/ext/tg-ext/sites/old/batch1/batch1/segments/20151003201017/content/part-00000/data
/ext/tg-ext/sites/old/batch1/batch1/segments/20151003201017/content/part-00001/data
/ext/tg-ext/sites/old/batch1/batch1/segments/20151003221050/content/part-00000/data

Run the job :

java -jar autoextractor-spark/target/autoextractor-spark-0.1-SNAPSHOT.jar \
  -list list.txt -workdir out-4 -master local[4]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Home

The current status

Roadmap

Quick Start

Cluster a bunch of htmls on file system

Cluster Nutch Output Segments on Spark

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally