forked from thammegowda/autoextractor
-
Notifications
You must be signed in to change notification settings - Fork 11
Home
Thamme Gowda edited this page Feb 2, 2016
·
7 revisions
Welcome to the Auto-Extractor wiki! Here you will find information related to Auto Extractor
- Clustering the web pages based on style and structure
- Scalable on Apache Spark
- Auto extraction of content
- Port to Map Reduce and thus plug into Apache Tika
Not documented yet. Look for FileCluster
in the code
This functionality is provided by autoextractor-spark
module.
- build the
autoextractor-spark
module :mvn clean package
- run :
java -jar autoextractor-spark/target/autoextractor-spark-0.1-SNAPSHOT.jar
Usage:
-list VAL : List of Nutch Segment(s) Part(s)
-master VAL : Spark master url (default: local[2])
-sw (--sim-weight) N : weight used for aggregating structural and style
similarity measures.
Range : [0.0, 1.0] inclusive
Notes :
0.0 disables structural similarity and only style
similarity will be used (it is faster)
1.0 disables style similarity and thus only structural
similarity will be used
(default: 0.0)
-workdir VAL : Work directory.
- Put all the segment content part paths to a file. For example
list.txt
contains :
/ext/tg-ext/sites/old/batch1/batch1/segments/20151003201017/content/part-00000/data
/ext/tg-ext/sites/old/batch1/batch1/segments/20151003201017/content/part-00001/data
/ext/tg-ext/sites/old/batch1/batch1/segments/20151003221050/content/part-00000/data
- Run the job :
java -jar autoextractor-spark/target/autoextractor-spark-0.1-SNAPSHOT.jar \
-list list.txt -workdir out-4 -master local[4]
- View output
out-4/
├── domains
│ ├── ag.ca.gov
│ ├── aka.ms
│ ├── certguns.doj.ca.gov
│ ├── dev.twitter.com
│ ├── freegunclassifieds.com
│ ├── glocktalk.com
│ └── mobile.twitter.com
└── similarity
├── entries
│ ├── ag.ca.gov
│ │ ├── part-00000
│ │ └── _SUCCESS
│ ├── certguns.doj.ca.gov
│ │ ├── part-00000
│ │ └── _SUCCESS
│ ├── classic.gunauction.com
│ │ └── _temporary
│ │ └── 0
│ ├── dev.twitter.com
│ │ ├── part-00000
│ │ └── _SUCCESS
│ └── mobile.twitter.com
│ ├── part-00000
│ └── _SUCCESS
└── matrix
├── ag.ca.gov
│ ├── part-00000
│ └── _SUCCESS
├── certguns.doj.ca.gov
│ ├── part-00000
│ └── _SUCCESS
├── dev.twitter.com
│ ├── part-00000
│ └── _SUCCESS
└── mobile.twitter.com
├── part-00000
└── _SUCCESS