html-explorer extracts main information from a HTML page.
Currently it extracts:
- Page meta:
titledescriptionkeywordscanonicalfeeds
- Main images - a ordered list of images;
- Main videos - a ordered list of videos;
- Page content - main page content/article;
- Page encoding;
var explorer = require('html-explorer');
explorer.explore('http://edition.cnn.com/')
.then(function(page){
// page object
});
-
url(String) - inputurlparam; -
href(String) - server response url; -
canonical(String) - page canonical; -
title(String); -
description(String); -
keywords(String); -
content(String); -
encoding(String): utf8, windows-1251, iso-8859-2, etc.; -
feeds([Feed]) - list of feeds:title(String);href(String) - feed url;
-
images([Image]) - a list of images:src(String) - image src;viewWidth(Number) - image view width if founded;viewHeight(Number);width(Number) - real image width;height(Number);alt(String);title(String);rating(Number) - count of words matching page title words;type(String) - (only ifidentifyoption is true) - can be:bmp,gif,jpg,png,psd,svg,tifforwebp;data(Buffer) - (only ifidentifyoption is true) - image data.
-
videos([Video]) - a list of videos:sourceType(String) - video source type:URL,YOUTUBE,VIMEOorIFRAME;sourceId(String) - depends ofsourceType: url or source id;width(Number) - video width;height(Number) - video height;
Explores an url.
-
page- html page options:timeout(Number) [5000] - request timeout;headers(Object) [{}]- request headers;canonical(Boolean) [true] - find or not;feeds(Boolean|Function) - find or not, function for validating a feed;validator(Function) [noop] - Validates page after exploring info, throw an error if invalid;html(Boolean|String) [false] - Return HTML text or not. If is string it will be used as remote HTML body;lang(String) - page language 2 chars code;
-
content(Boolean|Object) - content options:filter(Boolean|Object):minLine: (Number) [50] - accepted minimum line length;minPhrase: (Number) [100] - accepted minimum phrase length;phraseEndRegex: (Regex) default: /[.!?:;¡¿%]$/ - end phrase puctuation regex;phraseEnd: (Boolean) [false] - require phrase to end with a puctuation;maxInvalidLines: (Number) [3] - maximum consecutive invalid lines;minScore: (Number) [0.3] - min in text search score: 0 to 1;
-
images(Boolean|Object) - images explorer options:limit(Number) [5] - maximum number of images to return;filter(Object):minViewHeight(Number) [180] - accepted minimum image view height;minViewWidth(Number) [220] - accepted minimum image view width;minHeight(Number) [200] - accepted minimum image height;minWidth(Number) [250] - accepted minimum image width;minRating(Number) [0] - accepted minimum image rating(...);minRatio(Number) [null] - accepted minimum image ratio (ratio=width/height);maxRatio(Number) [null] - accepted maximum image ratio;invalidRatio(Number | [Number]) [1] - example: value [1] will exclude all images with width=height;invalidExtensions([String]) [gif, png] - invalid image extensions;src(RegExp) [see source code] - invalidate image by SRC;extraSrc(RegExp) - invalidate image by SRC;cssClass(RegExp) - filter image by its css class;types(String|[String]) - accepted image types (bmp,gif,jpg,png,psd,svg,tiff,webp), default:['jpg'];invalidTypes(String|[String]) - invalid image types;
identify(Boolean) [false] - identify imagewidth,heightandtypeby downloading data;data(Boolean) [false] - set imagedataproperty. Works only ifidentifyis true.timeout(Number) [1000] - image downloading timeout, in ms.
-
video(Boolean|Object) - video explorer options:limit(Number) [1] - maximum number or videos to return;filter(Object):minHeight(Number) [200] - accepted minimum image height;minWidth(Number) [250] - accepted minimum image width;minRatio(Number) [null] - accepted minimum image ratio (ratio=width/height);maxRatio(Number) [null] - accepted maximum image ratio;invalidRatio(Number | [Number]) [1] - example: value [1] will exclude all images with width=height;src(RegExp) [see source code] - invalidate image by SRC;extraSrc(RegExp) - invalidate image by SRC;
priority([String]) - video source type priority - default:['YOUTUBE', 'VIMEO', 'URL', 'IFRAME'];customFinders([Finder]) - a list of custom video fiders.
- filter page content by relevancy score option;
- added
langoption; - using ascripe module instead of readability-js;
- using in-text-search module;
- find videos from known iframes
- explore content with
readability-js - fix videos explore bug
- explore videos from microdata
- filter page content
- better encoding detection & add to the response object
- tests
- extracting page content
- editorconfig, eslint
- custom video finders
- sort videos by priority option
- head(og:video) video finder
- decode page urls
- image downloading timeout
- detect embedded videos
- better images order
- detect charset from content-type response header
- image filter:
invalidRatio
- filter images by view size - width & heigth detected in image attributes
- merge images with same src