github.com/MontFerret/contrib/modules/web/article registers article extraction helpers under the WEB::ARTICLE namespace for Ferret hosts.
The module exposes these functions:
WEB::ARTICLE::EXTRACTWEB::ARTICLE::TEXTWEB::ARTICLE::MARKDOWN
go get github.com/MontFerret/contrib/modules/web/articlepackage main
import (
"github.com/MontFerret/ferret/v2"
articlemodule "github.com/MontFerret/contrib/modules/web/article"
)
func main() {
articleMod, err := articlemodule.New()
if err != nil {
panic(err)
}
engine, err := ferret.New(
ferret.WithModules(articleMod),
)
if err != nil {
panic(err)
}
_ = engine
}| Function | Signature | Returns | Notes |
|---|---|---|---|
WEB::ARTICLE::EXTRACT |
WEB::ARTICLE::EXTRACT(input) |
Object |
Extracts a normalized article object from raw HTML, HTMLPage, HTMLDocument, or HTMLElement. |
WEB::ARTICLE::TEXT |
WEB::ARTICLE::TEXT(input) |
String | None |
Returns the cleaned main article text when meaningful content is found. |
WEB::ARTICLE::MARKDOWN |
WEB::ARTICLE::MARKDOWN(input) |
String | None |
Returns the cleaned article body rendered as Markdown when available. |
WEB::ARTICLE::EXTRACT always returns an object with these fields:
{
"title": "Example title",
"byline": "Jane Doe",
"excerpt": "Short description or summary",
"siteName": "Example Site",
"publishedAt": "2026-03-30T10:00:00Z",
"updatedAt": "2026-03-30T12:00:00Z",
"lang": "en",
"dir": "ltr",
"canonicalUrl": "https://example.com/post",
"leadImage": "https://example.com/image.jpg",
"text": "Clean main article text",
"html": "<p>Sanitized article body</p>",
"markdown": "Body rendered as Markdown",
"wordCount": 1234,
"readingTimeMinutes": 7,
"tags": ["ai", "news"],
"categories": ["Technology"]
}Missing values are returned as null.
LET response = HTTP::GET($url)
RETURN WEB::ARTICLE::EXTRACT(response.body)
LET page = HTML::DOCUMENT($url, true)
RETURN WEB::ARTICLE::EXTRACT(page)
RETURN WEB::ARTICLE::TEXT($html)
RETURN {
url: $url,
markdown: WEB::ARTICLE::MARKDOWN(HTTP::GET($url).body)
}
- Extraction is heuristic and best-effort; malformed HTML is parsed when practical.
inputmay be raw HTML,HTMLPage,HTMLDocument, orHTMLElement.EXTRACTmay still return metadata when no meaningful article body is found.TEXTandMARKDOWNreturnnullwhen the page is parseable but not article-like enough.- For
HTMLPageandHTMLDocumentinputs, the page URL is used as the fallback base URL when the DOM does not contain<base href>. - URL metadata is resolved to absolute URLs whenever a base URL is available (from
<base href>or the page URL forHTMLPage/HTMLDocument); for raw HTML orHTMLElementinputs without a base URL, relative URL values are preserved. - Timestamps are normalized to RFC3339 UTC when parseable; otherwise the original trimmed value is preserved.
htmlis sanitized with an allowlist policy before it is returned, so dangerous attributes and URL schemes are stripped from the article body fragment.markdownis rendered from that sanitized body HTML.text,html, andmarkdowncontain the cleaned body only and do not prepend the title.