Summary
Integrate the Trafilatura library to extract high-quality, readable content from URLs for users who are not leveraging models with built-in URL context tools. Trafilatura excels at extracting main article text, metadata, and more from web pages, making it ideal for summarization, analysis, and downstream NLP tasks.
Motivation
- Superior Extraction: Trafilatura provides robust extraction of main content, metadata, and structure from web pages, outperforming basic crawlers for articles and news.
- Broader Model Support: Enables users of regular models (not just Gemini) to benefit from advanced URL content extraction.
- Improved User Experience: Offers more reliable and readable content for summarization, Q&A, and other tasks compared to simple crawlers.
Example Usage
import trafilatura
def get_url_context(url):
downloaded = trafilatura.fetch_url(url)
if not downloaded:
return "Sorry, I couldn't fetch the content from that URL."
extracted = trafilatura.extract(
downloaded,
include_comments=False,
include_links=True,
output_format='json',
with_metadata=True,
url=url
)
if not extracted:
return "Sorry, I couldn't extract readable content from that page."
return extracted # returns JSON string with 'text', 'title', 'author', 'date', etc.
Current Limitation
Currently, only Firecrawl is available for URL extraction, which may not provide the same quality or depth as Trafilatura, especially for article-like content.
References
Summary
Integrate the Trafilatura library to extract high-quality, readable content from URLs for users who are not leveraging models with built-in URL context tools. Trafilatura excels at extracting main article text, metadata, and more from web pages, making it ideal for summarization, analysis, and downstream NLP tasks.
Motivation
Example Usage
Current Limitation
Currently, only Firecrawl is available for URL extraction, which may not provide the same quality or depth as Trafilatura, especially for article-like content.
References