Skip to content

Feature Request: Integrate Trafilatura for URL Content Extraction #869

@mzazakeith

Description

@mzazakeith

Summary

Integrate the Trafilatura library to extract high-quality, readable content from URLs for users who are not leveraging models with built-in URL context tools. Trafilatura excels at extracting main article text, metadata, and more from web pages, making it ideal for summarization, analysis, and downstream NLP tasks.

Motivation

  • Superior Extraction: Trafilatura provides robust extraction of main content, metadata, and structure from web pages, outperforming basic crawlers for articles and news.
  • Broader Model Support: Enables users of regular models (not just Gemini) to benefit from advanced URL content extraction.
  • Improved User Experience: Offers more reliable and readable content for summarization, Q&A, and other tasks compared to simple crawlers.

Example Usage

import trafilatura

def get_url_context(url):
    downloaded = trafilatura.fetch_url(url)
    if not downloaded:
        return "Sorry, I couldn't fetch the content from that URL."

    extracted = trafilatura.extract(
        downloaded,
        include_comments=False,
        include_links=True,
        output_format='json',
        with_metadata=True,
        url=url
    )

    if not extracted:
        return "Sorry, I couldn't extract readable content from that page."

    return extracted  # returns JSON string with 'text', 'title', 'author', 'date', etc.

Current Limitation

Currently, only Firecrawl is available for URL extraction, which may not provide the same quality or depth as Trafilatura, especially for article-like content.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions