Feature Request: Integrate Trafilatura for URL Content Extraction

### Summary

Integrate the [Trafilatura](https://trafilatura.readthedocs.io/) library to extract high-quality, readable content from URLs for users who are not leveraging models with built-in URL context tools. Trafilatura excels at extracting main article text, metadata, and more from web pages, making it ideal for summarization, analysis, and downstream NLP tasks.

### Motivation

- **Superior Extraction:** Trafilatura provides robust extraction of main content, metadata, and structure from web pages, outperforming basic crawlers for articles and news.
- **Broader Model Support:** Enables users of regular models (not just Gemini) to benefit from advanced URL content extraction.
- **Improved User Experience:** Offers more reliable and readable content for summarization, Q&A, and other tasks compared to simple crawlers.

### Example Usage

```python
import trafilatura

def get_url_context(url):
    downloaded = trafilatura.fetch_url(url)
    if not downloaded:
        return "Sorry, I couldn't fetch the content from that URL."

    extracted = trafilatura.extract(
        downloaded,
        include_comments=False,
        include_links=True,
        output_format='json',
        with_metadata=True,
        url=url
    )

    if not extracted:
        return "Sorry, I couldn't extract readable content from that page."

    return extracted  # returns JSON string with 'text', 'title', 'author', 'date', etc.
```

### Current Limitation

Currently, only Firecrawl is available for URL extraction, which may not provide the same quality or depth as Trafilatura, especially for article-like content.

### References

- [Trafilatura Documentation](https://trafilatura.readthedocs.io/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Integrate Trafilatura for URL Content Extraction #869

Summary

Motivation

Example Usage

Current Limitation

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Feature Request: Integrate Trafilatura for URL Content Extraction #869

Description

Summary

Motivation

Example Usage

Current Limitation

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions