Skip to content

docs(proposal): Firefox-based stealth scraper backend#1786

Draft
feder-cr wants to merge 75 commits into
assafelovic:masterfrom
feder-cr:proposal-invisible-firefox-scraper
Draft

docs(proposal): Firefox-based stealth scraper backend#1786
feder-cr wants to merge 75 commits into
assafelovic:masterfrom
feder-cr:proposal-invisible-firefox-scraper

Conversation

@feder-cr

Copy link
Copy Markdown

opening as draft to check interest before building the backend.

would an optional invisible_firefox scraper backend be in scope, parallel to firecrawl / browser / web_base_loader / tavily_extract under gpt_researcher/scraper/?

motivation: research-relevant pages behind Cloudflare/Akamai/Datadome/hCaptcha currently return empty content or 403. relevant open issues: #1685, #1081, #1602, #1404.

the backend would wrap feder-cr/invisible_playwright, which drives a patched Firefox 150 (feder-cr/invisible_firefox, MPL-2, same license as Firefox upstream, patches at the C++ source level so there are no JS shims to detect). selected via SCRAPER=invisible_firefox, optional dependency, no change to defaults.

this PR only adds an RFC stub in docs/docs/proposals/ so the proposal has somewhere concrete to land. tracking discussion: #1785

if the answer is "not in scope" i'll close it without noise.

He Qiangqiang and others added 30 commits January 30, 2026 09:43
- Convert string to float first before int to parse decimal dimensions like '409.12'
- Catch TypeError in addition to ValueError for robustness
- Update comment for clarity
Previously, PyMuPDFScraper only extracted doc[0].page_content, causing
PDFs with cover pages (common in ESG/annual reports) to fail the 100-char
minimum content validation despite having thousands of characters total.

This fix concatenates all pages, resolving the issue for 68% of PDFs
that were incorrectly rejected.

Fixes assafelovic#1600
Added instructions for setting the FireCrawl server URL when using self-hosted server.
Clarify usage notes for FireCrawl API key and server URL.
Bare `except:` catches BaseException including KeyboardInterrupt and
SystemExit. Replaced 9 instances with `except Exception:`.
Three fixes for running NoDriverScraper in containerised / concurrent
environments:

1. **browser_connection_timeout 1 → 10 s**
   Chrome's CDP socket takes 3-8 s to become available when running as
   root in a Docker container. The 1 s default causes immediate timeout
   even when Chrome started successfully, making the scraper silently
   fail in every Docker deployment.

2. **max_browsers 3 → 5 / browser_load_threshold 5 → 8**
   Deep research spawns several concurrent sub-researchers, each
   issuing scrape requests simultaneously. The previous pool limits
   caused unnecessary browser creation and teardown churn; raising
   them allows the pool to absorb concurrent deep-research workloads
   without thrashing.

3. **Guard against browser.get() returning None**
   When the CDP connection times out, browser.get() can return None
   instead of raising an exception. The caller then attempts to call
   methods on None, crashing with an AttributeError. More critically,
   browser.get() has already incremented processing_count before
   returning; without the guard the slot is never released, which
   eventually deadlocks the entire browser pool (no new browsers can
   be acquired and existing ones appear permanently busy).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add Avian (https://avian.io) as a supported LLM provider. Avian offers
an OpenAI-compatible API with cost-effective frontier models including
DeepSeek-V3.2, Kimi-K2.5, GLM-5, and MiniMax-M2.5.

Configuration uses AVIAN_API_KEY env var and the avian: provider prefix
(e.g. SMART_LLM=avian:deepseek/deepseek-v3.2).
Update FireCrawl setup instructions for self hosted instance
…t-run-task-import

fix(server): resolve multi-agent run_research_task NameError
…context-compression

perf: optimize context compression with smart fast-path for small documents
…duplication

perf: add URL deduplication to prevent redundant scraping
fix: replace 9 bare except clauses with except Exception
fix: nodriver scraper Docker compatibility and browser pool deadlock
…-all-pages

fix: Read all pages in PyMuPDFScraper instead of just first page
Replace expired Discord invite code (DUmbTebB) with the current valid
one (QgZXvJAccX) in the LLM documentation page.

Fixes assafelovic#1474
…rd-invite

docs: fix invalid Discord invite link in LLM docs
When _check_pkg installs a missing package via subprocess pip, the
running Python process's import caches are stale. Add
importlib.invalidate_caches() after successful installation so that
the subsequent import of the newly installed package succeeds.

This fixes the issue where the Docker container logs show 'installed
successfully' but then fails with 'No module named tavily'.

Fixes assafelovic#1625
The docs instruct users to set OPENAI_API_VERSION but embeddings.py
used os.environ['AZURE_OPENAI_API_VERSION'] which raises KeyError.
Now tries AZURE_OPENAI_API_VERSION first, then falls back to
OPENAI_API_VERSION.

Fixes assafelovic#1469
- Change timeout from 5s to (5, 30) tuple so the connect timeout
  remains 5s but the read timeout allows 30s for large PDFs.
- Add SSL verification fallback: try with verification first, on
  SSLError retry without verification (with a warning log).

Fixes assafelovic#1601
The generate_research_plan method only used retrievers[0] for its
initial search, ignoring additional configured retrievers. Now
iterates through all retrievers and aggregates results.

Fixes assafelovic#1574
The showdown converter has tables enabled but the CSS had no styles
for table elements, causing tables to render without borders or
formatting.

Fixes assafelovic#1578
Add per-step cost tracking to GPTResearcher:
- New step_costs dict records costs attributed to each research phase
- _current_step is set to 'agent_selection', 'research',
  'report_writing', or 'deep_research' as the workflow progresses
- New get_step_costs() method returns the cost breakdown
- add_costs() now also records cost against the current step
- Fully backward-compatible: existing cost_callback usage unchanged

Example:
  costs = researcher.get_step_costs()
  # {'agent_selection': 0.01, 'research': 0.15, 'report_writing': 0.45}

Fixes assafelovic#1470
Add a numeric input field to the FastAPI frontend that allows users
to control the number of websites scraped per search query (1-20,
default 5). The value is sent via WebSocket to the backend, threaded
through to BasicReport/DetailedReport, and applied as an override to
cfg.max_search_results_per_query. Subtopic researchers in detailed
reports also inherit the setting.

Fixes assafelovic#1504
Display sub-queries as 'pondering questions' in the research progress
output, matching the feature available in the Next.js frontend.

When the backend sends a 'subqueries' log message with metadata, the
FastAPI frontend now renders it as a visually distinct section with
pill-shaped tags for each sub-query, styled to match the app theme.

Fixes assafelovic#1503
assafelovic and others added 29 commits March 13, 2026 11:11
fix docs links and add ag2 pipeline diagram
fix: improve retry handling in create_chat_completion
Add support for connection headers in MCP client/server config
Updated contributor image link to include a max size parameter.
Add direct support for MiniMax models (MiniMax-M2.5, MiniMax-M2.5-highspeed)
via their OpenAI-compatible API. This includes both LLM chat and embedding
(embo-01) support, configured through the MINIMAX_API_KEY environment variable.
- Replace deprecated duckduckgo-search with ddgs>=9.0.0
- Add python-pptx and pandas to requirements
- Add missing os import in websocket_manager
Update documentation to recommend MiniMax-M2.7 and M2.7-highspeed as
the default models, while keeping M2.5 variants listed as alternatives.
M2.7 offers improved reasoning capabilities over M2.5.
Two issues prevented PubMed Central from working:

1. PubMed retriever returned `url`/`raw_content` keys but the research
   pipeline expected `href` to collect URLs. Added `href` and `body`
   keys to match the expected interface.

2. The pipeline re-scraped all URLs via web scraper, discarding the
   full-text content already fetched by PubMed's API. PMC URLs often
   block web scraping, resulting in empty content. Now retrievers that
   provide `raw_content` (>100 chars) have their content passed through
   directly without re-scraping.

Fixes assafelovic#1301
Adds XquikSearch — the first social media retriever for GPT Researcher.
Searches X (Twitter) via Xquik API for real-time perspectives, dev
discussions, product feedback, breaking news, and expert opinions.

Usage: set RETRIEVER=xquik and XQUIK_API_KEY in env.
Can combine with other retrievers: RETRIEVER=tavily,xquik,duckduckgo

Uses stdlib urllib only — zero new dependencies. Returns standard
{title, href, body} format. $0.00015 per tweet read.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Some LLM providers (e.g. reasoning/thinking models like GLM-5) emit
hundreds of empty-string content chunks before real content arrives.
The existing `if content is not None` check passes for empty strings,
so the paragraph buffer accumulates nothing and no output appears on
stdout until all empty chunks are consumed.

Changes:
- Replace `if content is not None:` with `if not content: continue`
  to skip both None and empty-string chunks
- Add `flush=True` to the print() call in _send_output() so output
  appears immediately when running in pipe/non-TTY mode
fix potential data parsing issue in web scraping
…ontext

Fix: Add context normalization for dict and string formats
…ax-provider

feat: add MiniMax as a native LLM and embedding provider (M2.7)
…encies-and-imports

Update dependencies and add missing os import
…-retriever-integration

fix: PubMed Central retriever returns no results
…-websocket-manager

Fix missing os import in websocket manager
…mpty-chunks-flush

fix: skip empty LLM streaming chunks and flush stdout
…ever

feat: add Xquik X/Twitter search retriever
Clarified the description of GPT Researcher for better understanding.
Adds an RFC proposal for an optional invisible_firefox scraper backend
parallel to firecrawl, browser, web_base_loader, tavily_extract.
Opened as draft to check interest before building the backend module.
@feder-cr

feder-cr commented Jun 6, 2026

Copy link
Copy Markdown
Author

@assafelovic any interest in an optional stealth scraper backend for sources behind anti-bot? it's opt-in, no change to defaults. happy to close otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.