docs(proposal): Firefox-based stealth scraper backend by feder-cr · Pull Request #1786 · assafelovic/gpt-researcher

feder-cr · 2026-05-27T04:11:24Z

opening as draft to check interest before building the backend.

would an optional invisible_firefox scraper backend be in scope, parallel to firecrawl / browser / web_base_loader / tavily_extract under gpt_researcher/scraper/?

motivation: research-relevant pages behind Cloudflare/Akamai/Datadome/hCaptcha currently return empty content or 403. relevant open issues: #1685, #1081, #1602, #1404.

the backend would wrap feder-cr/invisible_playwright, which drives a patched Firefox 150 (feder-cr/invisible_firefox, MPL-2, same license as Firefox upstream, patches at the C++ source level so there are no JS shims to detect). selected via SCRAPER=invisible_firefox, optional dependency, no change to defaults.

this PR only adds an RFC stub in docs/docs/proposals/ so the proposal has somewhere concrete to land. tracking discussion: #1785

if the answer is "not in scope" i'll close it without noise.

- Convert string to float first before int to parse decimal dimensions like '409.12' - Catch TypeError in addition to ValueError for robustness - Update comment for clarity

Previously, PyMuPDFScraper only extracted doc[0].page_content, causing PDFs with cover pages (common in ESG/annual reports) to fail the 100-char minimum content validation despite having thousands of characters total. This fix concatenates all pages, resolving the issue for 68% of PDFs that were incorrectly rejected. Fixes assafelovic#1600

Added instructions for setting the FireCrawl server URL when using self-hosted server.

Clarify usage notes for FireCrawl API key and server URL.

…uments

…ation

Bare `except:` catches BaseException including KeyboardInterrupt and SystemExit. Replaced 9 instances with `except Exception:`.

Three fixes for running NoDriverScraper in containerised / concurrent environments: 1. **browser_connection_timeout 1 → 10 s** Chrome's CDP socket takes 3-8 s to become available when running as root in a Docker container. The 1 s default causes immediate timeout even when Chrome started successfully, making the scraper silently fail in every Docker deployment. 2. **max_browsers 3 → 5 / browser_load_threshold 5 → 8** Deep research spawns several concurrent sub-researchers, each issuing scrape requests simultaneously. The previous pool limits caused unnecessary browser creation and teardown churn; raising them allows the pool to absorb concurrent deep-research workloads without thrashing. 3. **Guard against browser.get() returning None** When the CDP connection times out, browser.get() can return None instead of raising an exception. The caller then attempts to call methods on None, crashing with an AttributeError. More critically, browser.get() has already incremented processing_count before returning; without the guard the slot is never released, which eventually deadlocks the entire browser pool (no new browsers can be acquired and existing ones appear permanently busy). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add Avian (https://avian.io) as a supported LLM provider. Avian offers an OpenAI-compatible API with cost-effective frontier models including DeepSeek-V3.2, Kimi-K2.5, GLM-5, and MiniMax-M2.5. Configuration uses AVIAN_API_KEY env var and the avian: provider prefix (e.g. SMART_LLM=avian:deepseek/deepseek-v3.2).

Update FireCrawl setup instructions for self hosted instance

…t-run-task-import fix(server): resolve multi-agent run_research_task NameError

…context-compression perf: optimize context compression with smart fast-path for small documents

…duplication perf: add URL deduplication to prevent redundant scraping

fix: replace 9 bare except clauses with except Exception

fix: nodriver scraper Docker compatibility and browser pool deadlock

feat: add Avian as LLM provider

…-all-pages fix: Read all pages in PyMuPDFScraper instead of just first page

Replace expired Discord invite code (DUmbTebB) with the current valid one (QgZXvJAccX) in the LLM documentation page. Fixes assafelovic#1474

…rd-invite docs: fix invalid Discord invite link in LLM docs

When _check_pkg installs a missing package via subprocess pip, the running Python process's import caches are stale. Add importlib.invalidate_caches() after successful installation so that the subsequent import of the newly installed package succeeds. This fixes the issue where the Docker container logs show 'installed successfully' but then fails with 'No module named tavily'. Fixes assafelovic#1625

The docs instruct users to set OPENAI_API_VERSION but embeddings.py used os.environ['AZURE_OPENAI_API_VERSION'] which raises KeyError. Now tries AZURE_OPENAI_API_VERSION first, then falls back to OPENAI_API_VERSION. Fixes assafelovic#1469

- Change timeout from 5s to (5, 30) tuple so the connect timeout remains 5s but the read timeout allows 30s for large PDFs. - Add SSL verification fallback: try with verification first, on SSLError retry without verification (with a warning log). Fixes assafelovic#1601

The generate_research_plan method only used retrievers[0] for its initial search, ignoring additional configured retrievers. Now iterates through all retrievers and aggregates results. Fixes assafelovic#1574

The showdown converter has tables enabled but the CSS had no styles for table elements, causing tables to render without borders or formatting. Fixes assafelovic#1578

Add per-step cost tracking to GPTResearcher: - New step_costs dict records costs attributed to each research phase - _current_step is set to 'agent_selection', 'research', 'report_writing', or 'deep_research' as the workflow progresses - New get_step_costs() method returns the cost breakdown - add_costs() now also records cost against the current step - Fully backward-compatible: existing cost_callback usage unchanged Example: costs = researcher.get_step_costs() # {'agent_selection': 0.01, 'research': 0.15, 'report_writing': 0.45} Fixes assafelovic#1470

Add a numeric input field to the FastAPI frontend that allows users to control the number of websites scraped per search query (1-20, default 5). The value is sent via WebSocket to the backend, threaded through to BasicReport/DetailedReport, and applied as an override to cfg.max_search_results_per_query. Subtopic researchers in detailed reports also inherit the setting. Fixes assafelovic#1504

Display sub-queries as 'pondering questions' in the research progress output, matching the feature available in the Next.js frontend. When the backend sends a 'subqueries' log message with metadata, the FastAPI frontend now renders it as a visually distinct section with pill-shaped tags for each sub-query, styled to match the app theme. Fixes assafelovic#1503

fix docs links and add ag2 pipeline diagram

fix: improve retry handling in create_chat_completion

Add support for connection headers in MCP client/server config

…_llm' variable instead of 'model_name'.

assafelovic#1673: Fixed the reference error in code.

Updated contributor image link to include a max size parameter.

Add direct support for MiniMax models (MiniMax-M2.5, MiniMax-M2.5-highspeed) via their OpenAI-compatible API. This includes both LLM chat and embedding (embo-01) support, configured through the MINIMAX_API_KEY environment variable.

- Replace deprecated duckduckgo-search with ddgs>=9.0.0 - Add python-pptx and pandas to requirements - Add missing os import in websocket_manager

Update documentation to recommend MiniMax-M2.7 and M2.7-highspeed as the default models, while keeping M2.5 variants listed as alternatives. M2.7 offers improved reasoning capabilities over M2.5.

Two issues prevented PubMed Central from working: 1. PubMed retriever returned `url`/`raw_content` keys but the research pipeline expected `href` to collect URLs. Added `href` and `body` keys to match the expected interface. 2. The pipeline re-scraped all URLs via web scraper, discarding the full-text content already fetched by PubMed's API. PMC URLs often block web scraping, resulting in empty content. Now retrievers that provide `raw_content` (>100 chars) have their content passed through directly without re-scraping. Fixes assafelovic#1301

Adds XquikSearch — the first social media retriever for GPT Researcher. Searches X (Twitter) via Xquik API for real-time perspectives, dev discussions, product feedback, breaking news, and expert opinions. Usage: set RETRIEVER=xquik and XQUIK_API_KEY in env. Can combine with other retrievers: RETRIEVER=tavily,xquik,duckduckgo Uses stdlib urllib only — zero new dependencies. Returns standard {title, href, body} format. $0.00015 per tweet read. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Some LLM providers (e.g. reasoning/thinking models like GLM-5) emit hundreds of empty-string content chunks before real content arrives. The existing `if content is not None` check passes for empty strings, so the paragraph buffer accumulates nothing and no output appears on stdout until all empty chunks are consumed. Changes: - Replace `if content is not None:` with `if not content: continue` to skip both None and empty-string chunks - Add `flush=True` to the print() call in _send_output() so output appears immediately when running in pipe/non-TTY mode

fix potential data parsing issue in web scraping

…ontext Fix: Add context normalization for dict and string formats

…ax-provider feat: add MiniMax as a native LLM and embedding provider (M2.7)

…encies-and-imports Update dependencies and add missing os import

…-retriever-integration fix: PubMed Central retriever returns no results

…-websocket-manager Fix missing os import in websocket manager

Bug fix/pdf report

…mpty-chunks-flush fix: skip empty LLM streaming chunks and flush stdout

…ever feat: add Xquik X/Twitter search retriever

Clarified the description of GPT Researcher for better understanding.

Adds an RFC proposal for an optional invisible_firefox scraper backend parallel to firecrawl, browser, web_base_loader, tavily_extract. Opened as draft to check interest before building the backend module.

feder-cr · 2026-06-06T05:48:08Z

@assafelovic any interest in an optional stealth scraper backend for sources behind anti-bot? it's opt-in, no change to defaults. happy to close otherwise.

He Qiangqiang and others added 30 commits January 30, 2026 09:43

fix: handle decimal values and type errors in parse_dimension

d528270

- Convert string to float first before int to parse decimal dimensions like '409.12' - Catch TypeError in addition to ValueError for robustness - Update comment for clarity

Update FireCrawl setup instructions

90b551e

Added instructions for setting the FireCrawl server URL when using self-hosted server.

Update usage notes for FireCrawl configuration

da3980c

Clarify usage notes for FireCrawl API key and server URL.

fix(server): resolve multi-agent run_research_task NameError

9a681c7

perf: optimize context compression with smart fast-path for small doc…

acb02ac

…uments

refactor: move Document import to top of file for better code organiz…

1d2b4e8

…ation

perf: add URL deduplication to prevent redundant scraping

ca499e2

fix: replace bare except clauses with except Exception

dcfe673

Bare `except:` catches BaseException including KeyboardInterrupt and SystemExit. Replaced 9 instances with `except Exception:`.

Merge pull request assafelovic#1639 from technot80/firecrawl-doc

051c7c5

Update FireCrawl setup instructions for self hosted instance

Merge pull request assafelovic#1640 from AlexanderCGO2/fix/multi-agen…

db92a59

…t-run-task-import fix(server): resolve multi-agent run_research_task NameError

Merge pull request assafelovic#1641 from maanavagrawal/feature/smart-…

74e540e

…context-compression perf: optimize context compression with smart fast-path for small documents

Merge pull request assafelovic#1642 from maanavagrawal/feature/url-de…

059592d

…duplication perf: add URL deduplication to prevent redundant scraping

Merge pull request assafelovic#1643 from haosenwang1018/fix/bare-excepts

3c1a8ff

fix: replace 9 bare except clauses with except Exception

Merge pull request assafelovic#1645 from mareurs/nodriver-docker-fix

fd15fb7

fix: nodriver scraper Docker compatibility and browser pool deadlock

Merge pull request assafelovic#1646 from avianion/add-avian-llm-provider

9a439d8

feat: add Avian as LLM provider

updated project version

ff11c55

Merge pull request assafelovic#1623 from MattBenesch/fix/pymupdf-read…

aa5a6dd

…-all-pages fix: Read all pages in PyMuPDFScraper instead of just first page

docs: fix invalid Discord invite link in LLM docs

f109f5b

Replace expired Discord invite code (DUmbTebB) with the current valid one (QgZXvJAccX) in the LLM documentation page. Fixes assafelovic#1474

Merge pull request assafelovic#1649 from Br1an67/fix/issue-1474-disco…

9e4a19a

…rd-invite docs: fix invalid Discord invite link in LLM docs

fix: use all configured retrievers in deep research planning

47739c2

The generate_research_plan method only used retrievers[0] for its initial search, ignoring additional configured retrievers. Now iterates through all retrievers and aggregates results. Fixes assafelovic#1574

fix: add table styles to report container in FastAPI frontend

fb3020b

The showdown converter has tables enabled but the CSS had no styles for table elements, causing tables to render without borders or formatting. Fixes assafelovic#1578

assafelovic and others added 29 commits March 13, 2026 11:11

Merge pull request assafelovic#1662 from qingyun-wu/ag2-update

f7337e7

fix docs links and add ag2 pipeline diagram

Merge pull request assafelovic#1664 from jhyz/fix/llm-retry-logic

ea16010

fix: improve retry handling in create_chat_completion

Merge pull request assafelovic#1665 from GeorgelPreput/main

7d476fa

Add support for connection headers in MCP client/server config

Merge branch 'main' into fix/hash-mcp-context

a14df2a

assafelovic#1673: Fixed the reference error in code, added 'strategic…

9b98642

…_llm' variable instead of 'model_name'.

Merge pull request assafelovic#1674 from parth3083/fix/assafelovic#1673

b648bd2

assafelovic#1673: Fixed the reference error in code.

Modify contributor image link in README

7c32174

Updated contributor image link to include a max size parameter.

Update dependencies and add missing os import

9713bcc

- Replace deprecated duckduckgo-search with ddgs>=9.0.0 - Add python-pptx and pandas to requirements - Add missing os import in websocket_manager

feat: upgrade MiniMax models from M2.5 to M2.7 as default

f090d3a

Update documentation to recommend MiniMax-M2.7 and M2.7-highspeed as the default models, while keeping M2.5 variants listed as alternatives. M2.7 offers improved reasoning capabilities over M2.5.

Fix missing os import in websocket manager

86e27b3

Fix issue 1712

34723c1

Fix for issue 1718

b1032b6

Merge branch 'assafelovic:main' into fix/hash-mcp-context

4685e8b

Import os in websocket manager

8dd9562

Merge pull request assafelovic#1607 from Carton/fixes

61a8763

fix potential data parsing issue in web scraping

Merge pull request assafelovic#1668 from GeorgelPreput/fix/hash-mcp-c…

412bbd5

…ontext Fix: Add context normalization for dict and string formats

Merge pull request assafelovic#1677 from octo-patch/feature/add-minim…

0acc7a0

…ax-provider feat: add MiniMax as a native LLM and embedding provider (M2.7)

Merge pull request assafelovic#1680 from mparker404/fix/update-depend…

37e04f1

…encies-and-imports Update dependencies and add missing os import

Merge pull request assafelovic#1686 from antek-eth/fix/pubmed-central…

3b44b54

…-retriever-integration fix: PubMed Central retriever returns no results

Merge pull request assafelovic#1697 from sztoplover-bit/fix/import-os…

17c9d20

…-websocket-manager Fix missing os import in websocket manager

Merge pull request assafelovic#1720 from test23techvv/bug_fix/pdf_report

c6488fc

Bug fix/pdf report

Merge pull request assafelovic#1737 from kiranvk-2011/fix/streaming-e…

645f24c

…mpty-chunks-flush fix: skip empty LLM streaming chunks and flush stdout

Merge pull request assafelovic#1734 from kriptoburak/feat/xquik-retri…

27abde0

…ever feat: add Xquik X/Twitter search retriever

Refine GPT Researcher description in README

92bfc03

Clarified the description of GPT Researcher for better understanding.

docs(proposal): Firefox-based stealth scraper backend

526a61b

Adds an RFC proposal for an optional invisible_firefox scraper backend parallel to firecrawl, browser, web_base_loader, tavily_extract. Opened as draft to check interest before building the backend module.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(proposal): Firefox-based stealth scraper backend#1786

docs(proposal): Firefox-based stealth scraper backend#1786
feder-cr wants to merge 75 commits into
assafelovic:masterfrom
feder-cr:proposal-invisible-firefox-scraper

feder-cr commented May 27, 2026

Uh oh!

feder-cr commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Conversation

feder-cr commented May 27, 2026

Uh oh!

feder-cr commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants