docs(proposal): Firefox-based stealth scraper backend#1786
Draft
feder-cr wants to merge 75 commits into
Draft
Conversation
- Convert string to float first before int to parse decimal dimensions like '409.12' - Catch TypeError in addition to ValueError for robustness - Update comment for clarity
Previously, PyMuPDFScraper only extracted doc[0].page_content, causing PDFs with cover pages (common in ESG/annual reports) to fail the 100-char minimum content validation despite having thousands of characters total. This fix concatenates all pages, resolving the issue for 68% of PDFs that were incorrectly rejected. Fixes assafelovic#1600
Added instructions for setting the FireCrawl server URL when using self-hosted server.
Clarify usage notes for FireCrawl API key and server URL.
Bare `except:` catches BaseException including KeyboardInterrupt and SystemExit. Replaced 9 instances with `except Exception:`.
Three fixes for running NoDriverScraper in containerised / concurrent environments: 1. **browser_connection_timeout 1 → 10 s** Chrome's CDP socket takes 3-8 s to become available when running as root in a Docker container. The 1 s default causes immediate timeout even when Chrome started successfully, making the scraper silently fail in every Docker deployment. 2. **max_browsers 3 → 5 / browser_load_threshold 5 → 8** Deep research spawns several concurrent sub-researchers, each issuing scrape requests simultaneously. The previous pool limits caused unnecessary browser creation and teardown churn; raising them allows the pool to absorb concurrent deep-research workloads without thrashing. 3. **Guard against browser.get() returning None** When the CDP connection times out, browser.get() can return None instead of raising an exception. The caller then attempts to call methods on None, crashing with an AttributeError. More critically, browser.get() has already incremented processing_count before returning; without the guard the slot is never released, which eventually deadlocks the entire browser pool (no new browsers can be acquired and existing ones appear permanently busy). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add Avian (https://avian.io) as a supported LLM provider. Avian offers an OpenAI-compatible API with cost-effective frontier models including DeepSeek-V3.2, Kimi-K2.5, GLM-5, and MiniMax-M2.5. Configuration uses AVIAN_API_KEY env var and the avian: provider prefix (e.g. SMART_LLM=avian:deepseek/deepseek-v3.2).
Update FireCrawl setup instructions for self hosted instance
…t-run-task-import fix(server): resolve multi-agent run_research_task NameError
…context-compression perf: optimize context compression with smart fast-path for small documents
…duplication perf: add URL deduplication to prevent redundant scraping
fix: replace 9 bare except clauses with except Exception
fix: nodriver scraper Docker compatibility and browser pool deadlock
feat: add Avian as LLM provider
…-all-pages fix: Read all pages in PyMuPDFScraper instead of just first page
Replace expired Discord invite code (DUmbTebB) with the current valid one (QgZXvJAccX) in the LLM documentation page. Fixes assafelovic#1474
…rd-invite docs: fix invalid Discord invite link in LLM docs
When _check_pkg installs a missing package via subprocess pip, the running Python process's import caches are stale. Add importlib.invalidate_caches() after successful installation so that the subsequent import of the newly installed package succeeds. This fixes the issue where the Docker container logs show 'installed successfully' but then fails with 'No module named tavily'. Fixes assafelovic#1625
The docs instruct users to set OPENAI_API_VERSION but embeddings.py used os.environ['AZURE_OPENAI_API_VERSION'] which raises KeyError. Now tries AZURE_OPENAI_API_VERSION first, then falls back to OPENAI_API_VERSION. Fixes assafelovic#1469
- Change timeout from 5s to (5, 30) tuple so the connect timeout remains 5s but the read timeout allows 30s for large PDFs. - Add SSL verification fallback: try with verification first, on SSLError retry without verification (with a warning log). Fixes assafelovic#1601
The generate_research_plan method only used retrievers[0] for its initial search, ignoring additional configured retrievers. Now iterates through all retrievers and aggregates results. Fixes assafelovic#1574
The showdown converter has tables enabled but the CSS had no styles for table elements, causing tables to render without borders or formatting. Fixes assafelovic#1578
Add per-step cost tracking to GPTResearcher:
- New step_costs dict records costs attributed to each research phase
- _current_step is set to 'agent_selection', 'research',
'report_writing', or 'deep_research' as the workflow progresses
- New get_step_costs() method returns the cost breakdown
- add_costs() now also records cost against the current step
- Fully backward-compatible: existing cost_callback usage unchanged
Example:
costs = researcher.get_step_costs()
# {'agent_selection': 0.01, 'research': 0.15, 'report_writing': 0.45}
Fixes assafelovic#1470
Add a numeric input field to the FastAPI frontend that allows users to control the number of websites scraped per search query (1-20, default 5). The value is sent via WebSocket to the backend, threaded through to BasicReport/DetailedReport, and applied as an override to cfg.max_search_results_per_query. Subtopic researchers in detailed reports also inherit the setting. Fixes assafelovic#1504
Display sub-queries as 'pondering questions' in the research progress output, matching the feature available in the Next.js frontend. When the backend sends a 'subqueries' log message with metadata, the FastAPI frontend now renders it as a visually distinct section with pill-shaped tags for each sub-query, styled to match the app theme. Fixes assafelovic#1503
fix docs links and add ag2 pipeline diagram
fix: improve retry handling in create_chat_completion
Add support for connection headers in MCP client/server config
…_llm' variable instead of 'model_name'.
assafelovic#1673: Fixed the reference error in code.
Updated contributor image link to include a max size parameter.
Add direct support for MiniMax models (MiniMax-M2.5, MiniMax-M2.5-highspeed) via their OpenAI-compatible API. This includes both LLM chat and embedding (embo-01) support, configured through the MINIMAX_API_KEY environment variable.
- Replace deprecated duckduckgo-search with ddgs>=9.0.0 - Add python-pptx and pandas to requirements - Add missing os import in websocket_manager
Update documentation to recommend MiniMax-M2.7 and M2.7-highspeed as the default models, while keeping M2.5 variants listed as alternatives. M2.7 offers improved reasoning capabilities over M2.5.
Two issues prevented PubMed Central from working: 1. PubMed retriever returned `url`/`raw_content` keys but the research pipeline expected `href` to collect URLs. Added `href` and `body` keys to match the expected interface. 2. The pipeline re-scraped all URLs via web scraper, discarding the full-text content already fetched by PubMed's API. PMC URLs often block web scraping, resulting in empty content. Now retrievers that provide `raw_content` (>100 chars) have their content passed through directly without re-scraping. Fixes assafelovic#1301
Adds XquikSearch — the first social media retriever for GPT Researcher.
Searches X (Twitter) via Xquik API for real-time perspectives, dev
discussions, product feedback, breaking news, and expert opinions.
Usage: set RETRIEVER=xquik and XQUIK_API_KEY in env.
Can combine with other retrievers: RETRIEVER=tavily,xquik,duckduckgo
Uses stdlib urllib only — zero new dependencies. Returns standard
{title, href, body} format. $0.00015 per tweet read.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Some LLM providers (e.g. reasoning/thinking models like GLM-5) emit hundreds of empty-string content chunks before real content arrives. The existing `if content is not None` check passes for empty strings, so the paragraph buffer accumulates nothing and no output appears on stdout until all empty chunks are consumed. Changes: - Replace `if content is not None:` with `if not content: continue` to skip both None and empty-string chunks - Add `flush=True` to the print() call in _send_output() so output appears immediately when running in pipe/non-TTY mode
fix potential data parsing issue in web scraping
…ontext Fix: Add context normalization for dict and string formats
…ax-provider feat: add MiniMax as a native LLM and embedding provider (M2.7)
…encies-and-imports Update dependencies and add missing os import
…-retriever-integration fix: PubMed Central retriever returns no results
…-websocket-manager Fix missing os import in websocket manager
Bug fix/pdf report
…mpty-chunks-flush fix: skip empty LLM streaming chunks and flush stdout
…ever feat: add Xquik X/Twitter search retriever
Clarified the description of GPT Researcher for better understanding.
Adds an RFC proposal for an optional invisible_firefox scraper backend parallel to firecrawl, browser, web_base_loader, tavily_extract. Opened as draft to check interest before building the backend module.
Author
|
@assafelovic any interest in an optional stealth scraper backend for sources behind anti-bot? it's opt-in, no change to defaults. happy to close otherwise. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
opening as draft to check interest before building the backend.
would an optional
invisible_firefoxscraper backend be in scope, parallel to firecrawl / browser / web_base_loader / tavily_extract undergpt_researcher/scraper/?motivation: research-relevant pages behind Cloudflare/Akamai/Datadome/hCaptcha currently return empty content or 403. relevant open issues: #1685, #1081, #1602, #1404.
the backend would wrap feder-cr/invisible_playwright, which drives a patched Firefox 150 (feder-cr/invisible_firefox, MPL-2, same license as Firefox upstream, patches at the C++ source level so there are no JS shims to detect). selected via
SCRAPER=invisible_firefox, optional dependency, no change to defaults.this PR only adds an RFC stub in docs/docs/proposals/ so the proposal has somewhere concrete to land. tracking discussion: #1785
if the answer is "not in scope" i'll close it without noise.