A curated list of web scraping tools, frameworks, libraries, and APIs for 2026. Maintained weekly.
⭐ Star this repo to keep it in your bookmarks — new tools added every week.
📖 Need a custom scraper? Get a production-grade scraper built in 48 hours — $250 flat rate. Get a free quote →
🚀 Skip the scraping — I've built 78+ ready-made scrapers for Reddit, HN, Google, LinkedIn, Amazon, and more. Browse Apify actors → | Need something custom? Email spinov001@gmail.com
- Frameworks & Libraries
- Browser Automation
- Headless Browsers
- Anti-Detection & Stealth
- Proxy Services
- CAPTCHA Solving
- Cloud Scraping Platforms
- AI-Powered Scraping
- E-Commerce & Price Monitoring
- Free APIs (No Scraping Needed)
- Pre-Built Scrapers (Apify Store)
- Job Boards & Company Data
- Government & Public Data
- Data Parsing & Extraction
- Anti-Bot Detection
- Scraping Infrastructure
- Legal & Ethics
- Tutorials & Articles
- Related Awesome Lists
💡 Need data from ANY website? I build custom scrapers and data pipelines — fast, reliable, anti-detection built in. Get a quote → or check out my ready-made scrapers on Apify Store.
| Need | Best Tool | Why |
|---|---|---|
| Simple HTML parsing | BeautifulSoup | Easiest API, handles broken HTML |
| Large-scale crawling | Scrapy | Built-in queuing, middlewares, pipelines |
| JavaScript-rendered pages | Playwright | Best browser automation, anti-detection |
| Full scraping framework (JS) | Crawlee | Handles browser + HTTP, auto-scaling |
| Speed over everything | spider (Rust) | 20-100x faster than Python alternatives |
| No-code scraping | Apify or Portia | Visual tools, no programming needed |
| LLM-ready data | Firecrawl or Crawl4AI | Output as markdown for AI pipelines |
| Avoid scraping entirely | Free APIs | Structured JSON, no parsing, no breakage |
| Feature | Scrapy | BeautifulSoup | Requests-HTML | Crawlee (Python) |
|---|---|---|---|---|
| Async | ✅ Twisted | ❌ | ✅ | ✅ asyncio |
| JS Rendering | Plugin | ❌ | ✅ built-in | ✅ Playwright |
| Rate Limiting | ✅ built-in | Manual | Manual | ✅ built-in |
| Export (JSON/CSV) | ✅ built-in | Manual | Manual | ✅ built-in |
| Learning Curve | Medium | Low | Low | Medium |
| Best For | Production crawlers | Quick scripts | Simple pages + JS | Modern async scraping |
| Feature | Playwright | Puppeteer | Selenium |
|---|---|---|---|
| Languages | Python, JS, Java, C# | JS only | All major |
| Browsers | Chromium, Firefox, WebKit | Chrome only | All |
| Speed | Fast | Fast | Slower |
| Anti-Detection | Best | Good (with stealth) | Poor |
| Mobile Testing | ✅ | Limited | ✅ |
| Auto-Wait | ✅ | Manual | Manual |
| Community | Growing fast | Large | Largest |
| Best For | Modern scraping | Chrome-only projects | Legacy systems |
| Tool | Stars | Description |
|---|---|---|
| Scrapy | 53k+ | The most popular Python scraping framework. Async, middlewares, pipelines, built-in export. |
| BeautifulSoup | — | HTML/XML parser. Simple API, forgiving of bad markup. Use with requests. |
| Requests-HTML | 13k+ | Pythonic HTML parsing with JS rendering support via Chromium. |
| httpx | 13k+ | Modern async HTTP client. HTTP/2 support, better than requests for scraping. |
| Parsel | 1k+ | CSS + XPath selector library extracted from Scrapy. |
| MechanicalSoup | 4k+ | Stateful web browsing (form submission, cookies) — like a human clicking. |
| Grab | 2k+ | Web scraping framework. Network requests, DOM parsing, spider. |
| Selectolax | 1k+ | Fast HTML parser (10-20x faster than lxml). C-level speed. |
| gazpacho | 700+ | Simple, modern web scraping. Minimal API surface. |
| Crawlee (Python) | 5k+ | Apify's scraping framework for Python. BeautifulSoup + Playwright crawlers. |
| curl_cffi | 3k+ | Python bindings for curl-impersonate. TLS fingerprint impersonation. |
| botasaurus | 4k+ | All-in-one scraping framework: browser, anti-detect, caching, parallel. |
| Playwright for Python | 12k+ | Official Playwright Python bindings. Cross-browser automation. |
| aiohttp | 15k+ | Async HTTP client/server. Great for high-concurrency scraping. |
| Scrapling | 20k+ | Adaptive parsing — auto-relocates elements after page updates. 10x faster JSON. |
| Tool | Stars | Description |
|---|---|---|
| Crawlee | 15k+ | Full-featured scraping framework by Apify. Cheerio, Playwright, Puppeteer crawlers. |
| Cheerio | 28k+ | Fast jQuery-like HTML parser for Node.js. No browser needed. |
| node-crawler | 7k+ | HTTP crawler with jQuery-style selectors, rate limiting, retries. |
| x-ray | 6k+ | Declarative web scraping with schema definitions. |
| Apify SDK | 4k+ | Toolkit for building Apify actors — storage, proxies, queue. |
| got-scraping | 600+ | HTTP client with anti-fingerprinting. Built-in header generation. |
| Axios | 106k+ | Promise-based HTTP client. Great for API-based scraping. |
| Tool | Stars | Description |
|---|---|---|
| Colly | 23k+ | Fast and elegant scraping framework for Go. |
| goquery | 14k+ | jQuery-like HTML selector in Go. |
| Ferret | 6k+ | Declarative web scraping with FQL query language. |
| Geziyor | 2k+ | Fast web scraping with concurrent requests and caching. |
| chromedp | 11k+ | Chrome DevTools Protocol client for Go. Headless browser control. |
| Tool | Stars | Description |
|---|---|---|
| Nokogiri | 6k+ | HTML/XML parser, industry standard for Ruby. |
| Mechanize | 4k+ | Automated web interaction (clicks, forms, cookies). |
| Kimurai | 1k+ | Modern Ruby web scraping framework. |
| Tool | Stars | Description |
|---|---|---|
| spider | 3k+ | Fastest web crawler. Written in Rust, 20-100x faster. |
| reqwest | 10k+ | Ergonomic HTTP client for Rust with async support. |
| scraper | 2k+ | CSS selector-based HTML parser for Rust. |
| Tool | Stars | Description |
|---|---|---|
| Goutte | 9k+ | Screen scraping and web crawling library for PHP. |
| Roach | 2k+ | Scrapy-inspired web scraping for PHP. |
| Panther | 3k+ | Browser testing and scraping with real browsers in PHP. |
| Tool | Stars | Description |
|---|---|---|
| Playwright | 68k+ | Cross-browser automation by Microsoft. Chromium, Firefox, WebKit. Best anti-detection. |
| Puppeteer | 89k+ | Chrome automation by Google. Mature ecosystem. |
| Selenium | 31k+ | The OG browser automation. Supports all browsers. |
| Cypress | 47k+ | Testing-focused but works for scraping interactive SPAs. |
| Rod | 5k+ | Playwright/Puppeteer alternative for Go. DevTools Protocol. |
| Splash | 4k+ | Lightweight browser as a service. JS rendering via HTTP API. |
| Tool | Description |
|---|---|
| Browserless | Chrome as a service. Docker-ready. Free self-hosted. |
| chrome-headless-shell | Official Google headless Chrome. Smallest footprint. |
| Playwright Docker | Official Playwright Docker images with all browsers. |
| Tool | Stars | Description |
|---|---|---|
| undetected-chromedriver | 10k+ | Patched ChromeDriver that passes bot detection. |
| puppeteer-extra-stealth | 12k+ | Plugin bundle to evade detection (WebGL, navigator, etc.) |
| curl-impersonate | 13k+ | curl that impersonates Chrome/Firefox TLS fingerprint. |
| Camoufox | 5k+ | Anti-detect Firefox browser for scraping. |
| playwright-stealth | 1k+ | Stealth plugin for Playwright Python. Evade fingerprinting. |
| nodriver | 3k+ | Next-gen undetected browser automation. Successor to undetected-chromedriver. |
| Rebrowser | 1k+ | Patches for Playwright/Puppeteer to fix automation leaks. |
| Service | Free Tier | Description |
|---|---|---|
| Bright Data | Trial | 72M+ residential IPs. Enterprise grade. |
| Oxylabs | Trial | Residential and datacenter proxies. |
| ScraperAPI | 1000 free | API that handles proxies and CAPTCHAs. |
| Smartproxy | Trial | 65M+ residential proxies. |
| IPRoyal | — | Budget residential proxies from $1.75/GB. |
| Proxy-Seller | — | Datacenter & residential proxies in 220+ countries. IPv4/IPv6, SOCKS5. Use code SPINOV15 for 15% off. |
| Service | Price | Description |
|---|---|---|
| 2Captcha | $1-3/1000 | Human-powered CAPTCHA solving API. |
| Anti-Captcha | $1-2/1000 | reCAPTCHA, hCaptcha, image CAPTCHA. |
| CapSolver | $0.8/1000 | AI-powered CAPTCHA solving. |
| Platform | Free Tier | Description |
|---|---|---|
| Apify | $5/mo free | Run scrapers in cloud. 2000+ pre-built actors. Proxies included. |
| ScrapingBee | 1000 free | API: send URL, get HTML. JS rendering, proxies. |
| Firecrawl | 500 free | Turn websites into LLM-ready markdown. Great for AI. |
| Crawl4AI | Open source | LLM-friendly web crawler. Markdown extraction. |
| ScrapeGraphAI | Open source | AI-powered scraping — just describe what you want. |
| Browserbase | Free tier | Headless browser infrastructure. API-based. |
| Zyte (Scrapy Cloud) | Free tier | Cloud-based Scrapy deployment + smart proxy. By Scrapy creators. |
| Agenty | Free tier | No-code cloud scraping. Point-and-click extractors. |
Tools that use LLMs to extract data — describe what you want, get structured output:
| Tool | Stars | Description |
|---|---|---|
| ScrapeGraphAI | 18k+ | Describe extraction in plain English. Uses LLMs to parse HTML. |
| Crawl4AI | 50k+ | LLM-friendly crawler. Outputs clean markdown. Async, fast. |
| Firecrawl | 70k+ | Turn any website into LLM-ready markdown. API + self-hosted. |
| Jina Reader | 8k+ | Convert URLs to LLM-friendly text. Free API: r.jina.ai/URL. |
| Scrapfly | — | Web scraping API with AI extraction, anti-bot bypass. |
| Browserless | 8k+ | Chrome as a service. Great for LLM agent workflows. |
The trend: In 2026, more developers use LLMs to extract data instead of writing CSS selectors. These tools bridge the gap.
| Tool | Target | Description |
|---|---|---|
| Amazon Product API | Amazon | Official Product Advertising API. Requires affiliate account. |
| Keepa | Amazon | Price history tracking. API available ($20/mo). |
| CamelCamelCamel | Amazon | Free price tracker, browser extension. |
| PriceAPI | Multi | Product data from 1000+ retailers. Enterprise. |
| Diffbot | Any | AI-powered product extraction. Free tier. |
| Amazon Scraper (Apify) | Amazon | 750K+ users. Product data, reviews, prices. |
| Walmart Scraper (Apify) | Walmart | Products, prices, reviews. |
Tip: For price monitoring, combine scraping with cron jobs (GitHub Actions = free) and alert via email/Slack when prices change.
- IP-API — IP geolocation (country, city, ISP) — no key needed
- Open-Meteo — Weather forecasts and historical data — no key needed
- ExchangeRate-API — Currency conversion rates for 160+ currencies — no key needed
Why scrape when you can use official APIs? These require no API key:
| API | Data | Rate Limit |
|---|---|---|
| Reddit JSON | Posts, comments, subreddits | ~60/min |
| Hacker News | Stories, comments, users | ~1/sec |
| YouTube Innertube | Comments, transcripts, channels | No hard limit |
| Wikipedia | Articles, summaries, media | 200/sec |
| arXiv | 2M+ research papers | 1/3sec |
| npm Registry | Package metadata | No hard limit |
| PyPI JSON | Python package info | No hard limit |
| GitHub REST | Repos, users, issues | 60/hr unauth |
| Open-Meteo | Weather forecasts | Unlimited |
| CoinGecko | Crypto prices | 30/min |
| Crossref | 150M+ academic papers | 50/sec |
| RDAP | Domain WHOIS data | Varies |
Ready-to-use scrapers — no code required. Run on Apify free tier.
| Scraper | Method | Data |
|---|---|---|
| Reddit Scraper | JSON API | Posts, comments, scores |
| YouTube Comments | Innertube | Comments without API key |
| YouTube Transcript | Captions XML | Subtitles and captions |
| Hacker News | Firebase | Stories and comments |
| Trustpilot Reviews | JSON-LD | Reviews via structured data |
| Google News | RSS | 15 languages |
| SEO Audit | Multi | 50+ on-page factors |
| Email Extractor | HTML | Emails, phones, socials |
| Tech Stack Detector | Headers+JS | 80+ technologies |
| Bluesky Scraper | AT Protocol | Profiles and posts |
| Tool | Target | Description |
|---|---|---|
| LinkedIn Scraper (Apify) | Profiles, companies, jobs. Requires login. | |
| Indeed Scraper (Apify) | Indeed | Job listings, salary data, company reviews. |
| Glassdoor Scraper (Apify) | Glassdoor | Reviews, salaries, interviews. |
| Google Maps Scraper (Apify) | Google Maps | Business data, reviews, phone, hours. 500K+ users. |
| Crunchbase API | Crunchbase | Startup data, funding, investors. Paid. |
| Hunter.io | Any domain | Find email addresses. 25 free/mo. |
| Apollo.io | Any company | Contact data, org charts. Free tier. |
| Source | Data | Access |
|---|---|---|
| data.gov | US government datasets | Free API + bulk download |
| EU Open Data | EU datasets | Free API |
| SEC EDGAR | Company filings | Free API |
| USPTO | Patent data | Free API |
| OpenStreetMap | Geographic data | Free API |
| World Bank | Economic indicators | Free API |
| FRED | Economic data | Free API key |
| Tool | Stars | Description |
|---|---|---|
| lxml | 2k+ | Fastest XML/HTML parser for Python. XPath + XSLT. |
| Readability | 8k+ | Firefox's reader mode as a library. Extract article content. |
| Trafilatura | 3k+ | Extract main text from web pages. Removes boilerplate. |
| newspaper3k | 14k+ | Article scraping and NLP. Titles, authors, text, images. |
| extruct | 800+ | Extract JSON-LD, Microdata, OpenGraph from HTML. |
| markdownify | 1k+ | Convert HTML to Markdown. Great for LLM pipelines. |
| html2text | 2k+ | Convert HTML to clean Markdown. Handles complex layouts. |
| jusText | 500+ | Remove boilerplate from HTML. Extract just article text. |
| dateparser | 2k+ | Parse dates in any format/language. Essential for scraping. |
| price-parser | 300+ | Extract price and currency from any string. By Zyte. |
Tools to test your scraper against detection (for authorized testing only):
| Tool | Description |
|---|---|
| CreepJS | Browser fingerprint test — see what sites see about you. |
| Fingerprint.com | Browser fingerprinting service. |
| BotD | 2k+ |
| Sannysoft Test | Check what automation signals your browser leaks. |
| Incolumitas Bot Test | Advanced bot detection test — TLS, JS, canvas fingerprint. |
| Tool | Stars | Description |
|---|---|---|
| Scrapyd | 3k+ | Deploy and run Scrapy spiders as a service. |
| Gerapy | 3k+ | Distributed Scrapy management with Django UI. |
| Portia | 9k+ | Visual scraping tool — point and click, no code. |
| Scrapy-Redis | 5k+ | Distributed Scrapy with Redis. Scale to millions of pages. |
| Frontera | 1k+ | Large-scale web crawling frontier. URL management and scheduling. |
| Scrapy-Splash | 2k+ | Scrapy + Splash integration for JS rendering in pipelines. |
| Scrapy-Playwright | 1k+ | Playwright integration for Scrapy. Modern JS rendering. |
Before scraping, know the rules:
| Topic | Key Points |
|---|---|
| robots.txt | Always check. Respect Disallow directives. Not legally binding but shows good faith. |
| Rate Limiting | Never DDoS. Add delays between requests. 1 req/sec is a safe default. |
| Terms of Service | Some sites explicitly prohibit scraping. Violating ToS can have legal consequences. |
| Personal Data (GDPR) | Scraping personal data in the EU requires a lawful basis. Be careful with names, emails, etc. |
| CFAA (US) | The Computer Fraud and Abuse Act can apply. Key case: hiQ v. LinkedIn (public data is generally OK). |
| Copyright | Scraped content may be copyrighted. Extraction is usually OK; republishing is not. |
| API Terms | Even free APIs have terms. Read them — especially about commercial use. |
Rule of thumb: If the data is publicly available, not behind a login, and you respect rate limits — you're probably fine. When in doubt, use the official API.
Resources:
- Web Scraping Cheatsheet 2026
- Scrapy Documentation
- Playwright Web Scraping Guide
- Crawlee — Build Reliable Crawlers
- Real Python: Web Scraping with BeautifulSoup
- ScrapingBee: Web Scraping Guide
📖 Need a custom scraper or data pipeline? Email me — I build production-grade scrapers with anti-detection built in. Check my ready-made scrapers on Apify.
- GitHub Actions for Scheduled Scraping — Run scrapers for free on a schedule
- Docker for Web Scraping — Containerize your scrapers for consistency
- SQLite for Scraped Data — Lightweight storage for scraped datasets
- Google Dorking Cheatsheet — Advanced search operators for research
- Hetzner Cloud — Affordable servers for running scrapers at scale
- Neon Serverless Postgres — Free tier database for storing scraped data
-
awesome-web-scraping — The original awesome web scraping list
-
awesome-crawler — Web crawler tools by language
-
awesome-free-apis-2026 — 300+ free APIs, no key needed
-
awesome-data-engineering-2026 — 150+ data engineering tools
-
awesome-mcp-servers-2026 — MCP servers for AI agents
-
ai-market-research-reports — 506 AI-generated market research reports (1,600+ clones)
-
sqlite-vector-search-tutorial — Semantic search with SQLite + vectors (no server needed)
-
openalex-python-tutorial — Search 250M+ academic papers via API (no key needed)
python-web-scraping-starter — Clone → install → scrape in 5 minutes. API-first with Playwright fallback.
I've built 78+ production scrapers. I can extract data from any website — e-commerce, real estate, job boards, social media — with anti-detection, proxy rotation, and structured JSON/CSV output.
What you get: Working scraper in 24-48h, hosted on Apify (free tier available), with monitoring and auto-retry.
📧 Spinov001@gmail.com — describe your data need, get a free quote within 2 hours. First 3 clients this month get priority delivery.
💳 Pay securely via Payoneer → — custom scraper $250 flat rate. Delivered in 48 hours, no hourly surprises.
🔧 Browse 78+ ready-made scrapers → — Reddit, HN, Google, Amazon, and more. Deploy in 1 click, no coding required.
MIT