Skip to content

spinov001-art/awesome-web-scraping-2026

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

145 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Web Scraping 2026 Awesome Stars

A curated list of web scraping tools, frameworks, libraries, and APIs for 2026. Maintained weekly.

⭐ Star this repo to keep it in your bookmarks — new tools added every week.

📖 Need a custom scraper? Get a production-grade scraper built in 48 hours — $250 flat rate. Get a free quote →


🚀 Skip the scraping — I've built 78+ ready-made scrapers for Reddit, HN, Google, LinkedIn, Amazon, and more. Browse Apify actors → | Need something custom? Email spinov001@gmail.com


Contents


💡 Need data from ANY website? I build custom scrapers and data pipelines — fast, reliable, anti-detection built in. Get a quote → or check out my ready-made scrapers on Apify Store.


Quick Comparison: Which Tool Should You Use?

Need Best Tool Why
Simple HTML parsing BeautifulSoup Easiest API, handles broken HTML
Large-scale crawling Scrapy Built-in queuing, middlewares, pipelines
JavaScript-rendered pages Playwright Best browser automation, anti-detection
Full scraping framework (JS) Crawlee Handles browser + HTTP, auto-scaling
Speed over everything spider (Rust) 20-100x faster than Python alternatives
No-code scraping Apify or Portia Visual tools, no programming needed
LLM-ready data Firecrawl or Crawl4AI Output as markdown for AI pipelines
Avoid scraping entirely Free APIs Structured JSON, no parsing, no breakage

Python Framework Comparison

Feature Scrapy BeautifulSoup Requests-HTML Crawlee (Python)
Async ✅ Twisted ✅ asyncio
JS Rendering Plugin ✅ built-in ✅ Playwright
Rate Limiting ✅ built-in Manual Manual ✅ built-in
Export (JSON/CSV) ✅ built-in Manual Manual ✅ built-in
Learning Curve Medium Low Low Medium
Best For Production crawlers Quick scripts Simple pages + JS Modern async scraping

Browser Automation Comparison

Feature Playwright Puppeteer Selenium
Languages Python, JS, Java, C# JS only All major
Browsers Chromium, Firefox, WebKit Chrome only All
Speed Fast Fast Slower
Anti-Detection Best Good (with stealth) Poor
Mobile Testing Limited
Auto-Wait Manual Manual
Community Growing fast Large Largest
Best For Modern scraping Chrome-only projects Legacy systems

Frameworks & Libraries

Python

Tool Stars Description
Scrapy 53k+ The most popular Python scraping framework. Async, middlewares, pipelines, built-in export.
BeautifulSoup HTML/XML parser. Simple API, forgiving of bad markup. Use with requests.
Requests-HTML 13k+ Pythonic HTML parsing with JS rendering support via Chromium.
httpx 13k+ Modern async HTTP client. HTTP/2 support, better than requests for scraping.
Parsel 1k+ CSS + XPath selector library extracted from Scrapy.
MechanicalSoup 4k+ Stateful web browsing (form submission, cookies) — like a human clicking.
Grab 2k+ Web scraping framework. Network requests, DOM parsing, spider.
Selectolax 1k+ Fast HTML parser (10-20x faster than lxml). C-level speed.
gazpacho 700+ Simple, modern web scraping. Minimal API surface.
Crawlee (Python) 5k+ Apify's scraping framework for Python. BeautifulSoup + Playwright crawlers.
curl_cffi 3k+ Python bindings for curl-impersonate. TLS fingerprint impersonation.
botasaurus 4k+ All-in-one scraping framework: browser, anti-detect, caching, parallel.
Playwright for Python 12k+ Official Playwright Python bindings. Cross-browser automation.
aiohttp 15k+ Async HTTP client/server. Great for high-concurrency scraping.
Scrapling 20k+ Adaptive parsing — auto-relocates elements after page updates. 10x faster JSON.

JavaScript / TypeScript

Tool Stars Description
Crawlee 15k+ Full-featured scraping framework by Apify. Cheerio, Playwright, Puppeteer crawlers.
Cheerio 28k+ Fast jQuery-like HTML parser for Node.js. No browser needed.
node-crawler 7k+ HTTP crawler with jQuery-style selectors, rate limiting, retries.
x-ray 6k+ Declarative web scraping with schema definitions.
Apify SDK 4k+ Toolkit for building Apify actors — storage, proxies, queue.
got-scraping 600+ HTTP client with anti-fingerprinting. Built-in header generation.
Axios 106k+ Promise-based HTTP client. Great for API-based scraping.

Go

Tool Stars Description
Colly 23k+ Fast and elegant scraping framework for Go.
goquery 14k+ jQuery-like HTML selector in Go.
Ferret 6k+ Declarative web scraping with FQL query language.
Geziyor 2k+ Fast web scraping with concurrent requests and caching.
chromedp 11k+ Chrome DevTools Protocol client for Go. Headless browser control.

Ruby

Tool Stars Description
Nokogiri 6k+ HTML/XML parser, industry standard for Ruby.
Mechanize 4k+ Automated web interaction (clicks, forms, cookies).
Kimurai 1k+ Modern Ruby web scraping framework.

Rust

Tool Stars Description
spider 3k+ Fastest web crawler. Written in Rust, 20-100x faster.
reqwest 10k+ Ergonomic HTTP client for Rust with async support.
scraper 2k+ CSS selector-based HTML parser for Rust.

PHP

Tool Stars Description
Goutte 9k+ Screen scraping and web crawling library for PHP.
Roach 2k+ Scrapy-inspired web scraping for PHP.
Panther 3k+ Browser testing and scraping with real browsers in PHP.

Browser Automation

Tool Stars Description
Playwright 68k+ Cross-browser automation by Microsoft. Chromium, Firefox, WebKit. Best anti-detection.
Puppeteer 89k+ Chrome automation by Google. Mature ecosystem.
Selenium 31k+ The OG browser automation. Supports all browsers.
Cypress 47k+ Testing-focused but works for scraping interactive SPAs.
Rod 5k+ Playwright/Puppeteer alternative for Go. DevTools Protocol.
Splash 4k+ Lightweight browser as a service. JS rendering via HTTP API.

Headless Browsers

Tool Description
Browserless Chrome as a service. Docker-ready. Free self-hosted.
chrome-headless-shell Official Google headless Chrome. Smallest footprint.
Playwright Docker Official Playwright Docker images with all browsers.

Anti-Detection & Stealth

Tool Stars Description
undetected-chromedriver 10k+ Patched ChromeDriver that passes bot detection.
puppeteer-extra-stealth 12k+ Plugin bundle to evade detection (WebGL, navigator, etc.)
curl-impersonate 13k+ curl that impersonates Chrome/Firefox TLS fingerprint.
Camoufox 5k+ Anti-detect Firefox browser for scraping.
playwright-stealth 1k+ Stealth plugin for Playwright Python. Evade fingerprinting.
nodriver 3k+ Next-gen undetected browser automation. Successor to undetected-chromedriver.
Rebrowser 1k+ Patches for Playwright/Puppeteer to fix automation leaks.

Proxy Services

Service Free Tier Description
Bright Data Trial 72M+ residential IPs. Enterprise grade.
Oxylabs Trial Residential and datacenter proxies.
ScraperAPI 1000 free API that handles proxies and CAPTCHAs.
Smartproxy Trial 65M+ residential proxies.
IPRoyal Budget residential proxies from $1.75/GB.
Proxy-Seller Datacenter & residential proxies in 220+ countries. IPv4/IPv6, SOCKS5. Use code SPINOV15 for 15% off.

CAPTCHA Solving

Service Price Description
2Captcha $1-3/1000 Human-powered CAPTCHA solving API.
Anti-Captcha $1-2/1000 reCAPTCHA, hCaptcha, image CAPTCHA.
CapSolver $0.8/1000 AI-powered CAPTCHA solving.

Cloud Scraping Platforms

Platform Free Tier Description
Apify $5/mo free Run scrapers in cloud. 2000+ pre-built actors. Proxies included.
ScrapingBee 1000 free API: send URL, get HTML. JS rendering, proxies.
Firecrawl 500 free Turn websites into LLM-ready markdown. Great for AI.
Crawl4AI Open source LLM-friendly web crawler. Markdown extraction.
ScrapeGraphAI Open source AI-powered scraping — just describe what you want.
Browserbase Free tier Headless browser infrastructure. API-based.
Zyte (Scrapy Cloud) Free tier Cloud-based Scrapy deployment + smart proxy. By Scrapy creators.
Agenty Free tier No-code cloud scraping. Point-and-click extractors.

AI-Powered Scraping (2026 Trend)

Tools that use LLMs to extract data — describe what you want, get structured output:

Tool Stars Description
ScrapeGraphAI 18k+ Describe extraction in plain English. Uses LLMs to parse HTML.
Crawl4AI 50k+ LLM-friendly crawler. Outputs clean markdown. Async, fast.
Firecrawl 70k+ Turn any website into LLM-ready markdown. API + self-hosted.
Jina Reader 8k+ Convert URLs to LLM-friendly text. Free API: r.jina.ai/URL.
Scrapfly Web scraping API with AI extraction, anti-bot bypass.
Browserless 8k+ Chrome as a service. Great for LLM agent workflows.

The trend: In 2026, more developers use LLMs to extract data instead of writing CSS selectors. These tools bridge the gap.

E-Commerce & Price Monitoring

Tool Target Description
Amazon Product API Amazon Official Product Advertising API. Requires affiliate account.
Keepa Amazon Price history tracking. API available ($20/mo).
CamelCamelCamel Amazon Free price tracker, browser extension.
PriceAPI Multi Product data from 1000+ retailers. Enterprise.
Diffbot Any AI-powered product extraction. Free tier.
Amazon Scraper (Apify) Amazon 750K+ users. Product data, reviews, prices.
Walmart Scraper (Apify) Walmart Products, prices, reviews.

Tip: For price monitoring, combine scraping with cron jobs (GitHub Actions = free) and alert via email/Slack when prices change.

Free APIs (No Scraping Needed)

  • IP-API — IP geolocation (country, city, ISP) — no key needed
  • Open-Meteo — Weather forecasts and historical data — no key needed
  • ExchangeRate-API — Currency conversion rates for 160+ currencies — no key needed

Why scrape when you can use official APIs? These require no API key:

API Data Rate Limit
Reddit JSON Posts, comments, subreddits ~60/min
Hacker News Stories, comments, users ~1/sec
YouTube Innertube Comments, transcripts, channels No hard limit
Wikipedia Articles, summaries, media 200/sec
arXiv 2M+ research papers 1/3sec
npm Registry Package metadata No hard limit
PyPI JSON Python package info No hard limit
GitHub REST Repos, users, issues 60/hr unauth
Open-Meteo Weather forecasts Unlimited
CoinGecko Crypto prices 30/min
Crossref 150M+ academic papers 50/sec
RDAP Domain WHOIS data Varies

📚 Full list: 300+ Free APIs →

Pre-Built Scrapers (Apify Store)

Ready-to-use scrapers — no code required. Run on Apify free tier.

Scraper Method Data
Reddit Scraper JSON API Posts, comments, scores
YouTube Comments Innertube Comments without API key
YouTube Transcript Captions XML Subtitles and captions
Hacker News Firebase Stories and comments
Trustpilot Reviews JSON-LD Reviews via structured data
Google News RSS 15 languages
SEO Audit Multi 50+ on-page factors
Email Extractor HTML Emails, phones, socials
Tech Stack Detector Headers+JS 80+ technologies
Bluesky Scraper AT Protocol Profiles and posts

🔍 All 78 scrapers →

Job Boards & Company Data

Tool Target Description
LinkedIn Scraper (Apify) LinkedIn Profiles, companies, jobs. Requires login.
Indeed Scraper (Apify) Indeed Job listings, salary data, company reviews.
Glassdoor Scraper (Apify) Glassdoor Reviews, salaries, interviews.
Google Maps Scraper (Apify) Google Maps Business data, reviews, phone, hours. 500K+ users.
Crunchbase API Crunchbase Startup data, funding, investors. Paid.
Hunter.io Any domain Find email addresses. 25 free/mo.
Apollo.io Any company Contact data, org charts. Free tier.

Government & Public Data

Source Data Access
data.gov US government datasets Free API + bulk download
EU Open Data EU datasets Free API
SEC EDGAR Company filings Free API
USPTO Patent data Free API
OpenStreetMap Geographic data Free API
World Bank Economic indicators Free API
FRED Economic data Free API key

Data Parsing & Extraction

Tool Stars Description
lxml 2k+ Fastest XML/HTML parser for Python. XPath + XSLT.
Readability 8k+ Firefox's reader mode as a library. Extract article content.
Trafilatura 3k+ Extract main text from web pages. Removes boilerplate.
newspaper3k 14k+ Article scraping and NLP. Titles, authors, text, images.
extruct 800+ Extract JSON-LD, Microdata, OpenGraph from HTML.
markdownify 1k+ Convert HTML to Markdown. Great for LLM pipelines.
html2text 2k+ Convert HTML to clean Markdown. Handles complex layouts.
jusText 500+ Remove boilerplate from HTML. Extract just article text.
dateparser 2k+ Parse dates in any format/language. Essential for scraping.
price-parser 300+ Extract price and currency from any string. By Zyte.

Anti-Bot Detection

Tools to test your scraper against detection (for authorized testing only):

Tool Description
CreepJS Browser fingerprint test — see what sites see about you.
Fingerprint.com Browser fingerprinting service.
BotD 2k+
Sannysoft Test Check what automation signals your browser leaks.
Incolumitas Bot Test Advanced bot detection test — TLS, JS, canvas fingerprint.

Scraping Infrastructure

Tool Stars Description
Scrapyd 3k+ Deploy and run Scrapy spiders as a service.
Gerapy 3k+ Distributed Scrapy management with Django UI.
Portia 9k+ Visual scraping tool — point and click, no code.
Scrapy-Redis 5k+ Distributed Scrapy with Redis. Scale to millions of pages.
Frontera 1k+ Large-scale web crawling frontier. URL management and scheduling.
Scrapy-Splash 2k+ Scrapy + Splash integration for JS rendering in pipelines.
Scrapy-Playwright 1k+ Playwright integration for Scrapy. Modern JS rendering.

Legal & Ethics

Before scraping, know the rules:

Topic Key Points
robots.txt Always check. Respect Disallow directives. Not legally binding but shows good faith.
Rate Limiting Never DDoS. Add delays between requests. 1 req/sec is a safe default.
Terms of Service Some sites explicitly prohibit scraping. Violating ToS can have legal consequences.
Personal Data (GDPR) Scraping personal data in the EU requires a lawful basis. Be careful with names, emails, etc.
CFAA (US) The Computer Fraud and Abuse Act can apply. Key case: hiQ v. LinkedIn (public data is generally OK).
Copyright Scraped content may be copyrighted. Extraction is usually OK; republishing is not.
API Terms Even free APIs have terms. Read them — especially about commercial use.

Rule of thumb: If the data is publicly available, not behind a login, and you respect rate limits — you're probably fine. When in doubt, use the official API.

Resources:

Tutorials & Articles

📖 Need a custom scraper or data pipeline? Email me — I build production-grade scrapers with anti-detection built in. Check my ready-made scrapers on Apify.

Related Awesome Lists


Starter Template

python-web-scraping-starter — Clone → install → scrape in 5 minutes. API-first with Playwright fallback.

Need Custom Scraping?

I've built 78+ production scrapers. I can extract data from any website — e-commerce, real estate, job boards, social media — with anti-detection, proxy rotation, and structured JSON/CSV output.

What you get: Working scraper in 24-48h, hosted on Apify (free tier available), with monitoring and auto-retry.

📧 Spinov001@gmail.com — describe your data need, get a free quote within 2 hours. First 3 clients this month get priority delivery.

💳 Pay securely via Payoneer → — custom scraper $250 flat rate. Delivered in 48 hours, no hourly surprises.

🔧 Browse 78+ ready-made scrapers → — Reddit, HN, Google, Amazon, and more. Deploy in 1 click, no coding required.

License

MIT

Releases

No releases published

Packages

 
 
 

Contributors