The API handles proxy rotation, headless browser rendering, geo-targeting, and the JavaScript scrolling needed to load Amazon's lazy review widget, your code stays focused on what you actually want to do with the data.
- What is an Amazon review scraper?
- How it works
- Prerequisites
- Installation
- Quick start
- What you get
- How the script works
- Configuration reference
- Scrape reviews across Amazon regions
- Load more reviews with infinite scroll
- Use cases
- Why ScrapingBee
- Best practices
- Legal note
- FAQ
- Resources
An Amazon review scraper is a program that collects publicly visible review data from Amazon product pages, the reviewer's name, star rating, date, headline, and review body. Knowing how to scrape Amazon reviews lets you run sentiment analysis at the product, brand, or category level, watch competitor feedback over time, build training datasets for review classification, or feed dashboards that surface buyer complaints early.
The hard part is not parsing the HTML. It is loading the lazy review widget reliably in a headless browser, rotating IPs so Amazon does not block you, and matching the right regional storefront. Scraping Amazon reviews through an API like ScrapingBee removes all three problems and leaves you with one HTTP request per page.
You send a GET request to the ScrapingBee API with the product URL (https://www.amazon.com/dp/{ASIN}). The API:
- Routes the request through a rotating proxy in the country you specify.
- Renders the page with a headless browser.
- Runs your
js_scenario— scroll, click, wait — so the review widget loads. - Applies your CSS-selector
extract_rulesto the rendered DOM. - Returns the data as structured JSON.
Your code never touches HTML.
- Python 3.8 or later
- A ScrapingBee API key — sign up for 1,000 free credits
pip install scrapingbee pandasscrapingbee is the official Python SDK. pandas is used to write the CSV output.
Save this as scrape_reviews.py, replace YOUR_API_KEY, edit asin_list, and run.
from scrapingbee import ScrapingBeeClient
import pandas as pd
client = ScrapingBeeClient(api_key='YOUR_API_KEY')
def amazon_reviews(asins):
extract_rules = {
"product_title": {
"selector": "span.a-size-large.product-title-word-break",
"output": "text"
},
"properties": {
"selector": "#cm-cr-dp-review-list > li",
"type": "list",
"output": {
"name": ".a-profile-name",
"rating": ".review-rating > span",
"date": ".review-date",
"title": ".review-title span:not([class])",
"content": ".review-text"
}
}
}
js_scenario = {
"instructions": [
{"wait": 2000},
{"evaluate": "window.scrollTo(0, document.body.scrollHeight);"},
{"wait": 2000},
]
}
all_reviews = []
for asin in asins:
response = client.get(
f'https://www.amazon.com/dp/{asin}',
params={
"extract_rules": extract_rules,
"js_scenario": js_scenario,
"country_code": "us"
},
retries=2
)
product_title = response.json().get('product_title')
title_entry = {
"name": product_title,
"rating": "",
"date": "",
"title": "",
"content": ""
}
all_reviews.append(title_entry)
reviews = response.json().get('properties', [])
all_reviews.extend(reviews)
print(f"{asin}: {response.status_code}, {len(reviews)} reviews extracted")
df = pd.DataFrame(all_reviews)
df.to_csv("all_reviews.csv", index=False)
asin_list = ["B0CTH2QF23", "B0CCDTPDTQ", "B099WTN2TR"]
amazon_reviews(asin_list)Run it:
python scrape_reviews.pyYou will see one line per ASIN in the console (B0CTH2QF23: 200, 8 reviews extracted) and a CSV called all_reviews.csv will be written to the directory.
For each ASIN, the script writes one product-title row followed by one row per review.
| Column | Source selector | Example |
|---|---|---|
name |
.a-profile-name |
Jane D. |
rating |
.review-rating > span |
5.0 out of 5 stars |
date |
.review-date |
Reviewed in the United States on January 12, 2026 |
title |
.review-title span:not([class]) |
Better than expected |
content |
.review-text |
The build quality is solid... |
product_title |
span.a-size-large.product-title-word-break |
(Filled on the divider row between products) |
Four things do the work.
extract_rules. A declarative spec that tells ScrapingBee what to pull from the rendered page. product_title is a single element. properties is typed list, so the API iterates over every <li> inside #cm-cr-dp-review-list and returns one structured object per review. No HTML parsing on your side.
js_scenario. Amazon loads the review widget lazily, so the script tells the headless browser to wait 2 seconds, scroll to the bottom of the page, then wait 2 more seconds before extract rules run. Without the scroll, the widget would not be in the DOM.
country_code. Routes the request through a US IP. Amazon's review content varies by country — set this to the locale you care about. The full list of supported countries is in the API docs.
retries=2. If the request fails, the SDK retries up to two times before raising. Useful for transient blocks or slow page loads.
extract_rules is a JSON object where each key is a field name and each value is either a selector string or a rule object. It is the heart of how this scraper works without any HTML parsing.
Shorthand syntax:
{"title": "h1", "subtitle": "#subtitle"}Full rule object:
| Property | Type | Description |
|---|---|---|
selector |
string, required | CSS or XPath selector. XPath is auto-detected when the selector starts with /. |
selector_type |
string | Force "css" or "xpath" instead of auto-detection. |
output |
string or object | What to extract. See below. |
type |
string | "item" (default — first match) or "list" (all matches). |
clean |
boolean | Strips whitespace by default. Set false to preserve formatting. |
Output formats:
output value |
Returns |
|---|---|
text (default) |
Visible text content |
text_relevant |
Text with scripts, CSS, headers, and footers removed |
markdown_relevant |
Markdown with irrelevant content trimmed |
html |
Inner HTML |
@attribute_name |
An HTML attribute, e.g. @href for a link's URL |
table_json |
Parses a <table> into JSON objects |
table_array |
Parses a <table> into nested arrays |
Nested rules — extract a list of structured objects:
{
"reviews": {
"selector": "#cm-cr-dp-review-list > li",
"type": "list",
"output": {
"name": ".a-profile-name",
"rating": ".review-rating > span",
"link": {"selector": "a.review-title", "output": "@href"}
}
}
}Attribute shorthand — "link": "a@href" is equivalent to {"selector": "a", "output": "@href"}.
js_scenario is a list of instructions executed in order before extraction. Maximum runtime per scenario is 40 seconds.
| Instruction | Syntax | Purpose |
|---|---|---|
wait |
{"wait": 2000} |
Pause for N milliseconds |
wait_for |
{"wait_for": ".selector"} |
Pause until an element exists |
wait_for_and_click |
{"wait_for_and_click": ".selector"} |
Wait, then click |
click |
{"click": "#buttonId"} |
Click an element |
scroll_x |
{"scroll_x": 1000} |
Horizontal scroll in pixels |
scroll_y |
{"scroll_y": 1000} |
Vertical scroll in pixels |
fill |
{"fill": ["#input", "value"]} |
Type into an input |
evaluate |
{"evaluate": "window.scrollTo(0, document.body.scrollHeight);"} |
Run arbitrary JS |
infinite_scroll |
{"infinite_scroll": {"max_count": 0, "delay": 1000}} |
Auto-scroll until page end |
All selectors accept CSS or XPath. Set "strict": false on the scenario to allow individual instructions to fail without aborting the whole run.
These belong on the params argument of client.get(...).
| Parameter | Type | Default | Description |
|---|---|---|---|
extract_rules |
dict | — | Extraction spec (above). |
js_scenario |
dict | — | Scenario spec (above). |
country_code |
string | — | Two-letter ISO code. "us", "de", "gb". |
premium_proxy |
bool | false |
Route through residential proxies. Use when datacenter IPs get blocked. |
stealth_proxy |
bool | false |
Toughest-target proxy tier. |
render_js |
bool | true |
Run a headless browser. Always on for review scraping. |
wait |
int | — | Milliseconds to wait after the page loads, before extraction. |
wait_for |
string | — | Wait until a selector exists, then extract. |
Set country_code to the locale and change the URL host to match the regional storefront. The two should agree.
# Germany
response = client.get(
f'https://www.amazon.de/dp/{asin}',
params={
"extract_rules": extract_rules,
"js_scenario": js_scenario,
"country_code": "de"
},
retries=2
)
# United Kingdom
response = client.get(
f'https://www.amazon.co.uk/dp/{asin}',
params={
"extract_rules": extract_rules,
"js_scenario": js_scenario,
"country_code": "gb"
},
retries=2
)CSS selectors are the same across regional storefronts, so the extract_rules block does not change. Only the URL and country_code move.
The base script returns whatever reviews Amazon renders on the product page after the bottom-scroll — usually the 8–10 "most helpful". To load more before extraction runs, swap the simple scroll for infinite_scroll:
js_scenario = {
"instructions": [
{"wait": 2000},
{"infinite_scroll": {"max_count": 5, "delay": 1500}},
{"wait": 2000},
]
}max_count is the number of scroll cycles (0 means scroll until the page stops growing). delay is the wait between scrolls in milliseconds. Higher numbers cost more in credits per request because the headless browser runs longer.
- Sentiment analysis. Aggregate star ratings and run NLP over the review bodies to score satisfaction at the product, brand, or category level.
- Competitor monitoring. Watch how a rival's product is rated week over week. Flag drops.
- Voice of customer. Surface the phrases real buyers use — input for ad copy, landing-page rewrites, and feature prioritisation.
- Catalogue QA. Watch your own listings for review-volume changes that signal a fake-review attack or a quality regression.
- Training data. Build classification, summarisation, or fine-tuning datasets from honest, unprompted reviews.
- Academic research. Source structured review data for studies in marketing, NLP, or e-commerce.
- One API call per page. No proxy pool to maintain, no CAPTCHA solver to wire up, no browser automation framework to manage.
- CSS-selector extraction. Get clean JSON without writing HTML parsers that break on every DOM change.
- JavaScript scenarios. Click, scroll, wait, and run custom JS before extraction — needed for Amazon's lazy review widget and most other dynamic pages.
- Geo-targeting.
country_codeandpremium_proxyparameters return the regional storefront, currency, and stock a real buyer in that location would see. - Built-in retries. The Python SDK retries failed requests for you.
- 1,000 credits free. No credit card required to evaluate.
- Pace your requests. Do not send hundreds of requests per second. The free plan throttles at 5 concurrent requests, which is a reasonable starting ceiling even on paid plans.
- Retry transient failures. The SDK's
retries=2argument is enough for most cases. Do not retry 4xx errors — those mean the request was wrong, not unlucky. - Match
country_codeto the URL. A US IP requestingamazon.deis a fingerprint Amazon will spot. - Never log in. This repo scrapes public review data only. Authenticated scraping is against the ToS and out of scope.
- Cache results. Reviews rarely change in 24 hours. Cache locally and only re-scrape what has actually moved.
Scraping publicly visible Amazon data is generally legal in many jurisdictions, but Amazon's terms of service restrict automated access. A few practical rules:
- Only collect public, non-authenticated content.
- Keep request rates reasonable.
- Personal data scraped from reviews (names, opinions) is subject to GDPR and CCPA. Treat it the way you would any personal data — minimise collection, secure storage, honour deletion requests.
This repository is not legal advice. Review Amazon's ToS and the regulations that apply to your jurisdiction before running anything in production.
Yes. The ScrapingBee Web Scraping API automatically manages proxies and headers to avoid blocks, even from JavaScript-heavy pages like Amazon reviews.
Yes, but not on your machine. ScrapingBee has built-in JavaScript rendering, so you do not need to install or maintain Puppeteer, Playwright, or Selenium locally.
As many as you need. Provide the list of ASINs to the script and it will target each product page. Stay within your plan's rate and credit limits.
Yes. Set country_code in the API request (e.g. "country_code": "de" for Germany) and change the URL host to match (amazon.de). If a country does not return the expected results, switch on premium_proxy=True for residential IPs.
Use the infinite_scroll instruction in js_scenario (see Load more reviews). Each scroll cycle loads more reviews into the DOM before extraction runs.
Two common causes: the country_code does not match the Amazon regional domain in the URL, or the JavaScript scenario timing is too short for a slow-loading page. Bump the wait values from 2000 to 3500 ms and verify the review widget actually renders.
A standard request with JavaScript rendering and a basic scenario is 25 credits. Using premium_proxy doubles that, and stealth_proxy raises it further. The free 1,000-credit tier gets you roughly 40 of these requests for evaluation.
This script targets the product detail page, so it returns the product's reviews regardless of which seller owns the buy box at request time. Seller-specific review pages are a separate URL pattern and a separate scrape.