Skip to content

ScrapingBee/amazon-review-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Amazon Review Scraper

Amazon Review Scraper

The API handles proxy rotation, headless browser rendering, geo-targeting, and the JavaScript scrolling needed to load Amazon's lazy review widget, your code stays focused on what you actually want to do with the data.

Contents

What is an Amazon review scraper?

An Amazon review scraper is a program that collects publicly visible review data from Amazon product pages, the reviewer's name, star rating, date, headline, and review body. Knowing how to scrape Amazon reviews lets you run sentiment analysis at the product, brand, or category level, watch competitor feedback over time, build training datasets for review classification, or feed dashboards that surface buyer complaints early.

The hard part is not parsing the HTML. It is loading the lazy review widget reliably in a headless browser, rotating IPs so Amazon does not block you, and matching the right regional storefront. Scraping Amazon reviews through an API like ScrapingBee removes all three problems and leaves you with one HTTP request per page.

How it works

You send a GET request to the ScrapingBee API with the product URL (https://www.amazon.com/dp/{ASIN}). The API:

  1. Routes the request through a rotating proxy in the country you specify.
  2. Renders the page with a headless browser.
  3. Runs your js_scenario — scroll, click, wait — so the review widget loads.
  4. Applies your CSS-selector extract_rules to the rendered DOM.
  5. Returns the data as structured JSON.

Your code never touches HTML.

Prerequisites

Installation

pip install scrapingbee pandas

scrapingbee is the official Python SDK. pandas is used to write the CSV output.

Quick start

Save this as scrape_reviews.py, replace YOUR_API_KEY, edit asin_list, and run.

from scrapingbee import ScrapingBeeClient
import pandas as pd

client = ScrapingBeeClient(api_key='YOUR_API_KEY')

def amazon_reviews(asins):
    extract_rules = {
        "product_title": {
            "selector": "span.a-size-large.product-title-word-break",
            "output": "text"
        },
        "properties": {
            "selector": "#cm-cr-dp-review-list > li",
            "type": "list",
            "output": {
                "name": ".a-profile-name",
                "rating": ".review-rating > span",
                "date": ".review-date",
                "title": ".review-title span:not([class])",
                "content": ".review-text"
            }
        }
    }

    js_scenario = {
        "instructions": [
            {"wait": 2000},
            {"evaluate": "window.scrollTo(0, document.body.scrollHeight);"},
            {"wait": 2000},
        ]
    }

    all_reviews = []
    for asin in asins:
        response = client.get(
            f'https://www.amazon.com/dp/{asin}',
            params={
                "extract_rules": extract_rules,
                "js_scenario": js_scenario,
                "country_code": "us"
            },
            retries=2
        )

        product_title = response.json().get('product_title')

        title_entry = {
            "name": product_title,
            "rating": "",
            "date": "",
            "title": "",
            "content": ""
        }

        all_reviews.append(title_entry)
        reviews = response.json().get('properties', [])
        all_reviews.extend(reviews)

        print(f"{asin}: {response.status_code}, {len(reviews)} reviews extracted")

    df = pd.DataFrame(all_reviews)
    df.to_csv("all_reviews.csv", index=False)


asin_list = ["B0CTH2QF23", "B0CCDTPDTQ", "B099WTN2TR"]
amazon_reviews(asin_list)

Run it:

python scrape_reviews.py

You will see one line per ASIN in the console (B0CTH2QF23: 200, 8 reviews extracted) and a CSV called all_reviews.csv will be written to the directory.

What you get

For each ASIN, the script writes one product-title row followed by one row per review.

Column Source selector Example
name .a-profile-name Jane D.
rating .review-rating > span 5.0 out of 5 stars
date .review-date Reviewed in the United States on January 12, 2026
title .review-title span:not([class]) Better than expected
content .review-text The build quality is solid...
product_title span.a-size-large.product-title-word-break (Filled on the divider row between products)

How the script works

Four things do the work.

extract_rules. A declarative spec that tells ScrapingBee what to pull from the rendered page. product_title is a single element. properties is typed list, so the API iterates over every <li> inside #cm-cr-dp-review-list and returns one structured object per review. No HTML parsing on your side.

js_scenario. Amazon loads the review widget lazily, so the script tells the headless browser to wait 2 seconds, scroll to the bottom of the page, then wait 2 more seconds before extract rules run. Without the scroll, the widget would not be in the DOM.

country_code. Routes the request through a US IP. Amazon's review content varies by country — set this to the locale you care about. The full list of supported countries is in the API docs.

retries=2. If the request fails, the SDK retries up to two times before raising. Useful for transient blocks or slow page loads.

Configuration reference

extract_rules

extract_rules is a JSON object where each key is a field name and each value is either a selector string or a rule object. It is the heart of how this scraper works without any HTML parsing.

Shorthand syntax:

{"title": "h1", "subtitle": "#subtitle"}

Full rule object:

Property Type Description
selector string, required CSS or XPath selector. XPath is auto-detected when the selector starts with /.
selector_type string Force "css" or "xpath" instead of auto-detection.
output string or object What to extract. See below.
type string "item" (default — first match) or "list" (all matches).
clean boolean Strips whitespace by default. Set false to preserve formatting.

Output formats:

output value Returns
text (default) Visible text content
text_relevant Text with scripts, CSS, headers, and footers removed
markdown_relevant Markdown with irrelevant content trimmed
html Inner HTML
@attribute_name An HTML attribute, e.g. @href for a link's URL
table_json Parses a <table> into JSON objects
table_array Parses a <table> into nested arrays

Nested rules — extract a list of structured objects:

{
  "reviews": {
    "selector": "#cm-cr-dp-review-list > li",
    "type": "list",
    "output": {
      "name": ".a-profile-name",
      "rating": ".review-rating > span",
      "link": {"selector": "a.review-title", "output": "@href"}
    }
  }
}

Attribute shorthand"link": "a@href" is equivalent to {"selector": "a", "output": "@href"}.

js_scenario

js_scenario is a list of instructions executed in order before extraction. Maximum runtime per scenario is 40 seconds.

Instruction Syntax Purpose
wait {"wait": 2000} Pause for N milliseconds
wait_for {"wait_for": ".selector"} Pause until an element exists
wait_for_and_click {"wait_for_and_click": ".selector"} Wait, then click
click {"click": "#buttonId"} Click an element
scroll_x {"scroll_x": 1000} Horizontal scroll in pixels
scroll_y {"scroll_y": 1000} Vertical scroll in pixels
fill {"fill": ["#input", "value"]} Type into an input
evaluate {"evaluate": "window.scrollTo(0, document.body.scrollHeight);"} Run arbitrary JS
infinite_scroll {"infinite_scroll": {"max_count": 0, "delay": 1000}} Auto-scroll until page end

All selectors accept CSS or XPath. Set "strict": false on the scenario to allow individual instructions to fail without aborting the whole run.

Common request parameters

These belong on the params argument of client.get(...).

Parameter Type Default Description
extract_rules dict Extraction spec (above).
js_scenario dict Scenario spec (above).
country_code string Two-letter ISO code. "us", "de", "gb".
premium_proxy bool false Route through residential proxies. Use when datacenter IPs get blocked.
stealth_proxy bool false Toughest-target proxy tier.
render_js bool true Run a headless browser. Always on for review scraping.
wait int Milliseconds to wait after the page loads, before extraction.
wait_for string Wait until a selector exists, then extract.

Scrape reviews across Amazon regions

Set country_code to the locale and change the URL host to match the regional storefront. The two should agree.

# Germany
response = client.get(
    f'https://www.amazon.de/dp/{asin}',
    params={
        "extract_rules": extract_rules,
        "js_scenario": js_scenario,
        "country_code": "de"
    },
    retries=2
)

# United Kingdom
response = client.get(
    f'https://www.amazon.co.uk/dp/{asin}',
    params={
        "extract_rules": extract_rules,
        "js_scenario": js_scenario,
        "country_code": "gb"
    },
    retries=2
)

CSS selectors are the same across regional storefronts, so the extract_rules block does not change. Only the URL and country_code move.

Load more reviews with infinite scroll

The base script returns whatever reviews Amazon renders on the product page after the bottom-scroll — usually the 8–10 "most helpful". To load more before extraction runs, swap the simple scroll for infinite_scroll:

js_scenario = {
    "instructions": [
        {"wait": 2000},
        {"infinite_scroll": {"max_count": 5, "delay": 1500}},
        {"wait": 2000},
    ]
}

max_count is the number of scroll cycles (0 means scroll until the page stops growing). delay is the wait between scrolls in milliseconds. Higher numbers cost more in credits per request because the headless browser runs longer.

Use cases

  • Sentiment analysis. Aggregate star ratings and run NLP over the review bodies to score satisfaction at the product, brand, or category level.
  • Competitor monitoring. Watch how a rival's product is rated week over week. Flag drops.
  • Voice of customer. Surface the phrases real buyers use — input for ad copy, landing-page rewrites, and feature prioritisation.
  • Catalogue QA. Watch your own listings for review-volume changes that signal a fake-review attack or a quality regression.
  • Training data. Build classification, summarisation, or fine-tuning datasets from honest, unprompted reviews.
  • Academic research. Source structured review data for studies in marketing, NLP, or e-commerce.

Why ScrapingBee

  • One API call per page. No proxy pool to maintain, no CAPTCHA solver to wire up, no browser automation framework to manage.
  • CSS-selector extraction. Get clean JSON without writing HTML parsers that break on every DOM change.
  • JavaScript scenarios. Click, scroll, wait, and run custom JS before extraction — needed for Amazon's lazy review widget and most other dynamic pages.
  • Geo-targeting. country_code and premium_proxy parameters return the regional storefront, currency, and stock a real buyer in that location would see.
  • Built-in retries. The Python SDK retries failed requests for you.
  • 1,000 credits free. No credit card required to evaluate.

Best practices

  • Pace your requests. Do not send hundreds of requests per second. The free plan throttles at 5 concurrent requests, which is a reasonable starting ceiling even on paid plans.
  • Retry transient failures. The SDK's retries=2 argument is enough for most cases. Do not retry 4xx errors — those mean the request was wrong, not unlucky.
  • Match country_code to the URL. A US IP requesting amazon.de is a fingerprint Amazon will spot.
  • Never log in. This repo scrapes public review data only. Authenticated scraping is against the ToS and out of scope.
  • Cache results. Reviews rarely change in 24 hours. Cache locally and only re-scrape what has actually moved.

Legal note

Scraping publicly visible Amazon data is generally legal in many jurisdictions, but Amazon's terms of service restrict automated access. A few practical rules:

  • Only collect public, non-authenticated content.
  • Keep request rates reasonable.
  • Personal data scraped from reviews (names, opinions) is subject to GDPR and CCPA. Treat it the way you would any personal data — minimise collection, secure storage, honour deletion requests.

This repository is not legal advice. Review Amazon's ToS and the regulations that apply to your jurisdiction before running anything in production.

FAQ

Can I scrape Amazon reviews without getting blocked?

Yes. The ScrapingBee Web Scraping API automatically manages proxies and headers to avoid blocks, even from JavaScript-heavy pages like Amazon reviews.

Do I need a headless browser to scrape Amazon?

Yes, but not on your machine. ScrapingBee has built-in JavaScript rendering, so you do not need to install or maintain Puppeteer, Playwright, or Selenium locally.

How many reviews can I scrape from Amazon?

As many as you need. Provide the list of ASINs to the script and it will target each product page. Stay within your plan's rate and credit limits.

Can I use this method for different Amazon countries (UK, DE, FR)?

Yes. Set country_code in the API request (e.g. "country_code": "de" for Germany) and change the URL host to match (amazon.de). If a country does not return the expected results, switch on premium_proxy=True for residential IPs.

How do I scrape all the reviews for a product, not just the first page?

Use the infinite_scroll instruction in js_scenario (see Load more reviews). Each scroll cycle loads more reviews into the DOM before extraction runs.

Why am I getting empty results?

Two common causes: the country_code does not match the Amazon regional domain in the URL, or the JavaScript scenario timing is too short for a slow-loading page. Bump the wait values from 2000 to 3500 ms and verify the review widget actually renders.

How much does each request cost?

A standard request with JavaScript rendering and a basic scenario is 25 credits. Using premium_proxy doubles that, and stealth_proxy raises it further. The free 1,000-credit tier gets you roughly 40 of these requests for evaluation.

Can I scrape reviews from third-party sellers?

This script targets the product detail page, so it returns the product's reviews regardless of which seller owns the buy box at request time. Seller-specific review pages are a separate URL pattern and a separate scrape.