How to Scrape Amazon Product Data in 2026

Amazon is the world’s largest e-commerce platform, holding massive amounts of product data, customer feedback, and market trend information. Whether you’re a seller monitoring competitors, a researcher analyzing market dynamics, or a developer building price tracking tools, Amazon data offers tremendous value.

However, Amazon is also widely recognized as one of the most difficult websites to scrape, with sophisticated anti-bot mechanisms that frustrate many developers. This guide provides a complete Amazon data scraping solution—from hands-on Python techniques to overcoming large-scale scraping challenges, and finally leveraging residential proxies to build a stable, efficient data collection pipeline.

Sign Up for a Free Trial of kookeey Global Proxy

Why Scrape Amazon Data?

As the largest global e-commerce platform, Amazon’s data delivers value across multiple dimensions:

Use CaseData TypesBusiness Value
Competitor MonitoringPrices, ratings, review countsAdjust pricing strategies in real time, stay competitive
Product ResearchBestseller lists, new releasesDiscover hot categories, optimize inventory decisions
Review AnalysisReview text, rating trendsUnderstand customer pain points, improve product design
SEO OptimizationTitles, keywords, rankingsOptimize listings, increase search visibility

Manual Amazon Data Scraping with Python

Before writing code, you need to understand Amazon’s page structure and anti-scraping characteristics.

Environment Setup

# Create project directory
mkdir amazon-scraper && cd amazon-scraper

# Install required libraries
pip3 install beautifulsoup4 requests pandas playwright
playwright install

Understanding Amazon’s Page Structure

Amazon’s product listing pages and detail pages expose data differently:

Page TypeData ContentLoading MethodScraping Difficulty
Listing PagesTitles, prices, ratings, review countsServer-side rendering + dynamic loadingMedium
Detail PagesDescriptions, variants, Q&A, reviewsHeavy dynamic contentHigh

To analyze page structure, right-click a webpage element, select “Inspect,” and examine HTML tags and attributes in developer tools. Focus on:

  • Product card containers (often have data-component-type="s-search-result" attribute)
  • Price element selectors (e.g., .a-price-whole)
  • Rating element tag structure

Basic Scraper Code: Extract Amazon Product Listings

The following code uses Playwright’s async mode to scrape product information from Amazon search results pages:

import asyncio
from playwright.async_api import async_playwright
import pandas as pd
import random

async def scrape_amazon_search(keyword="headphones", max_pages=1):
    """
    Scrape product information from Amazon search results pages
    """
    async with async_playwright() as pw:
        # Launch browser
        browser = await pw.chromium.launch(headless=True)
        page = await browser.new_page()

        # Set random User-Agent
        await page.set_extra_http_headers({
            "User-Agent": random.choice([
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
            ])
        })

        all_products = []

        for page_num in range(1, max_pages + 1):
            # Build search URL
            url = f"https://www.amazon.com/s?k={keyword}&page={page_num}"
            print(f"Scraping page {page_num}: {url}")

            try:
                # Navigate to page
                await page.goto(url, timeout=60000, wait_until="domcontentloaded")

                # Wait for product cards to load
                await page.wait_for_selector('div[data-component-type="s-search-result"]', timeout=10000)

                # Extract all product cards
                products = await page.query_selector_all('div[data-component-type="s-search-result"]')

                for product in products:
                    try:
                        # Extract title
                        title_elem = await product.query_selector('h2 a span')
                        title = await title_elem.inner_text() if title_elem else "N/A"

                        # Extract price
                        price_whole = await product.query_selector('.a-price-whole')
                        price_fraction = await product.query_selector('.a-price-fraction')

                        if price_whole and price_fraction:
                            price = f"${await price_whole.inner_text()}.{await price_fraction.inner_text()}"
                        else:
                            price = "N/A"

                        # Extract rating
                        rating_elem = await product.query_selector('span[aria-label*="out of 5 stars"]')
                        rating = await rating_elem.get_attribute('aria-label') if rating_elem else "N/A"

                        # Extract review count
                        reviews_elem = await product.query_selector('span[aria-label*="stars"] + span a')
                        reviews = await reviews_elem.inner_text() if reviews_elem else "0"

                        # Extract ASIN
                        asin = await product.get_attribute('data-asin')

                        all_products.append({
                            "title": title[:100] + "..." if len(title) > 100 else title,
                            "price": price,
                            "rating": rating,
                            "reviews": reviews,
                            "asin": asin,
                            "page": page_num
                        })
                    except Exception as e:
                        print(f"Error parsing individual product: {e}")
                        continue

                # Random delay to avoid being detected as a bot
                await asyncio.sleep(random.uniform(2, 5))

            except Exception as e:
                print(f"Failed to scrape page {page_num}: {e}")
                break

        await browser.close()
        return all_products

# Run the scraper
results = asyncio.run(scrape_amazon_search(keyword="wireless earbuds", max_pages=2))

# Save to CSV
df = pd.DataFrame(results)
df.to_csv('amazon_products.csv', index=False, encoding='utf-8-sig')
print(f"Scraping complete! Total products: {len(results)}")

Common Issues and Solutions

Even with correct code, first runs may fail. Here are key strategies to address common problems:

ProblemSymptomsSolutions
Request RejectionReturns 503, 403 status codesRotate User-Agent, use proxy IPs
CAPTCHA AppearsPage redirects to verification pageLower request frequency, use residential proxies
Incomplete DataOnly some products loadIncrease wait_for_selector timeout
IP BlockedAll requests failSwitch to new proxy IP
Sign Up for a Free Trial of kookeey Global Proxy

Advanced Amazon Scraping Techniques

When moving from one-time scrapes to regular collection, master these advanced techniques:

Handling Pagination

Amazon result lists span multiple pages; paginate using the &page={n} URL parameter:

base_url = "https://www.amazon.com/s?k=headphones"

for page in range(1, 6):
    # Build paginated URL
    url = f"{base_url}&page={page}"

    # Scraping logic...

    # Key: random delay to avoid throttling
    import random
    import time
    time.sleep(random.uniform(3, 7))

Filtering Out Ads and Sponsored Products

Search results include ads that need filtering:

# Detect if "Sponsored" tag exists
sponsored = await product.query_selector('span:has-text("Sponsored")')
if sponsored:
    continue  # Skip ad products

Handling Dynamically Loaded Content

Areas like reviews and Q&A load via AJAX; use explicit waits:

# Wait for "Load More" button to appear
await page.wait_for_selector('li.a-last a', timeout=5000)

# Click to load more
await page.click('li.a-last a')

# Wait for new content
await page.wait_for_timeout(2000)

# Continue scraping newly loaded content

Anti-Blocking Strategy Summary

StrategyImplementationEffect
Random Delaystime.sleep(random.uniform(2, 5))Avoids regular request patterns
User-Agent RotationMaintain a list of UAs, select randomlyReduces browser fingerprinting
Concurrency LimitingControl number of parallel scrapersPrevents triggering traffic anomaly alerts
Proxy IP UsageConfigure residential proxy rotationMost effective anti-ban measure

Scaling Challenges and Solutions

Common Scaling Issues

When upgrading to regular collection, you’ll encounter:

ChallengeSymptomsRoot Cause
IP BansRequests return 503 errors or CAPTCHAsHigh-frequency access detected from same IP
Data InconsistencyDifferent results across time periodsGeolocation or login status affects returned content
High MaintenanceFrequent selector adjustmentsAmazon frontend code updates
Slow SpeedSingle IP gets throttledNeed distributed collection

Why Stable Proxy IPs Are Essential

Proxy IPs are the foundation of large-scale scraping, delivering core value:

IP Rotation Distributes Requests

  • Spread requests across millions of IPs, mimicking normal user access patterns
  • Completely bypass rate limits and IP blacklists
  • Drastically reduce bot detection probability

Access Geo-Localized Data

  • Lock exit IP to specific cities or countries
  • Obtain region-exclusive search results and prices
  • Enable localized market research

Maintain Session Consistency

  • Sticky sessions keep IP unchanged during tasks
  • Ideal for login or cart-adding scenarios
  • Prevent task interruption from frequent IP switches

kookeey Residential Proxy Solution

kookeey offers professional residential proxy services to solve large-scale scraping challenges:

Massive Clean IP Pool

  • Over 55 million real residential IPs worldwide
  • IPs originate from real users, making them hard to identify as proxies
  • Minimize platform blocking risk

Precise Geo-Targeting

  • Support city and country-level precise targeting
  • Easily obtain localized data from target markets
  • Meet regional market research needs

Flexible Session Control

  • Support sticky sessions up to 24 hours
  • Satisfy scenarios requiring persistent login states
  • Avoid task interruption from frequent IP switching

Free Benefits for kookeey New Users 🎁

200MB Residential ¥288 Bonus Pack 100MB Mobile
100% Dedicated IP ISP Supports Dedicated Port / API Access

Integrating Proxy in Code

Integrate kookeey proxy into your Playwright scraper:

# Kookeey proxy configuration
proxy = {
    "server": "http://gate.kookee.info:15959",
    "username": "YOUR_KOOKEEY_USERNAME",
    "password": "YOUR_KOOKEEY_PASSWORD"
}

# Configure proxy when launching browser
browser = await pw.chromium.launch(
    headless=True,
    proxy=proxy
)

Verifying Proxy Effectiveness

Before large-scale scraping, verify your proxy configuration:

import requests

# Kookeey proxy config
proxies = {
    "http": "http://YOUR_USERNAME:YOUR_PASSWORD@gate.kookee.info:15959",
    "https": "http://YOUR_USERNAME:YOUR_PASSWORD@gate.kookee.info:15959"
}

# Access IP detection site
try:
    response = requests.get(
        'https://lumtest.com/myip.json',
        proxies=proxies,
        timeout=10
    )
    ip_info = response.json()
    print(f"Proxy configured successfully! Current IP: {ip_info['ip']}")
    print(f"Location: {ip_info['country']} - {ip_info['city']}")
except Exception as e:
    print(f"Proxy connection failed: {e}")
Frequently Asked Questions (FAQ)
Can I download all videos from an entire channel?
Yes! Just pass the channel URL to yt-dlp, and it will automatically fetch the entire video list of that channel. The monitoring system described in this article can be easily modified to support channel monitoring.
Can it fetch playlists?
Absolutely! yt-dlp can extract both automatically generated subtitles and those provided by the uploader, supporting multiple languages.

Summary: Build Your Amazon Data Collection Blueprint

Three-Stage Upgrade Path

StageGoalCore StrategyTool Selection
Startup StageValidate ideas, small-scale testingUnderstand page structure, master basic scrapingPython + Playwright, random delays
Growth StageRegular collection, stable operationIntroduce proxy IPs, optimize anti-ban tacticsKookeey proxy + User-Agent rotation
Scale StageLarge-scale, distributed collectionIP rotation + geo-targeting + sticky sessionsKookeey residential proxies + concurrency control

With these methods, you can upgrade simple manual scripts into business-grade data collection systems capable of handling regular, large-scale scraping tasks.


Related Reading Recommendations

This article comes from online submissions and does not represent the analysis of kookeey. If you have any questions, please contact us

Like (0)
kookeeykookeey
Previous February 26, 2026 5:51 pm
Next September 6, 2024 2:53 pm

Related recommendations