How to Scrape Amazon Product Data in 2026-ip information- kookeey

Amazon is the world’s largest e-commerce platform, holding massive amounts of product data, customer feedback, and market trend information. Whether you’re a seller monitoring competitors, a researcher analyzing market dynamics, or a developer building price tracking tools, Amazon data offers tremendous value.

However, Amazon is also widely recognized as one of the most difficult websites to scrape, with sophisticated anti-bot mechanisms that frustrate many developers. This guide provides a complete Amazon data scraping solution—from hands-on Python techniques to overcoming large-scale scraping challenges, and finally leveraging residential proxies to build a stable, efficient data collection pipeline.

Free Trial Sign Up

Why Scrape Amazon Data?

As the largest global e-commerce platform, Amazon’s data delivers value across multiple dimensions:

Use Case	Data Types	Business Value
Competitor Monitoring	Prices, ratings, review counts	Adjust pricing strategies in real time, stay competitive
Product Research	Bestseller lists, new releases	Discover hot categories, optimize inventory decisions
Review Analysis	Review text, rating trends	Understand customer pain points, improve product design
SEO Optimization	Titles, keywords, rankings	Optimize listings, increase search visibility

Manual Amazon Data Scraping with Python

Before writing code, you need to understand Amazon’s page structure and anti-scraping characteristics.

Environment Setup

# Create project directory
mkdir amazon-scraper && cd amazon-scraper

# Install required libraries
pip3 install beautifulsoup4 requests pandas playwright
playwright install

Understanding Amazon’s Page Structure

Amazon’s product listing pages and detail pages expose data differently:

Page Type	Data Content	Loading Method	Scraping Difficulty
Listing Pages	Titles, prices, ratings, review counts	Server-side rendering + dynamic loading	Medium
Detail Pages	Descriptions, variants, Q&A, reviews	Heavy dynamic content	High

To analyze page structure, right-click a webpage element, select “Inspect,” and examine HTML tags and attributes in developer tools. Focus on:

Product card containers (often have data-component-type="s-search-result" attribute)
Price element selectors (e.g., .a-price-whole)
Rating element tag structure

Basic Scraper Code: Extract Amazon Product Listings

The following code uses Playwright’s async mode to scrape product information from Amazon search results pages:

import asyncio
from playwright.async_api import async_playwright
import pandas as pd
import random

async def scrape_amazon_search(keyword="headphones", max_pages=1):
    """
    Scrape product information from Amazon search results pages
    """
    async with async_playwright() as pw:
        # Launch browser
        browser = await pw.chromium.launch(headless=True)
        page = await browser.new_page()

        # Set random User-Agent
        await page.set_extra_http_headers({
            "User-Agent": random.choice([
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
            ])
        })

        all_products = []

        for page_num in range(1, max_pages + 1):
            # Build search URL
            url = f"https://www.amazon.com/s?k={keyword}&page={page_num}"
            print(f"Scraping page {page_num}: {url}")

            try:
                # Navigate to page
                await page.goto(url, timeout=60000, wait_until="domcontentloaded")

                # Wait for product cards to load
                await page.wait_for_selector('div[data-component-type="s-search-result"]', timeout=10000)

                # Extract all product cards
                products = await page.query_selector_all('div[data-component-type="s-search-result"]')

                for product in products:
                    try:
                        # Extract title
                        title_elem = await product.query_selector('h2 a span')
                        title = await title_elem.inner_text() if title_elem else "N/A"

                        # Extract price
                        price_whole = await product.query_selector('.a-price-whole')
                        price_fraction = await product.query_selector('.a-price-fraction')

                        if price_whole and price_fraction:
                            price = f"${await price_whole.inner_text()}.{await price_fraction.inner_text()}"
                        else:
                            price = "N/A"

                        # Extract rating
                        rating_elem = await product.query_selector('span[aria-label*="out of 5 stars"]')
                        rating = await rating_elem.get_attribute('aria-label') if rating_elem else "N/A"

                        # Extract review count
                        reviews_elem = await product.query_selector('span[aria-label*="stars"] + span a')
                        reviews = await reviews_elem.inner_text() if reviews_elem else "0"

                        # Extract ASIN
                        asin = await product.get_attribute('data-asin')

                        all_products.append({
                            "title": title[:100] + "..." if len(title) > 100 else title,
                            "price": price,
                            "rating": rating,
                            "reviews": reviews,
                            "asin": asin,
                            "page": page_num
                        })
                    except Exception as e:
                        print(f"Error parsing individual product: {e}")
                        continue

                # Random delay to avoid being detected as a bot
                await asyncio.sleep(random.uniform(2, 5))

            except Exception as e:
                print(f"Failed to scrape page {page_num}: {e}")
                break

        await browser.close()
        return all_products

# Run the scraper
results = asyncio.run(scrape_amazon_search(keyword="wireless earbuds", max_pages=2))

# Save to CSV
df = pd.DataFrame(results)
df.to_csv('amazon_products.csv', index=False, encoding='utf-8-sig')
print(f"Scraping complete! Total products: {len(results)}")

Common Issues and Solutions

Even with correct code, first runs may fail. Here are key strategies to address common problems:

Problem	Symptoms	Solutions
Request Rejection	Returns 503, 403 status codes	Rotate User-Agent, use proxy IPs
CAPTCHA Appears	Page redirects to verification page	Lower request frequency, use residential proxies
Incomplete Data	Only some products load	Increase `wait_for_selector` timeout
IP Blocked	All requests fail	Switch to new proxy IP

Free Trial Sign Up

Advanced Amazon Scraping Techniques

When moving from one-time scrapes to regular collection, master these advanced techniques:

Handling Pagination

Amazon result lists span multiple pages; paginate using the &page={n} URL parameter:

base_url = "https://www.amazon.com/s?k=headphones"

for page in range(1, 6):
    # Build paginated URL
    url = f"{base_url}&page={page}"

    # Scraping logic...

    # Key: random delay to avoid throttling
    import random
    import time
    time.sleep(random.uniform(3, 7))

Filtering Out Ads and Sponsored Products

Search results include ads that need filtering:

# Detect if "Sponsored" tag exists
sponsored = await product.query_selector('span:has-text("Sponsored")')
if sponsored:
    continue  # Skip ad products

Handling Dynamically Loaded Content

Areas like reviews and Q&A load via AJAX; use explicit waits:

# Wait for "Load More" button to appear
await page.wait_for_selector('li.a-last a', timeout=5000)

# Click to load more
await page.click('li.a-last a')

# Wait for new content
await page.wait_for_timeout(2000)

# Continue scraping newly loaded content

Anti-Blocking Strategy Summary

Strategy	Implementation	Effect
Random Delays	`time.sleep(random.uniform(2, 5))`	Avoids regular request patterns
User-Agent Rotation	Maintain a list of UAs, select randomly	Reduces browser fingerprinting
Concurrency Limiting	Control number of parallel scrapers	Prevents triggering traffic anomaly alerts
Proxy IP Usage	Configure residential proxy rotation	Most effective anti-ban measure

Scaling Challenges and Solutions

Common Scaling Issues

When upgrading to regular collection, you’ll encounter:

Challenge	Symptoms	Root Cause
IP Bans	Requests return 503 errors or CAPTCHAs	High-frequency access detected from same IP
Data Inconsistency	Different results across time periods	Geolocation or login status affects returned content
High Maintenance	Frequent selector adjustments	Amazon frontend code updates
Slow Speed	Single IP gets throttled	Need distributed collection

Why Stable Proxy IPs Are Essential

Proxy IPs are the foundation of large-scale scraping, delivering core value:

IP Rotation Distributes Requests

Spread requests across millions of IPs, mimicking normal user access patterns
Completely bypass rate limits and IP blacklists
Drastically reduce bot detection probability

Access Geo-Localized Data

Lock exit IP to specific cities or countries
Obtain region-exclusive search results and prices
Enable localized market research

Maintain Session Consistency

Sticky sessions keep IP unchanged during tasks
Ideal for login or cart-adding scenarios
Prevent task interruption from frequent IP switches

kookeey Residential Proxy Solution

kookeey offers professional residential proxy services to solve large-scale scraping challenges:

Massive Clean IP Pool

Over 55 million real residential IPs worldwide
IPs originate from real users, making them hard to identify as proxies
Minimize platform blocking risk

Precise Geo-Targeting

Support city and country-level precise targeting
Easily obtain localized data from target markets
Meet regional market research needs

Flexible Session Control

Free Benefits for kookeey New Users 🎁

200MB Residential ¥288 Bonus Pack 100MB Mobile

•Exclusive Use •ISP •Supports Dedicated Port / API Access

Claim for Free >

Integrating Proxy in Code

Integrate kookeey proxy into your Playwright scraper:

# Kookeey proxy configuration
proxy = {
    "server": "http://gate.kookee.info:15959",
    "username": "YOUR_KOOKEEY_USERNAME",
    "password": "YOUR_KOOKEEY_PASSWORD"
}

# Configure proxy when launching browser
browser = await pw.chromium.launch(
    headless=True,
    proxy=proxy
)

Verifying Proxy Effectiveness

Before large-scale scraping, verify your proxy configuration:

import requests

# Kookeey proxy config
proxies = {
    "http": "http://YOUR_USERNAME:YOUR_PASSWORD@gate.kookee.info:15959",
    "https": "http://YOUR_USERNAME:YOUR_PASSWORD@gate.kookee.info:15959"
}

# Access IP detection site
try:
    response = requests.get(
        'https://lumtest.com/myip.json',
        proxies=proxies,
        timeout=10
    )
    ip_info = response.json()
    print(f"Proxy configured successfully! Current IP: {ip_info['ip']}")
    print(f"Location: {ip_info['country']} - {ip_info['city']}")
except Exception as e:
    print(f"Proxy connection failed: {e}")

Summary: Build Your Amazon Data Collection Blueprint

Three-Stage Upgrade Path

Stage	Goal	Core Strategy	Tool Selection
Startup Stage	Validate ideas, small-scale testing	Understand page structure, master basic scraping	Python + Playwright, random delays
Growth Stage	Regular collection, stable operation	Introduce proxy IPs, optimize anti-ban tactics	Kookeey proxy + User-Agent rotation
Scale Stage	Large-scale, distributed collection	IP rotation + geo-targeting + sticky sessions	Kookeey residential proxies + concurrency control

With these methods, you can upgrade simple manual scripts into business-grade data collection systems capable of handling regular, large-scale scraping tasks.

Related Reading Recommendations

This article comes from online submissions and does not represent the analysis of kookeey. If you have any questions, please contact us

How to Scrape Amazon Product Data in 2026