How to scrape Amazon with Scrapy

How to scrape Amazon with Scrapy

·

6 min read

In this tutorial, we’ll learn how to scrape Amazon using Python for the following information, title, summary, runtime, release year, rating, and buy price in a movie category. We'll be using Scrapy, an open-source Python framework for web crawling.

Python is a popular programming language for web scraping. Web scraping is the process of extracting useful data from websites through the use of bots or web spiders. This has become very important for product optimization, marketing analysis, sentiment analysis, price comparison, etc. It is also an important skill in data science and data engineering for sourcing data.

Prerequisites

To complete this tutorial, we have Python 3 already installed on our system. We’ll install Scrapy from our terminal with the following command:

pip3 install scrapy

Setting up Scrapy project to scrape Amazon

Let’s navigate to a directory where we want to have our Scrapy project. Enter the code below in your terminal:

scrapy startproject AmazonMovieScraper

A new project AmazonMovieScraper would be created and its structure would look like so:

AmazonMovieScraper/
    scrapy.cfg            # deploy configuration file

    AmazonMovieScraper/   # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

Scrapy is somewhat opinionated hence we’ll have to follow a certain design path to accomplish our task.

How to scrape Amazon movie data

The next step involves creating a spider for scraping movie data from the Amazon comedy movie category. We’ll use Scrapy commands to create a generic spider in the terminal as follows:

scrapy genspider amspider amazon.com

This command would create a file amspider.py in the spiders sub-directory of our project. Scrapy creates a default spider class for us and would modify it to suit our purpose.

Just before we modify our spider class, we’ll edit the items.py file in our AmazonMoviesScraper directory. We’ll define all the fields we need for scraping Amazon’s movie category. The item.py file should look like this:

import scrapy

class AmazonMovieDetail(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    summary = scrapy.Field()
    runtime = scrapy.Field()
    release_year = scrapy.Field()
    rating = scrapy.Field()
    buy_price = scrapy.Field()

Make all necessary imports into the amspider.py file:

import scrapy
from AmazonMoviesScraper.items import AmazonMovieDetail

For a better explanation, our spider class is shown in bits. See code snippets below:

class AmspiderSpider(scrapy.Spider):
    name = 'amspider'
    allowed_domains = ['amazon.com']
    start_urls = [
        "https://amzn.to/3XfgcyH"
    ]

The name variable represents the name of our spider, ‘amspider’. Domains that are allowed for crawling are listed in the allowed_domains list class attribute. The start_urls list contains links which our spider would begin crawling from.

A spider class must have a parse() method to handle the response to the requests returned from the start_requests() method.

NB: Scrapy implements a default start_requests() when it is not defined in a spider class.

We could be using CSS Selector to target elements in the HTML tree to extract data. See the code below:

def parse(self, response):
        links = response.css('a.a-link-normal.s-no-outline::attr(href)').extract()
        for link in links:
            link = 'https://www.amazon.com' + link
            yield response.follow(link, callback=self.parse_details)

                #this takes the spider to the next page for scraping
        next_link = response.css('span.s-pagination-strip a::attr(href)').extract()
        next_link = next_link[-1]
        if next_link is not None:
            next_page_url = "https://www.amazon.com" + next_link
            yield response.follow(next_page_url, callback=self.parse)

The response.css('a.a-link-normal.s-no-outline::attr(href)').extract() CSS selector returns a list of movie page links. The response object method follow() takes two arguments, a link and a callback parse-details() that does the actual scraping of our movie data. The parse_details() method scrapes the data from each movie page on Amazon as seen below:

def parse_details(self, response):
        title = response.css('h1._2IIDsE._2sL6wP::text').extract_first()
        summary = response.css('div._3qsVvm._1wxob_ > div::text').extract_first()
        runtime = response.css('span.XqYSS8:nth-of-type(2) > span::text').extract_first()
        release_year = response.css('span.XqYSS8:nth-of-type(3) > span::text').extract_first()
        rating = response.css('span.FDDgZI > span::text').extract_first()
        prices = response.css('button.SPqQmU._3RF4FN._1D7HW3._2G6lpB.tvod-button.av-button > span::text').extract()
        buy_price = prices[3]

        yield AmazonMovieDetail(title = title, summary = summary, runtime=runtime, release_year = release_year, rating = rating, buy_price=buy_price)

We examined the HTML tree and targeted elements with the data we need using the right CSS selector. The fields scraped are used to create a movie item in the last line of code.

Let’s put all the code in our spider class together:

import scrapy
from AmazonMoviesScraper.items import AmazonMovieDetail

class AmspiderSpider(scrapy.Spider):
    name = 'amspider'
    allowed_domains = ['amazon.com']
    start_urls = [
        "https://amzn.to/3XfgcyH"
    ]

    def parse(self, response):
        links = response.css('a.a-link-normal.s-no-outline::attr(href)').extract()
        for link in links:
            link = 'https://www.amazon.com' + link
            yield response.follow(link, callback=self.parse_details)

                #this takes the spider to the next page for scraping
        next_link = response.css('span.s-pagination-strip a::attr(href)').extract()
        next_link = next_link[-1]
        if next_link is not None:
            next_page_url = "https://www.amazon.com" + next_link
            yield response.follow(next_page_url, callback=self.parse)


    def parse_details(self, response):
        title = response.css('h1._2IIDsE._2sL6wP::text').extract_first()
        summary = response.css('div._3qsVvm._1wxob_ > div::text').extract_first()
        runtime = response.css('span.XqYSS8:nth-of-type(2) > span::text').extract_first()
        release_year = response.css('span.XqYSS8:nth-of-type(3) > span::text').extract_first()
        rating = response.css('span.FDDgZI > span::text').extract_first()
        prices = response.css('button.SPqQmU._3RF4FN._1D7HW3._2G6lpB.tvod-button.av-button > span::text').extract()
        buy_price = prices[3]

        yield AmazonMovieDetail(title = title, summary = summary, runtime=runtime, release_year = release_year, rating = rating, buy_price=buy_price)

In summary, our spider takes a URL to begin crawling, and the parse() method handles the response by picking out movie page links, and the link to the next page. The parse_details() scrapes out required data fields and yields a movie object for each response received.

To begin scraping data from Amazon, we would the command in the terminal:

scrapy crawl amspider -o movies.json

When we run the above command, our movie data would be scraped and saved into a JSON file. The new file would be found in our Scrapy project root directory.

Over the years, the HTML structure of websites might change, therefore your web scraping script for a particular site would need some modification.

Some setting tweaks to improve scraping

When we’re scraping out large data, we risk getting blocked by sites like Amazon when servers are overloaded with requests. Perhaps our system could slow down.

There are a few adjustments we can make to improve scraping on a large scale.

Go to the settings.py file and add the following code.

DOWNLOAD_TIMEOUT = 540 - This would increase the waiting time of the downloader, from the default time of 180 seconds.

DOWNLOAD_DELAY = 5 - This defines the number of seconds the downloader should wait before downloading pages from the same website. This will slow down the hit on the servers.

DEPTH_LIMIT = 3 - This specifies the maximum depth to be crawled from a website. In our case, the spider would not scrape beyond the third page.

ROBOTSTXT_OBEY = False - This would disregard any robots.txt policies on the website.

Conclusion

In this tutorial, we explored how to scrape needed data from the e-commerce giant Amazon. We used Python’s web scraping framework Scrapy to extract movie data from a category of interest. There are other web scraping libraries out there to explore.

Scrapy provides an easy way to set up a spider for scraping the web, the framework comes with batteries included.

There are other advanced techniques we can explore to scrape data from websites without getting banned like proxy rotation, the use of a real user agent, setting a referrer, etc.

Did you find this article valuable?

Support Ginjar Codes by becoming a sponsor. Any amount is appreciated!