Learn everything about Analytics

Web Scraping in Python using Scrapy (with multiple examples)

SHARE
, / 53

Introduction

The explosion of the internet has been a boon for data enthusiasts. The variety and quantity of data that is available today through the internet is like a treasure trove of secrets and mysteries waiting to be solved. For example, you are planning to travel – how about scraping a few travel recommendation sites, pull out comments about various do to things and see which property is getting a lot of positive responses from the users! The list of use cases is endless.

Yet, there is no fixed methodology to extract such data and much of it is unstructured and full of noise.

Such conditions make web scraping a necessary technique for a data scientist’s toolkit. As it is rightfully said,

Any content that can be viewed on a webpage can be scraped. Period.

With the same spirit, you will be building different kinds of web scraping systems in this article and will learn some of the challenges and ways to tackle them.

By end of this article, you would know a framework to scrape the web and would have scrapped multiple websites – let’s go!

 

Table of Contents

  1. Overview of Scrapy
  2. Write your first Web Scraping code with Scrapy
    1. Set up your system
    2. Scraping Reddit: Fast Experimenting with Scrapy Shell
    3. Writing Custom Scrapy Spiders
  3. Case Studies using Scrapy
    1. Scraping an E-Commerce site
    2. Scraping Techcrunch: Create your own RSS Feed Reader

 

1. Overview of Scrapy

Scrapy is a Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format.

As diverse the internet is, there is no “one size fits all” approach in extracting data from websites. Many a time ad hoc approaches are taken and if you start writing code for every little task you perform, you will eventually end up creating your own scraping framework. Scrapy is that framework.

With Scrapy you don’t need to reinvent the wheel.

Note: There are no specific prerequisites of this article, a basic knowledge of HTML and CSS is preferred. If you still think you need a refresher, do a quick read of this article.

 

2. Write your first Web Scraping code with Scrapy

We will first quickly take a look at how to setup your system for web scraping and then see how we can build a simple web scraping system for extracting data from Reddit website.

 

2.1 Set up your system

Scrapy supports both versions of Python 2 and 3. If you’re using Anaconda, you can install the package from the conda-forge channel, which has up-to-date packages for Linux, Windows and OS X.

To install Scrapy using conda, run:

conda install -c conda-forge scrapy

Alternatively, if you’re on Linux or Mac OSX, you can directly install scrapy by:

pip install scrapy

Note: This article will follow Python 2 with Scrapy.

 

2.2 Scraping Reddit: Fast Experimenting with Scrapy Shell

Recently there was a season launch of a prominent TV series (GoTS7) and the social media was on fire, people all around were posting memes, theories, their reactions etc. I had just learnt scrapy and was wondering if it can be used to catch a glimpse of people’s reactions?

 

Scrapy Shell

I love the python shell, it helps me “try out” things before I can implement them in detail. Similarly, scrapy provides a shell of its own that you can use to experiment. To start the scrapy shell in your command line type:

scrapy shell

Woah! Scrapy wrote a bunch of stuff. For now, you don’t need to worry about it. In order to get information from Reddit (about GoT) you will have to first run a crawler on it. A crawler is a program that browses web sites and downloads content. Sometimes crawlers are also referred as spiders.

 

About Reddit

Reddit is a discussion forum website. It allows users to create “subreddits”  for a single topic of discussion. It supports all the features that conventional discussion portals have like creating a post, voting, replying to post, including images and links etc. Reddit also ranks the post based on their votes using a ranking algorithm of its own.

A crawler needs a starting point to start crawling(downloading) content from. Let’s see, on googling “game of thrones Reddit” I found that Reddit has a sub-reddit exclusively for game of thrones at https://www.reddit.com/r/gameofthrones/ this will be the crawler’s start URL.

To run the crawler in the shell type:

fetch("https://www.reddit.com/r/gameofthrones/")

When you crawl something with scrapy it returns a “response” object that contains the downloaded information. Let’s see what the crawler has downloaded:

view(response)

This command will open the downloaded page in your default browser.

Wow that looks exactly like the website, the crawler has successfully downloaded the entire web page.

Let’s see how does the raw content looks like:

print response.text

That’s a lot of content but not all of it is relevant. Let’s create list of things that need to be extracted :

  • Title of each post
  • Number of votes it has
  • Number of comments
  • Time of post creation

 

Extracting title of posts

Scrapy provides ways to extract information from HTML based on css selectors like class, id etc. Let’s find the css selector for title, right click on any post’s title and select “Inspect” or “Inspect Element”:

This will open the the developer tools in your browser:

As it can be seen,  the css class “title” is applied to all <p> tags that have titles. This will helpful in filtering out titles from rest of the content in the response object:

response.css(".title::text").extract()

Here response.css(..) is a function that helps extract content based on css selector passed to it. The ‘.’ is used with the title because it’s a css . Also you need to use ::text to tell your scraper to extract only text content of the matching elements. This is done because scrapy directly returns the matching element along with the HTML code. Look at the following two examples:

Notice how “::text” helped us filter and extract only the text content.

 

Extracting Vote counts for each post

Now this one is tricky, on inspecting, you get three scores:

The “score” class is applied to all the three so it can’t be used as a unique selector is required. On further inspection, it can be seen that the selector that uniquely matches the vote count that we need is the one that contains both “score” and “unvoted”.

When more than two selectors are required to identify an element, we use them both. Also since both are CSS classes we have to use “.” with their names. Let’s try it out first by extracting the first element that matches:

response.css(".score.unvoted").extract_first()

See that the number of votes of the first post is correctly displayed. Note that on Reddit, the votes score is dynamic based on the number of upvotes and downvotes, so it’ll be changing in real time. We will add “::text” to our selector so that we only get the vote value and not the complete vote element. To fetch all the votes:

response.css(".score.unvoted::text").extract()

Note: Scrapy has two functions to extract the content extract() and extract_first().

 

Dealing with relative time stamps: extracting time of post creation

On inspecting the post it is clear that the “time” element contains the time of the post.

There is a catch here though, this is only the relative time(16 hours ago etc.) of the post. This doesn’t give any information about the date or time zone the time is in. In case we want to do some analytics, we won’t be able to know by which date do we have to calculate “16 hours ago”. Let’s inspect the time element a little more:

The “title” attribute of time has both the date and the time in UTC. Let’s extract this instead:

response.css("time::attr(title)").extract()

The .attr(attributename) is used to get the value of the specified attribute of the matching element.

 

 Extracting Number of comments:

I leave this as a practice assignment for you. If you have any issues, you can post them here: https://discuss.analyticsvidhya.com/ and the community will help you out 🙂 .

So far:

  • response – An object that the scrapy crawler returns. This object contains all the information about the downloaded content.
  • response.css(..) – Matches the element with the given CSS selectors.
  • extract_first(..) – Extracts the “first” element that matches the given criteria.
  • extract(..) – Extracts “all” the elements that match the given criteria.

 

Note: CSS selectors are a very important concept as far as web scraping is considered, you can read more about it here and how to use CSS selectors with scrapy.

 

2.3 Writing Custom Spiders

As mentioned above, a spider is a program that downloads content from web sites or a given URL. When extracting data on a larger scale, you would need to write custom spiders for different websites since there is no “one size fits all” approach in web scraping owing to diversity in website designs. You also would need to write code to convert the extracted data to a structured format and store it in a reusable format like CSV, JSON, excel etc. That’s a lot of code to write, luckily scrapy comes with most of these functionality built in.

 

Creating a scrapy project

Let’s exit the scrapy shell first and create a new scrapy project:

scrapy startproject ourfirstscraper

This will create a folder “ourfirstscraper” with the following structure:

For now, the two most important files are:

  • settings.py – This file contains the settings you set for your project, you’ll be dealing a lot with it.
  • spiders/ – This folder is where all your custom spiders will be stored. Every time you ask scrapy to run a spider, it will look for it in this folder.

 

Creating a spider

Let’s change directory into our first scraper and create a basic spider “redditbot” :

scrapy genspider redditbot www.reddit.com/r/gameofthrones/

This will create a new spider “redditbot.py” in your spiders/ folder with a basic template:

Few things to note here:

  • name : Name of the spider, in this case it is “redditbot”. Naming spiders properly becomes a huge relief when you have to maintain hundreds of spiders.
  • allowed_domains : An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list won’t be followed.
  • parse(self, response) : This function is called whenever the crawler successfully crawls a URL. Remember the response object from earlier? This is the same response object that is passed to the parse(..).

After every successful crawl the parse(..) method is called and so that’s where you write your extraction logic. Let’s add the earlier logic wrote earlier to extract titles, time, votes etc. in the parse function:

def parse(self, response):
        #Extracting the content using css selectors
        titles = response.css('.title.may-blank::text').extract()
        votes = response.css('.score.unvoted::text').extract()
        times = response.css('time::attr(title)').extract()
        comments = response.css('.comments::text').extract()
       
        #Give the extracted content row wise
        for item in zip(titles,votes,times,comments):
            #create a dictionary to store the scraped info
            scraped_info = {
                'title' : item[0],
                'vote' : item[1],
                'created_at' : item[2],
                'comments' : item[3],
            }

            #yield or give the scraped info to scrapy
            yield scraped_info

 

Note: Here yield scraped_info does all the magic. This line returns the scraped info(the dictionary of votes, titles, etc.) to scrapy which in turn processes it and stores it.

Save the file redditbot.py and head back to shell. Run the spider with the following command:

scrapy crawl redditbot

Scrapy would print a lot of stuff on the command line. Let’s focus on the data.

Notice that all the data is downloaded and extracted in a dictionary like object that meticulously has the votes, title, created_at and comments.

 

Exporting scraped data as a csv

Getting all the data on the command line is nice but as a data scientist, it is preferable to have data in certain formats like CSV, Excel, JSON etc. that can be imported into programs. Scrapy provides this nifty little functionality where you can export the downloaded content in various formats. Many of the popular formats are already supported.

Open the settings.py file and add the following code to it:

#Export as CSV Feed
FEED_FORMAT = "csv"
FEED_URI = "reddit.csv"

And run the spider :

scrapy crawl redditbot

This will now export all scraped data in a file reddit.csv. Let’s see how the CSV looks:

What happened here:

  • FEED_FORMAT : The format in which you want the data to be exported. Supported formats are: JSON, JSON lines, XML and CSV.
  • FEED_URI : The location of the exported file.

There are a plethora of forms that scrapy support for exporting feed if you want to dig deeper you can check here and using css selectors in scrapy.

Now that you have successfully created a system that crawls web content from a link, scrapes(extracts) selective data from it and saves it in an appropriate structured format let’s take the game a notch higher and learn more about web scraping.

 

3. Case studies using Scrapy

Let’s now look at a few case studies to get more experience of scrapy as a tool and its various functionalities.

 

Scraping an E-Commerce site

The advent of internet and smartphones has been an impetus to the e-commerce industry. With millions of customers and billions of dollars at stake, the market has started seeing the multitude of players. Which in turn has led to rise of e-commerce aggregator platforms which collect and show you the information regarding your products from across multiple portals? For example when planning to buy a smartphone and you would want to see the prices at different platforms at a single place. What does it take to build such an aggregator platform? Here’s my small take on building an e-commerce site scraper.

As a test site, you will scrape ShopClues for 4G-Smartphones

Let’s first generate a basic spider:

scrapy genspider shopclues www.shopclues.com/mobiles-featured-store-4g-smartphone.html

This is how the shop clues web page looks like:

The following information needs to be extracted from the page:

  • Product Name
  • Product price
  • Product discount
  • Product image

 

Extracting image URLs of the product

On careful inspection, it can be seen that the attribute “data-img” of the <img> tag can be used to extract image URLs:

response.css("img::attr(data-img)").extract()

 

Extracting product name from <img> tags

Notice that the “title” attribute of the <img> tag contains the product’s full name:

response.css("img::attr(title)").extract()

Similarly, selectors for price(“.p_price”) and discount(“.prd_discount”).

 

How to download product images?

Scrapy provides reusable images pipelines for downloading files attached to a particular item (for example, when you scrape products and also want to download their images locally).

The Images Pipeline has a few extra functions for processing images. It can:

  • Convert all downloaded images to a common format (JPG) and mode (RGB)
  • Thumbnail generation
  • Check images width/height to make sure they meet a minimum constraint

In order to use the images pipeline  to download images, it needs to be enabled in the settings.py file. Add the following lines to the file :

ITEM_PIPELINES = {
  'scrapy.pipelines.images.ImagesPipeline': 1
}
IMAGES_STORE = 'tmp/images/'

you are basically telling scrapy to use the ‘Images Pipeline’ and the location for the images should be in the folder ‘tmp/images/. The final spider would now be:

import scrapy

class ShopcluesSpider(scrapy.Spider):
   #name of spider
   name = 'shopclues'

   #list of allowed domains
   allowed_domains = ['www.shopclues.com/mobiles-featured-store-4g-smartphone.html']
   #starting url
   start_urls = ['http://www.shopclues.com/mobiles-featured-store-4g-smartphone.html/']
   #location of csv file
   custom_settings = {
       'FEED_URI' : 'tmp/shopclues.csv'
   }


   def parse(self, response):
       #Extract product information
       titles = response.css('img::attr(title)').extract()
       images = response.css('img::attr(data-img)').extract()
       prices = response.css('.p_price::text').extract()
       discounts = response.css('.prd_discount::text').extract()


       for item in zip(titles,prices,images,discounts):
           scraped_info = {
               'title' : item[0],
               'price' : item[1],
               'image_urls' : [item[2])], #Set's the url for scrapy to download images
               'discount' : item[3]
           }

           yield scraped_info

A few things to note here:

  • custom_settings : This is used to set settings of an individual spider. Remember that settings.py is for the whole project so here you tell scrapy that the output of this spider should be stored in a CSV  file “shopclues.csv” that is to be stored in the “tmp” folder.
  • scraped_info[“image_urls”]  : This is the field that scrapy checks for the image’s link. If you set this field with a list of URLs, , scrapy will automatically download and store those images for you.

On running the spider the output can be read from “tmp/shopclues.csv”:

You also get the images downloaded. Check the folder “tmp/images/full” and you will see the images:

Also, notice that scrapy automatically adds the download path of the image on your system in the csv:

There you have your own little e-commerce aggregator 🙂

If you want to dig in you can read more about scrapy’s Images Pipeline here

 

Scraping Techcrunch: Creating your own RSS Feed Reader

Techcrunch is one of my favourite blogs that I follow to stay abreast with news about startups and latest technology products. Just like many blogs nowadays TechCrunch gives its own RSS feed here : https://techcrunch.com/feed/ . One of scrapy’s features is its ability to handle XML data with ease and in this part, you are going to extract data from Techcrunch’s RSS feed.

Create a basic spider:

Scrapy genspider techcrunch techcrunch.com/feed/

Let’s have a look at the XML, the marked portion is data of interest:

Here are some observations from the page:

  • Each article is present between <item></item> tags and there are 20 such items(articles).
  • The title of the post is in <title></title> tags.
  • Link to the article can be found in <link> tags.
  • <pubDate> contains the date of publishing.
  • The author name is enclosed between funny looking <dc:creator> tags.

 

Overview of XPath and XML

XPath is a syntax that is used to define XML documents. It can be used to traverse through an XML document. Note that XPath’s follows a hierarchy.

 

Extracting title of post

Let’s extract the title of the first post. Similar to response.css(..) , the function response.xpath(..) in scrapy to deal with XPath. The following code should do it:

response.xpath("//item/title").extract_first()

 

Output :

u'<title xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc
="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/
01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/">Why the future of deep learning depends on finding good data</title>'

Wow! That’s a lot of content, but only the text content of the title is of interest. Let’s filter it out:

response.xpath("//item/title/text()").extract_first()

Output :

u'Why the future of deep learning depends on finding good data'

This is much better. Notice that text() here is equivalent of ::text from CSS selectors. Also look at the XPath //item/title/text() here you are basically saying find the element “item” and extract the “text” content of its sub element “title”.

Similarly, the xpaths for link, pubDate as :

  • Link – //item/link/text()
  • Date of publishing – //item/pubDate/text()

 

Extracting author name: Dealing with namespaces in XML

Notice the <creator> tags:

The tag itself has some text “dc:” because of which it can’t be extracted using XPath and the author name itself is crowded with “![CDATA..” irrelevant text. These are just XML namespaces and you don’t want to have anything to do with them so we’ll ask scrapy to remove the namespace:

response.selector.remove_namespaces()

Now when you try extracting the author name , it will work :

response.xpath("//item/creator/text()").extract_first()

Output : u’Ophir Tanz,Cambron Carter’

The complete spider for TechCrunch would be:

import scrapy

class TechcrunchSpider(scrapy.Spider):
    #name of the spider
    name = 'techcrunch'

    #list of allowed domains
    allowed_domains = ['techcrunch.com/feed/']

    #starting url for scraping
    start_urls = ['http://techcrunch.com/feed/']

    #setting the location of the output csv file
    custom_settings = {
        'FEED_URI' : 'tmp/techcrunch.csv'
    }

    def parse(self, response):
        #Remove XML namespaces
        response.selector.remove_namespaces()

        #Extract article information
        titles = response.xpath('//item/title/text()').extract()
        authors = response.xpath('//item/creator/text()').extract()
        dates = response.xpath('//item/pubDate/text()').extract()
        links = response.xpath('//item/link/text()').extract()

        for item in zip(titles,authors,dates,links):
            scraped_info = {
                'title' : item[0],
                'author' : item[1],
                'publish_date' : item[2],
                'link' : item[3]
            }

            yield scraped_info

Let’s run the spider:

scrapy crawl techcrunch

And there you have your own RSS reader :)!

 

End Notes

In this article, we have just scratched the surface of Scrapy’s potential as a web scraping tool. Nevertheless, if you have experience with any other tools for scraping it would have been evident by now that in efficiency and practical application, Scrapy wins hands down. All the code used in this article is available on github. Also, check out some of the interesting projects built with Scrapy:

LearnEngageCompete & Get Hired

53 Comments

  • Mayank Srivastava says:

    By far the simplest and the best explaination about scrapy. Thanks !!

  • Karthikeyan Palanisamy says:

    Hi Mohammed,

    A very detailed article on scraping. Could you please let me know how does scrapy differs from Beautifulsoup?

    • Hey Karthikeyan,

      BeautifulSoup is a library that “parses” HTML or XML content. In other words, it reads your HTML file and helps extract content from it.

      Scrapy is a full blown web scraping framework. That means, it already has the functionality that BeautifulSoup provides along with that it offers much more.

      When you are developing a web scraping system, you would need a way to send requests to the websites (probably using requests or urllib) , you would need a way to send multiple requests at once(multiprocessing/asynchronous) so that you can download content faster. You would also need a way to export your downloaded content in various required formats, if you are working on large scale projects, you would require deploying your scraping code across distributed systems.

      Scrapy provides you with all of that and much more in built.

      And yeah, you can use BeautifulSoup with Scrapy if you prefer.

      Hope this helps,
      Sanad 🙂

  • Ankit says:

    Hi Sanad,

    I am currently started using scrapy but two roadblocks I have first in our domain we need to crawl pdf pages which scrapy doesn’t provide and after googling I found couple of paid ways which we don’t prefer, second how we write junit for any scrapy code to do unit testing is there any framework for this?

    Please help me out on this.

    Thanks
    Ankit

    • Hey Ankit,

      1. I’m not sure what do you mean by crawling PDF pages? If you are trying to scrape websites for PDF files, it again depends on what you are trying to achieve. You can probably use Scrapy to extract link of target PDFs and urllib2 or requests to fetch the PDF files. And then you can use something like PDFMiner( https://pypi.python.org/pypi/pdfminer/) to parse PDF and extract information.

      2. Regarding writing unit tests for Scrapy code, it provides an integrated way to unit test spiders, check out Spiders Contracts : https://doc.scrapy.org/en/latest/topics/contracts.html

      • Ankit says:

        Hi Sanad,

        Thanks for your response ya my use case is to scrape pdf data, I’ll go through the provided links and then let see 🙂

        Thanks

      • Amit says:

        Hi Rizvi,

        Thank you very much for respoding to Ankit’s (my colleague
        ) query.The issue is not in extracting text from pdf but in extracting the relevant info of the structure of the pdf(tables etc).In other words,no info on any way to identify the data as tabular or its structure in pdf document.What we are trying to do is to extract specific info (for eg specific column data from a table in pdf document).That’s where most of the open source libraries falter.Reason looks to be more about the way pdf has been encoded .

        Hope,the query is clear .In case,you need additional info,pls let me know.Any help in this regard well be highly appreciated.Primarily ,we are looking for Python APIs.Even if open source Java libraries can do the same,we can invoke the same from Python code.

  • Isindor Richie says:

    Hello…thanks for an explanatory tutorial…how can I start the scrapy server from Jupiter notebook?

    • Hey Isindor,

      I usually don’t run scrapy server from Jupyter Notebook. I run it from the command line to export data in CSVs and then import those CSVs using pandas in Notebook.

      But, you can execute any terminal command in Jupyter Notebook using ‘!’ sign before the command. Something like this:

      !scrapy crawl redditbot

      Hope this helps.
      Sanad

  • Charles says:

    Great article but I’m a little surprised it didn’t touch on the challenges of using Scrapy when trying to scrape JavaScript heavy websites.

    Most of the sites that I work with now require also using Splash to render the JavaScript. As such I’ve also started looking at the Selenium and WebDriver option.

    At first, I tried very hard to limit myself to only Scrapy and Splash but after a month working on a complicated site, I’m really wishing I would have changed approaches much earlier. I’ve done more in a few days with Selenium using the page object pattern than in weeks of Scrapy and Splash development.

    • Hey Charles,

      True that with the advent of JavaScript based front end frameworks and libraries, it is becoming difficult to scrape websites as such. We would have to use Selenium and Webdriver to aid in the part where we require user action like clicking a popup or filling a form. It’s not rare to see Scrapy applied in conjunction with Selenium in projects.

      Yet, we have to remind ourselves that that’s not the problem Scrapy is meant to solve. You could argue web scraping is a domain of its own with sub domains, one such sub domain being dealing with dynamic/javascript heavy websites.

      This article’s goal was supposed to get a beginner started with web scraping especially with the use of Scrapy. It would have been overkill to try to cover all aspects of advanced web scraping.

      Hope this helps,
      Sanad

      • Isindor Richie says:

        I look forward to a tutorial covering scraping JS heavy sites..

        Thanks.

      • Charles says:

        Hello Sanad,
        Thank you for your reply.
        Please don’t take my comment as anything but constructive.
        The article and work you are providing are wonderful. Please keep it up.

        I’m relatively new to Scrapy myself as I only started with it a few months ago.
        I was just trying to add a note for anyone new to it that there is a potential gotcha if you’re working with a website that heavily utilizes JavaScript.

        Great stuff! Please keep it coming.

  • Ram says:

    Hi Sanad,
    I have an issue with starting the scrapy shell. When I am typing scrapy shell in the command terminal/ ! scrapy shell in jupyter notebook it is showing
    ‘scrapy is not recognized as internal or external command, operable program or batch file’

    Any suggestions on how to overcome this issue and proceed further

    Thanks.

  • Anurag Kumar says:

    can scrapy scrape data that is inside the iframe?
    //copy pasting of xpath of website isnt working

  • Angeline Shalini says:

    Very nice article, I am beginner in webscraping, have been using Beautiful Soup. I am excited to try out the examples using Scrapy. Great job with the explanation. What are some of the websites/people/blogs that I can follow to better understand webscraping and also get the latest info?

  • Lwebzem says:

    This is great, I tried to use it from the shell for the same url that is in the example with python 3 and win 10 but I got error as below. Can you suggest something as I am new to scrapy.
    Thanks

    In [4]: print (response.text)
    —————————————————————————
    UnicodeEncodeError Traceback (most recent call last)
    in ()
    —-> 1 print (response.text)

    C:\Users\Owner\Anaconda3\lib\encodings\cp437.py in encode(self, input, final)
    17 class IncrementalEncoder(codecs.IncrementalEncoder):
    18 def encode(self, input, final=False):
    —> 19 return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    20
    21 class IncrementalDecoder(codecs.IncrementalDecoder):

    UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘\u2022’ in position
    1047: character maps to

  • Pulkit Verma says:

    During running command Scrapy genspider techcrunch techchrunch.com/feed/ , I encountered an error which related to permission i.e Permission denied :’.\\techchrunch.py’. How can I resolve this error, I am using python 3 and anaconda in windows.

  • Ram says:

    Hi Sanad,

    I am not able to open scrapy shell. An error “Scrapy is not a recognized external or internal command or batch file” is coming when I am typing scrapy shell in the terminal.

    Please help

    Thanks

  • Saurabh says:

    Hi Sanad,
    Very nice article. Thanks.

    I have one question regarding scrapping within web site request limit . How do we control number of request sent to website , so it doesn’t hold the web traffic and also is in limit of not getting blocked. What is the limit for number of request sent based on your experience.

    Thanks

  • Krish C A says:

    Great article and explained the flow in step-by-step manner, so simple that even python beginners can also give a try and see the code working.

  • vrana95 says:

    HI Sanad,

    Really nice article. Thanks for putting it up.

  • kabir says:

    I am getting error in below line

    scrapy startproject ourfirstscraper http://www.reddit.com/r/gameofthrones/
    error

    File “”, line 1
    scrapy startproject ourfirstscraper http://www.reddit.com/r/gameofthrones
    syntax error : invalid syntax

  • Yash says:

    Hi Sanad,
    I’m getting this error

    AttributeError Traceback (most recent call last)
    in ()
    —-> 1 response.css(“img::attr(data-img)”).extract()

    AttributeError: ‘NoneType’ object has no attribute ‘css’

  • Dhawal Modi says:

    Great artice! I also have a doubt that if i want the spider to also go through the other webpages….Like in the redditbot if i want to scrape the the pages after the first page…how do i iterate the links i have specified in ‘allowed_domains’ list??

  • Murali says:

    Hi Sanad,

    Thanks for the nice tutorial. When I was trying to pull data from shopclues, I am getting only one record in the output csv file. Not sure what is the issue with my code. I am pasting the code and the generated log. Can you please advise?

    import scrapy

    class MyshopcluesSpider(scrapy.Spider):
    name = “myshopclues”
    allowed_domains = [“www.shopclues.com/mobiles-featured-store-4g-smartphone.html”]
    start_urls = [“http://www.shopclues.com/mobiles-featured-store-4g-smartphone.html/”]

    custom_settings={
    ‘FEED_URI’:’tmp/shopclues.csv’
    }

    def parse(self, response):
    images=response.css(“img::attr(data-img)”).extract()
    titles=response.css(“img::attr(title)”).extract()
    prices=response.css(“.p_price::text”).extract()
    discounts=response.css(“.prd_discount::text”).extract()

    for item in zip(titles,prices,images,discounts):
    scraped_info={
    ‘title’:item[0],
    ‘price’:item[1],
    ‘image_urls’:[item[2]],
    ‘discount’:item[3]
    }

    yield scraped_info

    (C:\Users\mupago\AppData\Local\conda\conda\envs\my_root) C:\Users\mupago\www.shopclues.com\shopclues\spiders>scrapy crawl myshopclues
    2017-08-07 22:17:13 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: shopclues)
    2017-08-07 22:17:13 [scrapy.utils.log] INFO: Overridden settings: {‘BOT_NAME’: ‘shopclues’, ‘FEED_FORMAT’: ‘csv’, ‘NEWSPIDER_MODULE’: ‘shopclues.spiders’, ‘ROBOTSTXT_OBEY’: True, ‘SPIDER_MODULES’: [‘shopclues.spiders’]}
    2017-08-07 22:17:13 [scrapy.middleware] INFO: Enabled extensions:[‘scrapy.extensions.corestats.CoreStats’,
    ‘scrapy.extensions.telnet.TelnetConsole’,
    ‘scrapy.extensions.feedexport.FeedExporter’,
    ‘scrapy.extensions.logstats.LogStats’]
    2017-08-07 22:17:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
    [‘scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware’,
    ‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware’,
    ‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware’,
    ‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware’,
    ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’,
    ‘scrapy.downloadermiddlewares.retry.RetryMiddleware’,
    ‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware’,
    ‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’,
    ‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware’,
    ‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware’,
    ‘scrapy.downloadermiddlewares.stats.DownloaderStats’]
    2017-08-07 22:17:13 [scrapy.middleware] INFO: Enabled spider middlewares:
    [‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware’,
    ‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware’,
    ‘scrapy.spidermiddlewares.referer.RefererMiddleware’,
    ‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware’,
    ‘scrapy.spidermiddlewares.depth.DepthMiddleware’]
    2017-08-07 22:17:13 [scrapy.middleware] INFO: Enabled item pipelines:
    [‘scrapy.pipelines.images.ImagesPipeline’]
    2017-08-07 22:17:13 [scrapy.core.engine] INFO: Spider opened
    2017-08-07 22:17:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2017-08-07 22:17:13 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
    2017-08-07 22:17:15 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
    2017-08-07 22:17:15 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
    2017-08-07 22:17:15 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloadedimage from referred in
    2017-08-07 22:17:15 [scrapy.core.scraper] DEBUG: Scraped from
    {‘title’: ‘Swipe Konnect Neo 4G, Black [4G VoLTE, Quad Core, Android v6.0 Marshmallow, 5MP Camera] (Black)’, ‘price’: ‘Rs.3099’, ‘image_urls’: [‘https://cdn.shopclues.com/images/thumbnails/79033/200/200/124122307101876702KonnectNEO4GMainImage15000260381501218627.jpg’], ‘discount’: ‘28% Off’, ‘images’: [{‘url’: ‘https://cdn.shopclues.com/images/thumbnails/79033/200/200/124122307101876702KonnectNEO4GMainImage15000260381501218627.jpg’, ‘path’: ‘full/d03603c774c1a790d1e813e73743e60f1db3bd16.jpg’, ‘checksum’: ‘217ee1803456f4b83294c302d41cc9e7’}]}
    2017-08-07 22:17:15 [scrapy.core.engine] INFO: Closing spider (finished)
    2017-08-07 22:17:15 [scrapy.extensions.feedexport] INFO: Stored csv feed (1 items) in: tmp/shopclues.csv
    2017-08-07 22:17:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:{‘downloader/request_bytes’: 482,
    ‘downloader/request_count’: 2,
    ‘downloader/request_method_count/GET’: 2,
    ‘downloader/response_bytes’: 32121,
    ‘downloader/response_count’: 2,
    ‘downloader/response_status_count/200’: 2,
    ‘file_count’: 1,
    ‘file_status_count/uptodate’: 1,
    ‘finish_reason’: ‘finished’,
    ‘finish_time’: datetime.datetime(2017, 8, 8, 2, 17, 15, 400925),
    ‘item_scraped_count’: 1,
    ‘log_count/DEBUG’: 5,
    ‘log_count/INFO’: 8,
    ‘response_received_count’: 2,
    ‘scheduler/dequeued’: 1,
    ‘scheduler/dequeued/memory’: 1,
    ‘scheduler/enqueued’: 1,
    ‘scheduler/enqueued/memory’: 1,
    ‘start_time’: datetime.datetime(2017, 8, 8, 2, 17, 13, 850770)}
    2017-08-07 22:17:15 [scrapy.core.engine] INFO: Spider closed (finished)

    • Hey ,

      I checked your code but couldn’t find anything that stands out as such. I reused my code from here https://github.com/mohdsanadzakirizvi/web-scraping-magic-with-scrapy-and-python
      and it works perfectly fine. Hope this helps

      Sanad

      • Al T. says:

        I have tried to replicate the first tutorial by scraping from several other sites, and each time my spider only yields the first row… the code is almost identical to the GoT example. Any advice is appreciated.

        #yellowbot.py
        # -*- coding: utf-8 -*-
        import scrapy

        class YellowbotSpider(scrapy.Spider):
        name = ‘yellowbot’
        allowed_domains = [‘www.yellowpages.com’]
        start_urls = [‘https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=Portland%2C+OR’]

        def parse(self, response):
        #Extracting zee content using css selectors
        name = response.css(‘.business-name::text’).extract()
        street_address = response.css(‘.street-address::text’).extract()
        phone = response.css(‘.phones.phone.primary::text’).extract()

        for item in zip(name, street_address, phone):
        #create a dictionary to store scraped info
        scraped_info = {
        ‘shop_name’ : item[0],
        ‘street_address’ : item[1],
        ‘phone_number’ : item[2],
        }

        yield scraped_info

  • Avilash says:

    getting invalid syntax error when i try to run the spider.
    was able to see the response and text responses individually,

    >>>scrapy crawl smartpricebot
    File “”, line 1
    scrapy crawl smartpricebot
    ^
    SyntaxError: invalid syntax

    dont know why must be missing something. Please help resolve this

    • Hey Avilash,

      You are trying to run the spider from within the Python or scrapy shell. This command works when you are in your regular terminal(command line). As I have mentioned in my article, exit the scrapy shell first and then try it.

      Sanad 🙂

      • Avilash says:

        Thanks, Buddy, it is a very helpful article.
        doubt:
        Also, if there are no unique attributes on any particular page, can we have any start and stop points or use regex to restrict the crawl to a specific area of a page
        Also if you can address pagination and scroll down “load more” pages it would be great help.

  • I read thousands of articles and watch millions of video tutorial to learn Scrapy, but i’m still not able to run a project successfully, all my spiders stuck in the half way, or comeback with empty data. After i read your article, I finally can built a project which is work, really thanks a lot.

    By the way, can you please give another scrapy tutorial regarding how to schedule the scrapy task, and how to overwrite a csv file? Thanks once again.

  • Declan says:

    Great job on the scraping walkthroughs

    Is there a way to scrape multiple websites for a keyword and extract associated info ? Kind of similar to what google does but returning some additional variables related to the keyword ?

Leave A Reply

Your email address will not be published.

Join 50,000+ Data Scientists in our Community

Receive awesome tips, guides, infographics and become expert at:




 P.S. We only publish awesome content. We will never share your information with anyone.