The internet has become an expansive resource of data, providing numerous opportunities for data science enthusiasts. Web scraping using Scrapy, a powerful Python-based open-source web crawling framework, has become essential for extracting valuable insights from this vast amount of unstructured data. This article explores the fundamentals of web scraping using Scrapy Python, providing examples and case studies to demonstrate its capabilities. You will learn how to scrape data from various sources, including Reddit and e-commerce sites, and gain practical experience in handling common challenges in web scraping.
Note: We have created a free course for web scraping using the BeautifulSoup library. You can check it out here – Introduction to Web Scraping using Python.
This article was published as a part of the Data Science Blogathon.
Scrapy is a powerful, open-source web crawling framework for Python, designed to handle large-scale web scraping projects. It combines an efficient web crawler with a flexible processing framework, allowing you to extract data from websites and store it in your preferred format.
The internet’s diversity means there’s no one-size-fits-all approach to extracting data. Ad hoc solutions can lead to writing code for every task, effectively creating your own scraping framework. Scrapy solves this problem by providing a robust framework that eliminates the need to reinvent the wheel.
Note: There are no specific prerequisites for this article. Basic knowledge of HTML and CSS is preferred. If you still think you need a refresher, do a quick read of this article.
We will first quickly take a look at how to set up your system for web scraping and then see how we can build a simple web scraping system step-by-step for extracting data from the Reddit website.
Scrapy supports both versions of Python 2 and Python 3. If you’re using Anaconda, you can install the package from the conda-forge channel, which has up-to-date packages for Linux, Windows, and OS X.
conda install -c conda-forge scrapy
Alternatively, if you’re on Linux or Mac OSX, you can directly install scrapy by:
pip install scrapy
Note: This article will follow Python 2 to use Scrapy.
Recently there was a season launch of a prominent TV series (GoTS7), and social media was on fire. People all around were posting memes, theories, their reactions, etc. I had just learned scrapy and was wondering if it could be used to catch a glimpse of people’s reactions.
Working with Scrapy Shell
I love the python shell, it helps me “try out” things before I can implement them in detail. Similarly, scrapy provides a shell of its own that you can use to experiment. To start the scrapy shell in your command line, type:
scrapy shell
Woah! Scrapy wrote a bunch of stuff. For now, you don’t need to worry about it. In order to get information from Reddit (about GoT) you will have to first run a crawler on it. A crawler is a program that browses websites and downloads content. Sometimes crawlers are also referred to as spiders.
Reddit is a discussion forum website. It allows users to create “subreddits” for a single topic of discussion. It supports all the features that conventional discussion portals have, like creating a post, voting, replying to posts, including images and links, etc. Reddit also ranks posts based on their votes using a ranking algorithm of its own.
Getting back to Scrapy. A crawler needs a starting point to start crawling(downloading) content. Let’s see, on googling “game of thrones Reddit,” I found that Reddit has a subreddit exclusively for the game of thrones here; this will be the crawler’s start URL.
To run the crawler in the shell type:
fetch("https://www.reddit.com/r/gameofthrones/")
When you crawl something with scrapy, it returns a “response” object that contains the downloaded information. Let’s see what the crawler has downloaded:
view(response)
This command will open the downloaded page in your default browser.
Wow, that looks exactly like the website. The crawler has successfully downloaded the entire web page.
Let’s see how does the raw content look like:
print response.text
That’s a lot of content, but not all of it is relevant. Let’s create a list of things that need to be extracted:
Scrapy provides ways to extract information from HTML based on css selectors like class, id, etc. Let’s find the css selector for the title, right-click on any post’s title, and select “Inspect” or “Inspect Element”:
This will open the developer tools in your browser:
As can be seen, the css class “title” is applied to all <p> tags that have titles. This will help in filtering out titles from the rest of the content in the response object:
response.css(".title::text").extract()
Here response.css(..) is a function that helps extract content based on css selector passed to it. The ‘.’ is used with the title because it’s a css Also, you need to use “::text” to tell your scraper to extract only the text content of the matching elements. This is done because scrapy directly returns the matching element along with the HTML code. Look at the following two examples:
Notice how “::text” helped us filter and extract only the text content.
Now this one is tricky. On inspecting, you get three scores:
The “score” class is applied to all three, so it can’t be used as a unique selector is required. On further inspection, it can be seen that the selector that uniquely matches the vote count that we need is the one that contains both “score” and “unvoted.”
When more than two selectors are required to identify an element, we use them both. Also, since both are CSS classes, we have to use “.” with their names. Let’s try it out first by extracting the first element that matches:
response.css(".score.unvoted").extract_first()
See that the number of votes for the first post is correctly displayed. Note that on Reddit, the votes score is dynamic based on the number of upvotes and downvotes, so it’ll be changing in real-time. We will add “::text” to our selector so that we only get the vote value and not the complete vote element. To fetch all the votes:
response.css(".score.unvoted::text").extract()
Note: Scrapy has two functions to extract the content extract() and extract_first().
On inspecting the post, it is clear that the “time” element contains the time of the post.
There is a catch here, though this is only the relative time(16 hours ago, etc.) of the post. This doesn’t give any information about the date or time zone the time is in. If we want to do some analytics, we won’t know by which date we have to calculate “16 hours ago”. Let’s inspect the time element a little more:
The “title” attribute of time has both the date and the time in UTC. Let’s extract this instead:
response.css("time::attr(title)").extract()
The .attr(attributename) is used to get the value of the specified attribute of the matching element.
So far:
Note: CSS selectors are a very important concept as far as web scraping is concerned. You can read more about it here and how to use CSS selectors with scrapy.
As mentioned above, a spider is a program that downloads content from websites or a given URL. When extracting data on a larger scale, you would need to write custom spiders for different websites since there is no “one size fits all” approach in web scraping owing to the diversity in website designs. You also would need to write code to convert the extracted data to a structured format and store it in a reusable format like CSV, JSON (JavaScript Object Notation), excel, etc. That’s a lot of code to write. Luckily, scrapy comes with most of these functionalities built in.
Let’s exit the scrapy shell first and create a new scrapy project:
scrapy startproject ourfirstscraper
This will create a folder, “ourfirstscraper” with the following structure:
For now, the two most important files are:
Let’s change the directory into our first scraper and create a basic spider “redditbot”:
scrapy genspider redditbot www.reddit.com/r/gameofthrones/
This will create a new spider, “redditbot.py” in your spiders/ folder with a basic template:
Few things to note here:
After every successful crawl, the parse(..) method is called, and so that’s where you write your extraction logic. Let’s add the logic written earlier to extract titles, time, votes, etc., in the parse method:
def parse(self, response):
#Extracting the content using css selectors
titles = response.css('.title.may-blank::text').extract()
votes = response.css('.score.unvoted::text').extract()
times = response.css('time::attr(title)').extract()
comments = response.css('.comments::text').extract()
#Give the extracted content row wise
for item in zip(titles,votes,times,comments):
#create a dictionary to store the scraped info
scraped_info = {
'title' : item[0],
'vote' : item[1],
'created_at' : item[2],
'comments' : item[3],
}
#yield or give the scraped info to scrapy
yield scraped_info
Note: Here, yield scraped_info does all the magic. This line returns the scraped info(the dictionary of votes, titles, etc.) to scrapy, which in turn processes it and stores it.
Save the file redditbot.py and head back to the shell. Run the spider with the following command:
scrapy crawl redditbot
Scrapy would print a lot of stuff on the command line. Let’s focus on the data.
Notice that all the data is downloaded and extracted in a dictionary-like object that meticulously has the votes, title, created_at, and comments.
Getting all the data on the command line is nice, but as a data scientist, it is preferable to have data in certain formats like CSV, Excel, JSON, etc., that can be imported into programs. Scrapy provides this nifty little functionality where you can export the downloaded content in various formats. Many of the popular formats are already supported.
Open the settings.py file and add the following code to it:
#Export as CSV Feed
FEED_FORMAT = "csv"
FEED_URI = "reddit.csv"
And run the spider:
scrapy crawl redditbot
This will now export all scraped data into a file called reddit.csv. Let’s see how the CSV looks:
What happened here:
There are a plethora of forms that scrapy supports for exporting feed. If you want to dig deeper, you can check here and use css selectors in scrapy.
Now that you have successfully created a system that crawls web content from a link, scrapes(extracts) selective data from it, and saves it in an appropriately structured format, let’s take the game a notch higher and learn more about web scraping.
Let’s now look at a few case studies to get more experience with scrapy as a tool and its various functionalities.
The advent of the internet and smartphones has been an impetus to the e-commerce industry. With millions of customers and billions of dollars at stake, the market has started seeing a multitude of players. This, in turn, has led to rising of e-commerce aggregator platforms that collect and show you information regarding your products from across multiple portals. For example, when planning to buy a smartphone, you would want to see the prices on different platforms in a single place. What does it take to build such an aggregator platform? Here’s my small take on building an e-commerce site scraper.
As a test site, you will scrape ShopClues for 4G-Smartphones
Let’s first generate a basic spider:
scrapy genspider shopclues www.shopclues.com/mobiles-featured-store-4g-smartphone.html
This is what the ShopClues web page looks like:
The following information needs to be extracted from the page:
On careful inspection, it can be seen that the attribute “data-img” of the <img> tag can be used to extract image URLs:
response.css("img::attr(data-img)").extract()
Notice that the “title” attribute of the <img> tag contains the product’s full name:
response.css("img::attr(title)").extract()
Similarly, selectors for price(“.p_price”) and discount(“.prd_discount”).
Scrapy provides reusable image pipelines for downloading files attached to a particular item (for example, when you scrape products and also want to download their images locally).
The Images Pipeline has a few extra functions for processing images. It can:
In order to use the images pipeline to download images, it needs to be enabled in the settings.py file. Add the following lines to the file:
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1
}
IMAGES_STORE = 'tmp/images/'
you are basically telling scrapy to use the ‘Images Pipeline,’ and the location for the images should be in the folder ‘tmp/images/.’ The final spider would now be:
import scrapy
class ShopcluesSpider(scrapy.Spider):
#name of spider
name = 'shopclues'
#list of allowed domains
allowed_domains = ['www.shopclues.com/mobiles-featured-store-4g-smartphone.html']
#starting url
start_urls = ['http://www.shopclues.com/mobiles-featured-store-4g-smartphone.html/']
#location of csv file
custom_settings = {
'FEED_URI' : 'tmp/shopclues.csv'
}
def parse(self, response):
#Extract product information
titles = response.css('img::attr(title)').extract()
images = response.css('img::attr(data-img)').extract()
prices = response.css('.p_price::text').extract()
discounts = response.css('.prd_discount::text').extract()
for item in zip(titles,prices,images,discounts):
scraped_info = {
'title' : item[0],
'price' : item[1],
'image_urls' : [item[2])], #Set's the url for scrapy to download images
'discount' : item[3]
}
yield scraped_info
Here are a few things to note:
On running the spider, the output can be read from “tmp/shopclues.csv”:
You also get the images downloaded. Check the folder “tmp/images/full,” and you will see the images:
Also, notice that scrapy automatically adds the download path of the image on your system in the csv:
There you have your own little e-commerce aggregator.
If you want to dig in, you can read more about Scrapy’s Images Pipeline here.
Techcrunch is one of my favorite blogs that I follow to stay abreast with news about startups and the latest technology products. Just like many blogs nowadays, TechCrunch gives its own RSS feed here: https://techcrunch.com/feed/. One of Scrapy’s features is its ability to handle XML data with ease, and in this part, you are going to extract data from Techcrunch’s RSS feed.
Scrapy genspider techcrunch techcrunch.com/feed/
Let’s have a look at the XML; the marked portion is data of interest:
Here are some observations from the page:
XPath is a syntax that is used to define XML documents. It can be used to traverse through an XML document. Note that XPath follows a hierarchy.
Let’s extract the title of the first post. Similar to response.css(..), the function response.xpath(..) in scrapy deals with XPath. The following code should do it:
response.xpath("//item/title").extract_first()
Output:
u'<title xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc
="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/
01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/">Why the future of deep learning depends on finding good data</title>'
Wow! That’s a lot of content, but only the text content of the title is of interest. Let’s filter it out:
response.xpath("//item/title/text()").extract_first()
Output:
u'Why the future of deep learning depends on finding good data'
This is much better. Notice that text() here is equivalent of ::text from CSS selectors. Also, look at the XPath //item/title/text(); here, you are basically saying to find the element “item” and extract the “text” content of its sub-element “title”.
Similarly, the Xpaths for the link, pubDate as:
Notice the <creator> tags:
The tag itself has some text “dc:” because of which it can’t be extracted using XPath, and the author name itself is crowded with “![CDATA..” irrelevant text. These are just XML namespaces, and you don’t want to have anything to do with them, so we’ll ask scrapy to remove the namespace:
response.selector.remove_namespaces()
Now when you try extracting the author name, it will work:
response.xpath("//item/creator/text()").extract_first()
Output: u’Ophir Tanz,Cambron Carter’
The complete spider for TechCrunch would be:
import scrapy
class TechcrunchSpider(scrapy.Spider):
#name of the spider
name = 'techcrunch'
#list of allowed domains
allowed_domains = ['techcrunch.com/feed/']
#starting url for scraping
start_urls = ['http://techcrunch.com/feed/']
#setting the location of the output csv file
custom_settings = {
'FEED_URI' : 'tmp/techcrunch.csv'
}
def parse(self, response):
#Remove XML namespaces
response.selector.remove_namespaces()
#Extract article information
titles = response.xpath('//item/title/text()').extract()
authors = response.xpath('//item/creator/text()').extract()
dates = response.xpath('//item/pubDate/text()').extract()
links = response.xpath('//item/link/text()').extract()
for item in zip(titles,authors,dates,links):
scraped_info = {
'title' : item[0],
'author' : item[1],
'publish_date' : item[2],
'link' : item[3]
}
yield scraped_info
scrapy crawl techcrunch
And there you have your own RSS reader!
Also, check out some of the interesting projects built with Scrapy:
Also, there are multiple libraries for web scraping. BeautifulSoup, Selenium is one of those libraries. To learn more, you go through our free course- Introduction to Web Scraping using Python.
Web scraping using Scrapy Python offers a comprehensive solution for extracting data from websites efficiently and effectively. With its robust framework, Scrapy Python simplifies the process, allowing you to focus on data processing and storage without worrying about the intricacies of web crawling. Whether you’re working on a small project or a large-scale data extraction task, Scrapy provides the tools and flexibility you need. By exploring various Scrapy examples, you can quickly learn how to harness its capabilities, making web scraping using Scrapy a valuable skill for any data-driven project.
All the code used in this scrapy tutorial is available on GitHub.
A. Some of the advantages of the scrapy are:
1. It provides high-level API, which makes it easy to build and maintain projects.
2. Scrapy can handle websites with a large number of pages and complex structures. It handles pagination, thus allowing users to traverse to the next pages or previous pages easily.
3. Scrapy is fast and efficient.
4. Scrapy is highly extensible and can be customized to meet our needs. we can add custom middleware, pipelines, and extensions to enhance the functionality of the framework.
5. Scrapy supports multiple data storage formats like csv files,json files, etc.
A. Scrapy is a Python open-source web crawling framework used for large-scale web scraping. It is a web crawler used for both web scraping and web crawling. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format.
A. The key difference between these two is that using web scraping, we aim at extracting specific data from a webpage, whereas web crawling is a broad exploration of the web.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
By far the simplest and the best explaination about scrapy. Thanks !!
Thanks for your comment, Mayank! :)
How would I use the save scrapy items and integrate it in my project so it will display the items on the website page?
Hi Mohammed, A very detailed article on scraping. Could you please let me know how does scrapy differs from Beautifulsoup?
Hey Karthikeyan, BeautifulSoup is a library that "parses" HTML or XML content. In other words, it reads your HTML file and helps extract content from it. Scrapy is a full blown web scraping framework. That means, it already has the functionality that BeautifulSoup provides along with that it offers much more. When you are developing a web scraping system, you would need a way to send requests to the websites (probably using requests or urllib) , you would need a way to send multiple requests at once(multiprocessing/asynchronous) so that you can download content faster. You would also need a way to export your downloaded content in various required formats, if you are working on large scale projects, you would require deploying your scraping code across distributed systems. Scrapy provides you with all of that and much more in built. And yeah, you can use BeautifulSoup with Scrapy if you prefer. Hope this helps, Sanad :)
Hi Sanad, I am currently started using scrapy but two roadblocks I have first in our domain we need to crawl pdf pages which scrapy doesn't provide and after googling I found couple of paid ways which we don't prefer, second how we write junit for any scrapy code to do unit testing is there any framework for this? Please help me out on this. Thanks Ankit
Hey Ankit, 1. I'm not sure what do you mean by crawling PDF pages? If you are trying to scrape websites for PDF files, it again depends on what you are trying to achieve. You can probably use Scrapy to extract link of target PDFs and urllib2 or requests to fetch the PDF files. And then you can use something like PDFMiner( https://pypi.python.org/pypi/pdfminer/) to parse PDF and extract information. 2. Regarding writing unit tests for Scrapy code, it provides an integrated way to unit test spiders, check out Spiders Contracts : https://doc.scrapy.org/en/latest/topics/contracts.html
Hello...thanks for an explanatory tutorial...how can I start the scrapy server from Jupiter notebook?
Hey Isindor, I usually don't run scrapy server from Jupyter Notebook. I run it from the command line to export data in CSVs and then import those CSVs using pandas in Notebook. But, you can execute any terminal command in Jupyter Notebook using '!' sign before the command. Something like this: !scrapy crawl redditbot Hope this helps. Sanad
Great article but I'm a little surprised it didn't touch on the challenges of using Scrapy when trying to scrape JavaScript heavy websites. Most of the sites that I work with now require also using Splash to render the JavaScript. As such I've also started looking at the Selenium and WebDriver option. At first, I tried very hard to limit myself to only Scrapy and Splash but after a month working on a complicated site, I'm really wishing I would have changed approaches much earlier. I've done more in a few days with Selenium using the page object pattern than in weeks of Scrapy and Splash development.
Hey Charles, True that with the advent of JavaScript based front end frameworks and libraries, it is becoming difficult to scrape websites as such. We would have to use Selenium and Webdriver to aid in the part where we require user action like clicking a popup or filling a form. It's not rare to see Scrapy applied in conjunction with Selenium in projects. Yet, we have to remind ourselves that that's not the problem Scrapy is meant to solve. You could argue web scraping is a domain of its own with sub domains, one such sub domain being dealing with dynamic/javascript heavy websites. This article's goal was supposed to get a beginner started with web scraping especially with the use of Scrapy. It would have been overkill to try to cover all aspects of advanced web scraping. Hope this helps, Sanad
Hi Sanad, I have an issue with starting the scrapy shell. When I am typing scrapy shell in the command terminal/ ! scrapy shell in jupyter notebook it is showing 'scrapy is not recognized as internal or external command, operable program or batch file' Any suggestions on how to overcome this issue and proceed further Thanks.
can scrapy scrape data that is inside the iframe? //copy pasting of xpath of website isnt working
Hey Anurag, An IFrame is used when you want to embed a web page within another web page. What is actually happening under the hood is the element is showing the content of a given URL. More Here - https://www.w3schools.com/tags/tag_iframe.asp What you will do in this case is extract all such URLs that IFrame is displaying using Scrapy and then create another request for those URLs and give them to Scrapy. It will then handle things similarly. If I am not wrong, this answer will help you - https://stackoverflow.com/a/24302223 Also, learn more about Scrapy Requests here - https://doc.scrapy.org/en/latest/topics/request-response.html I would suggest you to first understand the basics of Scrapy well. Hope this helps :) Sanad
Very nice article, I am beginner in webscraping, have been using Beautiful Soup. I am excited to try out the examples using Scrapy. Great job with the explanation. What are some of the websites/people/blogs that I can follow to better understand webscraping and also get the latest info?
Hey Angeline, Check out the resources given in the End Note. Those are some really good blogs/people to follow to keep updated with Scrapy. Thanks. Sanad
This is great, I tried to use it from the shell for the same url that is in the example with python 3 and win 10 but I got error as below. Can you suggest something as I am new to scrapy. Thanks In [4]: print (response.text) --------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) in () ----> 1 print (response.text) C:\Users\Owner\Anaconda3\lib\encodings\cp437.py in encode(self, input, final) 17 class IncrementalEncoder(codecs.IncrementalEncoder): 18 def encode(self, input, final=False): ---> 19 return codecs.charmap_encode(input,self.errors,encoding_map)[0] 20 21 class IncrementalDecoder(codecs.IncrementalDecoder): UnicodeEncodeError: 'charmap' codec can't encode character '\u2022' in position 1047: character maps to
Hey, I was not able to reproduce the error.
hi Lwebzem, use this - print (response.text.encode("utf-8"))
During running command Scrapy genspider techcrunch techchrunch.com/feed/ , I encountered an error which related to permission i.e Permission denied :'.\\techchrunch.py'. How can I resolve this error, I am using python 3 and anaconda in windows.
Hey Pulkit , I think you don't have the permission to write to your disk. It doesn't seem to be a scrapy issue.
Hi Sanad, I am not able to open scrapy shell. An error "Scrapy is not a recognized external or internal command or batch file" is coming when I am typing scrapy shell in the terminal. Please help Thanks
You have to first install scrapy with the following command: pip install scrapy
Hi Sanad, Very nice article. Thanks. I have one question regarding scrapping within web site request limit . How do we control number of request sent to website , so it doesn't hold the web traffic and also is in limit of not getting blocked. What is the limit for number of request sent based on your experience. Thanks
Hey Saurabh, 1. The settings.py file has many parameters that you can use to tune your scraping code. Like the maximum number of concurrent requests sent to a site, maximum depth of crawl etc. Check this - https://doc.scrapy.org/en/latest/topics/settings.html#topics-settings-ref 2. Scrapy also has this feature called "Autothrottle". It automatically controls the number of requests and crawling speed based on the server response time to avoid getting blocked and prevent putting a load on the server. Here - https://doc.scrapy.org/en/latest/topics/autothrottle.html#topics-autothrottle Thanks Sanad :)
Great article and explained the flow in step-by-step manner, so simple that even python beginners can also give a try and see the code working.
Thank you for your comment! :)
https://mllib.wordpress.com/2017/07/27/web-scraping-in-python-using-scrapy-with-multiple-examples/
HI Sanad, Really nice article. Thanks for putting it up.
Hey, Thanks for the feedback :)
I am getting error in below line scrapy startproject ourfirstscraper www.reddit.com/r/gameofthrones/ error File "", line 1 scrapy startproject ourfirstscraper www.reddit.com/r/gameofthrones syntax error : invalid syntax
Hey Kabir, There was a typo in this line which has been fixed. Please check again. Sanad :)
Hi Sanad, I'm getting this error AttributeError Traceback (most recent call last) in () ----> 1 response.css("img::attr(data-img)").extract() AttributeError: 'NoneType' object has no attribute 'css'
Hey Yash, This basically means that your 'response' object is empty or not properly made. There would be some error in preceding lines of code. Sanad
Great artice! I also have a doubt that if i want the spider to also go through the other webpages....Like in the redditbot if i want to scrape the the pages after the first page...how do i iterate the links i have specified in 'allowed_domains' list??
Hey Dhawal, In order to make your scraper go to the next pages, you would need the link to the next page. Check out this tutorial - https://doc.scrapy.org/en/latest/intro/tutorial.html#following-links Hope this helps, Sanad :)
Hi Sanad, Thanks for the nice tutorial. When I was trying to pull data from shopclues, I am getting only one record in the output csv file. Not sure what is the issue with my code. I am pasting the code and the generated log. Can you please advise? import scrapy class MyshopcluesSpider(scrapy.Spider): name = "myshopclues" allowed_domains = ["www.shopclues.com/mobiles-featured-store-4g-smartphone.html"] start_urls = ["http://www.shopclues.com/mobiles-featured-store-4g-smartphone.html/"] custom_settings={ 'FEED_URI':'tmp/shopclues.csv' } def parse(self, response): images=response.css("img::attr(data-img)").extract() titles=response.css("img::attr(title)").extract() prices=response.css(".p_price::text").extract() discounts=response.css(".prd_discount::text").extract() for item in zip(titles,prices,images,discounts): scraped_info={ 'title':item[0], 'price':item[1], 'image_urls':[item[2]], 'discount':item[3] } yield scraped_info (C:\Users\mupago\AppData\Local\conda\conda\envs\my_root) C:\Users\mupago\www.shopclues.com\shopclues\spiders>scrapy crawl myshopclues 2017-08-07 22:17:13 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: shopclues) 2017-08-07 22:17:13 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'shopclues', 'FEED_FORMAT': 'csv', 'NEWSPIDER_MODULE': 'shopclues.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['shopclues.spiders']} 2017-08-07 22:17:13 [scrapy.middleware] INFO: Enabled extensions:['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2017-08-07 22:17:13 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-08-07 22:17:13 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-08-07 22:17:13 [scrapy.middleware] INFO: Enabled item pipelines: ['scrapy.pipelines.images.ImagesPipeline'] 2017-08-07 22:17:13 [scrapy.core.engine] INFO: Spider opened 2017-08-07 22:17:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-08-07 22:17:13 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024 2017-08-07 22:17:15 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2017-08-07 22:17:15 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2017-08-07 22:17:15 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloadedimage from referred in 2017-08-07 22:17:15 [scrapy.core.scraper] DEBUG: Scraped from {'title': 'Swipe Konnect Neo 4G, Black [4G VoLTE, Quad Core, Android v6.0 Marshmallow, 5MP Camera] (Black)', 'price': 'Rs.3099', 'image_urls': ['https://cdn.shopclues.com/images/thumbnails/79033/200/200/124122307101876702KonnectNEO4GMainImage15000260381501218627.jpg'], 'discount': '28% Off', 'images': [{'url': 'https://cdn.shopclues.com/images/thumbnails/79033/200/200/124122307101876702KonnectNEO4GMainImage15000260381501218627.jpg', 'path': 'full/d03603c774c1a790d1e813e73743e60f1db3bd16.jpg', 'checksum': '217ee1803456f4b83294c302d41cc9e7'}]} 2017-08-07 22:17:15 [scrapy.core.engine] INFO: Closing spider (finished) 2017-08-07 22:17:15 [scrapy.extensions.feedexport] INFO: Stored csv feed (1 items) in: tmp/shopclues.csv 2017-08-07 22:17:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:{'downloader/request_bytes': 482, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 32121, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'file_count': 1, 'file_status_count/uptodate': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 8, 8, 2, 17, 15, 400925), 'item_scraped_count': 1, 'log_count/DEBUG': 5, 'log_count/INFO': 8, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 8, 8, 2, 17, 13, 850770)} 2017-08-07 22:17:15 [scrapy.core.engine] INFO: Spider closed (finished)
Hey , I checked your code but couldn't find anything that stands out as such. I reused my code from here https://github.com/mohdsanadzakirizvi/web-scraping-magic-with-scrapy-and-python and it works perfectly fine. Hope this helps Sanad
This line need be inner "for" section: "yield scraped_info". (sorry for my poor english)
getting invalid syntax error when i try to run the spider. was able to see the response and text responses individually, >>>scrapy crawl smartpricebot File "", line 1 scrapy crawl smartpricebot ^ SyntaxError: invalid syntax dont know why must be missing something. Please help resolve this
Hey Avilash, You are trying to run the spider from within the Python or scrapy shell. This command works when you are in your regular terminal(command line). As I have mentioned in my article, exit the scrapy shell first and then try it. Sanad :)
I read thousands of articles and watch millions of video tutorial to learn Scrapy, but i'm still not able to run a project successfully, all my spiders stuck in the half way, or comeback with empty data. After i read your article, I finally can built a project which is work, really thanks a lot. By the way, can you please give another scrapy tutorial regarding how to schedule the scrapy task? Thanks once again.
I read thousands of articles and watch millions of video tutorial to learn Scrapy, but i'm still not able to run a project successfully, all my spiders stuck in the half way, or comeback with empty data. After i read your article, I finally can built a project which is work, really thanks a lot. By the way, can you please give another scrapy tutorial regarding how to schedule the scrapy task, and how to overwrite a csv file? Thanks once again.
Hey Chang, Thanks for the comment, means a lot! I'll see what I can do. :) Regards Sanad
How do you handle if item['discount'] = 0 . I want to skip any 0 or empty value from scraped data in CSV.
Great job on the scraping walkthroughs Is there a way to scrape multiple websites for a keyword and extract associated info ? Kind of similar to what google does but returning some additional variables related to the keyword ?
Hey Declan, Thanks for the feedback! :)
Great tutorial, The examples are very easy for learning and works fine, greetings from Chile
Thanks for the appreciation Fernando! :D
Great article ! I am new to scrapy and this information helped me a lot. can we scrap data of websites which have log in criteria? that is, if i need data from a website for multiple users,each user has unique id, the log in criteria is entering id. Now i want to scrap data of user and display same in o/p. can i do this? and can we log in to website through code without redirecting to that website??
Hey Ajay, Thanks for appreciating! :D Yes, look here- https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-userlogin Hope this helps, Sanad :)
Hey Ajay, Check this out https://blog.scrapinghub.com/2012/10/26/filling-login-forms-automatically/
Hi, I am planning to use scrapy for one of the bank site to crawl all the pages/api and parse the response like "Cache-Control", "Content-Length" with their values along with respected url as an automated way. Is this possible with Scrapy?
Hey Sandeep, yes you can :D Hope this helps, Sanad
Hi I am trying to scrape data of courses from edx.org but i am not able to get any data as it is using javascript for rendering data. Can you please help me with this.
Hi Sanad, thank you for this great post, very illuminating. Full disclosure total beginner here(to scrapy) I am tasked with extracting links from a bunch of websites (about 50) and i was wondering if it is possible with scrapy and if it is could you give me a brief guide on how, or direct me somewhere i can get help on the same.
Hey Odin, thank you for your feedback..! Different kinds of site require different methodologies, as you can probably see from the diverse case studies given in the article.
Hello Author, Great Article. I just read it. i will try it later. please answer my queries if i stuck :) Thank you.
Hey pgjosh, Sure , Anytime :) Sanad
Hi Mohd! Great tutorial, very thorough. This is what I have been looking for, for my Big Data project. I'm new to both Python, scraping, crawling and all that but this looks like something I could get started with right away. Could you give some hints on how to get both the posts data AND the comments connected to that post? Or maybe a link where one can find more help on this? Thanks again, it's highly appreciated! /Jonas
Hey Jonas, Check this out https://doc.scrapy.org/en/latest/topics/request-response.html
Hi Sanad, Using the custom_settings line has distorted the csv file extracted. Instead adding the file name within the settings.py has given a clean csv. Is there something to take care of in to make the custom_settings line work properly? Also, I have disabled the format and export option within the settings.py when custom_settings = .... was enabled in the main spider file. Thanks in advance :) Akshay
Reposting because I posted in the wrong place. How would I use the save scrapy items and integrate it in my project so it will display the items on the website page?
Hey Vincent, store the items in a database. In your website, fetch the contents from the above database .
HELLO , i need a good example to scraping multiple pages i'm a beginner
Hi Sanad, I am working with Scrapy Framework and able to capture dynamic content with the help of Lua Script and trying to improve the rendering speed for few pages by disabling all the images,JS,CSS and any other external requests except the html response. I see disable images option in LuaScript but not sure about how to disable other content. How can I achieve this ? Could you please shed some light on this .
Great tutorial +Mohd Sanad Zaki Rizvi. I wonder how to make such a scraper and put it on a website. I mean to make a paid tool. Somebody login pays and then use the tool. Could you point me out where I can find such a tutorial because I'm searching for it and can't find it? Greets.
Hey Lukasz, You would probably rent a cloud machine and run your scraper on that and it will store the scraped content in a database. Whenever someone wants to access the scraped content they would visit your website that will fetch the content from the above database. Hope this helps, Sanad :)
Super useful, thank you! Helped me get an overview of the whole process.
Hello there! i want to know to how crawl javascript pages or other which source-code does not contains actuall items shown in the website. Is it possible to crawl javascript webpages on canopy-python 3.5+? Thanks
Hey baba, for scraping JS rendered pages you will have to use another framework like selenium in conjunction with scrapy. Check out how selenium works here - https://medium.com/@hoppy/how-to-test-or-scrape-javascript-rendered-websites-with-python-selenium-a-beginner-step-by-c137892216aa Regards, Sanad :)
Hi Sanad I want to get information regarding the startups and incubation pages that various universities have on their websites. So what I have in my head is that I will have to first crawl various pages to get the urls of the various universities and then again deploy the crawler on each of those urls to get the url of the page where the data regarding startups and incubation centers is present. Am i thinking in the right direction? Thanks Khushboo
Hey Khushboo, Yes, looks alright. There is a feature in scrapy that lets you follow the links you have extracted and scrape content from them too. Check out https://doc.scrapy.org/en/latest/intro/tutorial.html#following-links Sanad :)
Hi Sanad, thanks you a lot for this lesson, very good job. I have a question for the 2.3 Writing Custom Spiders part. When i try to extract data into csv the kernel run and nothing happen. i'm under jupyther but there is nothing happenning than a busy kernel. I can see the creation of the csv but there is nothing inside. Thanks for you help
I've been reviewing online more than 7 hours today to make Web Scraping in Python using Scrapy, yet I never found any interesting article like yours. It is pretty worth enough for me. In my view, if all site owners and bloggers made good content as you did, the net will be much more useful than ever before.
Great tutorial, The examples are very easy for learning and works fine.
Hey Mich, Glad to know it helped you, keep posting your doubts/suggestions here! Sanad :)
Hello Mohd,I have great difficulty searching the div tag that includes "comment" inspecting the Reddit. How did you inspect elements and find the right one when searching it returns 40+ results?
This is a very simplest and most useful post related to scrapy for a beginner. thanks for posting. Before this scrapy was a mystery for me.
It's very good. the " print response.text" dont work for me and searching I found thah replace the response.text for response.body work very good. Thanks for the examples.
Hi, I have seen that you replying to every question. And due to that i like this blog. I want to extract information from whole website including all hyperlinks it has attached. Can i do that with scrapy or not. Thank you in advance.
I don't see any problem with that. Though I'd like you to check out these links after doing the blog - 1. Crawl a website using its sitemap - https://doc.scrapy.org/en/latest/topics/spiders.html#sitemapspider 2. Scrapy's link extractor - https://doc.scrapy.org/en/latest/topics/link-extractors.html Let me know how did it go! Sanad :)
Redditbot is not working, It is not saving the stats. Saying like that crawled 0 pages. Please help me.
Hi! Thank you so much for this Tutorial. I'm not very familiar with Python , please i would like to know how can I connect my database (Mysql) with the py file. I could not implement the connection in the same class redditbotspider : inconsistent use of tabs and spaces in indentation. Any idea please.
Thanks a lot Mohd Sanad, I had the luck to find your article. Your examples find informaton and view the data. Would you please explain me if the with scrapy I can use if conditional in order to display for example, data in specific date or other conditional or loops ?
Hey, I have done everything. Now i can crawl anything from any website through terminal. But I have a question. as i am new in crawling, I want to crawl a website and store the result in my hosting database and i want to show real time result on my website. please suggest me how i can crawl so it will show results on my website and save data in my database, I hope you get me. I really need your help. please reply as soon as possible.
Hey there! Thanks for the tutorial! I wanted to show something that didn't initially but I got to work by reading through the rest of the tutorial. In the tutorial where it first says to edit your settings.py file to : # Export as CSV feed FEED_URI = "reddit.csv" FEED_FORMAT = "csv" This didn't work for me. I'm not sure where the files were/are being stored or downloaded too but it was not the current folder where the spider resides. Later in the tutorial there is another spider example where a customer setting is specified just before the Parse function and after the urls, I edited it to see if I could get it to work and it did- custom_settings = { 'FEED_URI' : 'tmp/reddit.csv' } I just wanted to point that out. If anyone can provide any insight as to where my initial downloaded csvs have gone to, I'd definitely appreciate it. Thanks again for the tutorial! I've done it two or three times now.
Awsm tutorial man but i have a doubt . How we can download the .mkv file format through scrapy .
i don't know where i am going wrong . but data is not being scraped from the website. below is the result of my execution. 2018-03-12 15:08:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: amazon_crawl) 2018-03-12 15:08:23 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 17.5.0 (OpenSSL 1.0.2n 7 Dec 2017), cryptography 2.1.4, Platform Darwin-14.5.0-x86_64-i386-64bit 2018-03-12 15:08:23 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'amazon_crawl', 'NEWSPIDER_MODULE': 'amazon_crawl.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['amazon_crawl.spiders']} 2018-03-12 15:08:23 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2018-03-12 15:08:23 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-03-12 15:08:23 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-03-12 15:08:23 [scrapy.middleware] INFO: Enabled item pipelines: [] 2018-03-12 15:08:23 [scrapy.core.engine] INFO: Spider opened 2018-03-12 15:08:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-03-12 15:08:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-03-12 15:08:24 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2018-03-12 15:08:24 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2018-03-12 15:08:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2018-03-12 15:08:25 [scrapy.core.engine] DEBUG: Crawled (404) (referer: None) 2018-03-12 15:08:25 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response : HTTP status code is not handled or not allowed 2018-03-12 15:08:25 [scrapy.core.engine] INFO: Closing spider (finished) 2018-03-12 15:08:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 889, 'downloader/request_count': 4, 'downloader/request_method_count/GET': 4, 'downloader/response_bytes': 6319, 'downloader/response_count': 4, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/301': 2, 'downloader/response_status_count/404': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 3, 12, 9, 38, 25, 609330), 'httperror/response_ignored_count': 1, 'httperror/response_ignored_status_count/404': 1, 'log_count/DEBUG': 5, 'log_count/INFO': 8, 'memusage/max': 46002176, 'memusage/startup': 46002176, 'response_received_count': 2, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2018, 3, 12, 9, 38, 23, 681927)} 2018-03-12 15:08:25 [scrapy.core.engine] INFO: Spider closed (finished) sureshs-MBP:amazon_crawl Apple$ can you please help me out ?
Hello Sanad , Firstly i am appreciating you for your great job, and i have some droughts on scraping hope i am expecting explanation from you, well when i am scraping data from websites i am getting errors like 404, 400,500,200....... please let me know why that errors raises and how can i over come with that errors,thanks in advance Regards mohan
Hi liking your tutorial ill need help how do i get info out of the following span. i need only the price only ['KSh 1,399 ', 'KSh 1,499 ', 'KSh 8,999 ',
Hi, I have tried the examples provided and works very well. But I am unable to fetch the data from amazon.com. I have troubleshoot it using scrapy shell and found out that it redirects to a link having CAPTCHA to solve to get access to the amazon.com html page. Kindly let me know about how could I scrap the data from amazon.com
Clean and crystal article, Thanks Scrapy is the best framework for scraping
Madani, I'm glad you liked it and find it useful. Thanks for the feedback! Sanad
This is why anyone can learn Machine Learning. You are using publicly available datasets, or scraping data from the web via Python libraries like scrapy, everyone has access to quality data sets.
This is a great tutorial on web scraping in Python! I'm a beginner and this has been very helpful.
This is a great tutorial on web scraping in Python! I'm a beginner and this has been very helpful.