And useful reporting in general
Scrapy is a fantastic framework for web scraping with loads of stuff you need built in – including stats, and hooks you can use for email notifications. I’ll focus on that in this post to collect some of the techniques I mixed and matched.
In Scrapy you’ll set up your crawler using the scrapy startproject
command and work in a main file, say imagescraper.py
, and at its most simple, that’s really all you’ll need. But probably you’ll want to edit the settings.py
file to adjust things like autothrottling.
Here’s our hypothetical scraper:
# imagescraper.py import scrapy class ImageSiteCrawler(scrapy.Spider): name = "imagescraper" start_urls = ['http://bigphotowebsite.com/photos?tag=rain'] def parse(self, response): # follow links to photo entries for href in response.css('#results-list .photo a::attr(href)'): yield response.follow(href, self.parse_photo) def parse_photo(self, response): # parse the photo entry and yield the data
In the above code we’re scraping an imaginary site for the photos with a certain tag; all the results are listed on this one page as links, and each photo we want to scrape has full-size links and metadata on its own entry, which we can access through the list of links.
Let’s say you we wanted to have this scraper run on a schedule and email a report when it’s done. You can do that using Scrapy’s pipelines system, which includes a few built-in methods like close_spider
, which you can use to send an email like this:
# example pipelines.py file import smtplib from email.MIMEMultipart import MIMEMultipart from email.MIMEText import MIMEText import time import datetime import pprint class ImagescraperPipeline(object): def close_spider(self, spider): from_email = "myNotificationGmail@gmail.com" to_email = "myEmail@email.com" msg = MIMEMultipart() msg['From'] = from_email msg['To'] = to_email msg['Subject'] = 'Image Scraper Report for ' + datetime.date.today().strftime("%m/%d/%y") intro = "Summary stats from Scrapy spider: \n\n" body = spider.crawler.stats.get_stats() body = pprint.pformat(body) body = intro + body msg.attach(MIMEText(body, 'plain')) server = smtplib.SMTP('smtp.gmail.com', 587) server.starttls() server.login(from_email, "###password###") text = msg.as_string() server.sendmail(from_email, to_email, text) server.quit()
(Credit to Nael Shiab for the easy Python email code.)
Out of the box, this will send you a nice little email with a dump of all of Scrapy’s stats from the crawl, that looks something like this when it lands in your email:
Hi, it's 17:80 on Thursday, November 16, 2017 Summary stats from Scrapy spider: {'downloader/request_bytes': 1024, 'downloader/request_count': 303, 'downloader/request_method_count/GET': 303, 'downloader/response_bytes': 1780, 'downloader/response_count': 303, 'downloader/response_status_count/200': 304, 'downloader/response_status_count/404': 0, 'item_scraped_count': 303, 'log_count/DEBUG': 303, 'log_count/INFO': 0, 'memusage/max': 5200, 'memusage/startup': 90816, 'request_depth_max': 2, 'response_received_count': 303, 'scheduler/dequeued': 303, 'scheduler/dequeued/memory': 303, 'scheduler/enqueued': 303, 'scheduler/enqueued/memory': 303, 'start_time': datetime.datetime(2017, 11, 16, 8, 58, 2, 1780)}
That’s cool. But what if you want to do a little more? Maybe there’s some other stats you want to add to the report, based on data your crawler encounters when it first hits the initial page, or calculations you do while parsing, or custom validation?
It would be nice if our stats would include the website’s own information on how many photos are included in the results, and include that and the name of the category in its summary statistics when it sends a report when the scraper is done crawling the site. That way we can see it matches the report of requests and see at a glance if there were problems pulling down the data.
There’s a really easy way to do this, though docs are thin on how; my motivation to write this was in part to make more easily findable a code snippet that was really helpful for me and so buried online it took me a while to locate (here, if you’re interested).
We’d take our scraper from above and add the code in the following example.
# imagescraper.py import scrapy class ImageSiteCrawler(scrapy.Spider): name = "imagescraper" start_urls = ['http://bigphotowebsite.com/photos?tag=rain'] def parse(self, response): # some data from the first page to collect stats on category_name = response.css('.intro h2.cat-name::text').extract_first().strip() total_photos = response.css('.page-meta .results::text').extract_first().strip() # set those custom stats for the crawler self.crawler.stats.set_value('custom_stats_category_name', category_name) self.crawler.stats.set_value('custom_stats_total_photos', total_photos) # follow links to photo entries for href in response.css('#results-list .photo a::attr(href)'): yield response.follow(href, self.parse_photo) def parse_photo(self, response): # parse the photo entry and yield the data
To do this, we’re adding to the crawler’s stats collection using the self.crawler.stats.set_value()
method. You don’t have to do anything at the pipeline end, whatever you add while you’re scraping / parsing gets sent on through.
For more detail, check out Scrapy’s documentation on the Stats Collector and Crawler API.