Balthazar Rouberol, Mapado

« De la page blanche au web-crawling en moins d’une heure »
par Balthazar Rouberol

Balthazar Rouberol
$ whoami
Crawlers can be used to proactively go out and grab data by yourself when it is not made available through web APIs.
requests)
lxml)
Extraction des XPaths avec Firebug (Firefox), ou équivalent (Chrome DevTools, etc) par inspection du DOM.
'/html/body/div/div/div/div[2]/header/h1'
'/html/body/div/div/div/div[2]/header/p/a'
'/html/body/div/div/div/div[2]/header/div/p'
http://isbullsh\.it/\d{4}/\d{2}/[\w-]+
import requests, json, lxml.html
# Crawl: fetch html
html = requests.get(url).text
# Parse html: scrape the data using XPaths
tree = lxml.html.fromstring(html)
title = tree.xpath('/html/body/div/div/div/div[2]/header/h1/text()')[0]
author = tree.xpath('/html/body/div/div/div/div[2]/header/p/a/text()')[0]
date = tree.xpath('/html/body/div/div/div/div[2]/header/div/p/text()')[0]
# Export data to JSON format
data = {'title': title, 'author': author, 'date': date, 'url': url}
json.dump(data, open('export.json', 'w'))
[ {
"title": "Odyssey of a webapp developer"{
"author": "Etienne", {
"date": "27 Jun 2012", {
"url": "http://isbullsh.it/2012/06/Odyssey-Chap1-Part1/"
},
{
"title": "Blur images with ImageMagick"
"author": "Balthazar",
"date": "11 Apr 2012",
"url": "http://isbullsh.it/2012/04/Blur-images-with-imagemagick/"
},
...
$ time python isbullshit-manual-scrape.py 0,30s user 0,06s system 1% cpu 22,335 total
Asynchronous I/O, or non-blocking I/O, in computer science, is a form of input/output processing that permits other processing to continue before the transmission has finished.Wikipedia
$ scrapy startproject isbullshitStructure de projet standard
items.py: définition des données cibles
settings.py: paramètres du crawler
pipelines.py: traitement de données une fois scrapées
spiders/: dossier contenant les différentes spiders
spiders/isbullshit-spider.py: (à créer) notre spider
scapy.cfg: non abordé
# In items.py
from scrapy.item import Item, Field
class IsbullshitItem(Item):
title = Field()
author = Field()
date = Field()
url = Field()
# In spider/isbullshit-spider.py
class IsBullshitSpider(CrawlSpider):
"""General configuration of the Crawl Spider """
name = 'isbullshit'
start_urls = ['http://isbullsh.it']
allowed_domains = ['isbullsh.it']
rules = [
Rule(SgmlLinkExtractor(
allow=[r'http://isbullsh\.it/\d{4}/\d{2}/\w+'], unique=True),
callback='parse_blogpost')
]
# In spider/isbullshit-spider.py
def parse_blogpost(self, response):
""" Callback method scraping data from the response html """
hxs = HtmlXPathSelector(response)
item = IsbullshitItem()
item['title'] = hxs.select('//header/h1/text()').extract()[0]
item['author'] = hxs.select('//header/p/a/text()').extract()[0]
item['date'] = hxs.select('//header/div[@class="post-data"]'
'/p/text()').extract()[0]
item['url'] = response.url
return item
$ cd path/to/isbullshit$ scrapy crawl isbullshit -o blogposts.json -t json
# In settings.py
...
MONGODB_HOST = "localhost" # default value
MONGODB_PORT = 27017 # default value
MONGODB_DB = "isbullshit-scrape"
MONGODB_COL = "blogposts"
# Send items to these pipelines after they have been scraped
ITEM_PIPELINES = ['isbullshit.pipelines.MongoDBStorage']
# In pipelines.py
class MongoDBStorage(object):
def __init__(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT']
db = settings['MONGODB_DB']
col = settings['MONGODB_COL']
connection = pymongo.MongoClient(host, port)
db = connection[db]
self.collection = db[col]
# In pipelines.py
class MongoDBStorage(object):
...
def parse_item(self, item, spider)
if not self.collection.find_one({'url': item['url']}):
self.collection.insert(dict(item))
log.msg("Article from %s inserted in database" % (item['url']),
level=log.DEBUG, spider=spider)
return item
else:
raise DropItem('Article from %s already in DB' % item['url'])
$ cd path/to/isbullshit$ scrapy crawl isbullshit
# In settings.py
CONCURRENT_REQUESTS_PER_DOMAIN = 1
CONCURRENT_REQUESTS_PER_DOMAIN = True
DOWNLOAD_DELAY = 1
ROBOTSTXT_OBEY = True
RETRY_ENABLED = False
# In settings.py
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOAD_DELAY = 5
ROBOTSTXT_OBEY = False
USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:21.0)
Gecko/20100101 Firefox/21.0"
$ scrapy shell URL
Built-in Scrapy shell. Super awesome funky debugging time!
https://scrapy.readthedocs.org/en/latest/topics/shell.html#topics-shell