Balthazar Rouberol, Mapado
« De la page blanche au web-crawling en moins d’une heure »
par Balthazar Rouberol
Balthazar Rouberol
$ whoami
Crawlers can be used to proactively go out and grab data by yourself when it is not made available through web APIs.
requests
)
lxml
)
Extraction des XPaths avec Firebug (Firefox), ou équivalent (Chrome DevTools, etc) par inspection du DOM.
'/html/body/div/div/div/div[2]/header/h1'
'/html/body/div/div/div/div[2]/header/p/a'
'/html/body/div/div/div/div[2]/header/div/p'
http://isbullsh\.it/\d{4}/\d{2}/[\w-]+
import requests, json, lxml.html
# Crawl: fetch html
html = requests.get(url).text
# Parse html: scrape the data using XPaths
tree = lxml.html.fromstring(html)
title = tree.xpath('/html/body/div/div/div/div[2]/header/h1/text()')[0]
author = tree.xpath('/html/body/div/div/div/div[2]/header/p/a/text()')[0]
date = tree.xpath('/html/body/div/div/div/div[2]/header/div/p/text()')[0]
# Export data to JSON format
data = {'title': title, 'author': author, 'date': date, 'url': url}
json.dump(data, open('export.json', 'w'))
[ {
"title": "Odyssey of a webapp developer"{
"author": "Etienne", {
"date": "27 Jun 2012", {
"url": "http://isbullsh.it/2012/06/Odyssey-Chap1-Part1/"
},
{
"title": "Blur images with ImageMagick"
"author": "Balthazar",
"date": "11 Apr 2012",
"url": "http://isbullsh.it/2012/04/Blur-images-with-imagemagick/"
},
...
$ time python isbullshit-manual-scrape.py
0,30s user 0,06s system 1% cpu 22,335 total
Asynchronous I/O, or non-blocking I/O, in computer science, is a form of input/output processing that permits other processing to continue before the transmission has finished.Wikipedia
$ scrapy startproject isbullshit
Structure de projet standard
items.py
: définition des données cibles
settings.py
: paramètres du crawler
pipelines.py
: traitement de données une fois scrapées
spiders/
: dossier contenant les différentes spiders
spiders/isbullshit-spider.py
: (à créer) notre spider
scapy.cfg
: non abordé
# In items.py
from scrapy.item import Item, Field
class IsbullshitItem(Item):
title = Field()
author = Field()
date = Field()
url = Field()
# In spider/isbullshit-spider.py
class IsBullshitSpider(CrawlSpider):
"""General configuration of the Crawl Spider """
name = 'isbullshit'
start_urls = ['http://isbullsh.it']
allowed_domains = ['isbullsh.it']
rules = [
Rule(SgmlLinkExtractor(
allow=[r'http://isbullsh\.it/\d{4}/\d{2}/\w+'], unique=True),
callback='parse_blogpost')
]
# In spider/isbullshit-spider.py
def parse_blogpost(self, response):
""" Callback method scraping data from the response html """
hxs = HtmlXPathSelector(response)
item = IsbullshitItem()
item['title'] = hxs.select('//header/h1/text()').extract()[0]
item['author'] = hxs.select('//header/p/a/text()').extract()[0]
item['date'] = hxs.select('//header/div[@class="post-data"]'
'/p/text()').extract()[0]
item['url'] = response.url
return item
$ cd path/to/isbullshit
$ scrapy crawl isbullshit -o blogposts.json -t json
# In settings.py
...
MONGODB_HOST = "localhost" # default value
MONGODB_PORT = 27017 # default value
MONGODB_DB = "isbullshit-scrape"
MONGODB_COL = "blogposts"
# Send items to these pipelines after they have been scraped
ITEM_PIPELINES = ['isbullshit.pipelines.MongoDBStorage']
# In pipelines.py
class MongoDBStorage(object):
def __init__(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT']
db = settings['MONGODB_DB']
col = settings['MONGODB_COL']
connection = pymongo.MongoClient(host, port)
db = connection[db]
self.collection = db[col]
# In pipelines.py
class MongoDBStorage(object):
...
def parse_item(self, item, spider)
if not self.collection.find_one({'url': item['url']}):
self.collection.insert(dict(item))
log.msg("Article from %s inserted in database" % (item['url']),
level=log.DEBUG, spider=spider)
return item
else:
raise DropItem('Article from %s already in DB' % item['url'])
$ cd path/to/isbullshit
$ scrapy crawl isbullshit
# In settings.py
CONCURRENT_REQUESTS_PER_DOMAIN = 1
CONCURRENT_REQUESTS_PER_DOMAIN = True
DOWNLOAD_DELAY = 1
ROBOTSTXT_OBEY = True
RETRY_ENABLED = False
# In settings.py
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOAD_DELAY = 5
ROBOTSTXT_OBEY = False
USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0"
$ scrapy shell URL
Built-in Scrapy shell. Super awesome funky debugging time!
https://scrapy.readthedocs.org/en/latest/topics/shell.html#topics-shell