Scrapy框架
简介
Scrapy是一个用于爬网网站并提取结构化数据的应用程序框架,可用于各种有用的应用程序,例如数据挖掘,信息处理或历史档案。
即使Scrapy最初是为Web抓取而设计的,它也可以用于使用API(例如Amazon Associates Web Services)或用作通用Web 搜寻器来提取数据。
安装
最新版本的scrapy框架已经不再对py2提供支持,因此建议在py3上进行框架的搭建,安装执行:
pip install scrapy
使用
初始化
初始化项目执行:
scrapy startproject (项目名称,以tutorial为例) (绝对路径可有可无,默认在当前路径)
项目目录如下:
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
demo
爬虫脚本编写与spiders文件夹下,如下:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
from __future__ import absolute_import
import scrapy
class QuotesSpider(scrapy.Spider):
# 爬虫名称,启动爬虫根据该名称进行指定
name = 'quotes'
# 初始化请求的url,可不用start_requests,具体见官方文档
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
# 默认的回调方法
def parse(self, response):
# 此处通过xpath来解析页面元素,亦可用css等,详见官方文档
for quote in response.xpath("""//div[@class='quote']"""):
yield {
'text': quote.xpath("""./span[@class='text']/text()""").get(),
'author': quote.xpath("""./span/small[@class='author']/text()""").get(),
'tages': quote.xpath("""./div/a[@class='tag']/text()""").getall(),
}
next_page = response.xpath("""//ul[@class='pager']/li/a/@href""").get()
self.log('-----------------------------')
self.log(next_page)
if next_page:
# url拼接,用同样的parse方法处理下一页的相应内容
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
运行
返回至项目的初始目录,执行以下命令来运行爬虫:
scrapy crawl quotes
API
上述方法讲了如何在scrapy框架中运行一个爬虫,但实际生产中可能是需要通过外部接口调用来进行爬虫的启动,因此通过scrapy框架提供的api来进行爬虫的启动,具体事例代码如下:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
from __future__ import absolute_import
from twisted.internet import reactor
from tutorial.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess, CrawlerRunner
from scrapy.utils.project import get_project_settings
import os
def main():
my_spider = QuotesSpider()
settings = get_project_settings()
os.environ['SCRAPY_SETTINGS_MODULE'] = 'tutorial.settings'
settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
settings.setmodule(settings_module_path, priority='project')
crawler = CrawlerProcess(settings)
crawler.crawl(QuotesSpider)
crawler.start()
if __name__ == '__main__':
main()