关于python爬虫分布式爬虫的一些记载

2018-03-01 11:11:05来源:oschina作者:邪云人点击

分享

首先要安装scrapy-redis,其实scrapy和分布式在编辑上不会差别很大,差别有如下几点

settings.py中设置: # 使用了Scrapy-redis里的去重组件,不使用Scrapy默认的去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用了Scrapy-redis里的调度器组件,不使用Scrapy默认的调度
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 允许暂停,redis请求记录不丢失
SCHEDULER_PERSIST = True
# 下面三个选择启用一个,启用第一个最好
# 默认的Scrapy-redis请求(按优先级顺序)队列形式
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
# 队列形式,请求先进先出
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
# 栈形式,请求先进后出
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
# scrapy_redis.pipelines.RedisPipeline 支持将数据存储到Redis数据库里,必须启动
ITEM_PIPELINES = {
'dongguan.pipelines.DongguanPipeline': 500,
'scrapy_redis.pipelines.RedisPipeline': 400,
}
# 这是指定redis数据库的地址和端口号
REDIS_HOST = '192.168.99.1'
REDIS_PORT = 6379在爬虫文件里设置: # from scrapy.spiders import CrawlSpider, Rule
from scrapy.spiders import Rule
from scrapy_redis.spiders import RedisCrawlSpider
# class DongguanquestionSpider(CrawlSpider):
class DongguanquestionSpider(RedisCrawlSpider):
name = 'dongguanquestion'
redis_key = 'DongguanquestionSpider:start_urls'
# allowed_domains = ['wz.sun0769.com']
# start_urls = ['http://wz.sun0769.com/index.php/question/report?page=0']
pagelinks = LinkExtractor(allow='page=/d+')
questionlinks = LinkExtractor(allow='/question//d+//d+.shtml')
rules = (
Rule(pagelinks),
Rule(questionlinks, callback='parse_item'),
)
# 动态域
def __init__(self, *args, **kwargs):
# Dynamically define the allowed domains list.
domain = kwargs.pop('domain', '')
self.allowed_domains = filter(None, domain.split(','))
super(DongguanquestionSpider, self).__init__(*args, **kwargs)执行代码

python2 -m scrapy runspider dongguanquestion.py 在Ubuntu上要这样


sudopython2 -m scrapy runspider dongguanquestion.py

最新文章

123

最新摄影

微信扫一扫

第七城市微信公众平台