Scrapy学习笔记(4)分布式爬取京东商品详情,评论和评论总结

2017-12-12 20:18:08来源:CSDN作者:sinat_34200786人点击

分享
第七城市th7cn

目标:分布式爬取京东商品详情,评论和评论总结

Power by:

  1. Python 3.6
  2. Scrapy 1.4
  3. pymysql
  4. json
  5. redis

项目地址:https://github.com/Dengqlbq/JDSpider


Step 1——相关简介

本文将注意力放在代码实现上,代码思路的描述将另开一文


Step 2——总体框架

分析目标后可以发现有如下需求:

指定关键词并爬取关键词商品的id爬取商品详情爬取商品评论

如果将所有需求的实现放在同一个Spider中代码难免显得臃肿,所以决定将整个项目分为四部分

  • JDSpider
    • ProjectStart
    • JDUrlsSpider
    • JDDetailSpider
    • JDCommentSpider
ProjectStart         指定关键词并抛出指定数量页面的urlJDUrlsSpider         提取页面中所有商品id并形成detail-url 和comment-urlJDDetailSpider       根据detail-url提取商品详情JDCommentSpider      根据comment-url提取商品评论

Spider之间通过服务器端redis进行通信,主要就是detail-url和comment-url的传递


Step 3——ProjectStart

指定关键词并抛出指定数量页面的url
页面指在京东浏览商品时某一页

# JDSpider/ProjectStart/Test.pyimport redisfrom urllib import parse# Redis configurationr = redis.Redis(host='HOST', port=6379, password='PASS')# 改写keywords和page_countkeywords = '手机'page_count = 100keywords = parse.quote(keywords)current_page = 1start_index = 1url = 'https://search.jd.com/Search?keyword={0}&enc=utf-8&qrst=1&rt' /      '=1&stop=1&vt=2&wq={1}&page={2}&s={3}&click=0'for i in range(page_count):    # 提供给JDUrlsSpider           r.lpush('JDUrlsSpider', url.format(keywords, keywords, current_page, start_index))    current_page += 2    start_index += 60

Step4——JDUrlsSpider

提取页面中所有商品id并形成detail-url 和comment-url

创建项目:

cd JDSpiderscrapy startproject JDUrls

浏览商品的某一页时,京东先返回一半的商品信息,另一半采用异步加载只有在滚动条到尾时才加载
所以在提取某页所有商品id时还需要构造异步请求才能真正获得所有商品id

# JDSpider/JDUrls/spiders/JDUrlsSpider.pyfrom scrapy_redis.spiders import RedisSpiderfrom JDUrls.items import JDUrlsItemfrom scrapy.utils.project import get_project_settingsimport scrapyimport reclass JDUrlsSpider(RedisSpider):    # 获取指定页面中所有商品编号并整合成detail-relate url 和comment-relate url    name = 'JDUrlsSpider'    allow_domains = ['www.jd.com']    redis_key = 'JDUrlsSpider'    settings = get_project_settings()    hide_url = settings['HIDE_URL']def parse(self, response):    # 页面中未隐藏的所有商品编号    nums = response.xpath('//ul[@class="gl-warp clearfix"]/li[@class="gl-item"]                           [@data-sku]/@data-sku').extract()    keyword = re.findall(r'keyword=(.*?)&enc', response.url)[0]    # 虽然是同一个页面的商品编号,但异步加载请求隐藏的商品编号时请求的页面编号不同    page = re.findall(r'page=(/d+)', response.url)[0]    page = int(page) + 1    s = ''    for i in nums:        s += str(i) + ','    s = s[0:len(s)-1:]    item = JDUrlsItem()    item['num_list'] = nums    yield item    yield scrapy.Request(self.hide_url.format(keyword, page, s), callback=self.get_hidden)def get_hidden(self, response):    # 页面中隐藏的所有商品编号    nums = response.xpath('//li[@class="gl-item"][@data-sku]/@data-sku').extract()    item = JDUrlsItem()    item['num_list'] = nums    yield item

提取出商品id后构造出detail-url和comment-url并存入服务器端redis

# JDSpider/JDUrls/pipelines.pyimport redisfrom scrapy.utils.project import get_project_settingsclass JDUrlsPipeline(object):def __init__(self):    self.settings = get_project_settings()    self.detail_url = self.settings['GOODS_DETAIL_URL']    self.comment_url = self.settings['COMMENT_URL']    self.r = redis.Redis(host=self.settings['REDIS_HOST'], port=self.settings['REDIS_PORT'],                         password=self.settings['REDIS_PARAMS']['password'])def process_item(self, item, spider):    # 将商品编号整合成detail-relate url 和comment-relate url后存到服务器redis数据库    for n in item['num_list']:        self.r.lpush('JDDetailSpider', self.detail_url.format(n))        self.r.lpush('JDCommentSpider', self.comment_url.format(n))

Step 5——JDDetailSpider

根据detail-url提取商品详情
JDUrlsSpider已经将detail-url存入服务器端redis,JDDetailSpider只需从redis获取url爬取商品详情

创建项目:

cd JDSpiderscrapy startproject JDDetail    

要爬取的商品详情具体项如下:

# JDSpider/JDDetail/items.pyimport scrapyclass JDDetailItem(scrapy.Item):    # define the fields for your item here like:    # TINYTEXT    name = scrapy.Field()    # FLOAT    price = scrapy.Field()    # TINYTEXT    owner = scrapy.Field()    # TINYINT    jd_sel = scrapy.Field()    # TINYINT    global_buy = scrapy.Field()    # TINYINT    flag = scrapy.Field()    # INT    comment_count = scrapy.Field()    # INT    good_count = scrapy.Field()    # INT    default_good_count = scrapy.Field()    # INT    general_count = scrapy.Field()    # INT    poor_count = scrapy.Field()    # INT    after_count = scrapy.Field()    # FLOAT    good_rate = scrapy.Field()    # FLOAT    general_rate = scrapy.Field()    # FLOAT    poor_rate = scrapy.Field()    # FLOAT    average_score = scrapy.Field()    # TINYTEXT    num = scrapy.Field()

爬取详情时,价格数据和评论总结数据是异步加载的,所以需要另外构造异步请求

# JDSpider/JDDetail/JDDetailSpiderfrom scrapy_redis.spiders import RedisSpiderfrom JDDetail.items import JDDetailItemfrom scrapy.utils.project import get_project_settingsimport scrapyimport reimport jsonclass JDDetailSpider(RedisSpider):    # 获取指定商品的商品详情    name = 'JDDetailSpider'    allow_domains = ['www.jd.com']    redis_key = 'JDDetailSpider'    settings = get_project_settings()    comment_url = settings['COMMENT_EXCERPT_URL']    price_url = settings['PRICE_URL']def parse(self, response):    item = JDDetailItem()    # 全球购    if 'hk' in response.url:        global_buy = True    else:        global_buy = False    # 商品名    raw_name = re.findall(r'<div>(.*?)</div>', response.text, re.S)[0].strip()    if '京东精选' in raw_name:        jd_sel = True    else:        jd_sel = False    # 确保商品名无多余字符,如可能出现的 "京东精选"    name_list = raw_name.split('>')    name = name_list[len(name_list) - 1].strip()    # 全球购商铺名提取方法不同    if not global_buy:        owner_list = response.xpath('//div[@class="J-hove-wrap EDropdown                                      fr"]/div[@class="item"]/div[@class="name"]'                                    '/a/text()').extract()    else:        owner_list = response.xpath('//div[@class="shopName"]/strong/span/a/text()').extract()    # 是否自营    if len(owner_list) == 0:        owner = '自营'        flag = True    else:        owner = owner_list[0]        if '自营' in owner:            flag = True        else:            flag = False    num = re.findall(r'(/d+)', response.url)[0]    item['name'] = name    item['owner'] = owner    item['flag'] = flag    item['global_buy'] = global_buy    item['jd_sel'] = jd_sel    item['num'] = num    # 请求价格json数据    price_request = scrapy.Request(self.price_url.format(num), callback=self.get_price)    price_request.meta['item'] = item    yield price_requestdef get_price(self, response):    item = response.meta['item']    price_json = json.loads(response.text)    item['price'] = price_json[0]['p']    num = item['num']    # 请求评论总结json数据    comment_request = scrapy.Request(self.comment_url.format(num), callback=self.get_comment)    comment_request.meta['item'] = item    yield comment_requestdef get_comment(self, response):    item = response.meta['item']    comment_json = json.loads(response.text)    comment_json = comment_json['CommentsCount'][0]    item['comment_count'] = comment_json['CommentCount']    item['good_count'] = comment_json['GoodCount']    item['default_good_count'] = comment_json['DefaultGoodCount']    item['general_count'] = comment_json['GeneralCount']    item['poor_count'] = comment_json['PoorCount']    item['after_count'] = comment_json['AfterCount']    item['good_rate'] = comment_json['GoodRate']    item['general_rate'] = comment_json['GeneralRate']    item['poor_rate'] = comment_json['PoorRate']    item['average_score'] = comment_json['AverageScore']    yield item

Step 6——JDCommentSpider

根据comment-url提取商品评论

JDUrlsSpider已经将comment-url存入服务器端redis,JDCommentSpider只需从redis获取url爬取评论

创建项目:

cd JDSpiderscrapy startproject JDComment

要爬取的商品评论具体项如下:

# JDSpider/JDComment/items.pyclass JDCommentItem(scrapy.Item):    # TINYTEXT    good_num = scrapy.Field()    # TEXT    content = scrapy.Field()

初始comment-url返回的json数据中只有10条评论,但是maxPage指明了可以获取评论的次数,加个循
环即可获取其他评论数据

# JDSpider/JDComment/JDCommentSpider.pyfrom scrapy_redis.spiders import RedisSpiderfrom JDComment.items import JDCommentItemfrom scrapy.utils.project import get_project_settingsimport scrapyimport jsonimport reclass JDCommentSpider(RedisSpider):    # 获取指定商品的评论(完整评论,非摘要)    name = 'JDCommentSpider'    allow_domains = ['www.jd.com']    redis_key = 'JDCommentSpider'    settings = get_project_settings()    comment_url = settings['COMMENT_URL']def parse(self, response):    comment_json = json.loads(response.text)    good_number = re.findall(r'productId=(/d+)', response.url)[0]    max_page_num = comment_json['maxPage']    for com in comment_json['comments']:        item = JDCommentItem()        item['good_num'] = good_number        item['content'] = com['content']        yield item    for i in range(2, max_page_num):        yield scrapy.Request(self.comment_url.format(good_number, i), callback=self.get_leftover)def get_leftover(self, response):    comment_json = json.loads(response.text)    good_number = re.findall(r'productId=(/d+)', response.url)[0]    for com in comment_json['comments']:        item = JDCommentItem()        item['good_num'] = good_number        item['content'] = com['content']        yield item

Step 7——启动爬虫

cd ProjectStartpython Test.py
cd JDUrlsSpiderscrapy crawl JDUrlsSpider
cd JDDetailSpiderscrapy crawl JDDetailSpider(This is distributed crawler, you can run more than one JDDetailSpider)
cd JDCommentSpiderscrapy crawl JDCommentSpider(This is distributed crawler, you can run more than one JDCommentSpider)

成果展示

这里写图片描述

这里写图片描述

这里写图片描述


参考资料

总体框架参考

第七城市th7cn

微信扫一扫

第七城市微信公众平台