scrapy爬虫天猫笔记本电脑销量前60的商品

2018-01-29 13:05:30来源:cnblogs.com作者:venmsu人点击

分享
# 抓取内容:商品名称,商品价格,商品链接,店铺名称,店铺链接# 爬取的时候之前返回了多次302,301 但是html网页还是被爬取下来了抓取的首页:start_urls = ['https://list.tmall.com/search_product.htm?spm=a220m.1000858.1000724.4.4b3df937tMXU1S&cat=50024399&sort=d&style=g&active=1&industryCatId=50024399&theme=663']第一步:item.py编写  相信刚开始的准备工作不需要讲了(cmd---scrapy startproject tianmao-----scrapy genspider topgood tmall.com)
 1 import scrapy 2  3  4 class TianmaoItem(scrapy.Item): 5     # define the fields for your item here like: 6     # name = scrapy.Field() 7     GOODS_PRICE = scrapy.Field()  # 价格 8     GOODS_NAME = scrapy.Field()  # 名称 9     GOODS_URL = scrapy.Field()  # 商品链接10     SHOP_NAME = scrapy.Field()  # 商店名称11     SHOP_URL = scrapy.Field()  # 商店链接

注释:主要是介绍从一个网页获取商品价格名称链接,再通过链接爬取商店名称链接。所以挑了几个具有代表性的数据。

第二步:编写spiders

 1 # -*- coding: utf-8 -*- 2 import scrapy 3 from tianmao.items import TianmaoItem 4  5  6 class TopgoodSpider(scrapy.Spider): 7     name = 'topgood' 8     allowed_domains = ['list.tmall.com', 'detail.tmall.com']  # 二级域名 9 10     start_urls = ['https://list.tmall.com/search_product.htm?spm=a220m.1000858.1000724.4.4b3df937tMXU1S&cat=50024399&sort=d&style=g&active=1&industryCatId=50024399&theme=663']11     12     def parse(self, response):13         divs = response.xpath("//div[@id='J_ItemList']/div[@class='product item-1111 ']/div") # 有的时候会出现标签变化的情况,需要自行修改14         print(divs)15         16         for div in divs:17             item = TianmaoItem()18             # 价格19             item['GOODS_PRICE'] = div.xpath("p[@class='productPrice']/em/@title")[0].extract()  # 序列化该节点为unicode字符串并返回list20             print(item)21             # 名称//*[@id="J_ItemList"]/div[3]/div/div[2]/a[1]22             item['GOODS_NAME'] = div.xpath("div[@class='productTitle productTitle-spu']/a[1]/@title")[0].extract()23             print(item)24             pre_Product_Url = div.xpath("div[@class='productTitle productTitle-spu']/a[1]/@href").extract_first()25             26             if 'http' not in pre_Product_Url:27                 pre_Product_Url = response.urljoin(pre_Product_Url)28             29             item['GOODS_URL'] = pre_Product_Url30             print(item)31             yield scrapy.Request(url=pre_Product_Url, meta={'item': item}, callback=self.parse_detail,dont_filter=True)32 33     def parse_detail(self, response):34         divs = response.xpath("//div[@class='extend']/ul")35         36         if len(divs) == 0:37             self.log("Detail Page error --%s"%response.url)38         39         div = divs[0]40         item = response.meta['item']41         item['SHOP_NAME'] = div.xpath("li[1]/div[1]/a/text()")[0].extract().strip()42         item['SHOP_URL'] = div.xpath("li[1]/div[1]/a/@href")[0].extract()43         44         yield item45 # 要保存为csv的格式    scrapy crawl topgood -o result.csv

第三步:setting.py编写

BOT_NAME = 'tianmao'SPIDER_MODULES = ['tianmao.spiders']NEWSPIDER_MODULE = 'tianmao.spiders'LOG_FILE = 'topgood.log'LOG_STDOUT=TrueDEPTH_LIMIT = 2# Crawl responsibly by identifying yourself (and your website) on the user-agentUSER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"# Output .csv  FEED_URI = u'goods.csv'  FEED_FORMAT = 'CSV' DOWNLOAD_DELAY = 5# Obey robots.txt rulesROBOTSTXT_OBEY = False# Override the default request headers:DEFAULT_REQUEST_HEADERS = {   'Accept': 'text/html,application/xhtml+xm…plication/xml;q=0.9,*/*;q=0.8',   'Accept-Encoding': 'gzip, deflate, br',   'Accept-Language': 'zh-CN,zh;q=0.8',   "Cookie":"自己访问该网站时候的cookie",   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"}ITEM_PIPELINES = {    'tianmao.pipelines.TianmaoPipeline': 300,}

注:参数详解

# BOT_NAME   ---项目名称# DEFAULT_REQUEST_HEADERS   ---默认请求头# DEPTH_LIMIT   ---爬取网页的深度,默认零# ITEM_PIPELINES   ---保存项目中启用的pipeline及其顺序的字典。该字典默认为空,值(value)任意。 不过值(value)习惯设定在0-1000范围内。# LOG_ENABLED   ---是否启动logging# LOG_ENCODING   ---logging的编码# LOG_FILE   ---日志文件的文件名# LOG_LEVEL   ---日志记录的级别   (CRITICAL、 ERROR、WARNING、INFO、DEBUG)#                                (关键,     错误, 警告,    信息,调试)# LOG_STDOUT   --- 默认: False#               如果为 True ,进程所有的标准输出(及错误)将会被重定向到log中。#               例如, 执行 print 'hello' ,其将会在Scrapy log中显示。# RANDOMIZE_DOWNLOAD_DELAY   ---默认为True,在相同网站获取数据时随机暂停#                               DOWNLOAD_DELAY  默认为0# ROBOTSTXT_OBEY  默认False,是否遵守robots.txt策咯# FEED_FORMAT   --- 设置数据保存的形式# FEED_URI   --- 保存数据的路径和文件名

最新文章

123

最新摄影

闪念基因

微信扫一扫

第七城市微信公众平台