python从零开始写爬虫(4)-- 整合代码

2017-01-12 19:09:04来源:CSDN作者:u011320740人点击

第七城市

把前面代码整合下,抽取出到方法里,一只新浪新闻爬虫需要代码如下:

import requestsfrom bs4 import BeautifulSoupimport reimport json#获取新闻链接def getNewsURLs(url):    newsURL = []    newurl = url    res = requests.get(newurl)    res.encoding = 'utf-8'    soup = BeautifulSoup(res.text,'html.parser')    for news in soup.select('.news-item'):        if len(news.select('h2'))>0:            newsURL.append(news.select('h2 a')[0]['href'])    return newsURL#获取新闻详细信息def getNewsDetail(newsurl):    result = {}    res = requests.get(newsurl)    res.encoding = 'utf-8'    soup = BeautifulSoup(res.text,'html.parser')    m = re.search('doc-i(.*).shtml',newsurl)    result['newsid'] = m.group(1)    if len(soup.select('#artibodyTitle'))>0:        result['title'] = soup.select('#artibodyTitle')[0].text    else:        result['title'] = ''    if len(soup.select('#navtimeSource'))>0:        result['newssource'] = soup.select('#navtimeSource')[0].contents[1].text.strip()    else:        result['newssource'] = ''    if len(soup.select('#navtimeSource'))>0:        result['dt'] = soup.select('#navtimeSource')[0].contents[0].strip()    else:        result['dt'] = ''    if len(soup.select('#artibody p'))>0:        result['article'] = ''.join(p.text.strip() for p in soup.select('#artibody p')[:-1])    else:        result['article'] = ''    if len(soup.select('.article-editor'))>0:        result['editor'] = soup.select('.article-editor')[0].text.lstrip('责任编辑:')    else:        result['editor'] = ''    return result#获取评论数def getCommentCount(newsid):    commentURL = "http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=gn&newsid=comos-{}&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20"    comments = requests.get(commentURL.format(newsid))    jd = json.loads(comments.text.strip('var date='))    return jd['result']['count']['total']



测试下:(因为获取的数据太多,我就直接获取newsurl[0]第一条数据做下测试){


补充下:import re出错的话,在cmd跑下:pip install re;

第七城市

最新文章

123

最新摄影

微信扫一扫

第七城市微信公众平台