爬虫糗事百科实例

糗事百科实例:

爬取糗事百科段子,假设页面的URL是 http://www.qiushibaike.com/8hr/page/1

要求:

  1. 使用requests获取页面信息,用XPath / re 做数据提取
  2. 获取每个帖子里的用户头像链接用户姓名段子内容点赞次数评论次数
  3. 保存到 json 文件内

参考代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import json
import requests
from lxml import etree

page = 1
url = 'http://www.qiushibaike.com/8hr/page/' + str(page)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
'Accept-Language': 'zh-CN,zh;q=0.8'}


def get_page():
response = requests.get(url, headers=headers)
# response.encoding = response.apparent_encoding
if response.status_code == 200:
return response.text
else:
return None


def parse():
resHtml = get_page()
if resHtml is None:
return
html = etree.HTML(resHtml)
result = html.xpath('//div[contains(@id,"qiushi_tag")]')
items = []
for site in result:
try:
item = {}

imgurl = site.xpath('./div/a/img/@src')[0]
# username = site.xpath('./div/a/@title')[0]
username = site.xpath('.//h2')[0].text
content = site.xpath('.//div[@class="content"]/span')[0].text.strip()
# 投票次数
vote = site.xpath('.//i')[0].text
# print site.xpath('.//*[@class="number"]')[0].text
# 评论信息
comments = site.xpath('.//i')[1].text

item['imgurl'] = imgurl
item['username'] = username
item['content'] = content
item['vote'] = vote
item['comments'] = comments
print(imgurl, username, content, vote, comments)
items.append(item)
except Exception as e:
print(e)
return items


def save_items(items):
with open('qiushibaike.json', 'w', encoding='utf-8') as f:
f.write(json.dumps(items))


def main():
items = parse()
save_items(items)


if __name__ == '__main__':
main()
-------------本文结束感谢您的阅读-------------

本文标题:爬虫糗事百科实例

文章作者:GavinLiu

发布时间:2018年05月02日 - 23:05

最后更新:2018年05月02日 - 23:05

原始链接:http://gavinliu4011.github.io/post/f825a662.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

请博主吃个鸡腿吧
0%