python采集新浪热门微博_郑晓

当前位置：博客首页 >> Python >> 阅读正文

python采集新浪热门微博

作者: 郑晓分类: Python 发布于: 2014-03-05 20:54 浏览：9,077 评论(5)

这是之前学习python采集时的一个练习程序，程序基于python3和BeautifulSoup库。用来抓取新浪微博（热门微博hot.weibo.com）页面的信息，包括每条微博的发布人，微博内容和包含的图片，微博中含有的多张图片采集为一个图片列表。

由于在页面中没有发现比较精确的发布时间字段，所以也没有去弄（目前思路是获取到它的页面中的时间信息，然后做判断去转换）。这里以热门笑话的一个页面做为采集对象。

#-*-coding:utf-8 -*- from bs4 import BeautifulSoup import urllib.request #伪造的header headers = {'User-Agent':'Mozilla/5.0 (Windows NT 5.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36'} #抓取地址读入页面源文件 fromurl='http://hot.weibo.com/?v=1899&page=2' r = urllib.request.Request(url=fromurl, headers=headers) response=urllib.request.urlopen(r) page=response.read() #实例化BS对象 soup= BeautifulSoup(page) #定位到微博信息主节点页面中每一条微博是它的子节点 tags = soup.find_all(name='div', attrs={'class':'WB_detail'}) #遍历所有子节点 for tag in tags: #从子节点中找到发布人 sender = tag.find(name='a', attrs={'class':'WB_name S_func1'}).get_text() #从子节点中找到微博内容 text = tag.find(name='div', attrs={'class':'WB_text'}).get_text() #查找节点下的微博图片 thumbList = tag.find_all(name='img', attrs={'class':'bigcursor'}) img = [] #如果有图，把所有图片的地址放到img数组中 if thumbList: for t in thumbList: img.append(t['src']) print(sender+text) print(img) print() print() input()

程序运行结果如图：
python3+beautifulsoup采集新浪微博

本文采用知识共享署名-非商业性使用 3.0 中国大陆许可协议进行许可，转载时请注明出处及相应链接。

本文永久链接: https://www.zh30.com/python-cai-ji-xin-lang-re-men-wei-bo.html

python采集新浪热门微博：目前有5 条留言

增肥：发表于 2014年03月26日 14:56[回复]

很不错的思路

长春驾校：发表于 2014年03月15日 21:47[回复]

不错不错来看看

温州夜网：发表于 2014年03月15日 20:07[回复]

很棒啊，每天都来看看

东方CJ：发表于 2014年03月13日 11:19[回复]

很喜欢这种表达方式。。。

长春驾校排名：发表于 2014年03月09日 22:49[回复]

博客不错来维持一下

python采集新浪热门微博

python采集新浪热门微博：目前有5 条留言

发表评论