资源简介
此为python实现的基于网路爬虫的电影评论爬取和分析系统。其中包括源代码、完整文档。本系统主要由热门电影排名、影评内容词云、观众满意度饼图等模块组成。其中代码有bug(我去年可以运行,不知道今年为什么不可了呜呜呜),介意勿下载!!!
代码片段和文件信息
from urllib import request
headers={‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML like Gecko) Chrome/21.0.1180.89 Safari/537.1‘}
resp = request.urlopen(‘https://movie.douban.com/nowplaying/hangzhou/‘)
html_data = resp.read().decode(‘utf-8‘)
from bs4 import BeautifulSoup as bs
soup = bs(html_data ‘html.parser‘)
nowplaying_movie = soup.find_all(‘div‘ id=‘nowplaying‘)
nowplaying_movie_list = nowplaying_movie[0].find_all(‘li‘ class_=‘list-item‘)
nowplaying_list = []
for item in nowplaying_movie_list:
nowplaying_dict = {}
nowplaying_dict[‘id‘] = item[‘data-subject‘]
for tag_img_item in item.find_all(‘img‘):
nowplaying_dict[‘name‘] = tag_img_item[‘alt‘]
nowplaying_list.append(nowplaying_dict)
print(‘豆瓣排行榜中名列前茅的影片为:‘)
for i in range(len(nowplaying_list)):
print(‘NO.‘(i+1)‘\t‘nowplaying_list[i][‘name‘])
#print(nowplaying_list)
import requests
requrl = ‘https://movie.douban.com/subject/‘ + nowplaying_list[1][‘id‘] + ‘/comments‘ +‘?‘ +‘start=0‘ + ‘&limit=20‘
resp = requests.get(requrlheaders)
html_data = resp.text
soup = bs(html_data ‘html.parser‘)
comment_div_lits = soup.find_all(‘div‘ class_=‘comment‘)
#print(comment_div_lits)
eachAudiList=[]
for person in comment_div_lits:
b=person.find_all(‘a‘class_=‘‘)
eachAudiList.append(b[0].string)
#print(eachAudiList)
eachTimeList=[]
for time in comment_div_lits:
a=time.find_all(‘span‘class_=‘comment-time‘)
eachTimeList.append(a[0].text.split()[0])
#print(eachTimeList)
eachCommentList = []
for item in comment_div_lits:
i=item.find_all(‘p‘)[0].text
eachCommentList.append(i)
#print(eachCommentList)
comments = ‘‘
for k in range(len(eachCommentList)):
comments = comments + (str(eachCommentList[k])).strip()
#print(comments)
print(‘------------------以下为各路神仙的留言-----------------------------------------‘)
for i in range(len(eachCommentList)):
print(eachAudiList[i]+‘ 的留言为:‘)
print(eachCommentList[i])
print(‘\t\t\t‘eachTimeList[i])
from wordcloud import WordCloud
import jieba
import matplotlib.pyplot as plt
wordlist_after_jieba = jieba.cut(comments cut_all=True)
wl_space_split = “ “.join(wordlist_after_jieba)
my_wordcloud = WordCloud(background_color=“white“width=1000height=860 font_path=“font.ttf“).generate(wl_space_split)
plt.imshow(my_wordcloud)
plt.axis(“off“)
plt.show()
import requests
requrl = ‘https://movie.douban.com/subject/‘ + nowplaying_list[1][‘id‘] + ‘/‘+‘?‘+‘from=showing‘
resp = requests.get(requrl)
html_data = resp.text
soup = bs(html_data ‘html.parser‘)
assess=soup.find_all(‘div‘class_=‘ratings-on-weight‘)
#print(assess[0])
assess_dit={}
for ass in range(len(assess)):
x=assess[ass].find_all(‘div‘class_=‘item‘)
star=[]
percent=[]
for y in x:
z=y.find_all(‘span‘)
star.append(z[0].string.split
属性 大小 日期 时间 名称
----------- --------- ---------- ----- ----
文件 5828044 2019-06-11 13:35 python程序设计\font.ttf
文件 3707 2020-04-01 16:53 python程序设计\python语言程序设计.py
文件 529408 2020-04-01 16:58 python程序设计\文档.doc
文件 211 2020-04-01 17:01 python程序设计\附录.txt
目录 0 2020-04-01 17:08 python程序设计\
相关资源
- opencv for Python官方文档中文版
- 基于Python的股票数据爬虫系统GUI
- python Programming.An.Introduction.to.Computer
- python儿童编程
- Python-在TensorFlow中实现实现图像卷积网
- Python-在线网络小说搜索阅读网站
- image_classify_using_sift
- selenium2python自动化测试.pdf
- OpenCV官方教程中文版for Pythonpdf+自己编
- 时间序列预测讲义ARIMA&LSTM;及python代码
- python3完整版
- python + opencv 人脸识别代码(可以跨平
- 《python cookbook(第3版)》中文 高清完
- 《深度学习入门:基于Python的理论与
- Python Flask写的绿色植物商城购物网站
- 网络爬虫-Python和数据分析
- 深入理解Python (Dive into Python) 中文
- 基于Movielens的推荐系统—评分预测
- Geoplotlib: a Python Toolbox for Visualizing G
- python利用opencv进行相机标定(教程)
- python从入门到精通学习资料
- PyInstaller-3.5.tar.gz
- Python ip 代理池爬取工具
- 廖雪峰Python教程[完整版]
- Python 3.5.2 入门指南 高清
- python修改域帐户密码
- Python_Testing_with_pytest
- 笨办法学Python第3版
- python cookbook 第三版中文版
- Numerical Methods In Engineering With Python
评论
共有 条评论