Python项目案例开发从入门到实战源代码第20章词云实战——爬取豆瓣影评生成词云

大小: 607KB

文件类型: .rar

金币: 2

下载: 2 次

发布日期: 2021-10-26
语言: Python
标签: python

高速下载

资源简介

资源截图

小图大图

代码片段和文件信息

import warnings
warnings.filterwarnings（“ignore“）
import jieba    #分词包
import jieba.analyse
import numpy    #numpy计算包
import re
import matplotlib.pyplot as plt
from urllib import request
from bs4 import BeautifulSoup as bs
import matplotlib
matplotlib.rcParams[‘figure.figsize‘] = （10.0 5.0）
from wordcloud import WordCloud STOPWORDS   #词云包
 
#分析网页函数
def getNowPlayingMovie_list（）:   
    resp = request.urlopen（‘https://movie.douban.com/nowplaying/zhengzhou/‘） 
    html_data = resp.read（）.decode（‘utf-8‘）    
    soup = bs（html_data ‘html.parser‘）    
    nowplaying_movie = soup.find_all（‘div‘ id=‘nowplaying‘）        
    nowplaying_movie_list = nowplaying_movie[0].find_all（‘li‘ class_=‘list-item‘） 
    nowplaying_list = []    
    for item in nowplaying_movie_list:        
        nowplaying_dict = {}        
        nowplaying_dict[‘id‘] = item[‘data-subject‘]       
        for tag_img_item in item.find_all（‘img‘）:            
            nowplaying_dict[‘name‘] = tag_img_item[‘alt‘]  
            nowplaying_list.append（nowplaying_dict）    
    return nowplaying_list
 
#爬取评论函数
def getCommentsById（movieId pageNum）:      #参数为电影id号和要爬取评论的页码
    eachCommentList = []; 
    if pageNum>0: 
         start = （pageNum-1） * 20 
    else: 
        return False 
    requrl = ‘https://movie.douban.com/subject/‘ + movieId + ‘/comments‘ +‘?‘ +‘start=‘ + str（start） + ‘&limit=20‘ 
    print（requrl）
    resp = request.urlopen（requrl） 
    html_data = resp.read（）.decode（‘utf-8‘） 
    soup = bs（html_data ‘html.parser‘） 
    comment_div_lits = soup.find_all（‘div‘ class_=‘comment‘）
    #print（comment_div_lits[0]）
    for item in comment_div_lits:
        #print（item.find_all（‘p‘））
        p=item.find_all（‘p‘）[0]
        span=p.find（‘span‘）
        if span.string is not None:
            #print（span.string）
            eachCommentList.append（span.string）
            
    return eachCommentList
 
def main（）:
    #循环获取第一个电影的前10页评论
    commentList = []
    NowPlayingMovie_list = getNowPlayingMovie_list（）
    print（NowPlayingMovie_list）   #[{‘id‘: ‘27605698‘ ‘name‘: ‘西虹市首富‘} {‘id‘: ‘25882296‘ ‘name‘: ‘狄仁杰之四大天王‘}]
    for i in range（10）:                                    #前10页
        num = i + 1 
        commentList_temp = getCommentsById（NowPlayingMovie_list[0][‘id‘] num）#指定那部电影
        commentList.append（commentList_temp）
    #将列表中的数据转换为字符串
    comments = ‘‘
    for k in range（len（commentList））:
        comments = comments + （str（commentList[k]））.strip（）


    #使用正则表达式去除标点符号
    pattern = re.compile（r‘[\u4e00-\u9fa5]+‘）
    filterdata = re.findall（pattern comments）
    cleaned_comments = ‘‘.join（filterdata）
    #使用结巴分词进行中文分词
    result=jieba.analyse.textrank（cleaned_commentstopK=50withWeight=True）
    keywords = dict（）
    for i in result:
        keywords[i[0]]=i[1]    
    print（“删除停用词前“keywords）  #{‘演员‘: 0.18290354231824632 ‘大片‘: 0.2876433001472282} 
    #停用词集合
    stopwords

属性            大小     日期    时间   名称
----------- ---------  ---------- -----  ----

     文件       3662  2018-03-10 21:59  第20章词云实战——爬取豆瓣影评生成词云\StopWords.txt

     文件         73  2018-03-10 23:00  第20章词云实战——爬取豆瓣影评生成词云\wordcloud-1.4-cp35-cp35m-win_amd64.whl.txt

     文件       3881  2018-08-05 21:06  第20章词云实战——爬取豆瓣影评生成词云\分析豆瓣中最新电影的影评 - 最终版.py

     文件     168968  2018-08-05 16:52  第20章词云实战——爬取豆瓣影评生成词云\示例一拥有图片形状的词云\alice.png

     文件       2039  2018-08-05 16:59  第20章词云实战——爬取豆瓣影评生成词云\示例一拥有图片形状的词云\test.txt

     文件      41551  2018-08-05 17:00  第20章词云实战——爬取豆瓣影评生成词云\示例一拥有图片形状的词云\test2.jpg

     文件        918  2018-08-05 20:14  第20章词云实战——爬取豆瓣影评生成词云\示例一拥有图片形状的词云\拥有图片形状的词云.py

     文件     168968  2018-08-05 16:52  第20章词云实战——爬取豆瓣影评生成词云\示例三 wordcloud使用词频\alice.png

     文件      48598  2018-08-05 20:42  第20章词云实战——爬取豆瓣影评生成词云\示例三 wordcloud使用词频\dream.png

     文件       2039  2018-08-05 16:59  第20章词云实战——爬取豆瓣影评生成词云\示例三 wordcloud使用词频\test.txt

     文件        869  2018-08-05 17:32  第20章词云实战——爬取豆瓣影评生成词云\示例三 wordcloud使用词频\wordcloud使用词频.py

     文件     168968  2018-08-05 16:52  第20章词云实战——爬取豆瓣影评生成词云\示例二  设置停用词\alice.png

     文件       2039  2018-08-05 16:59  第20章词云实战——爬取豆瓣影评生成词云\示例二  设置停用词\test.txt

     文件      16826  2018-08-05 20:40  第20章词云实战——爬取豆瓣影评生成词云\示例二  设置停用词\test3.jpg

     文件        942  2018-08-05 17:19  第20章词云实战——爬取豆瓣影评生成词云\示例二  设置停用词\wordcloud的设置停用词.py

     目录          0  2018-11-07 19:54  第20章词云实战——爬取豆瓣影评生成词云\示例一拥有图片形状的词云

     目录          0  2018-11-07 19:54  第20章词云实战——爬取豆瓣影评生成词云\示例三 wordcloud使用词频

     目录          0  2018-11-07 19:54  第20章词云实战——爬取豆瓣影评生成词云\示例二  设置停用词

     目录          0  2018-11-07 19:54  第20章词云实战——爬取豆瓣影评生成词云

----------- ---------  ---------- -----  ----

               630341                    19

共有条评论

Python项目案例开发从入门到实战源代码第20章 词云实战——爬取豆瓣影评生成词云

资源简介

资源截图

代码片段和文件信息

评论

相关资源

Python项目案例开发从入门到实战源代码第20章词云实战——爬取豆瓣影评生成词云