新浪微博爬虫代码+结果

大小: 111KB

文件类型: .rar

金币: 2

下载: 0 次

发布日期: 2021-05-15
语言: Python
标签: python爬虫

高速下载

资源简介

新浪爬虫的python代码以及部分结果整理文件列表 1. spider_try.py 爬虫主程序，采用抓取html源码解析的方式获取用户信息。针对每个用户按照person类定义解析。 2. person.py 定义person类，将相应的html标签段解析为可读形式 3. format.py 将最终的结果输出为gexf标准格式方便图处理

资源截图

小图大图

代码片段和文件信息

# -*- coding: utf-8 -*-
“““
Created on Fri Jun  1 11:12:21 2018

@author: gaoruiyuan
“““


import re

biglist = []
normallist = []
node_data = “./html_follow_name/node.txt“
nodefile = open（node_data“w“ encoding=“UTF-8“）
edge_data = “./html_follow_name/edge.txt“
edgefile = open（edge_data“w“ encoding=“UTF-8“）
edgenum = 0

def file_ana（f）:
    global edgenum
    content = f.read（）.decode（‘utf-8‘）
#print （content）
    host_name = re.findall（r“= （.+?）\r\n“ content）
    host_name = host_name[0]
    if host_name not in normallist:
        from_id = str（ 10000 + len（normallist））
        normallist.append（host_name）
        nodefile.write（“\n“）
    else:
        from_id = str（normallist.index（host_name） + 10000）
    biglist_read = re.findall（r“\n（.+?）\tbig\r“ content）
    normallist_read = re.findall（r“\n（.+?）\tnormal\r“ content）
    for i in biglist_read:
        if i not in biglist:
            nodefile.write（“\n“）
            biglist.append（i）
        id_to = str（biglist.index（i））
        edgefile.write（“\n\n\n“）
        edgenum += 1
    for i in normallist_read:
        if i not in normallist:
            nodefile.write（“\n“）
            normallist.append（i）
        id_to = str（normallist.index（i））
        edgefile.write（“\n\n\n\n“）
        edgenum += 1
    f.close
    return;
    
for i in range（1100）:
    print（i）
    file_data = “./html_follow_name/“ + str（i） + “follow.txt“
    f = open（file_data“rb“）
    file_ana（f）
nodefile.close（）
edgefile.close（）

属性            大小     日期    时间   名称
----------- ---------  ---------- -----  ----

     文件       1958  2018-06-03 12:58  爬虫\format.py

     文件       2285  2018-05-21 10:20  爬虫\person.py

     文件       1344  2018-06-03 13:27  爬虫\Readme.md

     文件       4780  2018-06-01 10:47  爬虫\spider_try.py

     目录          0  2018-06-03 13:36  爬虫

     文件        179  2018-06-01 10:53  single_results\10follow.txt

     文件        431  2018-06-01 10:53  single_results\11follow.txt

     文件        491  2018-06-01 10:55  single_results\12follow.txt

     文件        363  2018-06-01 10:55  single_results\13follow.txt

     文件        972  2018-06-01 10:55  single_results\14follow.txt

     文件        475  2018-06-01 10:56  single_results\15follow.txt

     文件         80  2018-06-01 10:56  single_results\16follow.txt

     文件        479  2018-06-01 10:58  single_results\17follow.txt

     文件        158  2018-06-01 10:58  single_results\18follow.txt

     文件        379  2018-06-01 10:58  single_results\19follow.txt

     文件       2958  2018-06-01 10:48  single_results\1follow.txt

     文件        269  2018-06-01 10:59  single_results\20follow.txt

     文件        457  2018-06-01 11:00  single_results\21follow.txt

     文件        310  2018-06-01 11:00  single_results\22follow.txt

     文件        336  2018-06-01 11:01  single_results\23follow.txt

     文件         48  2018-06-01 11:02  single_results\24follow.txt

     文件        638  2018-06-01 11:02  single_results\25follow.txt

     文件        413  2018-06-01 11:03  single_results\26follow.txt

     文件        371  2018-06-01 11:03  single_results\27follow.txt

     文件        155  2018-06-01 11:04  single_results\28follow.txt

     文件         42  2018-06-01 11:04  single_results\29follow.txt

     文件       1030  2018-06-01 10:48  single_results\2follow.txt

     文件         72  2018-06-01 11:05  single_results\30follow.txt

     文件        858  2018-06-01 11:05  single_results\31follow.txt

     文件        577  2018-06-01 11:06  single_results\32follow.txt

............此处省略82个文件信息

上一篇：Python-RNNoiseRNN音频噪声抑制学习
下一篇：cohesive_COH2D4 for Quad.py

共有条评论

新浪微博爬虫代码+结果

资源简介

资源截图

代码片段和文件信息

评论

相关资源