Python爬虫--抓取百度百科的前1000个页面

大小: 693B

文件类型: .rar

金币: 2

下载: 1 次

发布日期: 2021-06-15
语言: Python
标签: Python爬虫 百度百科

高速下载

资源简介

Python爬虫--抓取百度百科的前1000个页面的实现。

资源截图

小图大图

代码片段和文件信息

# coding:UTF8

from bs4 import BeautifulSoup
import re

html_doc = “““
tle>The Dormouse‘s storytle>

tle“>The Dormouse‘s story


Once upon a time there were three little sisters; and their names were
nk1“>Elsie
nk2“>Lacie and
nk3“>Tillie;
and they lived at the bottom of a well.


...

“““
soup = BeautifulSoup（html_doc ‘html.parser‘ from_encoding=‘utf-8‘）
print ‘get all links‘
links = soup.find_all（‘a‘）
for link in links:
    print link.name link[‘href‘] link.get_text（）
    
print ‘\nget lacie link‘
link_node  = soup.find（‘a‘ href=“http://example.com/lacie“）
print link_node.name link_node[‘href‘] link_node.get_text（）

print ‘\nre‘
link_node  = soup.find（‘a‘ href=re.compile（r“ill“））
print link_node.name link_node[‘href‘] link_node.get_text（）

print ‘\np‘
p_node  = soup.find（‘p‘ class_=“title“）
print p_node.namep_node.get_text（）

属性            大小     日期    时间   名称
----------- ---------  ---------- -----  ----

     文件       1161  2016-10-30 13:31  reptile\test_bs4.py

     文件          0  2016-10-30 13:20  reptile\__init__.py

     目录          0  2016-10-30 13:21  reptile

----------- ---------  ---------- -----  ----

                 1161                    3

共有条评论

Python爬虫--抓取百度百科的前1000个页面

资源简介

资源截图

代码片段和文件信息

评论

相关资源