• 大小: 7.96MB
    文件类型: .zip
    金币: 2
    下载: 1 次
    发布日期: 2023-11-19
  • 语言: Python
  • 标签: python  爬虫  

资源简介

python实现对于整个网页内容的爬取,简单易写,非常适合对python爬虫的学习。

资源截图

代码片段和文件信息

import urllib.request
import re
path=“https://www.ittime.com.cn/news/chuangxin.shtml“

def getData(path):
    content=urllib.request.urlopen(path).read().decode(“UTF-8““ignore“)
    # print(content)
    imgRe=re.compile(r‘src=“(.*?\.jpg)“‘)
    imagePaths=imgRe.findall(content)
    print(“长度:“imagePaths.__len__())
    for imagePath in imagePaths:
        print(“https://www.ittime.com.cn“+imagePath)

    titleRe=re.compile(r‘

(.*?)

‘)
    titles=titleRe.findall(content)
    print(“标题长度:“titles.__len__())
    for title in titles:
        print(title)

for i in range(210):
    getData(f“https://www.ittime.com.cn/news/chuangxin_{i}.shtml“)

 属性            大小     日期    时间   名称
----------- ---------  ---------- -----  ----
     目录           0  2018-11-10 10:42  GetITNews\
     目录           0  2018-11-10 10:42  GetITNews\.idea\
     文件         478  2018-11-10 10:18  GetITNews\.idea\GetITNews.iml
     目录           0  2018-11-10 10:18  GetITNews\.idea\inspectionProfiles\
     文件         306  2018-11-10 10:18  GetITNews\.idea\misc.xml
     文件         277  2018-11-10 10:18  GetITNews\.idea\modules.xml
     文件        9301  2018-11-10 10:42  GetITNews\.idea\workspace.xml
     文件         724  2018-11-10 10:37  GetITNews\Test.py
     目录           0  2018-11-10 10:42  GetITNews\venv\
     目录           0  2018-11-10 10:18  GetITNews\venv\Include\
     目录           0  2018-11-10 10:42  GetITNews\venv\Lib\
     目录           0  2018-11-10 10:42  GetITNews\venv\Lib\site-packages\
     文件          55  2018-11-10 10:18  GetITNews\venv\Lib\site-packages\easy-install.pth
     目录           0  2018-11-10 10:42  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\
     目录           0  2018-11-10 10:42  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\EGG-INFO\
     文件           1  2018-11-10 10:18  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\EGG-INFO\dependency_links.txt
     文件          98  2018-11-10 10:18  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\EGG-INFO\entry_points.txt
     文件           2  2018-11-10 10:18  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\EGG-INFO\not-zip-safe
     文件        2972  2018-11-10 10:18  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\EGG-INFO\PKG-INFO
     文件          74  2018-11-10 10:18  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\EGG-INFO\requires.txt
     文件       12502  2018-11-10 10:18  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\EGG-INFO\SOURCES.txt
     文件           4  2018-11-10 10:18  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\EGG-INFO\top_level.txt
     目录           0  2018-11-10 10:42  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\pip\
     目录           0  2018-11-10 10:42  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\pip\_internal\
     文件       14014  2018-11-10 10:18  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\pip\_internal\basecommand.py
     文件        8764  2018-11-10 10:18  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\pip\_internal\baseparser.py
     文件        2773  2018-11-10 10:18  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\pip\_internal\build_env.py
     文件        7023  2018-11-10 10:18  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\pip\_internal\cache.py
     文件       16679  2018-11-10 10:18  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\pip\_internal\cmdoptions.py
     目录           0  2018-11-10 10:42  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\pip\_internal\commands\
     文件        1500  2018-11-10 10:18  GetITNews\venv\Lib\site-packages\pip-10.0.1-py3.7.egg\pip\_internal\commands\check.py
............此处省略375个文件信息

评论

共有 条评论