资源简介
本例子0积分下载,爬取单本书的所有章节内容。
重点:121行的time.sleep(6)不能删,网站有反爬虫!
2021-11-01重新上传代码,顶点换域名了,代码用不了,我随便改了下,勉强能用,有能力的自己改代码完善。30
代码片段和文件信息
# -*- coding:utf-8 -*-
#@File:DDXS.py
import random
import requests
from lxml import etree
import time
# 获取目录页面的html
def get_chapter_html(url headers):
resp = requests.get(urlheaders=headers)
resp.encoding = ‘gbk‘
html = resp.text
if ‘顶点小说‘ in html:
print(‘----获取目录html成功----‘)
return resp.text
else:
print(‘----获取目录html失败----‘)
return ‘获取html失败‘
# 获取所有章节的url
def get_chapter_url_list(html):
tree = etree.HTML(html)
book_name = tree.xpath(‘//*[@id=“info“]/h1/text()‘)[0]
chapter_url_x = tree.xpath(‘//*[@id=“list“]/dl/dd/a/@href‘)
if len(chapter_url_x) > 12:
print(‘----获取所有章节url成功----‘)
chapter_url_list = chapter_url_x[:]
else:
print(‘----获取所有章节url失败----‘)
chapter_url_list = []
return chapter_url_list book_name
# 下载并写入
def get_save_txt(chapter_url headers f):
try:
html = requests.get(chapter_url headers=headers timeout=10)
html.encoding = “gbk“
html = html.text
except requests.exceptions.RequestException as e:
print(e)
return ‘获取失败‘
tree = etree.HTML(html)
# 获取章节名
chapter_name = tree.xpath(‘//*[@class=“bookname“]/h1/text()‘)
if len(chapter_name) > 0:
chapter_name = chapter_name[0]
else:
chapter_name = ‘获取章节名失败|url:‘ + chapter_url
print(‘----获取章节名失败----‘)
print(chapter_name)
print(‘--------------------‘)
return ‘获取失败‘
# 获取小说内容
content_list = tree.xpath(‘//div[@id=“content“]/text()‘)
if len(content_list) > 0:
text = chapter_name + “\n“
for content in content_list:
text += content + ‘\n\n‘
f.write(text)
f.flush() # 刷新内存缓存区
print(‘写入成功:‘+ chapter_name + ‘--URL:‘ + chapter_url + ‘\n‘)
else:
print(‘----获取内容失败----‘)
print(‘章节名:‘ + chapter_name)
print(‘URL:‘ + chapter_url)
print(‘--------------------‘)
return
def main():
# 程序开始时的时间
time_start = time.time()
main_url = ‘https://www.ddxs.cc/‘
book_url = ‘https://www.ddxs.cc/ddxs/182824/‘
book_name = ‘等待获取书名‘
user_agent_list = [
“Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/68.0.3440.106 Safari/537.36“
“Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/67.0.3396.99 Safari/537.36“
“Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/64.0.3282.186 Safari/537.36“
“Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/62.0.3202.62 Safari/537.36“
“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/45.0.2454.101 Safari/537.36“
“Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)“
“Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20
- 上一篇:Python开发的个人博客
- 下一篇:python函数编程和讲解
相关资源
- 一个简单的python爬虫
- 豆瓣爬虫;Scrapy框架
- 中国城市经纬度爬虫.ipynb
- 爬取百度图片到本地(python代码)
- Python爬虫数据分析可视化
- 网站列表信息爬虫
- 百度图片爬虫(python版)
- python爬取小说59868
- 彼岸花网壁纸爬虫
- Python 爬虫小说.ipynb
- 小说阅读项目源码(附数据库脚本)
- 爬虫爬取网易云音乐
- 北邮python爬虫学堂在线
- python简单爬虫
- 爬取58同城二手房信息.py
- 知网爬虫软件(python)
- python爬虫爬取微博热搜
- python爬虫爬取旅游信息(附源码,c
- python爬虫爬取豆瓣电影信息
- 爬取上百张妹子图源码可直接运行
- Python爬虫实战入门教程
- Python爬取小说
- 网络爬虫(pachong_anjuke.py)
- Python-京东抢购助手包含登录查询商品
- python网络爬虫获取景点信息源码
- python爬取维基百科程序语言消息盒(
- python新浪微博爬虫
- 12306爬虫实现
- 中国裁判文书网爬虫
- Python爬虫相关书籍.zip
评论
共有 条评论