资源简介
找出评分最高的前100部电影,使用python 实现,对网站爬虫
代码片段和文件信息
#-*- coding: UTF-8 -*-
import sys
import time
import urllib
import urllib2
// import requests
import numpy as np
from bs4 import BeautifulSoup
from openpyxl import Workbook
reload(sys)
sys.setdefaultencoding(‘utf8‘)
#Some User Agents
hds=[{‘User-Agent‘:‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6‘}\
{‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.2) AppleWebKit/535.11 (KHTML like Gecko) Chrome/17.0.963.12 Safari/535.11‘}\
{‘User-Agent‘: ‘Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)‘}]
def book_spider(book_tag):
page_num=0;
book_list=[]
try_times=0
while(1):
#url=‘http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/book?start=0‘ # For Test
url=‘http://www.douban.com/tag/‘+urllib.quote(book_tag)+‘/book?start=‘+str(page_num*15)
time.sleep(np.random.rand()*5)
#Last Version
try:
req = urllib2.Request(url headers=hds[page_num%len(hds)])
source_code = urllib2.urlopen(req).read()
plain_text=str(source_code)
except (urllib2.HTTPError urllib2.URLError) e:
print e
continue
##Previous Version IP is easy to be Forbidden
#source_code = requests.get(url)
#plain_text = source_code.text
soup = BeautifulSoup(plain_text)
list_soup = soup.find(‘div‘ {‘class‘: ‘mod book-list‘})
try_times+=1;
if list_soup==None and try_times<200:
continue
elif list_soup==None or len(list_soup)<=1:
break # Break when no informatoin got after 200 times requesting
for book_info in list_soup.findAll(‘dd‘):
title = book_info.find(‘a‘ {‘class‘:‘title‘}).string.strip()
desc = book_info.find(‘div‘ {‘class‘:‘desc‘}).string.strip()
desc_list = desc.split(‘/‘)
book_url = book_info.find(‘a‘ {‘class‘:‘title‘}).get(‘href‘)
try:
author_info = ‘作者/译者: ‘ + ‘/‘.join(desc_list[0:-3])
except:
author_info =‘作者/译者: 暂无‘
try:
pub_info = ‘出版信息: ‘ + ‘/‘.join(desc_list[-3:])
except:
pub_info = ‘出版信息: 暂无‘
try:
rating = book_info.find(‘span‘ {‘class‘:‘rating_nums‘}).string.strip()
except:
rating=‘0.0‘
try:
#people_num = book_info.findAll(‘s
相关资源
- 爬虫源码:分页爬取,mysql数据库连接
- python大作业--爬虫完美应付大作业.z
- python3网络爬虫开发实战 无密码
- Python赶集网北京地区招聘信息爬虫
- 利用Python爬虫抓取网页上的图片含异
- python爬虫抓取百度贴吧中邮箱地址
- zw_基于python的网络爬虫设计.zip
- OFO单车数据爬虫
- Python-Scrapy 入门级爬虫项目实战
- python利用urllib实现爬取京东网站商品
- Python爬虫库requests获取响应内容、响应
- Python爬虫爬取51Job职位数据
- python爬取豆瓣电影源码+报告.zip
- 利用python爬虫爬取王者荣耀数据.py
- 生意参谋transit-id和data解密源码
- Python爬虫爬取豆瓣电影
- scrapy 封装的爬取社保信息
- 基于selenium模拟天眼查登录并爬取企业
- python爬虫样例
- Python爬虫代码
- 淘宝网页数据爬虫
- 《Python3网络爬虫开发实战》中文PDF
- 利用selenium编写的python网络爬虫-淘宝
- Python3网络爬虫实战思维导图
- python爬虫 破解js加密有道词典案列的
- python一加云相册批量爬虫
- 爬取51job网站招聘信息
- python爬取100个百度百科页面信息
- python爬虫之豆瓣电影使用requests、lx
- 爬虫从入门到精通内含如何应对反爬
评论
共有 条评论