Python爬虫源码—爬取猫途鹰官方旅游网站信息

大小: 72KB

文件类型: .zip

金币: 2

下载: 0 次

发布日期: 2021-05-09
语言: Python
标签: Python爬虫

高速下载

资源简介

用Python爬取猫途鹰旅游网站的爬虫代码；爬取到的数据有：酒店和景点信息，酒店评论信息，景点评论信息；内附爬虫过程的思路和难点介绍

资源截图

小图大图

代码片段和文件信息

# -*- coding:utf-8 -*-

# 从酒店列表中爬取酒店信息

import requests
import re
import tool
import os
import time
import urllib3
urllib3.disable_warnings（）
# requests.packages.urllib3.disable_warnings（）

# url
l_siteURL = ‘https://www.tripadvisor.cn/Hotels-g294212-oa‘
r_siteURL = ‘-Beijing-Hotels.html#BODYCON‘


# 抓取酒店
class Hotel:

    # 页面初始化
    def __init__（self）:
        # url的左边和右边，左+30+右
        self.l_siteURL = ‘https://www.tripadvisor.cn/Hotels-g294212-oa‘
        self.r_siteURL = ‘-Beijing-Hotels.html#BODYCON‘
        self.frontUrl = ‘https://www.tripadvisor.cn‘  # 酒店详情url要加的前缀
        self.tool = tool.Tool（）

    # 获取页面源码内容
    def getPage（self infoURL）:
        time.sleep（0.2）
        headers = {‘content-type‘: ‘application/json‘
           ‘User-Agent‘: ‘Mozilla/5.0 （X11; Ubuntu; Linux x86_64; rv:22.0） Gecko/20100101 Firefox/22.0‘}
        r = requests.get（url=infoURL verify=False headers=headers）
        r.encoding = ‘utf-8‘
        return r.text

    # 传入图片地址，文件名，保存单张图片
    def saveImg（self imageURL fileName）:
        time.sleep（0.2）
        headers = {‘content-type‘: ‘application/json‘
                   ‘User-Agent‘: ‘Mozilla/5.0 （X11; Ubuntu; Linux x86_64; rv:22.0） Gecko/20100101 Firefox/22.0‘}
        r = requests.get（url=imageURL verify=False headers=headers）
        data = r.content  # 二进制内容返回
        f = open（fileName ‘wb‘）
        f.write（data）
        print u“正在悄悄保存一张图片为%s“ % （fileName）
        f.close（）

    # 保存一张酒店图片并返回图片路径
    def saveIcon（self iconURL path）:
        splitPath = iconURL.split（‘.‘）
        fTail = splitPath.pop（）  # 移除最后一个元素并返回
        fileName = path + “.“ + fTail
        self.saveImg（iconURL fileName）
        return fileName

    # 获取酒店详细地址
    def getHotelAddress（self page）:
        pattern = re.compile（“（.{010}）（.{010}）（.{030}）“ re.S）
        result = re.search（pattern page）
        return self.tool.replace（result.group（1） + result.group（2） + result.group（3））
    
    # 获取酒店评分
    def getHotelGrade（self page）:
        pattern = re.compile（“（.{04}） “ re.S）
        result = re.search（pattern page）
        return self.tool.replace（result.group（1））
    
    # 获取酒店点评数量
    def getHotelCommentNumber（self detailPage）:
        pattern = re.compile（u“（\d{04}）<\/span>条点评“ re.S）
        result = re.search（pattern detailPage）
        print type（result）
        return int（self.tool.replace（result.group（1）））

    # 解析酒店评论并保存
    def parseHotelComment（self hotelName url pageNum）:
        splitURL = url.split（‘.‘） 
        print ‘共有‘ pageNum ‘页评论‘
        if pageNum > 80 :  # 最多爬取80页评论
            pageNum = 80
        firstUserName = u‘‘  # 记录上一页的第一个评论人
        flag = False
        for index in range（1 pageNum + 1）:
            if flag == True:
                print ‘跳出了

属性            大小     日期    时间   名称
----------- ---------  ---------- -----  ----
     目录           0  2018-01-30 21:10  maotuying\
     文件         191  2018-01-15 11:24  maotuying\hotel.txt
     文件        1815  2018-01-15 11:26  maotuying\hotelComment.txt
     目录           0  2018-01-23 09:27  maotuying\img\
     目录           0  2018-01-23 09:27  maotuying\img\hotel\
     文件       64751  2018-01-15 11:24  maotuying\img\hotel\北京新云南皇冠假日酒店.jpg
     目录           0  2018-01-15 11:24  maotuying\img\scenic\
     文件       19622  2018-01-30 21:04  maotuying\main.py
     文件         479  2018-01-30 21:17  maotuying\READEME.txt
     文件        1060  2018-01-02 20:34  maotuying\tool.py

上一篇：密度聚类（Density peaks Clustering）Python实现
下一篇：基于django的在线作业提交系统

共有条评论

Python爬虫源码—爬取猫途鹰官方旅游网站信息

资源简介

资源截图

代码片段和文件信息

评论

相关资源