基于selenium模拟天眼查登录并爬取企业工商信息的python爬虫

大小: 2.96MB

文件类型: .zip

金币: 1

下载: 0 次

发布日期: 2023-09-30
语言: Python
标签: 爬虫模拟登录 selenium python

高速下载

资源简介

此资源仅供学习用途，当前selenium都是基于无头模式的firefox或者chrome等浏览器进行爬虫抓取，天眼查的反爬技术算是很不错的，仅仅用于个人学习用，并不可以进行大数据的爬取技术： python selenium 爬虫模拟登陆 xpath css选择器可自己装proxy 想添加翻页功能可以参考里面的代码模板也可以加我QQ问;必须将deckodriver放在相同路径记住：自己输入账号和密码！每次输入之后都要回车！

资源截图

小图大图

代码片段和文件信息

#!/usr/bin/python
# -*- coding:utf-8 -*-
# author: Jola
# datetime:2018/4/20 17:15
# software-version: python 3.5

import time

from selenium import webdriver
from selenium.webdriver import Firefox


class GetCompanyInfo（object）:
    “““
    爬取天眼查下的企业的信息
    “““
    def __init__（self）:
        “““
        初始化爬虫执行代理，使用firefox访问
        “““
        self.username = ‘13160676288‘
        self.password = ‘panjie19970620‘
        self.options = webdriver.FirefoxOptions（）
        self.options.add_argument（‘-headless‘）  # 无头参数
        self.geckodriver = r‘geckodriver.exe‘
        self.driver = Firefox（executable_path=self.geckodriver firefox_options=self.options）

        self.start_url = ‘https://www.tianyancha.com‘

    def test（self）:
        “““
        调试专用
        :return:
        “““
        start_url = ‘http://y2.twenteen.cn/Home/Index‘
        self.driver.get（start_url）
        cookies = {
            ‘ASP.NET_SessionId‘: ‘v3gnz3zsx0l2vxqmszhzat4w‘
            ‘Hm_lvt_ddd605dfec122be0f190ebb874331df1‘: ‘1524279814‘
            ‘Hm_lpvt_ddd605dfec122be0f190ebb874331df1‘: ‘152428022‘
        }
        for k v in cookies.items（）:
            self.driver.add_cookie（{
                ‘name‘: k
                ‘value‘: v
            }）
        time.sleep（1）
        print（self.driver.page_source）
        self.driver.close（）

    def login（self）:
        “““
        登录并检查状态
        :return:
        “““
        try:
            self.driver.get（self.start_url）

            print（self.driver.get_cookies（））

            username = self.index_login（）
            username_pattern = username[:3] + ‘ **** ‘ + username[-4:]
            print（username_pattern）
            page = self.driver.page_source
            is_login = page.find（username_pattern）

            print（is_login）
            if is_login != -1:
                print（‘登录成功‘）
        except Exception as e:
            print（e）

    def index_login（self）:
        “““
        主页下的登录模式
        :return:
        “““
        get_login = self.driver.find_elements_by_xpath（‘//a[@class=“media_port“]‘）[0]   # 登录/注册
        print（get_login.text）
        # url为login的input
        get_login.click（）
        login_by_pwd = self.driver.find_element_by_xpath（‘//div[@class=“bgContent“]/div[2]/div[2]/div‘）     # 切换到手机登录
        print（login_by_pwd.text）
        login_by_pwd.click（）
        input1 = self.driver.find_element_by_xpath（‘//div[@class=“bgContent“]/div[2]/div/div[2]/input‘）     # 手机号码

        input2 = self.driver.find_element_by_xpath（‘//div[@class=“bgContent“]/div[2]/div/div[3]/input‘）     # 密码
        print（input1.get_attribute（‘placeholder‘））
        print（input2.get_attribute（‘placeholder‘））

        username password = self._check_user_pass（）
        input1.send_keys（username）
        input2.send_keys（password）

        login_button = self.driver.find_element_by_xpath（‘//div[@class=“bgContent“]/div[2]/di

属性            大小     日期    时间   名称
----------- ---------  ---------- -----  ----
     文件        9595  2018-04-21 22:42  crawl.py
     文件     9684296  2018-04-08 20:49  geckodriver.exe

上一篇：Python3 廖雪峰教程pdf版
下一篇：黑马python入门教程飞机大战素材（图片+声音+字体）

共有条评论

基于selenium模拟天眼查登录并爬取企业工商信息的python爬虫

资源简介

资源截图

代码片段和文件信息

评论

相关资源