资源简介
可用的谷歌图片爬虫,默认的关键词是心情,如angry、sad
代码片段和文件信息
# -*- coding: utf-8 -*-
# @Author: wlc
# @Date: 2017-09-25 23:54:24
# @Last Modified by: Henry
# @Last Modified time: 2018-7-11 22:40:11
####################################################################################################################
# Download images from google with specified keywords for searching
# search query is created by “main_keyword + supplemented_keyword“
# if there are multiple keywords each main_keyword will join with each supplemented_keyword
# mainly use urllib and each search query will download at most 100 images due to page source code limited by google
# allow single process or multiple processes for downloading
####################################################################################################################
import os
import time
import re
import logging
import urllib.request
import urllib.error
from multiprocessing import Pool
from user_agent import generate_user_agent
log_file = ‘download.log‘
logging.basicConfig(level=logging.DEBUG filename=log_file filemode=“a+“ format=“%(asctime)-15s %(levelname)-8s %(message)s“)
def download_page(url):
“““download raw content of the page
Args:
url (str): url of the page
Returns:
raw content of the page
“““
try:
headers = {}
headers[‘User-Agent‘] = generate_user_agent()
headers[‘Referer‘] = ‘https://www.google.com‘
req = urllib.request.Request(url headers = headers)
resp = urllib.request.urlopen(req)
return str(resp.read())
except Exception as e:
print(‘error while downloading page {0}‘.format(url))
logging.error(‘error while downloading page {0}‘.format(url))
return None
def parse_page(url):
“““parge the page and get all the links of images max number is 100 due to limit by google
Args:
url (str): url of the page
Returns:
A set containing the urls of images
“““
page_content = download_page(url)
if page_content:
link_list = re.findall(‘“ou“:“(.*?)“‘ page_content)
if len(link_list) == 0:
print(‘get 0 links from page {0}‘.format(url))
logging.info(‘get 0 links from page {0}‘.format(url))
return set()
else:
return set(link_list)
else:
return set()
def download_images(main_keyword supplemented_keywords download_dir):
“““download images with one main keyword and multiple supplemented keywords
Args:
main_keyword (str): main keyword
supplemented_keywords (list[str]): list of supplemented keywords
Returns:
None
“““
image_links = set()
print(‘Process {0} Main keyword: {1}‘.format(os.getpid() main_keyword))
# create a directory for a main keyword
img_dir = download_dir + main_keyword + ‘/‘
if not os.path.exists(img_dir)
相关资源
- python实现SGBM图像匹配算法
- python实现灰度直方图均衡化
- scrapy_qunar_one
- Python学习全系列教程永久可用
- python简明教程.chm
- 抽奖大转盘python的图形化界面
- 双边滤波器实验报告及代码python
- python +MYSQL+HTML实现21蛋糕网上商城
- Python-直播答题助手自动检测出题搜索
- OpenCV入门教程+OpenCV官方教程中文版
- Python 串口工具源码+.exe文件
- Python开发的全栈股票系统.zip
- Python操作Excel表格并将其中部分数据写
- python书籍 PDF
- 利用python绘制散点图
- python+labview+No1.vi
- 老男孩python项目实战
- python源码制作whl文件.rar
- python3.5可用的scipy
- PYTHON3 经典50案例.pptx
- 计算机科学导论-python.pdf
- python模拟鼠标点击屏幕
- windows鼠标自动点击py脚本
- 鱼c小甲鱼零基础学python全套课后题和
- Python 练习题100道
- Practical Programming 2nd Edition
- wxPython Application Development Cookbook
- python 3.6
- Python 3.5.2 中文文档 互联网唯一CHM版本
- python3.5.2.chm官方文档
评论
共有 条评论