资源简介
网络爬虫,轻松获取网络资源!网络爬虫为搜索引擎从万维网下载网页。一般分为传统爬虫和聚焦爬虫。
代码片段和文件信息
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
public class Crawler {
private List urlWaiting = new ArrayList(); //A list of URLs that are waiting to be processed
private List urlProcessed = new ArrayList(); //A list of URLs that were processed
private List urlError = new ArrayList(); //A list of URLs that resulted in an error
private int numFindUrl = 0; //find the number of url
public Crawler() {}
/**
* start crawling
*/
public void begin() {
while (!urlWaiting.isEmpty()) {
processURL(urlWaiting.remove(0));
}
log(“finish crawling“);
log(“the number of urls that were found:“ + numFindUrl);
log(“the number of urls that were processed:“ + urlProcessed.size());
log(“the number of urls that resulted in an error:“ + urlError.size());
}
/**
* Called internally to process a URL
*
* @param strUrl
* The URL to be processed.
*/
public void processURL(String strUrl) {
URL url = null;
try {
url = new URL(strUrl);
log(“Processing: “ + url);
// get the URL‘s contents
URLConnection connection = url.openConnection();
connection.setRequestProperty(“User-Agent“ “Test Crawler for Course NIR“);
if ((connection.getContentType() != null)
&& !connection.getContentType().toLowerCase()
.startsWith(“text/“)) {
log(“Not processing because content type is: “
+ connection.getContentType());
return;
}
// read the URL
InputStream is = connection.getInputStream();
Reader r = new InputStreamReader(is);
// parse the URL
HTMLEditorKit.Parser parse = new HTMLParse().getParser();
parse.parse(r new Parser(url) true);
} catch (IOException e) {
urlError
相关资源
- 基于http的Java爬虫爬取百度新闻
- 微信公众号爬取数据
- 基于JAVA技术爬虫爬网站图片设计与实
- 用java实现爬虫抓取网页中的表格数据
- 基于强智科技教务系统学生成绩爬虫
- java网络爬虫搜索引擎
- jsp搜索引擎完整源码自带网络爬虫功
- SQL注入漏洞检测原型工具
- java地址转换经纬度
- 用java实现爬虫抓取网页中的表格数据
- java实现的主题爬虫
- 基于java的文本搜索引擎的设计与实现
- 爬虫工具,用于获取平行语料
- 知乎爬虫最新版
- 网络爬虫 PDF
- Java+爬虫+爬取图片+完整案例+源码
- 基于java实现网络爬虫
- htmlunit 及其 依赖包
- 12306Java爬虫
- 网络爬虫jar包全
- Java爬虫汽车之家图片
- Java Web+爬虫+lucene 大学新闻网
- java爬取携程酒店评价信息
- android文字识别并翻译
- Java网络爬虫源码
- java 爬虫教学
- JAVA爬豆瓣电影数据文件流.zip
- java爬虫爬取当当网图书信息
- htmlunit-2.15-bin.zip
- Java实现网路爬虫爬取新闻信息
评论
共有 条评论