资源简介
网络爬虫,轻松获取网络资源!网络爬虫为搜索引擎从万维网下载网页。一般分为传统爬虫和聚焦爬虫。
代码片段和文件信息
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
public class Crawler {
private List urlWaiting = new ArrayList(); //A list of URLs that are waiting to be processed
private List urlProcessed = new ArrayList(); //A list of URLs that were processed
private List urlError = new ArrayList(); //A list of URLs that resulted in an error
private int numFindUrl = 0; //find the number of url
public Crawler() {}
/**
* start crawling
*/
public void begin() {
while (!urlWaiting.isEmpty()) {
processURL(urlWaiting.remove(0));
}
log(“finish crawling“);
log(“the number of urls that were found:“ + numFindUrl);
log(“the number of urls that were processed:“ + urlProcessed.size());
log(“the number of urls that resulted in an error:“ + urlError.size());
}
/**
* Called internally to process a URL
*
* @param strUrl
* The URL to be processed.
*/
public void processURL(String strUrl) {
URL url = null;
try {
url = new URL(strUrl);
log(“Processing: “ + url);
// get the URL‘s contents
URLConnection connection = url.openConnection();
connection.setRequestProperty(“User-Agent“ “Test Crawler for Course NIR“);
if ((connection.getContentType() != null)
&& !connection.getContentType().toLowerCase()
.startsWith(“text/“)) {
log(“Not processing because content type is: “
+ connection.getContentType());
return;
}
// read the URL
InputStream is = connection.getInputStream();
Reader r = new InputStreamReader(is);
// parse the URL
HTMLEditorKit.Parser parse = new HTMLParse().getParser();
parse.parse(r new Parser(url) true);
} catch (IOException e) {
urlError
相关资源
- Java爬虫完整.zip
- java实现爬取指定网站的数据源码
- 主题网络爬虫
- java简单网络爬虫
- 爬虫jsp获取网页源码
- java 实现简单爬虫,爬取图片
- 很简易的java爬虫 可以爬取携程的航班
- 一个简单的java网络蜘蛛程序,非常适
- Java 爬虫图片
- java利用多线程爬虫查询快递100物流信
- 百度贴吧java爬虫
- JAVA爬虫批量网页文件
- JAVA爬虫项目源代码
- 基于java爬取网络图片并且保存到本地
- 网络爬虫 Java实现原理
- Java网络爬虫及正文提取
- Java爬虫项目
- java小说网站爬虫
- Java WebSocket爬虫
- 最新新浪微博爬虫程序Java版 2015
- webmagic修复HTTPS下无法抓取只支持TLS
- 2017统计局区划编码爬虫
- python爬虫十万条UA User_Agent信息浏览器
- 2019java爬取国家统计局省市区及编码
- JAVA爬虫 javaReptile
- 爬虫+springmvc+maven
- java采集csdn文章(jsoup爬虫)
- Java爬虫完整
- jsoup实现网络爬虫
- java爬虫Demo
评论
共有 条评论