资源简介
通过爬得的网页来获取平行网页,java语言开发的,开源
代码片段和文件信息
package com.googlecode.pupsniffer;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.Iterator;
import java.util.Set;
import java.util.SortedMap;
import info.monitorenter.cpdetector.io.*;
/**
* A simple encoding detector based on cpdetector.sf.net
* @author Xuchen Yao
* @since 2010-03-30
*/
public class EncodingDetector {
protected CodepageDetectorProxy detector;
public EncodingDetector() {
detector = CodepageDetectorProxy.getInstance(); // A singleton.
// Add the implementations of info.monitorenter.cpdetector.io.ICodepageDetector:
// This one is quick if we deal with unicode codepages:
detector.add(new ByteOrderMarkDetector());
// The first instance delegated to tries to detect the meta charset attribut in html pages.
detector.add(new ParsingDetector(false)); // be verbose about parsing.
// This one does the tricks of exclusion and frequency detection if first implementation is
// unsuccessful:
detector.add(JChardetFacade.getInstance()); // Another singleton.
detector.add(ASCIIDetector.getInstance()); // Fallback see javadoc.
}
/**
* Detect the encoding of a URL
* @param url the URL address
* @return the encoding in upper case
* @throws IOException
* @throws MalformedURLException
*/
public String detect(String url) throws MalformedURLException IOException {
// Work with the configured proxy:
Charset charset = null;
charset = detector.detectCodepage(new URL(url));
if(charset == null){
return null;
}
else{
// Open the document in the given code page:
//java.io.Reader reader = new java.io.InputStreamReader(new java.io.FileInputStream(document)charset);
// Read from it do sth. whatever you desire. The character are now - hopefully - correct..
return charset.name().toUpperCase();
}
}
public String detectFromRaw(String raw String encoding) throws IOException {
// Work with the configured proxy:
Charset charset = null;
InputStream is;
byte[] bs;
// convert String to inputstream
try {
if (encoding == null)
bs = raw.getBytes();
else
bs = raw.getBytes(encoding);
is = new ByteArrayInputStream(bs);
charset = detector.detectCodepage(is bs.length);
if(charset == null){
return null;
} else{
// Open the document in the given code page:
//java.io.Reader reader = new java.io.InputStreamReader(new java.io.FileInputStream(document)charset);
// Read from it do sth. whatever you desire. The character are now - hopefully - correct..
return charset.name();
}
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return null;
}
/**
* List the supported encoding on your system. For debugging and coding.
*/
public static void supporte
- 上一篇:android中TextView高亮并可以点击
- 下一篇:JSmooth中文版+教程
相关资源
- 知乎爬虫最新版
- 网络爬虫 PDF
- Java+爬虫+爬取图片+完整案例+源码
- 基于java实现网络爬虫
- htmlunit 及其 依赖包
- 12306Java爬虫
- 网络爬虫jar包全
- Java爬虫汽车之家图片
- Java Web+爬虫+lucene 大学新闻网
- java爬取携程酒店评价信息
- android文字识别并翻译
- Java网络爬虫源码
- java 爬虫教学
- JAVA爬豆瓣电影数据文件流.zip
- java爬虫爬取当当网图书信息
- htmlunit-2.15-bin.zip
- Java实现网路爬虫爬取新闻信息
- java爬虫需要的jar包
- 爬虫搜索简单的搜索引擎java爬虫搜索
- WebMagicJava爬虫实现,实现数据爬取,
- 自己动手写网络爬虫_
- Java爬虫。。。。。
- 高德地图poi数据爬取-java
- java爬虫项目实战教学视频
- 一个java新闻标题爬虫
- java实现的爬虫,亲自编写,测试通过
- java爬虫完整代码
- 网络爬虫的设计与实现+毕业论文
- 京东苏宁爬虫java源码
- 用Java写的一个简单爬虫,爬取京东图
评论
共有 条评论