-
大小: 4KB文件类型: .py金币: 1下载: 0 次发布日期: 2021-05-04
- 语言: Python
- 标签: 20newsgroup python
资源简介
http://blog.csdn.net/abcjennifer/article/details/23615947
代码片段和文件信息
#first extract the 20 news_group dataset to /scikit_learn_data
from sklearn.datasets import fetch_20newsgroups
#all categories
#newsgroup_train = fetch_20newsgroups(subset=‘train‘)
#part categories
categories = [‘comp.graphics‘
‘comp.os.ms-windows.misc‘
‘comp.sys.ibm.pc.hardware‘
‘comp.sys.mac.hardware‘
‘comp.windows.x‘];
newsgroup_train = fetch_20newsgroups(subset = ‘train‘categories = categories);
def calculate_result(actualpred):
m_precision = metrics.precision_score(actualpred);
m_recall = metrics.recall_score(actualpred);
print ‘predict info:‘
print ‘precision:{0:.3f}‘.format(m_precision)
print ‘recall:{0:0.3f}‘.format(m_recall);
print ‘f1-score:{0:.3f}‘.format(metrics.f1_score(actualpred));
#print category names
from pprint import pprint
pprint(list(newsgroup_train.target_names))
#newsgroup_train.data is the original documents but we need to extract the
#TF-IDF vectors inorder to model the text data
from sklearn.feature_extraction.text import TfidfVectorizer HashingVectorizer
#vectorizer = TfidfVectorizer(sublinear_tf = True
# max_df = 0.5
# stop_words = ‘english‘);
#however Tf-Idf feather extractor makes the training set and testing set have
#divergent number of features. (Because they have different vocabulary in documents)
#So we use HashingVectorizer
vectorizer = HashingVectorizer(stop_words = ‘english‘non_negative = True
n_features = 100)
fea_train = vectorizer.fit_transform(newsgroup_train.data)
#return feature vector ‘fea_train‘ [n_samplesn_features]
print ‘Size of fea_train:‘ + repr(fea_train.shape)
#11314 documents 130107 vectors for all categories
print ‘The average feature sparsity is {0:.3f}%‘.format(
fea_train.nnz/float(fea_train.shape[0]*fea_train.shape[1])*100);
#####
相关资源
- python+ selenium教程
- PycURL(Windows7/Win32)Python2.7安装包 P
- 英文原版-Scientific Computing with Python
- 7.图像风格迁移 基于深度学习 pyt
- 基于Python的学生管理系统
- A Byte of Python(简明Python教程)(第
- Python实例174946
- Python 人脸识别
- Python 人事管理系统
- 基于python-flask的个人博客系统
- 计算机视觉应用开发流程
- python 调用sftp断点续传文件
- python socket游戏
- 基于Python爬虫爬取天气预报信息
- python函数编程和讲解
- Python开发的个人博客
- 基于python的三层神经网络模型搭建
- python实现自动操作windows应用
- python人脸识别(opencv)
- python 绘图(方形、线条、圆形)
- python疫情卡UN管控
- python 连连看小游戏源码
- 基于PyQt5的视频播放器设计
- 一个简单的python爬虫
- csv文件行列转换python实现代码
- Python操作Mysql教程手册
- Python Machine Learning Case Studies
- python获取硬件信息
- 量化交易(附python常见函数的使用方
- python 名字用字排行
评论
共有 条评论