使用python实现TF-IDF

大小: 6KB

文件类型: .py

金币: 1

下载: 1 次

发布日期: 2021-06-07
语言: Python
标签: python IF DF

高速下载

资源简介

python编程语言预处理统计词频计算IT-IDF

资源截图

小图大图

代码片段和文件信息

# coding=utf-8                   #注意此句注释要放在第一句，可进行中文注释
from __future__ import division  # 保证得到正常除法计算的结果
import os
import math
import nltk
import codecs  # 保证可以用指定的编码格式打开文件
from nltk.corpus import stopwords
import sys

reload（sys）
sys.setdefaultencoding（‘utf-8‘）

doc_num = 0  # 统计文档总数
dictionary = []  # 词典，即文件中的单词列表
word_idf_dict = {}  # 字典，存放词典中每个单词的idf值
num_doc_word = {}  # 字典，存放每篇论文的单词总数


def processing（str）:
    “对给定的文本进行预处理返回值是一个列表，存储预处理之后的单词“
    str_lower = str.lower（）  # 将文本小写化
    sens = nltk.sent_tokenize（str_lower）  # 将小写化的文本进行句子分词
    words = []
    # 对每个句子进行分词
    for sen in sens:
        words.extend（nltk.word_tokenize（sen））  # 注意区分append VS extend
        stopword = stopwords.words（‘english‘）  # 去除停顿词
        punctuation = [‘‘ ‘.‘ ‘:‘ ‘;‘ ‘（‘ ‘）‘ ‘[‘ ‘]‘ ‘&‘ ‘#‘ ‘!‘ ‘?‘ ‘@‘ ‘$‘ ‘%‘]  # 去除标点符号
        stemming = nltk.stem.SnowballStemmer（‘english‘）  # 提取词干
        new_words = []
        for word in words:
            if （word.isalpha（）） and （word not in stopword） and （word not in punctuation）:  # 去除乱码，即非字母的单词
                new_words.append（stemming.stem（word））  # append VS extend
    return new_words


def compute_tf（wordlist）:
    “统计给定单词列表的词频，返回值是一个词典，key为单词，value为该单词对应的词频“
    temp_dict = {}
    for word in wordlist:
        if word in temp_dict:
            temp_dict[word] += 1
        else:
            temp_dict[word] = 1
    return temp_dict


def compute_idf（word_in_file）:
    “统计单词的逆向文档频率，参数的数据类型为dict{file_name：dict{word：word_tf}}}“
    for word in dictionary:
        word_in_doc = 0  # 统计出现过该单词的文档数目
        for index in word_in_file:
            if word in word_in_file[index].keys（）:
                word_in_doc += 1
        word_in_doc = math.log10（doc_num / word_in_doc）  # 计算单词的idf值
        word_idf_dict[word] = word_in_doc  # 得到全局变量word_idf_dict
        # 此处没有将word_idf_dict作为返回值返回，而是将其定义为全局变量，便于其他函数使用


def compute_tfidf（word_in_file）:
    “计算单词的tf-idf值，参数的数据类型为dict{file_name：dict{word：word_tf}}“
    word_tfidf = {}  # 存放单词的tf-idf值，数据类型为dict{file_name：dict{word：word_tfidf}}
    for index in word_in_file:
        word_tfidf[index] = {}  # 存放指定文档下的dict{word：word_tfidf}
        temp_len = num_doc_word[index]  # 使用全局变量num_doc_word，计算指定文档下的单词总数
        for word in word_in_file[index].keys（）:
            word_tfidf[index][word]

上一篇：python爬虫爬取58租房信息
下一篇：Python-opencv-植物叶片识别

共有条评论

使用python实现TF-IDF

资源简介

资源截图

代码片段和文件信息

评论

相关资源