基于关键词的文本排序检索系统.rar

大小: 350KB

文件类型: .rar

金币: 2

下载: 0 次

发布日期: 2021-05-14
语言: Python
标签: tf-idf模型 python 课程资源

高速下载

资源简介

包含课题的python源码，实验报告以及测试数据。对于给定文本库，用户提交检索关键词（例如： NBA, basket, ball），在文本库中查询与检索关键词最相关的 k 个文本（例如 k=5），并根据文本与检索关键词的相关度，对这 k 个文本进行排序，将排序后的结果返回给用户。使用TF-IDF权值衡量关键词对于某篇文章的重要性，从而根据关键词挑选出相关性较高的文本。首先程序加载文本库，并对数据进行处理，用户输入一个或多个关键词，分别输出前五的各检索关键词的文本排序序列。

资源截图

小图大图

代码片段和文件信息

import math
import os
import re
from nltk.corpus import stopwords

def loadDataSet（path）:
    “““
    读取文本库中的文本内容以字典形式输出

    :param path: 文本库地址
    :return: 文本库字典｛文本名1：文本内容1，文本名2：文本内容2...｝
    “““
    # 将文件夹内的文本全部导入程序
    files = os.listdir（path）  # 得到文件夹下的所有文件名称
    all_docu_dic = {}  # 接收文档名和文档内容的词典
    for file in files:  # 遍历文件夹
        if not os.path.isdir（file）:  # 判断是否是文件夹，不是文件夹才打开
            f = open（path + “/“ + file encoding=‘UTF-8-sig‘）  # 打开文件
            iter_f = iter（f）  # 创建迭代器
            strr = ““
            for line in iter_f:  # 遍历文件，一行行遍历，读取文本
                strr = strr + line
            all_docu_dic[file] = strr.strip（‘.‘）   # 去除末尾的符号.
    print（“文件库：“）
    print（all_docu_dic）
    return all_docu_dic

def dealDataSet（all_docu_dic）:
    “““
    处理文件库字典的数据

    :param all_docu_dic:文本库字典｛文本名1：文本内容1，文本名2：文本内容2...｝
    :return: 1.all_words_set 文本库的词库｛word1word2...｝
             2.words_num_dic 文本词数字典｛txt1:{word1:num1word2:num2}...｝
    “““
    all_words = []
    all_docu_cut = {}  # 分完词后的dic（dic嵌套list）

    stop_words = stopwords.words（‘english‘）    # 原始停用词库
    # #停用词的扩展
    # print（len（stop_words））
    # extra_words = [‘ ‘]#新增的停用词
    # stop_words.extend（extra_words）#最后停用词
    # print（len（stop_words））

    # 计算所有文档总词库和分隔后的词库
    for filename content in all_docu_dic.items（）:
        cut = re.split（“[!? ‘.）（+-=。:]“ content）  # 分词
        new_cut = [w for w in cut if w not in stop_words if w]  # 去除停用词，并且去除split后产生的空字符
        all_docu_cut[filename] = new_cut  # 键为文本名，值为分词完成的list
        all_words.extend（new_cut）
    all_words_set = set（all_words）  # 转化为集合形式

    # 计算各文本中的词数
    words_num_dic = {}
    for filename cut in all_docu_cut.items（）:
        words_num_dic[filename] = dict.fromkeys（all_docu_cut[filename] 0）
        for word in cut:
            words_num_dic[filename][word] += 1
    # print（“词库：“）
    # print（all_words_set）
    print（“文件分词库：“）
    print（all_docu_cut）
    return all_words_set words_num_dic     # 返回词库和文档词数字典

def computeTF（in_word words_num_dic）:
    “““
    计算单词in_word在每篇文档的TF

    :param in_word: 单词
    :param words_num_dic: 文本词数字典｛txt1:{word1:num1word2:num2}...｝
    :return: tfDict: 单词in_word在所有文本中的tf值字典 ｛文件名1：tf1文件名2：tf2...｝
    “““
    allcount_dic = {}   # 各文档的总词数
    tfDict = {}     # in_word的tf字典
    # 计算每篇文档总词数
    for filename num in words_num_dic.items（）:
        count = 0
        for value in num.values（）:
            count += value
        allcount_dic[filename] = count
    # 计算tf
    for filename num in words_num_dic.items（）:
        if in_word in num.keys（）:
            tfDict[filename] = num[in_word] / allcount_dic[filename]
    return tfDict

def computeIDF（in_word words_num_dic）:
    “““
    计算in_word的idf值

    :param in_word: 单词
    :param words_num_dic: 文本词数字典｛txt1:{word1:num1word2:num2}...｝
    :return: 单词in_word在整个文本库中的idf值
    “““
    docu_count = len（words_num_dic）

属性            大小     日期    时间   名称
----------- ---------  ---------- -----  ----

     文件       6737  2020-06-18 18:42  基于关键词的文本排序检索系统\B4.py

     文件          5  2020-06-16 11:32  基于关键词的文本排序检索系统\data\1.txt

     文件         16  2020-06-14 20:55  基于关键词的文本排序检索系统\data\2.txt

     文件         11  2020-06-14 20:46  基于关键词的文本排序检索系统\data\3.txt

     文件         23  2020-06-16 11:33  基于关键词的文本排序检索系统\data\4.txt

     文件         17  2020-06-16 11:33  基于关键词的文本排序检索系统\data\5.txt

     文件         29  2020-06-16 11:34  基于关键词的文本排序检索系统\data\6.txt

     文件          5  2020-06-14 20:46  基于关键词的文本排序检索系统\data\7.txt

     文件          3  2020-06-14 20:56  基于关键词的文本排序检索系统\data\8.txt

     文件        121  2020-06-18 19:31  基于关键词的文本排序检索系统\readme.txt

     文件     402836  2020-06-18 18:47  基于关键词的文本排序检索系统\基于关键词的文本排序检索系统课题报告.pdf

     目录          0  2020-06-18 19:32  基于关键词的文本排序检索系统\data

     目录          0  2020-06-18 19:32  基于关键词的文本排序检索系统

----------- ---------  ---------- -----  ----

               409803                    13

上一篇：python 远程获取文件
下一篇：python写的自动发送QQ邮件的脚本

共有条评论

基于关键词的文本排序检索系统.rar

资源简介

资源截图

代码片段和文件信息

评论

相关资源