随机森林的代码实现和相应的数据集 python代码

大小: 34KB

文件类型: .zip

金币: 2

下载: 0 次

发布日期: 2021-05-07
语言: Python
标签: 随机森林 Python

高速下载

资源简介

本文件包括随机森林的代码实现和相应的数据集，以及详尽的中文注释，已调试通过。代码有两份，一份是在网上下载的，另一份是自己整理后编写的。编程环境为Python2.7。因为只是用来学习随机森林算法，所以在调参方法没下多少功夫，正确率可能不太高,当然数据集比较小也是一个原因。感兴趣的童鞋可以自己调整参数提高正确率。

资源截图

小图大图

代码片段和文件信息

#-*- coding: utf-8 -*-
# Random Forest Algorithm on Sonar Dataset
from random import seed
from random import randrange
from csv import reader
from math import sqrt
from math import log
# Load a CSV file
def load_csv（filename）:  #导入csv文件
    dataset = list（）
    with open（filename ‘r‘） as file:
        csv_reader = reader（file）
        for row in csv_reader:
            if not row:
                continue
            dataset.append（row）
    return dataset

# Convert string column to float
def str_column_to_float（dataset column）:  #将数据集的第column列转换成float形式
    for row in dataset:
        row[column] = float（row[column].strip（））  #strip（）返回移除字符串头尾指定的字符生成的新字符串。
        
# Convert string column to integer
def str_column_to_int（dataset column）:    #将最后一列表示标签的值转换为Int类型01...
    class_values = [row[column] for row in dataset]
    unique = set（class_values）
    lookup = dict（）
    for i value in enumerate（unique）:
        lookup[value] = i
    for row in dataset:
        row[column] = lookup[row[column]]
    return lookup

# Split a dataset into k folds
def cross_validation_split（dataset n_folds）:  #将数据集dataset分成n_flods份，每份包含len（dataset） / n_folds个值，每个值由dataset数据集的内容随机产生，每个值被使用一次
    dataset_split = list（）
    dataset_copy = list（dataset）  #复制一份dataset防止dataset的内容改变
    fold_size = len（dataset） / n_folds
    for i in range（n_folds）:
        fold = list（）   #每次循环fold清零，防止重复导入dataset_split
        while len（fold） < fold_size:   #这里不能用if，if只是在第一次判断时起作用，while执行循环，直到条件不成立
            index = randrange（len（dataset_copy））
            fold.append（dataset_copy.pop（index））  #将对应索引index的内容从dataset_copy中导出，并将该内容从dataset_copy中删除。pop（） 函数用于移除列表中的一个元素（默认最后一个元素），并且返回该元素的值。
        dataset_split.append（fold）
    return dataset_split    #由dataset分割出的n_folds个数据构成的列表，为了用于交叉验证

# Calculate accuracy percentage  
def accuracy_metric（actual predicted）:  #导入实际值和预测值，计算精确度
    correct = 0
    for i in range（len（actual））:
        if actual[i] == predicted[i]:
            correct += 1
    return correct / float（len（actual）） * 100.0



# Split a dataset based on an attribute and an attribute value #根据特征和特征值分割数据集
def test_split（index value dataset）:
    left right = list（） list（）
    for row in dataset:
        if row[index] < value:
            left.append（row）
        else:
            right.append（row）
    return left right

# Calculate the Gini index for a split dataset
def gini_index（groups class_values）:   #个人理解：计算代价，分类越准确，则gini越小
    gini = 0.0
    for class_value in class_values:  #class_values =[01] 
        for group in groups:          #groups=（leftright）
            size = len（group）
            if size == 0:
                continue
            proportion = [row[-1] for row in group].count（class_value） / float（size）
            gini += （proportion * （1.0 - proportion））  #个人理解：计算代价，分类越准确，则gini越小
    return gini

# Select the best split point for a dataset  #找出分割数据集的最优特征，得到最优的特征in

属性            大小     日期    时间   名称
----------- ---------  ---------- -----  ----
     文件        7189  2017-03-20 20:31  RandomForest\RFbymyself.py
     文件       11063  2017-03-20 20:32  RandomForest\randomForest.py
     文件       86084  2017-03-17 11:29  RandomForest\sonar-all-data.csv
     文件         370  2017-03-20 21:27  RandomForest\阅读说明.txt
     目录           0  2017-03-20 20:33  RandomForest\

上一篇：Python scripts For ABAQUS: Learn By Example （全书完整，地址）
下一篇：python LDA学习

共有条评论

随机森林的代码实现和相应的数据集 python代码

资源简介

资源截图

代码片段和文件信息

评论

相关资源