资源简介
压缩包主要采用随机森林算法处理adult数据集的分类问题,主要包含四部分,第一部分是由python编写的adult数据集预处理过程,第二部分是自己编写的随机森林算法处理adult数据集,第三部分是调用python中sklearn模块处理adult分类问题,第四部分是基于matlab调用5种机器学习分类算法分别处理adult分类问题比较哪种算法能够取得更好的分类效果。
代码片段和文件信息
# -*- coding: utf-8 -*-
“““
Created on Tue Nov 6 13:29:41 2018
@author: 28770
“““
import pandas as pd
excelFile=r‘ML_data2.xlsx‘
train_df = pd.Dataframe(pd.read_excel(excelFilesheet_name=0)) #读取指定路径的表格的sheet0为文件并转换到结构框格式
test_df= pd.Dataframe(pd.read_excel(excelFilesheet_name=1)) #读取指定路径的表格的sheet1为文件并转换到结构框格式
‘‘‘
#workClass_loss用于返回train_df中‘workClass‘这一列中的确实项,缺失数据处为True
workClass_loss=train_df[‘workClass‘].isnull() #.notnull()效果与其相反。
‘‘‘
‘‘‘
缺失值填充步骤:(使用缺失值上一行的数据填充缺失值处)
对train_df中的缺失值进行填充,其中.mode()是用这一列的众数填充,mean()使用列平均值填充。
其中,由于可能某一列有多个相同的众数,因此.mode()返回的是一个series不像mean()一样返回
的是一个数值,因此,采用.mode()[0]自动将其填充为第一个众数。
‘‘‘
train_df_fill=train_df.fillna(method=“ffill“)
test_df_fill=test_df.fillna(method=“ffill“)
‘‘‘
删除重复的列信息
‘‘‘
train_df_fill=train_df_fill.drop([‘education‘]1)
test_df_fill=test_df_fill.drop([‘education‘]1)
‘‘‘
离散特征映射
‘‘‘
salary_mapping={‘<=50K‘:0‘>50K‘:1}
train_df_fill[‘salary‘]=train_df_fill[‘salary‘].map(salary_mapping)
test_df_fill[‘salary‘]=test_df_fill[‘salary‘].map(salary_mapping)
Discrete_attribute=[‘workClass‘‘education‘‘marital_status‘‘occupation‘
‘relationship‘‘race‘‘sex‘‘native_country‘]
for attribute in Discrete_attribute:
attribute_mapping = {lab:idx for idxlab in enumerate(set(train_df_fill[attribute]))}
train_df_fill[attribute] = train_df_fill[attribute].map(attribute_mapping)
test_df_fill[attribute] = test_df_fill[attribute].map(attribute_mapping)
‘‘‘
workClass_mapping = {lab:idx for idxlab in enumerate(set(train_df_fill[‘workClass‘]))}
train_df_fill[‘workClass‘] = train_df_fill[‘workClass‘].map(workClass_mapping)
test_df_fill[‘workClass‘] = test_df_fill[‘workClass‘].map(workClass_mapping)
education_mapping = {lab:idx for idxlab in enumerate(set(train_df_fill[‘education‘]))}
train_df_fill[‘education‘] = train_df_fill[‘education‘].map(education_mapping)
test_df_fill[‘education‘] = test_df_fill[‘education‘].map(education_mapping)
marital_status_mapping = {lab:idx for idxlab in enumerate(set(train_df_fill[‘marital_status‘]))}
train_df_fill[‘marital_status‘] = train_df_fill[‘marital_status‘].map(marital_status_mapping)
test_df_fill[‘marital_status‘] = test_df_fill[‘marital_status‘].map(marital_status_mapping)
occupation_mapping = {lab:idx for idxlab in enumerate(set(train_df_fill[‘occupation‘]))}
train_df_fill[‘occupation‘] = train_df_fill[‘occupation‘].map(occupation_mapping)
test_df_fill[‘occupation‘] = test_df_fill[‘occupation‘].map(occupation_mapping)
relationship_mapping = {lab:idx for idxlab in enumerate(set(train_df_fill[‘relationship‘]))}
train_df_fill[‘relationship‘] = train_df_fill[‘relationship‘].map(relationship_mapping)
test_df_fill[‘relationship‘] = test_df_fill[‘relationship‘].map(relationship_mapping)
race_mapping = {lab:idx for idxlab in enumerate(set(train_df_fill[‘race‘]))}
train_df_fill[‘race‘] = train_df_fill[‘race‘].map(race_mapping)
test_df_
属性 大小 日期 时间 名称
----------- --------- ---------- ----- ----
文件 4575 2018-11-13 23:33 Random_Forest\excel_change.py
文件 1589 2018-11-13 20:55 Random_Forest\Matlab_xlr\excel_run.m
文件 2677491 2018-11-06 20:50 Random_Forest\Matlab_xlr\ML_data2_trans.xlsx
文件 2918697 2018-11-01 21:57 Random_Forest\ML_data2.xlsx
文件 642592 2018-11-08 10:55 Random_Forest\ML_data2_test.csv
文件 1285749 2018-11-08 10:55 Random_Forest\ML_data2_train.csv
文件 2677491 2018-11-06 20:50 Random_Forest\ML_data2_trans.xlsx
文件 642435 2018-11-08 10:59 Random_Forest\Random Forest\ML_data2_test.csv
文件 1285592 2018-11-08 10:59 Random_Forest\Random Forest\ML_data2_train.csv
文件 2677491 2018-11-06 20:50 Random_Forest\Random Forest\ML_data2_trans.xlsx
文件 10260 2018-11-14 13:26 Random_Forest\Random Forest\Random_Forest.py
文件 642435 2018-11-08 10:59 Random_Forest\RF_sklearn\ML_data2_test.csv
文件 1285592 2018-11-08 10:59 Random_Forest\RF_sklearn\ML_data2_train.csv
文件 2677491 2018-11-06 20:50 Random_Forest\RF_sklearn\ML_data2_trans.xlsx
文件 1259 2018-11-14 14:15 Random_Forest\RF_sklearn\RF_sklearn.py
文件 214 2018-11-14 13:51 Random_Forest\文本描述(首先阅读).txt
目录 0 2018-12-14 10:51 Random_Forest\Matlab_xlr
目录 0 2018-12-14 10:51 Random_Forest\Random Forest
目录 0 2018-12-14 10:51 Random_Forest\RF_sklearn
目录 0 2018-12-14 10:51 Random_Forest
----------- --------- ---------- ----- ----
19430953 20
评论
共有 条评论