资源简介
kaggle比赛HousePrices之数据预处理部分的完整代码,包含非常详细的注释,属于数据挖掘预处理的经典流程性代码。
代码片段和文件信息
#preprocessing for training&test data
#@2016.11.08
import pandas as pd
#step1:reading csv data
train = pd.read_csv(‘train.csv‘)
test = pd.read_csv(‘test.csv‘)
#train.head() # take a brief look at training data
all_data = pd.concat((train.loc[:‘MSSubClass‘:‘SaleCondition‘]
test.loc[:‘MSSubClass‘:‘SaleCondition‘])) # concat training&test data
import numpy as np
from scipy.stats import skew
import matplotlib
matplotlib.use(‘Agg‘)
import matplotlib.pyplot as plt
#step2:log transform for training data (including the labels)
‘‘‘ a png for labels‘ distribution
matplotlib.rcParams[‘figure.figsize‘] = (12.0 6.0)
prices = pd.Dataframe({“price“:train[“SalePrice“] “log(price + 1)“:np.log1p(train[“SalePrice“])})
prices.hist()
plt.savefig(‘label_dist.png‘dpi=150)
‘‘‘
train[“SalePrice“] = np.log1p(train[“SalePrice“]) #log transform the target
#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != “object“].index # get the index of all the n
评论
共有 条评论