资源简介
西电数据挖掘大作业之商场数据分析。
代码片段和文件信息
# -*- coding: utf-8 -*-
“““
Created on Sat Aug 25 13:45:40 2018
@author: Pratik
“““
import pandas as pd
import numpy as np
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
train = pd.read_csv(‘Train.csv‘)
test = pd.read_csv(‘Test.csv‘)
# We will combine the train and test data to perform feature engineering我们将结合训练和测试数据进行特征工程
train[‘source‘] = ‘train‘
test[‘source‘] = ‘test‘
data = pd.concat([train test] ignore_index=True)
print(‘--------------------------------------------------------------‘)
print(train.shape test.shape data.shape)
print(‘--------------------------------------------------------------\n‘)
# As the problem is already defined -- we know that we need to predict sales by the store 问题已经定义好了——我们知道我们需要预测商店的销售额
data.info()
data.describe()
# Some observations
# 1. item_visibility has min value of 0 which is less likely 项目可见性的最小值为0,这是不太可能的
# 2. Outlet_Establishment_Year will be more useful in a way by which we could know how old it is 在某种程度上,网点建立年将更有用,这样我们就可以知道它的年龄
# Lets check how many unique items each column has 让我们检查每个列有多少个惟一项
data.apply(lambda x: len(x.unique()))
# Let us have a look at the object datatype columns 让我们看一下对象数据类型列
for i in train.columns:
if train[i].dtype == ‘object‘:
print(train[i].value_counts())
print(‘--------------------------------------------\n‘)
print(‘--------------------------------------------‘)
# The output gives us following observations:输出结果给出了以下观察结果
# Item_Fat_Content: Some of ‘Low Fat’ values mis-coded as ‘low fat’ and ‘LF’. Also some of ‘Regular’ are mentioned as ‘regular’.项目脂肪含量:一些低脂值被错误编码为低脂和低脂。此外,一些规则也被称为规则。
# Item_Type: Not all categories have substantial numbers. It looks like combining them can give better results.项目类型:不是所有的类别都有大量的数字。看起来把它们结合在一起可以得到更好的结果。
# Outlet_Type: Supermarket Type2 and Type3 can be combined. But we should check if that’s a good idea before doing it.出口类型:超市2型和3型可组合。但是我们应该在做这件事之前检查一下这是不是一个好主意。
# missing value percentage缺失值的百分比
# Item_Weight and Outlet_Size has some missing values
print(‘--------------------------------------------‘)
print(‘missing value percentage:‘)
print((data[data[‘Item_Weight‘].isnull()].shape[0] / data.shape[0]) * 100)
print((data[data[‘Outlet_Size‘].isnull()].shape[0] / data.shape[0]) * 100)
print(‘--------------------------------------------\n‘)
# we impute missing values
data[‘Item_Weight‘] = data[‘Item_Weight‘].fillna(data[‘Item_Weight‘].mean())
# data[‘Outlet_Size‘] = data[‘Outlet_Size‘].fillna(data[‘Outlet_Size‘].mode())
data[‘Outlet_Size‘].fillna(data[‘Outlet_Size‘].mode()[0] inplace=True)
# lets change item_visibility from 0 to mean to make sense让我们将项目可见性从0更改为有意的
data[‘Item_Visibility‘] = data[‘Item_Visibility‘].replace(
0 data[‘Item_Visibility‘].mean())
# we will calculate meanRatio for each object‘s visibility我
属性 大小 日期 时间 名称
----------- --------- ---------- ----- ----
目录 0 2018-10-17 20:59 bigmart-master\
文件 1203 2018-09-07 02:49 bigmart-master\.gitignore
文件 181844 2018-10-18 22:29 bigmart-master\alg0.csv
文件 112910 2018-10-17 21:32 bigmart-master\alg1.csv
文件 179127 2018-10-17 21:32 bigmart-master\alg2.csv
文件 177867 2018-10-17 21:32 bigmart-master\alg3.csv
文件 178794 2018-10-17 21:32 bigmart-master\alg6.csv
文件 9337 2018-10-18 22:29 bigmart-master\BigMart.py
文件 37 2018-09-07 02:49 bigmart-master\README.md
文件 527709 2018-09-07 02:49 bigmart-master\Test.csv
文件 965049 2018-10-18 22:29 bigmart-master\test_modified.csv
文件 869537 2018-09-07 02:49 bigmart-master\Train.csv
文件 1534109 2018-10-18 22:29 bigmart-master\train_modified.csv
评论
共有 条评论