西电数据挖掘大作业之商场数据分析

大小: 977KB

文件类型: .zip

金币: 2

下载: 1 次

发布日期: 2021-09-14
语言: 其他
标签:

高速下载

资源简介

西电数据挖掘大作业之商场数据分析。

资源截图

小图大图

代码片段和文件信息

# -*- coding: utf-8 -*-
“““
Created on Sat Aug  25 13:45:40 2018

@author: Pratik
“““

import pandas as pd
import numpy as np
import seaborn as sns
sns.set（）
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier（n_neighbors=5）

train = pd.read_csv（‘Train.csv‘）
test = pd.read_csv（‘Test.csv‘）

# We will combine the train and test data to perform feature engineering我们将结合训练和测试数据进行特征工程

train[‘source‘] = ‘train‘
test[‘source‘] = ‘test‘

data = pd.concat（[train test] ignore_index=True）
print（‘--------------------------------------------------------------‘）
print（train.shape test.shape data.shape）
print（‘--------------------------------------------------------------\n‘）
# As the problem is already defined -- we know that we need to predict sales by the store  问题已经定义好了——我们知道我们需要预测商店的销售额

data.info（）
data.describe（）

# Some observations
# 1. item_visibility has min value of 0 which is less likely  项目可见性的最小值为0，这是不太可能的
# 2. Outlet_Establishment_Year will be more useful in a way by which we could know how old it is 在某种程度上，网点建立年将更有用，这样我们就可以知道它的年龄

# Lets check how many unique items each column has 让我们检查每个列有多少个惟一项
data.apply（lambda x: len（x.unique（）））

# Let us have a look at the object datatype columns  让我们看一下对象数据类型列

for i in train.columns:
    if train[i].dtype == ‘object‘:
        print（train[i].value_counts（））
        print（‘--------------------------------------------\n‘）
        print（‘--------------------------------------------‘）

# The output gives us following observations:输出结果给出了以下观察结果

# Item_Fat_Content: Some of ‘Low Fat’ values mis-coded as ‘low fat’ and ‘LF’. Also some of ‘Regular’ are mentioned as ‘regular’.项目脂肪含量:一些低脂值被错误编码为低脂和低脂。此外，一些规则也被称为规则。
# Item_Type: Not all categories have substantial numbers. It looks like combining them can give better results.项目类型:不是所有的类别都有大量的数字。看起来把它们结合在一起可以得到更好的结果。
# Outlet_Type: Supermarket Type2 and Type3 can be combined. But we should check if that’s a good idea before doing it.出口类型:超市2型和3型可组合。但是我们应该在做这件事之前检查一下这是不是一个好主意。

# missing value percentage缺失值的百分比
# Item_Weight and Outlet_Size has some missing values
print（‘--------------------------------------------‘）
print（‘missing value percentage:‘）
print（（data[data[‘Item_Weight‘].isnull（）].shape[0] / data.shape[0]） * 100）
print（（data[data[‘Outlet_Size‘].isnull（）].shape[0] / data.shape[0]） * 100）
print（‘--------------------------------------------\n‘）

# we impute missing values
data[‘Item_Weight‘] = data[‘Item_Weight‘].fillna（data[‘Item_Weight‘].mean（））
# data[‘Outlet_Size‘] = data[‘Outlet_Size‘].fillna（data[‘Outlet_Size‘].mode（））
data[‘Outlet_Size‘].fillna（data[‘Outlet_Size‘].mode（）[0] inplace=True）


# lets change item_visibility from 0 to mean to make sense让我们将项目可见性从0更改为有意的
data[‘Item_Visibility‘] = data[‘Item_Visibility‘].replace（
    0 data[‘Item_Visibility‘].mean（））

# we will calculate meanRatio for each object‘s visibility我

属性            大小     日期    时间   名称
----------- ---------  ---------- -----  ----
     目录           0  2018-10-17 20:59  bigmart-master\
     文件        1203  2018-09-07 02:49  bigmart-master\.gitignore
     文件      181844  2018-10-18 22:29  bigmart-master\alg0.csv
     文件      112910  2018-10-17 21:32  bigmart-master\alg1.csv
     文件      179127  2018-10-17 21:32  bigmart-master\alg2.csv
     文件      177867  2018-10-17 21:32  bigmart-master\alg3.csv
     文件      178794  2018-10-17 21:32  bigmart-master\alg6.csv
     文件        9337  2018-10-18 22:29  bigmart-master\BigMart.py
     文件          37  2018-09-07 02:49  bigmart-master\README.md
     文件      527709  2018-09-07 02:49  bigmart-master\Test.csv
     文件      965049  2018-10-18 22:29  bigmart-master\test_modified.csv
     文件      869537  2018-09-07 02:49  bigmart-master\Train.csv
     文件     1534109  2018-10-18 22:29  bigmart-master\train_modified.csv

上一篇：8通道数据采集的Labview源代码，PC端代码.zip
下一篇：基于51单片机的温度报警器程序和原理图

共有条评论

西电数据挖掘大作业之商场数据分析

资源简介

资源截图

代码片段和文件信息

评论

相关资源