项目描述:利用kaggle网站项目(Bike Sharing Demand | Kaggle)中提供的2011年到2012年美国某城市的共享单车数据集,其中包括了租车日期,天气,季节,气温,体感温度,空气湿度,风速等数据。通过对数据进行清洗,计算描述性统计数据,分析租车日期,天气,季节,气温,体感温度,空气湿度,风速等对租车的影响并基本实现数据的可视化。
import warnings
warnings.filterwarnings(‘ignore’)#忽略警告错误的输出
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data = pd.read_csv(‘Biketrain.csv’)
#数据清洗:将datetime数据按月、周及小时维度进行拆分,将datetime数据按月、周及小时维度进行拆分
data[‘date’] = data[‘datetime’].map(lambda x:x.split()[0])
data[‘hour’] = data[‘datetime’].map(lambda x:x.split()[1].split(‘:’)[0])
data[‘month’] = data[‘datetime’].map(lambda x:x.split(‘ ‘)[0].split(‘-‘)[1]).astype(‘int’)#’ ‘里有空格
data[‘weekday’] = data[‘date’].map(lambda x:datetime.strptime(x,’%Y-%m-%d’).isoweekday())#大写’Y’
#返回的1-7代表周一到周日
data.head()
#字典映射map
#季节、天气和星期为数值型,为了便于理解,需要转换成名称:
week_days = {0:’Sunday’,1:’Monday’,2:’Tuesday’,3:’Wednesday’,4:’Thursday’,5:’Friday’,6:’Saturday’}
data[‘weekday_name’]=data[‘weekday’].map(week_days)
season_name={1:’Spring’,2:’Summer’,3:’Fall’,4:’Winter’}
data[‘season_name’]=data[‘season’].map(season_name)
weather_name={1:’Sunny’,2:’Cloudy’,3:’Light Rain’,4:’Heavy Rain’}
data[‘weather_name’]=data[‘weather’].map(weather_name)
data.drop(‘datetime’,axis=1,inplace=True)
#inplace=True:不创建新的对象,直接对原始对象进行修改;
#inplace=False:对数据进行修改,创建并返回新的对象承载其修改结果
data.head()
”’变量解释: datetime – hourly date + timestamp
season – 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday – whether the day is considered a holiday(0:否 ;1:是)
workingday – whether the day is neither a weekend nor holiday(0:否 ;1:是)
weather – 1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp – temperature in Celsius(实际温度)
atemp – “feels like” temperature in Celsius(体感温度)
humidity – relative humidity
windspeed – wind speed
casual – number of non-registered user rentals initiated(未注册用户数)
registered – number of registered user rentals initiated(注册用户数)
count – number of total rentals”’
#https://www.cnblogs.com/zqiguoshang/p/5744563.html
#利用corr方法得出数据集的相关系数矩阵,并将其可视化,以选择出对count影响较大的变量
data_corr =data.corr()
fig = plt.figure(1)#新建一个名叫figure1的画图窗口
ax1 =plt.subplot(1,1,1)
#plt.subplot(111)和plt.subplot(1,1,1)是等价的。意思是将区域分成1行1列,当前画的是第一个图(排序由行至列)
fig.set_size_inches(11,11)#重新设置大小
sns.heatmap(data_corr,ax=ax1,annot=True,square=False)
#annotate的缩写,annot默认为False,当annot为True时,在heatmap中每个方格写入数据
#square设置热力图矩阵小块形状,默认值是False
plt.show
#发现季节、天气、温度、湿度、风速和月份对租车人数的影响较大
#柱形图,分类–连续
#分别分析月份、季节和星期几对单车使用情况的影响,并在分析季节和星期时结合小时维度进行分析
fig,(ax1,ax2,ax3) = plt.subplots(3,1)#逗号
fig.set_size_inches(11,15)
Month_avg = pd.DataFrame(data.groupby(‘month’)[‘count’].mean()).reset_index()
sns.barplot(ax=ax1,data=Month_avg,x=’month’,y=’count’)
ax1.set(xlabel=’Month’,ylabel=’Avg_num’,title=’Average Number By Month’)
#折线图
#hue:使用指定变量为分类变量画图
Season_hour_avg=pd.DataFrame(data.groupby([‘hour’,’season_name’],sort=True)[‘count’].mean()).reset_index()
sns.pointplot(ax=ax2,x=Season_hour_avg[‘hour’], y=Season_hour_avg[‘count’],
hue=Season_hour_avg[‘season_name’], data=Season_hour_avg, join=True)
ax2.set(xlabel=’Hour’, ylabel=’Count’,title=’Average Count By Hour Each Season’)
#折线图
Week_hour_avg=pd.DataFrame(data.groupby([‘hour’,’weekday’],sort=True)[‘count’].mean()).reset_index()
sns.pointplot(ax=ax3,x=Week_hour_avg[‘hour’], y=Week_hour_avg[‘count’],
hue=Week_hour_avg[‘weekday’], data=Week_hour_avg, join=True)
ax3.set(xlabel=’Hour’, ylabel=’Count’,title=’Average Count By Hour Each Weekday’)
每年5-10月为租车旺季,春季的租车量明显低于其他季节
工作日的用车时段集中在早上8点和晚上5-6点,与上下班高峰期吻合
周末的用车高峰时段不同于工作日,多集中在11点-17点
#天气和风力对单车使用情况的影响
climateDf=data[[‘count’,’weather’,’weather_name’,’temp’,’atemp’,’humidity’,’windspeed’]]
climateDf=pd.concat([climateDf,data[‘hour’].astype(int)],axis=1)
fig,axes=plt.subplots(2,1,figsize=(10,14))#逗号,通过figsize参数可以指定绘图对象的宽度和高度,单位为英寸
df_sum=climateDf.groupby(‘weather’).sum()[‘count’]
df_avg = climateDf.groupby(‘weather’).mean()[‘count’]
df_weather=pd.concat([df_sum,df_avg], axis=1).reset_index()
# 双轴图,不同天气下的单车使用量
ax1=plt.subplot(2,1,1)#几行、几列,以及选取第几个绘图区域
df_weather.columns=[‘weather’,’sum’,’mean’]
df_weather[‘sum’].plot(kind=’bar’,width=0.4,ax=ax1,alpha=0.6,label=”)
#类型,宽度,坐标轴,透明度,给所绘制的曲线一个名字,此名字在图示(legend)中显示
df_weather[‘mean’].plot(ax=ax1,style=’b–.’,alpha=0.6,secondary_y=True,label=’平均值’)
#linestyle与marker的取值可以参见表2,默认的线形为’-‘,点形为’o’
ax1.set_xticks(df_weather[‘weather’])
ax1.set_xlabel(‘Weather’)#xlabel : 设定x轴的标签
ax1.set_xticklabels([‘Sunny’,’Cloudy’,’Light Rain’,’Heavy Rain’],rotation=’horizontal’)
#设定x轴的标签文字,rotation就是翻转的角度
ax1.set_ylabel(‘Sum of rental’)
ax1.right_ax.set_ylabel(‘Ayg of rental’)
ax1.set_title(‘The rental number of bike_sharing in 2011-2012 with different weather’)
#不同风力下的单车使用量
ax2=plt.subplot(2,1,2)
df_sum2=climateDf.groupby(‘windspeed’).sum()[‘count’]
df_avg2=climateDf.groupby(‘windspeed’).mean()[‘count’]
df_wind=pd.concat([df_sum2,df_avg2], axis=1).reset_index()
df_wind.columns=[‘windspeed’,’sum’,’mean’]
df_wind[‘sum’].plot(ax=ax2,kind=’area’,alpha=0.5,label=”)#区域图
df_wind[‘mean’].plot(style=’b–.’,alpha=0.7,ax=ax2,secondary_y=True,label=’平均值’)
ax2.set_ylabel(‘Sum of rental’)
ax2.right_ax.set_ylabel(‘Ayg of rental’)
ax2.set_title(‘The rental number of bike_sharing in 2011-2012 with different windspped’)
ax2.set_xlabel(‘Windspeed’)
天气状况越好、风速越小(低于6),租车量就越大。而极端天气和较大风速(大于25)对应的平均租车量反而较高,产生了异常值,因为它们出现的天数较少,导致租车数量波动值较大。
#散点图
#https://blog.csdn.net/qq_17278169/article/details/54927014,参数解释
fig,axes=plt.subplots(3,1,figsize=(10,13)) #3行1列
ax1=plt.subplot(3,1,1)
df_hum=climateDf[[‘humidity’, ‘count’]]
ax1.scatter(df_hum[‘humidity’],df_hum[‘count’],s=df_hum[‘count’]/5,c=df_hum[‘count’],marker=’.’,alpha=0.8)
# s:点的大小 ; c:点的颜色
ax1.set_title(‘The rental number of bike_sharing in 2011-2012 with different humidity’)
ax1.set_xlabel(‘Humidity’)
ax1.set_ylabel(‘Number’)
ax2=plt.subplot(3,1,2)
df_temp=climateDf[[‘temp’, ‘count’]]
ax2.scatter(df_temp[‘temp’], df_temp[‘count’], s=df_temp[‘count’]/5, c=df_temp[‘count’], marker=’.’, alpha=0.8)
ax2.set_title(‘The rental number of bike_sharing in 2011-2012 with different temperature’)
ax2.set_xlabel(‘Temperature’)
ax2.set_ylabel(‘Number’)
ax3=plt.subplot(3,1,3)
df_temp=climateDf[[‘windspeed’, ‘count’]]
ax3.scatter(df_temp[‘windspeed’], df_temp[‘count’], s=df_temp[‘count’]/5, c=df_temp[‘count’], marker=’.’, alpha=0.8)
ax3.set_title(‘The rental number of bike_sharing in 2011-2012 with different windspeed’)
ax3.set_xlabel(‘Windspeed’)
ax3.set_ylabel(‘Number’)
随着湿度的增大,租车量在减小;
租车量随着温度增加先递增后递减,最佳温度在25-30度之间;
在某一范围内(5