【Python】Kaggle_Titanic_prediction 1 — logistics regression逻辑回归预测

  • Post author:
  • Post category:python


Kaggle泰坦尼克号沉船生存预测,已经是数据挖掘界国际经典入门案例了。

那,小试“牛”刀。

#Titanic: Machine Learning from Disaster#

# 导入常用数据模块
import pandas as pd
import numpy as np
# 导入训练集数据文件
train1=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/train.csv")
train1.head(5)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
# 看下数据基本统计信息

#先看数字类型的整体情况
train1.describe()

# 平均Survived存活率38.4%,不到四成的人活下来。
# 平均Age年龄29.7,整体偏年轻。
# SibSp兄弟姐妹数量加上Parch父母子女数量,平均0.9个人(这里可以归纳为“家庭成员”),相当于,平均每人带一个同伴。当然,有人携家带口人多,也会有人独自出行。
# 平均Fare票价32.2.从常识推断,高票价人数会比普通票价人数少,并且票价高很多。票价的中位数和众数应该在32以下。
# Pclass船舱等级平均2.3,说明三等舱的乘客比一、二等舱乘客多。

# Age字段缺失两百多个数据量,后期待补,可用均值mean或中位数median直接补上。
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
# 然后看看字符串类型(非数值)的整体情况
train1.describe(include="O")

# Ticket票有681种,有可能是681批人。
# Embarked码头有3个。
# Cabin舱位有147种,不过因为该列数据缺失较多,这个值147无法采用。
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Jensen, Mr. Hans Peder male 347082 B96 B98 S
freq 1 577 7 4 644
# 看下各字段信息
train1.info()

# 总891条数据,Age、Cabin、Embarked有缺失
# 那接下来看看怎么让这些缺失数据更直观
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
# 用 isnull 查看各列有没有缺失值
train1.isnull().any()

# ——有三列缺失,Age、Cabin、Embarked。但是还不够直观,我们不知道是大部分数据缺失还是个别数据缺失。
# 接下来可以试试在isnull后面加sum来统计总缺失个数
PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool
# sum统计出个数之后,为了更直观一些,将统计出的数据排序
train1.isnull().sum().sort_values(ascending=False)

# 这样就更直观了,不仅知道哪些列有缺失数据,还知道缺失了多少个数据,而且降序排列更加一目了然。
# Cabin          687
# Age            177
# Embarked         2
# 以上三个字段缺失,其中:
# Age可以用均值 mean 或中位数 median 来填充。
# Embarked登船地点只缺两个数据,或许可以通过其他关联数据(例如票号)找到对应的登陆港口。
# Cabin暂时还想不出怎么补,我们得继续看看具体数据内容,找到它们的逻辑所在。或者直接舍弃该字段。
Cabin          687
Age            177
Embarked         2
Fare             0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
Survived         0
PassengerId      0
dtype: int64
# 查看 Embarked 列缺失值的信息
train1[train1.Embarked.isnull()]

# 这两位,票号一致,舱位一致,说明是同行的两个人。既不是sibsp,也不是parch,说明是朋友。
# 那问题来了,他们会从哪个登船地点上来呢?

# 在下一步,我们把train1的所有数据放出来看(考虑到占篇幅太大,得出结论后已清空原数据显示),浏览一下发现:
# 票号Ticket六位数且以“113***”开头的乘客,Embarked登船地点都是S。由此我们推测,这两位的登船口也是S,直接填充进缺失值。
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN
# 用 fillna 函数来填充

train2=train1.fillna({"Embarked":"S"})
train2.head(20)

# 如果要展示某一列,可以train1["Embarked"].fillna({"Embarked":"S"}) 这样展示
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande… female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
# 确认一下 Embarked 缺失值是否已填充成功
train2[train2.Embarked.isnull()]

# 结果显示已经没有缺失记录,说明已填充成功。
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
# 看看其他列是不是还缺失
train2.isnull().sum().sort_values(ascending=False)

# 发现其他两列仍然和填充 Embarked 前一致,一切正常,继续下一步。
Cabin          687
Age            177
Embarked         0
Fare             0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
Survived         0
PassengerId      0
dtype: int64
# 下面,用中位数median来填充 Age 列缺失值
train3=train2.fillna(train2['Age'].median())
train3.head(20)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 28 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 28 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 28 S
5 6 0 3 Moran, Mr. James male 28.0 0 0 330877 8.4583 28 Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 28 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 28 S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 28 C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 28 S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 28 S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 28 S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 28 S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 28 Q
17 18 1 2 Williams, Mr. Charles Eugene male 28.0 0 0 244373 13.0000 28 S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande… female 31.0 1 0 345763 18.0000 28 S
19 20 1 3 Masselmani, Mrs. Fatima female 28.0 0 0 2649 7.2250 28 C
# 填充完毕,重新看看 train 的基础信息
train3.isnull().sum()

# 发现 Carbin 列也全部没缺失了,为什么呢?而且出现了很多次“28”,估计是age的median值28。
# 是不是在填充 Age 列的时候,影响到 Carbin 列了呢?接下来要证实。
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64
# 换一种方式,还是用中位数 median 填充 Age 列缺失值
train2["Age"]=train2["Age"].fillna(train2["Age"].median())
train2.isnull().sum().sort_values(ascending=False)

# 这种方式与上一种方式的区别,是train2后面加了["Age"]来限定某列。
# 特别注意,语句 train2["Age"]=train2["Age"].fillna(train2["Age"].median()) 中,等号前面的train2["Age"] 不能把["Age"]去掉,如果去掉尝试,整个文件都得重头运行一遍。不信试试。
Cabin          687
Embarked         0
Fare             0
Ticket           0
Parch            0
SibSp            0
Age              0
Sex              0
Name             0
Pclass           0
Survived         0
PassengerId      0
dtype: int64
train2.head(20)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male 28.0 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male 28.0 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande… female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female 28.0 0 0 2649 7.2250 NaN C
# 接下来看看test测试数据

# 导入测试集数据文件
test1=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/test.csv")
test1.head(5)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
# 查看缺失列及缺失数量
test1.isnull().sum().sort_values(ascending=False)

# Cabin又上榜,估计可以先当弃子,搁一边先不管。
# Age也上榜,用均值mean或中位数median填充。
# Fare缺失一个数据,可以回到数据集中查看一下规律,手动填充。
Cabin          327
Age             86
Fare             1
Embarked         0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
PassengerId      0
dtype: int64
# 查看某列(Fare)缺失值
test1[test1.Fare.isnull()]

# Fare缺失的这位乘客,3等舱,男性,65岁,单独出行。其实可以用Pclass三等舱的平均票价或中位数来填充,想来这样填充对整体评估还是比价契合的。
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
152 1044 3 Storey, Mr. Thomas male 60.5 0 0 3701 NaN NaN S
# 用某列各个组中的平均值填充相应缺失值
# 例如,将Pclass分组,用各组Fare平均值来填充对应的Fare
test1["Fare"] = test1.groupby("Pclass").transform(lambda x: x.fillna(x.mean()))
test1.isnull().sum().sort_values(ascending=False)

# 填充完毕,从缺失值数量上看,OK。

# 接下来验证一下填充的是不是Pclass=3的平均值
Cabin          327
Age             86
Embarked         0
Fare             0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
PassengerId      0
dtype: int64
# 我们发现缺失了Fare值的那位乘客,PassengerId是1044,行索引是152,我们可以索引出来看看结果。
test1.loc[152,["Pclass","Name","Age","Fare"]]
# 或 test1.loc[152] 或 test1.loc[152,:] 可以查看152行的所有列
Pclass                     3
Name      Storey, Mr. Thomas
Age                     60.5
Fare                    1044
Name: 152, dtype: object
# 最便捷的方法,是直接 用 loc 函数,查看第152行所有列
test1.loc[152]

# 发现填充的Fare数值是1044,那我们看看Pclass=3(即三等舱)的票价Fare平均值是不是1044。
PassengerId                  1044
Pclass                          3
Name           Storey, Mr. Thomas
Sex                          male
Age                          60.5
SibSp                           0
Parch                           0
Ticket                       3701
Fare                         1044
Cabin                         NaN
Embarked                        S
Name: 152, dtype: object
Fare_Pclass_mean = test1.groupby("Pclass")["Fare"].mean()
Fare_Pclass_mean
# 或不用赋值,直接用 test1.groupby("Pclass")["Fare"].mean()得出结果即可,达到验证目的即可。

# 然而验证发现,Pclass=3的均值是1094,并不是上面填充的1044。这是为何呢?
# 说明这个函数填充,错了——test1["Fare"] = test1.groupby("Pclass").transform(lambda x: x.fillna(x.mean()))
Pclass
1    1098.224299
2    1117.935484
3    1094.178899
Name: Fare, dtype: float64
# 接下来找找其他方法,同时先恢复填充前的源状态。
# 重新导入数据
test2=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/test.csv")
test2.head(35)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
5 897 3 Svensson, Mr. Johan Cervin male 14.0 0 0 7538 9.2250 NaN S
6 898 3 Connolly, Miss. Kate female 30.0 0 0 330972 7.6292 NaN Q
7 899 2 Caldwell, Mr. Albert Francis male 26.0 1 1 248738 29.0000 NaN S
8 900 3 Abrahim, Mrs. Joseph (Sophie Halaut Easu) female 18.0 0 0 2657 7.2292 NaN C
9 901 3 Davies, Mr. John Samuel male 21.0 2 0 A/4 48871 24.1500 NaN S
10 902 3 Ilieff, Mr. Ylio male NaN 0 0 349220 7.8958 NaN S
11 903 1 Jones, Mr. Charles Cresson male 46.0 0 0 694 26.0000 NaN S
12 904 1 Snyder, Mrs. John Pillsbury (Nelle Stevenson) female 23.0 1 0 21228 82.2667 B45 S
13 905 2 Howard, Mr. Benjamin male 63.0 1 0 24065 26.0000 NaN S
14 906 1 Chaffee, Mrs. Herbert Fuller (Carrie Constance… female 47.0 1 0 W.E.P. 5734 61.1750 E31 S
15 907 2 del Carlo, Mrs. Sebastiano (Argenia Genovesi) female 24.0 1 0 SC/PARIS 2167 27.7208 NaN C
16 908 2 Keane, Mr. Daniel male 35.0 0 0 233734 12.3500 NaN Q
17 909 3 Assaf, Mr. Gerios male 21.0 0 0 2692 7.2250 NaN C
18 910 3 Ilmakangas, Miss. Ida Livija female 27.0 1 0 STON/O2. 3101270 7.9250 NaN S
19 911 3 Assaf Khalil, Mrs. Mariana (Miriam”)” female 45.0 0 0 2696 7.2250 NaN C
20 912 1 Rothschild, Mr. Martin male 55.0 1 0 PC 17603 59.4000 NaN C
21 913 3 Olsen, Master. Artur Karl male 9.0 0 1 C 17368 3.1708 NaN S
22 914 1 Flegenheim, Mrs. Alfred (Antoinette) female NaN 0 0 PC 17598 31.6833 NaN S
23 915 1 Williams, Mr. Richard Norris II male 21.0 0 1 PC 17597 61.3792 NaN C
24 916 1 Ryerson, Mrs. Arthur Larned (Emily Maria Borie) female 48.0 1 3 PC 17608 262.3750 B57 B59 B63 B66 C
25 917 3 Robins, Mr. Alexander A male 50.0 1 0 A/5. 3337 14.5000 NaN S
26 918 1 Ostby, Miss. Helene Ragnhild female 22.0 0 1 113509 61.9792 B36 C
27 919 3 Daher, Mr. Shedid male 22.5 0 0 2698 7.2250 NaN C
28 920 1 Brady, Mr. John Bertram male 41.0 0 0 113054 30.5000 A21 S
29 921 3 Samaan, Mr. Elias male NaN 2 0 2662 21.6792 NaN C
30 922 2 Louch, Mr. Charles Alexander male 50.0 1 0 SC/AH 3085 26.0000 NaN S
31 923 2 Jefferys, Mr. Clifford Thomas male 24.0 2 0 C.A. 31029 31.5000 NaN S
32 924 3 Dean, Mrs. Bertram (Eva Georgetta Light) female 33.0 1 2 C.A. 2315 20.5750 NaN S
33 925 3 Johnston, Mrs. Andrew G (Elizabeth Lily” Watson)” female NaN 1 2 W./C. 6607 23.4500 NaN S
34 926 1 Mock, Mr. Philipp Edmund male 30.0 1 0 13236 57.7500 C78 C
test2.isnull().sum().sort_values(ascending=False)

# Fare还在缺失中,这是我们的源数据。
Cabin          327
Age             86
Fare             1
Embarked         0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
PassengerId      0
dtype: int64
# 既然我们知道了三等舱平均票价1094,那直接填充进 缺失的那列NaN。
test2["Fare"]=test2["Fare"].fillna("1094")
test2.isnull().sum().sort_values(ascending=False)

# 填充Fare列成功
Cabin          327
Age             86
Embarked         0
Fare             0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
PassengerId      0
dtype: int64
# Age年龄一列缺失值也用平均值替代,将前面训练集数据的填充方法复制过来直接用。
test2["Age"]=test2["Age"].fillna(test2["Age"].median())
test2.isnull().sum().sort_values(ascending=False)

# 填充Age列成功
Cabin          327
Embarked         0
Fare             0
Ticket           0
Parch            0
SibSp            0
Age              0
Sex              0
Name             0
Pclass           0
PassengerId      0
dtype: int64
# 看看Age列年龄均值多少,验证下。
Age_mean = test2["Age"].mean()
Age_mean
29.599282296650717
# 看看填充缺失值后的测试集数据test2

test2.head(35)

# 发现缺失行(索引10/22/29/33)的Age缺失值都填充了27,而前面Age_mean计算结果又是29.6.
# (本来存疑。后来发现把median中位数当成mean均值了。不过无碍,两者皆可。2019.4.9)
# 鉴于实际填充的27与计算平均值结果29相差不大,考虑进度,此处暂且忽略。容后再探讨。2019.4.8
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
5 897 3 Svensson, Mr. Johan Cervin male 14.0 0 0 7538 9.225 NaN S
6 898 3 Connolly, Miss. Kate female 30.0 0 0 330972 7.6292 NaN Q
7 899 2 Caldwell, Mr. Albert Francis male 26.0 1 1 248738 29 NaN S
8 900 3 Abrahim, Mrs. Joseph (Sophie Halaut Easu) female 18.0 0 0 2657 7.2292 NaN C
9 901 3 Davies, Mr. John Samuel male 21.0 2 0 A/4 48871 24.15 NaN S
10 902 3 Ilieff, Mr. Ylio male 27.0 0 0 349220 7.8958 NaN S
11 903 1 Jones, Mr. Charles Cresson male 46.0 0 0 694 26 NaN S
12 904 1 Snyder, Mrs. John Pillsbury (Nelle Stevenson) female 23.0 1 0 21228 82.2667 B45 S
13 905 2 Howard, Mr. Benjamin male 63.0 1 0 24065 26 NaN S
14 906 1 Chaffee, Mrs. Herbert Fuller (Carrie Constance… female 47.0 1 0 W.E.P. 5734 61.175 E31 S
15 907 2 del Carlo, Mrs. Sebastiano (Argenia Genovesi) female 24.0 1 0 SC/PARIS 2167 27.7208 NaN C
16 908 2 Keane, Mr. Daniel male 35.0 0 0 233734 12.35 NaN Q
17 909 3 Assaf, Mr. Gerios male 21.0 0 0 2692 7.225 NaN C
18 910 3 Ilmakangas, Miss. Ida Livija female 27.0 1 0 STON/O2. 3101270 7.925 NaN S
19 911 3 Assaf Khalil, Mrs. Mariana (Miriam”)” female 45.0 0 0 2696 7.225 NaN C
20 912 1 Rothschild, Mr. Martin male 55.0 1 0 PC 17603 59.4 NaN C
21 913 3 Olsen, Master. Artur Karl male 9.0 0 1 C 17368 3.1708 NaN S
22 914 1 Flegenheim, Mrs. Alfred (Antoinette) female 27.0 0 0 PC 17598 31.6833 NaN S
23 915 1 Williams, Mr. Richard Norris II male 21.0 0 1 PC 17597 61.3792 NaN C
24 916 1 Ryerson, Mrs. Arthur Larned (Emily Maria Borie) female 48.0 1 3 PC 17608 262.375 B57 B59 B63 B66 C
25 917 3 Robins, Mr. Alexander A male 50.0 1 0 A/5. 3337 14.5 NaN S
26 918 1 Ostby, Miss. Helene Ragnhild female 22.0 0 1 113509 61.9792 B36 C
27 919 3 Daher, Mr. Shedid male 22.5 0 0 2698 7.225 NaN C
28 920 1 Brady, Mr. John Bertram male 41.0 0 0 113054 30.5 A21 S
29 921 3 Samaan, Mr. Elias male 27.0 2 0 2662 21.6792 NaN C
30 922 2 Louch, Mr. Charles Alexander male 50.0 1 0 SC/AH 3085 26 NaN S
31 923 2 Jefferys, Mr. Clifford Thomas male 24.0 2 0 C.A. 31029 31.5 NaN S
32 924 3 Dean, Mrs. Bertram (Eva Georgetta Light) female 33.0 1 2 C.A. 2315 20.575 NaN S
33 925 3 Johnston, Mrs. Andrew G (Elizabeth Lily” Watson)” female 27.0 1 2 W./C. 6607 23.45 NaN S
34 926 1 Mock, Mr. Philipp Edmund male 30.0 1 0 13236 57.75 C78 C
# 经过对数据集中影响因素的分析,
# 我们筛选了潜在影响因素"Survived","Pclass","Sex","Age","SibSp","Parch","Fare","Embarked",
# 剔除了"PassengerId"/"Ticket"/"Cabin"三个或缺失或无关因素。

train2[["Survived","Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]].corr(method="pearson")
# 查看各因素之间关联性

# 与Survived强相关的因素:无
# 与Survived弱相关的因素:Pclass/Fare——船舱等级和票价,果然是,社会经济地位影响生存概率。
# 与Survived极弱相关或无相关的因素:"Age","SibSp","Parch"——从电影来看,明明是小孩和妇女优先,所以这里三个因素可能可以进一步分析探讨。
# 因为不是连续变量,没能通过pearson相关性展示的因素有"Sex"和"Embarked",这两个我们另外找其他方法。
Survived Pclass Age SibSp Parch Fare
Survived 1.000000 -0.338481 -0.064910 -0.035322 0.081629 0.257307
Pclass -0.338481 1.000000 -0.339898 0.083081 0.018443 -0.549500
Age -0.064910 -0.339898 1.000000 -0.233296 -0.172482 0.096688
SibSp -0.035322 0.083081 -0.233296 1.000000 0.414838 0.159651
Parch 0.081629 0.018443 -0.172482 0.414838 1.000000 0.216225
Fare 0.257307 -0.549500 0.096688 0.159651 0.216225 1.000000
# 此处补充一个知识点(来自MOOC课程)
# ——Pearson相关系数

# r 取值范围[-1,1]
# 0.8-1.0 极强相关
# 0.6-0.8 强相关
# 0.4-0.6 中等强度相关
# 0.2-0.4 弱相关
# 0-0.2 极弱相关或无相关

# 需要进一步了解,可以去找概率论的书来看
# 关于"SibSp"和"Parch",一个父母子女,一个兄弟朋友,我们统一归类为同伴,并将他们相加更新一个新因素。

train2["Family"]=train2["SibSp"]+train2["Parch"]
train2.head(5)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C 1
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 1
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 0
# 关于"Age",我们做一个年龄分层。

# 1婴幼儿童:0-11
# 2少年:12-18
# 3青年:19-35
# 4中年:36-52
# 5中老年:53-65
# 6老年:66以上

# 从网上搜索Python数值分组方法,处理如下:
#分组依据,注意最小值要减1,最大最要加1.因为pandas的数值分组是左开右闭,或左闭右开
#使用了开区间、闭区间的概念,可百度了解
bins=[-1,12,19,36,53,66,150]
 
#分组对应的标签,-1到11对应婴幼儿童,12到18对应少年……
labels=['婴幼儿童','少年','青年','中年','中老年','老年']

#使用pandas中的cut进行数值分组,right=False表示左闭右开,省略参数right表示左开右闭
train2['age_group']=pd.cut(
        train2['Age'],
        bins,
        right=False,
        labels=labels)

train2.head(5)

# Python数值分组 参考来源 https://blog.csdn.net/qq_35990702/article/details/82313055 
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family age_group
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 青年
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C 1 中年
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0 青年
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 1 青年
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 0 青年
# 为方便做相关性分析,把['婴幼儿童','少年','青年','中年','中老年','老年']转化成数值型[1,2,3,4,5,6]

train2['age_group0'] = train2['age_group'].map({'婴幼儿童': 1, '少年': 2,'青年':3,'中年':4,'中老年':5,'老年':6}).astype(int)
train2.head(5)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family age_group age_group0
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 青年 3
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C 1 中年 4
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0 青年 3
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 1 青年 3
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 0 青年 3
# 关于“Sex”,我们也要把类别属性(性别sex)转化为1,2

train2['Sex0'] = train2['Sex'].map({'female': 1, 'male': 2}).astype(int)
train2.head(5)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family age_group age_group0 Sex0
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 青年 3 2
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C 1 中年 4 1
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0 青年 3 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 1 青年 3 1
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 0 青年 3 2
# 关于"Embarked",同样类似Sex处理方式,将其类别属性转化为数值属性。
# 先看看"Embarked"总共有多少种不重复的值
train2['Embarked'].value_counts()
S    646
C    168
Q     77
Name: Embarked, dtype: int64
# 将 Embarked 类别属性(S/C/Q)转化为数值属性(1/2/3)
train2["Embarked0"] = train2["Embarked"].map({'S': 1, 'C': 2, 'Q': 2}).astype(int)
train2.head(5)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family age_group age_group0 Sex0 Embarked0
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 青年 3 2 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C 1 中年 4 1 2
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0 青年 3 1 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 1 青年 3 1 1
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 0 青年 3 2 1
# 这次,我们再来看看相关性分析
# 注意:"Age"——age_group;"SibSp"及"Parch"——Family;"Embarked"——Embarked0
train2[["Survived","Pclass","Sex0","age_group0","Family","Fare","Embarked0"]].corr(method="pearson")

# 这次结果,性别sex0与survived存活是强相关的关系,甚至超过了社会经济地位(船舱等级和票价)的影响程度。
# 而根据本次相关性分析,Family人数和age_group年龄层反而对存活没有产生很大的影响作用。
# (在一些大神帖子里看到,family人数和年龄对个人存活影响很大,那这两个悬念就留着先,等学有所成再抽时间回来撸。)
Survived Pclass Sex0 age_group0 Family Fare Embarked0
Survived 1.000000 -0.338481 -0.543351 -0.086879 0.016639 0.257307 0.149683
Pclass -0.338481 1.000000 0.131900 -0.308349 0.065997 -0.549500 -0.074053
Sex0 -0.543351 0.131900 1.000000 0.095705 -0.200988 -0.182333 -0.119224
age_group0 -0.086879 -0.308349 0.095705 1.000000 -0.293598 0.077438 0.002818
Family 0.016639 0.065997 -0.200988 -0.293598 1.000000 0.217138 -0.077359
Fare 0.257307 -0.549500 -0.182333 0.077438 0.217138 1.000000 0.162184
Embarked0 0.149683 -0.074053 -0.119224 0.002818 -0.077359 0.162184 1.000000
# 忍不住,看看family。
train2[["Family","Survived"]].groupby("Family",as_index=False).mean().sort_values(by="Survived",ascending=False)

# 结果发现,家庭人数1-3人的乘客,存活率均高于50%,远高于其他情形。
Family Survived
3 3 0.724138
2 2 0.578431
1 1 0.552795
6 6 0.333333
0 0 0.303538
4 4 0.200000
5 5 0.136364
7 7 0.000000
8 10 0.000000
# 年龄层Age_group 也按照family的方法看一下
train2[["age_group","Survived"]].groupby("age_group",as_index=False).mean().sort_values(by="Survived",ascending=False)

# 年龄层age_group介于婴幼儿童和少年(0-18岁)的乘客,存活率较高,均四成以上,而老年人存活率则不足13%。
age_group Survived
0 婴幼儿童 0.573529
1 少年 0.436620
3 中年 0.397590
4 中老年 0.372093
2 青年 0.353271
5 老年 0.125000
# 特征工程
# 据说,特征工程是影响最终预测准确率的最关键因素,甚至超过了各类神奇的算法本身。

# 以上相关性分析我们发现,Fare与Pclass之间的相关系数达到了0.55,强相关关系,所以我们应该对这两个因素做统一处理。
# 作为新手,为了效率和简便,就粗暴地直接取其中一个因素,与Survived相关性最强的Pclass,同时舍弃另一个。

# 那,根据前面corr分析,与survived存活息息相关的因素:Sex0 > Pclass > Embarked (已舍弃Fare)
# 而根据双因素分析,family人数在1-3之间,或者年龄层age_group介于婴幼儿童和少年(0-18岁)的乘客,存活率较高。
# 既然在训练集数据train中提取了一些相关因素,那也得在测试集test中对应转化。
test2["Family"]=test2["SibSp"]+test2["Parch"]
test2.head(5)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q 0
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7 NaN S 1
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q 0
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S 0
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S 2
bins=[-1,12,19,36,53,66,150]
 
#分组对应的标签,-1到11对应婴幼儿童,12到18对应少年……
labels=['婴幼儿童','少年','青年','中年','中老年','老年']

test2['age_group']=pd.cut(
        test2['Age'],
        bins,
        right=False,
        labels=labels)

test2.head(5)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family age_group
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q 0 青年
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7 NaN S 1 中年
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q 0 中老年
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S 0 青年
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S 2 青年
test2['age_group0'] = test2['age_group'].map({'婴幼儿童': 1, '少年': 2,'青年':3,'中年':4,'中老年':5,'老年':6}).astype(int)
test2.head(5)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family age_group age_group0
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q 0 青年 3
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7 NaN S 1 中年 4
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q 0 中老年 5
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S 0 青年 3
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S 2 青年 3
test2['Sex0'] = test2['Sex'].map({'female': 1, 'male': 2}).astype(int)
test2.head(5)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family age_group age_group0 Sex0
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q 0 青年 3 2
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7 NaN S 1 中年 4 1
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q 0 中老年 5 2
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S 0 青年 3 2
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S 2 青年 3 1
test2["Embarked0"] = test2["Embarked"].map({'S': 1, 'C': 2, 'Q': 2}).astype(int)
test2.head(5)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family age_group age_group0 Sex0 Embarked0
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q 0 青年 3 2 2
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7 NaN S 1 中年 4 1 1
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q 0 中老年 5 2 2
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S 0 青年 3 2 1
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S 2 青年 3 1 1
# 模型构建与评估
# 划分训练集、训练集数据
x_train = train2[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]
y_train =train2["Survived"]
x_test = test2[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]
# Logistic回归
from sklearn.linear_model import LogisticRegression
Classifier1 = LogisticRegression()
#训练模型
Classifier1.fit(x_train,y_train)
#预测
Y1_prediction = Classifier1.predict(x_test)
#模型评估
score_Logit = Classifier1.score(x_train,y_train)
score_Logit

# 可喜可贺,预测准确率0.805,比官方sample的0.766高了一丢丢。
0.8047138047138047
Classifier1.coef_
array([[-0.74124862,  0.00496064, -0.17449562, -0.33260786, -2.28141631,
         0.58615422]])
Final = pd.DataFrame({"PassengerId":test2["PassengerId"],
                       "Survived":Y1_prediction
                       })
Final.head(10)
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 0
5 897 0
6 898 1
7 899 0
8 900 1
9 901 0
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/Final1.csv",index=False)
# submission score 0.75598,低于sample 0.766.看来需要再接再励,调调参,或者试试其他算法。
# 把“fare”因素去掉,同样logistics 回归,重新试试。
x_train1 = train2[["Pclass","Family","age_group0","Sex0","Embarked0"]]
y_train1 =train2["Survived"]
x_test1 = test2[["Pclass","Family","age_group0","Sex0","Embarked0"]]
from sklearn.linear_model import LogisticRegression
Classifier1 = LogisticRegression()   #训练模型
Classifier1.fit(x_train1,y_train1)   #预测
Y1_prediction = Classifier1.predict(x_test1)   #模型评估
score_Logit = Classifier1.score(x_train1,y_train1)
score_Logit

# 预测准确率0.799,比刚刚同类方法预测值0.805还低了0.006。
# 不过无妨,可以试试导入竞赛submission试试。
0.7991021324354658
Final = pd.DataFrame({"PassengerId":test2["PassengerId"],
                       "Survived":Y1_prediction
                       })
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/Final2.csv",index=False)
# submission score 0.75598,与刚刚结果一致。
x_train2 = train2[["Pclass","Family","age_group0","Sex0"]]
y_train2 =train2["Survived"]
x_test2 = test2[["Pclass","Family","age_group0","Sex0"]]
from sklearn.linear_model import LogisticRegression
Classifier1 = LogisticRegression()   #训练模型
Classifier1.fit(x_train2,y_train2)   #预测
Y1_prediction = Classifier1.predict(x_test2)   #模型评估
score_Logit = Classifier1.score(x_train2,y_train2)
score_Logit

# 结果0.806,比刚刚较高的0.805还高了0.001,算是小进步。
0.8058361391694725
Final = pd.DataFrame({"PassengerId":test2["PassengerId"],
                       "Survived":Y1_prediction
                       })
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/Final3.csv",index=False)
# submission score 0.77511,总算比官方sample 0.766高一些了。
# 减小变量至三个。
x_train3 = train2[["Pclass","Family","Sex0"]]
y_train3 =train2["Survived"]
x_test3 = test2[["Pclass","Family","Sex0"]]

from sklearn.linear_model import LogisticRegression
Classifier1 = LogisticRegression()   #训练模型
Classifier1.fit(x_train3,y_train3)   #预测
Y1_prediction = Classifier1.predict(x_test3)   #模型评估
score_Logit = Classifier1.score(x_train3,y_train3)
score_Logit
0.8002244668911336
Final = pd.DataFrame({"PassengerId":test2["PassengerId"],
                       "Survived":Y1_prediction
                       })
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/Final4.csv",index=False)

# submission score 0.77033。
x_train4 = train2[["Pclass","age_group0","Sex0"]]
y_train4 =train2["Survived"]
x_test4 = test2[["Pclass","age_group0","Sex0"]]

from sklearn.linear_model import LogisticRegression
Classifier1 = LogisticRegression()   #训练模型
Classifier1.fit(x_train4,y_train4)   #预测
Y1_prediction = Classifier1.predict(x_test4)   #模型评估
score_Logit = Classifier1.score(x_train4,y_train4)
score_Logit
0.8002244668911336
Final = pd.DataFrame({"PassengerId":test2["PassengerId"],
                       "Survived":Y1_prediction
                       })
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/Final5.csv",index=False)

# submission score 0.76076,又低了。
# 最后试试两个因素。
x_train5 = train2[["Pclass","Sex0"]]
y_train5 =train2["Survived"]
x_test5 = test2[["Pclass","Sex0"]]

from sklearn.linear_model import LogisticRegression
Classifier1 = LogisticRegression()   #训练模型
Classifier1.fit(x_train5,y_train5)   #预测
Y1_prediction = Classifier1.predict(x_test5)   #模型评估
score_Logit = Classifier1.score(x_train5,y_train5)
score_Logit
0.7867564534231201
Final = pd.DataFrame({"PassengerId":test2["PassengerId"],
                       "Survived":Y1_prediction
                       })
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/Final6.csv",index=False)

# submission score 0.76555,回到原始持平。
# Logistic回归 尝试结束。接下来试试其他算法。
# 先把清洗后的train2和test2保存成CSV,下次分析直接导入使用。

train2.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/train2.csv",index=False)
test2.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/test2.csv",index=False)



版权声明:本文为weixin_44216391原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。