泰坦尼克号幸存预测 是本小白接触的第一个Kaggle入门比赛,主要参考了以下两篇教程:
https://www.cnblogs.com/star-zhao/p/9801196.html
https://zhuanlan.zhihu.com/p/30538352
本模型在Leaderboard上的最高得分为0.79904,排名前13%。
由于这个比赛做得比较早了,当时很多分析的细节都忘了,而且由于是第一次做,整体还是非常简陋的。今天心血来潮,就当做个简单的记录(流水账)。
导入相关包:
1 2 3 4 5 6 7 8 import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport refrom sklearn.model_selection import GridSearchCVfrom sklearn.linear_model import LinearRegressionfrom sklearn.ensemble import GradientBoostingRegressorfrom sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
读取训练、测试集,合并在一起处理:
1 2 3 train_raw = pd.read_csv('datasets/train.csv' ) test_raw = pd.read_csv('datasets/test.csv' ) train_test = train_raw.append(test_raw, ignore_index=True , sort=False )
姓名中的称谓可以在一定程度上体现出人的性别、年龄、身份、社会地位等,因而是一个不可忽略的重要特征。我们首先用正则表达式将Name字段中的称谓信息提取出来,然后做归类:
Mr、Don代表男性
Miss、Ms、Mlle代表未婚女子
Mrs、Mme、Lady、Dona代表已婚女士
Countess、Jonkheer均为贵族身份
Capt、Col、Dr、Major、Sir这些少数称谓归为其他一类
1 2 3 4 5 6 train_test['Title' ] = train_test['Name' ].apply(lambda x: re.search('(\w+)\.' , x).group(1 )) train_test['Title' ].replace(['Don' ], 'Mr' , inplace=True ) train_test['Title' ].replace(['Mlle' ,'Ms' ], 'Miss' , inplace=True ) train_test['Title' ].replace(['Mme' , 'Lady' , 'Dona' ], 'Mrs' , inplace=True ) train_test['Title' ].replace(['Countess' , 'Jonkheer' ], 'Noble' , inplace=True ) train_test['Title' ].replace(['Capt' , 'Col' , 'Dr' , 'Major' , 'Sir' ], 'Other' , inplace=True )
对称谓类别进行独热编码(One-Hot encoding):
1 2 title_onehot = pd.get_dummies(train_test['Title' ], prefix='Title' ) train_test = pd.concat([train_test, title_onehot], axis=1 )
对性别进行独热处理:
1 2 sex_onehot = pd.get_dummies(train_test['Sex' ], prefix='Sex' ) train_test = pd.concat([train_test, sex_onehot], axis=1 )
将SibSp和Parch两个特征组合在一起,构造出表示家庭大小的特征,因为分析表明有亲人同行的乘客比独自一人具有更高的存活率。
1 train_test['FamilySize' ] = train_test['SibSp' ] + train_test['Parch' ] + 1
用众数对Embarked填补缺失值:
1 2 3 train_test['Embarked' ].fillna(train_test['Embarked' ].mode()[0 ], inplace=True ) embarked_onehot = pd.get_dummies(train_test['Embarked' ], prefix='Embarked' ) train_test = pd.concat([train_test, embarked_onehot], axis=1 )
由于Cabin缺失值太多,姑且将有无Cabin作为特征:
1 2 3 4 train_test['Cabin' ].fillna('NO' , inplace=True ) train_test['Cabin' ] = np.where(train_test['Cabin' ] == 'NO' , 'NO' , 'YES' ) cabin_onehot = pd.get_dummies(train_test['Cabin' ], prefix='Cabin' ) train_test = pd.concat([train_test, cabin_onehot], axis=1 )
用同等船舱的票价均值填补Fare的缺失值:
1 Ktrain_test['Fare'].fillna(train_test.groupby('Pclass')['Fare'].transform('mean'), inplace=True)
由于有团体票,我们将票价均摊到每个人身上:
1 2 shares = train_test.groupby('Ticket' )['Fare' ].transform('count' ) train_test['Fare' ] = train_test['Fare' ] / shares
票价分级:
1 2 3 4 5 6 7 train_test.loc[train_test['Fare' ] < 5 , 'Fare' ] = 0 train_test.loc[(train_test['Fare' ] >= 5 ) & (train_test['Fare' ] < 10 ), 'Fare' ] = 1 train_test.loc[(train_test['Fare' ] >= 10 ) & (train_test['Fare' ] < 15 ), 'Fare' ] = 2 train_test.loc[(train_test['Fare' ] >= 15 ) & (train_test['Fare' ] < 30 ), 'Fare' ] = 3 train_test.loc[(train_test['Fare' ] >= 30 ) & (train_test['Fare' ] < 60 ), 'Fare' ] = 4 train_test.loc[(train_test['Fare' ] >= 60 ) & (train_test['Fare' ] < 100 ), 'Fare' ] = 5 train_test.loc[train_test['Fare' ] >= 100 , 'Fare' ] = 6
利用shares构造一个新的特征,将买团体票的乘客分为一类,单独买票的分为一类:
1 2 3 train_test['GroupTicket' ] = np.where(shares == 1 , 'NO' , 'YES' ) group_ticket_onehot = pd.get_dummies(train_test['GroupTicket' ], prefix='GroupTicket' ) train_test = pd.concat([train_test, group_ticket_onehot], axis=1 )
对于缺失较多的Age项,直接用平均数或者中位数来填充不太合适。这里我们用机器学习算法,利用其他特征来推测年龄。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 missing_age_df = pd.DataFrame(train_test[['Age' , 'Parch' , 'Sex' , 'SibSp' , 'FamilySize' , 'Title' , 'Fare' , 'Pclass' , 'Embarked' ]]) missing_age_df = pd.get_dummies(missing_age_df,columns=['Title' , 'FamilySize' , 'Sex' , 'Pclass' ,'Embarked' ]) missing_age_train = missing_age_df[missing_age_df['Age' ].notnull()] missing_age_test = missing_age_df[missing_age_df['Age' ].isnull()] def fill_missing_age (missing_age_train, missing_age_test ): missing_age_X_train = missing_age_train.drop(['Age' ], axis=1 ) missing_age_Y_train = missing_age_train['Age' ] missing_age_X_test = missing_age_test.drop(['Age' ], axis=1 ) gbm_reg = GradientBoostingRegressor(n_estimators=100 , max_depth=3 , learning_rate=0.01 , max_features=3 , random_state=42 ) gbm_reg.fit(missing_age_X_train, missing_age_Y_train) missing_age_test['Age_GB' ] = gbm_reg.predict(missing_age_X_test) lrf_reg = LinearRegression(fit_intercept=True , normalize=True ) lrf_reg.fit(missing_age_X_train, missing_age_Y_train) missing_age_test['Age_LRF' ] = lrf_reg.predict(missing_age_X_test) missing_age_test['Age' ] = np.mean([missing_age_test['Age_GB' ], missing_age_test['Age_LRF' ]]) return missing_age_test train_test.loc[(train_test.Age.isnull()), 'Age' ] = fill_missing_age(missing_age_train, missing_age_test)
划分年龄段:
1 2 3 4 5 6 7 8 9 10 train_test.loc[train_test['Age' ] < 9 , 'Age' ] = 0 train_test.loc[(train_test['Age' ] >= 9 ) & (train_test['Age' ] < 18 ), 'Age' ] = 1 train_test.loc[(train_test['Age' ] >= 18 ) & (train_test['Age' ] < 27 ), 'Age' ] = 2 train_test.loc[(train_test['Age' ] >= 27 ) & (train_test['Age' ] < 36 ), 'Age' ] = 3 train_test.loc[(train_test['Age' ] >= 36 ) & (train_test['Age' ] < 45 ), 'Age' ] = 4 train_test.loc[(train_test['Age' ] >= 45 ) & (train_test['Age' ] < 54 ), 'Age' ] = 5 train_test.loc[(train_test['Age' ] >= 54 ) & (train_test['Age' ] < 63 ), 'Age' ] = 6 train_test.loc[(train_test['Age' ] >= 63 ) & (train_test['Age' ] < 72 ), 'Age' ] = 7 train_test.loc[(train_test['Age' ] >= 72 ) & (train_test['Age' ] < 81 ), 'Age' ] = 8 train_test.loc[train_test['Age' ] >= 81 , 'Age' ] = 9
保存PassengerId:
1 passengerId_test = train_test['PassengerId' ][891 :]
丢弃多余的特征:
1 train_test.drop(['PassengerId' , 'Name' , 'SibSp' , 'Parch' , 'Title' , 'Sex' , 'Embarked' , 'Cabin' , 'Ticket' , 'GroupTicket' ], axis=1 , inplace=True )
划分训练集和测试集:
1 2 3 4 5 train = train_test[:891 ] test = train_test[891 :] X_train = train.drop(['Survived' ], axis=1 ) y_train = train['Survived' ] X_test = test.drop(['Survived' ], axis=1 )
分别用随机森林、极端随机树和梯度提升树进行训练,然后利用VotingClassifer建立最终预测模型。
1 2 3 4 5 rf = RandomForestClassifier(n_estimators=500 , max_depth=5 , min_samples_split=13 ) et = ExtraTreesClassifier(n_estimators=500 , max_depth=7 , min_samples_split=8 ) gbm = GradientBoostingClassifier(n_estimators=500 , learning_rate=0.0135 ) voting = VotingClassifier(estimators=[('rf' , rf), ('et' , et), ('gbm' , gbm)], voting='soft' ) voting.fit(X_train, y_train)
预测并生成提交文件:
1 2 3 y_predict = voting.predict(X_test) submission = pd.DataFrame({'PassengerId' : passengerId_test, 'Survived' : y_predict.astype(np.int32)}) submission.to_csv('submission.csv' , index=False )