The sinking of the Titanic ship is one of the most infamous accidents in history. On April 15, 1912, Titanic sank after colliding with an iceberg resulting in the death of nearly 60% of the passengers. In this exercise, we will compare various machine learning models that could predict the survival rate using passenger data (like name, title, ticket number, gender, class, etc.)
setwd("C:/Users/khannva1/Documents/DS-Projects/Titanic")
# This function calculates the confusion matrix
cm <- function(model, data, target){
confusionMatrix(predict(model, newdata = data), target)
}
# Function to split the continous varibale into bins
custom_cut <- function(x, lower = 0, upper, by = 10,
sep = "-", above.char = "+") {
labs <- c(paste(seq(lower, upper - by, by = by),
seq(lower + by - 1, upper - 1, by = by),
sep = sep),
paste(upper, above.char, sep = ""))
cut(floor(x), breaks = c(seq(lower, upper, by = by), Inf),
right = FALSE, labels = labs)
}
'%ni%' <- Negate('%in%')
I will import libraries such as dplyr, tidyr, caret, gridExtra and purrr.
library("dplyr", quietly = T)
library("tidyr", quietly = T)
library("readr", quietly = T)
library("kableExtra", quietly = T)
library("caret", quietly = T)
library("stringr", quietly = T)
library("purrr", quietly = T)
library("gridExtra")
train <- suppressMessages(read_csv("C:\\Users\\khannva1\\Documents\\DS-Projects\\Titanic\\train.csv"))
test <- suppressMessages(read_csv("C:\\Users\\khannva1\\Documents\\DS-Projects\\Titanic\\test.csv"))
Exploratory data analysis refers to exploring the data numerically and graphically.
eda <- train
eda %>% head() %>%
kable() %>% kable_styling(bootstrap_options = c("striped", "condensed", "responsive"), full_width = F, position = "center")
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NA | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NA | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | NA | S |
6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | NA | Q |
eda %>% group_by(Survived) %>%
summarise(count = n(), percent = count/nrow(train)) %>%
kable() %>% kable_styling(bootstrap_options = c("striped", "condensed", "responsive"), full_width = F, position = "center")
Survived | count | percent |
---|---|---|
0 | 549 | 0.6161616 |
1 | 342 | 0.3838384 |
There are 891 rows and 12 columns (features) in the training data set and 418 rows and 11 features in the test set. the percentage of the survived and not survived passengers is 38.3% and 61.4% respectively. Types of features are given the table below
col_type <- data.frame(Features = names(eda), Defination = c("Id", "Survival", "Ticket Class", "Name", "Sex", "Age", "Number of siblings/spouses", "Number of parents/childern", "Ticket number", "Passenger fare", "Cabin number","Boarding port"), Type = c("Numeric", "0 = No, 1 = Yes", "1 = 1st, 2 = 2nd, 3 = 3rd", "Character","Male, Female","Numeric","Numeric","Numeric","Character","Numeric","Character","C = Cherbourg, Q = Queenstown, S = Southampton"))
col_type %>%
kable() %>% kable_styling(bootstrap_options = c("striped", "condensed", "responsive"), full_width = F, position = "center")
Features | Defination | Type |
---|---|---|
PassengerId | Id | Numeric |
Survived | Survival | 0 = No, 1 = Yes |
Pclass | Ticket Class | 1 = 1st, 2 = 2nd, 3 = 3rd |
Name | Name | Character |
Sex | Sex | Male, Female |
Age | Age | Numeric |
SibSp | Number of siblings/spouses | Numeric |
Parch | Number of parents/childern | Numeric |
Ticket | Ticket number | Character |
Fare | Passenger fare | Numeric |
Cabin | Cabin number | Character |
Embarked | Boarding port | C = Cherbourg, Q = Queenstown, S = Southampton |
Age and Fare are numeric continuous variables and let us look at them based on the target variable.
eda$Survived <- as.factor(eda$Survived)
p1 <- eda %>% drop_na(Age) %>% ggplot(aes(y = Age, x= Survived, fill = Survived)) + geom_boxplot() +
ggtitle(label = "Distribution of Ages of the Passengers") +
ylab(label = "Age") + labs(fill = "Survived") + coord_flip()
p2 <- eda %>% drop_na(Fare) %>% ggplot(aes(y = Fare, x = Survived, fill = Survived)) + geom_boxplot() +
ggtitle(label = "Distribution of Fare of the Passengers") +
ylab(label = "Fare") + labs(fill = "Survived") + coord_flip()
grid.arrange(p1, p2, nrow=2)
Age: We note that most of the passengers were young from 20 to 40 years old in both the categories. The median age in the training dataset is 28 and the mean is 29.7 years.
Fare: Also, the fare shows that most of the passengers paid less than 100 Pounds per ticket. Further, it seems like there is a higher chance of survival if passengers have paid more.
p3 <- eda %>% drop_na(Pclass) %>% group_by(Pclass, Survived) %>%
summarize(Percent = round(n()/nrow(.),2)) %>%
ggplot(aes(x = Pclass, y = Percent, fill = Survived)) + geom_bar(stat = "identity") +
ggtitle(label = "Ticket classes") +
labs(fill = "Survived")
p4 <- eda %>% drop_na(Sex) %>% group_by(Sex, Survived) %>%
summarize(Percent = round(n()/nrow(.),2)) %>%
ggplot(aes(x = Sex, y = Percent, fill = Survived)) + geom_bar(stat = "identity") +
ggtitle(label = "Gender") +
labs(fill = "Survived")
p5 <- eda %>% drop_na(SibSp) %>% group_by(SibSp, Survived) %>%
summarize(Percent = round(n()/nrow(.),2)) %>%
ggplot(aes(x = SibSp, y = Percent, fill = Survived)) + geom_bar(stat = "identity") +
ggtitle(label = "Siblings or spouses") +
labs(fill = "Survived")
p6 <- eda %>% drop_na(Parch) %>% group_by(Parch, Survived) %>%
summarize(Percent = round(n()/nrow(.),2)) %>%
ggplot(aes(x = Parch, y = Percent, fill = Survived)) + geom_bar(stat = "identity") +
ggtitle(label = "Parents or childern") +
labs(fill = "Survived")
p7 <- eda %>% drop_na(Embarked) %>% group_by(Embarked, Survived) %>%
summarize(Percent = round(n()/nrow(.),2)) %>%
ggplot(aes(x = Embarked, y = Percent, fill = Survived)) + geom_bar(stat = "identity") +
ggtitle(label = "Boarding stations") +
labs(fill = "Survived")
grid.arrange(p3, p4, p5, p6, p7, nrow=3)
Pclass: It looks like first-class passengers had a higher rate of survival. It is not an unexpected finding as people with first-class tickets would have been given preference over people with lower-class tickets.
Gender: High survival chance if the passenger was female. It makes sense because females and children would have been given preference during evacuation.
Number of Siblings or Spouse: Survival rate is high for passengers with Siblings or Spouses. It may be due to the fact that more trusted people you have onboard more the chance of getting reliable information. Reliable information is critical for survival.
Number of Parents or children: Similarly, the survival rate is higher for passengers with Parents or children. I think the same logic applies as above.
Boarding station: It turns out that people who board form Cherbourg and Queenstown station have a higher rate of survival. It will be interesting to dig deeper and see if the majority of the people who board from these two stations have first- or second-class tickets. In case that is true, it will help explain the observation.
not_cq <- eda %>% filter(Embarked %ni% c('C','Q')) %>%
group_by(Pclass)%>% summarise(n = n()) %>%
mutate(percent = n/sum(n))
cq <- eda %>% filter(Embarked %in% c('C','Q')) %>%
group_by(Pclass)%>% summarise(n = n()) %>%
mutate(percent = n/sum(n))
not_cq$Embarked <- 'notFromCQ'
cq$Embarked <- 'FromCQ'
rbind(not_cq,cq) %>% ggplot(aes(x = Pclass, y = percent, fill = Embarked)) +
geom_bar(stat = "identity", position = "dodge")
Indeed we find that there are more first-class passengers from Cherbourg and Queenstown stations.
eda$Pclass <- as.factor(eda$Pclass)
p8 <- eda %>% drop_na(Age) %>% ggplot(aes(x = Pclass, y = Age)) + geom_boxplot(aes(fill = Survived))
p9 <- eda %>% drop_na(Fare) %>% ggplot(aes(x = Pclass, y = Fare)) + geom_boxplot(aes(fill = Survived))
grid.arrange(p8, p9, nrow = 2)
Combine the train and test dataset into one dataset for easier imputation of missing values. Create an isTrain binary logical variable to distinguish between train and test set for later use.
train <- train %>% mutate(isTrain = TRUE)
survived <- train %>% pull(Survived)
train <- train %>% select(-c(Survived))
test <- test %>% mutate(isTrain = FALSE)
df <- bind_rows(train, test)
A quick investigation of the combined dataset shows that there are missing values in Age, Cabin, Embarked, Fare. Cabin has a large number of missing values (~77%) followed by Age (~20%), 2 values in Embarked and 1 in Fare.
missing.values <- df %>% gather(key = "key", value = "val") %>% mutate(is.missing = is.na(val)) %>%
# Calculate missing in each category
group_by(key) %>% mutate(total = n()) %>%
group_by(key,total,is.missing) %>%
summarise(num.missing = n()) %>%
# claculate percentage
mutate(pct = num.missing/total * 100)
names <- (missing.values %>% filter(is.missing == TRUE) %>%
arrange(desc(pct)))$key
# Change the column names for better readability
ungroup(missing.values) %>% filter(is.missing == TRUE) %>% select(-c('pct','total', 'is.missing')) %>% rename("Column" = "key", "Missing" = "num.missing") %>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed", "responsive"), full_width = F, position = "center")
Column | Missing |
---|---|
Age | 263 |
Cabin | 1014 |
Embarked | 2 |
Fare | 1 |
# Plot missing values
p10 <- ggplot(missing.values) + geom_bar(aes( x= reorder(key, desc(pct)), y = pct, fill = is.missing), stat = 'identity', alpha = 0.8) + scale_fill_manual(name = "", values = c('steelblue','tomato3'), label = c("Present", "Missing")) + coord_flip() + labs(title = "Percentage of missing values",x = '',y = '% missing')
p10
Let us impute the missing values.
For Cabin I will extract and store the first letter of the Cabin column and I will replace all the missing values with the letter ‘U’ referring to ‘Unknown’.
For Embarked I will replace the missing value with ‘S’ as it the most common station to embark.
For Fare and Age I will build a linear and decision tree model to predict the missing values.
Let us extract first letter from the Cabin and replace missing values with the U. Also, I will create a new binary variable called CabinMissing to encode values which are missing or present in Cabin column.
# Replace Cabin with missing or not missing
df <- df %>% mutate(CabinMissing = ifelse(is.na(Cabin), "Yes","No"))
df$Cabin <- str_extract(df$Cabin, "\\w") %>% ifelse(is.na(.), "U",.)
df$Cabin <- ifelse(df$Cabin == 'T', "U", df$Cabin)
p11 <- df %>% ggplot(aes(x = Cabin)) + geom_bar()
p12 <- df %>% ggplot(aes(x = CabinMissing)) + geom_bar()
grid.arrange(p11, p12, ncol = 2)
Since S is the most common boarding station so will fill the two missing values in the Embarked column by S.
# Fill in the missing values in Embarked column.
df$Embarked <- ifelse(is.na(df$Embarked), "S", df$Embarked)
To fill in the one missing value we will create a cubist model to predict Fare. The built model has good R-squared value.
# By building the model
fare_lt_65 <- df %>% filter(Fare <= 65) %>% select(Pclass, Sex, Age, Parch, SibSp, Embarked, Fare)
fare.formula <- as.formula(Fare ~ Pclass + Sex + Parch + Embarked + SibSp)
ctrl <- trainControl(method = "repeatedcv",
repeats = 5)
model.fare <- train(fare.formula, data = fare_lt_65, method = "cubist", trControl = ctrl, tuneLength = 5)
# Look at the model
model.fare
Cubist
1137 samples
5 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 1025, 1023, 1025, 1024, 1023, 1022, ...
Resampling results across tuning parameters:
committees neighbors RMSE Rsquared MAE
1 0 7.580134 0.7012583 3.632071
1 5 7.632292 0.6972563 4.048258
1 9 7.594590 0.7005765 4.011889
10 0 7.523949 0.7055726 3.708284
10 5 7.609393 0.6984349 4.030294
10 9 7.578811 0.7006436 4.019808
20 0 7.517471 0.7054253 3.705649
20 5 7.611797 0.6982621 4.032112
20 9 7.582354 0.7003227 4.022212
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were committees = 20 and neighbors = 0.
to_predict <- df %>% filter(is.na(Fare))%>%select(Pclass, Sex, Age, Parch, Embarked, SibSp)
fare.predictions <- predict(model.fare, newdata = to_predict)
df$Fare <- ifelse(is.na(df$Fare), fare.predictions, df$Fare)
First like Cabin I will create a new variable called AgeMissing to record missing age in the original dataset. In order to predict the missing ages, I will first extract the Title from the name column and save them in the Title column. Further, I will categorize all the titles into four categories of Officer, Royalty, Mrs and Miss. By experimenting I found that cubist model performs better than other models in predicting the age so I will use it to predict missing ages.
# Fill in the missing values in age column
df <- df %>% mutate(AgeMissing = ifelse(is.na(Age), "Yes","No"))
df$Title <- str_extract(df$Name, ",\\s*\\w+") %>% str_remove(., pattern = ", ")
df$Title[df$Title %in% c('Capt', 'Col', 'Major', 'the')] <- 'Officer'
df$Title[df$Title %in% c('Don','Dr','Rev','Sir','Jonkheer','Countess','Lady','Dona')] <- 'Royalty'
df$Title[df$Title %in% c('Mrs','Mme')] <- 'Mrs'
df$Title[df$Title %in% c('Ms','Mlle')] <- 'Miss'
age_lt_66 <- df %>% filter(Age <= 66) %>% select(Pclass, Parch, SibSp, Embarked, Title, Age)
ctrl <- trainControl(method = "repeatedcv",
repeats = 5)
age.formula <- as.formula(Age ~ Pclass + SibSp + Parch + Embarked + Title)
# Look at the model
model.age <- train(age.formula, data = age_lt_66, method = "cubist", trControl = ctrl, tuneLength = 5)
model.age
Cubist
1037 samples
5 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 933, 933, 932, 933, 933, 935, ...
Resampling results across tuning parameters:
committees neighbors RMSE Rsquared MAE
1 0 10.52479 0.4344659 8.072775
1 5 10.36930 0.4493661 8.070186
1 9 10.36169 0.4480718 8.072155
10 0 10.57409 0.4315073 8.205296
10 5 10.30525 0.4553485 8.046059
10 9 10.29713 0.4540338 8.061593
20 0 10.59063 0.4293391 8.213320
20 5 10.31308 0.4544272 8.048073
20 9 10.29269 0.4542779 8.059671
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were committees = 20 and neighbors = 9.
to_predict <- df %>% filter(is.na(Age))%>%select(Pclass, Sex, Fare, Parch, Embarked, SibSp, Title)
age.predictions <- predict(model.age, newdata = to_predict)
df$Age <- ifelse(is.na(df$Age), age.predictions, df$Age)
Like Cabin we will create a new variable called AgeMissing to encode if the Age information was missing in the original dataset.
Let us create new variables and remove the id variables.
# Convert Pclass into factor variable
df <- df %>% mutate(Pclass = ifelse(Pclass == 1, "One", ifelse(Pclass == 2, "Two", "Three")))
df$Pclass <- as.factor(df$Pclass)
# make bins of age and fare
df$agebins <- custom_cut(df$Age, upper = 70, by = 9)
df$farebins <- custom_cut(df$Fare, upper = 60, by = 10)
df <- df %>% mutate(Ticket_len = map_dbl(Ticket, nchar))
df <- df %>% mutate(Ticket_parts = str_split(Ticket, pattern = " ") %>% map_dbl(., length))
df <- df %>% mutate(family = (Parch + SibSp + 1))
df <- df %>% mutate(family_size = case_when(family == 1 ~ "Alone",
family %in% c(2,3,4) ~ "Small",
family %in% c(5,6) ~ "Medium",
family %in% c(7,8,11) ~ "Large"))
df$LargeFamily <- ifelse(df$family > 4, "Yes", "No")
df <- df %>% mutate(fare_gt_50 = ifelse(Fare >= 50, "Yes", "No"))
df <- df %>% mutate(PclassThree_female = ifelse(Pclass == "Three" & Sex == "female", "Yes","No"))
# Remove columns not required
df <- df %>% select(-c(PassengerId, Name, Ticket))
df <- df %>% mutate_if(is.character, as.factor)
df_copy <- df # Make a copy
# One hot encode the categorical variables
dmy <- dummyVars("~Pclass + Sex + Embarked + Title + fare_gt_50 + AgeMissing + LargeFamily + agebins + farebins + family_size + CabinMissing + Cabin + PclassThree_female" , data = df)
ohe_df <- predict(dmy, newdata = df) %>% as.data.frame()
num_df <- df %>% select_if(funs(is.numeric(.)| is.logical(.))) %>% as.data.frame()
df <- cbind(ohe_df,num_df)
# # Preprocess
pre_process <- df %>% preProcess(., method = c("center","scale"))
df <- predict(pre_process, newdata = df)
Split the dataset into train, test and validation set. Validation set will be used to get the model performance however should not be used as sole cretirea for model selection. It should be used along with cross-validation performance.
train <- df %>% filter(isTrain == TRUE) %>% select(-c(isTrain))
test_x <- df %>% filter(isTrain == FALSE) %>% select(-c(isTrain))
train_x_full <- train
train_y_full <- as.factor(make.names(survived))
metric <- "logLoss"
ctrl <- trainControl(
method = "cv",
number = 5,
savePredictions = "all",
classProbs = TRUE,
summaryFunction = multiClassSummary,
verboseIter = FALSE)
# split the data
set.seed(3000)
inTrain <- createDataPartition(survived, p = 0.10, list = FALSE)
train_x <- train[-inTrain,]
val_x <- train[inTrain,]
train_y <- as.factor(make.names(survived))[-inTrain]
val_y <- as.factor(make.names(survived))[inTrain]
Train the model with all the features.
set.seed(3000)
weights <- c(0.623,1.000)
model_weights <- ifelse(train_y == "0", weights[1], weights[2])
model_glm <- train(x = train_x, y = train_y, method = "glm", trControl = ctrl, metric = metric, weights = model_weights)
# Evaluate the model on the test data
cm_glm <- cm(model = model_glm, data = val_x, target = val_y)
print(cm_glm)
Confusion Matrix and Statistics
Reference
Prediction X0 X1
X0 49 13
X1 4 24
Accuracy : 0.8111
95% CI : (0.7149, 0.8859)
No Information Rate : 0.5889
P-Value [Acc > NIR] : 6.123e-06
Kappa : 0.595
Mcnemar's Test P-Value : 0.05235
Sensitivity : 0.9245
Specificity : 0.6486
Pos Pred Value : 0.7903
Neg Pred Value : 0.8571
Prevalence : 0.5889
Detection Rate : 0.5444
Detection Prevalence : 0.6889
Balanced Accuracy : 0.7866
'Positive' Class : X0
summary(model_glm)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9246 -0.5221 -0.3652 0.3096 2.4622
Coefficients: (16 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.453915 2.737613 -0.166 0.86831
Pclass.One 0.320827 0.277290 1.157 0.24727
Pclass.Three 0.007709 0.236867 0.033 0.97404
Pclass.Two NA NA NA NA
Sex.female 7.629011 391.688571 0.019 0.98446
Sex.male NA NA NA NA
Embarked.C 0.230173 0.123891 1.858 0.06319 .
Embarked.Q 0.192536 0.117357 1.641 0.10088
Embarked.S NA NA NA NA
Title.Master 0.674237 0.223383 3.018 0.00254 **
Title.Miss -4.449672 328.242070 -0.014 0.98918
Title.Mr -0.096357 0.402136 -0.240 0.81063
Title.Mrs -3.859121 293.105356 -0.013 0.98950
Title.Officer 0.126146 0.108811 1.159 0.24633
Title.Royalty NA NA NA NA
fare_gt_50.No -0.516426 0.480892 -1.074 0.28287
fare_gt_50.Yes NA NA NA NA
AgeMissing.No 0.040260 0.117999 0.341 0.73296
AgeMissing.Yes NA NA NA NA
LargeFamily.No 0.613413 0.258968 2.369 0.01785 *
LargeFamily.Yes NA NA NA NA
`agebins.0-8` 0.972292 0.732060 1.328 0.18413
`agebins.9-17` 0.749955 0.657365 1.141 0.25393
`agebins.18-26` 1.078065 1.025931 1.051 0.29334
`agebins.27-35` 1.274896 0.894653 1.425 0.15415
`agebins.36-44` 0.651796 0.570307 1.143 0.25309
`agebins.45-53` 0.477183 0.368610 1.295 0.19548
`agebins.54-62` 0.157945 0.254859 0.620 0.53543
`agebins.70+` NA NA NA NA
`farebins.0-9` 0.937562 0.615348 1.524 0.12760
`farebins.10-19` 0.735899 0.491775 1.496 0.13455
`farebins.20-29` 0.653454 0.438885 1.489 0.13651
`farebins.30-39` 0.589954 0.300976 1.960 0.04998 *
`farebins.40-49` NA NA NA NA
`farebins.50-59` 0.336758 0.148955 2.261 0.02377 *
`farebins.60+` NA NA NA NA
family_size.Alone 0.046063 0.247778 0.186 0.85252
family_size.Large 0.279122 0.174950 1.595 0.11061
family_size.Medium NA NA NA NA
family_size.Small NA NA NA NA
CabinMissing.No -5.941472 608.328251 -0.010 0.99221
CabinMissing.Yes NA NA NA NA
Cabin.A 1.829015 187.158105 0.010 0.99220
Cabin.B 3.128038 316.282300 0.010 0.99211
Cabin.C 3.735797 375.889384 0.010 0.99207
Cabin.D 2.820829 268.094928 0.011 0.99161
Cabin.E 2.771884 253.606021 0.011 0.99128
Cabin.F 1.916390 182.926116 0.010 0.99164
Cabin.G 0.838190 89.811495 0.009 0.99255
Cabin.U NA NA NA NA
PclassThree_female.No 0.994957 0.221483 4.492 7.05e-06 ***
PclassThree_female.Yes NA NA NA NA
Age 0.307587 0.594241 0.518 0.60473
SibSp -0.340686 0.332315 -1.025 0.30527
Parch -0.049742 0.256434 -0.194 0.84620
Fare 0.281514 0.209357 1.345 0.17874
Ticket_len -0.201083 0.206624 -0.973 0.33046
Ticket_parts 0.350385 0.213124 1.644 0.10017
family NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1064.44 on 800 degrees of freedom
Residual deviance: 584.11 on 758 degrees of freedom
AIC: 670.11
Number of Fisher Scoring iterations: 14
It looks like some features like Sex,fare_gt_50,family_size,Ticket_len,Parch and SibSp are not contributing to the model. Let us remove these variables and build the model the again to see if we get a similar accuracy.
# Let us select few variables
df_copy_selected <- df_copy %>% select(c(Pclass,Embarked,Title,LargeFamily,Age,Cabin,farebins,family,AgeMissing,Sex,isTrain))
#df_copy_selected <- df_copy %>% select(c(Pclass,Embarked,Title,LargeFamily,Age,farebins,PclassThree_female,isTrain))
train <- df_copy_selected %>% filter(isTrain == TRUE) %>% select(-c(isTrain)) %>% as.data.frame()
test_x <- df_copy_selected %>% filter(isTrain == FALSE) %>% select(-c(isTrain)) %>% as.data.frame()
train_x_full <- train
train_y_full <- as.factor(make.names(survived))
# split the data
set.seed(3000)
inTrain <- createDataPartition(survived, p = 0.10, list = FALSE)
train_x <- train[-inTrain,]
val_x <- train[inTrain,]
train_y <- as.factor(make.names(survived))[-inTrain]
val_y <- as.factor(make.names(survived))[inTrain]
set.seed(3000)
weights <- c(0.623,1.000)
model_weights <- ifelse(train_y == "0", weights[1], weights[2])
model_glm_selected <- train(x = train_x, y = train_y, method = "glm", trControl = ctrl, metric = metric, weights = model_weights, preProcess = c("center", "scale"))
# Evaluate the model on the test data
cm_glm_selected <- cm(model = model_glm_selected, data = val_x, target = val_y)
print(cm_glm_selected)
Confusion Matrix and Statistics
Reference
Prediction X0 X1
X0 49 6
X1 4 31
Accuracy : 0.8889
95% CI : (0.8051, 0.9454)
No Information Rate : 0.5889
P-Value [Acc > NIR] : 3.821e-10
Kappa : 0.7686
Mcnemar's Test P-Value : 0.7518
Sensitivity : 0.9245
Specificity : 0.8378
Pos Pred Value : 0.8909
Neg Pred Value : 0.8857
Prevalence : 0.5889
Detection Rate : 0.5444
Detection Prevalence : 0.6111
Balanced Accuracy : 0.8812
'Positive' Class : X0
# Let us build the model with the whole training set
model_weights <- ifelse(train_y_full == "0", weights[1], weights[2])
model_glm_full <- train(x = train_x_full, y = train_y_full, method = "glm", trControl = ctrl, metric = metric, weights = model_weights, preProcess = c("center", "scale"))
Let us use the same variables to build the random forest model. RF can take care of outliers so no need for pre-processing of the data.
set.seed(3000)
weights <- c(0.623,1.000)
model_weights <- ifelse(train_y == "0", weights[1], weights[2])
model_rf <- train(x = train_x, y = train_y, method = "rf", trControl = ctrl, metric = metric, weights = model_weights, tuneLength = 8)
# Evaluate the model on the validation data
cm_rf<- cm(model = model_rf, data = val_x, target = val_y)
print(cm_rf)
Confusion Matrix and Statistics
Reference
Prediction X0 X1
X0 51 14
X1 2 23
Accuracy : 0.8222
95% CI : (0.7274, 0.8948)
No Information Rate : 0.5889
P-Value [Acc > NIR] : 1.951e-06
Kappa : 0.6139
Mcnemar's Test P-Value : 0.00596
Sensitivity : 0.9623
Specificity : 0.6216
Pos Pred Value : 0.7846
Neg Pred Value : 0.9200
Prevalence : 0.5889
Detection Rate : 0.5667
Detection Prevalence : 0.7222
Balanced Accuracy : 0.7919
'Positive' Class : X0
# Let us build the model with the full training data
model_weights <- ifelse(train_y_full == "0", weights[1], weights[2])
model_rf_full <- train(x = train_x_full, y = train_y_full, method = "rf", trControl = ctrl, metric = metric, weights = model_weights, tuneLength = 8)
Let us try GLM Boost model.
set.seed(3000)
weights <- c(0.623,1.000)
model_weights <- ifelse(train_y == "0", weights[1], weights[2])
model_glmboost<- train(x = train_x, y = train_y, method = "glmboost", trControl = ctrl, metric = metric, weights = model_weights, tuneLength = 10, preProcess = c("center","scale"))
# Evaluate the model on the test data
cm_glmboost <- cm(model = model_glmboost, data = val_x, target = val_y)
print(cm_glmboost)
Confusion Matrix and Statistics
Reference
Prediction X0 X1
X0 49 8
X1 4 29
Accuracy : 0.8667
95% CI : (0.7787, 0.9292)
No Information Rate : 0.5889
P-Value [Acc > NIR] : 9.36e-09
Kappa : 0.7201
Mcnemar's Test P-Value : 0.3865
Sensitivity : 0.9245
Specificity : 0.7838
Pos Pred Value : 0.8596
Neg Pred Value : 0.8788
Prevalence : 0.5889
Detection Rate : 0.5444
Detection Prevalence : 0.6333
Balanced Accuracy : 0.8542
'Positive' Class : X0
model_weights <- ifelse(train_y_full == "0", weights[1], weights[2])
model_glmboost_full <- train(x = train_x_full, y = train_y_full, method = "glmboost", trControl = ctrl, metric = metric, tuneLength = 10, weights = model_weights, preProcess = c("center","scale"))
Generate the prediction file to submit the predictions to Kaggle.
```r
# Predict the model on the test data
pred <- predict(model_glm_full, test_x) %>% str_remove(., pattern = \X\) %>% as.integer()
sample_submission <- read.csv(\gender_submission.csv\, header = TRUE, stringsAsFactors = F)
sample_submission$Survived <- pred
write.csv(sample_submission, file = \glm_submission.csv\,quote = FALSE, row.names = FALSE)
```
I got the score of 79.4% in Kaggle which is not bad however, it is way below the test score I got in the GLM model (~88%) which suggest that my model has overfit the training data.