Reading the data
Instead of using csv reader provided by Python itself, here we use Pandas.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import csv
# always use header = 0 when row 0 is the header row
df = pd.read_csv('/path/train.csv', header = 0)
test_df = pd.read_csv('/path/test.csv', header = 0)
Pandas store the data into an object called DataFrame. It also provides functions for basic statistics of the data.
df.head(n) returns the first n rows of the data.
df.tail(n) returns the last n rows of the data.
>>> df.head(3)
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1
2 Heikkinen, Miss. Laina female 26 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
>>> df.tail(3)
PassengerId Survived Pclass Name \
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie"
889 890 1 1 Behr, Mr. Karl Howell
890 891 0 3 Dooley, Mr. Patrick
Sex Age SibSp Parch Ticket Fare Cabin Embarked
888 female NaN 1 2 W./C. 6607 23.45 NaN S
889 male 26 0 0 111369 30.00 C148 C
890 male 32 0 0 370376 7.75 NaN Q
df.dtypes returns the data type of all the columns. Remember that csv reads data defaults to string, import data using Pandas automatically converts data based on the actual type of the data.
>>> df.dtypes
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
Sometimes we want to know if there is missing data in the columns. can help on this.
class pandas.core.frame.dataframe=""
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5 KB
There are total 891 rows but Age, Cabin and Embarked has fewer non-null rows, which indicates there is missing data.
Moreover, df.describe() provides basic statistics of the data.
>>> df.describe()
PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
However, since there is missing values in some columns, we need to be careful when we quote the statistics using this method.
Pandas provides handy ways to select and filter data, see the following several examples:
df[list of column names][m:n]: selects n rows from row m with the desired columns.
>>> df[['Name','Pclass']][1:6]
Name Pclass
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1
2 Heikkinen, Miss. Laina 3
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1
4 Allen, Mr. William Henry 3
5 Moran, Mr. James 3
df[criteria of df['column name']][list of column names]: filters the row based on the criteria and display the desired columns.
>>> df[df['Age']>60][['Name','Sex','Age']]
Name Sex Age
33 Wheadon, Mr. Edward H male 66.0
54 Ostby, Mr. Engelhart Cornelius male 65.0
96 Goldschmidt, Mr. George B male 71.0
116 Connors, Mr. Patrick male 70.5
170 Van der hoef, Mr. Wyckoff male 61.0
252 Stead, Mr. William Thomas male 62.0
275 Andrews, Miss. Kornelia Theodosia female 63.0
280 Duane, Mr. Frank male 65.0
326 Nysveen, Mr. Johan Hansen male 61.0
438 Fortune, Mr. Mark male 64.0
456 Millet, Mr. Francis Davis male 65.0
483 Turkula, Mrs. (Hedwig) female 63.0
493 Artagaveytia, Mr. Ramon male 71.0
545 Nicholson, Mr. Arthur Ernest male 64.0
555 Wright, Mr. George male 62.0
570 Harris, Mr. George male 62.0
625 Sutton, Mr. Frederick male 61.0
630 Barkworth, Mr. Algernon Henry Wilson male 80.0
672 Mitchell, Mr. Henry Michael male 70.0
745 Crosby, Capt. Edward Gifford male 70.0
829 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0
851 Svensson, Mr. Johan male 74.0
Analyzing the data
Lots of machine learning models only allow for numerical imports. Thus for string type data such as Sex, we need to convert the data to numerical value.
# map female to 0 and male to 1
df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)
the map() function maps all elements in an iterable ('female, 'male' in this example) to a given function (a discrete one here).
Filling the data becomes easier with Pandas because we can use the methods provided in the package. Here we will bin the passengers based on gender and passenger class and fill the missing age based on the median of each bin.
# fill in missing ages
# for each passenger without an age, fill the median age
# of his/her passenger class
median_ages = np.zeros((2,3))
for i in range(0, 2):
for j in range(0, 3):
median_ages[i, j] = df[(df['Gender'] == i) &
(df['Pclass'] == j + 1)]['Age'].dropna().median()
# create a new column to fill the missing age (for caution)
df['AgeFill'] = df['Age']
# since each column is a pandas data series object, the data cannot be accessed
# by df[2,3], we must provide the label (header) of the the column and use .loc()
# to locate the data e.g., df.loc[0, 'Age']
# or df[row]['header']
for i in range(2):
for j in range(3):
df.loc[(df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j + 1),
'AgeFill'] = median_ages[i, j]
We fill the Embarked based on the most common boarding place. In statistics, mode returns the most frequency element in the data set.
# fill the missing Embarked with the most common boarding place
# mode() returns the mode of the data set, which is the most frequent element in the data
# sometimes multiple values may be returned, thus in order to select the maximum
# use df.mode().iloc[0]
if len(df.Embarked[df.Embarked.isnull()]) > 0:
df.Embarked[df.Embarked.isnull()] = df.Embarked.dropna().mode().iloc[0]
# returns an enumerate object
# e.g., [(0, 'S'), (1, 'C'),(2, 'Q')]
# Ports = list(enumerate(np.unique(df.Embarked)))
# Set up a dictionary that is an enumerate object of the ports
Ports_dict = {name : i for i, name in list(enumerate(np.unique(df.Embarked)))}
df['EmbarkFill'] = x: Ports_dict[x]).astype(int)
Drop the unwanted columns.
df = df.drop(['PassengerId', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Age'], axis=1)
We need to do the same thing for test data. I will omit that part here, but you can find the full source code on my Github.
Training the model
The model we are going to build is called random forest. Random forest is an ensemble learning method. It constructs a bag of decision trees at the training time and output the class that is the mode of the classes. The data set for each decision tree is produced(resampled from the original dataset) by bootstrap method.
We use scikit-learn to build the model and predict the data. Since scikit-learn only works with numpy arrays, after we clean the data with Pandas, we need to convert the data to numpy arrays.
# convert the data to numpy array
training_data = df.values
test_data = test_df.values
Then we build a random forest object and train the model.
# train the data using random forest
# n_estimators: number of trees in the forest, this number affects the prediction
forest = RandomForestClassifier(n_estimators=150)
# build the forest
# X: array-like or sparse matrix of shape = [n_samples, n_features]
# y: array-like, shape = [n_samples], target values/class labels
forest =[0::, 1::], training_data[0::, 0])
output = forest.predict(test_data).astype(int)
Write the output to a csv file using Python's csv package.
# write the output to a new csv file
predictions_file = open("predictByRandomForest.csv", 'w')
open_file_object = csv.writer(predictions_file)
open_file_object.writerow(["PassengerId", "Survived"])
open_file_object.writerows(zip(ids, output))
My score is 0.76555 on the leader board, on average, not too bad for the first try.
A few thoughts
Apparently choosing the right model is important, moreover, how to play around with the input parameters are also important(e.g., n_estimators).
For the data, I still need to learn what are the most relevant data for the "best" fit. Sometimes if we include more features, we may get more correct predictions, but it may also cause overfitting problem.
