Thursday, June 11, 2015

First touch in data science (Titanic project on Kaggle) Part II: Random Forest

In this post, I will use the Pandas and Scikit learn packages to make the predictions.

Reading the data
Instead of using csv reader provided by Python itself, here we use Pandas.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import csv
# always use header =  0 when row 0 is the header row
df = pd.read_csv('/path/train.csv', header = 0)
test_df = pd.read_csv('/path/test.csv', header = 0)

Pandas store the data into an object called DataFrame. It also provides functions for basic statistics of the data.

df.head(n) returns the first n rows of the data.
df.tail(n) returns the last n rows of the data.


>>> df.head(3)
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   

                                                Name     Sex  Age  SibSp  \
0                            Braund, Mr. Owen Harris    male   22      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   38      1   
2                             Heikkinen, Miss. Laina  female   26      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
>>> df.tail(3)
     PassengerId  Survived  Pclass                                      Name  \
888          889         0       3  Johnston, Miss. Catherine Helen "Carrie"   
889          890         1       1                     Behr, Mr. Karl Howell   
890          891         0       3                       Dooley, Mr. Patrick   

        Sex  Age  SibSp  Parch      Ticket   Fare Cabin Embarked  
888  female  NaN      1      2  W./C. 6607  23.45   NaN        S  
889    male   26      0      0      111369  30.00  C148        C  
890    male   32      0      0      370376   7.75   NaN        Q  

df.dtypes returns the data type of all the columns. Remember that csv reads data defaults to string, import data using Pandas automatically converts data based on the actual type of the data.

>>> df.dtypes
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Sometimes we want to know if there is missing data in the columns. df.info() can help on this.


>>> df.info()
class pandas.core.frame.dataframe=""
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5 KB

There are total 891 rows but Age, Cabin and Embarked has fewer non-null rows, which indicates there is missing data.

Moreover, df.describe() provides basic statistics of the data.

>>> df.describe()
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  

However, since there is missing values in some columns, we need to be careful when we quote the statistics using this method.

Pandas provides handy ways to select and filter data, see the following several examples:

df[list of column names][m:n]: selects n rows from row m with the desired columns.

>>> df[['Name','Pclass']][1:6]
                                                Name  Pclass
1  Cumings, Mrs. John Bradley (Florence Briggs Th...       1
2                             Heikkinen, Miss. Laina       3
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)       1
4                           Allen, Mr. William Henry       3
5                                   Moran, Mr. James       3


df[criteria of df['column name']][list of column names]: filters the row based on the criteria and display the desired columns.


>>> df[df['Age']>60][['Name','Sex','Age']]
                                          Name     Sex   Age
33                       Wheadon, Mr. Edward H    male  66.0
54              Ostby, Mr. Engelhart Cornelius    male  65.0
96                   Goldschmidt, Mr. George B    male  71.0
116                       Connors, Mr. Patrick    male  70.5
170                  Van der hoef, Mr. Wyckoff    male  61.0
252                  Stead, Mr. William Thomas    male  62.0
275          Andrews, Miss. Kornelia Theodosia  female  63.0
280                           Duane, Mr. Frank    male  65.0
326                  Nysveen, Mr. Johan Hansen    male  61.0
438                          Fortune, Mr. Mark    male  64.0
456                  Millet, Mr. Francis Davis    male  65.0
483                     Turkula, Mrs. (Hedwig)  female  63.0
493                    Artagaveytia, Mr. Ramon    male  71.0
545               Nicholson, Mr. Arthur Ernest    male  64.0
555                         Wright, Mr. George    male  62.0
570                         Harris, Mr. George    male  62.0
625                      Sutton, Mr. Frederick    male  61.0
630       Barkworth, Mr. Algernon Henry Wilson    male  80.0
672                Mitchell, Mr. Henry Michael    male  70.0
745               Crosby, Capt. Edward Gifford    male  70.0
829  Stone, Mrs. George Nelson (Martha Evelyn)  female  62.0
851                        Svensson, Mr. Johan    male  74.0


Analyzing the data
Lots of machine learning models only allow for numerical imports. Thus for string type data such as Sex, we need to convert the data to numerical value.

# map female to 0 and male to 1
df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)

the map() function maps all elements in an iterable ('female, 'male' in this example) to a given function (a discrete one here).

Filling the data becomes easier with Pandas because we can use the methods provided in the package. Here we will bin the passengers based on gender and passenger class and fill the missing age based on the median of each bin.

# fill in missing ages
# for each passenger without an age, fill the median age
# of his/her passenger class
median_ages = np.zeros((2,3))
for i in range(0, 2):
    for j in range(0, 3):
        median_ages[i, j] = df[(df['Gender'] == i) &
                               (df['Pclass'] == j + 1)]['Age'].dropna().median()

# create a new column to fill the missing age (for caution)
df['AgeFill'] = df['Age']
# since each column is a pandas data series object, the data cannot be accessed
# by df[2,3], we must provide the label (header) of the the column and use .loc()
# to locate the data e.g., df.loc[0, 'Age']
# or df[row]['header']
for i in range(2):
    for j in range(3):
        df.loc[(df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j + 1),
               'AgeFill'] = median_ages[i, j]


We fill the Embarked based on the most common boarding place. In statistics, mode returns the most frequency element in the data set.


# fill the missing Embarked with the most common boarding place
# mode() returns the mode of the data set, which is the most frequent element in the data
# sometimes multiple values may be returned, thus in order to select the maximum
# use df.mode().iloc[0]
if len(df.Embarked[df.Embarked.isnull()]) > 0:
    df.Embarked[df.Embarked.isnull()] = df.Embarked.dropna().mode().iloc[0]

# returns an enumerate object
# e.g., [(0, 'S'), (1, 'C'),(2, 'Q')]
# Ports = list(enumerate(np.unique(df.Embarked)))
# Set up a dictionary that is an enumerate object of the ports
Ports_dict = {name : i for i, name in list(enumerate(np.unique(df.Embarked)))}
df['EmbarkFill'] = df.Embarked.map(lambda x: Ports_dict[x]).astype(int)

Drop the unwanted columns.

df = df.drop(['PassengerId', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Age'], axis=1)

We need to do the same thing for test data. I will omit that part here, but you can find the full source code on my Github.

Training the model
The model we are going to build is called random forest.  Random forest is an ensemble learning method. It constructs a bag of decision trees at the training time and output the class that is the mode of the classes. The data set for each decision tree is produced(resampled from the original dataset) by bootstrap method.

We use scikit-learn to build the model and predict the data. Since scikit-learn only works with numpy arrays, after we clean the data with Pandas, we need to convert the data to numpy arrays.

# convert the data to numpy array
training_data = df.values
test_data = test_df.values

Then we build a random forest object and train the model.

# train the data using random forest
# n_estimators: number of trees in the forest, this number affects the prediction
forest = RandomForestClassifier(n_estimators=150)
# build the forest
# X: array-like or sparse matrix of shape = [n_samples, n_features]
# y: array-like, shape = [n_samples], target values/class labels
forest = forest.fit(training_data[0::, 1::], training_data[0::, 0])
output = forest.predict(test_data).astype(int)

Write the output to a csv file using Python's csv package.

# write the output to a new csv file
predictions_file = open("predictByRandomForest.csv", 'w')
open_file_object = csv.writer(predictions_file)
open_file_object.writerow(["PassengerId", "Survived"])
open_file_object.writerows(zip(ids, output))
predictions_file.close()


My score is 0.76555 on the leader board, on average, not too bad for the first try.

A few thoughts
Apparently choosing the right model is important, moreover, how to play around with the input parameters are also important(e.g., n_estimators).

For the data, I still need to learn what are the most relevant data for the "best" fit. Sometimes if we include more features, we may get more correct predictions, but it may also cause overfitting problem.

No comments:

Post a Comment