The Fake Geek's blog: Data Science

Showing posts with label Data Science. Show all posts

Thursday, June 11, 2015

First touch in data science (Titanic project on Kaggle) Part II: Random Forest

In this post, I will use the Pandas and Scikit learn packages to make the predictions.

Reading the data
Instead of using csv reader provided by Python itself, here we use Pandas.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import csv
# always use header =  0 when row 0 is the header row
df = pd.read_csv('/path/train.csv', header = 0)
test_df = pd.read_csv('/path/test.csv', header = 0)

Pandas store the data into an object called DataFrame. It also provides functions for basic statistics of the data.

df.head(n) returns the first n rows of the data.
df.tail(n) returns the last n rows of the data.

>>> df.head(3)
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   

                                                Name     Sex  Age  SibSp  \
0                            Braund, Mr. Owen Harris    male   22      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   38      1   
2                             Heikkinen, Miss. Laina  female   26      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
>>> df.tail(3)
     PassengerId  Survived  Pclass                                      Name  \
888          889         0       3  Johnston, Miss. Catherine Helen "Carrie"   
889          890         1       1                     Behr, Mr. Karl Howell   
890          891         0       3                       Dooley, Mr. Patrick   

        Sex  Age  SibSp  Parch      Ticket   Fare Cabin Embarked  
888  female  NaN      1      2  W./C. 6607  23.45   NaN        S  
889    male   26      0      0      111369  30.00  C148        C  
890    male   32      0      0      370376   7.75   NaN        Q

df.dtypes returns the data type of all the columns. Remember that csv reads data defaults to string, import data using Pandas automatically converts data based on the actual type of the data.

>>> df.dtypes
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Sometimes we want to know if there is missing data in the columns. df.info() can help on this.

>>> df.info()
class pandas.core.frame.dataframe=""
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5 KB

There are total 891 rows but Age, Cabin and Embarked has fewer non-null rows, which indicates there is missing data.

Moreover, df.describe() provides basic statistics of the data.

>>> df.describe()
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200

However, since there is missing values in some columns, we need to be careful when we quote the statistics using this method.

Pandas provides handy ways to select and filter data, see the following several examples:

df[list of column names][m:n]: selects n rows from row m with the desired columns.

>>> df[['Name','Pclass']][1:6]
                                                Name  Pclass
1  Cumings, Mrs. John Bradley (Florence Briggs Th...       1
2                             Heikkinen, Miss. Laina       3
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)       1
4                           Allen, Mr. William Henry       3
5                                   Moran, Mr. James       3

df[criteria of df['column name']][list of column names]: filters the row based on the criteria and display the desired columns.

>>> df[df['Age']>60][['Name','Sex','Age']]
                                          Name     Sex   Age
33                       Wheadon, Mr. Edward H    male  66.0
54              Ostby, Mr. Engelhart Cornelius    male  65.0
96                   Goldschmidt, Mr. George B    male  71.0
116                       Connors, Mr. Patrick    male  70.5
170                  Van der hoef, Mr. Wyckoff    male  61.0
252                  Stead, Mr. William Thomas    male  62.0
275          Andrews, Miss. Kornelia Theodosia  female  63.0
280                           Duane, Mr. Frank    male  65.0
326                  Nysveen, Mr. Johan Hansen    male  61.0
438                          Fortune, Mr. Mark    male  64.0
456                  Millet, Mr. Francis Davis    male  65.0
483                     Turkula, Mrs. (Hedwig)  female  63.0
493                    Artagaveytia, Mr. Ramon    male  71.0
545               Nicholson, Mr. Arthur Ernest    male  64.0
555                         Wright, Mr. George    male  62.0
570                         Harris, Mr. George    male  62.0
625                      Sutton, Mr. Frederick    male  61.0
630       Barkworth, Mr. Algernon Henry Wilson    male  80.0
672                Mitchell, Mr. Henry Michael    male  70.0
745               Crosby, Capt. Edward Gifford    male  70.0
829  Stone, Mrs. George Nelson (Martha Evelyn)  female  62.0
851                        Svensson, Mr. Johan    male  74.0

Analyzing the data
Lots of machine learning models only allow for numerical imports. Thus for string type data such as Sex, we need to convert the data to numerical value.

# map female to 0 and male to 1
df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)

the map() function maps all elements in an iterable ('female, 'male' in this example) to a given function (a discrete one here).

Filling the data becomes easier with Pandas because we can use the methods provided in the package. Here we will bin the passengers based on gender and passenger class and fill the missing age based on the median of each bin.

# fill in missing ages
# for each passenger without an age, fill the median age
# of his/her passenger class
median_ages = np.zeros((2,3))
for i in range(0, 2):
    for j in range(0, 3):
        median_ages[i, j] = df[(df['Gender'] == i) &
                               (df['Pclass'] == j + 1)]['Age'].dropna().median()

# create a new column to fill the missing age (for caution)
df['AgeFill'] = df['Age']
# since each column is a pandas data series object, the data cannot be accessed
# by df[2,3], we must provide the label (header) of the the column and use .loc()
# to locate the data e.g., df.loc[0, 'Age']
# or df[row]['header']
for i in range(2):
    for j in range(3):
        df.loc[(df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j + 1),
               'AgeFill'] = median_ages[i, j]

We fill the Embarked based on the most common boarding place. In statistics, mode returns the most frequency element in the data set.

# fill the missing Embarked with the most common boarding place
# mode() returns the mode of the data set, which is the most frequent element in the data
# sometimes multiple values may be returned, thus in order to select the maximum
# use df.mode().iloc[0]
if len(df.Embarked[df.Embarked.isnull()]) > 0:
    df.Embarked[df.Embarked.isnull()] = df.Embarked.dropna().mode().iloc[0]

# returns an enumerate object
# e.g., [(0, 'S'), (1, 'C'),(2, 'Q')]
# Ports = list(enumerate(np.unique(df.Embarked)))
# Set up a dictionary that is an enumerate object of the ports
Ports_dict = {name : i for i, name in list(enumerate(np.unique(df.Embarked)))}
df['EmbarkFill'] = df.Embarked.map(lambda x: Ports_dict[x]).astype(int)

Drop the unwanted columns.

df = df.drop(['PassengerId', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Age'], axis=1)

We need to do the same thing for test data. I will omit that part here, but you can find the full source code on my Github.

Training the model
The model we are going to build is called random forest. Random forest is an ensemble learning method. It constructs a bag of decision trees at the training time and output the class that is the mode of the classes. The data set for each decision tree is produced(resampled from the original dataset) by bootstrap method.

We use scikit-learn to build the model and predict the data. Since scikit-learn only works with numpy arrays, after we clean the data with Pandas, we need to convert the data to numpy arrays.

# convert the data to numpy array
training_data = df.values
test_data = test_df.values

Then we build a random forest object and train the model.

# train the data using random forest
# n_estimators: number of trees in the forest, this number affects the prediction
forest = RandomForestClassifier(n_estimators=150)
# build the forest
# X: array-like or sparse matrix of shape = [n_samples, n_features]
# y: array-like, shape = [n_samples], target values/class labels
forest = forest.fit(training_data[0::, 1::], training_data[0::, 0])
output = forest.predict(test_data).astype(int)

Write the output to a csv file using Python's csv package.

# write the output to a new csv file
predictions_file = open("predictByRandomForest.csv", 'w')
open_file_object = csv.writer(predictions_file)
open_file_object.writerow(["PassengerId", "Survived"])
open_file_object.writerows(zip(ids, output))
predictions_file.close()

My score is 0.76555 on the leader board, on average, not too bad for the first try.

A few thoughts
Apparently choosing the right model is important, moreover, how to play around with the input parameters are also important(e.g., n_estimators).

For the data, I still need to learn what are the most relevant data for the "best" fit. Sometimes if we include more features, we may get more correct predictions, but it may also cause overfitting problem.

Wednesday, June 10, 2015

First touch in data science (Titanic project on Kaggle) Part I: a simple model

Right after I became Dr. Young, I decide to pick up the thing I always want to do yet didn't get enough time to work on: machine learning and data analytics.

Kaggle is a great source to start with. Besides the active competitions, they provide several entry-level projects that include tutorials. I start with the first one: Titanic.

The full description can be found here. In short, given the data of 891 passengers and if they have survived or not, predict what sorts of people were likely to survive.

I used Python to finish this project. There are good libraries in Python for data analysis and machine learning. Moreover, I personally think Python is a rather faster language, so it may be more efficient when dealing with large data set.

Understand the data
The first thing to do when start with data science is to read and understand the data. What we want to do is to use the determine which variable(s) is(are) strongly correlated with the ultimate survival. Part of the data is shown in the following figure.

In data science, features indicates the variables that are given in the data. In the Titanic dataset, Pclass, Name, Sex, Age, ..., are all features. labels are the outcome. In this dataset, the labels are survived (1) or not survived(0), and it's a binary class.

When a huge dataset with lots of features are given to you, some features are strongly correlated with the label, some are not. It would be better if more information is provided, however, without that, it is not bad to start with intuition. In the Titanic situation, it is possible that Sex, Age, Pclass are more related compare to say, Embarked (the place where the passenger was boarded).

Reading the data to Python
Python provides libraries to read csv file. Moreover, the numpy library, provides handy functions to analyze the data. The script provides here are in Python 3.4, for script in Python 2.x, see the Kaggle tutorial.

import csv
import numpy as np
training_object = csv.reader(open('/path/train.csv', 'r'))
training_header = training_object.__next__()
# create a numpy multidimensional array object
data = []
for row in training_object:
    data.append(row)
data = np.array(data)

Python has a very interesting way to deal with iterables. For example, data[:2] gives you the first two elements and data[-1] gives you the last element:

 >>> data = [1, 2, 3, 4, 5]  
 >>> data  
 [1, 2, 3, 4, 5]  
 >>> data[:2]  
 [1, 2]  
 >>> data[:-2]  
 [1, 2, 3]  
 >>> data[-1]  
 5  
 >>> data[-2:]  
 [4, 5]

Analyzing the data
Here we try to build a simple model that assume Fare, Sex and Pclass (passenger class). To see the name of all the features, call training_header:

>>> training_header
['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

Since each person paid different fare to board, we need to bin up the fare price so that we can classify the passengers based on the fare bin.
csv reader in Python reads the data default to string, thus we need to convert it to float.

fare_ceiling = 40
# for ticket price higher than 39, it will be set to equal 39
# so that we can set 4 bins with equal size
# i.e., $0-9, $10-19, $20-29, $30-39
data[data[0::, 9].astype(np.float) >= fare_ceiling, 9] = fare_ceiling - 1.0

# basically make 4 equal bins
fare_bracket_size = 10
number_of_price_brackets = fare_ceiling / fare_bracket_size

# np.unique() return an array of unique elements in the object
# get the length of that array
number_of_classes = len(np.unique(data[0::, 2]))

data[0::, 9] indicates from the first row to the last and the 9th column. Numpy has a lovely way to select data. in the above code:

data[data[0::, 9].astype(np.float) >= fare_ceiling, 9]

selects all the rows with the fare (9th column) greater than fare_ceiling.

>>The Numpy.unique() function
This is a very interesting function. I looked at its src because I wanted to see if it uses the brutal iteration way. Apparently it is much smarter. The full source code can be found here. (referred to Codatlas). Here I simplify the implementation for the purpose of explanation.

Transfer the array to Numpy array and flatten the array to one-dimension array.

>>> ar = [1, 1, 2, 3, 3, 3, 2, 2, 2]
>>> ar = np.asanyarray(ar).flatten()
>>> ar
array([1, 1, 2, 3, 3, 3, 2, 2, 2])

Sort the array.

>>> ar.sort()
>>> ar
array([1, 1, 2, 2, 2, 2, 3, 3, 3])

Here comes the interesting part.

>>> aux = ar
>>> flag = np.concatenate(([True], aux[1:] != aux[:-1]))
>>> flag
array([ True, False,  True, False, False, False,  True, False, False], dtype=bool)

If we print aux[1:] and aux[:-1]:

>>> aux[1:]
array([1, 2, 2, 2, 2, 3, 3, 3])
>>> aux[:-1]
array([1, 1, 2, 2, 2, 2, 3, 3])

This operation is similar to shift the array right by one position, if there is no repeated element in the array, aux[1:] != aux[:-1] should return true at any position, otherwise, it will return false at the place with repeated elements.

Generate the return array based on the flag array.

>>> ret = aux[flag]
>>> ret
array([1, 2, 3])

<<Back to Analyzing the data

The next step is to calculate the statistics based on features we have chosen, i.e., for each sex, sum up all the survived people that are in a particular passenger class and fare bin.

# initialize the survival table with all zeros
survival_table = np.zeros((2, number_of_classes, number_of_price_brackets))

for i in range(number_of_classes):
    for j in range(int(number_of_price_brackets)):

        women_only_stats = data[(data[0::, 4] == "female")
                                & (data[0::, 2].astype(np.float) == i+1)  # i starts from 0,
                                # the ith class fare was greater than or equal to the least fare in current bin
                                & (data[0:, 9].astype(np.float) >= j*fare_bracket_size)
                                # fare was less than the least fare in next bin
                                & (data[0:, 9].astype(np.float) < (j+1)*fare_bracket_size), 1]

        men_only_stats = data[(data[0::, 4] != "female")
                              & (data[0::, 2].astype(np.float) == i + 1)
                              & (data[0:,9].astype(np.float) >= j * fare_bracket_size)
                              & (data[0:,9].astype(np.float) < (j + 1) * fare_bracket_size), 1]
        survival_table[0, i, j] = np.mean(women_only_stats.astype(np.float))
        survival_table[1, i, j] = np.mean(men_only_stats.astype(np.float))
# if nobody satisfies the criteria, the table will return a NaN
# since the divisor is zero
survival_table[survival_table != survival_table] = 0

Since Survived only contains 0 and 1, the probability of surviving at given passenger class is calculated by:

sum of survived passenger / total number of passenger

i.e., the mean.

Again, Survived only contains 0 and 1, thus we assume any probability greater than or equal to 0.5 should predict a survival.

# assume any probability >= 0.5 should result in predicting survival
# otherwise not
survival_table[survival_table < 0.5] = 0
survival_table[survival_table >= 0.5] = 1

Predicting the data
Now we need to use the table to predict the test data. We use csv reader to create a new file for writing file.

test_file  =open('/path/test.csv')
test_object = csv.reader(test_file)
test_header = test_object.__next__()
prediction_file = open("/path/genderClassModel.csv", 'w')
p = csv.writer(prediction_file)
p.writerow(["PassengerId", "Survived"])

The original tutorial provided by Kaggle uses a loop to determine if a passenger's fare falls in a certain bin. I personally don't like this way: it's slow. Alternatively, we can calculate the bin by dividing the fare by fare_bracket_size (10 in this case).

for row in test_object:
    # for each passenger, find the price bin where the passenger
    # belongs to
    try:
        row[8] = float(row[8])
    # if data is missing, bin the fare according Pclass
    except:
        bin_fare = 3 - float(row[1])
        continue
    # assign the passenger to the last bin if the fare he/she paid
    # was greater than the fare ceiling
    if row[8] > fare_ceiling:
        bin_fare = number_of_price_brackets - 1
    else:
        bin_fare = int(row[8] / fare_bracket_size)

    if row[3] == 'female':
        p.writerow([row[0], "%d" %
            int(survival_table[0, float(row[1]) - 1, bin_fare])])
    else:
        p.writerow([row[0], "%d" %
                    int(survival_table[1, float(row[1]) - 1, bin_fare])])



test_file.close()
prediction_file.close()

In the next post, I will talk about using Pandas library to clean the data and use Scikit learn to train a machine learning model for the prediction.

The full src can be found on my Github.

AdSense

Thursday, June 11, 2015

First touch in data science (Titanic project on Kaggle) Part II: Random Forest

Wednesday, June 10, 2015

First touch in data science (Titanic project on Kaggle) Part I: a simple model