AdSense

Wednesday, June 10, 2015

First touch in data science (Titanic project on Kaggle) Part I: a simple model

Right after I became Dr. Young, I decide to pick up the thing I always want to do yet didn't get enough time to work on: machine learning and data analytics.

Kaggle is a great source to start with. Besides the active competitions, they provide several entry-level projects that include tutorials. I start with the first one: Titanic.

The full description can be found here. In short, given the data of 891 passengers and if they have survived or not, predict what sorts of people were likely to survive.

I used Python to finish this project. There are good libraries in Python for data analysis and machine learning. Moreover, I personally think Python is a rather faster language, so it may be more efficient when dealing with large data set.


Understand the data
The first thing to do when start with data science is to read and understand the data. What we want to do is to use the determine which variable(s) is(are) strongly correlated with the ultimate survival. Part of the data is shown in the following figure.





In data science, features indicates the variables that are given in the data. In the Titanic dataset, Pclass, Name, Sex, Age, ..., are all features. labels are the outcome. In this dataset, the labels are survived (1) or not survived(0), and it's a binary class.

When a huge dataset with lots of features are given to you, some features are strongly correlated with the label, some are not. It would be better if more information is provided, however, without that, it is not bad to start with intuition. In the Titanic situation, it is possible that Sex, Age, Pclass are more related compare to say, Embarked (the place where the passenger was boarded).

Reading the data to Python
Python provides libraries to read csv file. Moreover, the numpy library, provides handy functions to analyze the data. The script provides here are in Python 3.4, for script in Python 2.x, see the Kaggle tutorial.

import csv
import numpy as np
training_object = csv.reader(open('/path/train.csv', 'r'))
training_header = training_object.__next__()
# create a numpy multidimensional array object
data = []
for row in training_object:
    data.append(row)
data = np.array(data)


Python has a very interesting way to deal with iterables. For example, data[:2] gives you the first two elements and data[-1] gives you the last element:


 >>> data = [1, 2, 3, 4, 5]  
 >>> data  
 [1, 2, 3, 4, 5]  
 >>> data[:2]  
 [1, 2]  
 >>> data[:-2]  
 [1, 2, 3]  
 >>> data[-1]  
 5  
 >>> data[-2:]  
 [4, 5]  



Analyzing the data
Here we try to build a simple model that assume Fare, Sex and Pclass (passenger class). To see the name of all the features, call training_header:

>>> training_header
['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

Since each person paid different fare to board, we need to bin up the fare price so that we can classify the passengers based on the fare bin.
csv reader in Python reads the data default to string, thus we need to convert it to float.

fare_ceiling = 40
# for ticket price higher than 39, it will be set to equal 39
# so that we can set 4 bins with equal size
# i.e., $0-9, $10-19, $20-29, $30-39
data[data[0::, 9].astype(np.float) >= fare_ceiling, 9] = fare_ceiling - 1.0

# basically make 4 equal bins
fare_bracket_size = 10
number_of_price_brackets = fare_ceiling / fare_bracket_size

# np.unique() return an array of unique elements in the object
# get the length of that array
number_of_classes = len(np.unique(data[0::, 2]))


data[0::, 9] indicates from the first row to the last and the 9th column. Numpy has a lovely way to select data. in the above code:

data[data[0::, 9].astype(np.float) >= fare_ceiling, 9]

selects all the rows with the fare (9th column) greater than fare_ceiling.

>>The Numpy.unique() function
This is a very interesting function. I looked at its src because I wanted to see if it uses the brutal iteration way. Apparently it is much smarter. The full source code can be found here. (referred to Codatlas). Here I simplify the implementation for the purpose of explanation.

Transfer the array to Numpy array and flatten the array to one-dimension array.

>>> ar = [1, 1, 2, 3, 3, 3, 2, 2, 2]
>>> ar = np.asanyarray(ar).flatten()
>>> ar
array([1, 1, 2, 3, 3, 3, 2, 2, 2])


Sort the array.

>>> ar.sort()
>>> ar
array([1, 1, 2, 2, 2, 2, 3, 3, 3])

Here comes the interesting part.

>>> aux = ar
>>> flag = np.concatenate(([True], aux[1:] != aux[:-1]))
>>> flag
array([ True, False,  True, False, False, False,  True, False, False], dtype=bool)

If we print aux[1:] and aux[:-1]:

>>> aux[1:]
array([1, 2, 2, 2, 2, 3, 3, 3])
>>> aux[:-1]
array([1, 1, 2, 2, 2, 2, 3, 3])

This operation is similar to shift the array right by one position, if there is no repeated element in the array, aux[1:] != aux[:-1] should return true at any position, otherwise, it will return false at the place with repeated elements.




Generate the return array based on the flag array.

>>> ret = aux[flag]
>>> ret
array([1, 2, 3])


<<Back to Analyzing the data

The next step is to calculate the statistics based on features we have chosen, i.e., for each sex, sum up all the survived people that are in a particular passenger class and fare bin.


# initialize the survival table with all zeros
survival_table = np.zeros((2, number_of_classes, number_of_price_brackets))

for i in range(number_of_classes):
    for j in range(int(number_of_price_brackets)):

        women_only_stats = data[(data[0::, 4] == "female")
                                & (data[0::, 2].astype(np.float) == i+1)  # i starts from 0,
                                # the ith class fare was greater than or equal to the least fare in current bin
                                & (data[0:, 9].astype(np.float) >= j*fare_bracket_size)
                                # fare was less than the least fare in next bin
                                & (data[0:, 9].astype(np.float) < (j+1)*fare_bracket_size), 1]

        men_only_stats = data[(data[0::, 4] != "female")
                              & (data[0::, 2].astype(np.float) == i + 1)
                              & (data[0:,9].astype(np.float) >= j * fare_bracket_size)
                              & (data[0:,9].astype(np.float) < (j + 1) * fare_bracket_size), 1]
        survival_table[0, i, j] = np.mean(women_only_stats.astype(np.float))
        survival_table[1, i, j] = np.mean(men_only_stats.astype(np.float))
# if nobody satisfies the criteria, the table will return a NaN
# since the divisor is zero
survival_table[survival_table != survival_table] = 0


Since Survived only contains 0 and 1, the probability of surviving at given passenger class is calculated by:
sum of survived passenger / total number of passenger
i.e., the mean.

Again, Survived only contains 0 and 1, thus we assume any probability greater than or equal to 0.5 should predict a survival.

# assume any probability >= 0.5 should result in predicting survival
# otherwise not
survival_table[survival_table < 0.5] = 0
survival_table[survival_table >= 0.5] = 1


Predicting the data
Now we need to use the table to predict the test data. We use csv reader to create a new file for writing file.

test_file  =open('/path/test.csv')
test_object = csv.reader(test_file)
test_header = test_object.__next__()
prediction_file = open("/path/genderClassModel.csv", 'w')
p = csv.writer(prediction_file)
p.writerow(["PassengerId", "Survived"])


The original tutorial provided by Kaggle uses a loop to determine if a passenger's fare falls in a certain bin. I personally don't like this way:  it's slow. Alternatively, we can calculate the bin by dividing the fare by fare_bracket_size (10 in this case).


for row in test_object:
    # for each passenger, find the price bin where the passenger
    # belongs to
    try:
        row[8] = float(row[8])
    # if data is missing, bin the fare according Pclass
    except:
        bin_fare = 3 - float(row[1])
        continue
    # assign the passenger to the last bin if the fare he/she paid
    # was greater than the fare ceiling
    if row[8] > fare_ceiling:
        bin_fare = number_of_price_brackets - 1
    else:
        bin_fare = int(row[8] / fare_bracket_size)

    if row[3] == 'female':
        p.writerow([row[0], "%d" %
            int(survival_table[0, float(row[1]) - 1, bin_fare])])
    else:
        p.writerow([row[0], "%d" %
                    int(survival_table[1, float(row[1]) - 1, bin_fare])])



test_file.close()
prediction_file.close()

In the next post, I will talk about using Pandas library to clean the data and use Scikit learn to train a machine learning model for the prediction.

The full src can be found on my Github.

2 comments:

  1. Hello,
    The Article on First touch in data science is an amazing Article give detail information about it .Thanks for Sharing the information about Data Science. hire data scientists

    ReplyDelete
  2. Successfully transit your career into the technology of Data Science by enrolling for Data Science Course un Hyderabad program offered by AI Patasala training institute.
    Data Science Training Hyderabad

    ReplyDelete