Saturday, June 27, 2015

Natural Language Processing: the IMDB movie reviews

Natural language processing (NLP) relates to problems dealing with text problems, usually based on machine learning algorithms. Many machine learning models require features to be quantified, which leads to a great challenge to NLP: how to transfer the large amount of text contents to a language that the computer can understand.

In this blog I apply the IMDB movie reviews and use three different ways to classify if a review is a positive one or negative one. The first one, which creates features according to the occurrence of the words, and the second one, which uses Google's word2vec to transfer a word to a vector, are based on Kaggle's Bag of Words Meet Bag of Popcorn tutorial. The third one, which uses doc2vec, is an optimized version from word2vec by Le and Mikolov[1].

All the src can be found on my Github.


Data cleaning and text processing

A paragraph of text content, or corpus,  may contain HTML tags, punctuations numbers and symbols (e.g., emojis) that affect the result of prediction. Moreover, it needs to be broken down to single words before it can be applied to any feature creating methods.


HTML tags
Some corpus contains HTML tags such as <br/> ,<pre> etc. In this blog, I use a Python library Beautiful Soup to do that.

review_text = BeautifulSoup(raw_review).get_text()



Break down to words

A long paragraph of text contents is hard for the computer to process it. Thus, we need to break the paragraph to single words. It is true that the order of the words may affect the meaning of the content, thus affect the prediction. As we may see later, doc2vec takes into account word order and has been shown to be the best method for IMDB movie review classifications. Moreover, punctuations such as ":)" (smileys) and numbers may also affect the classification. The original tutorial from Kaggle does not include the smileys and numbers option, for better performance, I added these two in my code.

The code uses regular expression to find patterns (numbers or punctuations) and replace them with white space for split up later.
Python provides re library for regular expression operations.


smileys = """:-) :) :o) :] :3 :c) :> =] 8) =) :} :^)
:D 8-D 8D x-D xD X-D XD =-D =D =-3 =3 B^D :( :/ :-( :'( :D :P""".split()
smiley_pattern = "|".join(map(re.escape, smileys))
# re.sub() replace the pattern by the desired character/string
# [^] matches a single character that is not contained within the brackets if remove_numbers and remove_smileys:
elif remove_smileys:
# any character that is not in a to z and A to Z (non text) review_text = re.sub("[^a-zA-Z]", " ", review_text) # numbers are also included
review_text = re.sub("[^a-zA-Z0-9" + smiley_pattern + "]", " ", review_text)
review_text = re.sub("[^a-zA-Z0-9]", " ", review_text) elif remove_numbers: review_text = re.sub("[^a-zA-Z" + smiley_pattern + "]", " ", review_text)
else:

After we remove the unnecessary symbols, we split the paragraph to single words.

# split in to a list of words
words = review_text.lower().split()

This operation gives a list of single words.


Remove stop words
Stop words are those high frequent words that do not carry much meaning. Examples include "I", "you", "this", "is"...etc. Including such stop words may affect the model prediction. Here, I use Natural Language Toolkit to get all the common stop words. It's not included in Python's default library, thus we need to install it and its dictionary first. Please do not use IDE to install it, it will not work. Please check the NLTK documentation for installation and downloading the data. A separate window will pop up when you type the nltk.download() command in the interactive model and ask you to select the desired dictionary. I selected all, you may select specific library (e.g., stopwords).

After installation and got the desired data, import stopwords library. Now all we need to do is to remove the stop words that are in the stopwords library from the list of words we have created from the review.

from nltk.corpus import stopwords
if remove_stopwords:
# create a set of all stop words
stops = set(stopwords.words("english"))
words = [w for w in words if w not in stops]
# remove stop words from the list

The final list of words should look like this:

>>> words = processData.review_to_words(train["review"][0], True, False, False)
>>> words
['stuff', 'going', 'moment', 'mj', "i've", 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker', 'maybe', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'really', 'cool', 'eighties', 'maybe', 'make', 'mind', 'whether', 'guilty', 'innocent', 'moonwalker', 'part', 'biography', 'part', 'feature', 'film', 'remember', 'going', 'see', 'cinema', 'originally', 'released', 'subtle', 'messages', "mj's", 'feeling', 'towards', 'press', 'also', 'obvious', 'message', 'drugs', 'bad', "m'kay", 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless', 'remotely', 'like', 'mj', 'anyway', 'going', 'hate', 'find', 'boring', 'may', 'call', 'mj', 'egotist', 'consenting', 'making', 'movie', 'mj', 'fans', 'would', 'say', 'made', 'fans', 'true', 'really', 'nice', 'actual', 'feature', 'film', 'bit', 'finally', 'starts', '20', 'minutes', 'excluding', 'smooth', 'criminal', 'sequence', 'joe', 'pesci', 'convincing', 'psychopathic', 'powerful', 'drug', 'lord', 'wants', 'mj', 'dead', 'bad', 'beyond', 'mj', 'overheard', 'plans', 'nah', 'joe', "pesci's", 'character', 'ranted', 'wanted', 'people', 'know', 'supplying', 'drugs', 'etc', 'dunno', 'maybe', 'hates', "mj's", 'music', 'lots', 'cool', 'things', 'like', 'mj', 'turning', 'car', 'robot', 'whole', 'speed', 'demon', 'sequence', 'also', 'director', 'must', 'patience', 'saint', 'came', 'filming', 'kiddy', 'bad', 'sequence', 'usually', 'directors', 'hate', 'working', 'one', 'kid', 'let', 'alone', 'whole', 'bunch', 'performing', 'complex', 'dance', 'scene', 'bottom', 'line', 'movie', 'people', 'like', 'mj', 'one', 'level', 'another', '(which', 'think', 'people)', 'stay', 'away', 'try', 'give', 'wholesome', 'message', 'ironically', "mj's", 'bestest', 'buddy', 'movie', 'girl', 'michael', 'jackson', 'truly', 'one', 'talented', 'people', 'ever', 'grace', 'planet', 'guilty', 'well', 'attention', "i've", 'gave', 'subject', 'hmmm', 'well', "don't", 'know', 'people', 'different', 'behind', 'closed', 'doors', 'know', 'fact', 'either', 'extremely', 'nice', 'stupid', 'guy', 'one', 'sickest', 'liars', 'hope', 'latter']



Processing all the reviews:

clean_train_reviews = []
print("Cleaning and parsing training data", end="\n")
for i in range(0, num_reviews):
    # if (i+1) % 1000 == 0:
    # print("Review %d of %d\n" % (i+1, num_reviews))
    clean_train_reviews.append(" ".join(processData.review_to_words(train["review"][i], True,False,False)))


The review_to_words() function contains all operations to process a review.

def review_to_words(raw_review, remove_stopwords=False, remove_numbers=False, remove_smileys=False): # use BeautifulSoup library to remove the HTML/XML tags (e.g., ) review_text = BeautifulSoup(raw_review).get_text() # emotional symbols may affect the meaning of the review smileys = """:-) :) :o) :] :3 :c) :> =] 8) =) :} :^) :D 8-D 8D x-D xD X-D XD =-D =D =-3 =3 B^D :( :/ :-( :'( :D :P""".split() smiley_pattern = "|".join(map(re.escape, smileys)) # [^] matches a single character that is not contained within the brackets # re.sub() replace the pattern by the desired character/string if remove_numbers and remove_smileys: # any character that is not in a to z and A to Z (non text) review_text = re.sub("[^a-zA-Z]", " ", review_text) elif remove_smileys: # numbers are also included review_text = re.sub("[^a-zA-Z0-9]", " ", review_text) elif remove_numbers: review_text = re.sub("[^a-zA-Z" + smiley_pattern + "]", " ", review_text) else: review_text = re.sub("[^a-zA-Z0-9" + smiley_pattern + "]", " ", review_text) # split in to a list of words words = review_text.lower().split() if remove_stopwords: # create a set of all stop words stops = set(stopwords.words("english")) # remove stop words from the list words = [w for w in words if w not in stops] # for bag of words, return a string that is the concatenation of all the meaningful words # for word2Vector, return list of words # return " ".join(words) return words

Note:
The Kaggle tutorial mentioned that:
If you are appending a list of lists to another list of lists, "append" will only append the first list; you need to use "+=" in order to join all of the lists at once.
This is not the case for Python 3. In Python 3, += join all the lists together and flatten them, while append appends each list to the list. See this post.

This part of the code is included the processData.py in the review to 



Bag of words model

Now we have a list of words. The next thing we need to do is to create features for the model we are going to train based on all reviews (list of words) we have. The simplest way is to learn a vocabulary(bag-of-words model) from all the reviews we have, then use this vocabulary as the features and count the occurrence of each word in the vocabulary in each list of words. The result will be a vector, with each element the occurrence of a word in the vocabulary. And this vector will be used as the feature vector for training.
For example,
dictionary {the, cat, sat, on, hat, dog, likes, and}
sentence1: the cat sat on the hat {2, 1, 1, 1, 1, 0, 0, 0}
sentence2: the dog likes the cat and the hat {3, 1, 0, 0, 1, 1, 1, 1} 
I use the feature_extraction module from scikit-learn to create a bag-of-words features. CountVectorizer converts a collection of text documents to a matrix of token counts.  Here, we will convert the list of the list of words (each review is a list) to a matrix. Each row will be a frequency vector, each column is the frequency of the word. Note that CountVectorizer has the option of preprocessing, and it is definitely worth trying. :)



# max_features determine the maximum words that is taken into account, 5000 here
# e.g. dictionary {the, cat, sat, on, hat, dog, likes, and}
# sentence1: the cat sat on the hat {2, 1, 1, 1, 1, 0, 0, 0}
# sentence2: the dog likes the cat and the hat {3, 1, 0, 0, 1, 1, 1, 1}
vectorizer = CountVectorizer(analyzer="word",
                             tokenizer=None,
                             preprocessor=None,
                             stop_words=None,
                             max_features=5000)

This part of the code is included in the nlp.py.


Word2Vec
Word2Vec is a project developed by Google using neural network implementation that learns distributed representations for words[2]. There are quite a few resources explaining the details of Word2Vec. I selected a few here.


In short, given a dictionary (e.g., wikitionary), Word2Vec map each word to a vector so that words can be compared or operated. For example, "king" -"man" = "queen" - "woman".

The original Word2Vec is implemented in C. In Python, the Gensim package provides excellent implementation of the project. 

Word2Vec expects single sentences, each one as a list of words. That means a review will be broken down to a list of lists, with each list a list of words. So the question is, how to determine what is a sentence? There are several different ways to define the end of a sentence ("?",  ".", "!", " ", etc), moreover, sometimes capitalization at the beginning of a sentence may also indicate the previous word is the end of a sentence. Thus, it is hard to do it manually. Python's NLTK package provides punkt module for sentence splitting. If you have downloaded all packages when you install NLTK, you can directly import punkt, otherwise download the module first. 

Processing each review is the same as the way shown previously. However, Kaggle's tutorial have mentioned that it is better not to remove stop words because the algorithm relies on the broader context of the sentence in order to produce high-quality word vectors. It may also be helpful not remove numbers and smileys. The review_to_word() can be used directly with default options.

I wrote the function review_to_sentences() to process each review.

def review_to_sentences(review, tokenizer, remove_stopwords=False, remove_numbers=False, remove_smileys=False):
    """
    This function splits a review into parsed sentences
    :param review:
    :param tokenizer:
    :param remove_stopwords:
    :return: sentences, list of lists
    """
    # review.strip()remove the white spaces in the review
    # use tokenizer to separate review to sentences
    raw_sentences = tokenizer.tokenize(review.strip())

    #cleaned_review = [review_to_words(sentence, remove_stopwords, remove_numbers, remove_smileys) for sentence
    #                  in raw_sentences if len(sentence) > 0]
    # generic form equals append
    cleaned_review = []
    for sentence in raw_sentences:
        if len(sentence) > 0:
            cleaned_review += review_to_words(sentence, remove_stopwords, remove_numbers, remove_smileys)

    return cleaned_review

Before you start training the model, it is better to install Cython for better performance:
**Make sure you have a C compiler before installing gensim, to use optimized (compiled) word2vec training**
(70x speedup compared to plain NumPy implementation [3]_). 

The source code of Word2Vec in Gensim can be found here. They also provide tutorials in the doc string, so it's worth taking a look. :)

The model is trained using skip-gram algorithm (See reference[1]) in default. You can also train the model using continuous bag of words (CBOW) by setting sg=0. Refer to this and this for detailed explanations on skip-gram and CBOW. According to Mikolov (the guy who developed word2vec), skip-gram works well with small amount of the training data, represents well even rare words or phrases. CBOW is several times faster to train than the skip-gram, slightly better accuracy for the frequent words. Based on Kaggle tutorial, skip-gram produces better results on this data set.


num_features = 500  # word vector dimensionality
# minimum word count: any word that does not occur at least this many times
# across all documents is ignored
min_word_count = 40
num_workers = 4  # Number of threads to run in parallel
context = 10  # Context window size
downsampling = 1e-3  # Downsample setting for frequent words

print("Training model...")
model = word2vec.Word2Vec(bag_sentences, workers=num_workers,
                          size=num_features, min_count=min_word_count,
                          window=context, sample=downsampling)

# If you don't plan to train the model any further, calling
# init_sims will make the model much more memory-efficient
model.init_sims(replace=True)
# save the model for future use
model.save("Word2VectforNLPTraining")

This part of the code can be found in word2vecNLP.py


Explore the model
Since we have saved the model, we can load the model directly.

>>> from gensim.models import Word2Vec
>>> model = Word2Vec.load("Word2VectforNLPTraining") 

The model consists a feature vector for each word in the vocabulary, stored in a numpy array called "syn0".

>>> print(type(model.syn0))
#number of words, number of features
>>> model.syn0.shape
(17978, 500)
>>> model["man"]
array([-0.0173709 , -0.05453965, -0.01378504, -0.02687   , -0.0247492 ,
       -0.02725732, -0.08029163, -0.01303324, -0.01790693,  0.02459037,
       -0.0451758 ,  0.06946673,  0.00119434, -0.01014592,  0.00334688,
       ...
       -0.01781173, -0.05186122, -0.04420475,  0.00410226, -0.05667625,
        0.06580704, -0.00364238, -0.14961284,  0.02291572, -0.04049427,
       -0.0516507 ,  0.03579128,  0.00122541,  0.02547096,  0.03301932], dtype=float32)

doesnt_match() function tries to deduce which word in a set is most dissimilar from the others.

>>> model.doesnt_match("man woman child kitchen".split())
'kitchen'
>>> model.doesnt_match("paris berlin london austria".split())
'paris'

most_similar(): returns the score of the most similar words based on the criteria. The topn option determines the top N most similar words. Positive words contribute positively towards the similarity, negative words negatively.

>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)
[('princess', 0.4017685651779175), ('queen', 0.3796828091144562), ('prince', 0.36173444986343384), ('mistress', 0.3507348895072937), ('rudolf', 0.3303285539150238), ('maid', 0.32905253767967224), ('astor', 0.32380762696266174), ('throne', 0.31540173292160034), ('stepmother', 0.3121083974838257), ('antoinette', 0.31201446056365967)]

This part of the code is included in exploreWord2VecModel.py


Doc2Vec
Doc2Vec takes into account the word order. Details about Doc2Vec can be found in this blog and reference [1]. In gensim, Doc2Vec is implemented as a derived class from Word2Vec. A tutorial about Doc2Vec can be found here.

Instead of skip-gram and CBOW, Doc2Vec implements distributed memory and distributed bag of words (DBOW). Default algorithm is distributed memory (dm=1), by setting dm=0, you can use DBOW.

Doc2Vec requires each sentence to be a LabeledSentence object, which is different from Word2Vec. The easiest way is to create the LabeledSentence object for each sentence.

def labelizeReviews(reviewSet, labelType):
    """
    add label to each review
    :param reviewSet:
    :param label: the label to be put on the review
    :return:
    """
    labelized = []
    for index, review in enumerate(reviewSet):

        labelized.append(doc2vec.LabeledSentence(words=review, labels=['%s_%s'%(labelType, index)]))
    return labelized
# the input to doc2vec is an iterator of LabeledSentence objects
# each consists a list of words and alist of labels
labeled = labelizeReviews(labeled, 'LABELED')
unlabeled = labelizeReviews(unlabeled, 'UNLABELED')

Training part is similar to Word2Vec.

num_features = 500
# minimum word count: any word that does not occur at least this many times
# across all documents is ignored
min_word_count = 40
# the paper (http://arxiv.org/pdf/1405.4053v2.pdf) suggests 10 is the optimal
context = 10
#  threshold for configuring which higher-frequency words are randomly downsampled;
# default is 0 (off), useful value is 1e-5
# set the same as word2vec
downsampling = 1e-3
um_workers = 4  # Number of threads to run in parallel

# if sentence is not supplied, the model is left uninitialized
# otherwise the model is trained automatically
# https://www.codatlas.com/github.com/piskvorky/gensim/develop/gensim/models/doc2vec.py?line=192
model = doc2vec.Doc2Vec(size=num_features,
                        window=context, min_count=min_word_count,
                        sample=downsampling, workers=4)

model.build_vocab(bag_labeled_sentence)
# gensim documentation suggests training over data set for multiple times
# by either randomizing the order of the data set or adjusting learning rate
# see here for adjusting learn rate: http://rare-technologies.com/doc2vec-tutorial/
# iterate 10 times
for it in range(10):
    # perm = np.random.permutation(bag_labeled_sentence.shape[0])
    model.train(np.random.permutation(bag_labeled_sentence))

This part of the code can be found in doc2vecNLP.py

IMDB review classification

Word2Vec or Doc2Vec only allows us to project text words to vectors. In order to classify a paragraph of text contents, we still need to determine the features of each review.

Vector Averaging
The easiest method is averaging the word vectors in each review. Each word is a vector of num_features (500 in my case), if we add all words in the review and take the average, in the end the review becomes a vector of num_features dimension.

For example,
review: "shirley is awesome"
"shirley" = "0.3 0.6 0.8"
"is" = "1.2 3.5 4.6"
"awesome" = "0.9 1.2 8.7"

then the vector of the review = "(0.3 + 1.2 + 0.9)/3 (0.6 + 3.5 + 1.2)/3 (0.8 + 4.6 + 8.7)/3" = “0.8 1.77 4.7”

def makeFeatureVec(review, model, num_features):
    """
    given a review, define the feature vector by averaging the feature vectors
    of all words that exist in the model vocabulary in the review
    :param review:
    :param model:
    :param num_features:
    :return:
    """

    featureVec = np.zeros(num_features, dtype=np.float32)
    nwords = 0

    # index2word is the list of the names of the words in the model's vocabulary.
    # convert it to set for speed
    vocabulary_set = set(model.index2word)

    # loop over each word in the review and add its feature vector to the total
    # if the word is in the model's vocabulary
    for word in review:
        if word in vocabulary_set:
            nwords = nwords + 1
            # add arguments element-wise
            # if x1.shape != x2.shape, they must be able to be casted
            # to a common shape
            featureVec = np.add(featureVec, model[word])
    featureVec = np.divide(featureVec,nwords)
    return featureVec

def getAvgFeatureVecs (reviewSet, model):

    # initialize variables
    counter = 0
    num_features = model.syn0.shape[1]
    reviewsetFV = np.zeros((len(reviewSet),num_features), dtype=np.float32)

    for review in reviewSet:
        reviewsetFV[counter] = makeFeatureVec(review, model, num_features)
        counter += 1
    return reviewsetFV

This part of the code can be found on w2vPredictVectorAveraging.py

Clustering
The second one is a fancier one, which uses the clustering algorithm, in specific, the K-means algorithm, to cluster each word to a centroid, and use the vector of the centroid as the feature vector of the review.

I use scikit-learn to perform K-means algorithm.

from sklearn.cluster import KMeans
def kmeans(num_clusters, dataSet):
    # n_clusters: number of centroids
    # n_jobs: number of jobs running in parallel
    kmeans_clustering = KMeans(n_clusters=num_clusters)
    # Compute cluster centers and predict cluster index for each sample
    centroidIndx = kmeans_clustering.fit_predict(dataSet)

    return centroidIndx

After training, each word in the vocabulary is assigned to a centroid.

Now we need to create the feature vector for each review. We can do this by creating a vector of dimension the number of clusters, and add the centroid of each word in the review to the vector. For example,

vocabulary:
   word         centroid
"Shirley"            0
"Dora"               1
"is"                    2
"awesome"        1
"fun"                 0

Number of centroids: 3
So review
"Shirley is awesome" = {1, 1, 1}
"Shirley is fun" = {2, 0, 1}
"Dora is awesome" = {0, 2, 1}

The function create_bag_of_centroids() implements this method.

def create_bag_of_centroids(reviewData):
        """
        assign each word in the review to a centroid
        this returns a numpy array with the dimension as num_clusters
        each will be served as one feature for classification
        :param reviewData:
        :return:
        """
        featureVector = np.zeros(num_clusters, dtype=np.float)
        for word in reviewData:
            if word in index_word_map:
                index = index_word_map[word]
                featureVector[index] += 1
        return featureVector

This part of the code can be found in the w2vPredictClustering.py.

Classification
After you create feature vectors using either of the above method. There are several models you supervised learning methods you can use to classify the review. Reference [1] uses a logistic regression and claims they got 94% test accuracy. I use random forest and got on average 0.84 score for all methods mentioned above except using Doc2Vec and clustering (only 0.73).

def rfClassifer(n_estimators, trainingSet, label, testSet):

    forest = RandomForestClassifier(n_estimators)
    forest = forest.fit(trainingSet, label)
    result = forest.predict(testSet)

    return result

Discussion

Overall, this is a pretty fun object. The above three methods are quite different and take different time for training, but they are similar in certain way: all of them try to project text contents to a feature vector of desired dimension and use that feature vector for classification.

There are quite a few approaches to improve the results: taking into account punctuations, symbols (smileys), use different classification models and so on.



References
[1] Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053
[2]Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method." arXiv preprint arXiv:1402.3722
[3] Optimizing word2vec in gensim, http://radimrehurek.com/2013/09/word2vec-in-python-part-two-optimizing/

Acknowledgement
Kaggle
Codatlas

Thursday, June 11, 2015

First touch in data science (Titanic project on Kaggle) Part II: Random Forest

In this post, I will use the Pandas and Scikit learn packages to make the predictions.

Reading the data
Instead of using csv reader provided by Python itself, here we use Pandas.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import csv
# always use header =  0 when row 0 is the header row
df = pd.read_csv('/path/train.csv', header = 0)
test_df = pd.read_csv('/path/test.csv', header = 0)

Pandas store the data into an object called DataFrame. It also provides functions for basic statistics of the data.

df.head(n) returns the first n rows of the data.
df.tail(n) returns the last n rows of the data.


>>> df.head(3)
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   

                                                Name     Sex  Age  SibSp  \
0                            Braund, Mr. Owen Harris    male   22      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   38      1   
2                             Heikkinen, Miss. Laina  female   26      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
>>> df.tail(3)
     PassengerId  Survived  Pclass                                      Name  \
888          889         0       3  Johnston, Miss. Catherine Helen "Carrie"   
889          890         1       1                     Behr, Mr. Karl Howell   
890          891         0       3                       Dooley, Mr. Patrick   

        Sex  Age  SibSp  Parch      Ticket   Fare Cabin Embarked  
888  female  NaN      1      2  W./C. 6607  23.45   NaN        S  
889    male   26      0      0      111369  30.00  C148        C  
890    male   32      0      0      370376   7.75   NaN        Q  

df.dtypes returns the data type of all the columns. Remember that csv reads data defaults to string, import data using Pandas automatically converts data based on the actual type of the data.

>>> df.dtypes
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Sometimes we want to know if there is missing data in the columns. df.info() can help on this.


>>> df.info()
class pandas.core.frame.dataframe=""
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5 KB

There are total 891 rows but Age, Cabin and Embarked has fewer non-null rows, which indicates there is missing data.

Moreover, df.describe() provides basic statistics of the data.

>>> df.describe()
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  

However, since there is missing values in some columns, we need to be careful when we quote the statistics using this method.

Pandas provides handy ways to select and filter data, see the following several examples:

df[list of column names][m:n]: selects n rows from row m with the desired columns.

>>> df[['Name','Pclass']][1:6]
                                                Name  Pclass
1  Cumings, Mrs. John Bradley (Florence Briggs Th...       1
2                             Heikkinen, Miss. Laina       3
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)       1
4                           Allen, Mr. William Henry       3
5                                   Moran, Mr. James       3


df[criteria of df['column name']][list of column names]: filters the row based on the criteria and display the desired columns.


>>> df[df['Age']>60][['Name','Sex','Age']]
                                          Name     Sex   Age
33                       Wheadon, Mr. Edward H    male  66.0
54              Ostby, Mr. Engelhart Cornelius    male  65.0
96                   Goldschmidt, Mr. George B    male  71.0
116                       Connors, Mr. Patrick    male  70.5
170                  Van der hoef, Mr. Wyckoff    male  61.0
252                  Stead, Mr. William Thomas    male  62.0
275          Andrews, Miss. Kornelia Theodosia  female  63.0
280                           Duane, Mr. Frank    male  65.0
326                  Nysveen, Mr. Johan Hansen    male  61.0
438                          Fortune, Mr. Mark    male  64.0
456                  Millet, Mr. Francis Davis    male  65.0
483                     Turkula, Mrs. (Hedwig)  female  63.0
493                    Artagaveytia, Mr. Ramon    male  71.0
545               Nicholson, Mr. Arthur Ernest    male  64.0
555                         Wright, Mr. George    male  62.0
570                         Harris, Mr. George    male  62.0
625                      Sutton, Mr. Frederick    male  61.0
630       Barkworth, Mr. Algernon Henry Wilson    male  80.0
672                Mitchell, Mr. Henry Michael    male  70.0
745               Crosby, Capt. Edward Gifford    male  70.0
829  Stone, Mrs. George Nelson (Martha Evelyn)  female  62.0
851                        Svensson, Mr. Johan    male  74.0


Analyzing the data
Lots of machine learning models only allow for numerical imports. Thus for string type data such as Sex, we need to convert the data to numerical value.

# map female to 0 and male to 1
df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)

the map() function maps all elements in an iterable ('female, 'male' in this example) to a given function (a discrete one here).

Filling the data becomes easier with Pandas because we can use the methods provided in the package. Here we will bin the passengers based on gender and passenger class and fill the missing age based on the median of each bin.

# fill in missing ages
# for each passenger without an age, fill the median age
# of his/her passenger class
median_ages = np.zeros((2,3))
for i in range(0, 2):
    for j in range(0, 3):
        median_ages[i, j] = df[(df['Gender'] == i) &
                               (df['Pclass'] == j + 1)]['Age'].dropna().median()

# create a new column to fill the missing age (for caution)
df['AgeFill'] = df['Age']
# since each column is a pandas data series object, the data cannot be accessed
# by df[2,3], we must provide the label (header) of the the column and use .loc()
# to locate the data e.g., df.loc[0, 'Age']
# or df[row]['header']
for i in range(2):
    for j in range(3):
        df.loc[(df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j + 1),
               'AgeFill'] = median_ages[i, j]


We fill the Embarked based on the most common boarding place. In statistics, mode returns the most frequency element in the data set.


# fill the missing Embarked with the most common boarding place
# mode() returns the mode of the data set, which is the most frequent element in the data
# sometimes multiple values may be returned, thus in order to select the maximum
# use df.mode().iloc[0]
if len(df.Embarked[df.Embarked.isnull()]) > 0:
    df.Embarked[df.Embarked.isnull()] = df.Embarked.dropna().mode().iloc[0]

# returns an enumerate object
# e.g., [(0, 'S'), (1, 'C'),(2, 'Q')]
# Ports = list(enumerate(np.unique(df.Embarked)))
# Set up a dictionary that is an enumerate object of the ports
Ports_dict = {name : i for i, name in list(enumerate(np.unique(df.Embarked)))}
df['EmbarkFill'] = df.Embarked.map(lambda x: Ports_dict[x]).astype(int)

Drop the unwanted columns.

df = df.drop(['PassengerId', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Age'], axis=1)

We need to do the same thing for test data. I will omit that part here, but you can find the full source code on my Github.

Training the model
The model we are going to build is called random forest.  Random forest is an ensemble learning method. It constructs a bag of decision trees at the training time and output the class that is the mode of the classes. The data set for each decision tree is produced(resampled from the original dataset) by bootstrap method.

We use scikit-learn to build the model and predict the data. Since scikit-learn only works with numpy arrays, after we clean the data with Pandas, we need to convert the data to numpy arrays.

# convert the data to numpy array
training_data = df.values
test_data = test_df.values

Then we build a random forest object and train the model.

# train the data using random forest
# n_estimators: number of trees in the forest, this number affects the prediction
forest = RandomForestClassifier(n_estimators=150)
# build the forest
# X: array-like or sparse matrix of shape = [n_samples, n_features]
# y: array-like, shape = [n_samples], target values/class labels
forest = forest.fit(training_data[0::, 1::], training_data[0::, 0])
output = forest.predict(test_data).astype(int)

Write the output to a csv file using Python's csv package.

# write the output to a new csv file
predictions_file = open("predictByRandomForest.csv", 'w')
open_file_object = csv.writer(predictions_file)
open_file_object.writerow(["PassengerId", "Survived"])
open_file_object.writerows(zip(ids, output))
predictions_file.close()


My score is 0.76555 on the leader board, on average, not too bad for the first try.

A few thoughts
Apparently choosing the right model is important, moreover, how to play around with the input parameters are also important(e.g., n_estimators).

For the data, I still need to learn what are the most relevant data for the "best" fit. Sometimes if we include more features, we may get more correct predictions, but it may also cause overfitting problem.

Wednesday, June 10, 2015

First touch in data science (Titanic project on Kaggle) Part I: a simple model

Right after I became Dr. Young, I decide to pick up the thing I always want to do yet didn't get enough time to work on: machine learning and data analytics.

Kaggle is a great source to start with. Besides the active competitions, they provide several entry-level projects that include tutorials. I start with the first one: Titanic.

The full description can be found here. In short, given the data of 891 passengers and if they have survived or not, predict what sorts of people were likely to survive.

I used Python to finish this project. There are good libraries in Python for data analysis and machine learning. Moreover, I personally think Python is a rather faster language, so it may be more efficient when dealing with large data set.


Understand the data
The first thing to do when start with data science is to read and understand the data. What we want to do is to use the determine which variable(s) is(are) strongly correlated with the ultimate survival. Part of the data is shown in the following figure.





In data science, features indicates the variables that are given in the data. In the Titanic dataset, Pclass, Name, Sex, Age, ..., are all features. labels are the outcome. In this dataset, the labels are survived (1) or not survived(0), and it's a binary class.

When a huge dataset with lots of features are given to you, some features are strongly correlated with the label, some are not. It would be better if more information is provided, however, without that, it is not bad to start with intuition. In the Titanic situation, it is possible that Sex, Age, Pclass are more related compare to say, Embarked (the place where the passenger was boarded).

Reading the data to Python
Python provides libraries to read csv file. Moreover, the numpy library, provides handy functions to analyze the data. The script provides here are in Python 3.4, for script in Python 2.x, see the Kaggle tutorial.

import csv
import numpy as np
training_object = csv.reader(open('/path/train.csv', 'r'))
training_header = training_object.__next__()
# create a numpy multidimensional array object
data = []
for row in training_object:
    data.append(row)
data = np.array(data)


Python has a very interesting way to deal with iterables. For example, data[:2] gives you the first two elements and data[-1] gives you the last element:


 >>> data = [1, 2, 3, 4, 5]  
 >>> data  
 [1, 2, 3, 4, 5]  
 >>> data[:2]  
 [1, 2]  
 >>> data[:-2]  
 [1, 2, 3]  
 >>> data[-1]  
 5  
 >>> data[-2:]  
 [4, 5]  



Analyzing the data
Here we try to build a simple model that assume Fare, Sex and Pclass (passenger class). To see the name of all the features, call training_header:

>>> training_header
['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

Since each person paid different fare to board, we need to bin up the fare price so that we can classify the passengers based on the fare bin.
csv reader in Python reads the data default to string, thus we need to convert it to float.

fare_ceiling = 40
# for ticket price higher than 39, it will be set to equal 39
# so that we can set 4 bins with equal size
# i.e., $0-9, $10-19, $20-29, $30-39
data[data[0::, 9].astype(np.float) >= fare_ceiling, 9] = fare_ceiling - 1.0

# basically make 4 equal bins
fare_bracket_size = 10
number_of_price_brackets = fare_ceiling / fare_bracket_size

# np.unique() return an array of unique elements in the object
# get the length of that array
number_of_classes = len(np.unique(data[0::, 2]))


data[0::, 9] indicates from the first row to the last and the 9th column. Numpy has a lovely way to select data. in the above code:

data[data[0::, 9].astype(np.float) >= fare_ceiling, 9]

selects all the rows with the fare (9th column) greater than fare_ceiling.

>>The Numpy.unique() function
This is a very interesting function. I looked at its src because I wanted to see if it uses the brutal iteration way. Apparently it is much smarter. The full source code can be found here. (referred to Codatlas). Here I simplify the implementation for the purpose of explanation.

Transfer the array to Numpy array and flatten the array to one-dimension array.

>>> ar = [1, 1, 2, 3, 3, 3, 2, 2, 2]
>>> ar = np.asanyarray(ar).flatten()
>>> ar
array([1, 1, 2, 3, 3, 3, 2, 2, 2])


Sort the array.

>>> ar.sort()
>>> ar
array([1, 1, 2, 2, 2, 2, 3, 3, 3])

Here comes the interesting part.

>>> aux = ar
>>> flag = np.concatenate(([True], aux[1:] != aux[:-1]))
>>> flag
array([ True, False,  True, False, False, False,  True, False, False], dtype=bool)

If we print aux[1:] and aux[:-1]:

>>> aux[1:]
array([1, 2, 2, 2, 2, 3, 3, 3])
>>> aux[:-1]
array([1, 1, 2, 2, 2, 2, 3, 3])

This operation is similar to shift the array right by one position, if there is no repeated element in the array, aux[1:] != aux[:-1] should return true at any position, otherwise, it will return false at the place with repeated elements.




Generate the return array based on the flag array.

>>> ret = aux[flag]
>>> ret
array([1, 2, 3])


<<Back to Analyzing the data

The next step is to calculate the statistics based on features we have chosen, i.e., for each sex, sum up all the survived people that are in a particular passenger class and fare bin.


# initialize the survival table with all zeros
survival_table = np.zeros((2, number_of_classes, number_of_price_brackets))

for i in range(number_of_classes):
    for j in range(int(number_of_price_brackets)):

        women_only_stats = data[(data[0::, 4] == "female")
                                & (data[0::, 2].astype(np.float) == i+1)  # i starts from 0,
                                # the ith class fare was greater than or equal to the least fare in current bin
                                & (data[0:, 9].astype(np.float) >= j*fare_bracket_size)
                                # fare was less than the least fare in next bin
                                & (data[0:, 9].astype(np.float) < (j+1)*fare_bracket_size), 1]

        men_only_stats = data[(data[0::, 4] != "female")
                              & (data[0::, 2].astype(np.float) == i + 1)
                              & (data[0:,9].astype(np.float) >= j * fare_bracket_size)
                              & (data[0:,9].astype(np.float) < (j + 1) * fare_bracket_size), 1]
        survival_table[0, i, j] = np.mean(women_only_stats.astype(np.float))
        survival_table[1, i, j] = np.mean(men_only_stats.astype(np.float))
# if nobody satisfies the criteria, the table will return a NaN
# since the divisor is zero
survival_table[survival_table != survival_table] = 0


Since Survived only contains 0 and 1, the probability of surviving at given passenger class is calculated by:
sum of survived passenger / total number of passenger
i.e., the mean.

Again, Survived only contains 0 and 1, thus we assume any probability greater than or equal to 0.5 should predict a survival.

# assume any probability >= 0.5 should result in predicting survival
# otherwise not
survival_table[survival_table < 0.5] = 0
survival_table[survival_table >= 0.5] = 1


Predicting the data
Now we need to use the table to predict the test data. We use csv reader to create a new file for writing file.

test_file  =open('/path/test.csv')
test_object = csv.reader(test_file)
test_header = test_object.__next__()
prediction_file = open("/path/genderClassModel.csv", 'w')
p = csv.writer(prediction_file)
p.writerow(["PassengerId", "Survived"])


The original tutorial provided by Kaggle uses a loop to determine if a passenger's fare falls in a certain bin. I personally don't like this way:  it's slow. Alternatively, we can calculate the bin by dividing the fare by fare_bracket_size (10 in this case).


for row in test_object:
    # for each passenger, find the price bin where the passenger
    # belongs to
    try:
        row[8] = float(row[8])
    # if data is missing, bin the fare according Pclass
    except:
        bin_fare = 3 - float(row[1])
        continue
    # assign the passenger to the last bin if the fare he/she paid
    # was greater than the fare ceiling
    if row[8] > fare_ceiling:
        bin_fare = number_of_price_brackets - 1
    else:
        bin_fare = int(row[8] / fare_bracket_size)

    if row[3] == 'female':
        p.writerow([row[0], "%d" %
            int(survival_table[0, float(row[1]) - 1, bin_fare])])
    else:
        p.writerow([row[0], "%d" %
                    int(survival_table[1, float(row[1]) - 1, bin_fare])])



test_file.close()
prediction_file.close()

In the next post, I will talk about using Pandas library to clean the data and use Scikit learn to train a machine learning model for the prediction.

The full src can be found on my Github.