AdSense

Saturday, June 27, 2015

Natural Language Processing: the IMDB movie reviews

Natural language processing (NLP) relates to problems dealing with text problems, usually based on machine learning algorithms. Many machine learning models require features to be quantified, which leads to a great challenge to NLP: how to transfer the large amount of text contents to a language that the computer can understand.

In this blog I apply the IMDB movie reviews and use three different ways to classify if a review is a positive one or negative one. The first one, which creates features according to the occurrence of the words, and the second one, which uses Google's word2vec to transfer a word to a vector, are based on Kaggle's Bag of Words Meet Bag of Popcorn tutorial. The third one, which uses doc2vec, is an optimized version from word2vec by Le and Mikolov[1].

All the src can be found on my Github.


Data cleaning and text processing

A paragraph of text content, or corpus,  may contain HTML tags, punctuations numbers and symbols (e.g., emojis) that affect the result of prediction. Moreover, it needs to be broken down to single words before it can be applied to any feature creating methods.


HTML tags
Some corpus contains HTML tags such as <br/> ,<pre> etc. In this blog, I use a Python library Beautiful Soup to do that.

review_text = BeautifulSoup(raw_review).get_text()



Break down to words

A long paragraph of text contents is hard for the computer to process it. Thus, we need to break the paragraph to single words. It is true that the order of the words may affect the meaning of the content, thus affect the prediction. As we may see later, doc2vec takes into account word order and has been shown to be the best method for IMDB movie review classifications. Moreover, punctuations such as ":)" (smileys) and numbers may also affect the classification. The original tutorial from Kaggle does not include the smileys and numbers option, for better performance, I added these two in my code.

The code uses regular expression to find patterns (numbers or punctuations) and replace them with white space for split up later.
Python provides re library for regular expression operations.


smileys = """:-) :) :o) :] :3 :c) :> =] 8) =) :} :^)
:D 8-D 8D x-D xD X-D XD =-D =D =-3 =3 B^D :( :/ :-( :'( :D :P""".split()
smiley_pattern = "|".join(map(re.escape, smileys))
# re.sub() replace the pattern by the desired character/string
# [^] matches a single character that is not contained within the brackets if remove_numbers and remove_smileys:
elif remove_smileys:
# any character that is not in a to z and A to Z (non text) review_text = re.sub("[^a-zA-Z]", " ", review_text) # numbers are also included
review_text = re.sub("[^a-zA-Z0-9" + smiley_pattern + "]", " ", review_text)
review_text = re.sub("[^a-zA-Z0-9]", " ", review_text) elif remove_numbers: review_text = re.sub("[^a-zA-Z" + smiley_pattern + "]", " ", review_text)
else:

After we remove the unnecessary symbols, we split the paragraph to single words.

# split in to a list of words
words = review_text.lower().split()

This operation gives a list of single words.


Remove stop words
Stop words are those high frequent words that do not carry much meaning. Examples include "I", "you", "this", "is"...etc. Including such stop words may affect the model prediction. Here, I use Natural Language Toolkit to get all the common stop words. It's not included in Python's default library, thus we need to install it and its dictionary first. Please do not use IDE to install it, it will not work. Please check the NLTK documentation for installation and downloading the data. A separate window will pop up when you type the nltk.download() command in the interactive model and ask you to select the desired dictionary. I selected all, you may select specific library (e.g., stopwords).

After installation and got the desired data, import stopwords library. Now all we need to do is to remove the stop words that are in the stopwords library from the list of words we have created from the review.

from nltk.corpus import stopwords
if remove_stopwords:
# create a set of all stop words
stops = set(stopwords.words("english"))
words = [w for w in words if w not in stops]
# remove stop words from the list

The final list of words should look like this:

>>> words = processData.review_to_words(train["review"][0], True, False, False)
>>> words
['stuff', 'going', 'moment', 'mj', "i've", 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker', 'maybe', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'really', 'cool', 'eighties', 'maybe', 'make', 'mind', 'whether', 'guilty', 'innocent', 'moonwalker', 'part', 'biography', 'part', 'feature', 'film', 'remember', 'going', 'see', 'cinema', 'originally', 'released', 'subtle', 'messages', "mj's", 'feeling', 'towards', 'press', 'also', 'obvious', 'message', 'drugs', 'bad', "m'kay", 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless', 'remotely', 'like', 'mj', 'anyway', 'going', 'hate', 'find', 'boring', 'may', 'call', 'mj', 'egotist', 'consenting', 'making', 'movie', 'mj', 'fans', 'would', 'say', 'made', 'fans', 'true', 'really', 'nice', 'actual', 'feature', 'film', 'bit', 'finally', 'starts', '20', 'minutes', 'excluding', 'smooth', 'criminal', 'sequence', 'joe', 'pesci', 'convincing', 'psychopathic', 'powerful', 'drug', 'lord', 'wants', 'mj', 'dead', 'bad', 'beyond', 'mj', 'overheard', 'plans', 'nah', 'joe', "pesci's", 'character', 'ranted', 'wanted', 'people', 'know', 'supplying', 'drugs', 'etc', 'dunno', 'maybe', 'hates', "mj's", 'music', 'lots', 'cool', 'things', 'like', 'mj', 'turning', 'car', 'robot', 'whole', 'speed', 'demon', 'sequence', 'also', 'director', 'must', 'patience', 'saint', 'came', 'filming', 'kiddy', 'bad', 'sequence', 'usually', 'directors', 'hate', 'working', 'one', 'kid', 'let', 'alone', 'whole', 'bunch', 'performing', 'complex', 'dance', 'scene', 'bottom', 'line', 'movie', 'people', 'like', 'mj', 'one', 'level', 'another', '(which', 'think', 'people)', 'stay', 'away', 'try', 'give', 'wholesome', 'message', 'ironically', "mj's", 'bestest', 'buddy', 'movie', 'girl', 'michael', 'jackson', 'truly', 'one', 'talented', 'people', 'ever', 'grace', 'planet', 'guilty', 'well', 'attention', "i've", 'gave', 'subject', 'hmmm', 'well', "don't", 'know', 'people', 'different', 'behind', 'closed', 'doors', 'know', 'fact', 'either', 'extremely', 'nice', 'stupid', 'guy', 'one', 'sickest', 'liars', 'hope', 'latter']



Processing all the reviews:

clean_train_reviews = []
print("Cleaning and parsing training data", end="\n")
for i in range(0, num_reviews):
    # if (i+1) % 1000 == 0:
    # print("Review %d of %d\n" % (i+1, num_reviews))
    clean_train_reviews.append(" ".join(processData.review_to_words(train["review"][i], True,False,False)))


The review_to_words() function contains all operations to process a review.

def review_to_words(raw_review, remove_stopwords=False, remove_numbers=False, remove_smileys=False): # use BeautifulSoup library to remove the HTML/XML tags (e.g., ) review_text = BeautifulSoup(raw_review).get_text() # emotional symbols may affect the meaning of the review smileys = """:-) :) :o) :] :3 :c) :> =] 8) =) :} :^) :D 8-D 8D x-D xD X-D XD =-D =D =-3 =3 B^D :( :/ :-( :'( :D :P""".split() smiley_pattern = "|".join(map(re.escape, smileys)) # [^] matches a single character that is not contained within the brackets # re.sub() replace the pattern by the desired character/string if remove_numbers and remove_smileys: # any character that is not in a to z and A to Z (non text) review_text = re.sub("[^a-zA-Z]", " ", review_text) elif remove_smileys: # numbers are also included review_text = re.sub("[^a-zA-Z0-9]", " ", review_text) elif remove_numbers: review_text = re.sub("[^a-zA-Z" + smiley_pattern + "]", " ", review_text) else: review_text = re.sub("[^a-zA-Z0-9" + smiley_pattern + "]", " ", review_text) # split in to a list of words words = review_text.lower().split() if remove_stopwords: # create a set of all stop words stops = set(stopwords.words("english")) # remove stop words from the list words = [w for w in words if w not in stops] # for bag of words, return a string that is the concatenation of all the meaningful words # for word2Vector, return list of words # return " ".join(words) return words

Note:
The Kaggle tutorial mentioned that:
If you are appending a list of lists to another list of lists, "append" will only append the first list; you need to use "+=" in order to join all of the lists at once.
This is not the case for Python 3. In Python 3, += join all the lists together and flatten them, while append appends each list to the list. See this post.

This part of the code is included the processData.py in the review to 



Bag of words model

Now we have a list of words. The next thing we need to do is to create features for the model we are going to train based on all reviews (list of words) we have. The simplest way is to learn a vocabulary(bag-of-words model) from all the reviews we have, then use this vocabulary as the features and count the occurrence of each word in the vocabulary in each list of words. The result will be a vector, with each element the occurrence of a word in the vocabulary. And this vector will be used as the feature vector for training.
For example,
dictionary {the, cat, sat, on, hat, dog, likes, and}
sentence1: the cat sat on the hat {2, 1, 1, 1, 1, 0, 0, 0}
sentence2: the dog likes the cat and the hat {3, 1, 0, 0, 1, 1, 1, 1} 
I use the feature_extraction module from scikit-learn to create a bag-of-words features. CountVectorizer converts a collection of text documents to a matrix of token counts.  Here, we will convert the list of the list of words (each review is a list) to a matrix. Each row will be a frequency vector, each column is the frequency of the word. Note that CountVectorizer has the option of preprocessing, and it is definitely worth trying. :)



# max_features determine the maximum words that is taken into account, 5000 here
# e.g. dictionary {the, cat, sat, on, hat, dog, likes, and}
# sentence1: the cat sat on the hat {2, 1, 1, 1, 1, 0, 0, 0}
# sentence2: the dog likes the cat and the hat {3, 1, 0, 0, 1, 1, 1, 1}
vectorizer = CountVectorizer(analyzer="word",
                             tokenizer=None,
                             preprocessor=None,
                             stop_words=None,
                             max_features=5000)

This part of the code is included in the nlp.py.


Word2Vec
Word2Vec is a project developed by Google using neural network implementation that learns distributed representations for words[2]. There are quite a few resources explaining the details of Word2Vec. I selected a few here.


In short, given a dictionary (e.g., wikitionary), Word2Vec map each word to a vector so that words can be compared or operated. For example, "king" -"man" = "queen" - "woman".

The original Word2Vec is implemented in C. In Python, the Gensim package provides excellent implementation of the project. 

Word2Vec expects single sentences, each one as a list of words. That means a review will be broken down to a list of lists, with each list a list of words. So the question is, how to determine what is a sentence? There are several different ways to define the end of a sentence ("?",  ".", "!", " ", etc), moreover, sometimes capitalization at the beginning of a sentence may also indicate the previous word is the end of a sentence. Thus, it is hard to do it manually. Python's NLTK package provides punkt module for sentence splitting. If you have downloaded all packages when you install NLTK, you can directly import punkt, otherwise download the module first. 

Processing each review is the same as the way shown previously. However, Kaggle's tutorial have mentioned that it is better not to remove stop words because the algorithm relies on the broader context of the sentence in order to produce high-quality word vectors. It may also be helpful not remove numbers and smileys. The review_to_word() can be used directly with default options.

I wrote the function review_to_sentences() to process each review.

def review_to_sentences(review, tokenizer, remove_stopwords=False, remove_numbers=False, remove_smileys=False):
    """
    This function splits a review into parsed sentences
    :param review:
    :param tokenizer:
    :param remove_stopwords:
    :return: sentences, list of lists
    """
    # review.strip()remove the white spaces in the review
    # use tokenizer to separate review to sentences
    raw_sentences = tokenizer.tokenize(review.strip())

    #cleaned_review = [review_to_words(sentence, remove_stopwords, remove_numbers, remove_smileys) for sentence
    #                  in raw_sentences if len(sentence) > 0]
    # generic form equals append
    cleaned_review = []
    for sentence in raw_sentences:
        if len(sentence) > 0:
            cleaned_review += review_to_words(sentence, remove_stopwords, remove_numbers, remove_smileys)

    return cleaned_review

Before you start training the model, it is better to install Cython for better performance:
**Make sure you have a C compiler before installing gensim, to use optimized (compiled) word2vec training**
(70x speedup compared to plain NumPy implementation [3]_). 

The source code of Word2Vec in Gensim can be found here. They also provide tutorials in the doc string, so it's worth taking a look. :)

The model is trained using skip-gram algorithm (See reference[1]) in default. You can also train the model using continuous bag of words (CBOW) by setting sg=0. Refer to this and this for detailed explanations on skip-gram and CBOW. According to Mikolov (the guy who developed word2vec), skip-gram works well with small amount of the training data, represents well even rare words or phrases. CBOW is several times faster to train than the skip-gram, slightly better accuracy for the frequent words. Based on Kaggle tutorial, skip-gram produces better results on this data set.


num_features = 500  # word vector dimensionality
# minimum word count: any word that does not occur at least this many times
# across all documents is ignored
min_word_count = 40
num_workers = 4  # Number of threads to run in parallel
context = 10  # Context window size
downsampling = 1e-3  # Downsample setting for frequent words

print("Training model...")
model = word2vec.Word2Vec(bag_sentences, workers=num_workers,
                          size=num_features, min_count=min_word_count,
                          window=context, sample=downsampling)

# If you don't plan to train the model any further, calling
# init_sims will make the model much more memory-efficient
model.init_sims(replace=True)
# save the model for future use
model.save("Word2VectforNLPTraining")

This part of the code can be found in word2vecNLP.py


Explore the model
Since we have saved the model, we can load the model directly.

>>> from gensim.models import Word2Vec
>>> model = Word2Vec.load("Word2VectforNLPTraining") 

The model consists a feature vector for each word in the vocabulary, stored in a numpy array called "syn0".

>>> print(type(model.syn0))
#number of words, number of features
>>> model.syn0.shape
(17978, 500)
>>> model["man"]
array([-0.0173709 , -0.05453965, -0.01378504, -0.02687   , -0.0247492 ,
       -0.02725732, -0.08029163, -0.01303324, -0.01790693,  0.02459037,
       -0.0451758 ,  0.06946673,  0.00119434, -0.01014592,  0.00334688,
       ...
       -0.01781173, -0.05186122, -0.04420475,  0.00410226, -0.05667625,
        0.06580704, -0.00364238, -0.14961284,  0.02291572, -0.04049427,
       -0.0516507 ,  0.03579128,  0.00122541,  0.02547096,  0.03301932], dtype=float32)

doesnt_match() function tries to deduce which word in a set is most dissimilar from the others.

>>> model.doesnt_match("man woman child kitchen".split())
'kitchen'
>>> model.doesnt_match("paris berlin london austria".split())
'paris'

most_similar(): returns the score of the most similar words based on the criteria. The topn option determines the top N most similar words. Positive words contribute positively towards the similarity, negative words negatively.

>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)
[('princess', 0.4017685651779175), ('queen', 0.3796828091144562), ('prince', 0.36173444986343384), ('mistress', 0.3507348895072937), ('rudolf', 0.3303285539150238), ('maid', 0.32905253767967224), ('astor', 0.32380762696266174), ('throne', 0.31540173292160034), ('stepmother', 0.3121083974838257), ('antoinette', 0.31201446056365967)]

This part of the code is included in exploreWord2VecModel.py


Doc2Vec
Doc2Vec takes into account the word order. Details about Doc2Vec can be found in this blog and reference [1]. In gensim, Doc2Vec is implemented as a derived class from Word2Vec. A tutorial about Doc2Vec can be found here.

Instead of skip-gram and CBOW, Doc2Vec implements distributed memory and distributed bag of words (DBOW). Default algorithm is distributed memory (dm=1), by setting dm=0, you can use DBOW.

Doc2Vec requires each sentence to be a LabeledSentence object, which is different from Word2Vec. The easiest way is to create the LabeledSentence object for each sentence.

def labelizeReviews(reviewSet, labelType):
    """
    add label to each review
    :param reviewSet:
    :param label: the label to be put on the review
    :return:
    """
    labelized = []
    for index, review in enumerate(reviewSet):

        labelized.append(doc2vec.LabeledSentence(words=review, labels=['%s_%s'%(labelType, index)]))
    return labelized
# the input to doc2vec is an iterator of LabeledSentence objects
# each consists a list of words and alist of labels
labeled = labelizeReviews(labeled, 'LABELED')
unlabeled = labelizeReviews(unlabeled, 'UNLABELED')

Training part is similar to Word2Vec.

num_features = 500
# minimum word count: any word that does not occur at least this many times
# across all documents is ignored
min_word_count = 40
# the paper (http://arxiv.org/pdf/1405.4053v2.pdf) suggests 10 is the optimal
context = 10
#  threshold for configuring which higher-frequency words are randomly downsampled;
# default is 0 (off), useful value is 1e-5
# set the same as word2vec
downsampling = 1e-3
um_workers = 4  # Number of threads to run in parallel

# if sentence is not supplied, the model is left uninitialized
# otherwise the model is trained automatically
# https://www.codatlas.com/github.com/piskvorky/gensim/develop/gensim/models/doc2vec.py?line=192
model = doc2vec.Doc2Vec(size=num_features,
                        window=context, min_count=min_word_count,
                        sample=downsampling, workers=4)

model.build_vocab(bag_labeled_sentence)
# gensim documentation suggests training over data set for multiple times
# by either randomizing the order of the data set or adjusting learning rate
# see here for adjusting learn rate: http://rare-technologies.com/doc2vec-tutorial/
# iterate 10 times
for it in range(10):
    # perm = np.random.permutation(bag_labeled_sentence.shape[0])
    model.train(np.random.permutation(bag_labeled_sentence))

This part of the code can be found in doc2vecNLP.py

IMDB review classification

Word2Vec or Doc2Vec only allows us to project text words to vectors. In order to classify a paragraph of text contents, we still need to determine the features of each review.

Vector Averaging
The easiest method is averaging the word vectors in each review. Each word is a vector of num_features (500 in my case), if we add all words in the review and take the average, in the end the review becomes a vector of num_features dimension.

For example,
review: "shirley is awesome"
"shirley" = "0.3 0.6 0.8"
"is" = "1.2 3.5 4.6"
"awesome" = "0.9 1.2 8.7"

then the vector of the review = "(0.3 + 1.2 + 0.9)/3 (0.6 + 3.5 + 1.2)/3 (0.8 + 4.6 + 8.7)/3" = “0.8 1.77 4.7”

def makeFeatureVec(review, model, num_features):
    """
    given a review, define the feature vector by averaging the feature vectors
    of all words that exist in the model vocabulary in the review
    :param review:
    :param model:
    :param num_features:
    :return:
    """

    featureVec = np.zeros(num_features, dtype=np.float32)
    nwords = 0

    # index2word is the list of the names of the words in the model's vocabulary.
    # convert it to set for speed
    vocabulary_set = set(model.index2word)

    # loop over each word in the review and add its feature vector to the total
    # if the word is in the model's vocabulary
    for word in review:
        if word in vocabulary_set:
            nwords = nwords + 1
            # add arguments element-wise
            # if x1.shape != x2.shape, they must be able to be casted
            # to a common shape
            featureVec = np.add(featureVec, model[word])
    featureVec = np.divide(featureVec,nwords)
    return featureVec

def getAvgFeatureVecs (reviewSet, model):

    # initialize variables
    counter = 0
    num_features = model.syn0.shape[1]
    reviewsetFV = np.zeros((len(reviewSet),num_features), dtype=np.float32)

    for review in reviewSet:
        reviewsetFV[counter] = makeFeatureVec(review, model, num_features)
        counter += 1
    return reviewsetFV

This part of the code can be found on w2vPredictVectorAveraging.py

Clustering
The second one is a fancier one, which uses the clustering algorithm, in specific, the K-means algorithm, to cluster each word to a centroid, and use the vector of the centroid as the feature vector of the review.

I use scikit-learn to perform K-means algorithm.

from sklearn.cluster import KMeans
def kmeans(num_clusters, dataSet):
    # n_clusters: number of centroids
    # n_jobs: number of jobs running in parallel
    kmeans_clustering = KMeans(n_clusters=num_clusters)
    # Compute cluster centers and predict cluster index for each sample
    centroidIndx = kmeans_clustering.fit_predict(dataSet)

    return centroidIndx

After training, each word in the vocabulary is assigned to a centroid.

Now we need to create the feature vector for each review. We can do this by creating a vector of dimension the number of clusters, and add the centroid of each word in the review to the vector. For example,

vocabulary:
   word         centroid
"Shirley"            0
"Dora"               1
"is"                    2
"awesome"        1
"fun"                 0

Number of centroids: 3
So review
"Shirley is awesome" = {1, 1, 1}
"Shirley is fun" = {2, 0, 1}
"Dora is awesome" = {0, 2, 1}

The function create_bag_of_centroids() implements this method.

def create_bag_of_centroids(reviewData):
        """
        assign each word in the review to a centroid
        this returns a numpy array with the dimension as num_clusters
        each will be served as one feature for classification
        :param reviewData:
        :return:
        """
        featureVector = np.zeros(num_clusters, dtype=np.float)
        for word in reviewData:
            if word in index_word_map:
                index = index_word_map[word]
                featureVector[index] += 1
        return featureVector

This part of the code can be found in the w2vPredictClustering.py.

Classification
After you create feature vectors using either of the above method. There are several models you supervised learning methods you can use to classify the review. Reference [1] uses a logistic regression and claims they got 94% test accuracy. I use random forest and got on average 0.84 score for all methods mentioned above except using Doc2Vec and clustering (only 0.73).

def rfClassifer(n_estimators, trainingSet, label, testSet):

    forest = RandomForestClassifier(n_estimators)
    forest = forest.fit(trainingSet, label)
    result = forest.predict(testSet)

    return result

Discussion

Overall, this is a pretty fun object. The above three methods are quite different and take different time for training, but they are similar in certain way: all of them try to project text contents to a feature vector of desired dimension and use that feature vector for classification.

There are quite a few approaches to improve the results: taking into account punctuations, symbols (smileys), use different classification models and so on.



References
[1] Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053
[2]Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method." arXiv preprint arXiv:1402.3722
[3] Optimizing word2vec in gensim, http://radimrehurek.com/2013/09/word2vec-in-python-part-two-optimizing/

Acknowledgement
Kaggle
Codatlas

5 comments:

  1. Hi Shirley, Thanks for the great post.

    I get a question of how to query similarity by using both word and tag together? For example something like this below

    model.most_similar(‘MOVIE_123’, ‘good’)

    Also, how could I get the sentence as string based on the tag? if I only know the tag MOVIE_123, is it possible to get the original sentence as string?

    Many thanks!

    ReplyDelete
  2. Hi, I am using the Doc2Vec for text classification and using the functions that you have. I want to know how do I predict a new string that I might take from the user using this code. I am not able to understand how to transform that string and get features from it. This will be very helpful. Thanks

    ReplyDelete
  3. The development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. IEEE final year projects on machine learning In case you will succeed, you have to begin building machine learning projects in the near future.

    Projects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.


    Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.

    ReplyDelete
  4. The article is so appealing. You should read this article before choosing the Big data cloud solutions you want to learn.

    ReplyDelete
  5. very well explained. I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Logistic Regression explained
    Correlation vs Covariance
    Simple Linear Regression
    Bag of Words Python
    KNN Algorithm

    ReplyDelete