AdSense

Sunday, November 20, 2016

Facebook typeahead search

What is typeahead?



That's the search bar on top of the Facebook page. Easy, right? Now the question is how to design it.

Unlike Google's search result, FB's search is personalized, i.e., for each user, the ranking of the results should be different based on their friends, likes, etc, so static ranking performs badly.

Boils down to the actual question, we would expect the user's first degree relations becomes the top results, then the second degree relations, and then the global results (like Google). Based on this, we actually can split searching to three parts and aggregate together later.



First degree relations include user's friends, his/her likes, activities and etc. For each user, these information may not be changed very quickly, so a batch job can always run to get all the first degree results, and cache it somewhere. When user types the first character, all first degree results can be immediately rendered. 

For the global results, it can always be ranked and cached in advance as those are indifferent to different users (like Google search). 

Now the real problem comes from the second degree relations, i.e., friends of friends, objects (likes, activities, etc) of friends. 

Prefix search

Option 1: Trie


Usually when comes to prefix search, this comes on top of my mind. In this case, for each friend I have, there is a Trie structure that contains all his/her friends and objects. Now we do prefix search on each of those Tries and get the result. 

Problem:
Those pointers in the Tries are really big
Pointers point to random places: touch the graph randomly, no partition.

Option 2: Sorted Vector of name, ID


For each of the friend, there is a sorted vector structure for his/her friends and objects. Now prefix matches becomes binary search for the range of the prefix matches. There are no random pointers now, so we can store this object with the original user as the key, and partition gets easier. 

Problem:
Not good at memory consumption: multiple copies of one user's data
Scales badly when indexing more terms (multiple names for a user)
O(n) insertion: When you inserting a new data, you need to move all consequent data to give space for the new data. 

Option 3: Brutal force


Friends lists are stored as adjacency list. For each of the friend, prefix-match by brute force in forward index. For example, Alice has friend Bob. For each of the friend of Bob's, search the friend's index, e.g., Bob has a friend Jon, his last name is "Doe", now if Alice searches "D", Jon Doe should show up. 
Since each friend and object data is stored only once, we can achieve near optimal space utilization. 

Problem:
Recomputing lots of things, high CPU, cache bandwidth.


Option 4: Filtered force


Using tiny bloom filter summarizing edge names. 
Briefly speaking, bloom filter is an m bits bit array. All elements in a set will be mapped to some bits of in the array using some hash function. To check if a number is in the set, all bits after mapping the number using the hash function should be correctly reflected in the array.  
In our prefix match case, we can consider there is an array for each friend, say char[26] representing 'a' - 'z'. Now if there is a prefix starts with 'c', but when we search one of Alice's friend 'Bob', we found that no 'c' in 'Bob' array, we can easily filter out Bob from our search. This may not guarantee the final result has all correct results, but all incorrect results will be filtered out. This approach requires extra space for the array for each friend, but we can increase CPU performance. Plus, it's easy to cache since we can easily cache those friends and objects that mismatches our prefixes.  

Merging, ranking and returnning


The last step is to merge the FOF, OOF results and Global results with some ranking and return to user. There are lots of things that should be considered in ranking, and think loud to your interviewer!


No comments:

Post a Comment