AdSense

Sunday, June 11, 2017

THIS BLOG IS NOT MAINTAINED

Hi guys,

I really appreciate you supporting this blog and making comments on it. However, I've been very busy in my new job and don't actually have time to maintain this blog. If you have questions, most questions come from Leetcode and you can find a community that are more responsive than I am.

Hope every one get their dream job!!! :)


Thursday, December 1, 2016

LFU cache

Design and implement a data structure for Least Frequently Used (LFU) cache. It should support the following operations: getand set.
get(key) - Get the value (will always be positive) of the key if the key exists in the cache, otherwise return -1.
set(key, value) - Set or insert the value if the key is not already present. When the cache reaches its capacity, it should invalidate the least frequently used item before inserting a new item. For the purpose of this problem, when there is a tie (i.e., two or more keys that have the same frequency), the least recently used key would be evicted.
Follow up:
Could you do both operations in O(1) time complexity?
Example:
LFUCache cache = new LFUCache( 2 /* capacity */ );

cache.set(1, 1);
cache.set(2, 2);
cache.get(1);       // returns 1
cache.set(3, 3);    // evicts key 2
cache.get(2);       // returns -1 (not found)
cache.get(3);       // returns 3.
cache.set(4, 4);    // evicts key 1.
cache.get(1);       // returns -1 (not found)
cache.get(3);       // returns 3
cache.get(4);       // returns 4

Similar to LRU cache, we use a doubly linked list to track the frequency. The list node uses the following implementation:


class Node {
        int frequency;
        Queue<integer> keys;
        Node next;
        Node prev;

        public Node(int frequency) {
            this.frequency = frequency;
            keys = new LinkedList<>();
            next = null;
            prev = null;
        }
    }


Frequency is the time a key is accessed. Queue is a list of keys with current frequency. The reason to use a queue is because we want to keep track the order, i.e., we can poll out the head of the queue because the least recently used key will be removed in case there are multiple keys with the same frequency. To visualize the list, I borrow this graph:

Source: http://bookshadow.com/weblog/2016/11/22/leetcode-lfu-cache/


Different from the graph, I use the first node as head, so I can remove keys from head very time I need to.

There are another two maps, one for key-value pair and the other for key-list node to find the position of the key in the list.

When we access one existing node (get the value or set a new value), we increase its frequency, so the key will be removed to the next node if next node has frequency that is larger than the current one by 1 or we need to insert a new node.

The overall complexity is O(1).


    private Node head;
    private Map map;
    private Map values;
    private int capacity;

    public LFUCache(int capacity) {
        head = new Node(0);
        map = new HashMap<>();
        values = new HashMap<>();
        this.capacity = capacity;
    }

    public int get(int key) {
        if (capacity == 0 || !map.containsKey(key)) {
            return -1;
        }
        Node curr = map.get(key);
        changeFrequency(curr, key);
        return values.get(key);
    }

    public void set(int key, int value) {
        if (capacity == 0) {
            return;
        }

        if (!map.containsKey(key) && values.size() == capacity) {
            int toRemove = head.keys.poll();
            map.remove(toRemove);
            values.remove(toRemove);
            if (head.keys.isEmpty()) {
                removeNode(head);
            }
        }
        if (values.containsKey(key)) {
            changeFrequency(map.get(key), key);
        } else {
            if (head.frequency == 0) {
                head.keys.add(key);
            } else {
                Node newNode = new Node(0);
                newNode.keys.add(key);
                newNode.next = head;
                head.prev = newNode;
                head = newNode;
            }
            map.put(key, head);
        }
        values.put(key, value);
    }

    private void changeFrequency(Node curr, int key) {
        int freq = curr.frequency;
        if (curr.next != null && curr.next.frequency == freq + 1) {
            curr.next.keys.add(key);
            map.put(key, curr.next);
        } else {
            Node newNode = new Node(freq + 1);
            newNode.keys.add(key);
            Node next = curr.next;
            newNode.prev = curr;
            newNode.next = next;
            curr.next = newNode;
            if (next != null) {
                next.prev = newNode;
            }
            map.put(key, newNode);
        }
        curr.keys.remove(key);
        if (curr.keys.isEmpty()) {
            removeNode(curr);
        }
    }

    private void removeNode(Node node) {
        Node prev = node.prev;
        Node next = node.next;
        if (prev != null) {
            prev.next = next;
        }
        if (next != null) {
            next.prev = prev;
        }


        if (head == node) {
            head = next;
        }
        if (head == null) {
            head = new Node(0);
        }
    }

    private class Node {
        int frequency;
        Queue keys;
        Node next;
        Node prev;

        public Node(int frequency) {
            this.frequency = frequency;
            keys = new LinkedList<>();
            next = null;
            prev = null;
        }
    }



Monday, November 21, 2016

Amazing number

Define amazing number as: its value is less than or equal to its index. Given a circular array, find the starting position, such that the total number of amazing numbers in the array is maximized.Example 1: 0, 1, 2, 3 Ouptut: 0. When starting point at position 0,
all the elements in the array are equal to its index.
So all the numbers are amazing number.Example 2: 1, 0 , 0Output: 1. When starting point at position 1,
the array becomes 0, 0, 1. All the elements are amazing number.If there are multiple positions, return the smallest one.should get a solution with time complexity less than O(N^2)

Brutal force is trivial: For each index as starting point, count number of amazing number in the array, then update index if we see a max.

There is a clever way of doing it in the original post, which utilizes interval. For any number, if it's larger than the length of the array, there is no way we can make it an amazing number. Otherwise, starting from the next index of the current one can always make the number an amazing number. The largest index would be length_of_array + current_index - array[current_index]. This is because current_index - array[current_index] is the index we need to make the number an amazing number. If the number is larger than current index, this index will become negative, so we need to add length of array to keep the index in the array. For current_index > array[current_index], this operation will lead to index > length of array, so we mod the index by length.

After we get all intervals, we need to find the index that all intervals intersects most, which is the index that contains the most amazing numbers. To do that, we can use algorithm in Range Addition.


public int getAmazingNumber(int[] array) {
        List<Interval> startIndexInterval = new ArrayList<>();
        int len = array.length;
        for (int i = 0; i < len; i++) {
            if (array[i] >= len) {
                continue;
            }
            // for i = len - 1, next interval
            int start = (i + 1) % len;
            // for array[i] < i, mod to make sure index is in the array
            //which may lead start > end
            int end = (i - array[i] + len) % len;
            startIndexInterval.add(new Interval(start, end));
        }

        int[] indexCount = new int[len];
        for (Interval interval : startIndexInterval) {
            indexCount[interval.start]++;
            //for start > end, we need to split interval to two
            //[start, len - 1], [0, end]
            if (interval.start > interval.end) {
                indexCount[0]++;
            }
            if (interval.end + 1 < len) {
                indexCount[interval.end + 1]--;
            }

        }
        int max = indexCount[0];
        int rstIndex = 0;
        for (int i = 1; i < len; i++) {
            indexCount[i] += indexCount[i - 1];
            if (indexCount[i] > max) {
                rstIndex = i;
                max = indexCount[i];
            }
        }
        return rstIndex;
    }


First pair non matching leaves


Given two (binary) trees, return the first pair of non-matching leaves

Tree 1: A, B, C, D, E, null, null
Tree 2: A, D, B

Output: (E,B)

DFS, but be smart on how to get the leaves. The idea is whenever we find two leaves when we traverse the two trees, we compare if there are not equal, if they are, return them, otherwise keep traversing until we find the result.

   public int[] firstPairNoMatchingNodes (TreeNode r1, TreeNode r2) {
        if (r1 == null || r2 == null) {
            return new int[]{-1, -1};
        }
        Stack<TreeNode> tree1 = new Stack<>();
        Stack<TreeNode> tree2 = new Stack<>();
        getLeaves(r1, tree1);
        getLeaves(r2, tree2);

        TreeNode leaf1 = getLeaf(tree1);
        TreeNode leaf2 = getLeaf(tree2);
        while (leaf1 != null && leaf2 != null) {
            if (leaf1.val != leaf2.val) {
                return new int[]{leaf1.val, leaf2.val};
            }
            leaf1 = getLeaf(tree1);
            leaf2 = getLeaf(tree2);
        }
        return new int[]{-1, -1};
    }

    private void getLeaves(TreeNode root, Stack stack) {
        while (root != null) {
            stack.add(root);
            root = root.left;
        }
    }

    private TreeNode getLeaf(Stack tree) {
        while (!tree.isEmpty()) {
            TreeNode curr = tree.pop();
            if (curr.left == null && curr.right == null) {
                return curr;
            } else if (curr.right != null) {
               getLeaves(curr.right, tree);
            }
        }
        return null;
    }


Sunday, November 20, 2016

Image/Video hosting service

Functionality

This is a very broad question, the first thing is to figure out what's the purpose of this hosing service, is it used for showing it to other users like Flicker or is it used for photo backup like Google photos. 
The reason is that these two functionalities lead to different requirements to the system. The first system is read intensive but the second one is write intensive. 

For now let's think it as an image distribution system, i.e., post a photo and everybody can see it. 

High level architecture


Our version 1.0 may only has one server which runs all backend api calls and read/write to our DB. However, as we have mentioned before that the service may scale differently, it would be better to separate our write service from our read service. Moreover, image and video uploading takes lot of bandwidth. Consider an Apache web service can only maintain 500 connections at once, now if at some time all those 500 connections are used for uploading photos, which may take couple seconds, all reading is blocked by this, and it will cause huge problem. On the other hand, reading is quite fast. If each image have a unique ID, retrieving image should be O(1) operation. So it's easy to have ~7000 connections in 1 second even if we can only maintain 500 connections at once. 


Image search 


The easiest way is to store it through a key-value approach. That allows retrieving becomes O(1) approach. However, most of the time, the user opens an album and we do not only retrieve one photo, but also lots of photos. So we have to solve the partition/data locality problem so that we don't search the DB randomly for the photos. 

We can also apply secondary indices for easier search, e.g., on user_id. 

Facebook stores data in file systems considering its large-in-size nature. They have a self-designed FS called Haystack. Details can be found in this and this presentation. In brief, the architecture contains four components: 

Haystack Directory: which you can think of as a metadata storage, a load balancer and a coordinator. 
Haystack Cache: which is... a cache. 
Haystack Store: which is actual file system. 
CDN: for those very frequent images (like your profile photo or Facebook logo). 

The photos are stored in log format. Briefly, you can think of an album of photos are stored in the same file (a super block). For each photo, some metadata is stored in the first(key, cookie, size, flag, etc), then the actual data, and some other metadata in the end. The metadata at first are also used for index.

Haystack Directory is a very interesting design. Whenever a photo write comes in, it goes to the Haystack Directory for 3 physical locations, which will store 3 copies of the node on three different servers. Whenever a read request comes in, Haystack returns one of the physical locations so that the web server can find the photo. It also takes care of load balancing. 

Poxy Server


This is another way for optimization. A poxy server can be very helpful in coordinating requests. For example, when lots of people are requesting the same video/photo, poxy server can batch all these requests together into one request, thus reduces network traffic. This is called collapsed forwarding. 

Queues


For image/video uploading systems, writes usually takes very long time. In order to achieve high availability, asynchrony is required for the system. One way is to use queues. 



Design news feed

News feed is a very interesting problem, it looks very easy, but it can have lots of components.

Functionality


The basic functionality for a news feed service is:

1. A user can post a feed
2. A user can see a list of news feeds from all users he/she follows (including himself/herself) with a particular order.

Data model

There are at least two basic objects: user and feed. Both objects can contains varies columns and can be indexed for different purposes. 

The easiest solution for DB is relational DB, e.g., MySql. The advantage on relational DB is that it clearly shows the relations among all the data. In the news feed example, when we post data, we create a new feed object,  add it to our DB and, done! Serving data is more complicated, we need to first retrieving all users that the user follows, then find all post ids that those users have ever posted, then get the post contents from the feed table, order them and return to the user. This requires at least three joins, which will take lots of computing resource. 

Now since our DB operations become our bottleneck, we need to optimize. One way is to run a batch job periodically so that all the news feed are pre-processed. This will require us to have one more column in the user table to store the news feed, so we are trading off space for time.

NoSQL DB is also an option, however, in this simple case, we can denormalize our data and use relational DB as a NoSQL DB with user_id as the key. 

Serving feeds

There are usually two ways to serve the news feeds from a user's friends list: push and pull

For a push system, when a user published a tweet, the tweet (in fact the pointer of the tweet) is immediately pushed to all his/her friends' news feed. The advantage is that when fetching news feed, we don't need to go through the whole friends list to get the feed, which reduces lots of query time. On the other hand, a user's news feed object will be extremely large and involves lots of write operations if the user has lots of friends. 

For a pull system, feeds are only fetched when the user requires them. The upper side is obviously reduced write operations, but we can imagine it will be quite slow for large queries.

The push activity is also called fanout on write and the pull activity is called fanout on load. 

Now back to our batch job optimization. This is similar to the pull approach mentioned above. In the pull approach, the whole process may be done when the user's actually loading the page, in our batch job approach, this is done using a scheduled job. This approach may serve most of the functionalities, since most of time news feeds are not time critical. However, in some cases, time may be an important factor. Say Alice makes a post: "Who wants to see the movie tonight?". She definitely wants the post to be seen among her friends immediately. Either the batch job approach or the pull approach is not sufficient for such requirement.

In this case, the push approach works better. In our design, instead of the scheduled batch job, the news feed can be updated whenever a new message from one of the user's friends comes in. This approach may increase lots of write to the DB. Now think of separating reading and writing to different services so that either one will not be affected by the other. Periodically, a batch job can still be run to completely replace all news feeds in this object with new feeds.

In reality, there pros and cons for both approaches. Considering Donald Trump, the newly elected president to be, he has thousands of followers (even though many of them hates him :)). When he makes a post, there are thousands of update operations and are computationally expensive. So for these high profile users, fanout on write should be disabled, and use batch job to update his/her followers news feed.

Ranking


Ranking is always interesting. While there are lots of things we can consider, what matters are two things, features and calculation algorithm. In Facebook, there is (or was) an algorithm called EdgeRank. See this post for detail. 

More optimization


Cache, cache and cache. Facebook uses so called "write through cache" rules. Briefly, all of its data read and write are through Memcached. If data is not found in Memcached, Memcached goes to actual data store for data and write it to itself as well as return to client.
Also see this post for an implementation of consistent hashing.  

Facebook typeahead search

What is typeahead?



That's the search bar on top of the Facebook page. Easy, right? Now the question is how to design it.

Unlike Google's search result, FB's search is personalized, i.e., for each user, the ranking of the results should be different based on their friends, likes, etc, so static ranking performs badly.

Boils down to the actual question, we would expect the user's first degree relations becomes the top results, then the second degree relations, and then the global results (like Google). Based on this, we actually can split searching to three parts and aggregate together later.



First degree relations include user's friends, his/her likes, activities and etc. For each user, these information may not be changed very quickly, so a batch job can always run to get all the first degree results, and cache it somewhere. When user types the first character, all first degree results can be immediately rendered. 

For the global results, it can always be ranked and cached in advance as those are indifferent to different users (like Google search). 

Now the real problem comes from the second degree relations, i.e., friends of friends, objects (likes, activities, etc) of friends. 

Prefix search

Option 1: Trie


Usually when comes to prefix search, this comes on top of my mind. In this case, for each friend I have, there is a Trie structure that contains all his/her friends and objects. Now we do prefix search on each of those Tries and get the result. 

Problem:
Those pointers in the Tries are really big
Pointers point to random places: touch the graph randomly, no partition.

Option 2: Sorted Vector of name, ID


For each of the friend, there is a sorted vector structure for his/her friends and objects. Now prefix matches becomes binary search for the range of the prefix matches. There are no random pointers now, so we can store this object with the original user as the key, and partition gets easier. 

Problem:
Not good at memory consumption: multiple copies of one user's data
Scales badly when indexing more terms (multiple names for a user)
O(n) insertion: When you inserting a new data, you need to move all consequent data to give space for the new data. 

Option 3: Brutal force


Friends lists are stored as adjacency list. For each of the friend, prefix-match by brute force in forward index. For example, Alice has friend Bob. For each of the friend of Bob's, search the friend's index, e.g., Bob has a friend Jon, his last name is "Doe", now if Alice searches "D", Jon Doe should show up. 
Since each friend and object data is stored only once, we can achieve near optimal space utilization. 

Problem:
Recomputing lots of things, high CPU, cache bandwidth.


Option 4: Filtered force


Using tiny bloom filter summarizing edge names. 
Briefly speaking, bloom filter is an m bits bit array. All elements in a set will be mapped to some bits of in the array using some hash function. To check if a number is in the set, all bits after mapping the number using the hash function should be correctly reflected in the array.  
In our prefix match case, we can consider there is an array for each friend, say char[26] representing 'a' - 'z'. Now if there is a prefix starts with 'c', but when we search one of Alice's friend 'Bob', we found that no 'c' in 'Bob' array, we can easily filter out Bob from our search. This may not guarantee the final result has all correct results, but all incorrect results will be filtered out. This approach requires extra space for the array for each friend, but we can increase CPU performance. Plus, it's easy to cache since we can easily cache those friends and objects that mismatches our prefixes.  

Merging, ranking and returnning


The last step is to merge the FOF, OOF results and Global results with some ranking and return to user. There are lots of things that should be considered in ranking, and think loud to your interviewer!