The Fake Geek's blog: 2016

Thursday, December 1, 2016

LFU cache

Design and implement a data structure for Least Frequently Used (LFU) cache. It should support the following operations: getand set.

get(key) - Get the value (will always be positive) of the key if the key exists in the cache, otherwise return -1.
set(key, value) - Set or insert the value if the key is not already present. When the cache reaches its capacity, it should invalidate the least frequently used item before inserting a new item. For the purpose of this problem, when there is a tie (i.e., two or more keys that have the same frequency), the least recently used key would be evicted.

Follow up:
Could you do both operations in O(1) time complexity?

Example:

LFUCache cache = new LFUCache( 2 /* capacity */ );

cache.set(1, 1);
cache.set(2, 2);
cache.get(1);       // returns 1
cache.set(3, 3);    // evicts key 2
cache.get(2);       // returns -1 (not found)
cache.get(3);       // returns 3.
cache.set(4, 4);    // evicts key 1.
cache.get(1);       // returns -1 (not found)
cache.get(3);       // returns 3
cache.get(4);       // returns 4

Similar to LRU cache, we use a doubly linked list to track the frequency. The list node uses the following implementation:

class Node {
        int frequency;
        Queue<integer> keys;
        Node next;
        Node prev;

        public Node(int frequency) {
            this.frequency = frequency;
            keys = new LinkedList<>();
            next = null;
            prev = null;
        }
    }

Frequency is the time a key is accessed. Queue is a list of keys with current frequency. The reason to use a queue is because we want to keep track the order, i.e., we can poll out the head of the queue because the least recently used key will be removed in case there are multiple keys with the same frequency. To visualize the list, I borrow this graph:

Source: http://bookshadow.com/weblog/2016/11/22/leetcode-lfu-cache/

Different from the graph, I use the first node as head, so I can remove keys from head very time I need to.

There are another two maps, one for key-value pair and the other for key-list node to find the position of the key in the list.

When we access one existing node (get the value or set a new value), we increase its frequency, so the key will be removed to the next node if next node has frequency that is larger than the current one by 1 or we need to insert a new node.

The overall complexity is O(1).

    private Node head;
    private Map map;
    private Map values;
    private int capacity;

    public LFUCache(int capacity) {
        head = new Node(0);
        map = new HashMap<>();
        values = new HashMap<>();
        this.capacity = capacity;
    }

    public int get(int key) {
        if (capacity == 0 || !map.containsKey(key)) {
            return -1;
        }
        Node curr = map.get(key);
        changeFrequency(curr, key);
        return values.get(key);
    }

    public void set(int key, int value) {
        if (capacity == 0) {
            return;
        }

        if (!map.containsKey(key) && values.size() == capacity) {
            int toRemove = head.keys.poll();
            map.remove(toRemove);
            values.remove(toRemove);
            if (head.keys.isEmpty()) {
                removeNode(head);
            }
        }
        if (values.containsKey(key)) {
            changeFrequency(map.get(key), key);
        } else {
            if (head.frequency == 0) {
                head.keys.add(key);
            } else {
                Node newNode = new Node(0);
                newNode.keys.add(key);
                newNode.next = head;
                head.prev = newNode;
                head = newNode;
            }
            map.put(key, head);
        }
        values.put(key, value);
    }

    private void changeFrequency(Node curr, int key) {
        int freq = curr.frequency;
        if (curr.next != null && curr.next.frequency == freq + 1) {
            curr.next.keys.add(key);
            map.put(key, curr.next);
        } else {
            Node newNode = new Node(freq + 1);
            newNode.keys.add(key);
            Node next = curr.next;
            newNode.prev = curr;
            newNode.next = next;
            curr.next = newNode;
            if (next != null) {
                next.prev = newNode;
            }
            map.put(key, newNode);
        }
        curr.keys.remove(key);
        if (curr.keys.isEmpty()) {
            removeNode(curr);
        }
    }

    private void removeNode(Node node) {
        Node prev = node.prev;
        Node next = node.next;
        if (prev != null) {
            prev.next = next;
        }
        if (next != null) {
            next.prev = prev;
        }


        if (head == node) {
            head = next;
        }
        if (head == null) {
            head = new Node(0);
        }
    }

    private class Node {
        int frequency;
        Queue keys;
        Node next;
        Node prev;

        public Node(int frequency) {
            this.frequency = frequency;
            keys = new LinkedList<>();
            next = null;
            prev = null;
        }
    }

Monday, November 21, 2016

Amazing number

Define amazing number as: its value is less than or equal to its index. Given a circular array, find the starting position, such that the total number of amazing numbers in the array is maximized.Example 1: 0, 1, 2, 3 Ouptut: 0. When starting point at position 0,
all the elements in the array are equal to its index.
So all the numbers are amazing number.Example 2: 1, 0 , 0Output: 1. When starting point at position 1,
the array becomes 0, 0, 1. All the elements are amazing number.If there are multiple positions, return the smallest one.should get a solution with time complexity less than O(N^2)

Brutal force is trivial: For each index as starting point, count number of amazing number in the array, then update index if we see a max.

There is a clever way of doing it in the original post, which utilizes interval. For any number, if it's larger than the length of the array, there is no way we can make it an amazing number. Otherwise, starting from the next index of the current one can always make the number an amazing number. The largest index would be length_of_array + current_index - array[current_index]. This is because current_index - array[current_index] is the index we need to make the number an amazing number. If the number is larger than current index, this index will become negative, so we need to add length of array to keep the index in the array. For current_index > array[current_index], this operation will lead to index > length of array, so we mod the index by length.

After we get all intervals, we need to find the index that all intervals intersects most, which is the index that contains the most amazing numbers. To do that, we can use algorithm in Range Addition.

public int getAmazingNumber(int[] array) {
        List<Interval> startIndexInterval = new ArrayList<>();
        int len = array.length;
        for (int i = 0; i < len; i++) {
            if (array[i] >= len) {
                continue;
            }
            // for i = len - 1, next interval
            int start = (i + 1) % len;
            // for array[i] < i, mod to make sure index is in the array
            //which may lead start > end
            int end = (i - array[i] + len) % len;
            startIndexInterval.add(new Interval(start, end));
        }

        int[] indexCount = new int[len];
        for (Interval interval : startIndexInterval) {
            indexCount[interval.start]++;
            //for start > end, we need to split interval to two
            //[start, len - 1], [0, end]
            if (interval.start > interval.end) {
                indexCount[0]++;
            }
            if (interval.end + 1 < len) {
                indexCount[interval.end + 1]--;
            }

        }
        int max = indexCount[0];
        int rstIndex = 0;
        for (int i = 1; i < len; i++) {
            indexCount[i] += indexCount[i - 1];
            if (indexCount[i] > max) {
                rstIndex = i;
                max = indexCount[i];
            }
        }
        return rstIndex;
    }

First pair non matching leaves

Given two (binary) trees, return the first pair of non-matching leaves

Tree 1: A, B, C, D, E, null, null
Tree 2: A, D, B

Output: (E,B)

DFS, but be smart on how to get the leaves. The idea is whenever we find two leaves when we traverse the two trees, we compare if there are not equal, if they are, return them, otherwise keep traversing until we find the result.

   public int[] firstPairNoMatchingNodes (TreeNode r1, TreeNode r2) {
        if (r1 == null || r2 == null) {
            return new int[]{-1, -1};
        }
        Stack<TreeNode> tree1 = new Stack<>();
        Stack<TreeNode> tree2 = new Stack<>();
        getLeaves(r1, tree1);
        getLeaves(r2, tree2);

        TreeNode leaf1 = getLeaf(tree1);
        TreeNode leaf2 = getLeaf(tree2);
        while (leaf1 != null && leaf2 != null) {
            if (leaf1.val != leaf2.val) {
                return new int[]{leaf1.val, leaf2.val};
            }
            leaf1 = getLeaf(tree1);
            leaf2 = getLeaf(tree2);
        }
        return new int[]{-1, -1};
    }

    private void getLeaves(TreeNode root, Stack stack) {
        while (root != null) {
            stack.add(root);
            root = root.left;
        }
    }

    private TreeNode getLeaf(Stack tree) {
        while (!tree.isEmpty()) {
            TreeNode curr = tree.pop();
            if (curr.left == null && curr.right == null) {
                return curr;
            } else if (curr.right != null) {
               getLeaves(curr.right, tree);
            }
        }
        return null;
    }

Sunday, November 20, 2016

Image/Video hosting service

Functionality

This is a very broad question, the first thing is to figure out what's the purpose of this hosing service, is it used for showing it to other users like Flicker or is it used for photo backup like Google photos.

The reason is that these two functionalities lead to different requirements to the system. The first system is read intensive but the second one is write intensive.

For now let's think it as an image distribution system, i.e., post a photo and everybody can see it.

High level architecture

Our version 1.0 may only has one server which runs all backend api calls and read/write to our DB. However, as we have mentioned before that the service may scale differently, it would be better to separate our write service from our read service. Moreover, image and video uploading takes lot of bandwidth. Consider an Apache web service can only maintain 500 connections at once, now if at some time all those 500 connections are used for uploading photos, which may take couple seconds, all reading is blocked by this, and it will cause huge problem. On the other hand, reading is quite fast. If each image have a unique ID, retrieving image should be O(1) operation. So it's easy to have ~7000 connections in 1 second even if we can only maintain 500 connections at once.

Image search

The easiest way is to store it through a key-value approach. That allows retrieving becomes O(1) approach. However, most of the time, the user opens an album and we do not only retrieve one photo, but also lots of photos. So we have to solve the partition/data locality problem so that we don't search the DB randomly for the photos.

We can also apply secondary indices for easier search, e.g., on user_id.

Facebook stores data in file systems considering its large-in-size nature. They have a self-designed FS called Haystack. Details can be found in this and this presentation. In brief, the architecture contains four components:

Haystack Directory: which you can think of as a metadata storage, a load balancer and a coordinator.

Haystack Cache: which is... a cache.

Haystack Store: which is actual file system.

CDN: for those very frequent images (like your profile photo or Facebook logo).

The photos are stored in log format. Briefly, you can think of an album of photos are stored in the same file (a super block). For each photo, some metadata is stored in the first(key, cookie, size, flag, etc), then the actual data, and some other metadata in the end. The metadata at first are also used for index.

Haystack Directory is a very interesting design. Whenever a photo write comes in, it goes to the Haystack Directory for 3 physical locations, which will store 3 copies of the node on three different servers. Whenever a read request comes in, Haystack returns one of the physical locations so that the web server can find the photo. It also takes care of load balancing.

Poxy Server

This is another way for optimization. A poxy server can be very helpful in coordinating requests. For example, when lots of people are requesting the same video/photo, poxy server can batch all these requests together into one request, thus reduces network traffic. This is called collapsed forwarding.

Queues

For image/video uploading systems, writes usually takes very long time. In order to achieve high availability, asynchrony is required for the system. One way is to use queues.

Design news feed

News feed is a very interesting problem, it looks very easy, but it can have lots of components.

Functionality

The basic functionality for a news feed service is:

1. A user can post a feed
2. A user can see a list of news feeds from all users he/she follows (including himself/herself) with a particular order.

Data model

There are at least two basic objects: user and feed. Both objects can contains varies columns and can be indexed for different purposes.

The easiest solution for DB is relational DB, e.g., MySql. The advantage on relational DB is that it clearly shows the relations among all the data. In the news feed example, when we post data, we create a new feed object, add it to our DB and, done! Serving data is more complicated, we need to first retrieving all users that the user follows, then find all post ids that those users have ever posted, then get the post contents from the feed table, order them and return to the user. This requires at least three joins, which will take lots of computing resource.

Now since our DB operations become our bottleneck, we need to optimize. One way is to run a batch job periodically so that all the news feed are pre-processed. This will require us to have one more column in the user table to store the news feed, so we are trading off space for time.

NoSQL DB is also an option, however, in this simple case, we can denormalize our data and use relational DB as a NoSQL DB with user_id as the key.

Serving feeds

There are usually two ways to serve the news feeds from a user's friends list: push and pull.

For a push system, when a user published a tweet, the tweet (in fact the pointer of the tweet) is immediately pushed to all his/her friends' news feed. The advantage is that when fetching news feed, we don't need to go through the whole friends list to get the feed, which reduces lots of query time. On the other hand, a user's news feed object will be extremely large and involves lots of write operations if the user has lots of friends.

For a pull system, feeds are only fetched when the user requires them. The upper side is obviously reduced write operations, but we can imagine it will be quite slow for large queries.

The push activity is also called fanout on write and the pull activity is called fanout on load.

Now back to our batch job optimization. This is similar to the pull approach mentioned above. In the pull approach, the whole process may be done when the user's actually loading the page, in our batch job approach, this is done using a scheduled job. This approach may serve most of the functionalities, since most of time news feeds are not time critical. However, in some cases, time may be an important factor. Say Alice makes a post: "Who wants to see the movie tonight?". She definitely wants the post to be seen among her friends immediately. Either the batch job approach or the pull approach is not sufficient for such requirement.

In this case, the push approach works better. In our design, instead of the scheduled batch job, the news feed can be updated whenever a new message from one of the user's friends comes in. This approach may increase lots of write to the DB. Now think of separating reading and writing to different services so that either one will not be affected by the other. Periodically, a batch job can still be run to completely replace all news feeds in this object with new feeds.

In reality, there pros and cons for both approaches. Considering Donald Trump, the newly elected president to be, he has thousands of followers (even though many of them hates him :)). When he makes a post, there are thousands of update operations and are computationally expensive. So for these high profile users, fanout on write should be disabled, and use batch job to update his/her followers news feed.

Ranking

Ranking is always interesting. While there are lots of things we can consider, what matters are two things, features and calculation algorithm. In Facebook, there is (or was) an algorithm called EdgeRank. See this post for detail.

More optimization

Cache, cache and cache. Facebook uses so called "write through cache" rules. Briefly, all of its data read and write are through Memcached. If data is not found in Memcached, Memcached goes to actual data store for data and write it to itself as well as return to client.
Also see this post for an implementation of consistent hashing.

Facebook typeahead search

What is typeahead?

That's the search bar on top of the Facebook page. Easy, right? Now the question is how to design it.

Unlike Google's search result, FB's search is personalized, i.e., for each user, the ranking of the results should be different based on their friends, likes, etc, so static ranking performs badly.

Boils down to the actual question, we would expect the user's first degree relations becomes the top results, then the second degree relations, and then the global results (like Google). Based on this, we actually can split searching to three parts and aggregate together later.

First degree relations include user's friends, his/her likes, activities and etc. For each user, these information may not be changed very quickly, so a batch job can always run to get all the first degree results, and cache it somewhere. When user types the first character, all first degree results can be immediately rendered.

For the global results, it can always be ranked and cached in advance as those are indifferent to different users (like Google search).

Now the real problem comes from the second degree relations, i.e., friends of friends, objects (likes, activities, etc) of friends.

Prefix search

Option 1: Trie

Usually when comes to prefix search, this comes on top of my mind. In this case, for each friend I have, there is a Trie structure that contains all his/her friends and objects. Now we do prefix search on each of those Tries and get the result.

Problem:

Those pointers in the Tries are really big

Pointers point to random places: touch the graph randomly, no partition.

Option 2: Sorted Vector of name, ID

For each of the friend, there is a sorted vector structure for his/her friends and objects. Now prefix matches becomes binary search for the range of the prefix matches. There are no random pointers now, so we can store this object with the original user as the key, and partition gets easier.

Problem:

Not good at memory consumption: multiple copies of one user's data

Scales badly when indexing more terms (multiple names for a user)

O(n) insertion: When you inserting a new data, you need to move all consequent data to give space for the new data.

Option 3: Brutal force

Friends lists are stored as adjacency list. For each of the friend, prefix-match by brute force in forward index. For example, Alice has friend Bob. For each of the friend of Bob's, search the friend's index, e.g., Bob has a friend Jon, his last name is "Doe", now if Alice searches "D", Jon Doe should show up.

Since each friend and object data is stored only once, we can achieve near optimal space utilization.

Problem:

Recomputing lots of things, high CPU, cache bandwidth.

Option 4: Filtered force

Using tiny bloom filter summarizing edge names.

Briefly speaking, bloom filter is an m bits bit array. All elements in a set will be mapped to some bits of in the array using some hash function. To check if a number is in the set, all bits after mapping the number using the hash function should be correctly reflected in the array.

In our prefix match case, we can consider there is an array for each friend, say char[26] representing 'a' - 'z'. Now if there is a prefix starts with 'c', but when we search one of Alice's friend 'Bob', we found that no 'c' in 'Bob' array, we can easily filter out Bob from our search. This may not guarantee the final result has all correct results, but all incorrect results will be filtered out. This approach requires extra space for the array for each friend, but we can increase CPU performance. Plus, it's easy to cache since we can easily cache those friends and objects that mismatches our prefixes.

Merging, ranking and returnning

The last step is to merge the FOF, OOF results and Global results with some ranking and return to user. There are lots of things that should be considered in ranking, and think loud to your interviewer!

Saturday, November 19, 2016

Design interview general approach -- tiny url example

1. Understand requirement:
Start from understanding basic functionality. For example, If we need to design a tiny url, we know we want to map a long url to a shorter one.

2. Figure out what data we need to store:
In the tiny url example, it's quite simple: we need to store the actual url and the shorter one (ID).

3. Figure out your database:
Different data requires different DB. It would be easier for you to start with some relational database and then compare it with NoSQL DB.

In this example, since the structure is so easy, there may not be lots of relations between data, so NoSQL may be a good choice. Here, key may be our generated ID (the tiny url) and the value can be the original url.

4. Think about your API:
How to grab data from DB? How to operate data? How to connect with front end? How to display to your user?
Let's walk through our tiny url example. When we receive a long url, our TinyUrlService interface calls createId(URL url) method to create a new id as the tiny url. There can be multiple ways to implement this method. The easiest one can be having incremental ID. However, as incremental ID only contains numbers, which means there can be at most 100 thousand ones (0 - 99999), and it's hard to scale. One way to optimize is to use Base 64 encoding, and we can increase representation to 64^5 = 2^30 ids. The implementation can be found in my previous post, and you can take a look if you are interested.

Now when the client request the page with the tiny url, our TinyUrlService interface will call getId(ID id) to retrieve the original url from the DB, if it matches, we redirect the page to the original url, otherwise we return code 404.

5. Now think about scalability, reliability and reduce latency:
We have mentioned to use Base 64 encoding to allow more tiny urls, that can be one way for scalability. Using NoSQL for fast query is another way when you have lots of data.

Now another common approach is to use cache. In the tiny url example, Least Frequent Use (LFU) can be a good one. Also you can use some other in-memory storage (e.g., Memcached) for actual caching implementation.

If we have lots of data, we need to think about sharding. Here, since we already have a key, we can just shard by this key. Moreover, think about consistent hash (check this post) so that it's easy to add more machines. (scalability!)

Now if we only have incremental key (with base 64 encoding), it has security issues. So we need to use some hashing mechanism so that the actual key doesn't look like the one shown to the client.

Friday, November 18, 2016

Design chat server

Explain how you would design a chat server. In particular, provide details about the various backend components, classes, and methods. What would be the hardest problems to solve?

1. Understand requirement

When given such a problem, you probably want to discuss with the interviewer what may be the requirements for the chat server. Some ideas can be as follows:

Personal messaging
Group messaging
Sign on/sign off
Add friend requests

2. Figure out what data we need to store

User relation: User and all his/her friends that he/she can send message to
user_id, list of friends
Message:
sender, receiver, time, message_id, content, group_id
* if message is sent to a group, receiver should be empty or a universal id for all groups.
Device:
Consider the user has multiple devices, we need to ensure all devices receive messages simultaneously.
device_id, type, status (enum, online, offline, etc)
***Notification queue:
To ensure every message is notified to the receivers, we need to store those messages that are waiting to be notified to receivers, in case anything happened with the servers.

3. Figure out your DB

Always start from relational DB. That's easier to understand the relations between different objects and to maintain. For this problem, the relational DB part can be easily figured out from above data types.

4. Think about your API

Now comes the fun part. How to design the whole thing so that our chat server will work.
Let's start from a new user registering to our chat server. So we need a User interface, in which it should have a method called createUser(some parameter). For an existing user, if he or she logs in, we need to grab his/her chat history/profile/etc, so we need another interface, called getUser().
Now if the user wants to send a message, there should be two methods called send(Sender sender, Receiver receiver, Content content) and send(Sender sender, Group group, Content content). These two methods can be put in another interface, called MessageSendingService (or else). The action of sending message will create a new Message object.
Now let's think about the process of sending a message. First the user send a POST request with a new message content. This request will be transformed to an API call to our MessageSendingService, which will call send method and create the message. A new message will then be created and stored in our DB. Now at this point, the message is successfully sent from the sender, so we can send response to sender that the message is successfully sent. The next thing is to notify the receiver. Now we can have another service called MessageNotificationService, which will have a method called createNotificationRequest(Receiver receiver), which will create list of notification requests for each "active" device the user has. "Active" status can be acquired by checking status of the device in device id. The interface can have another method called pushNotification(Message message) which will push push each message to the receiver.
Now there is one more interface we need, which can be called UserRelationManager, which can have two methods, createRelation(User requestUser, User responseUser), which one user will send a "friend" request to another user, and another one, createGroup(User requestUser, List<User> group), which will add all users to a Group object and save to DB.

These are possible interfaces and methods we may need for our API, but there should be more and better solutions.

5. Now think about scalability and reliability

* Think about using NoSQL, how would you denormalize?
* Separate front end, back end and DB. Front end servers only care about transforming client requests and call back end server. Backend servers make API calls, read and write to DB and send response to front end server. This makes things easier later we want to expand our app to mobile or other platforms.
* Check heartbeat
* Using load balancer, replication, caching, batch processing...

6. Think about the hardest problem

* How to guarantee exactly once?
Retry if message delivery not successful.
In client side using hash to identify already delivered message.

7. The reality

The actual Facebook chat architecture can be found in this presentation. I wrote this post before I saw this presentation, and I'm happy that the actual implementation doesn't deviate from what I propose. :)

The challenge in reality is that the "status" field, which in reality is called "Presence", is hard to scale. In FB's implementation, they use an actual set of servers for presence. Presence aggregates online info in memory, and do periodic AJAX polling for list of online friends.

Conversations are stored in log format.

The notification queue is maintained in user bases, i.e., each user has a channel for all his/her devices, , and long-polling is used for delivering messages. Briefly, long-polling is that when client sends a request for a message, the connection between client and server keeps open until new message comes in, or until time out. When the client receives the response, the connection closes and the client files a new request.

Wednesday, November 16, 2016

Largest BST Subtree

Given a binary tree, find the largest subtree which is a Binary Search Tree (BST), where largest means subtree with largest number of nodes in it.

Note:
A subtree must include all of its descendants.
Here's an example:

The Largest BST Subtree in this case is the highlighted one.
The return value is the subtree's size, which is 3.

Hint:

You can recursively use algorithm similar to 98. Validate Binary Search Tree at each node of the tree, which will result in O(nlogn) time complexity.

Use a struct which contains current subtree number the and the maximum subtree so far. Recursively get the result from the two children. Now if curr is 0 for either tree, it means the node's children are not valid BST, so we only track the maximum subtree seen so far. Otherwise, if current node and its left and right children follows the BST rule, we update the maximum subtree number and current subtree number.

public int largestBSTSubtree(TreeNode root) {
        return getTree(root).max;
    }

    private BSTCounter getTree(TreeNode root) {
        if (root == null) {
            return new BSTCounter(0, 0);
        }
        if (root.left == null && root.right == null) {
            return new BSTCounter(1, 1);
        }
        BSTCounter left = getTree(root.left);
        BSTCounter right = getTree(root.right);
        BSTCounter curr = new BSTCounter(0, 0);
        if (left.curr == -1 || right.curr == -1
            || (root.left != null && root.left.val >= root.val)
            || (root.right != null && root.right.val <= root.val)) {
            curr.curr = -1;
            curr.max = Math.max(left.max, right.max);
        } else {
            curr.curr = left.curr + right.curr + 1;
            curr.max = curr.curr;
        }
        return curr;
    }

    private class BSTCounter {
        int curr;
        int max;
        public BSTCounter(int curr, int max){
            this.curr = curr;
            this.max = max;
        }
    }

Sunday, November 13, 2016

Number of Connected Components in an Undirected Graph

Given n nodes labeled from 0 to n - 1 and a list of undirected edges (each edge is a pair of nodes), write a function to find the number of connected components in an undirected graph.

Example 1:

     0          3
     |          |
     1 --- 2    4

Given n = 5 and edges = [[0, 1], [1, 2], [3, 4]], return 2.

Example 2:

     0           4
     |           |
     1 --- 2 --- 3

Given n = 5 and edges = [[0, 1], [1, 2], [2, 3], [3, 4]], return 1.

Note:
You can assume that no duplicate edges will appear in edges. Since all edges are undirected, [0, 1] is the same as [1, 0] and thus will not appear together in edges.

Union find, easiest solution you can have.

public int countComponents(int n, int[][] edges) {
        UnionFind unionFind = new UnionFind(n);
        for (int[] e : edges) {
            unionFind.union(e[0], e[1]);
        }
        return unionFind.size;
    }

    private class UnionFind {
        int[] roots;
        int size;

        public UnionFind(int n) {
            this.roots = new int[n];
            Arrays.fill(roots, -1);
            size = n;
        }

        public int find(int x) {
            if (roots[x] < 0) {
                return x;
            }
            roots[x] = find(roots[x]);
            return roots[x];
        }

        //false if already in the same union
        //true if successfully union these two vertices
        public boolean union(int x, int y) {
            if (x == y) {
                return false;
            }
            int r1 = find(x);
            int r2 = find(y);
            if (r1 == r2) {
                return false;
            }
            if (roots[r1] > roots[r2]) {
                roots[r1] = r2;
            } else {
                if (roots[r1] == roots[r2]) {
                    roots[r1]--;
                }
                roots[r2] = r1;
            }
            size--;
            return true;
        }
    }

Saturday, November 12, 2016

Range Sum Query 2D - Mutable

Given a 2D matrix matrix, find the sum of the elements inside the rectangle defined by its upper left corner (row1, col1) and lower right corner (row2, col2).

The above rectangle (with the red border) is defined by (row1, col1) = (2, 1) and (row2, col2) = (4, 3), which contains sum = 8.

Example:

Given matrix = [
  [3, 0, 1, 4, 2],
  [5, 6, 3, 2, 1],
  [1, 2, 0, 1, 5],
  [4, 1, 0, 1, 7],
  [1, 0, 3, 0, 5]
]

sumRegion(2, 1, 4, 3) -> 8
update(3, 2, 2)
sumRegion(2, 1, 4, 3) -> 10

Note:

The matrix is only modifiable by the update function.
You may assume the number of calls to update and sumRegion function is distributed evenly.
You may assume that row1 ≤ row2 and col1 ≤ col2.

Using Binary index tree, similar as the 1D one, see here for explanation.

public class RangeSumQuery2DMutable {

    int[][] binaryIndexTree;
    int[][] nums;

    public RangeSumQuery2DMutable(int[][] nums) {
        int m = nums.length, n = nums[0].length;
        binaryIndexTree = new int[m][n + 1];
        this.nums = nums;
        for (int i = 0; i < m; i++) {
            for (int j = 0; j < n; j++) {
                add(i, j + 1, nums[i][j]);
            }
        }
    }

    private void add(int row, int col, int val) {
        while (col < binaryIndexTree[row].length) {
            binaryIndexTree[row][col] += val;
            col += (col&-col);
        }
    }

    public void update(int row, int col, int val) {
        add(row, col + 1, val - nums[row][col]);
        nums[row][col] = val;
    }

    public int sumRegion(int row1, int col1, int row2, int col2) {
        int ans = 0;
        for (int i = row1; i <= row2; i++) {
            ans += sum(i, col2 + 1) - sum(i, col1);
        }
        return ans;
    }

    private int sum(int row, int col) {
        int ans = 0;
        while (col > 0) {
            ans += binaryIndexTree[row][col];
            col -= (col&-col);
        }
        return ans;
    }
}

Friday, November 11, 2016

Number of Boomerangs

Given n points in the plane that are all pairwise distinct, a "boomerang" is a tuple of points (i, j, k) such that the distance between iand j equals the distance between i and k (the order of the tuple matters).

Find the number of boomerangs. You may assume that n will be at most 500 and coordinates of points are all in the range [-10000, 10000](inclusive).

Example:

Input:
[[0,0],[1,0],[2,0]]

Output:
2

Explanation:
The two boomerangs are [[1,0],[0,0],[2,0]] and [[1,0],[2,0],[0,0]]

For each point, we use a map to track the distance and number of points seen for this distance. Total number of points with same distance should be x * (x - 1) where x is the number of points with a calculated distance from this point.

    public int numberOfBoomerangs(int[][] points) {
        if (points.length == 0) {
            return 0;
        }
        int sum = 0;
        int len = points.length;
        for (int i = 0; i < len; i++) {
            Map distances = new HashMap<>();
            for (int j = 0; j < len; j++) {
                if (i == j) {
                    continue;
                }
                int distance = (points[i][0] - points[j][0]) * (points[i][0] - points[j][0]) + (points[i][1] - points[j][1]) * (points[i][1] - points[j][1]);
                distances.put(distance, distances.getOrDefault(distance, 0) + 1);
            }
            sum += distances.values().stream().mapToInt(x -> x * (x - 1)).sum();
        }
        return sum;
    }

Wednesday, November 9, 2016

Number of Islands II

A 2d grid map of m rows and n columns is initially filled with water. We may perform an addLand operation which turns the water at position (row, col) into a land. Given a list of positions to operate, count the number of islands after each addLand operation. An island is surrounded by water and is formed by connecting adjacent lands horizontally or vertically. You may assume all four edges of the grid are all surrounded by water.

Example:

Given m = 3, n = 3, positions = [[0,0], [0,1], [1,2], [2,1]].
Initially, the 2d grid grid is filled with water. (Assume 0 represents water and 1 represents land).

0 0 0
0 0 0
0 0 0

Operation #1: addLand(0, 0) turns the water at grid[0][0] into a land.

1 0 0
0 0 0   Number of islands = 1
0 0 0

Operation #2: addLand(0, 1) turns the water at grid[0][1] into a land.

1 1 0
0 0 0   Number of islands = 1
0 0 0

Operation #3: addLand(1, 2) turns the water at grid[1][2] into a land.

1 1 0
0 0 1   Number of islands = 2
0 0 0

Operation #4: addLand(2, 1) turns the water at grid[2][1] into a land.

1 1 0
0 0 1   Number of islands = 3
0 1 0

We return the result as an array: [1, 1, 2, 3]

The easiest solution for this problem utilizes union-find. Each time we add a land, first increment the size of the total number of islands, then for each of its neighbors, union them if the neighbor is also an island.

public class NumberOfIslands {
    private int[] rankArray;

    int size, cols, rows;

    public NumberOfIslands (int m, int n) {
        rankArray = new int[m * n];
        this.size = 0;
        this.cols = n;
        this.rows = m;
        Arrays.fill(rankArray, rows * cols + 1);
    }

    public int add(int x, int y) {
        int index = getIndex(x, y);
        rankArray[index] = -1;
        size++;
        int[] neighbors = getNeighbors(x, y);
        for (int n : neighbors) {
            if (n < 0 || rankArray[n] == rows * cols + 1) {
                continue;
            }
            union(index, n);
        }
        return size;
    }

    private int getIndex(int x, int y) {
        return x * cols + y;
    }

    //left, right, down, up
    private int[] getNeighbors(int x, int y) {
        int[] neighbors = new int[4];
        neighbors[0] = x - 1 >= 0 ? getIndex(x - 1, y) : -1;
        neighbors[1] = x + 1 < rows ? getIndex(x + 1, y) : -1;
        neighbors[2] = y - 1 >= 0 ? getIndex(x, y - 1) : -1;
        neighbors[3] = y + 1 < cols ? getIndex(x, y + 1) : -1;
        return neighbors;
    }

    private void union(int index1, int index2) {
        int root1 = find(index1);
        int root2 = find(index2);
        if (root1 == root2)  {
            size--;
            return;
        }
        if (rankArray[root2] < rankArray[root1]) {
            rankArray[root1] = root2;
            size--;
        } else {
            if (rankArray[root1] == rankArray[root2]) {
                rankArray[root1]--;
            }
            rankArray[root2] = root1;
            size--;
        }
    }

    private int find(int index) {
        if (rankArray[index] < 0) {
            return index;
        }
        int root = find(rankArray[index]);
        rankArray[index] = root;
        return root;
    }
}

Smallest Rectangle Enclosing Black Pixels

An image is represented by a binary matrix with 0 as a white pixel and 1 as a black pixel. The black pixels are connected, i.e., there is only one black region. Pixels are connected horizontally and vertically. Given the location (x, y) of one of the black pixels, return the area of the smallest (axis-aligned) rectangle that encloses all black pixels.

For example, given the following image:

[
  "0010",
  "0110",
  "0100"
]

and x = 0, y = 2,

Return 6.

There are lots of ways to solve this problem. I use DFS because all black pixels are connected to each other so it's quite easy to figure out the boundaries of the black pixels.

public int minArea(char[][] image, int x, int y) {

        int[] coords = new int[4];
        coords[0] = Integer.MAX_VALUE;
        coords[1] = Integer.MIN_VALUE;
        coords[2] = Integer.MAX_VALUE;
        coords[3] = Integer.MIN_VALUE;
        boolean[][] visited = new boolean[image.length][image[0].length];
        findRange(image, coords, x, y, visited);
        return (coords[3] - coords[2] + 1) * (coords[1] - coords[0] + 1);
    }

    private void findRange(char[][] image, int[] coords, int x, int y, boolean[][] visited) {
        visited[x][y] = true;
        coords[0] = Math.min(coords[0], x);
        coords[1] = Math.max(coords[1], x);
        coords[2] = Math.min(coords[2], y);
        coords[3] = Math.max(coords[3], y);
        if (x - 1 >= 0 && image[x - 1][y] == '1' && !visited[x - 1][y]) {
            findRange(image, coords, x - 1, y, visited);
        }
        if (x + 1 < image.length && image[x + 1][y] == '1' && !visited[x + 1][y]) {
            findRange(image, coords, x + 1, y, visited);
        }
        if (y - 1 >= 0 && image[x][y - 1] == '1' && !visited[x][y - 1]) {
            findRange(image, coords, x, y - 1, visited);
        }
        if (y + 1< image[0].length && image[x][y + 1] == '1' && !visited[x][y + 1]) {
            findRange(image, coords, x, y + 1, visited);
        }
    }

BFS is another way to solve it. Basically, from the point given to you, the upper bound of black pixels should range from (0, x), lower bound (x, m - 1), left bound(0, y) and right bound (y, n - 1). Code not included here.

Monday, November 7, 2016

Unique Word Abbreviation

An abbreviation of a word follows the form <first letter><number><last letter>. Below are some examples of word abbreviations:

a) it                      --> it    (no abbreviation)

     1
b) d|o|g                   --> d1g

              1    1  1
     1---5----0----5--8
c) i|nternationalizatio|n  --> i18n

              1
     1---5----0
d) l|ocalizatio|n          --> l10n

Assume you have a dictionary and given a word, find whether its abbreviation is unique in the dictionary. A word's abbreviation is unique if no other word from the dictionary has the same abbreviation.

Example:

Given dictionary = [ "deer", "door", "cake", "card" ]

isUnique("dear") -> false
isUnique("cart") -> true
isUnique("cane") -> false
isUnique("make") -> true

Use a set and get all abbreviations in the dict. Now for each word, check if the set contains the abbreviation.

public class ValidWordAbbr {
    private Set<string> abbreviatons;
    public ValidWordAbbr(String[] dictionary) {
        abbreviatons = new HashSet<>();
        for (String s : dictionary) {
            abbreviatons.add(getAbbreviatoin(s));
        }

    }
    public boolean isUnique(String word) {
        String abbr = getAbbreviatoin(word);
        return !abbreviatons.contains(abbr);
    }

    private String getAbbreviatoin(String s) {
        int len = s.length();
        if (len == 0) {
            return "";
        }
        StringBuilder builder = new StringBuilder();
        builder.append(s.charAt(0));
        if (len == 1) {
            return builder.toString();
        }
        if (len - 2 > 0) {
            builder.append(len - 2);
        }
        builder.append(s.charAt(len - 1));
        return builder.toString();
    }
}

Walls and gates

You are given a m x n 2D grid initialized with these three possible values.

-1 - A wall or an obstacle.
0 - A gate.
INF - Infinity means an empty room. We use the value 231 - 1 = 2147483647 to represent INF as you may assume that the distance to a gate is less than2147483647.

Fill each empty room with the distance to its nearest gate. If it is impossible to reach a gate, it should be filled with INF.

For example, given the 2D grid:

INF  -1  0  INF
INF INF INF  -1
INF  -1 INF  -1
  0  -1 INF INF

After running your function, the 2D grid should be:

  3  -1   0   1
  2   2   1  -1
  1  -1   2  -1
  0  -1   3   4

Basic BFS problem. Add all cells of gates to the queue. Now for each cell polled out, check all its neighbors, update all neighbors that are valid cells and have distance to a gate larger than current one plus one.

public void wallsAndGates(int[][] rooms) {
        Queue queue = new LinkedList();
        int r = rooms.length, c = rooms[0].length;
        for (int i = 0; i < r; i++) {
            for (int j = 0; j < c; j++) {
                if (rooms[i][j] == 0) {
                    queue.add(i * c + j);
                }
            }
        }
        int[][] dirs = {{0, 1}, {0, -1}, {1, 0}, {-1, 0}};

        while (!queue.isEmpty()) {
            int curr = queue.poll();
            for (int[] d : dirs) {
                int x = curr / c + d[0];
                int y = curr % c + d[1];
                if (x < 0 || x >= r || y < 0 || y >= c ||
                    rooms[x][y] <= rooms[x - d[0]][y - d[1]] + 1) {
                    continue;
                }
                rooms[x][y] = rooms[x - d[0]][y - d[1]] + 1;
                queue.add(x * c + y);
            }
        }
    }

Inorder Successor in BST

Given a binary search tree and a node in it, find the in-order successor of that node in the BST.

Note: If the given node has no in-order successor in the tree, return null.

If the node has a right child, no doubt the successor should be the left most children of its right child. Otherwise, either the node is the right most node which does not have a successor or it is a left child of some node. In both case, we can traverse the tree, whenever we find the node is smaller than the root, we set the successor to the root of the node and proceed to check its left child until we find the node.

    public TreeNode inorderSuccessor(TreeNode root, TreeNode p) {
        if (p == null) {
            return null;
        }
        if (p.right != null) {
            return leftMost(p.right);
        }
        TreeNode suc = null;
        while (root != null) {
            if (p.val < root.val) {
                suc = root;
                root = root.left;
            } else if (p.val > root.val) {
                root = root.right;
            } else {
                break;
            }
        }
        return suc;
    }

    private TreeNode leftMost(TreeNode node) {
        while (node.left != null) {
            node = node.left;
        }
        return node;
    }

Sunday, November 6, 2016

Zigzag Iterator

Given two 1d vectors, implement an iterator to return their elements alternately.

For example, given two 1d vectors:

v1 = [1, 2]
v2 = [3, 4, 5, 6]

By calling next repeatedly until hasNext returns false, the order of elements returned by next should be: [1, 3, 2, 4, 5, 6].

Follow up: What if you are given k 1d vectors? How well can your code be extended to such cases?

Clarification for the follow up question - Update (2015-09-18): The "Zigzag" order is not clearly defined and is ambiguous for k > 2 cases. If "Zigzag" does not look right to you, replace "Zigzag" with "Cyclic". For example, given the following input:

[1,2,3]
[4,5,6,7]
[8,9]

It should return [1,4,8,2,5,9,3,6,7].

Using two iterators, alternatively calling get the next element from each iterator when call next() method for the ZigzagIterator object. If one iterator reaches to the end, then call the other one only from then on.

public class ZigzagIterator {
    Iterator iFirst;
    Iterator iSecond;
    boolean first;

    public ZigzagIterator(List v1, List v2) {
        iFirst = v1.iterator();
        iSecond = v2.iterator();
        first = true;
    }

    public boolean hasNext() {
        return iFirst.hasNext() || iSecond.hasNext();
    }

    public int next() {
        int toReturn = 0;
        if (first) {
            toReturn = iFirst.next();
        } else {
            toReturn = iSecond.next();
        }
        if (!iFirst.hasNext()) {
            first = false;
        } else if (!iSecond.hasNext()) {
            first = true;
        } else {
            first = !first;
        }
        return toReturn;
    }
}

For the follow up, using a list of iterators.

public class CyclingIterator {

    private List<iterator<Integer>> iterators;
    int index;
    public CyclingIterator(List<list<Integer>> lists) {
        iterators = new ArrayList<>();
        for (int i = 0; i < lists.size(); i++) {
            iterators.add(lists.get(i).iterator());
        }
    }

    public int next() {
        if (!hasNext()) {
            throw new RuntimeException("Empty iterator!");
        }
        Iterator<integer> curr = iterators.get(index);
        int toReturn = curr.next();
        if (!curr.hasNext()) {
            iterators.remove(index);
        } else {
            index++;
        }
        if (index == iterators.size()) {
            index = 0;
        }
        return toReturn;
    }

    public boolean hasNext() {
        return iterators.size() > 0;
    }
}

Wiggle sort

Given an unsorted array nums, reorder it in-place such that nums[0] <= nums[1] >= nums[2] <= nums[3]....

For example, given nums = [3, 5, 2, 1, 6, 4], one possible answer is [1, 6, 2, 5, 3, 4].

The easiest solution is to sort the array and add elements from head and tail:

1 2 3 4 5 6 => 1 6 2 5 3 4

But, this solution takes O(nlogn) complexity. We definitely can do better.

Based on the problem, we know that for any index i, if i is even, then nums[i] <= nums[i - 1], other wise, nums[i] >= nums[i - 1]. Now we only need to traverse the array once and swap the elements that don't follow the rule.

public void wiggleSort(int[] nums) {
        for (int i = 1; i < nums.length; i++) {
            if ((i % 2 != 0 && nums[i] < nums[i - 1]) || (i % 2 == 0 && nums[i] > nums[i - 1])) {
                swap(nums, i, i - 1);
            }
        }
    }
    private void swap(int[] nums, int i, int j) {
        int tmp = nums[i];
        nums[i]  = nums[j];
        nums[j] = tmp;
    }

Trapping Rain Water II

Given an m x n matrix of positive integers representing the height of each unit cell in a 2D elevation map, compute the volume of water it is able to trap after raining.

Note:
Both m and n are less than 110. The height of each unit cell is greater than 0 and is less than 20,000.

Example:

Given the following 3x6 height map:
[
  [1,4,3,1,3,2],
  [3,2,1,3,2,4],
  [2,3,3,2,3,1]
]

Return 4.

The above image represents the elevation map [[1,4,3,1,3,2],[3,2,1,3,2,4],[2,3,3,2,3,1]] before the rain.

After the rain, water are trapped between the blocks. The total volume of water trapped is 4.

When it turns to 3D , it becomes a little bit more difficult.

We know that the highest amount of water the container can hold depends on the lowest height in the surroundings. Thus we can use a min-heap, first put all surrounding cells in it. Then each time poll out the cell with the lowest height, find if it has unvisited even lower neighbors, and add water to it (calculate sum). Now once we pour the water in it, the height of the neighbor cell should have the same height as the original cell (if it is not higher, in which case we cannot put water in it), and thus later we can track how much water it can help its neighbors hold, so we need to put the maximum height of the current cell and neighbors cell into the queue.

There was one concern I had, which can be explained using the following graph:

Consider I have an elevation map like this (looking from top down). Now the current cell I poll out is cell 1, its neighbor cell 2 has a lower height. Cell 3 is an unvisited neighbor of cell 2, which has a lower height than cell 1 but higher than cell 2. My question was why do we add height difference between cell 1 and 2 without checking cell 3?

To explain this, we need to involve another boundary cell 4. There are two cases, first, cell 4 has already been visited, second, it's not. For the first case, if it is already visited, we are sure it has lower height than cell 1, and we know its neighbor cell 3 has lower height then cell 1, which should also be visited, so it doesn't satisfy our condition. For the second case, cell 4 should be higher than cell 1. The water the container can hold in this column is determined by cell 1, even if cell 3 has lower height than cell 1, it will be filled by water with the same height as cell 1 later because cell 4 has a higher height. So in the end both cell 2 and cell 3 will be filled with water.

public int trapRainWater(int[][] heightMap) {
        int m = heightMap.length;
        if (m == 0) {
            return 0;
        }
        int n = heightMap[0].length;
        if (n == 0) {
            return 0;
        }
        PriorityQueue<Cell> pq = new PriorityQueue<>();
        boolean[][] visited = new boolean[m][n];
        for (int i = 0; i < m; i++) {
            pq.add(new Cell(i, 0, heightMap[i][0]));
            visited[i][0] = true;
            pq.add(new Cell(i, n - 1, heightMap[i][n - 1]));
            visited[i][n - 1] = true;
        }
        
        for (int j = 1; j < n - 1; j++) {
            pq.add(new Cell(0, j, heightMap[0][j]));
            visited[0][j] = true;
            pq.add(new Cell(m - 1, j, heightMap[m - 1][j]));
            visited[m - 1][j] = true;
        }
        
        int[] dirX = {0, 0, -1, 1};
        int[] dirY = {-1, 1, 0, 0};
        int sum = 0;
        while (!pq.isEmpty()) {
            Cell curr = pq.poll();
            for (int i = 0; i < 4; i++) {
                int nx = curr.x + dirX[i];
                int ny = curr.y + dirY[i];
                if (nx >= 0 && nx < m && ny >= 0 && ny < n && !visited[nx][ny]) {
                    sum += Math.max(0, curr.h - heightMap[nx][ny]);
                    pq.add(new Cell(nx, ny, Math.max(heightMap[nx][ny], curr.h)));
                    visited[nx][ny] = true;
                }
            }
        }
        return sum;
    }
    
    private class Cell implements Comparable<Cell> {
        int x, y, h;
        public Cell(int x, int y, int h) {
            this.x = x;
            this.y = y;
            this.h = h;
        }
        
        public int compareTo(Cell other) {
            return h - other.h;
        }
    }

AdSense

Thursday, December 1, 2016

Monday, November 21, 2016

Sunday, November 20, 2016

Functionality

High level architecture

Image search

Poxy Server

Queues

Functionality

Data model

Serving feeds

Ranking

More optimization

What is typeahead?

Prefix search

Option 1: Trie

Option 2: Sorted Vector of name, ID

Option 3: Brutal force

Option 4: Filtered force

Merging, ranking and returnning

Saturday, November 19, 2016

Friday, November 18, 2016

Wednesday, November 16, 2016

Sunday, November 13, 2016

Saturday, November 12, 2016

Friday, November 11, 2016

Wednesday, November 9, 2016

Monday, November 7, 2016

Sunday, November 6, 2016