Thursday, October 13, 2016

Design Twitter

This is a very common interview question. There are lots of things we should consider when we solve this problem. No matter how, we should think at high level first and break down problems to small problems. Twitter contains lots of things, e.g., news feed, social graph, search engine, security, etc. When we start this question, we should think what's the priority of each component, then we can start think about actual design.

First thought

Let's look back at when Twitter was first developed. We want a Tweets news feed that can serve only 500 people. It's quite small, so let's put scalability aside for now. What do we need... at least?

a front end server,
a back end server
and a database.

The front end server is responsible for returning result to client, e.g., whenever I login my Twitter, I should see a bunch of feeds going on. The back end server has the duty to generate the result for each user, so whenever the front end requires a result, the backend server can return it immediately. And the database... is database.

Now we can dive into database a little bit. How should we design the database. For 500 people, relational database may be good enough. There are at least two objects, User and Tweets. Also there are at least two relations: user-tweets and user-user (followers). User-tweets relations is easy, user_id - tweet-id. For user-user relations, the easiest way is build a follower table with one entry as the follower and the other one as the followee.

Now think about how our design would work here. Our backend server runs a query which first fetch all followee id in user-user relation table, then go to the user-tweets table and search for all Tweets we want. This operation requires at least 3 joins, for our 500 people Tweeter, it may still work, but the performance is definitely compromised.

The Twitter is getting more popular... 

 Well, your users tell their friends they found the cool stuff called Twitter, now you have 5000 users, at least. Your database is definitely getting slower. So how could you improve?

The first thing we can think of is to reduce the number of joins, which is denormalization. For example, we can store tweets contents together in user-tweet table, so we don't have to join tweet table anymore. But the disadvantage would be tremendous data redundancy. Besides, by storing tweets in two different tables, there may exists data inconsistency issue if one table got updated and the other isn't.

If we want to reduce the amount of replicated data, we can just store everything to a very huge table, with one PK. For example, a record of one user can have all his/her profile information, followers/followees list and tweets he/she posted. This makes lookup very easy: just go to that user and find everything. Here are two questions:

How to store the user-user relation and user-tweet relation in one table?
What is the disadvantage of this approach?

For the first question, one way is to store the relations as an object, e.g., user_id -> list of followers, list of followees. Now you probably will ask does relational database support this? Probably not, but most of the databases support storing binary strings, so we can serialize our object to binary strings and deserialize it when we need them.

For the second question, we probably will have to face huge table issue. when we want to load a record, the number of fields will be extremely large. Moreover, if we only want to load user profile for a user, loading such a large row may be unnecessary. If we are using column store DB, the problem may not be too severe, because we can always select the correct column, and if our index is quite good, the performance may not be too bad. However, if we are using NoSQL, whenever we give it a key, it returns a large record, and the problem will be more obvious.

On the other hand, using NoSQL for record retrieving is extremely fast... So can we improve? Now we are going back to the normalization part. For NoSQL, it may be more efficient to store different things in different tables, e.g., user profile table only contains user information, then we have a user-user relation table which stores the user, an object(binary string) of all his/her followers and an object of all his/her followees, another user-tweets table which has the key as the user and the value as all tweets ids he/she have, the last table is the tweets table. The idea is to use a linked list fashion, and we throw everything to our backend server and generate the result for us.



Twitter is today's Twitter


The first thing we want to seriously consider is to switch our DB to NoSQL, like Cassandra or HBase. The pros and cons have been discussed previously, but read and write will be much faster.

Now we have a NoSQL DB and it's configured prefectly, but we still find it's slow because Twitter is so popular. What else can we do?

Our backend program hasn't been upgraded for a while. Probably we can do something on that? The first thing is how we should generate our feed object. Generally there are two ways of doing that: push or pull.

In the push approach, once a user publishes a tweet, all his/her follower's news feed object is got updated. The advantage is that it will be very fast to fetch the news feed, because at anytime, it's ready. However, the downside is that if a user follows lots of people, his/her feed object will be very very large, and the write operation will be increased extremely, so as the memeory taken.

On the other hand, the pull approach, feeds object are updated only when it is asked to. In reality, this is done through map reduce jobs. For example, an hadoop job is scheduled every 15 minutes, pulling all user's feeds a user follows, sort them and get the most N recent feeds. The disadvantage is there are lots of read operations and it may get slow. But for users with large profile, it may be quite userful.

In fact, there is an approach called selective fanout, which is a combination of push and pull approaches. If you are mainly using push model, you can disable the push for users that have lots of followees.  Moreover, some users may be inactive, then the push operation can also be disabled for those user.

The generated object, no matter through pull or push approach, is stored in a cache. Memcached or Redis are good choices. When front end asks for results, backend server first checks if its cache has the result, if not, then backend server starts to generate results.


Ok, what else can we do?
We know for large systems like Twitter, one machine is definitely not enough ---- we need distributed systems. One thing we should try is to partition responsibilities. For example, using one (or more) machines to only generate user profiles and the other for tweets feeds.

Now since we have multiple machines. We need a load balancer for all servers. Check about different assigning algorithms (e.g., Round Robin).

No comments:

Post a Comment