Functionality
The basic functionality for a news feed service is:
1. A user can post a feed
2. A user can see a list of news feeds from all users he/she follows (including himself/herself) with a particular order.
Data model
There are at least two basic objects: user and feed. Both objects can contains varies columns and can be indexed for different purposes.
The easiest solution for DB is relational DB, e.g., MySql. The advantage on relational DB is that it clearly shows the relations among all the data. In the news feed example, when we post data, we create a new feed object, add it to our DB and, done! Serving data is more complicated, we need to first retrieving all users that the user follows, then find all post ids that those users have ever posted, then get the post contents from the feed table, order them and return to the user. This requires at least three joins, which will take lots of computing resource.
Now since our DB operations become our bottleneck, we need to optimize. One way is to run a batch job periodically so that all the news feed are pre-processed. This will require us to have one more column in the user table to store the news feed, so we are trading off space for time.
NoSQL DB is also an option, however, in this simple case, we can denormalize our data and use relational DB as a NoSQL DB with user_id as the key.
NoSQL DB is also an option, however, in this simple case, we can denormalize our data and use relational DB as a NoSQL DB with user_id as the key.
Serving feeds
There are usually two ways to serve the news feeds from a user's friends list: push and pull.
For a push system, when a user published a tweet, the tweet (in fact the pointer of the tweet) is immediately pushed to all his/her friends' news feed. The advantage is that when fetching news feed, we don't need to go through the whole friends list to get the feed, which reduces lots of query time. On the other hand, a user's news feed object will be extremely large and involves lots of write operations if the user has lots of friends.
For a pull system, feeds are only fetched when the user requires them. The upper side is obviously reduced write operations, but we can imagine it will be quite slow for large queries.
The push activity is also called fanout on write and the pull activity is called fanout on load.
The push activity is also called fanout on write and the pull activity is called fanout on load.
Now back to our batch job optimization. This is similar to the pull approach mentioned above. In the pull approach, the whole process may be done when the user's actually loading the page, in our batch job approach, this is done using a scheduled job. This approach may serve most of the functionalities, since most of time news feeds are not time critical. However, in some cases, time may be an important factor. Say Alice makes a post: "Who wants to see the movie tonight?". She definitely wants the post to be seen among her friends immediately. Either the batch job approach or the pull approach is not sufficient for such requirement.
In this case, the push approach works better. In our design, instead of the scheduled batch job, the news feed can be updated whenever a new message from one of the user's friends comes in. This approach may increase lots of write to the DB. Now think of separating reading and writing to different services so that either one will not be affected by the other. Periodically, a batch job can still be run to completely replace all news feeds in this object with new feeds.
In reality, there pros and cons for both approaches. Considering Donald Trump, the newly elected president to be, he has thousands of followers (even though many of them hates him :)). When he makes a post, there are thousands of update operations and are computationally expensive. So for these high profile users, fanout on write should be disabled, and use batch job to update his/her followers news feed.
Ranking
Ranking is always interesting. While there are lots of things we can consider, what matters are two things, features and calculation algorithm. In Facebook, there is (or was) an algorithm called EdgeRank. See this post for detail.
More optimization
Cache, cache and cache. Facebook uses so called "write through cache" rules. Briefly, all of its data read and write are through Memcached. If data is not found in Memcached, Memcached goes to actual data store for data and write it to itself as well as return to client.
Also see this post for an implementation of consistent hashing.
No comments:
Post a Comment