AdSense

Sunday, November 20, 2016

Image/Video hosting service

Functionality

This is a very broad question, the first thing is to figure out what's the purpose of this hosing service, is it used for showing it to other users like Flicker or is it used for photo backup like Google photos. 
The reason is that these two functionalities lead to different requirements to the system. The first system is read intensive but the second one is write intensive. 

For now let's think it as an image distribution system, i.e., post a photo and everybody can see it. 

High level architecture


Our version 1.0 may only has one server which runs all backend api calls and read/write to our DB. However, as we have mentioned before that the service may scale differently, it would be better to separate our write service from our read service. Moreover, image and video uploading takes lot of bandwidth. Consider an Apache web service can only maintain 500 connections at once, now if at some time all those 500 connections are used for uploading photos, which may take couple seconds, all reading is blocked by this, and it will cause huge problem. On the other hand, reading is quite fast. If each image have a unique ID, retrieving image should be O(1) operation. So it's easy to have ~7000 connections in 1 second even if we can only maintain 500 connections at once. 


Image search 


The easiest way is to store it through a key-value approach. That allows retrieving becomes O(1) approach. However, most of the time, the user opens an album and we do not only retrieve one photo, but also lots of photos. So we have to solve the partition/data locality problem so that we don't search the DB randomly for the photos. 

We can also apply secondary indices for easier search, e.g., on user_id. 

Facebook stores data in file systems considering its large-in-size nature. They have a self-designed FS called Haystack. Details can be found in this and this presentation. In brief, the architecture contains four components: 

Haystack Directory: which you can think of as a metadata storage, a load balancer and a coordinator. 
Haystack Cache: which is... a cache. 
Haystack Store: which is actual file system. 
CDN: for those very frequent images (like your profile photo or Facebook logo). 

The photos are stored in log format. Briefly, you can think of an album of photos are stored in the same file (a super block). For each photo, some metadata is stored in the first(key, cookie, size, flag, etc), then the actual data, and some other metadata in the end. The metadata at first are also used for index.

Haystack Directory is a very interesting design. Whenever a photo write comes in, it goes to the Haystack Directory for 3 physical locations, which will store 3 copies of the node on three different servers. Whenever a read request comes in, Haystack returns one of the physical locations so that the web server can find the photo. It also takes care of load balancing. 

Poxy Server


This is another way for optimization. A poxy server can be very helpful in coordinating requests. For example, when lots of people are requesting the same video/photo, poxy server can batch all these requests together into one request, thus reduces network traffic. This is called collapsed forwarding. 

Queues


For image/video uploading systems, writes usually takes very long time. In order to achieve high availability, asynchrony is required for the system. One way is to use queues. 



No comments:

Post a Comment