Monday, September 26, 2016

Design a tinyurl system

What is it?

Tinyurl system is a url shortening service, which is a web service for redirecting to long urls to short alias. For example, http://www.this-is-a-long-url.com/page1/page2/page3/ can be replaced by alias http://tinyurl.com/abcd123. If you click the alias of a long url, it will redirect you to the actual url.

First thought

A long url and its alias forms a key-value pair. Since each url and its alias is a one-to-one mapping, simply we would think of having a large map. Now there are few things that need to be considered:

  • a good hash function
  • Performance
  • Scale

Hash function

In the above example http://tinyurl.com/abcd123abcd123 is a hash id that is created by the hash function. When user types this tiny url, it will look for the actual url by the key abcd123 and redirect the user there. One of the most popular url shortening services simply take the ID in the database of the url and then convert it to Base 62[a-zA-Z0-9], i.e., in record <57, "abcd123", "http://www.this-is-a-long-url.com/page1/page2/page3/ ">, "abcd123" is created by hashing id = 57, which is the index of the record in the db. A simple code snippet is shown in the following:


        private static final String ALPHABET_MAP = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" ;
 private static final int BASE = ALPHABET_MAP.length() ;

 public static String encode ( int IndexNum ) {
  StringBuilder sb = new StringBuilder() ;
  
  while ( IndexNum > 0 ) {
   sb.append ( ALPHABET_MAP.charAt ( IndexNum % BASE ) ) ;
   IndexNum /= BASE ;
  }
  return sb.reverse().toString() ;
 }

 public static int decode ( String str ) {
  int Num = 0 ;

  for ( int i = 0, len = str.length(); i < len; i++ ) {
   Num = Num * BASE + ALPHABET_MAP.indexOf ( str.charAt(i) ) ;
  }
  return Num ;
 }


Performance

Using GUID vs Incrementing ID

Using GUID may slow down writing process especially when the data set is very large since we have to go through the whole data set to find the correct spot for the id. On the other hand, using incrementing ID means we only need to go to the bottom of the dataset and insert the data. 

However, using incrementing ID may loss some flexibility, especially when the service allows user to create custom short url. Using GUID, we can just "reverse the hash" and get the id. 

Data store

Using MySQL is always easy, it's stable, and everybody knows how to do it. Yet as we mentioned above, it takes pain when writing to DB. Alternatively, we can use some NoSQL DB for faster writing and reading. 

Scale

This is always an exciting topic, your money and your data never aligns on the same scale, but your boss just wants the data to return faster. 

How much data needed?

Assuming each record stores <id, hash, url>, and the max length of url is 2083 chars. So approximately 8 (long) + 7 * 4 + 2083 * 4 ~8.4kb. (Not much?)

Data store...again

There are two solutions for the question, and which one to choose depends on your DB. 

#1. I like "My"-SQL

Ok, so you want to keep using the beast. What we can do is to do a master-slave replication (write to master, read with proper data sharding on each slave). For master server, we can add more and more RAM to keep it running. For weeks, your boss will be so pissed by the hardware bill and you realize it becomes much harder for you to migrate your DB. 

#2. Try something else? 

Alternatively, when your dataset is rather small, you probably can try some easier to scale NoSQL database. Now a followup question would be: How to add a new machine? 

If you are familiar with Distributed hash table, you know what I'm talking about. If not,  you probably want to be patient and read the following explanation. 

The basic difference between a normal hash and a distributed hash is ... literately, it's distributed.  It provides the same functionality as normal hash, yet the data is distributed over a set of connecting nodes. A normal hash usually takes two variables, key "x" and number of buckets "n". Usually hash function do some calculations and then mod the result by n, i.e., hash(x) % n, which can lead to which bucket "x" belongs. Now if  we change number of buckets (by adding one more node), the address of "x" also changes, and that will cause some disaster.

On the contrary, distributed hash defines a range "r", which can be any large number. The address of the key is calculated by hash(x) / r, since the address is independent of number of buckets "n", the number of remapping is dramatically decreased.

For example, consider a key space range r and a logic ring of N nodes, each is responsible of r / N key spaces. When you add one more node between two existing nodes, the new node takes care of some of one or both of siblings' data. All other nodes has no effect from the new node and only one or two nodes need to redistribute keys.

Locating a key in a ring requires searching round the ring one node at a time so it takes O(n) time (O(n)-hop-routing). Other structure (e.g., augmented ring) takes O(log(n)) time, some claims O(1) look-up with more maintenance.

Other things 

Now we are talking about distributed system, things like replication, partition and concurrency need to be considered. 

1 comment:

  1. pleasure to visit your blog there are such a nice information about tiny system there are more Brand Monitoring option available on internet but your looks pretty good.

    ReplyDelete