The Web Servers
NetScaler
load balances: mark down/up machines,
cache static contents: Apache is not very good at handling very large static contents.
mod_fastcgi: request look dynamic, follow certain rules
Run linux
can usually scale by adding machines
Python is fast enough
Web code speed is usually not the bottleneck
More often spending on time waiting on databases, caches, RPC (remote procedure call) calls.
Development code is important
<<100 ms page service times
dynamic Python -> C compiler, use psyco to speed up Python
selectively write encryptions, such as C extensions
pre-generated cached HTML
Serving Video
Issues: bandwidth, hardware, limited power in a data center, hardware doesn't consume too much power
Each video is hosted by "mini-cluster"
mini-cluster: small number machines that serve exact the same set of videos, few redundancy;
couple machines in one cluster:
1. scalability: more disks serving each content
2. head room: one machine goes down, other machines can take off the slack
3. online backups of everything
from Apache -> lighttpd
single process -> multi-process
Serving Video
Put the most popular content on to the CDNs
The moderately played and least played videos go to YouTube servers.
The traffic profile that goes to YouTube servers: some videos have 10 - 20 views / day, but if aggregate all those videos, it's still a lot. Rate control becomes more important, cache becomes more important. And want to tune the amount memory of each system so that it won't be too small. But not too large, because otherwise when you expand the number of machines, it will cost too much money without much benefit.
Keep it simple
simple network path: not too many network devices are in the path between the client and video servers
commodity hardware
simple common tools
Bundles of Linux, build layers on top of them, can be modified if needed
OS page cache size is critical
Handle random seeks(SATA tweaks, etc. )
Serving Thumbnails
Thumbnails images that correspond to each video are around 100 * 90 pixels, so ~ 5kb each, A lot of them, each video has multiple thumbnails. There are 4 times number of thumbnails than there are videos. Videos are across thousands of machines, but thumbnails are concentrated on small number of machines. Need to deal with lots of small objects: In & Out (INO) cache, direct entry caches, page caches, all over OS level
High request / second: any given web page, when search a key word, there are lots thumbnails, but there is only one video.
Start with Apache: unified pool (same machine that is serving the web traffic is also serving static images(thumbnails)) -> separate pool, high load, low performance
-> squid(reverse proxy): far more efficient with CPU, load increases, performance decreases over time
-> lighttpd (modified)
Main thread: epoll reading content that is already in memory
Assign disk reads to different Worker Thread
As long as keep all the fast stuff on the main thread, serving lots of request per second, all the slow stuff can come a little bit later
-> BTFE
Google's BTFE
Based on Bigtable(Google distributed data store)
avoids small file problem
set up new machines becomes faster and easier
fast, fault-tolerant
various forms of caching: cache multiple levels so that the latency between server and user is not so much
Databases
MySQL
only store metadata (users, video description titles tags, etc)
In the beginning: one main DB and one backup
main database is swapping: it is not using more memory than the system has: The Linux 2.4 kernels, it thinks the page caches is more important than applications, if the page caches is 1.6 gb, the application is 3.5 gb, if there is only 4 gb of physical memory, rather than reduce the size of page cache, it instead swapped out mySQL, but mySQL needs to be on memory, so it goes back and forth, so swapping in and swapping out is very frequently.
DB Optimizations
Query optimizations
Batch jobs whenever they are too expensive to do
memcache: instead of going to database all the time, use memcache D, hashtable on TCP socket, allows you to get and set data based on some string based key; using memory is better than using the disk
App server caching: database caches, database machine OS caches, memcache, memory space of the app server process: access the data very often (hundreds of times / second), sometimes precalculate the data, and store the data in memory
Replication
Spread reload across other databases when the traffic grows, which are replaying the transactions happening on the master databases. All the writes go to the master database, they spread to single replica that is attached to that database, asynchronized process.
All the writes are in multiple places, in some situations that writes are exhausting the databases;
If there are too many replicas, management issues;
Replica Lag (downside)
Replication is asynchronous
If there are multiple machines which have different CPU speed, slightly different lags on each machine;
On the Master database, updates comes from multiple threads. Each thread is executing a single update.
On the replica data base, MySQL serialize everything onto one thread, it can be bottleneck.Update 3 cannot execute until update 2 is finished assuming update 2 is a long update.
Unhealthy Replication Thread
Consider there are 5 updates, 3 are results of pages that are not in memory, so result in cache misses. Cache misses stalls the replication thread. At certain point the replica cannot keep up with the master database. The master database is using all the CPUs and all the disks, but the replica can only use 1 CPU and 1 disk, causing a reduction in replication speed.
DB updates involve:
1. reading the affected DB pages
2. Applying the changes
MySQL use 1 thread to buffer or the sql from the master database, and another one executing the sql.
The solution: since replication is behind, the cache primer thread can see the not-yet-replayed SQL queries and pre-fetch the affected rows.
Replica Pools
Since it cannot deal with all database load, prioritized video watching above everything else. Split replica databases into two pools:
Database Partitions
A gigantic database: split it to different pieces, each is independent
partition by user
spread writes AND reads
much better cache locality -> less I/O
30% hardware reduction with same amount of traffic
Scalable systems
simple
solve the problem at hand:
product of evolution:the complexity will come over time
Need the flexibility
Scalable techniques
divide and conquer: partition things. distribute things
approximate correctness: the system of the state is what what is reported to be so if you can't tell that your system is fundamentally skewing and inconsistent, then it's not
e.g., if you write a comment, and another person on the other side of the world is loading the watch page, he or she may not care what you write, but you may want to see the comment, so inconsistency comes to make sure the writer sees the comment first, and thus can save some cost.
expert knob twiddling: consistency, durability
jitter: introduces randomness, things have tendency to stack up
e.g., cache expirations: if there is a popular video, we want to cache it. If we set usual cache timeout for an hour or 10 minutes, but for a very popular video (the most popular one yesterday), the cache timeout maybe 24 hours. If everything expires at one given time, then every machine will observe that expiration exactly the same time and go back and try to compute it, and this is awful. By jittering, the desired expiration time is about 24 hours and up, but we will randomly jittering (adding variances) around 18 to 30 hours , and that prevent things from stacking up. Because system has the tendency to synchronize lines up.
cheating: when you have monotonically increasing counters like a view count, you could do a full transaction every time you update that, or you can do a transaction once in a while and update by a random amount, as long as the change people will believe it is real.
Scalable components
Well defined inputs:
Well defined dependencies
frequently end up behind an RPC (remote procedure call)
leverage local optimizations
Efficiency
uncorrelated with scalability
The most efficient thing would be to cram everything in process and see
focus on algorithms: focus on macro level, they way individual the components break out doesn't matter,
libraries:
wiseguy
pycurl
spitfire
serialization formats
tools
Apache
linux
mySQL
vitess
zookeeper