The Fake Geek's blog: Distributed system

Big data: the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

HDFS(Hadoop Distributed File System)
Master-slave architecture

Master:

Name node:

control all the data nodes, it's the controller that handles all file system operations (e.g., the create file, the delete file will go through name node), it manages the namespace of the file system, a snapshot of what the file system looks like,
also handles block mapping(whenever you put a file into the file system, it will break the file into blocks and spread it over all data nodes)
single point of failure: contains filesystem metadata, once the Name Node fails, the whole cluster fails
Holds in memory a map of the entire cluster
use secondary name node for a copy
replication value/factor: how many blocks of each file are out there
store everything in memory
mapping of block_ids to Data Nodes
journal/edit log: track of edits and changes, any client request information, the client will write the changes into other logs, it is not in the memory
Get the information from the edit log to the memory: reboot of the name node, do a check point process;

do reboot will force a check point, the name node will take what's in the memory, process it to the disk, and it will merge the edit logs with that file that was proceeded to the disk (fsImage);
after reboot, it will reload what's in fsImage back to the memory

If we don't reboot Name Node very often, the edit logs will keep growing, if we loose the Name Node, we will loose all changes on the Name Node
monitors health
client communicate/controller
Never communicate with anyone (other nodes talk to it)

Secondary Name Node (check point node): snapshot of the Name Node every now and then, system restore for the name node

takes all responsibilities of merging the log with the filesystem image
housekeeping node for the name node
Metadata backup: periodically access the Name Node and merge the edit log and fsImage, and re-upload the fsImage back to the Name Node
not high availability server

Job tracker: (in map reduce): controller for all task trackers

Slave:

Data Nodes:

Workhorse of the system, doing all block operations, receiving instructions from the name node. e.g., client communicates with the name node and get a list of where the desired file is stored, and the client will communicate directly with the data nodes that store those files and get the files.
Writing and retrieving blocks to disks
Communicate directly with clients or Name Node:

Data Node sends signal every 3s to the Name Node so that the Name Node knows that it is online
If a Data Node fails to send signal by 10min, the Name Node will consider that Node dead and re-replicate all blocks on that node all over other nodes

Also do replications

Task trackers

File Blocks:

Default saves the data in 64 MB chunks on disk;

will affect the metadata: small number of large files, then the footprints of the metadata will be small; on the other hand, if the size of each file is small (8kb), it will have too much metadata
Block placement: Inter rack communication, it will calculate the distance (network band width) between data nodes and it will place the blocks as close to each other as possible per file within a rack
Rebalancing

balancer tool, rebalancing or redistribute files among all nodes

Replication management

every 10 heart beat from a data node is actually a block report, the name node can figure out if the block is under or over replicated. If it is over replicated, it will remove the block. If it is under replicated, it will create a priority queue, and things with the least blocks on the cluster will be highest on the list.

Every Data Node has a block scanner so that it can check out the block integrity of all the blocks. And it directly impacts the block report: if the node happens to have a corrupt block, in the report the Name Node will recognize it and remove it and re-replicate the good block to that node.

Source: https://www.youtube.com/watch?v=Ll7tVuQhRWI

Multiple rack cluster

broke roles into separate racks

Challenges:

Data loss prevention: use replication, multiple copy of data
Rack awareness: understanding network topology, manual thing, describe network topology to the name node. Once the name node understands the network topology, and it is rack aware, it will ensure multiple data copies sit on multiple racks, so that whenever data loss happens, the name node is able to replicate the data on that rack.