Saturday, March 21, 2015

B-tree

B-tree is a tree data structure that keeps data sorted and allow searches, sequential access, insertions and deletions in logarithmic time. B-tree is a generalization of binary search tree in that a node can have more than two children. Unlike self-balancing BST, the B-tree is optimized for systems that read and write large blocks of data. If is commonly used in databases and filesystems.

The internal (non-leaf) nodes of a B-tree can have a variable number of child nodes within some pre-defined range. When data is inserted or removed from a node, its number of child nodes changes. Thus, internal nodes may be joined or split in order to maintain the pre-defined range. B-trees do not need to re-balancing as frequently as self-balancing search trees, but may waste some space since nodes are not entirely full due to the predefined range. The lower or upper bounds on the number of child nodes are typically fixed for a particular implementation. For example, each node in a 2-3 B-tree (aka 2-3 tree) may have only 2 or 3 child nodes.

Each internal node of a B-tree will contain a number of keys. The keys act as separation values which divide its subtrees. For example, if an internal node has 3 child nodes, then it must have 2 keys. Let them be a1 and a2, then all values in the leftmost child will have value less than a1, all values in the rightmost child will have value greater than a2, the middle children will contain values between a1 and a2.

Usually the number of keys is chosen to vary between d and 2d, where d + 1 is the minimum degree or branching factor of the tree (or number of children), thus d indicates the minimum number of keys. In practice, the keys take up the most space in a node.

The nodes can be split or combined. If an internal node has 2d keys, then adding a key to that node can be accomplished by splitting the node to 2 nodes each contains d keys and adding the key to the parent node. Similarly, if an internal node and its neighbor each have d keys, then a key may be deleted from the node by combining with its neighbor: delete the key, joining the neighbor, and add one more key brought down from the neighbor's parent. The result is an entirely full node of 2d keys.

A B-tree is kept balanced by requiring that all leaf nodes be at the same depth. B-trees have substantial advantages over alternative implementations when the time to access the data of a node greatly exceeds the time spent processing that data, because then the cost of accessing the node may be amortized over multiple operations within the node. By maximizing the number of keys within each internal node, the height of the tree decreases and the number of expensive node accesses is reduced.

The following implementation allows at most 4 keys in one node, actually, it should be 3, because if the 4th one is added, the node will be split into 2 nodes.

Insertion
Starting from the root, each node can contain at most 4 keys, so if the number of keys is smaller than 4, simply add the key-value pair to the node. After adding the node, if the number of keys reaches to 4, split the root into 2, create a new node with two keys, the first one is the first element in the previous node before split, and the second one is the third element. Make the new node the root. Note now if we add any element that is smaller than the first key, it will be on the left of that key, and any number larger than the second key will be on the right of the second key, others will be in the middle.


Now if we have multiple height, and need split the node, we need to it from the bottom to the top. 


Assume we are adding 19. After we add it, we will need to split the node and we need a new key in the root to point to the new node. So we add a new node to the root. After we add the new key to the root, we need to split the root now. So same as previous, we split the two nodes and make a new root. 

Note each Entry points to the node in the next level, and the key of the entry is the smallest children in the node it points to. 

Searching
Search 16

Since the keys are partitioned in such a way that the children of the key will be on its left it their keys are  smaller than this key and on the right if larger. So if the next children of the current node has a larger key than the one we are searching, or if we have reached the end of the list, it means the key we are searching should be equal or larger than the key of current children. So we go to the node the current child points to in the next level, that node have children with the smallest key the current child, so we go through all children in that node to find the key.

No comments:

Post a Comment