The Fake Geek's blog: Sorting algorithms

I always have this idea of summarize the major sorting algorithms. And here it is.
P.S., All source code can be found on GitHub:
https://github.com/shirleyyoung0812/Sorting-Algorithms.git

The findLimit method and swap method are used in couple of the following algorithms.

private static int[] findLimit(int[] num) {
  int[] limits = new int[2];
  limits[0] = Integer.MAX_VALUE;
  limits[1] = Integer.MIN_VALUE;
  for (int n : num) {
   limits[0] = Math.min(n, limits[0]);
   limits[1] = Math.max(n, limits[1]);
  }
  return limits;
 }

private static void swap(int[] num, int i, int j) {
 int tmp = num[i];
 num[i] = num[j];
 num[j] = tmp;
}

Bucket Sort

This is an O(n) sorting algorithm. It is useful when input is uniformly distributed over a range. The main procedures are:

Create n empty buckets (lists).
Insert array[i] into bucket.
Sort individual buckets using insertion sort.
Concatenate all sorted buckets.

The insertion takes O(1) time. The concatenation takes O(n) time. The main step is insertion sort. It takes O(m) time for each bucket where m is the average number of elements in 1 bucket. However, if the numbers are not well distributed, i.e., some buckets have much more numbers compare to other buckets, the time it takes to sort a bucket increases.
Bucket sort is considered stable since we add elements to the lists in order, and concatenating them preserves that order.

Here is a visualization of the Bucket Sort.

 public static void bucketSortListNode (int[] num) {
  int bucket_size = 11;
  ListNode[] buckets = new ListNode[bucket_size];
  for (int i = 0; i < buckets.length; i++) {
   buckets[i] = null;
  }
  int maxval = findLimit(num)[1];
  int minval = findLimit(num)[0];
  int range = (maxval - minval + 1) / bucket_size;
  for (int n : num) {
   int bucket = n / range - 1;
   if (bucket <= 0)
    bucket = 0;
   if (bucket >= bucket_size)
    bucket = bucket_size - 1;
   if (buckets[bucket] == null)
    buckets[bucket] = new ListNode(n);
   else {
    buckets[bucket] = addNode(buckets[bucket], new ListNode(n));
   }
  }
  int index = 0;
  for (int i = 0; i < bucket_size; i++) {
   ListNode node = buckets[i];
   while (node != null) {
    num[index++] = node.val;
    node = node.next;
   }
  }
 }
 private static ListNode addNode(ListNode head, ListNode toAdd) {
  ListNode node = head;
  if (node.next == null) {
   if (node.val < toAdd.val)
    node.next = toAdd;
   else{
    toAdd.next = node;
    head = toAdd;
   }
  }
  else {
   while (node.next != null && node.next.val < toAdd.val) 
    node = node.next;
   toAdd.next = node.next;
   node.next = toAdd;
  }
   
  return head;
 }

In fact, if we create an array that is large enough, i.e., cover the range of the input array, and the input array is an integer array, we don't really need the linked list.

public static void bucketSort(int[] num) {
  if (num == null)
   throw new NullPointerException("Null array!");
  int maxval = findLimit(num)[1];
  int minval = findLimit(num)[0];
  int[] buckets = new int[maxval - minval + 1];
  for (int n : num) {
   buckets[n - minval]++;
  }
  int index = 0;
  for (int i = 0; i < buckets.length; i++) {
   if (buckets[i] == 0)
    continue;
   for (int j = 0; j < buckets[i]; j++) {
    num[index++] = i + minval;
   }
  }
 }

Yet the trade-off here is we need large amount of preallocated memory (e.g., sort ({1, 13, 4, 2}).

As you will see below, this method is very similar to the Counting sort. And that's the reason I was confused at first what's the difference between these two. The main difference is that we don't count the position of each element in the original array. I would say this is a simplified version of counting sort if we are dealing with small amount of integers that are uniformly distributed.

Counting Sort

This is an integer sorting algorithm. It operates by counting the number of objects that have distinct key value, and using arithmetic on those counts to determine the positions of each key value in the output sequence.

The main procedures are:

Create an array that covers the range of the input array.
Count the occurrence of each integer and store the occurrence in the new array.
Create another output array, start from the end of the array, find the position of the element and put it in the correct position in the output array.

Here is a visualization of the Counting Sort.

 public static int[] countingSort(int[] num) {
  if (num == null)
   throw new NullPointerException("Null array!");
  int maxval = findLimit(num)[1];
  int minval = findLimit(num)[0];
  int[] count = new int[maxval - minval + 1];
  for (int n : num) {
   count[n - minval]++;
  }
  for (int i = 1; i < count.length; i++) {
   count[i] += count[i - 1];
  }
  int[] rst = new int[num.length];
  for (int i = num.length - 1; i >= 0; i--) {
   int tmp = num[i] - minval;
   rst[--count[tmp]] = num[i];
  }
  return rst;
 }

Compared to Bucket Sort, this method actually requires more memory: one output array. So if we have large amount of data, this one is definitely not an efficient one.
Counting sort is also a stable sorting algorithm. In the third loop, where we start from the end of the original array and find the correct position in the counting array and decrease the counter, then we put the element into the correct position. So each time, the relative position of each element is always preserved.

Besides, one important limit of Counting Sort is that it only works for integers.

Bubble Sort

This poor guy, "It is used in practice once in a blue moon and its main application is to make an introduction to the sorting algorithms. ", quoted from an online tutorial.

It is very easy to understand, my implementation is more like "sinking sort" rather than the "bubble sort", since every iteration I sink down the largest element.

Basically we use two loops. The outer loop starting from the last element of the array, this will be the largest element after one iteration of the inner loop and will be fixed then on.
The inner loop has the duty of swapping two consecutive elements if the prior one has a larger value. So after one iteration, the largest element will be at the end of the array.
As I mentioned before, now we fix the last element of the array and do the swapping again.
A boolean swapped variable is used to check if there is a swap procedure be done in the inner loop, if not, then all the elements are in order and we can exit the outer loop.

public static void bubbleSort(int[] num) {
  boolean swapped = true;
  for (int i = num.length - 1; i > 0 && swapped; i--) {
   swapped = false;
   for (int j = 0; j < i; j++) {
    if (num[j] > num[j + 1]) {
     swap(num, j, j + 1);
     swapped = true;
    }
   }
  }
 }

Complexity, of course it is O(n^2), but why? Swapping the element takes O(1) time. Each inner loop will take linear time to sink down the larges element, so the total complexity is: (n - 1) + (n - 2) + ... + 2 + 1 ~O(n^2).
Stable as long as we don't swap two equal elements.

Selection Sort

Another O(n^2) algorithm. Like its name indicates, every time it selects a smallest element in the array and put it in the correct place.
Here is an easy way to understand this algorithm:

Imagine the array is separated into two parts, the sorted part and the unsorted part. At the beginning, the unsorted part is empty and the sorted part is the whole array.
Each time we select the smallest element in the unsorted part and put it at the tail of the sorted part.
When the unsorted part is empty, the array is sorted.

public static void selectionSort(int[] num) {
  for (int i = 0; i < num.length - 1; i++) {
   int index = i;
   for (int j = i + 1; j < num.length; j++) {
    if (num[j] < num[index])
     index = j;
   }
   swap(num, i, index);
  }
 }

Because we need to select the smallest in the unsorted array, and this procedure takes linear time. So the total complexity is that n + (n - 1) + ... + 2 + 1 ~O(n^2).
Not stable.

Insertion Sort

Still O(n^2) algorithm. It is claimed to be quite efficient for small data sets and it is adaptive.

Again, let's imagine that the array is separated into the sorted part and unsorted part.
Each time, we select an element in the unsorted part, compare it with the elements in the sorted part and insert it to the right place by shifting all larger elements in the sorted array right.
When the unsorted array is empty, we are done.

public static void insertionSort(int[] num) {
  for (int i = 0; i < num.length; i++) {
   int pivot = num[i];
   int j = i;
   //shift larger elements to the right
   while (j > 0 && pivot < num[j - 1]) {
    num[j] = num[j - 1];
    j--;
   }
   num[j] = pivot;
  }
 }

It is also stable since the pivot is chosen with the order of the original array, so the relative order after sorted is preserved.
Moreover, it is an online algorithm since we can insert the element to the correct place every time the data is received.
Note that Java actually uses Insertion Sort for "tiny arrays" (array size <= 47).

Here is also an linked list implementation from Leetcode. In fact, this is the reason I started this summary today.

public ListNode insertionSortList(ListNode head) {
        if (head == null || head.next == null)
            return head;
        ListNode dummy = new ListNode(0);
        
        while (head != null) {
            //starting from the beginning of the sorted list
            ListNode node = dummy;
            while(node.next != null && node.next.val < head.val) {
                node = node.next;
            }
            ListNode tmp = head.next;
            head.next = node.next;
            node.next = head;
            head = tmp;
        }
        return dummy.next;
    }

Merge Sort

Ha, finally, our beloved merge sort. This is an O(nlogn) comparasion-based sorting algorithm. Most implementations produce a stable sort. It is a divide and conquer algorithm.

Divide the array into two subarrays (Divide).
Sort each array(Conquer).
Merge them into one array.

The visualization of Merge Sort

public static void mergeSort(int[] num) {
  int[] tmp = new int[num.length];
  divide(num, tmp, 0, num.length - 1);
 }
 private static void divide(int[] num, int[] tmp, int left, int right) {
  if (left >= right)
   return;
  int mid = (left + right) / 2;
  divide(num, tmp, left, mid);
  divide(num, tmp, mid + 1, right);
  merge(num, tmp, left, mid + 1, right);
 }
 private static void merge(int[] num, int[] tmp, int leftStart, int rightStart, int rightEnd) {
  int leftEnd = rightStart - 1;
  int k = leftStart;
  int length = rightEnd - leftStart + 1;
  while (leftStart <= leftEnd && rightStart <= rightEnd) {
   //"=" maintain stability
   if (num[leftStart] <= num[rightStart])
    tmp[k++] = num[leftStart++];
   else
    tmp[k++] = num[rightStart++];
  }
  while (leftStart <= leftEnd)
   tmp[k++] = num[leftStart++];
  while (rightStart <= rightEnd)
   tmp[k++] = num[rightStart++];
  for (int i = 0; i < length; i++, rightEnd--) {
   num[rightEnd] = tmp[rightEnd];
  }
 }

The complexity, of course, as I have mentioned, is O(n log n). The reason is that each time we divide the array into two subarrays, so if we imagine a balance binary tree, we will have a tree with depth of log n. At each level, we need to compare and merge elements in linear time, and at the highest level, this process takes O(n) time. So overall, we have O(n log n) complexity.
As we can see, merge sort takes O(n) extra space. As we will see below, the quick sort, which is another O(n log n) algorithm, can be done by constant space.

Here is another Linked list implementation from Leetcode.

public ListNode sortList(ListNode head) {
        if (head == null || head.next == null)
            return head;
        ListNode mid = findMiddle(head);
        ListNode right = sortList(mid.next);
        mid.next = null;
        ListNode left = sortList(head);
        return merge(left, right);
        
    }
    private ListNode merge(ListNode n1, ListNode n2) {
        ListNode newHead = new ListNode(0);
        ListNode dummy = newHead;
        while (n1 != null && n2 != null) {
            if (n1.val <= n2.val) {
                dummy.next = n1;
                n1 = n1.next;
            }
            else {
                dummy.next = n2;
                n2 = n2.next;
            }
            dummy = dummy.next;
        }
        if (n1 != null) 
            dummy.next = n1;
        else
            dummy.next = n2;
        return newHead.next;
    }
    private ListNode findMiddle(ListNode head) {
        ListNode fast = head.next;
        ListNode slow = head;
        while (fast != null && fast.next != null) {
            fast = fast.next.next;
            slow = slow.next;
        }
        return slow;
    }

Quick Sort

This is my favorite sorting algorithm. It can be done in place (for arrays), it is relatively fast, and since it is a comparison sort, it can be applied to any comparable objects.

It is also a divide-and-conquer algorithm, so you can imagine it takes O(n log n) complexity.

Find a pivot, usually the middle of the array.
Move all the elements that are larger than the pivot to the right of the pivot and move all the elements that are smaller than the pivot to the left of the pivot.
Recursively sort elements that are at the left side of the pivot and that the right side of the pivot separately.

Each recursion takes linear time for comparing and swapping elements and there are log n levels of separation, so the complexity is O( n log n) on average. The worst case, where the pivot is at the beginning or the end of the array, T (n) = O(n) + T(0) + T(n - 1) ~ O(n^2).

Update 2016-08-17
A clearer and cleaner implementation. The thumb of rule is to make sure the partition returned every time contains the element with the correct position after sort.

    public void sort(int[] nums){
        quickSort(nums, 0, nums.length - 1);
    }

    private void quickSort(int[] nums, int left, int right) {
        int index = partition(nums, left, right);
        if (index > left) {
            quickSort(nums, left, index - 1);
        }
        if (index < right) {
            quickSort(nums, index + 1, right);
        }
    }

    private int partition(int[] nums, int left, int right) {
        if (left == right) {
            return left;
        }
        int pos = (left + right) / 2;
        int pivot = nums[pos];
        while (left < right) {
            while (left < right && nums[left] < pivot) {
                left++;
            }
            while (left < right && nums[right] >= pivot) {
                right--;
            }
            if (left < right) {
                swap(nums, left, right);
            }

        }
        if (pos > left && nums[left] > nums[pos]) {
            swap(nums, left, pos);
        }
        return left;
    }

    private void swap (int[] num, int left, int right) {
        int tmp = num[left];
        num[left] = num[right];
        num[right] = tmp;
    }

 public static void quickSort(int[] num) {
  doQuickSort(num, 0, num.length - 1);
 }
 public static void doQuickSort(int[] num, int left, int right) {
  int index = partition(num, left, right);
  if(left < index - 1)
   doQuickSort(num, left, index - 1);
  if(index < right)
   doQuickSort(num, index, right);
 
 }
 private static int partition(int[] num, int left, int right) {
  int i = left; 
  int j = right;
  int pivot = num[(left + right) / 2];
  while (i <= j) {
   while (num[i] < pivot)
    i++;
   while(num[j] > pivot)
    j--;
   if (i <= j) {
    swap(num, i, j);
    i++;
    j--;
   }
  }
  return i;
 }

I realized that Java replaced Quicksort in java.util.Arrays with the new Dual-Pivot QuickSort. According to Java doc, "This algorithm offers O(n log(n)) performance on many data sets that cause other quicksorts to degrade to quadratic performance, and is typically faster than traditional (one-pivot) Quicksort implementations."

The explanation of the Dual-Pivot QuickSort can be found here. And if you are interested in the Java source code, click here.
However, there are concerns about expensive swaps:

while dual pivot quicksort performs on average 5% less comparisons, it performs significantly more swaps. (Approximately 80% more)

Well, at least that since Dual-Pivot QuickSort divides into smaller subarray than traditional one-pivot, it improves cache behavior.

Note that for "small array", i.e., array size that is smaller than (or equals to )286, Java still uses Quick Sort.

Dual Pivot Quick Sort

In the end, I decided to this. The reason is that when I was doing the Sort Color problem on LeetCode, and even though it was the third time I was doing that problem, I still encountered some issue. But that one was similar to Dual Pivot Quick Sort, so I thought, ok, let's figure this out.

As its name suggests, Dual Pivot Quick Sort takes two pivots. Java picks 5 evenly spaced elements from the input array, sort them using Insertion Sort, and choose the 2nd and 4th element as the two pivots. To simplify, I choose the first and last element in the array as two pivots. lt: lessThan gt: greatThan

The input array, initialize lt = low + 1, gt = high - 1

num[index] > num[high], swap with num[gt], gt--

when index > gt, exit loop

The elements with index > gt should be larger than num[high];

The elements with index < lt should be smaller than num[low];

Swap num[low] with the first element before lt and swap num[high] with the first element after gt

Recursively sort elements smaller than lt, lt+1 to gt - 1, and elements greater than gt

public static void dualPivotQuickSort(int[] num) {
  doDualPivotQuickSort(num, 0, num.length - 1);
 }
 private static void doDualPivotQuickSort(int[] num, int low, int high) {
  if (low >= high)
   return;
  if (num[low] > num[high])
   swap(num, low, high);
  int lessThan = low + 1;
  int greatThan = high - 1;
  int index = low + 1;
  while (index <= greatThan) {
   if (num[index] < num[low]) {
    swap(num, lessThan, index);
    lessThan++;
    index++;
   }
   else if (num[high] < num[index]) {
    swap(num, index, greatThan);
    greatThan--;
   }
   else
    index++;
  }
  lessThan--;
  swap(num, low, lessThan);
  greatThan++;
  swap(num, high, greatThan);
  doDualPivotQuickSort(num, low, lessThan - 1);
  if (num[lessThan] < num[greatThan])
   doDualPivotQuickSort(num, lessThan + 1, greatThan - 1);
  doDualPivotQuickSort(num, greatThan + 1, high);
 }

I added the performance test at the bottom.
Heap Sort

The last but not least, is also a comparison-based sorting algorithm. It is implemented by first organizing the array to a heap. We know that heap is a tree-based data structure that satisfies the property:

If A is a parent node of B then the key of node A is ordered with respect to the key of node B with the same ordering applying across the heap. Either the keys of parent nodes are always greater than or equal to those of the children and the highest key is in the root node (this kind of heap is called max heap) or the keys of parent nodes are less than or equal to those of the children and the lowest key is in the root node (min heap).

When I just started to learn data structure, I couldn't fully understand how to convert an array to a heap tree. The following figure shows a tree representation of the array.

We can see that for element at index i, the element at index 2 * i + 1 and 2 * i + 2 are its left and right child, if exist. Conversely, the parent of the element at index k has index (k - 1) / 2.

In order to do heap sort, the first we need is to create a heap, which is called Heapify. That can be done by shifting down elements that violates the heap rule. The shiftDown() method below will convert the array to a min heap, which is for the purpose of sorting. After Heapify, the above array will be like this.

Next we will sort the array. The algorithm is very interesting, and we will still need the shiftDown method.

Swap the last element with the first element.

Fixed the last element (the previously first element in the array) and shift down the now first element. So after the first step, it will be like this:

Notice what's going on? The next smallest element is on the top. Then we just iteratively swap the top element and the last element that is not fixed, and in the end, we will have an array that like this:

Yup, it is reversely sorted! So the last step is, to reverse it!

Here is a visualization of heap sort.

public static void heapSort(int[] num) {
  for (int i = num.length / 2 - 1; i >= 0; i--) {
   shiftDown(num, num.length, i);
  }
  int size = num.length;
  while (size > 0) {
   swap(num, 0, --size);
   for (int i = 0; i < num.length; i++) {
    System.out.print(num[i] + ", ");
   }
   System.out.println(" ");
   shiftDown(num, size, 0);
  }
  for (int i = 0, j = num.length - 1; i < j; i++, j--) {
   swap(num, i, j);
  }
 }
 private static void shiftDown(int[] num, int size, int index) {
  if (index >= size)
   return;
  int half = (size) / 2;
  int copy = num[index];
  while (index < half) {
   int child = 2 * index + 1;
   int right = child + 1;
   if (right < size && num[right] < num[child]) {
    child = right;
   }
   if (copy < num[child])
    break;
   num[index] = num[child];
   index = child;
  }
  num[index] = copy;
 }

Note that heap sort can be done in place, but it is not a stable sort (swap involved).
Heap is the implementation of priority queue in Java.
Moreover, besides the binary heap sort, there is k-ary heap sort that allows less memory traffic.

The performance

All performance test uses integer arrays, where all elements are generated randomly from 0 to 50000.

Bucket Sort
The performance of Bucket Sort depends largely on the number of bucket sizes. My implementation determines the number of buckets based on the maximum and minimum elements in the array, so if the array size is small, the elements can be considered as well distributed, and the number of buckets is relatively large, then the complexity is close to constant, i.e., all operations are for inserting the element into the corresponding bucket. However, the disadvantage is that we need large amount of preallocated memory.

Bucket Sort with small array size

On the other hand, if the array size is relatively large, the elements start to cluster, the number of insertion sort increases, and the performance will get close to O(n^2).

Bucket Sort with large array size

Comparison

This comparison is generated from array of size = 20000. The performance is expected. Counting sort is not as fast as those O(nlogn) algorithm may be because of the array copy operation.

The Fake Geek's blog

AdSense

Saturday, January 3, 2015

Sorting algorithms

No comments:

Post a Comment