Wednesday, September 29, 2010

Threaded Merge Sort


In my previous post about Merge Sort, the algorithm for merge sort was discussed. Although the pictures shows something which appears to be apparently parallel, but actually they are not. In fact, as the procedure introduced there was recursive, it works in a post-order fashion, meaning, first it solves the left half, then the right half, and then it works with the current range. The following picture shows the actual work-flow of the standard recursive merge sort algorithm:

The red lines and numbers shows the actual call sequence of the algorithm merge sort, for reference, the basic algorithm, is shown here again:

MERGE-SORT(A, p, r):
if p < r
then q := (r + p) / 2
MERGE-SORT(A, q+1, r)
MERGE(A, p, q, r)

So, from the above example, we can see that no two calls are concurrent, i.e. parallel. Each call is one of the three calls made in MERGE-SORT(A, p, r) procedure stated above, also shown below (according to the picture, p=0, r=6):

Step-00: MERGE-SORT(A, 0, 6)
Step-01: MERGE-SORT(A, 0, 3)
Step-02: MERGE-SORT(A, 0, 1)
Step-03: MERGE-SORT(A, 0, 0)
Step-04: MERGE-SORT(A, 1, 1)
Step-05: MERGE(A, 0, 0, 1)
Step-06: MERGE-SORT(A, 2, 3)
Step-07: MERGE-SORT(A, 2, 2)
Step-08: MERGE-SORT(A, 3, 3)
Step-09: MERGE(A, 2, 2, 3)
Step-10: MERGE(A, 0, 1, 3)
Step-11: MERGE-SORT(A, 4, 6)
Step-12: MERGE-SORT(A, 4, 5)
Step-13: MERGE-SORT(A, 4, 4)
Step-14: MERGE-SORT(A, 5, 5)
Step-15: MERGE(A, 4, 4, 5)
Step-16: MERGE-SORT(A, 6, 6)
Step-17: MERGE(A, 4, 5, 6)
Step-18: MERGE(A, 0, 3, 6)

Also note that, there are no overlapping calls at a same level of the tree, which is going to be a key factor.

How threading helps:

After a few observations, we can easily find out some interesting properties about merge sort algorithm. If we closely look at the merge sort tree structure, it becomes very clear that, int the recursive call tree, parent nodes depend on children, but siblings are independent. This property helps us to deduce a threaded version of merge sort algorithm where, we can really solve the left and right sub-portion of a range in a parallel fashion by making the calls threaded.

As the algorithm here is recursive, it might look confusing at the first glance, but it is not. Because, nodes share data only with their ancestors and in a call, we can wait until the threads finish their works. So, no concurrency problem appears here. Also, we don't need to worry about mutual exclusion of the shared data, because, there is no such situation in this algorithm, already stated above.

So, the threaded version of the algorithm will look like this:

if p < r
then q := (r + p) / 2
start_thread(left_thread, THREADED-MERGE-SORT(A, p, q))
start_thread(right_thread, THREADED-MERGE-SORT(A, q+1, r))
MERGE(A, p, q, r)

According to the above algorithm, the flow of program changes which is shown in the picture below:

Now because of being able to run multiple THREADED-MERGE-SORT() at a time, and the job scheduling features of modern computers, the effective number of iteration is dramatically reduced, resulting a much much elegant solution for the heavy duty areas like manipulating large data on drives.

A simple implementation:

Here a simple implementation is presented using pthread library in C++:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

typedef struct { int i, j; } pair;
const int MAX = 1024;
int a[MAX];
pthread_attr_t attr;

void *threaded_merge_sort(void *param) {
pair range = *(pair *)param, lrange, rrange;
pthread_t lside, rside;
int p = range.i, q, r = range.j, n1, n2, n = r-p+1, i, j, k;
int *aleft, *aright;
if(p < r) {
q = (p + r) >> 1;
lrange.i = p, lrange.j = q, rrange.i = q + 1, rrange.j = r;
pthread_create(&lside, &attr, threaded_merge_sort, (void *)&lrange);
pthread_create(&rside, &attr, threaded_merge_sort, (void *)&rrange);
pthread_join(lside, NULL);
pthread_join(rside, NULL);
n1 = q - p + 1, n2 = r - q;
aleft = (int *)malloc(sizeof(int) * n1);
aright = (int *)malloc(sizeof(int) * n2);
for(i = 0; i < n1; i++) aleft[i] = a[p+i];
for(i = 0; i < n2; i++) aright[i] = a[q+1+i];
for(k = i = j = 0; k < n; k++) {
if(i >= n1 || (j < n2 && aleft[i] > aright[j])) a[k+p] = aright[j++];
else a[k+p] = aleft[i++];
return NULL;

int main() {
int n, i;
pthread_t sorter;
pair range;
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
while(scanf("%d", &n)==1 && n) {
for(i = 0; i < n; i++) scanf("%d", &a[i]);
range.i = 0, range.j = n-1;
pthread_create(&sorter, &attr, threaded_merge_sort, (void *)&range);
pthread_join(sorter, NULL);
for(i = 0; i < n; i++) printf("%d%c", a[i], (i==n-1? '\n' : ' '));
return 0;

This may look a bit messy, but, actually, you may find a little difference with the original one, we just created threads instead of plain recursive calls, that's it.
To learn more about pthread, check these out:

Give it a try! :)


  1. Hey pal! im implementing one as well onlywith arbitrary data as input. I was wondering when looking at your code if you really need to get the 'pthread_attr_t attr;' and then pass it as a param.. can u just pass NULL?

  2. Hi Jose, nope, just for the merge sort, they are not needed. It was from my habit to use attributes withe pthread rather than using NULL.

  3. What do u use for crating images like above ?? any particular algo simulation tool ??

  4. Most of the images are copied from internet then I edited them with ms paint, for example, images here are basically from wikipedia, then I edited them in paint to show the execution paths (red lines).

  5. How can this code change for number of certain process? for expample if we define N 6 processes?

  6. You can keep 1 process for controlling, partition the array in five groups and apply normal merge sort in each for corresponding segment. Then in the control process, finally combine 5 segments. There can be many other ways.

  7. Do you know about an implementation in c#?

    1. Sorry, I don't know C# :(
      But I think this code is easy to convert in other languages if you know the tread mechanics of that language.