Post

Duff's device (A retrospect over vintage syntactic sugar)

TL;DR

Duff’s copy is a method of implementing array copying in C that relies on unusual syntactic sugar. Although it may be as efficient as the standard copying algorithm, it is not commonly used in modern times. However, it is still considered an interesting approach that could assist programming novices in appreciating the beauty of C’s syntax.

Intro

Programmers are creatures who are genearlly explorative in nature. We have a tendency to implement bizzare ideas which may or may not have a practical application. These implementations have baffled most of the genius minds time over time and have set an unprecedented league of absurdities yet to be overthrown.

Duff’s Device is one of the most prominently discussed implementations, which aims to introduce a unique way of thinking, accompanied by the unusual syntactic exploitation of the C language

Standard Array Copy

Before we step into the labyrinth of understanding Duff’s Device, getting familiar with the current standard implementation of array copying would be a great approach.

1
2
3
4
5
6
int src[] = {1, 2, 3, 4};
int dest[4];

for(int i = 0; i < 4; i++) {
    src[i] = dest[i];
}

The above code copies the elements from the src array to dest array using a simple for-loop.

The string copy!

This sub-section can be skipped, but the code below is a peculiar method1 of implementing copying of strings in C. As you might recall, strings in C are essentially character arrays. Therefore, pointer manipulation is also applicable to them. Furthermore, upon further reading you may notice additional similarities in Duff’s Device and this strcpy method.

One particular implementation of strcpy as follows

1
2
3
4
5
char* strcpy(register char *dst, register const char *src) {
    while ((*dst++ = *src++) != '\0');
    return dst;
}

An intution for the above function can be envisioned as follows String Copy

As you can see in the above animation, we are dereferencing the src pointer and assigning the value found to the dereferenced dest pointer location. Both the variables are then incremented to point the next block of memory. This process is repeated until we’ve reached the end of the string, that is \0 in our case.

PS: The assignment operator always returns what is assigned to the current variable. This is the reason we’re able to check if the current assigned variable is \0 or not.

The example implementation above is semi-baked version of the actual function. Refer this, if you interested in more implementations of strcpy.

Duff’s Copy in C

The long awaited part of the blog is finally here! One of the most dramatic way of loop-unrolling2; that utilizes C’s switch-case and while loop to copy elements from the source to destination array. Duff’s Copy is a bizarre yet entirely possible technique.

Have a look at it’s implementation below

1
2
3
4
5
6
7
8
9
10
11
12
13
int* duff(int* dst, const int* src, int len) {
    int n = (len + 3) / 4;

    switch(len % 4) {
        case 0: do { *dst++ = *src++;
        case 3: *dst++ = *src++;
        case 2: *dst++ = *src++;
        case 1: *dst++ = *src++;
            } while(--n);
    };

    return dst;
}

Weird3, right?

Well, I won’t blame you. I doubted if it would even compile without any errors. But, to my surprise, it was valid C code. And what’s more surprising is that it’s quite efficient in what it does, too. So, let’s not beat around the bush and get started on understanding this code.

First of all, we get the input parameters for the function,

1
int* duff(int* dst, const int* src, int len) {
  • dst -> The destination array to copy to
  • src -> The source array to copy from
  • len -> The length of the source and destination array (For simplicity, we assume dst is pre-allocated with len blocks)

Then, we get this weird calculation for n

1
    int n = (len + 3) / 4;

What does this n do? This, to be specific, is one of the most significant parts of the algorithm. Here, we calculate the number of iterations we can do per loop. For a better intuition, look at the video below.

The number of arrows above represent the case scenario being triggered. To calculate this, we can look at the cases present

1
2
3
4
5
6
7
    switch(len % 4) {
        case 0: do { *dst++ = *src++;
        case 3: *dst++ = *src++;
        case 2: *dst++ = *src++;
        case 1: *dst++ = *src++;
            } while(--n);
    };

Firstly, since the size of the array is 10, dividing it by modulo 4 gives us 2. So, at the first iteration of the loop, we trigger the case 2.
Now is the moment we can pause and ponder about the working of this weird while loop in between. What do you think will happen when case 2 gets triggerd?

To be honest, at first, I couldn’t comprehend what might happen. But after looking at its execution, I was awestruck.

The first iteration was designed to handle the edge case of the array’s length, while the subsequent iterations cover all other cases. Refer to the example below for a better understanding.

1
2
3
4
5
6
7
    switch(len % 4) {
        case 0: do { *dst++ = *src++; printf("Case 0\n");
        case 3: *dst++ = *src++; printf("Case 3\n");
        case 2: *dst++ = *src++; printf("Case 2\n");
        case 1: *dst++ = *src++; printf("Case 1\n");
            } while(--n);
    };

We annotated each and every case triggered, so that we get a better understanding. The output is as follows.

1
2
3
4
5
6
7
8
9
10
11
12
❯ ./simple_duff
Case 2
Case 1
Case 0
Case 3
Case 2
Case 1
Case 0
Case 3
Case 2
Case 1
0 1 2 3 4 5 6 7 8 9

The program can be found here

If we look at the length of the array, it’s 10. if we could remove 2 elements from it, the remaning is 8 that is divisible by 4. So, if we handle the edge case of the 2 elements, the rest of it is simple copying.

Clever, innit? In fact this kind of tricks have a distinctive name in Computer Science named loop unrolling. But, this isn’t what Tom Duff has invented. This version of “Duff’s Copy” is nothing but an extension to the real Duff’s device.

The real Duff’s Device

Invented by Tom Duff, this was a fast working mechanism for copying buffers into a magic IO register4. The difference between Duff’s device and Duff’s copy is the ++ part is ommited in former. This is because, the buffer being copied into is not a memory region, but a IO register which doesn’t persist memory.

Let’s have a peek at the code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
   register n = (count + 7) / 8;      /* count > 0 assumed */

   switch (count % 8)
   {
   case 0:        do {  *to = *from++;
   case 7:              *to = *from++;
   case 6:              *to = *from++;
   case 5:              *to = *from++;
   case 4:              *to = *from++;
   case 3:              *to = *from++;
   case 2:              *to = *from++;
   case 1:              *to = *from++;
                      } while (--n > 0);
   }

Here, the cases range from 0-7, including 8 copy instructions. Which is a bit bigger than the previous 4 that we’ve seen in Duff’s Copy. The only thing that changes is the number, but the logic behind it remains the same. For an intuitive explanation of Duff’s Device, please watch the video below.

Note: This video utilizes 4 to be instructions for copy.

Performance

These are the performance results for copying 0-99999 integers.

Fig.1 Speed in milli seconds

With no clear margin, both of them perform quite on the same league.

The code used to bench can be found here Benchmarks are done by hyperfine

Why not use it?

To quote an reply from ycombinator

The biggest problem with this technique, from an optimization perspective, is that it creates an irreducible control-flow graph. In simple terms, this is a cycle in the graph that has two entry points from outside the cycle, and as a result, it is no longer considered a loop. This means you’re going to disable any loop optimization–so Duff’s Device is only worth it if you’re going to write the optimal loop form already, and the compiler isn’t going to arrive at that form.

The Duff’s device increases complexity in simple code and prohibits the compiler to run its own optimization on it. So, whether to use it or not is the programmers prerogative. But, the programmer has to remember that this implementation comes with a cost.

Further reading

Duff’s Device Wikipedia
Duff’s Device Post

Implementations of strcpy

Footnotes

  1. Only for novices in C. Seasoned veterans recite these weird hieroglyphics with astute fluency. ↩︎

  2. Optimize loop execution speed with reducing the amount of instructions executed. Refer more about it here ↩︎

  3. Even Tom Duff says that’s its disgusting yet a it runs. refer ↩︎

  4. From here ↩︎

This post is licensed under CC BY 4.0 by the author.