Today I'll be talking about hacking your C code by injecting some Assembly code into it.

neo_matrix

Alright, maybe this isn't really "hacking" and isn't as revolutionary as it might seem. Turns out the GCC compilers have had this functionality for a long time now. Perhaps it even has to do with how Chris Sawyer did his thing with RollerCoaster Tycoon.

I've learned by now in my SPO600 class that you can get down into the nitty gritty details of a library, compiling different versions for different platforms to get as much performance as possible for that platform. You can even go beyond different C implementations and get into Assembly implementations (though you couldn't pay me enough to be an Assembly programmer, to be honest). Now I've learned that you can have your cake and eat it to, if you're willing to use a certain syntax to mix some assembly into your C program just where you need it.

Background

Here's a quick example. First, a C program that adds two numbers together and displays the result:

#include <stdio.h>

int main() {
    int a = 20;
    int b = 22;

    // The action is here!
    b = a + b;

    printf("The sum is %d.", b);
}

This may be a simple example, in fact so simple that the optimizations done for me by the C compiler may make it as fast as it can possibly be, but let's throw in some inline assembly to replace the "action" anyways. Here's the same C program, with the computation replaced by inline assembly (for the x86_64 architecture):

#include <stdio.h>

int main() {
    int a = 20;
    int b = 22;

    // The action is here!
    __asm__("add %1,%0"
            : "+r"(b)
            : "r"(a)
            :
    );

    printf("The sum is %d.", b);
}

In this version of the program, the adding is done by taking the contents of the second register I described (which is linked to the variable "b"), adding the contents of the first register I described (which is linked to the variable "a"), and putting that new sum into the first register I described. Why the odd order? The "asm" function requires me to describe the output registers first.

After the asm part is finished, the output of the assembly program is stored in the variables I described for it for output purposes. In this case, it means that the sum of "42" is stored in variable "b", and just like with the C program, I can display the sum to the user by reading the contents of this variable.

Real World Usage

So where does this come into play for more real world usage? In my last lab, I examined different algorithms for processing an audio signal. I was able to get the most efficient result by using a lookup table containing the results of multiplying each possible signal value by the volume scaling factor. On the AArch64 platform, unexpectedly, this approach was slower. But in the real world, on most systems, the lookup table should be faster than doing each multiplication operation. Now, armed with the knowledge that you can inject Assembly code into C programs, let's examine an algorithm that uses this approach to attempt to be more efficient. This code was provided by our professor. I've annotated the questions featured in the code comments with their answers.

// set vol_int to fixed-point representation of 0.5
// Q: should we use 32767 or 32768 in next line? why?
// A: 32767 must be used because this is the maximum value of the int16_t data type. The volume factor could go as high as 1.0. So, 1.0 * 32768 would be one too high for this data type.
vol_int = (int16_t) (0.5 * 32767.0);

printf("Scaling samples.\n");

// Q: what does it mean to "duplicate" values here?
// A: This takes the data present in the w22 register and in one operation, writes it into many registers. This parallel operation makes it more efficient.
__asm__ ("dup v1.8h,w22"); // duplicate vol_int into v1.8h
while ( in_cursor < limit ) {
    __asm__ (
    "ldr q0, [x20],#16 \n\t"
    // load eight samples into q0 (v0.8h)
    // from in_cursor, and post-increment
    // in_cursor by 16 bytes

    "sqdmulh v0.8h, v0.8h, v1.8h \n\t"
    // multiply each lane in v0 by v1*2
    // saturate results
    // store upper 16 bits of results into v0

    "str q0, [x21],#16 \n\t"
    // store eight samples to out_cursor
    // post-increment out_cursor by 16 bytes

    // Q: what happens if we remove the following
    // lines? Why?
    // A: If the following lines are removed, the constraints aren't set on the registers we reference. This would cause a compiler error, and if the compiler didn't help, would cause unintended behavior at runtime as register values are unexpectedly overwritten.
    : "=r"(in_cursor)
    : "r"(limit),"r"(in_cursor),"r"(out_cursor)
    );
}

Now let's use the same value for the "number of samples" value as I used in the previous algorithms lab (500000000) and see if this implementation is faster:

Fastest old implementation on AArch64: 3.840 seconds (fixed-point arithmetic)
This implementation: 0.752 seconds

In this case, the assembly optimization proved to be much faster! It's probably worth it to optimize small parts of your programs this way if it's a small amount of code that isn't likely to change over time, and where its benefit can be taken advantage of by other programs that depend on it.

Investigating Popular Software

Now let's investigate a popular open source software project to see how they use inline Assembly code to speed up their program. For this, I'll be looking at Blender.

The entire Blender git repo only has inline assembly code in one file called "system.c". This is an excerpt from this source code file where they use it in one function definition:

int BLI_cpu_support_sse2(void)
{
#if defined(__x86_64__) || defined(_M_X64)
  /* x86_64 always has SSE2 instructions */
  return 1;
#elif defined(__GNUC__) && defined(i386)
  /* for GCC x86 we check cpuid */
  unsigned int d;
  __asm__(
      "pushl %%ebx\n\t"
      "cpuid\n\t"
      "popl %%ebx\n\t"
      : "=d" (d)
      : "a" (1));
  return (d & 0x04000000) != 0;
#elif (defined(_MSC_VER) && defined(_M_IX86))
  /* also check cpuid for MSVC x86 */
  unsigned int d;
  __asm {
    xor     eax, eax
    inc eax
    push ebx
    cpuid
    pop ebx
    mov d, edx
  }
  return (d & 0x04000000) != 0;
#else
  return 0;
#endif
}

The purpose here is to use the "CPUID" instruction to learn about the CPU's capabilities. Specifically, they're looking for the SSE2 instruction set, presumably because certain features of Blender require this instruction set. I don't have time to investigate the whole source code base for Blender, but my assumption is that this instruction set contains instructions that help with rendering things quickly or performing other performance-sensitive work.

When it comes to programs using inline assembly code, there's always a trade off of portability and simplicity vs. performance. Use too much inline assembly, and your code becomes less portable (are you sure your specific register instructions are supported on the CPU that ends up compiling this? etc) and the maintainers need to understand some assembly just to know what your code is doing. Use none at all, and you risk having your program not take full advantage of the hardware it runs on.

I think that in this case, the usage of inline assembly was appropriate. They need to find out what instructions the CPU supports and this is the only way to do it. It's only a few lines in one source code file, it's not like they're sprinkling bits of assembly code everywhere in a disorganized manner. This particular use of assembly code is also simple to understand. As a novice, I was able to learn about it with some quick googling, so I'm sure that anybody wishing to contribute to the Blender project will not be too confused by it.

Conclusion

While I've learned that I'm definitely most comfortable with web development and high level languages, it has been fun to investigate this low level coding, seeing what little hacks you can do when you're willing to micromanage the registers. I feel like there may be a day when I'm coding something and I think "Gee it sure would be simpler to just make some byte arrays and make my own algorithm in C to do this" and I set about making a C extension for my task. At that point, maybe I'll have a "go big or go home" attitude and if it's performance critical, be willing to get my hands dirty with the registers. This kind of coding may help me out in a pinch in my future career and I'm glad I've had a chance to see it.


This was originally posted on the blog I used for my SPO600 class while studying at Seneca College.