For my fifth lab in my SPO600 class, I'll be creating a C program for an AArch64 system in a way that allows the GCC compiler to automatically vectorize it for me.

The program must do the following:

  • Fill two arrays each with 1000 random integers between -1000 and 1000
  • Map over the two arrays, adding the numbers from each, and putting the result into a third array

I'll start by writing the program in a naive way. I'll just write it the way I normally would, without thinking about vectorization:

#include "stdio.h"
#include "stdlib.h"
#include#define ARRAY_SIZE 1000

int main() {
    srand(time(NULL));

    int a[ARRAY_SIZE];
    int b[ARRAY_SIZE];

    int c[ARRAY_SIZE];

    // Fill first two arrays with random numbers:
    for (int i = 0; i < ARRAY_SIZE; i++) {
        a[i] = rand() % 2000 - 1000;
        b[i] = rand() % 2000 - 1000;
    }

    // Add from a and b and put result into c
    for (int i = 0; i < ARRAY_SIZE; i++) {
        b[i] = a[i] + b[i];
    }
}

Now let's see what assembly code was produced by the compiler:

enabling_vectorization_1

There's actually a lot of code here and it looks pretty repetitive, so I just posted a screenshot excerpt instead of the whole thing. The important thing is that if auto-vectorization worked, we should see particular registers meant for vectorization used.

Vector registers are ones that begin with the letter 'v' followed by a period. For example, "v0.4s" would be the name of a vector register. We can perform an objdump I screenshotted above to look for the vector registers being used:

enabling_vectorization_2

Nope. :( No vector registers used.

But let's try compiling this again using the -O3 switch for GCC, which tells it to do as much optimization as possible for us:

enabling_vectorization_3

Now, let's examine the assembly code again and look for the vector registers:

enabling_vectorization_4

Still nothing. :( In this case, the compiler wasn't able to automatically vectorize this for us. What we need to do is add something else to our source code. According to the GNU manuals, we can add "attributes" to our code to help guide the compiler as to what our intentions are.

They have this example:

int x __attribute__ ((aligned (16))) = 0;

This "causes the compiler to allocate the global variable x on a 16-byte boundary", meaning we can force this variable to squeeze onto registers larger than its size to accommodate other variables next to it on the register. What does this lead to? For example, one 64-bit register (aka, the ones we're using in this exercise) could store four 16-bit values, like these ints. This is vectorization. I can write my code like this:

enabling_vectorization_5

And I've now turned one CPU core into something capable of handling four integers at once. I guess my quad core CPU is now a decahexacore CPU? (Disclosure: I'm not an expert and I'm probably exaggerating).

Let's compile this, with optimization enabled, and inspect its assembly code again to look for those vector registers being used:

enabling_vectorization_6

... really? Alright this took me a while to figure out. After researching some more what I found was that it was being so efficient that because I didn't use the values in the loops at all, it just removed the entire function body. The only assembly code left was related to returning from and calling functions:

enabling_vectorization_7

You can see in the screenshot above that there is register activity, but the compiler removed any registers that would be involved in looping and adding numbers together (from my a and b arrays into my c array).

To solve this, I can force the program to do something with the values after this work has been done. That way, the compiler will be forced to keep it in the generated assembly code:

enabling_vectorization_8

Now let's repeat this process and hopefully, we'll see those vector registers being used...

enabling_vectorization_9

Yes! Now that I wrote a program that actually does something, the compiler stopped thinking I was an idiot, stopped removing my code itself, and followed my optimization instructions. This program will basically be working with four integers at a time as it performs its work, all on one thread on one CPU core.

This has been a neat exercise in optimization. Here I thought having more CPUs, more cores per CPU, or more threads per core were the only ways to make your programs do things in parallel. The rabbit hole goes deeper!


This was originally posted on the blog I used for my SPO600 class while studying at Seneca College.