In my last post on my SPO project, I talked about finishing setting up my test environment so that I could run proper benchmarks. Now, I'll be exploring the code of the SipHash function in detail and looking into ways to optimize it.

The body of the function (minus the code I added to measure it looks like this:

/* Test of the CPU is Little Endian and supports not aligned accesses.
 * Two interesting conditions to speedup the function that happen to be
 * in most of x86 servers. */
#if defined(__X86_64__) || defined(__x86_64__) || defined (__i386__)
#define UNALIGNED_LE_CPU
#endif

#define ROTL(x, b) (uint64_t)(((x) <> (64 - (b))))

#define U32TO8_LE(p, v)                                                        \
    (p)[0] = (uint8_t)((v));                                                  \
    (p)[1] = (uint8_t)((v) >> 8);                                             \
    (p)[2] = (uint8_t)((v) >> 16);                                            \
    (p)[3] = (uint8_t)((v) >> 24);

#define U64TO8_LE(p, v)                                                        \
    U32TO8_LE((p), (uint32_t)((v)));                                          \
    U32TO8_LE((p) + 4, (uint32_t)((v) >> 32));

#ifdef UNALIGNED_LE_CPU
#define U8TO64_LE(p) (*((uint64_t*)(p)))
#else
#define U8TO64_LE(p)                                                           \
    (((uint64_t)((p)[0])) | ((uint64_t)((p)[1]) << 8) |                       \
     ((uint64_t)((p)[2]) << 16) | ((uint64_t)((p)[3]) << 24) |                \
     ((uint64_t)((p)[4]) << 32) | ((uint64_t)((p)[5]) << 40) |                \
     ((uint64_t)((p)[6]) << 48) | ((uint64_t)((p)[7]) << 56))
#endif

#define U8TO64_LE_NOCASE(p)                                                    \
    (((uint64_t)(siptlw((p)[0]))) |                                           \
     ((uint64_t)(siptlw((p)[1])) << 8) |                                      \
     ((uint64_t)(siptlw((p)[2])) << 16) |                                     \
     ((uint64_t)(siptlw((p)[3])) << 24) |                                     \
     ((uint64_t)(siptlw((p)[4])) << 32) |                                     \
     ((uint64_t)(siptlw((p)[5])) << 40) |                                     \
     ((uint64_t)(siptlw((p)[6])) << 48) |                                     \
     ((uint64_t)(siptlw((p)[7])) << 56))

#define SIPROUND                                                               \
    do {                                                                      \
        v0 += v1;                                                             \
        v1 = ROTL(v1, 13);                                                    \
        v1 ^= v0;                                                             \
        v0 = ROTL(v0, 32);                                                    \
        v2 += v3;                                                             \
        v3 = ROTL(v3, 16);                                                    \
        v3 ^= v2;                                                             \
        v0 += v3;                                                             \
        v3 = ROTL(v3, 21);                                                    \
        v3 ^= v0;                                                             \
        v2 += v1;                                                             \
        v1 = ROTL(v1, 17);                                                    \
        v1 ^= v2;                                                             \
        v2 = ROTL(v2, 32);                                                    \
    } while (0)

uint64_t siphash(const uint8_t *in, const size_t inlen, const uint8_t *k) {

#ifndef UNALIGNED_LE_CPU
    uint64_t hash;
    uint8_t *out = (uint8_t*) &hash;
#endif
    uint64_t v0 = 0x736f6d6570736575ULL;
    uint64_t v1 = 0x646f72616e646f6dULL;
    uint64_t v2 = 0x6c7967656e657261ULL;
    uint64_t v3 = 0x7465646279746573ULL;
    uint64_t k0 = U8TO64_LE(k);
    uint64_t k1 = U8TO64_LE(k + 8);
    uint64_t m;
    const uint8_t *end = in + inlen - (inlen % sizeof(uint64_t));
    const int left = inlen & 7;
    uint64_t b = ((uint64_t)inlen) << 56;
    v3 ^= k1;
    v2 ^= k0;
    v1 ^= k1;
    v0 ^= k0;

    for (; in != end; in += 8) {
        m = U8TO64_LE(in);
        v3 ^= m;

        SIPROUND;

        v0 ^= m;
    }

    switch (left) {
    case 7: b |= ((uint64_t)in[6]) << 48;
    case 6: b |= ((uint64_t)in[5]) << 40;
    case 5: b |= ((uint64_t)in[4]) << 32;
    case 4: b |= ((uint64_t)in[3]) << 24;
    case 3: b |= ((uint64_t)in[2]) << 16;
    case 2: b |= ((uint64_t)in[1]) << 8;
    case 1: b |= ((uint64_t)in[0]); break;
    case 0: break;
    }

    v3 ^= b;

    SIPROUND;

    v0 ^= b;
    v2 ^= 0xff;

    SIPROUND;
    SIPROUND;

    b = v0 ^ v1 ^ v2 ^ v3;
#ifndef UNALIGNED_LE_CPU
    U64TO8_LE(out, b);

    return hash;
#else
    return b;
#endif
}

Figure 1: The complete original hash function

While reviewing the function, I found out that a lot of the optimization techniques I have learned in class and from googling have already been implemented by this function. I think that my project will end up being more of an exploration of why SipHash is already optimized instead of optimizing it further. This makes sense too, since SipHash was created by very talented computer scientists, and the community has decided it's good enough to already be included in many programming languages like Python and Ruby at this point.

Endianness

One trick that can apply to optimizing for architectures like AArch64 is to look into "endianness". Some CPUs store the bytes in memory from least significant to most significant, called "littleendian". Other CPUs will store the bytes from most significant to least significant, called "bigendian". There isn't really an advantage to either pattern, but there are penalties incurred when you need to reverse the data to match another platform, for example when reading from a file written by a "differentendian" platform, or when receiving data from it over a network.

So where does this come into play with our algorithm? These lines:

#ifdef UNALIGNED_LE_CPU
#define U8TO64_LE(p) (*((uint64_t*)(p)))
#else
#define U8TO64_LE(p)                                                           \
    (((uint64_t)((p)[0])) | ((uint64_t)((p)[1]) << 8) |                       \
     ((uint64_t)((p)[2]) << 16) | ((uint64_t)((p)[3]) << 24) |                \
     ((uint64_t)((p)[4]) << 32) | ((uint64_t)((p)[5]) << 40) |                \
     ((uint64_t)((p)[6]) << 48) | ((uint64_t)((p)[7]) << 56))
#endif

Figure 2: Original endianness optimization in hash function

The variable "UNALIGNED_LE_CPU" was defined above that code if the compiler was able to detect that it was being compiled for a littleendian CPU. In that case, it doesn't need to perform the operations with the bit shift operators (<<) which are just there to reverse the data in the variable "p" so that it is ready for the algorithm (see Figure 2). If we can avoid performing these steps on littleendian machines, there is some performance to be gained.

In order to demonstrate this optimization with concrete test results, I compiled another version of Redis where I removed the lines of code that allowed the "8TO64_LE" macro function to be defined in the optimized way, forcing it to perform the four line version no matter the endianness of the CPU (See Figure 2). I expected that this would slow it down, demonstrating that the SipHash programmers successfully performed this validation. These are the results, where I ran tests three times each and then took the average, first on the default build and then in the de-optimized build:

Default:

-- siphash count = 1000000, exec time = 0.813 seconds --
-- siphash count = 1000000, exec time = 0.835 seconds --
-- siphash count = 1000000, exec time = 0.816 seconds --

Average: 0.821 seconds

De-optimized:

-- siphash count = 1000000, exec time = 0.828 seconds --
-- siphash count = 1000000, exec time = 0.818 seconds --
-- siphash count = 1000000, exec time = 0.823 seconds --

Average: 0.823 seconds

Figure 3: Benchmark difference from endianness optimization

It's a very minor difference, but it is there. This optimization shaved off a bit of time in the hash function.

I actually noticed when exploring various source code repositories that the reference implementation of SipHash does not have this optimization. It was present in the Redis repo, meaning that when the developers behind Redis decided to use SipHash for their software (which SipHash's license allows), they added that optimization (See Figure 2). They must have had the foresight to realize various college students in Canada would be tasked with optimizing their software and decided to ruin their day.

Vectorization

Vectorization is an interesting optimization technique. Before looking into how vectorization can be applied to the hash function, here's a quick review of what it is. It basically comes down to a few rules. Here are a few of them, though I'm sure someone with more knowledge on it could list more:

  • If your operations are supported by vector registers (specific registers that have more capabilities than other registers on the CPU), then you can vectorize the statement.
  • However, if the data used in the operations is a loop-carried dependency, the statement cannot be vectorized.

Here's an example to demonstrate these rules. The code below is vectorizable, because its operations (multiplication) are supported by vector registers, and there are no loop-carried dependencies:

for (int i = 0; i < arrayOfDataSz; i++) {
    results[i] = arrayOfData[i] * 0.5;
}

Figure 4: Example of code where vectorization can occur

This code might have been used to scale a set of data by 50% for example. There are no loop-carried dependencies because during a given iteration of the for loop, the previous iteration's value for "results[i]" is irrelevant (See Figure 4). The CPU wouldn't need to worry about needing to complete one iteration at a time so that that iteration's results would be available to the next iteration. This is also known as an "embarrassingly parallel" problem, and it's one of the first places someone can look to optimize their code if they have parallel computing available to them like vector registers on a CPU core, multiple CPU cores on a CPU, or multiple CPUs in a data centre, etc.

Here is an example of code that is not vectorizable because even though the multiplication operation is supported by vector registers, there is a loop-carried dependency:

for (int i = 0; i < arrayOfDataSz; i++) {
    results[i] = arrayOfData[i] * 0.5;
    maxResults = results[i] > maxResult ? results[i] : maxResult;
}

Figure 5: Example of code where vectorization cannot occur

In this example, the results are calculated as they were before, but the result in each iteration of the loop is used to determine something in the next iteration of the loop (the max value seen so far). Because the next iteration can't occur until a given iteration finishes, it isn't vectorizable. It must be done serially (See Figure 5).

The hash function might have code where vectorization can be used. I examined the different sections of the loop portion of the function to look for vectorization opportunities. Here's the entire loop portion (See Figure 1 for original code), with the preprocessor directives in-lined to examine the actual operations occurring:

for (; in != end; in += 8) {
    m = (*((uint64_t*)(in)));
    v3 ^= m;

    do {
        v0 += v1;
        v1 = (uint64_t)(((v1) <> (64 - (13))));
        v1 ^= v0;
        v0 = (uint64_t)(((v0) <> (64 - (32))));
        v2 += v3;
        v3 = (uint64_t)(((v3) <> (64 - (16))));
        v3 ^= v2;
        v0 += v3;
        v3 = (uint64_t)(((v3) <> (64 - (21))));
        v3 ^= v0;
        v2 += v1;
        v1 = (uint64_t)(((v1) <> (64 - (17))));
        v1 ^= v2;
        v2 = (uint64_t)(((v1) <> (64 - (32))));
    } while (0);

    v0 ^= m;
}

Figure 6: For loop portion of hash function where work is done

This code doesn't contain any operations that would be worth vectorizing. Addition and XOR operations aren't as expensive as operations like multiplication or division. Furthermore, even if the operations were expensive, vectorizing it wouldn't be possible for the AArch64 architecture. The type of the data being worked with (the variable "m") is uint64_t (See Figure 6). This means that these are 64-bit values. The registers on an AArch64 CPU are 64-bit. This means that there's no way to fit more than one of these values onto the registers at a time.

If this CPU had 128-bit registers, then this would be another story. In theory, I could optimize this to run twice as fast (for those particular operations that would get vectorized by the compiler) since the registers could apply that operation to two pieces of data at a time. I could also optimize it if the pieces of data were less than 64-bits. For example, if they were Unicode characters, which are 16 bits each, I could process four of them at a time.

I think that the designers behind SipHash knew that it would be simpler to just read the data in 64-bit chunks (See lines 1-2 of Figure 6), knowing that it would be already optimized for 64-bit CPUs. I have noticed various benchmarks online where they find that 32-bit processors have horrible performance with SipHash, so there's probably a case of the designers having 64-bit CPUs in mind.

This is how I can can conclude that SipHash is already optimized in terms of vectorization opportunities.

Rotation Operators

Another area for optimization is to create C source code in a way that the compiler will be able to identify as being a "rotate". This is basically where you want to do a bit shift operation, for example performing a "<< 2" operation on the value 0011 (decimal 3) to get the value 1100 (decimal 12), except that you want to keep any bits on the side of the value and have them appear on the end. An example would be rotating left by 2 bits on the value 0110. If you shifted this left by 2 bits, you'd lose the left-most bit. You'd end up with 1000, and if you shifted right by 2 bits, you'd end up with 0010. You just lost that data! Rotating it left would instead go from 0110 to 1001. You've preserved that left-most bit. If you rotated right by 2 bits, you'd get it back. In the field of cryptography, keeping that bit is important. Again, I'm not an expert, but I'll trust the smart folks who make these algorithms and choose to use these operators.

So what about SipHash then. Is it coded in a way that uses these rotation operators? According to this blog post, I have to look in the assembly code for the "ror" operators. That would indicate that the rotations are occurring. So I compiled the code on the AArch64 machine with -O2 optimization for GCC, and used the following command to examine the source code and search for "ror":

objdump -f -s -d --source a.out | grep ror

This is the output I get:

  40070c:  cac2cc02    eor x2, x0, x2, ror #51
  400714:   93c08000    ror x0, x0, #32
  400720:   cac4c064    eor x4, x3, x4, ror #48
  40072c:   cac2bc62    eor x2, x3, x2, ror #47
  400730:   cac4ac04    eor x4, x0, x4, ror #43
  400734:   93c38063    ror x3, x3, #32
  4007b4:   cac2cc01    eor x1, x0, x2, ror #51
  4007bc:   cac4c043    eor x3, x2, x4, ror #48
  4007c0:   93c08000    ror x0, x0, #32
  4007d0:   cac1bc41    eor x1, x2, x1, ror #47
  4007d4:   cac3ac03    eor x3, x0, x3, ror #43
  4007d8:   93c28042    ror x2, x2, #32
  4007e4:   cac1cc01    eor x1, x0, x1, ror #51
  4007ec:   cac3c043    eor x3, x2, x3, ror #48
  4007f0:   93c08000    ror x0, x0, #32
  4007fc:   cac1bc41    eor x1, x2, x1, ror #47
  400800:   cac3ac03    eor x3, x0, x3, ror #43
  400808:   93c28042    ror x2, x2, #32
  40080c:   cac1cc01    eor x1, x0, x1, ror #51
  400818:   cac3c042    eor x2, x2, x3, ror #48
  40081c:   cac2ac02    eor x2, x0, x2, ror #43
  400824:   93c08000    ror x0, x0, #32
  400828:   cac1bc00    eor x0, x0, x1, ror #47

Figure 7: Excerpt of assembly produced by GCC compiler with -O3 flag, filtered to rotation operations

These developers really didn't leave much for me to do, did they? I don't even know for sure if they had the AArch64 architecture in mind as they created the reference implementation (which they published years ago, before the recent interest in AArch64 in the data centre), but the way the code was written, the rotation operations are being triggered for AArch64 too (See Figure 7), removing that as an area where the function could be optimized.

The ror operator being in the assembly code is proof that this part of the hash function is as optimized as it can be because the rotation operation is constant. It can be performed in one CPU clock cycle. Because the compiler is able to condense the assembly operations that were produced from the C code into just one assembly operation, there is no further optimization that can happen here.

Conclusion

In conclusion, I was tasked with choosing a hash function in a prominent open source project and:

  • Benchmarking it on the AArch64 architecture to determine its original performance
  • Using a variety of techniques to optimize its performance on this architecture
  • Demonstrate that my changes had a measurable impact on its performance using aditional benchmarks, or that I was not able to optimize it further and that it is already in an optimal state, again using benchmarks to prove this

These are the steps I took to complete this project:

  • Choose Redis as my open source project, since as a data structure-oriented database, I expected it to have a hash function
  • Explore the Redis source code and familiarize myelf with how its hash function worked
  • Identified a way to benchmark the hash function (using my Ruby client scripts that inserted records into Redis. I used techniques using the <time.h> header to measure the time spent in the work portion of the hash function.
  • Explored various optimization techniques including endianness exploits, vectorization, and ensuring efficient assembly operations (rotations) were being used, fiding that Redis's hash function, and for the most part, the reference implementation of SipHash itself were already in an optimized state.
  • Benchmarked "de-optimized" versions of the Redis hash function where applicable to demonstrate the performance difference.

Results & Reflection

This project didn't go the way I thought it would. I was expecting to find something to do to optimize the hash function further. It seems I underestimated the talent of the open source community behind it. That, or I chose a hash function that happened to be very prominent. In retrospect, knowing that it was put forth by the guy who sued the US government over cryptography's status as a munition, and that it was used in multiple popular interpreted programming languages that form a big part of the web today should have been clues that I would encounter this.

I'm still happy with having had a chance to study this subject. I never imagined I would ever get a chance to work with assembly and understand its role in cryptography. I learned about timing attacks, and how it's important for operations in the finished assembly to be constant so that nobody can predict how the hash function works based on putting different inputs into it. In this case, it's SipHash's simple instruction set operations that satisfy this requirement and make it cryptographically-strong. I know now that it's important to inspect the assembly code generated by the compiler to know what's happening to the C code I write.

I also know more about the capabilities of ARM CPUs now and I may be able to help in areas besides cryptography when it comes to optimizing programs for ARM architectures. Outside of hash functions, there will still be situations where data is being processed, perhaps in parallel, and knowing how to apply vectorization etc will help with optimizing this code.

I didn't end up doing much benchmarking because I wasn't able to optimize the hash function. However, for those interested in how fast this industry-standard hash function is, here's the results I got from running my Ruby scripts that I mentioned in the previous posts. I inserted enough records from random data (from /dev/urandom) to trigger the hash function 1,000,000 times, and took the average of three of these tests:

-- siphash count = 1000000, exec time = 0.812691 seconds --
-- siphash count = 1000000, exec time = 0.835221 seconds --
-- siphash count = 1000000, exec time = 0.816377 seconds --

Average: 0.821429667 seconds

This was originally posted on the blog I used for my SPO600 class while studying at Seneca College.