My final project for my SPO600 class is to find a hashing or checksum algorithm in popular open source software and improve it using the optimization techniques I learned in class. We were given the example of MurmurHash, a non-cryptographic hashing function. However, I wasn't able to find many uses of this online. Where I did find it, it already seemed to be fully optimized.

I decided to think critically about what role hashing plays in computer science. I remembered learning about MongoDB and I liked how it generated the GUIDs for documents as they're inserted. Each was guaranteed to be unique. I thought that sounded like hashing. So on that topic, I thought about any other database or database-like software that might have this, and a few others came to mind: PostgreSQL, MySQL, Redis. I thought that Redis would be the least likely to have received enough widespread attention that it would already be optimized. So, I checked out its source code.

What I found was pretty interesting. Hashing definitely played a role. I found a hashing function that looks like it has to do with deciding on the key to use as objects are inserted into a hash table data structure in memory. Luckily, the creator of the software and any maintainers wrote very clear code, and it was well-commented, which helped me quickly understand that this might work for my project. The source code file in question is here and this is the function I'll be studying:

/* Generic hash function (a popular one from Bernstein).
 * I tested a few and this was the best. */
static unsigned int dictGenHashFunction(const unsigned char *buf, int len) {
    unsigned int hash = 5381;

    while (len--)
        hash = ((hash << 5) + hash) + (*buf++); /* hash * 33 + c */
    return hash;
}

I can see that this function is important in the software because it is called in many places in the source code repository including in the files 'src/server.c' (called 6 times), 'src/dict.c' (called once), 'src/latency.c' (called once), 'deps/hiredis/async.c' (called once), and 'utils/hashtable/rehashing.c' (called one).

Another reason I think it's a candidate for my project is the nature of the code. They don't appear to be taking advantage of any optimization techniques, such as loop unrolling or using inline assembly. I will be investigating whether or not these techniques will help. I can also investigate what compiler flags the function is compiled with. If I learn that it was not compiled with "03" (for maximum optimization at the expense of a larger binary file size), then this will be one of the first things I do to optimize it.

Benchmarking

In order to establish a baseline as I begin to improve the code, it's a good idea to benchmark it before I make changes. Since the class focuses on the AArch64 architecture, I'll be building it on our school's AArch64 system and will be running my future changes there too. I have a Raspberri Pi 2 (a newer model, also with the AArch64 architecture), so I might test it on there too just for the sake of comparison. I might identify trends that help me understand my changes as I go.

I created the following program to benchmark this function. Note that the timing code only times the hash generation, not the setup portion where I prepare the random data for the test:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define NUM_TESTS 10000000
#define STRING_LENGTH 50

static unsigned int dictGenHashFunction(const unsigned char *buf, int len) {
    unsigned int hash = 5381;

    while (len--)
        hash = ((hash << 5) + hash) + (*buf++); /* hash * 33 + c */
    return hash;
}

int main() {
    srand(time(NULL));

    printf("Generating data for test.\n");

    // Generate NUM_TEST random strings of length STRING_LENGTH
    char** testStrings = (char**) malloc(NUM_TESTS * sizeof(char*));

    for (int i = 0; i < NUM_TESTS; i++) {
        testStrings[i] = (char*) malloc(STRING_LENGTH * sizeof(char*));

        // Fill up new string with random letters
        int j;
        for (j = 0; j < STRING_LENGTH - 1; j++) {
            testStrings[i][j] = (char) rand() % 255;
        }
        // End with null terminator
        testStrings[i][j] = '\0';
    }

    printf("Beginning test.\n");

    clock_t begin = clock();

    for (int i = 0; i < NUM_TESTS; i++) {
        unsigned int hash = dictGenHashFunction(testStrings[i], 42);
    }

    clock_t end = clock();

    double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;

    printf("Time spent: %.3f\n", time_spent);

    return 0;
}

This results in the following results on the school's AArch64 machine:

gcc source.c && ./a.out
Generating data for test.
Beginning test.
Time spent: 2.145

gcc source.c && ./a.out
Generating data for test.
Beginning test.
Time spent: 2.134

gcc source.c && ./a.out
Generating data for test.
Beginning test.
Time spent: 2.150

This means I now have a good base to work on Phase 2 of the project. I'll be able to measure the improvements I make. I'll be posting more as I begin this work!


This was originally posted on the blog I used for my SPO600 class while studying at Seneca College.