For my sixth lab in my SPO600 class, I'll be exploring algorithms used to increase or decrease the volume of an audio signal.

It helps to start with a solid understanding of what audio is. An audio file has a sample rate, usually denoted in KHz. This refers to the number of times per second the amplitude of the audio wave is sampled. For example, perhaps it was 0 because there was no sound, then it was 65535 because it was very loud! When we use 16 bits to store this information for each sample, we get a scale between 0 and 65535 as possible values for each sample.

In order to decrease the volume of the audio, we would multiply each sample by a number, for example, 0.75 to reduce the volume to 75% of what it originally was. I'll be exploring the speed of three ways of performing this transformation:

  • A for loop that iterates through the array of samples and does the multiplication using floating-point arithmetic. (<sample_value> * 0.75)
  • A for loop that performs the calculation in #1 ahead of time, storing the result in a separate array. A second for loop which will be the timed test will simply look up this value in the array instead of performing the multiplication again.
  • A for loop that iterates through the array of samples and does the multiplication using fixed-point arithmetic.

Each of these tests will be performed with a 500M size array of sample values. They will be performed on both an x86_64 machine and an AArch64 machine to see if there is a difference between how each architecture treats each algorithm. Each test features a section at the bottom that uses the calculated values in some way, in this case displaying the sum of each to the screen. This prevents the compiler from being overzealous with optimization and removing the code we need to test.

#1 - Floating-point arithmetic

Here is the code for this test.

This is the part of the code that acts as the timed test:

exploring_audio_processing_algorithms_1

This code is pretty simple and doesn't warrant much explanation. I'm replacing the sample value in the samples array with its processed value.

The results:

x86_64:
2.970 seconds

AArch64:
3.966 seconds

#2 - Pre-calculated array lookup

Here is the code for this test.

This is the part of the code that prepares the lookup array that will be referenced in the actual test:

exploring_audio_processing_algorithms_2

This code deserves some explanation.

My lookup array (called "lookup075" in the code) is an array that stores ints. I choose to use the "int16_t" type to make it match with the type I'm using for the samples. I use "calloc" to allocate the memory for them. The first argument is "UINT16_MAX", since we need to be able to calculate any sample between INT16_MIN (about -32k) and INT16_MAX (about 32k), and this is a useful shorthand for about 64k. We use "sizeof(int16_t)" for the second argument because this is shorthand for "how many bytes the int16_t happens to use on this platform, irrespective of the number of bits each byte has on this platform". This code is as portable as possible.

After creating the array in memory, I have a for loop that starts at the minimum possible value (about -32k) and ends at the highest possible value (about 32k). In each iteration, I multiply this value by the volume factor (75% in this case) and store the result in the lookup array. Of course, I can't use a negative number as the array index when I store it, so I cast the for loop counter as an unsigned type instead so that we reference 0 to ~64k, not ~-32k to ~32k. Because my array starts at 0, not ~-32k, I will have to do this casting whenever I pull data out of it or put data in.

This is the part of the code that acts as the timed test:

exploring_audio_processing_algorithms_3

Similar to the code that builds the lookup array, I must cast the sample value (which is signed) to an unsigned type to dereference the lookup array value.

The results:

x86_64:
2.568 seconds

AArch64:
4.276 seconds

#3 - Fixed-point arithmetic

Here is the code for this test. This is the part of the code that acts as the timed test:

exploring_audio_processing_algorithms_4

The results:

x86_64:
2.387 seconds

AArch64:
3.840 seconds

Analysis

The fastest result out of the three on both architectures was the fixed-point arithmetic. Using fixed-point arithmetic instead of floating-point arithmetic can be faster in many situations. According to my research, this performance difference is narrowing over time as our computers get faster floating-point calculating components, but according to my tests today, that performance difference is still there.

On the x86_64 architecture, the lookup array was much faster than the approach where I just did floating-point calculations we I went. This makes a lot of sense. It's caching. I prepare the data I know the program will need ahead of time, storing it in memory, so that the program can fetch it quickly when it needs it.

It's also pretty scaleable. We have 16-bit samples, so no matter how long the audio is, we only need to store 2 ^ 16 values in the lookup array. This speedup comes at the cost of extra memory usage, but that memory usage is constant.

However, this may be a problem in computers that have very low amounts of memory, to begin with. They may not have enough memory for this in the first place. It may also be a problem if many volume levels need to be supported. I would need many lookup tables, one for each volume level.

Strangely, the lookup array approach was not faster than calculating floating-point values as I went on the AArch64 architecture. I'm not sure what could cause this, but my theory right now is that the AArch64 system we're using at school to run these tests may have slower memory compares to the x86_64 system. I know from my experience tinkering with Raspberry Pi computers (that also use ARM architectures) that one weakness they have is slow memory.


This was originally posted on the blog I used for my SPO600 class while studying at Seneca College.