Mar 20, 2017

Signal processing audio chirp on OSX

In my previous blog post, I studied pulse compression one level deeper by producing and processing an actual audio chirp.  In this blog post, I get one step closer to implementing pulse compression on the phone by receiving an audio pulse with CoreAudio framework, and signal processing with Apple Accelerate framework's vDSP API.  I know that Unity offers ability to record from system mic.  I found the built-in Microphone class unsuitable for my need (deciding when a chirp was received) because Unity Microphone class's recording duration has 1 second resolution, and is finite, whereas I want an option to perform continuous (never stopping) correlation.  So I bought this book on CoreAudio and started with modifying the example from Chapter 8.  You can download all the code below from my GitHub repo: https://github.com/henrychoi/OpenCVUnity, in the branch output_pulse.

CoreAudio code to sample the first channel of the OS X mic

As in the book example, I used the first mic found on the system.

    // generate description that will match audio HAL
    AudioComponentDescription inputcd = {0};
    inputcd.componentType = kAudioUnitType_Output;
    inputcd.componentSubType = kAudioUnitSubType_HALOutput;
    inputcd.componentManufacturer = kAudioUnitManufacturer_Apple;
    

    AudioComponent comp = AudioComponentFindNext(NULL, &inputcd);
    CheckError(AudioComponentInstanceNew(comp, &player->inputUnit),
               "Couldn't open component for inputUnit");

But then I deviated from the book's example because my use case is to correlate the audio signal against the matched filter.  Because the filter is designed for a chirp signal of certain frequency and length, I want to match the sampling frequency and the number of frames in a block.

    propertySize = sizeof (AudioStreamBasicDescription);
    CheckError(AudioUnitGetProperty(player->inputUnit,
                                    kAudioUnitProperty_StreamFormat,
                                    kAudioUnitScope_Output, //to other units
                                    inputBus,
                                    &player->streamFormat,
                                    &propertySize),
               "Couldn't get ASBD from output unit");
    //Filter designed for this rate, so fix at expected frequency
    player->streamFormat.mSampleRate = 44100;
    CheckError(AudioUnitSetProperty(player->inputUnit,
                                    kAudioUnitProperty_StreamFormat,
                                    kAudioUnitScope_Output,
                                    inputBus,
                                    &player->streamFormat,
                                    propertySize),

               "Couldn't set output unit Fs");

    AudioStreamBasicDescription deviceFormat;
    CheckError(AudioUnitGetProperty(player->inputUnit,
                                    kAudioUnitProperty_StreamFormat,
                                    kAudioUnitScope_Input, //from HW to IO unit
                                    inputBus,
                                    &deviceFormat,
                                    &propertySize),
               "Couldn't get ASBD from input unit");
    
    deviceFormat.mSampleRate = 44100;
    CheckError(AudioUnitSetProperty(player->inputUnit,
                                    kAudioUnitProperty_StreamFormat,
                                    kAudioUnitScope_Input,
                                    inputBus,
                                    &deviceFormat,
                                    propertySize),
               "Couldn't set Fs input unit");


    UInt32 bufferSizeFrames = 1<<LOG2M;
    propertySize = sizeof(UInt32);
    CheckError(AudioUnitSetProperty(player->inputUnit,
                                    kAudioDevicePropertyBufferFrameSize,
                                    kAudioUnitScope_Global,
                                    0,
                                    &bufferSizeFrames,
                                    propertySize),
               "Couldn't set buffer frame size to input unit");

Chapter 08 example uses AUGraph to connect the mic HAL unit to default output unit (speaker), but I just want the raw signal from the mic, so the program is significantly simplified: I just have to start the AUHAL unit explicitly:


    CheckError(AudioUnitInitialize(player->inputUnit),
               "Couldn't initialize input unit");
    // Input unit is now ready for use
    
    CheckError(AudioOutputUnitStart(player->inputUnit)
               , "AudioOutputUnitStart");

Later, when I want to stop receiving the mic input, I call AudioOutputUnitStop instead.  But in the continuous chirp detection scenario, I won't stop the input.  The CoreAudio interface for sucking out the sound samples is kAudioOutputUnitProperty_SetInputCallback.

    AURenderCallbackStruct callbackStruct;
    callbackStruct.inputProc = InputRenderProc
    callbackStruct.inputProcRefCon = player;
    
    CheckError(AudioUnitSetProperty(player->inputUnit
                                    kAudioOutputUnitProperty_SetInputCallback
                                    kAudioUnitScope_Global,
                                    0,
                                    &callbackStruct, 
                                    sizeof(callbackStruct)),
               "Couldn't set input callback");

The callback routine will be called (semi) real-time and therefore is time critical; similar to interrupt handlers, the callback routine must return as quickly as possible.  This means that I need to keep pushing the (zero padded) received signal from the audio reception callback to the signal processing algorithm running in another thread.

Memory pool and lockless queue

In the CoreAudio book, data is shuttled from the callback to the downstream through the CARingBuffer, which is specialized for the CA input buffer (this is what CoreAudio returns to me when query the HW for new sample block).  Instead of the CARingBuffer the book uses, I wrote my own memory pool and queue for 2 reasons:

  1. I need to send meta data (like the host timestamp for the input block) along with the raw samples to the algorithm.
  2. I am only interested in 1 channel (the first one) of the potentially stereo mic, so copying the both channels through the CARingBuffer would be a waster of memory.

The interface to the (aligned) memory pool was copied from another blog I wrote more than 5 years ago:

typedef struct llsMP {
  size_t _capacity, _available;
  void* _pool;/* the memory itself */
  void* *_book;/* Array for book keeping, but don't know the size till ctor */
} llsMP;

unsigned char llsMP_alloc(struct llsMP* me,
    size_t capacity, size_t memsize, size_t alignment
                            , unsigned char zero);
void llsMP_free(struct llsMP* me);
struct llsMP* llsMP_new(size_t capacity, size_t memsize, size_t alignment
                          , unsigned char zero);
void llsMP_delete(struct llsMP* me);
void* llsMP_get(struct llsMP* me);
unsigned char llsMP_return(struct llsMP* me, void* node);

In this primitive, cross-platform API, I don't use more convenient types like uint8_t or bool, because I did not want to require stdint.h.  I prefer using the non-memory alloc version if possible, to keep the object as part of the container's object, like this:

typedef struct MyMic {
    AudioStreamBasicDescription streamFormat;
    AudioUnit inputUnit;
AudioBufferList *inputBuffer;
    FILE* t_csv, *x_csv, *f_csv, *c_csv;
    struct llsMP mpool;
    struct llsQ padded_xQ;
    int8_t Eblock, iBlock;
    bool firstBlock;

} MyMic;

The memory pool still needs to be initialized before the callback is called, which is why I offer the alloc() API.

    if (!llsMP_alloc(&player->mpool
                     , 3 // capacity
                     , sizeof(struct SampleBlock) // memsize
                     , 16 // alignment: recommended by vDSP programming guide
                     , 1)//memset zero, since this pool is used for zero padding
        ) {
        fprintf (stderr, "Can't allocate zero padded memory pool");
        exit (errno);

    }

The 16 byte alignment is recommended by vDSP--probably because it uses the SSE instructions to speed up the math (and these instructions require 16 byte aligned data).  I enable zeroing out the memory because this memory pool is for the zero padded raw samples.  Recall from my previous blog that using the circular convolution/correlation of FFT requires appending zeros of same length as the input block.  If I zero out the memory pool initially and only write to the latter half of the memory pool node I obtain, I can maintain a zero padding forever, without ever having to explicitly zero out parts of the stack memory.  Copying the received sound samples from the HW to the latter half of this memory buffer (the first half always remains zero padded) is achieved through careful pointer arrangement right before handing the buffer to the AudioUnit to fill, in the receive callback:

struct SampleBlock {
    UInt64 mHostTime;//The timestamp for these frames
    //UInt64 mWordClockTime //don't bother; always zero

    //I can calculate the time of the 1st sample with this just as well
    uint32_t iBlock;
    UInt32 nFrames;
    float sample[1<<LOG2_2M];//2x sample block length, to zero pad
};

    struct SampleBlock* block = (struct SampleBlock*)llsMP_get(&player->mpool);
    if (!block) {
        fprintf(stderr, "Memory pool exhausted\n");
        return errno;
    }
    player->inputBuffer->mBuffers[0].mData = (void*)(&block->sample[1<<LOG2M]);

    CheckError(AudioUnitRender(player->inputUnit, ioActionFlags,
                            inTimeStamp, inBusNumber, inNumberFrames,
player->inputBuffer),
               "AudioUnitRender failed");

    block->iBlock = player->iBlock;
    block->mHostTime = inTimeStamp->mHostTime;

    block->nFrames = inNumberFrames;

    if (!llsQ_push(&player->padded_xQ, block)) {
        fprintf(stderr, "llsQ_push failed\n");
        return errno;
    }

As you can see, the pointer to the memory pool node then can be pushed into a lockless queue, which works well in this case because there is only 1 writer thread, and 1 reader thread.  Note also that the reception time on the host (in nanoseconds) and the block number (starts at 0) are attached to the sample block.  If the correlation processor decides there is a positive event, it can then place the event within the audio sampling period (1/44100 Hz here, or < 23 usec).

  /* SINGLE writer, SINGLE reader queue */
  typedef struct llsQ {
    size_t _head, _tail, _mask;
    void* *_q;/* array of pointers */
  } llsQ;

  unsigned char llsQ_alloc(struct llsQ* me, unsigned char exponent);
  void llsQ_free(struct llsQ* me);

  struct llsQ* llsQ_new(unsigned char exponent);
  void llsQ_delete(struct llsQ* me);

  unsigned char llsQ_push(struct llsQ* me, void* node);
  unsigned char llsQ_pop(struct llsQ* me, void** node);

The main algorithm loop sleeps until the lockless queue becomes not-empty.

    for (int8_t iBlock = 0; iBlock < player->Eblock; ) {
        struct SampleBlock* block;
        if (!llsQ_pop(&player->padded_xQ, (void**)&block)) {
            usleep(1000);
            continue;
        }

        // Have a new block to process

The correlation processor now has work to do.

Fast correlation processor

I discussed the central algorithm for the fast correlation using FFT in my previous blog entry.  Repeating for convenience:

correlation(x, filter) = iFFT(FFT(x) • FFT*(filter))

where  is the element wise product of 2 vectors of equal length, and * is the complex conjugate operator.  Because the filter coefficients are constant, I can pre-calculate FFT*(filter) and store in the DATA segment of the program, as I've done here:

#define LOG2M 10
#define EXP_LOG2M (1<<LOG2M)
#define LOG2_HALFM (LOG2M - 1)
#define LOG2_2M (LOG2M + 1)
// The matched filter from Matlab
float FFT_pulse_conj_vDSPz_flattened[1<<LOG2_2M] __attribute__((aligned(16)))
= {
#include "FFT_pulse_conj_vDSPz_flattened.h"
};

Given the wealth of DSP libraries and algorithms available for free these days, I expected to find a DSP library that is optimized for OS X/iOS, and I did: Apple's Accelerate framework, which seems to take advantage of Intel's SSE.  But typical of Apple documentation, the documentation is lacking (to put it mildly).  One must begin with the vDSP's packing of the FFT of a real sequence, given here.  A few crucial concepts are:
  1. The result of FFT is a complex number in general, which they keep in 2 DIFFERENT arrays (lots of other implementations, including the venerable Numerical Recipes) keeps them in alternating fashion.
  2. FFT of a real sequence x is anti-symmetrical about frequency=0; i.e. X(f) = X*(-f), where X = FFT(x).
  3. X(f=0) and X(f=Fs/2)--FFT at the DC and the Nyquist frequency--are pure real.
Using all of the above properties, vDSP uses N numbers to represent the FFT of a real sequence of length N (which would otherwise require 2N numbers).  But because they keep the real and the imaginary parts separately, an in-place FFT requires a prior step of separating out the real sequence into odd and even sequences, as I've done below with vDSP_ctoz.

    float creal[1<<LOG2M] __attribute__((aligned(16))) //input/output buffer
        , cimag[1<<LOG2M] __attribute__((aligned(16)))
        , ftemp[1<<LOG2_2M] __attribute__((aligned(16))) //temporary scratch pad
    ;
    COMPLEX_SPLIT splitc = { .realp = creal, .imagp = cimag }
        , split_temp = { .realp = ftemp, .imagp = &ftemp[1<<LOG2M] }
        , split_filter = { .realp = FFT_pulse_conj_vDSPz_flattened
            , .imagp = FFT_pulse_conj_vDSPz_flattened + (1<<LOG2M) }
    ;
    float FFT_filter_nyq = *split_filter.imagp;
    *split_filter.imagp = 0//see explanation below
...

// When I get a block to process, split out a time series into odd and even
        vDSP_ctoz((COMPLEX*)block->sample, 2, &splitc, 1, 1<<LOG2M);

Note the arrangement of the filter's FFT to be consistent with vDSP's representation of FFT of a real sequence.  My Matlab code (explained in my previous blog post) anticipates the C code will do this, and writes out the header file correctly.  I can then FFT the zero padded sample block with vDSP API:

vDSP_fft_zript(fftSetup, &splitc, 1, &split_temp, LOG2_2M, FFT_FORWARD);

The result is still in the "splitc"--because this is an inplace FFT.  The temporary buffer is supposed to speed up the FFT (have not yet measured how much).  If I multiply this FFT with the filter's FFT (actually the conjugate thereof), the result would be wrong, because the Nyquist portion masquerading as the imaginary value of FFT(x) @ DC distorts the result.  Because vDSP does not take care of this, I have to jump through some hoops to save and then restore.  At this this is straight-forward in the 1D case; the 2D case requires setting up a separate vector (huge inconvenience).  I think the vDSP folks dropped the ball on this.

float FFT_c_nyq = *splitc.imagp; *splitc.imagp = 0;
vDSP_zvmul(&splitc, 1, &split_filter, 1, &splitc, 1
           , 1<<LOG2M//The multiplication is over vector of length M
           , 1 // don't conjugate (even though we are correlating);
           ); // because the filter is already in conjugate form
// Restore the Nyquist frequency portion, which is pure real
*splitc.imagp = FFT_c_nyq * FFT_filter_nyq;

I can then inverse FFT to obtain the fast correlation..

vDSP_fft_zript(fftSetup, &splitc, 1, &split_temp, LOG2_2M, FFT_INVERSE);
vDSP_ztoc(&splitc, 1, (COMPLEX*)ftemp, 2, 1<<LOG2M);

But as I explained in previous blog post, the correlation of the zero-padded block is not the final running correlation unless the result from the previous block is added to the overlap (the overlap save method).  That is why the overlap_save (initially zero) is maintained from 1 block to the next:

float overlap_save[1<<LOG2M] __attribute__((aligned(16))) //final correlation
    ;
...
        //Correlation to output: overlap_save + ftemp[0:M-1];
        vDSP_vadd(ftemp, 1, overlap_save, 1, ftemp, 1, 1<<LOG2M);
        
        //Correlation to save for next round: overlap_save = ftemp[M:2M-1]
        memcpy(overlap_save, &ftemp[1<<LOG2M], sizeof(overlap_save));

Result: clear difference between silence and chirp

To validate the output of the fast correlator, I ran the program twice: the first without any signal (just room noise) and the 2nd while playing the chirp I designed in Matlab on repeat.  The Matlab validation script's orange line shows the difference between Matlab's calculation and the output from my CoreAudio program.  The 1st panel is the raw sample value, and the middle panel is the magnitude of the element-wise multiplication of the 2 FFTs.  The bottom panel is the correlator output.
I am not sure what the mic is picking up during the first few milliseconds of the recording, but there is no obvious hit in the correlator output, which is clearly visible when I play the chirp, as you can see below.

Next steps

When I can further process the correlator output to a yes/no decision between chirp, picking out the sample at which the chirp was received, and mapping that back to a monotonically increasing timestamp on the receiver side should be straight-forward.  And thanks to the use of the memory pool and circular queue, this processing pipeline can run ad-infinitum.