Mar 25, 2017

Understanding sorting network in Matlab

When I began this hobby project to solve the indoor location problem using only the HW available on a modern smartphone--all just to play an augmented reality X-wing vs. Tie-figher game with my friend's 5-year old son, I had no idea how much I would have to figure out.  And it's turning out to be quite a lot: the definition of "knowing" takes on a whole new dimension when you try to make something that works.  But it turns out this is the best way I learn; I should have tried to make something by myself a long time ago!

Now that I understand how to get continuous correlation from the matched filter (as explained in my previous blog entry), I am trying to boost the SNR of the filter and fish out the instantaneous jump in correlation when there is a chirp.  As shown in the last blog entry, that spike can be buried in the ambient sound (usually of much lower frequency than the near-Nyquist frequency of the correlation spike).  One of my favorite techniques is to subtract the median of rolling time window: the median is superior to the mean in smoothing a signal in a high noise environment.  The only problem is that rolling median is hard to compute, unlike the rolling mean.  The most naive implementation will sort the content of the rolling window at every sample point.  Even with an O(N logN) sorting algorithm, the cost of sorting goes up quickly as the window size grows.  More sophisticated running median algorithm tries to keep an ordered rolling window.  I actually tried to do this with a doubly linked list, with a special pointer to the median, as shown in the sketch below.
Since I use a circular queue in all embedded projects, I was going to maintain a doubly linked circular buffer of size 1 greater than the window size, so that I could make the oldest element stale, and decide whether to move the median to the left or to right (or leave as is) depending on the combination of the values of the elements, as hinted at below.
This took me 2 nights of coding and debugging to get right.  Incidentally, problems like this come up often in SW interviews.  Here is the class that does all the heavy lifting:

template <typename T> class SortedQ {
    class dValue {
        friend class SortedQ;
    protected:
        dValue *_prev, *_next;
    public:
        T val;
        bool stale;
        inline void init(T v) { stale = false; val = v; }
        inline dValue* next() {
            return (!_next || !_next->stale) ? _next : _next->_next;
        }
        inline dValue* prev() {
            return (!_prev || !_prev->stale) ? _prev : _prev->_prev;
        }
        static int compare(const void* lp, const void* rp) {
            const dValue* *l = (const dValue**)lp;
            const dValue* *r = (const dValue**)rp;
            return (*l)->val - (*r)->val;
        }
        inline void rlink(dValue& right) { _next = &right; right._prev = this; }
        inline void llink(dValue& left) { _prev = &left; left._next = this; }
        inline void replace(dValue* old) {
            _prev = old->_prev; _next = old->_next;
            _prev->_next = _next->_prev = this;
        }
        void rinsert(dValue* newnode);
        void linsert(dValue* newnode);
        inline void dropout() {
            if (_next) _next->_prev = _prev;
            if (_prev) _prev->_next = _next;
        }
    };
public:
    static constexpr unsigned SIZE = 5;
    dValue arr[SIZE+1];
    SortedQ() {
        memset(arr, 0, sizeof(arr)); // mass-set pointers to 0
    }
    ~SortedQ() = default;
    void init();
    inline T& median() { return _median->val; }
    
    //Tail is where we will insert the NEXT sample to, so
    //head should be the slot AFTER the tail (circular).
    inline dValue* head() { return &arr[(_tail + 1) % (SIZE+1)]; }
    T& push(T& f);
    void print();
private:
    unsigned _tail;//In stable state, _tail points to the NEXT slot
    dValue* _median;

};

As evidenced by my slow coding speed, I think I am not a top-notch SW developer--especially given that my code assumes non-duplicate elements.  If I piqued your interest by tying this to SW interviews, the best known solution is to keep 2 separate ordered lists (that can report its size) on either side of the median, and to keep the 2 sides of equal length at all times.  A simple list has O(N) search and insertion time, so people use advanced data structures like balanced binary tree, or even skip list, for the O(logN) insertion and search time.  I can guarantee that such implementation will not be requested in a 45 minute SW interview!

So even for a seemingly simple task of keeping a sorted window, SW solution will be complicated; and I hate complexity.   As someone who is more interested in HW than the SW, there is bigger disappointment in the SW solution discussed above: there is no HW optimization!  The CPU is spending most of its time managing pointers, and the actual value comparison is only a minuscule portion of the total running time.  But in THIS problem, where the window size is fixed and known at design time, and the data is streaming in nature, there IS one important HW optimization possible: vector-wise sorting using sorting network.  The streaming nature of the data means that the sliding window of successive samples look like delayed version of the previous window, as illustrated below:
I can batch-sort a block of inputs going through this sorting window, by treating each delay (starting at delay 0) as a separate vector.  For the window size of 5, for example, there are 5 inputs going into my sorting network--which has a known optimum solution (both in number of comparisons and delay) discussed in The Art of Computer Programming, Vol 3 and reproduced on Ron Zeno's website.

While the sorting network is ideally implemented in a custom HW, the next best implementation can be had on SIMD capable chips, like on Intel with SSE, or ARM with Neon.  And the vectored nature of Matlab's max/min function can easily simulate it.  Here's my implementation of 5-element sorting network:
There are 2 opportunities for parallelization: at the very begging and right before the last swapping.  I deliberately drew them staggered because I am NOT going to parallelize them.  The letters in the middle show the temporary vectors necessary to store the intermediate values--unless you have a HW assisted compare and swap--because each compare and swap needs to be done in 2 stages: taking the max of 2 vectors, and then taking the min of the same 2 vectors.  Right away, you can see that the vectored sorting network will be memory hungry, and might not be suitable for cheap MCUs.  But an iPhone can easily provide a few extra KB without throwing up.

Here's a Matlab simulation I used to verify this network:

x = rand(1000,1);
d1 = [x(1); x(1:end-1)];
d2 = [d1(1); d1(1:end-1)];
d3 = [d2(1); d2(1:end-1)];
d4 = [d3(1); d3(1:end-1)];
plot(x); hold on;
plot(d4);
hold off;

A = max(x, d1);  B = min(x, d1);
C = max(d2, d3); D = min(d2, d3);
E = max(B, D);   D = min(B, D);
B = max(C, d4);  C = min(C, d4);
F = max(A, B);   B = min(A, B);
A = max(E, C);   C = min(E, C);
E = max(A, B);   B = min(A, B);
D = max(D, C);
B = max(B, D);

truth = medfilt1(x, 5);
med_err = truth(1:end-2) - B(3:end);
plot(x); hold on
plot(truth(1:end-2)); plot(B(3:end)); hold off;

plot(med_err);

You can see in my previous blog entry that I used 3 extra arrays of length 1K (that's the length of the matched filter) for correlation processing.  That is already 3 x 1K x 4B = 12 KB of RAM.  For this median processing, I can reuse those 3 arrays on the stack for A, B, and C.  But I still have to come up with D and E (F is only necessary to get the running maximum), so I am looking at 20 KB extra RAM for the correlation and median processing.  As a reference, nRF51822 the BLE MCU I am used to playing around with only has up to 32 KB of RAM, so this solution is starting to look expensive...

Mar 20, 2017

Audio chirp processing problems on OS X

In a previous blog post, I explored directly processing the sound samples from the built-in mic on OS X using the CoreAudio framework.  In that post, I briefly mentioned about the non-ideal response seen on my MacBook Pro.  For convenience, here is the raw sample.
This is supposed to be a flat chirp going from 50 Hz to 22 kHz (except for a short ramp up and down windowing at the beginning and the end).  Since the MacBook Pro's mic is RIGHT under the speaker, I expected some distortion, but this seemed too much.  So I tried turning off the "voice processing" units in the HW.

    if (true) { // Is the voice filter on?  Try to turn it off
        AudioComponentDescription desc;
        desc.componentType = kAudioUnitType_Output;
        desc.componentSubType = kAudioUnitSubType_VoiceProcessingIO;
        desc.componentManufacturer = kAudioUnitManufacturer_Apple;
        desc.componentFlags = 0;
        desc.componentFlagsMask = 0;
        AudioComponent comp = AudioComponentFindNext(NULL, &desc);
        
        AudioUnit vpioUnit;
        AudioComponentInstanceNew(comp, &vpioUnit);
        CheckError(AudioUnitSetProperty(vpioUnit, kAUVoiceIOProperty_BypassVoiceProcessing
                             , kAudioUnitScope_Global, 1, &enableFlag, sizeof(enableFlag))
                   , "BypassVoiceProcessing failed");
        CheckError(AudioUnitSetProperty(vpioUnit, kAUVoiceIOProperty_VoiceProcessingEnableAGC
                             , kAudioUnitScope_Global, 1, &disableFlag, sizeof(disableFlag))
                   , "VoiceProcessingEnableAGC");
        //Deprecated a long time ago
        //AudioUnitSetProperty(vpioUnit, kAUVoiceIOProperty_DuckNonVoiceAudio
        //                     , kAudioUnitScope_Global, 1, &disableFlag, sizeof(disableFlag));

    }

The result is even stranger:
I then tried different combinations of the above 2 options, and even got an address sanitizer inside AudioUnitSetProperty when I tried to enable kAUVoiceIOProperty_VoiceProcessingEnableAGC.

Running the simple LFM (linear frequency modulated) matched filter on a non-ideal received signal is problematic.  The non-constant group delay is the biggest problem.  This can be visualized in 2 different ways: the first is to simply get the best cross-correlation I can, and shift the received signal by this many sample delay, as shown in the zoomed in view below:
This "best effort" fit clearly is not well aligned--even when I reduced the bandwidth from 20 kHz down to 8 kHz, to avoid the low and high frequencies (where the distortion seems the greatest).  Another way to visualize is to segment the filter (I have been using 1024 taps) into smaller chunks and calculate the best cross correlation.  If the MacBook mic (and the iPod speaker) were distortion-free, the group delay should be constant, which means that the maximum correlation index should increase linearly (because group delay is the derivative of the phase lag).  But as you can see below, this is not the case.


Apple audio HW is supposed to be high quality, but I guess the built-in speaker/mic are far more limited...

Signal processing audio chirp on OSX

In my previous blog post, I studied pulse compression one level deeper by producing and processing an actual audio chirp.  In this blog post, I get one step closer to implementing pulse compression on the phone by receiving an audio pulse with CoreAudio framework, and signal processing with Apple Accelerate framework's vDSP API.  I know that Unity offers ability to record from system mic.  I found the built-in Microphone class unsuitable for my need (deciding when a chirp was received) because Unity Microphone class's recording duration has 1 second resolution, and is finite, whereas I want an option to perform continuous (never stopping) correlation.  So I bought this book on CoreAudio and started with modifying the example from Chapter 8.  You can download all the code below from my GitHub repo: https://github.com/henrychoi/OpenCVUnity, in the branch output_pulse.

CoreAudio code to sample the first channel of the OS X mic

As in the book example, I used the first mic found on the system.

    // generate description that will match audio HAL
    AudioComponentDescription inputcd = {0};
    inputcd.componentType = kAudioUnitType_Output;
    inputcd.componentSubType = kAudioUnitSubType_HALOutput;
    inputcd.componentManufacturer = kAudioUnitManufacturer_Apple;
    

    AudioComponent comp = AudioComponentFindNext(NULL, &inputcd);
    CheckError(AudioComponentInstanceNew(comp, &player->inputUnit),
               "Couldn't open component for inputUnit");

But then I deviated from the book's example because my use case is to correlate the audio signal against the matched filter.  Because the filter is designed for a chirp signal of certain frequency and length, I want to match the sampling frequency and the number of frames in a block.

    propertySize = sizeof (AudioStreamBasicDescription);
    CheckError(AudioUnitGetProperty(player->inputUnit,
                                    kAudioUnitProperty_StreamFormat,
                                    kAudioUnitScope_Output, //to other units
                                    inputBus,
                                    &player->streamFormat,
                                    &propertySize),
               "Couldn't get ASBD from output unit");
    //Filter designed for this rate, so fix at expected frequency
    player->streamFormat.mSampleRate = 44100;
    CheckError(AudioUnitSetProperty(player->inputUnit,
                                    kAudioUnitProperty_StreamFormat,
                                    kAudioUnitScope_Output,
                                    inputBus,
                                    &player->streamFormat,
                                    propertySize),

               "Couldn't set output unit Fs");

    AudioStreamBasicDescription deviceFormat;
    CheckError(AudioUnitGetProperty(player->inputUnit,
                                    kAudioUnitProperty_StreamFormat,
                                    kAudioUnitScope_Input, //from HW to IO unit
                                    inputBus,
                                    &deviceFormat,
                                    &propertySize),
               "Couldn't get ASBD from input unit");
    
    deviceFormat.mSampleRate = 44100;
    CheckError(AudioUnitSetProperty(player->inputUnit,
                                    kAudioUnitProperty_StreamFormat,
                                    kAudioUnitScope_Input,
                                    inputBus,
                                    &deviceFormat,
                                    propertySize),
               "Couldn't set Fs input unit");


    UInt32 bufferSizeFrames = 1<<LOG2M;
    propertySize = sizeof(UInt32);
    CheckError(AudioUnitSetProperty(player->inputUnit,
                                    kAudioDevicePropertyBufferFrameSize,
                                    kAudioUnitScope_Global,
                                    0,
                                    &bufferSizeFrames,
                                    propertySize),
               "Couldn't set buffer frame size to input unit");

Chapter 08 example uses AUGraph to connect the mic HAL unit to default output unit (speaker), but I just want the raw signal from the mic, so the program is significantly simplified: I just have to start the AUHAL unit explicitly:


    CheckError(AudioUnitInitialize(player->inputUnit),
               "Couldn't initialize input unit");
    // Input unit is now ready for use
    
    CheckError(AudioOutputUnitStart(player->inputUnit)
               , "AudioOutputUnitStart");

Later, when I want to stop receiving the mic input, I call AudioOutputUnitStop instead.  But in the continuous chirp detection scenario, I won't stop the input.  The CoreAudio interface for sucking out the sound samples is kAudioOutputUnitProperty_SetInputCallback.

    AURenderCallbackStruct callbackStruct;
    callbackStruct.inputProc = InputRenderProc
    callbackStruct.inputProcRefCon = player;
    
    CheckError(AudioUnitSetProperty(player->inputUnit
                                    kAudioOutputUnitProperty_SetInputCallback
                                    kAudioUnitScope_Global,
                                    0,
                                    &callbackStruct, 
                                    sizeof(callbackStruct)),
               "Couldn't set input callback");

The callback routine will be called (semi) real-time and therefore is time critical; similar to interrupt handlers, the callback routine must return as quickly as possible.  This means that I need to keep pushing the (zero padded) received signal from the audio reception callback to the signal processing algorithm running in another thread.

Memory pool and lockless queue

In the CoreAudio book, data is shuttled from the callback to the downstream through the CARingBuffer, which is specialized for the CA input buffer (this is what CoreAudio returns to me when query the HW for new sample block).  Instead of the CARingBuffer the book uses, I wrote my own memory pool and queue for 2 reasons:

  1. I need to send meta data (like the host timestamp for the input block) along with the raw samples to the algorithm.
  2. I am only interested in 1 channel (the first one) of the potentially stereo mic, so copying the both channels through the CARingBuffer would be a waster of memory.

The interface to the (aligned) memory pool was copied from another blog I wrote more than 5 years ago:

typedef struct llsMP {
  size_t _capacity, _available;
  void* _pool;/* the memory itself */
  void* *_book;/* Array for book keeping, but don't know the size till ctor */
} llsMP;

unsigned char llsMP_alloc(struct llsMP* me,
    size_t capacity, size_t memsize, size_t alignment
                            , unsigned char zero);
void llsMP_free(struct llsMP* me);
struct llsMP* llsMP_new(size_t capacity, size_t memsize, size_t alignment
                          , unsigned char zero);
void llsMP_delete(struct llsMP* me);
void* llsMP_get(struct llsMP* me);
unsigned char llsMP_return(struct llsMP* me, void* node);

In this primitive, cross-platform API, I don't use more convenient types like uint8_t or bool, because I did not want to require stdint.h.  I prefer using the non-memory alloc version if possible, to keep the object as part of the container's object, like this:

typedef struct MyMic {
    AudioStreamBasicDescription streamFormat;
    AudioUnit inputUnit;
AudioBufferList *inputBuffer;
    FILE* t_csv, *x_csv, *f_csv, *c_csv;
    struct llsMP mpool;
    struct llsQ padded_xQ;
    int8_t Eblock, iBlock;
    bool firstBlock;

} MyMic;

The memory pool still needs to be initialized before the callback is called, which is why I offer the alloc() API.

    if (!llsMP_alloc(&player->mpool
                     , 3 // capacity
                     , sizeof(struct SampleBlock) // memsize
                     , 16 // alignment: recommended by vDSP programming guide
                     , 1)//memset zero, since this pool is used for zero padding
        ) {
        fprintf (stderr, "Can't allocate zero padded memory pool");
        exit (errno);

    }

The 16 byte alignment is recommended by vDSP--probably because it uses the SSE instructions to speed up the math (and these instructions require 16 byte aligned data).  I enable zeroing out the memory because this memory pool is for the zero padded raw samples.  Recall from my previous blog that using the circular convolution/correlation of FFT requires appending zeros of same length as the input block.  If I zero out the memory pool initially and only write to the latter half of the memory pool node I obtain, I can maintain a zero padding forever, without ever having to explicitly zero out parts of the stack memory.  Copying the received sound samples from the HW to the latter half of this memory buffer (the first half always remains zero padded) is achieved through careful pointer arrangement right before handing the buffer to the AudioUnit to fill, in the receive callback:

struct SampleBlock {
    UInt64 mHostTime;//The timestamp for these frames
    //UInt64 mWordClockTime //don't bother; always zero

    //I can calculate the time of the 1st sample with this just as well
    uint32_t iBlock;
    UInt32 nFrames;
    float sample[1<<LOG2_2M];//2x sample block length, to zero pad
};

    struct SampleBlock* block = (struct SampleBlock*)llsMP_get(&player->mpool);
    if (!block) {
        fprintf(stderr, "Memory pool exhausted\n");
        return errno;
    }
    player->inputBuffer->mBuffers[0].mData = (void*)(&block->sample[1<<LOG2M]);

    CheckError(AudioUnitRender(player->inputUnit, ioActionFlags,
                            inTimeStamp, inBusNumber, inNumberFrames,
player->inputBuffer),
               "AudioUnitRender failed");

    block->iBlock = player->iBlock;
    block->mHostTime = inTimeStamp->mHostTime;

    block->nFrames = inNumberFrames;

    if (!llsQ_push(&player->padded_xQ, block)) {
        fprintf(stderr, "llsQ_push failed\n");
        return errno;
    }

As you can see, the pointer to the memory pool node then can be pushed into a lockless queue, which works well in this case because there is only 1 writer thread, and 1 reader thread.  Note also that the reception time on the host (in nanoseconds) and the block number (starts at 0) are attached to the sample block.  If the correlation processor decides there is a positive event, it can then place the event within the audio sampling period (1/44100 Hz here, or < 23 usec).

  /* SINGLE writer, SINGLE reader queue */
  typedef struct llsQ {
    size_t _head, _tail, _mask;
    void* *_q;/* array of pointers */
  } llsQ;

  unsigned char llsQ_alloc(struct llsQ* me, unsigned char exponent);
  void llsQ_free(struct llsQ* me);

  struct llsQ* llsQ_new(unsigned char exponent);
  void llsQ_delete(struct llsQ* me);

  unsigned char llsQ_push(struct llsQ* me, void* node);
  unsigned char llsQ_pop(struct llsQ* me, void** node);

The main algorithm loop sleeps until the lockless queue becomes not-empty.

    for (int8_t iBlock = 0; iBlock < player->Eblock; ) {
        struct SampleBlock* block;
        if (!llsQ_pop(&player->padded_xQ, (void**)&block)) {
            usleep(1000);
            continue;
        }

        // Have a new block to process

The correlation processor now has work to do.

Fast correlation processor

I discussed the central algorithm for the fast correlation using FFT in my previous blog entry.  Repeating for convenience:

correlation(x, filter) = iFFT(FFT(x) • FFT*(filter))

where  is the element wise product of 2 vectors of equal length, and * is the complex conjugate operator.  Because the filter coefficients are constant, I can pre-calculate FFT*(filter) and store in the DATA segment of the program, as I've done here:

#define LOG2M 10
#define EXP_LOG2M (1<<LOG2M)
#define LOG2_HALFM (LOG2M - 1)
#define LOG2_2M (LOG2M + 1)
// The matched filter from Matlab
float FFT_pulse_conj_vDSPz_flattened[1<<LOG2_2M] __attribute__((aligned(16)))
= {
#include "FFT_pulse_conj_vDSPz_flattened.h"
};

Given the wealth of DSP libraries and algorithms available for free these days, I expected to find a DSP library that is optimized for OS X/iOS, and I did: Apple's Accelerate framework, which seems to take advantage of Intel's SSE.  But typical of Apple documentation, the documentation is lacking (to put it mildly).  One must begin with the vDSP's packing of the FFT of a real sequence, given here.  A few crucial concepts are:
  1. The result of FFT is a complex number in general, which they keep in 2 DIFFERENT arrays (lots of other implementations, including the venerable Numerical Recipes) keeps them in alternating fashion.
  2. FFT of a real sequence x is anti-symmetrical about frequency=0; i.e. X(f) = X*(-f), where X = FFT(x).
  3. X(f=0) and X(f=Fs/2)--FFT at the DC and the Nyquist frequency--are pure real.
Using all of the above properties, vDSP uses N numbers to represent the FFT of a real sequence of length N (which would otherwise require 2N numbers).  But because they keep the real and the imaginary parts separately, an in-place FFT requires a prior step of separating out the real sequence into odd and even sequences, as I've done below with vDSP_ctoz.

    float creal[1<<LOG2M] __attribute__((aligned(16))) //input/output buffer
        , cimag[1<<LOG2M] __attribute__((aligned(16)))
        , ftemp[1<<LOG2_2M] __attribute__((aligned(16))) //temporary scratch pad
    ;
    COMPLEX_SPLIT splitc = { .realp = creal, .imagp = cimag }
        , split_temp = { .realp = ftemp, .imagp = &ftemp[1<<LOG2M] }
        , split_filter = { .realp = FFT_pulse_conj_vDSPz_flattened
            , .imagp = FFT_pulse_conj_vDSPz_flattened + (1<<LOG2M) }
    ;
    float FFT_filter_nyq = *split_filter.imagp;
    *split_filter.imagp = 0//see explanation below
...

// When I get a block to process, split out a time series into odd and even
        vDSP_ctoz((COMPLEX*)block->sample, 2, &splitc, 1, 1<<LOG2M);

Note the arrangement of the filter's FFT to be consistent with vDSP's representation of FFT of a real sequence.  My Matlab code (explained in my previous blog post) anticipates the C code will do this, and writes out the header file correctly.  I can then FFT the zero padded sample block with vDSP API:

vDSP_fft_zript(fftSetup, &splitc, 1, &split_temp, LOG2_2M, FFT_FORWARD);

The result is still in the "splitc"--because this is an inplace FFT.  The temporary buffer is supposed to speed up the FFT (have not yet measured how much).  If I multiply this FFT with the filter's FFT (actually the conjugate thereof), the result would be wrong, because the Nyquist portion masquerading as the imaginary value of FFT(x) @ DC distorts the result.  Because vDSP does not take care of this, I have to jump through some hoops to save and then restore.  At this this is straight-forward in the 1D case; the 2D case requires setting up a separate vector (huge inconvenience).  I think the vDSP folks dropped the ball on this.

float FFT_c_nyq = *splitc.imagp; *splitc.imagp = 0;
vDSP_zvmul(&splitc, 1, &split_filter, 1, &splitc, 1
           , 1<<LOG2M//The multiplication is over vector of length M
           , 1 // don't conjugate (even though we are correlating);
           ); // because the filter is already in conjugate form
// Restore the Nyquist frequency portion, which is pure real
*splitc.imagp = FFT_c_nyq * FFT_filter_nyq;

I can then inverse FFT to obtain the fast correlation..

vDSP_fft_zript(fftSetup, &splitc, 1, &split_temp, LOG2_2M, FFT_INVERSE);
vDSP_ztoc(&splitc, 1, (COMPLEX*)ftemp, 2, 1<<LOG2M);

But as I explained in previous blog post, the correlation of the zero-padded block is not the final running correlation unless the result from the previous block is added to the overlap (the overlap save method).  That is why the overlap_save (initially zero) is maintained from 1 block to the next:

float overlap_save[1<<LOG2M] __attribute__((aligned(16))) //final correlation
    ;
...
        //Correlation to output: overlap_save + ftemp[0:M-1];
        vDSP_vadd(ftemp, 1, overlap_save, 1, ftemp, 1, 1<<LOG2M);
        
        //Correlation to save for next round: overlap_save = ftemp[M:2M-1]
        memcpy(overlap_save, &ftemp[1<<LOG2M], sizeof(overlap_save));

Result: clear difference between silence and chirp

To validate the output of the fast correlator, I ran the program twice: the first without any signal (just room noise) and the 2nd while playing the chirp I designed in Matlab on repeat.  The Matlab validation script's orange line shows the difference between Matlab's calculation and the output from my CoreAudio program.  The 1st panel is the raw sample value, and the middle panel is the magnitude of the element-wise multiplication of the 2 FFTs.  The bottom panel is the correlator output.
I am not sure what the mic is picking up during the first few milliseconds of the recording, but there is no obvious hit in the correlator output, which is clearly visible when I play the chirp, as you can see below.

Next steps

When I can further process the correlator output to a yes/no decision between chirp, picking out the sample at which the chirp was received, and mapping that back to a monotonically increasing timestamp on the receiver side should be straight-forward.  And thanks to the use of the memory pool and circular queue, this processing pipeline can run ad-infinitum.