Mar 25, 2017

Understanding sorting network in Matlab

When I began this hobby project to solve the indoor location problem using only the HW available on a modern smartphone--all just to play an augmented reality X-wing vs. Tie-figher game with my friend's 5-year old son, I had no idea how much I would have to figure out.  And it's turning out to be quite a lot: the definition of "knowing" takes on a whole new dimension when you try to make something that works.  But it turns out this is the best way I learn; I should have tried to make something by myself a long time ago!

Now that I understand how to get continuous correlation from the matched filter (as explained in my previous blog entry), I am trying to boost the SNR of the filter and fish out the instantaneous jump in correlation when there is a chirp.  As shown in the last blog entry, that spike can be buried in the ambient sound (usually of much lower frequency than the near-Nyquist frequency of the correlation spike).  One of my favorite techniques is to subtract the median of rolling time window: the median is superior to the mean in smoothing a signal in a high noise environment.  The only problem is that rolling median is hard to compute, unlike the rolling mean.  The most naive implementation will sort the content of the rolling window at every sample point.  Even with an O(N logN) sorting algorithm, the cost of sorting goes up quickly as the window size grows.  More sophisticated running median algorithm tries to keep an ordered rolling window.  I actually tried to do this with a doubly linked list, with a special pointer to the median, as shown in the sketch below.
Since I use a circular queue in all embedded projects, I was going to maintain a doubly linked circular buffer of size 1 greater than the window size, so that I could make the oldest element stale, and decide whether to move the median to the left or to right (or leave as is) depending on the combination of the values of the elements, as hinted at below.
This took me 2 nights of coding and debugging to get right.  Incidentally, problems like this come up often in SW interviews.  Here is the class that does all the heavy lifting:

template <typename T> class SortedQ {
    class dValue {
        friend class SortedQ;
    protected:
        dValue *_prev, *_next;
    public:
        T val;
        bool stale;
        inline void init(T v) { stale = false; val = v; }
        inline dValue* next() {
            return (!_next || !_next->stale) ? _next : _next->_next;
        }
        inline dValue* prev() {
            return (!_prev || !_prev->stale) ? _prev : _prev->_prev;
        }
        static int compare(const void* lp, const void* rp) {
            const dValue* *l = (const dValue**)lp;
            const dValue* *r = (const dValue**)rp;
            return (*l)->val - (*r)->val;
        }
        inline void rlink(dValue& right) { _next = &right; right._prev = this; }
        inline void llink(dValue& left) { _prev = &left; left._next = this; }
        inline void replace(dValue* old) {
            _prev = old->_prev; _next = old->_next;
            _prev->_next = _next->_prev = this;
        }
        void rinsert(dValue* newnode);
        void linsert(dValue* newnode);
        inline void dropout() {
            if (_next) _next->_prev = _prev;
            if (_prev) _prev->_next = _next;
        }
    };
public:
    static constexpr unsigned SIZE = 5;
    dValue arr[SIZE+1];
    SortedQ() {
        memset(arr, 0, sizeof(arr)); // mass-set pointers to 0
    }
    ~SortedQ() = default;
    void init();
    inline T& median() { return _median->val; }
    
    //Tail is where we will insert the NEXT sample to, so
    //head should be the slot AFTER the tail (circular).
    inline dValue* head() { return &arr[(_tail + 1) % (SIZE+1)]; }
    T& push(T& f);
    void print();
private:
    unsigned _tail;//In stable state, _tail points to the NEXT slot
    dValue* _median;

};

As evidenced by my slow coding speed, I think I am not a top-notch SW developer--especially given that my code assumes non-duplicate elements.  If I piqued your interest by tying this to SW interviews, the best known solution is to keep 2 separate ordered lists (that can report its size) on either side of the median, and to keep the 2 sides of equal length at all times.  A simple list has O(N) search and insertion time, so people use advanced data structures like balanced binary tree, or even skip list, for the O(logN) insertion and search time.  I can guarantee that such implementation will not be requested in a 45 minute SW interview!

So even for a seemingly simple task of keeping a sorted window, SW solution will be complicated; and I hate complexity.   As someone who is more interested in HW than the SW, there is bigger disappointment in the SW solution discussed above: there is no HW optimization!  The CPU is spending most of its time managing pointers, and the actual value comparison is only a minuscule portion of the total running time.  But in THIS problem, where the window size is fixed and known at design time, and the data is streaming in nature, there IS one important HW optimization possible: vector-wise sorting using sorting network.  The streaming nature of the data means that the sliding window of successive samples look like delayed version of the previous window, as illustrated below:
I can batch-sort a block of inputs going through this sorting window, by treating each delay (starting at delay 0) as a separate vector.  For the window size of 5, for example, there are 5 inputs going into my sorting network--which has a known optimum solution (both in number of comparisons and delay) discussed in The Art of Computer Programming, Vol 3 and reproduced on Ron Zeno's website.

While the sorting network is ideally implemented in a custom HW, the next best implementation can be had on SIMD capable chips, like on Intel with SSE, or ARM with Neon.  And the vectored nature of Matlab's max/min function can easily simulate it.  Here's my implementation of 5-element sorting network:
There are 2 opportunities for parallelization: at the very begging and right before the last swapping.  I deliberately drew them staggered because I am NOT going to parallelize them.  The letters in the middle show the temporary vectors necessary to store the intermediate values--unless you have a HW assisted compare and swap--because each compare and swap needs to be done in 2 stages: taking the max of 2 vectors, and then taking the min of the same 2 vectors.  Right away, you can see that the vectored sorting network will be memory hungry, and might not be suitable for cheap MCUs.  But an iPhone can easily provide a few extra KB without throwing up.

Here's a Matlab simulation I used to verify this network:

x = rand(1000,1);
d1 = [x(1); x(1:end-1)];
d2 = [d1(1); d1(1:end-1)];
d3 = [d2(1); d2(1:end-1)];
d4 = [d3(1); d3(1:end-1)];
plot(x); hold on;
plot(d4);
hold off;

A = max(x, d1);  B = min(x, d1);
C = max(d2, d3); D = min(d2, d3);
E = max(B, D);   D = min(B, D);
B = max(C, d4);  C = min(C, d4);
F = max(A, B);   B = min(A, B);
A = max(E, C);   C = min(E, C);
E = max(A, B);   B = min(A, B);
D = max(D, C);
B = max(B, D);

truth = medfilt1(x, 5);
med_err = truth(1:end-2) - B(3:end);
plot(x); hold on
plot(truth(1:end-2)); plot(B(3:end)); hold off;

plot(med_err);

You can see in my previous blog entry that I used 3 extra arrays of length 1K (that's the length of the matched filter) for correlation processing.  That is already 3 x 1K x 4B = 12 KB of RAM.  For this median processing, I can reuse those 3 arrays on the stack for A, B, and C.  But I still have to come up with D and E (F is only necessary to get the running maximum), so I am looking at 20 KB extra RAM for the correlation and median processing.  As a reference, nRF51822 the BLE MCU I am used to playing around with only has up to 32 KB of RAM, so this solution is starting to look expensive...

2 comments:

  1. Not sure I understand: why to do you need to keep the arrays in memory? Why not just perform the comparisons each time a new value comes in keeping only the last 5 samples?

    ReplyDelete
    Replies
    1. When you process high frequency signals such as audio (coming in at minimum of 44.1 kHz for CD quality or better), the processor cannot wake up at that rate and process each sample: the processor will be completely swamped managing the interrupt. These high frequency signals therefore often come in chunks. When the current chunk needs 1 or more previous chunks for proper processing (as in the case of convolution or correlation using FFT as I discussed in other blogs), then you must hold onto the original samples and/or intermediate results of the processing. These things all take up memory.

      Delete