Apr 28, 2016

Noise cancelling with dual omni directional microphones

Good bye Steve, I'll miss you

The Jawbone Noise-Assassin product lines (ERA, etc.) were category-creating.  When I got to Jawbone in September 2015, the many of the original people who worked on it had already left, including the former chief scientist Greg Burnett.  But the remaining people gave me some clues to learning the technology on my own.  One of them was Steve Forestieri, the director of the DSP group.  Just a few weeks ago week, he had promised me to teach the noise cancelling algorithm, but he had a heart attack on the same day, and passed away a week later.  I had seen Steve struggle with work related stress for the last couple of months, and correlate it with his heart attack.  The irony was that he had a heart attack on the same day he seemed to come to a resolution with what was frustrating him, and decided to move on with the rest of his life.  I am saddened by his passing, and am reminded again of the song Enjoy yourself (It's later than you think):
Enjoy yourself, it's later than you think
Enjoy yourself, while you're still in the pink
The years go by, as quickly as a wink
Enjoy yourself, enjoy yourself, it's later than you think
Well, learning new things is how I enjoy myself, so here goes nothing.

DOMA (dual omni-directional microphone array)

The seminal technology behind Jawbone's Noise Assassin algorithm is described in Greg Burnett's patent (USPN 8731211B2 and USPN 8837746B2).  Even though they are well written, I learn better when I can rewrite it in my own words.

Geometry

Suppose I have 2 mics O1 and O2 that have essentially the same frequency response (getting that same frequency response is the subject of the DOMA calibration patent), spaced 2 x d0 apart and sampling sound source S that is ds away from the mid-point of the mic pair, as shown below.
The geometry of 2 mics in relation to the S is:
d1 = SQRT[ds2 – 2 ds d0 cosθ + d02
d2 = SQRT[ds2 + 2 ds d0 cosθ + d02]
If S is a point source, the sound amplitude attenuates like R^2.  On the ground, the ground reflection reduces the attenuation to something less than R^2, but still, it's probably not 1/R as the patent claims.  Let's just continue using the 1/R and see if 1/R vs. 1/R^2 even matters in the end.  O2(t) then will experience a sound wave that is attenuated by β = (d1/d2) of O1(t), and delayed by (d2 - d1) / c in wavelengths, where c is the sound wave speed.  In a discrete sampled system with Fs sampling frequency, that delay is
γ (gamma) = (d2 - d1) Fs / c in samples (not unnecessarily an integer)
As a comparison, at Fs = 48 kHz, c / Fs = 7.3 mm.  So the z transform of O1 and O2 can be related:
O2(z) = β z O1(z)
Let's visualize the quantity β and γ for d0 = 10 mm, and ds = [10 cm 30 cm 1m].
For sampling frequency Fs = 16 kHz, if the sound source is 10 cm (ds = 0.1 m) and directly in front of O1 (θ = 0),  β = 0.8182 and γ = 0.9275.  Let's call these B and G, to indicate they are constants.

It makes sense that as the sound source becomes more distant, the sound source "sounds" the same to the 2 mics.  ds = 100 d0 is a practical infinite distance in this scheme.

Gradient mic pair

Now imagine a linear combination of arbitrary delay of O1/O2, as shown below.
Let's form V2 as a transfer function of what the speech mic O1 experiences.

V2(z) = O2(z) – B z-G O1(z) = β z O1(z) – B z-G O1(z) = (β z – B z-G) O1(z)

V2 has a speech null

I can evaluate the transfer function V2(z) / O1(z) by letting z = exp(jω):  (β e-jωγ – B e-jωG), whose magnitude and phase looks like this for sound source within +/- 20% of the expected distance (Ds = 10 cm):

Here, I took the frequency up to the Nyquist limit; black = 10 Hz, red = 100 Hz, green = 1 kHz, blue = 4 kHz, cyan = 8 kHz.  Even though the penalty for being wrong about the true distance of the speech source from the mic array gets larger for high frequency (solid line is ideal, dashed and dotted are +/- 20%), there is still a null for angle +/- 30 degrees: the relatively small space where V2 picks up NO sound from O1: this is apparently called "null" among "those skilled in the art".  But it does pick up OTHER sound--which I will call N2 here (for noise emitted by V2).

V1 picks up speech AND noise N1

Now consider another arrangement of O1 and O2, using the same constants B and G as above
V1(z) = z-G O1(z) – B O2(z)  = z-G O1(z) – B β z O1(z) =  (z-G – B β z) O1(z)
I first thought the design goal is to directly cancel out noise (S relatively far away from the mouth) by designing V1 ~ V2 for noise.  BUT as you can see below, that is not true in general (gets worse as frequency goes up).
Fundamentally, V1 is picking up the difference of the sound heard by O1 and O2, which you can see by staring at the transfer function above when we are "reasonably close" to the assumed values B and G.
V1(z)/O1(z) =  z-G – B β z ~ z-G (1 – B β) ~ z-G (1 – B2)
What the above plot tells me is that V2 is NOT emitting the same nose as V1.  This is one of the keys to understanding the Jawbone noise cancelling.

For speech near the assumed distance, the |V1| is nearly constant for +/- 20 degrees within the design angle to the sound source, as you can see on the left plot below.
When the sound source is relatively far away from the array, O1 and O2 hear roughly the same thing.  For the purpose of rejecting distant sound, let's call this sound N and plug into O1 and O2 in the equations for V1 and V2.
V1(z) = z-G O1(z) – B O2(z)  ~ z-G N(z) – B N(z) = (z-G – B) N(z)
V2(z) = O2(z) – B z-G O1(z) ~ N(z) – B z-G N(z) = (1 – B z-G) N(z) 
=> H(z) = V2(z) / V1(z) ~ (1 – B z-G) / (z-G – B)
Again, I evaluate the transfer function with z = exp(jω) to get the frequency response.
V2(z) / V1(z) ~ (1 – B e-jωG) / (e-jωG – B)
So contrary to my initial expectation, DOMA cannot cancel noise; it merely transforms the speech and (far away) noise into speech and 2 related noise.
V1 = S + N1
V2 = N2
To kill N1 in V1, we turn to adaptive filter.

LMS (least mean squared) adaptive filter

When there is no speech (S = 0), N1 and N2 are related: they are both driven by the same noise (IF there is 1 dominant noise source from the mic array--which is true in most cases).  An adaptive filter then should yield an estimate for N1 such that the residual noise Nresidual is the minimum obtainable.
There are whole books written on adaptive filters, so LMS can potentially be replaced with other adaptive filters.  Regardless of the adaptive algorithm used, the adaption must STOP when S is active, to avoid the filter adapting to the speech and therefore canceling it.  That is what the Jawbone VAD (voice activity detection) patent (USPN 8467543) covers.

But wait, there is more; a LOT more!

This is just the core of the noise cancelling covered by the Jawbone patent.  If you have ever worked on a real product, you know that things get complicated quickly.  And a lot of that is the "magic sauce" or "dark art" trade secret.  I may add more material as I learn about noise cancelling, but from this point on, I am going to be careful about what I write here.