Good bye Steve, I'll miss you
The Jawbone Noise-Assassin product lines (ERA, etc.) were category-creating. When I got to Jawbone in September 2015, the many of the original people who worked on it had already left, including the former chief scientist Greg Burnett. But the remaining people gave me some clues to learning the technology on my own. One of them was Steve Forestieri, the director of the DSP group. Just a few weeks ago week, he had promised me to teach the noise cancelling algorithm, but he had a heart attack on the same day, and passed away a week later. I had seen Steve struggle with work related stress for the last couple of months, and correlate it with his heart attack. The irony was that he had a heart attack on the same day he seemed to come to a resolution with what was frustrating him, and decided to move on with the rest of his life. I am saddened by his passing, and am reminded again of the song Enjoy yourself (It's later than you think):Enjoy yourself, it's later than you thinkWell, learning new things is how I enjoy myself, so here goes nothing.
Enjoy yourself, while you're still in the pink
The years go by, as quickly as a wink
Enjoy yourself, enjoy yourself, it's later than you think
DOMA (dual omni-directional microphone array)
The seminal technology behind Jawbone's Noise Assassin algorithm is described in Greg Burnett's patent (USPN 8731211B2 and USPN 8837746B2). Even though they are well written, I learn better when I can rewrite it in my own words.Geometry
Suppose I have 2 mics O1 and O2 that have essentially the same frequency response (getting that same frequency response is the subject of the DOMA calibration patent), spaced 2 x d0 apart and sampling sound source S that is ds away from the mid-point of the mic pair, as shown below.The geometry of 2 mics in relation to the S is:
d1 = SQRT[ds2 – 2 ds d0 cosθ + d02]
d2 = SQRT[ds2 + 2 ds d0 cosθ + d02]If S is a point source, the sound amplitude attenuates like R^2. On the ground, the ground reflection reduces the attenuation to something less than R^2, but still, it's probably not 1/R as the patent claims. Let's just continue using the 1/R and see if 1/R vs. 1/R^2 even matters in the end. O2(t) then will experience a sound wave that is attenuated by β = (d1/d2) of O1(t), and delayed by (d2 - d1) / c in wavelengths, where c is the sound wave speed. In a discrete sampled system with Fs sampling frequency, that delay is
γ (gamma) = (d2 - d1) Fs / c in samples (not unnecessarily an integer)As a comparison, at Fs = 48 kHz, c / Fs = 7.3 mm. So the z transform of O1 and O2 can be related:
O2(z) = β z-γ O1(z)Let's visualize the quantity β and γ for d0 = 10 mm, and ds = [10 cm 30 cm 1m].
For sampling frequency Fs = 16 kHz, if the sound source is 10 cm (ds = 0.1 m) and directly in front of O1 (θ = 0), β = 0.8182 and γ = 0.9275. Let's call these B and G, to indicate they are constants.
It makes sense that as the sound source becomes more distant, the sound source "sounds" the same to the 2 mics. ds = 100 d0 is a practical infinite distance in this scheme.
Gradient mic pair
Now imagine a linear combination of arbitrary delay of O1/O2, as shown below.
Let's form V2 as a transfer function of what the speech mic O1 experiences.
V2(z) = O2(z) – B z-G O1(z) = β z-γ O1(z) – B z-G O1(z) = (β z-γ – B z-G) O1(z)
V2 has a speech null
I can evaluate the transfer function V2(z) / O1(z) by letting z = exp(jω): (β e-jωγ – B e-jωG), whose magnitude and phase looks like this for sound source within +/- 20% of the expected distance (Ds = 10 cm):
Here, I took the frequency up to the Nyquist limit; black = 10 Hz, red = 100 Hz, green = 1 kHz, blue = 4 kHz, cyan = 8 kHz. Even though the penalty for being wrong about the true distance of the speech source from the mic array gets larger for high frequency (solid line is ideal, dashed and dotted are +/- 20% ), there is still a null for angle +/- 30 degrees: the relatively small space where V2 picks up NO sound from O1: this is apparently called "null" among "those skilled in the art". But it does pick up OTHER sound--which I will call N2 here (for noise emitted by V2).
V1 picks up speech AND noise N1
Now consider another arrangement of O1 and O2, using the same constants B and G as above
V1(z) = z-G O1(z) – B O2(z) = z-G O1(z) – B β z-γ O1(z) = (z-G – B β z-γ) O1(z)I first thought the design goal is to directly cancel out noise (S relatively far away from the mouth) by designing V1 ~ V2 for noise. BUT as you can see below, that is not true in general (gets worse as frequency goes up).
Fundamentally, V1 is picking up the difference of the sound heard by O1 and O2, which you can see by staring at the transfer function above when we are "reasonably close" to the assumed values B and G.
V1(z)/O1(z) = z-G – B β z-γ ~ z-G (1 – B β) ~ z-G (1 – B2)What the above plot tells me is that V2 is NOT emitting the same nose as V1. This is one of the keys to understanding the Jawbone noise cancelling.
For speech near the assumed distance, the |V1| is nearly constant for +/- 20 degrees within the design angle to the sound source, as you can see on the left plot below.
When the sound source is relatively far away from the array, O1 and O2 hear roughly the same thing. For the purpose of rejecting distant sound, let's call this sound N and plug into O1 and O2 in the equations for V1 and V2.
V1(z) = z-G O1(z) – B O2(z) ~ z-G N(z) – B N(z) = (z-G – B) N(z)
V2(z) = O2(z) – B z-G O1(z) ~ N(z) – B z-G N(z) = (1 – B z-G) N(z)
=> H(z) = V2(z) / V1(z) ~ (1 – B z-G) / (z-G – B)Again, I evaluate the transfer function with z = exp(jω) to get the frequency response.
V2(z) / V1(z) ~ (1 – B e-jωG) / (e-jωG – B)So contrary to my initial expectation, DOMA cannot cancel noise; it merely transforms the speech and (far away) noise into speech and 2 related noise.
V1 = S + N1
V2 = N2To kill N1 in V1, we turn to adaptive filter.