FIR filtering in pictures
For audio applications, the constant group delay property of the FIR filter is an absolute must to avoid phase distortion that is very difficult to design around in the IIR filter, but for some reason, FIR (finite impulse response) filters are used far more than the IIR (impulse response filter) even for non-audio applications. FIR filter kernels are far greater than the IIR, so the multiply-and-add speed of the MCU or in some cases dedicated DSP makes a big difference. Suppose we have to filter a relatively long time series x[k] of length N with an FIR filter h[k] of order M-1 (the FIR filter coefficient length is 1 greater than the filter order), as shown below.
In time domain, filtering (convolving) x[k] by h[k] is done by calculating the integral of the product of x[k] by the time reversed version of h[k] at every sampling instance. Since I assumed at the outset that x[k] is a much longer (N) time series signal than the filter kernel size (M), it is convenient to visualize the mirror image of h[k] sliding from left to right for each time series sample of the filter output, as you can see below.
Since we are dealing with discrete sampled system, it's possible to draw the FIR kernel at discrete points and visually integrate the overlap to obtain the filtered time series (green) above. It is also possible to see that at k = -M, there is no overlap and therefore the filtered output is 0. But at k = N-1, there is an overlap of 1 sample, so the response is non-zero in general. Therefore, the response is non-zero for total of N + (M-1) samples, BUT there are 3 separate regions of filter output:
Note how I emphasized the beginning and end of the dataset with the dot. In the circular case, the "sliding" of the filter kernel is achieved by rotating the kernel, as you can see below.
Can you see where the end of h[k] now overlaps the beginning of x[k]? The multiply-accumulate contribution from this overlap is correct for periodic signals, but otherwise incorrect. It is as if you change your signal as shown below:
To understand the zero padding concept, segment the x[k] sequence by blocks of M sequences. At delay = -M+1, only the end of the h[k] overlaps x--specifically x[0], as shown below.
As we slide the kernel to the right, the MAC grows because of more overlap, until at zero delay, h[k] and x[k] completely overlap. But subsequently, the overlap goes down again and the MAC decreases. In overlap-add, you ADD the R contribution of the upper convolution to the L contribution of the lower convolution. The MAC values at delay = 0, M, 2M, ... do not need addition because they are completely overlapped. So the block-wise convolution can be done with the padding either at the beginning or the end. I found it easier to visualize the scheme by padding on the left, as you can see below.
If the array to receive the results is initialized to zero, the stitching can be done with a simple addition of the new MAC result to the final result. At each block, 3 kinds of MAC results are generated:
My "padding on the left" can be drawn on a circle when doing the convolution with FFT (fast convolution), as shown below.
Note where the L and the R portions of the resulting MAC are; this is critical for assembling the results between blocks correctly. Note also that when using FFT, the block size is constrained to power of 2.
In time domain, filtering (convolving) x[k] by h[k] is done by calculating the integral of the product of x[k] by the time reversed version of h[k] at every sampling instance. Since I assumed at the outset that x[k] is a much longer (N) time series signal than the filter kernel size (M), it is convenient to visualize the mirror image of h[k] sliding from left to right for each time series sample of the filter output, as you can see below.
Since we are dealing with discrete sampled system, it's possible to draw the FIR kernel at discrete points and visually integrate the overlap to obtain the filtered time series (green) above. It is also possible to see that at k = -M, there is no overlap and therefore the filtered output is 0. But at k = N-1, there is an overlap of 1 sample, so the response is non-zero in general. Therefore, the response is non-zero for total of N + (M-1) samples, BUT there are 3 separate regions of filter output:
- Pre: from k=-M+1 to k = -1, the filter is not completely overlapped with the signal, so the output is artificially small.
- Completely overlapped: from k=0 to k=(N-1)-M, the filter is completely overlapped with the signal, so every filter kernel coefficient contributes to the output. Note there are only N-M+1 sample instances in a finite length time series consideration here.
- Post: from k=N-M on, the filter kernel is again not completely overlapped with the signal.
Note how I emphasized the beginning and end of the dataset with the dot. In the circular case, the "sliding" of the filter kernel is achieved by rotating the kernel, as you can see below.
Can you see where the end of h[k] now overlaps the beginning of x[k]? The multiply-accumulate contribution from this overlap is correct for periodic signals, but otherwise incorrect. It is as if you change your signal as shown below:
To understand the zero padding concept, segment the x[k] sequence by blocks of M sequences. At delay = -M+1, only the end of the h[k] overlaps x--specifically x[0], as shown below.
As we slide the kernel to the right, the MAC grows because of more overlap, until at zero delay, h[k] and x[k] completely overlap. But subsequently, the overlap goes down again and the MAC decreases. In overlap-add, you ADD the R contribution of the upper convolution to the L contribution of the lower convolution. The MAC values at delay = 0, M, 2M, ... do not need addition because they are completely overlapped. So the block-wise convolution can be done with the padding either at the beginning or the end. I found it easier to visualize the scheme by padding on the left, as you can see below.
If the array to receive the results is initialized to zero, the stitching can be done with a simple addition of the new MAC result to the final result. At each block, 3 kinds of MAC results are generated:
- Those on the L region are added to the R region of the previous block's result.
- Values at multiples of M are already complete.
- Those of the R region will be complete in the next round.
My "padding on the left" can be drawn on a circle when doing the convolution with FFT (fast convolution), as shown below.
Note where the L and the R portions of the resulting MAC are; this is critical for assembling the results between blocks correctly. Note also that when using FFT, the block size is constrained to power of 2.
Simulating an audio chirp in noisy environment in Matlab
LFM (linear frequency modulation) pulse is the simplest waveform used in pulse compression (it's described in radar textbooks!). The frequency starts at f0, and grows at π΅ = B/π½p, where B is the frequency bandwidth (maximum frequency - f0) and π½p is the pulse duration. As shown in my previous blog entry, the greater the B and π½p, the more energy you can pack into the pulse, increasing the probability of true positive detection. In ranging, B is inversely proportional to the ranging resolution. B and π½p are somewhat "baked" into the algorithm (in the form of the precomputed FFT values, as we will see shortly), but the amplitude can be dynamically changed for every pulse (but not DURING the pulse). The signal strength is directly proportional to the sound amplitude, but I deliberately wanted to keep the amplitude low, for a better user experience. So I found a anechoic singing recording, and set my amplitude to be the same as the RMS amplitude of the recording:
CHIRP_STRENGTH = 1
[s_ambient, Fs_a] = audioread('singing.wav', [1 2*N] + 8E4);
if Fs_a ~= Fs % resample
s_ambient = interp1(((1:length(s_ambient)) - 1) / Fs_a, s_ambient, t, 'linear');
else
s_ambient = s_ambient';
end
RMS_ambient = sqrt(mean(s_ambient .^ 2));
A_u = CHIRP_STRENGTH * RMS_ambient * sqrt(2);
In addition to the anechoic ambient sound, I also added some white noise with amplitude = 10% of the ambient sound) to the received signal:
sigma_r = 0.1 * RMS_ambient;
rcv_noise = sigma_r * randn(1,N);
s_r = rcv_noise;
s_r = s_r + s_ambient((1:N));
It is exceptionally common in filter design to modify an ideal (the desired) waveform with a windowing function; in fact, the very first frequency domain filter design techniques taught in textbooks are exactly this. The windowing functions reduce the strength of the side lobes (what we don't want) at the expense of spreading out the bandwidth of the main lobe (what we do want, but ideally sharply focused in the frequency domain). Even leaving aside the filter design considerations, the physical bandwidth capabilities of the speaker / mic on smartphones means the low and high frequencies will be attenuated. I found the windows widely used in filter designs (Kaiser, Hamming, Chebichev, etc) to attenuate too much of the midband frequencies, which the smartphone speaker/mics are optimized for. So I designed a "trapezoidal" window, like this:
win = ones(1,M);
forte_d = round(200 * Fs / 48000);
piano_d = round(50 * Fs / 48000);
win(1:forte_d) = linspace(0,1, forte_d);
win((M-piano_d+1):M) = linspace(1,0, piano_d);
The windowed compressed pulse of B = 22 kHz, π½p = 1024 * Ts, where Ts = 44100 Hz (CD quality) then looks like this in time and frequency domain.
Given a true range, the closest simulated received sound at the mic due to a combination of the delayed compressed pulse, competing environmental sound, and receiver noise are:
t_R = range/c; % nominal wave travel time
x(j, (1:M) + round(t_R * Fs)) = xmit;
s_r = s_r + (scat_rcs(j) / (range)^attenuation) * x(j,:);
Recall that s_r was initialized to the noise value a few paragraphs above.
Correlation processor in Matlab
A matched filter has the highest possible SNR for a given signal, and is easy to to design: it is just the complex conjugate of the original signal--that is, a time reversed version. Now stare at the convolution in pictures section above: the kernel is flipped, and then the MAC operation is performed. So if we are conjugating the original signal to obtain the filter kernel, and then conjugating that again to actually perform the MAC, then filtering the received signal with the matched filter is identical to correlating the received signal with the sent signal! So in my implementation of compressed pulse, I wound up using correlation--which is just the mirror of convolution--as can be seen in the convolution theorem and the correlation theorem, which differ by whether the kernel is complex conjugated:
Yconv(f) = X(f) .* H(f)
Ycorr(f) = X(f) .* H*(f)
In my correlation, I only use the middle (N-M) portion of the N+M-1 possible correlation values, because the front and the back correlation values do not have complete overlap of the signal and the kernel; consider it an issue of fairness, if it helps you.
pulse_pad = [pulse zeros(size(pulse))];
FFT_pulse = fft(pulse_pad);
FFT_pulse_conj = conj(FFT_pulse);
fft_corr = zeros(size(s_r) - [0 M]);
rx_pad = [zeros(size(pulse)) s_r(1:M)];
corr = ifft(fft(rx_pad) .* FFT_pulse_conj);
fft_corr(1:M) = corr(M + (1:M)); % Throw away the L half, save the R half
for i=M:M:(N-2*M) % Feed subsequent received samples
rx_pad((M+1):end) = s_r(i+(1:M));
corr = ifft(fft(rx_pad) .* FFT_pulse_conj);
fft_corr(i + ((-M+1):M)) = fft_corr(i + ((-M+1):M)) + corr;
end
% Pick up the last block's L half (and add to the previous block's R half)
if isempty(i), i = M; else i = i + M; end
rx_pad((M+1):end) = s_r(i+(1:M));
corr = ifft(fft(rx_pad) .* FFT_pulse_conj);
fft_corr(i+((-M+1):0)) = fft_corr(i+ ((-M+1):0)) + corr(1:M);
And here is a plot of the fft_corr using the ground truth of 7 m.
Do you see the sharp peak at 7 m? The ambient sound (of equal magnitude as the chirp wave, since CHIRP_STRENGTH = 1 when forming the transmitted sound earlier) comes right through the relatively flat passband of the matched filter, so the sharp peak at 7 m is competing against the ambient sound.
Real compressed pulse
I saved 2 versions of the ideal compressed pulse calculated above: one from the left speaker, and the other from the right:
silence = zeros(size(pulse));
audiowrite(strcat(tempDir, 'lchirp.wav'), [pulse; silence]', Fs);
audiowrite(strcat(tempDir, 'rchirp.wav'), [silence; pulse]', Fs);
Since these are just WAV files, the PC can play it from its built-in speaker at will. I recorded what the PC's built-in mic heard:
for i=0:(audiodevinfo(1)-1)
if isempty(strfind(audiodevinfo(1,i), 'Built'))
continue;
end
break;
end
Fs = 44100
% OK, now I have the ID of the recorder
recorder = audiorecorder(Fs, 16, 2, i)
disp('recording...');
recordblocking(recorder, 5);
disp('recording done');
play(recorder);
heard = getaudiodata(recorder);
figure(1);
t = (1:length(heard)) / Fs;
ha(1) = subplot(211); plot(t, heard(:,1));
ha(2) = subplot(212); plot(t, heard(:,2));
linkaxes(ha, 'x');
xlabel('[s]');
fname = '~/github/OpenCVUnity/doc/heard.mat';
save(fname, 'heard');
Because I played the file twice within span of 5 s, you see 2 pulses on both the left and the right mics.
But either due to the poor quality of the PC speaker or (and?) mic, what the mic actually returned is far from the ideal response with which I calculated the matched filter frequency response.
Despite this imperfection, the matched filter thrived in a low noise environment.
Zoomed in, the correlation function is no longer a single sharp peak seen in simulation.
Ignore the crazy distance: a sound will not even travel that far; the distance is calculated with the hypothesis that the pulse was transmitted at the 1st sample of the recording.
If the signal has to compete with ambient sound, then correlation will let the ambient sound through, as you can see below:
The good news is that the sharp peak survived. Picking off the sharp correlation peak among competing oscillation will require more downstream filter; I am not sure if they will necessarily be linear, in which case the error covariance is going to be an interesting topic.