Henry Choi: Understanding the EKF SLAM

In April of this year, I attended the AR (augmented reality) workshop at Stanford, and learned the current issues and direction of the AR community. I was curious whether I can just use VI (visual-inertial) SLAM (simultaneous localization and mapping) for the AR video game I want to write, and got confirmation that this would be possible after talking to Renato F. Salas-Moreno. Since I work with inertial sensors in my current job, VI SLAM is particularly interesting. While the VI SLAM on a single device is amply demonstrated by the slew of cutting edge products like Google Tango and Microsoft HoloLens, the interactive AR video game I have in mind requires merging of the map produced from multiple devices in the same room. I learned from Marc Pollefeys that HoloLens solves the problem by using the map of a "master" device. I got the feeling that this is not a completely solved problem, but since I am new to the whole SLAM problem, I am going to work through the EKF (extended Kalman filter) example from the online SLAM course, which uses the notation in the Probabilistic Robotics book. In the equations to follow, I am forced to use the underline as the vector, instead of the usual overbar notation, because of the poor formatting capability of Blogger editor.

State formation

In SLAM, the device's pose (position and orientation) x, and the positions (but not orientation) of various landmarks (loosely defined as things that can be identified from camera images) m_i--both in the world frame are captured as the state vector 𝛍^true = [x^true; m^true₁; ...; m^true_N], or more compactly 𝛍^true = [x^true; m^true]. Since I don't know where the device or the landmarks are initially my initial state estimate is 𝛍[0] = [0; 0]. Most text will annotate this vector with a hat (^) for the estimate, but again due to the Formatting constraint in the Blogger editor, I will just omit the hat, and instead denote the (unknowable) as ^true, as I've done above.

Ignoring for now the management of the landmark associations (they are sometimes unobservable as the device moves around the world frame), it is clear that the initial covariance matrix 𝚺_mm[0] = diag(∞, ..., ∞) because at the outset I know nothing about the landmark positions--except the landmark positions are independent of each other, and the device position, for that matter. So 𝚺_𝛍𝛍[0] = diag(0, 0, 0, ∞, ..., ∞). Note that this initialization of the device pose covariance (setting to 0) is different than what may be found in other Kalman filter textbooks (which uses ∞) because of the different initialization method in the book (coming up shortly below). In linear (the original) Kalman filter formulation, the state evolution and observable generation are linear stochastic processes, but in EKF, they are Gaussian nonlinear processes (that can nevertheless be linearized):

Kalman Filter Extended Kalman Filter

Process (state change) model 𝛍^true = A 𝛍^true + B u + r 𝛍^true = g(𝛍^true, u) + r

Measurement model z_k[k]= H 𝛍^true + q_k[k] z_k[k]= h_k(𝛍^true) + q_k[k]

where r ~ 𝓝(0, R[k]) and q_k[k] ~ 𝓝(0, Q_k[k]). The subscripts k in the measurement model is necessary in SLAM because landmarks come and go at every time step: to me, it means "pertaining to all observations at time step k".

The zero initial value of the states pretty much assure that the prediction of the state for the next time step is still zero--unless the system is such that zero is not an equilibrium, or if there is an external input AKA control input. In this example of a planar robot movement using odometer model, there is apparently a non-zero initial external input (the odometry data):

u[0] = [rot1; trans; rot2] = [0.1007 0.1001 0.0002]^T.

Prediction step

When first studying Kalman filter, I was confused whether to perform prediction right after outputting the control command, or at the beginning of the next step, right before processing the fresh measurement. Because of my background in real-time control, I perform the non-time-critical tasks such as prediction after outputting the control command (i.e. minimize time delay between sensor measurement available and when the control command is output). But in this case, the external input is not the control command but rather another sensor reading (odometry reading), which is only available at the next time step. That is why the "prediction" step uses the a-posteriori estimate from the last time step x⁺[k-1] = [x⁺[k-1]; y⁺[[k-1]; 𝛉⁺[k-1] ], which in this example:

𝛍^-[k] = g(u[k]; 𝛍⁺[k-1])

For additional clarity, I adopted the "-" and "+" superscripts for the a-priori and a-posteriori estimates.

Even though g(u[k]; 𝛍⁺[k-1]) can be a BHF (big hairy function) in general, stationarity assumption for the landmarks (m[k] = m[k-1]) in SLAM simplifies the state update to only the device pose update:

𝛍^-[k] = 𝛍⁺[k-1] + [trans * cos(𝛉⁺[k-1] + rot1) ; trans * sin(𝛉⁺[k-1] + rot1) ; rot1 + rot2]

Observation prediction: has to wait till landmark observation in SLAM

We will see shortly the a-posterior update (correction) driven by the sensor measurement (observation). The prediction of the observation h(𝛍[k]) is only necessary at that time, but in the spirit of calculating everything possible outside the time critical path from sensor measurement being available to outputting control command, I want to pre-calculate h(𝛍[k]) now--except that in SLAM, I don't know a-priori which landmarks I will observe, so I have to wait till discovering all observations for the time step.

Covariance prediction

To update the covariance, I need the Jacobian matrix G^-[k]:

G^-[k] = ∂𝛍[k] / ∂𝛍[k-1] @ 𝛍^-[k] = ∂g(u[k]; 𝛍[k-1]) / ∂𝛍[k-1] @ 𝛍^-[k]

Note that the partial derivative is derived from the prediction (i.e. without the new sensor input), AT 𝛍[k-1]--the last optimal state estimate. Once again, the landmark stationarity and independence assumption allows simplification:

G^-[k] = diag(G^-_x[k], I_2N); where G^-_x[k] = ∂x[k] / ∂x[k-1] @ x^-[k]

In the prediction equation above, only 𝛉 affects change of x, so that only the dx/d𝛉 and dy/d𝛉 need to be considered for the Jacobian calculation:

dx/d𝛉 = -trans * sin(𝛉 + rot1); dy/d𝛉 = trans * cos(𝛉 + rot1)

G^-_x[k] = I₃ + trans * [ -sin(𝛉^-[k] + rot1) ; cos(𝛉^-[k] + rot1) ; 0]

I think it is helpful to remember separation of the device pose and the landmark positions manifest in the covariance matrix structure as well:

𝚺_𝛍𝛍 = [𝚺_xx 𝚺_xm ; 𝚺^T_xm 𝚺_mm], since 𝚺_mx= 𝚺^T_xm.

Covariance prediction step involves this huge matrix (dimension (3+2N)² in this example):

𝚺^-_𝛍𝛍[k] = G^-[k] 𝚺⁺_𝛍𝛍[k-1] G^-T[k] + R[k]

In R[k] = diag(R^-_x[k], 0), R^-_x characterizes the random noise in the device pose (imagine the device being knocked around by an accident), evaluated at 𝛍^-[k]. I have not seen a noise covariance modeled with such fidelity as to be sensitive to the current state estimate, but wanted clarity for now. Note that unless the physical process has no noise (i.e. R[k] is 0: impossible!), the state uncertainly always grows during the prediction step, as you can see in the following plot of the robot location estimate overlaid against the known landmarks.

Red: prior device pose estimate and uncertainly (a-posteriori estimate from previous time step)
Magenta: predicted device pose and uncertainty, which is larger than the red ellipse

In SLAM, the huge covariance prediction calculation above in the first term above can be broken up into 3 isolated portions (the off-diagonal terms are transpose of each other) for simplicity.

G^-[k] 𝚺⁺_𝛍𝛍[k-1] G^-T[k] = G^-_x[k] 𝚺⁺_xx[k-1] G^-T[k] , G^-_x[k] 𝚺⁺_xm[k-1]

(G^-_x[k] 𝚺⁺_xm[k-1])^T , 𝚺⁺_mm[k-1]]

In fact, the independence of the landmarks allows even more factoring of G^-[k] 𝚺^-_𝛍𝛍[k-1] G^-T[k]:

G^-_x[k] 𝚺⁺_xx[k-1] G^-T[k] , G^-_x[k] 𝚺⁺_x1[k-1] , ... , G^-_x[k] 𝚺⁺_xN[k-1]

(G^-_x[k] 𝚺⁺_x1[k-1])^T , 𝚺⁺₁₁[k-1] , ... , 𝚺⁺_1N[k-1]

... , ... , ... , ...

(G^-_x[k] 𝚺⁺_xN[k-1])^T , 𝚺⁺_N1[k-1] , ... , 𝚺⁺_NN[k-1]

This formulation better shows the correlation between the device pose and individual landmark, and the inter-landmark correlation; it might be a better suited to the when landmarks rapidly come in and out of view, but might be an overkill for the simple case of fixed and unchanging landmarks.

Correction step

In this example problem, the range and bearing of a subset of the known landmarks are observed at each time step. Assuming that the association problem (which landmark is this for?) is solved somehow, we can calculate the difference between the measured observation and the predicted observation, which is called innovation. If a given landmark was previously observed, its estimated position is already in 𝛍^-[k]. But a newly discovered landmark i's state m^-_i[k] is initialized with yet another function:

m^-_i[k] = initial(𝛍^-[k], z_i[k])

There doesn't seem to be a lot of discussion on this initialization function, but if we can initialize, then we can use the completed 𝛍^-[k] to calculate the predicted observation h_p(𝛍^true) and the measurement Jacobian H_p[k], which is the gradient of the observables z[k] WRT 𝛍^true:

H_p[k] = ∂h_p(𝛍[k]) / ∂𝛍[k] @ 𝛍^-[k]

Just like the process Jacobian, there are many ways to estimate the measurement Jacobian (analytical, numerical, and their respective children), but I've only used analytical models so far.

State correction

K_p[k] is the optimal gain under the zero-mean Gaussian noise hypothesis--called Kalman gain.

K_p[k] = 𝚺^-_𝛍𝛍[k] H_p^T[k] / (H_p[k] 𝚺^-_𝛍𝛍[k] H_p^T[k] + Q_p[k])

Note that the "/" operator is in fact matrix inversion. I find Kalman "weight" to be a better reminder of its true nature, because that is how much weight is given to the innovation z[k] - h_p(𝛍^-[k]) to update the state estimate:

𝛍⁺[k] = 𝛍^-[k] + K_p[k] ( z[k] - h_p(𝛍^-[k]) )

Note that the unobserved landmarks do not contribute to correcting the state estimate.

Covariance correction

The Kalman weight also contributes to the state covariance correction:

𝚺⁺_pp[k] = (I_pp - K_p[k] H_p[k]) 𝚺^-_pp[k] = 𝚺^-_pp[k] - K_p[k] H_p[k] 𝚺^-_pp[k]

where 𝚺_pp = [𝚺_xx 𝚺_xp ; 𝚺^T_xp 𝚺_pp] is the combination of the device pose to device pose covariance, device pose to observed (at k) landmark positions, and the landmark position to landmark position covariance. Note that the covariance decreases commensurate to the Kalman weight, in reverse of the always increasing covariance during the prediction step earlier.

Red: predicted device pose estimate and uncertainly (a-priori estimate from this time step)
Magenta: corrected device pose and uncertainty, which is smaller than the red ellipse
Blue: predicted landmark position estimate and uncertainty (a-priori estimate from this time step)
Cyan: corrected landmark position and uncertainty, which is smaller than the blue ellipse
Black: range and bearing measurements at that time step, drawn from the corrected device pose (maybe I should have drawn it from the predicted device pose)

This completely textbook implementation of EKF leaves much to be desired. I had to futz around with the sensor and process uncertainties to avoid gross estimation error--to some extent, this is one of the deficiencies of KF I already heard about before. But I was surprised by how much the solution drifts as the device moves away from the starting position, which is in the lower left corner below.

If I reduced the sensor covariance too much, the final landmark (at 9,2) doesn't even fall within the 2𝛔 ellipse. The drift might be due to the sensor bias, and probably uncorrectable without another sensor or a loop closure (coming later in the course).

There is much, much more to Kalman filter

While reading GNSS books about a year ago, I realized that the Kalman filter that I and many other people know is just the tip of the iceberg, and I am not even talking about the UFK. A year ago, I could not have imagined such things as the Kalman smoother, closed-loop Kalman filter, or constrained Kalman filter. I hope to share my future discoveries on such topics with you.

Henry Choi

Jun 6, 2017

Understanding the EKF SLAM