Consider some nonlinear function y = f(x), colored gray below. I am interested in estimating the true value of x (xtrue) by measuring the function's output y, which is corrupted by noise, yielding a value of y0 instead of the noise-free value of f(xtrue) as you can see below.
At about age 10, we learned how to solve this problem in closed form if f were linear and the measurement is noise-free: dividing y_0 - f(x_guess)--the difference of the expected and observed value of output y--BY the slope of the function which I call H here. Checking the intuition: the slope H is Δy/Δx, so Δy / H = Δx, as long as Δy and H and are not zero. Well it turns out that LS is fundamentally just the same idea, with fancy sounding deviations.
First, we pick a reasonable starting point x[0], and ignore the fact that the function f is nonlinear, and just estimate the slope H at x[0]. Therefore, we can still get a refined estimate x[1], which is closer to the truth than the starting guess. We can then re-estimate the slope at x[1] and repeat the process. Unfortunately, the will introduce a sizable error in the estimate we converge to. Usually, it is difficult to reduce noise any further (if good HW engineers worked on the problem for any reasonable period), but note that the noise in the single observation pulled us to 1 direction. So what if we had another measurement that was tainted by noise in such a way as to be off in the opposite direction? Then the average of the noise corrupted correction will be much closer to the noise-free value! Since we can't dictate that the next measurement be off in the opposite direction of the first (in the nose sense), we have ensure that the noise in unbiased, and increase the sample size sufficiently to reduce the aggregate contribution of the noise to an acceptable level.
The LS is the framework for fitting a large sample of observations (usually much greater than the number of parameters to estimate). Picture the 2nd iteration of the above effort, but we now have 2 observations instead of just 1, as shown below.
In general, each observation will have a different operating point, and therefore different "slope" H. The matrix H is now generalized, to stack up all such slopes for each observation, and we solve the associated normal equation involving the H matrix and the vector of residuals (y - f(x[1])). As long as the function is convex, we will approach the true value rapidly, after only a few iterations. Leaving aside the history (LS goes back to Gauss) or the math, the result just seems magical when applied on a crazy nonlinear problem (like rigid body rotation). Here is an example of how the expected and observed feature points (in a camera image) line up only after 4 iterations.
Even with 4 out of 7 model parameters held at some random values, the residual error is uncannily small--only after 4 iterations through a crazily nonlinear function. |
Even though we are not taking a simple inverse any more, fundamentally, we still need the slope to be "non-zero": technical term for it is being full-rank. Even when you do lose rank, there are some things to try after recovering from the disappointment:
- Drop the H column that lost rank, and only solve for the parameters that we have the rank for--hoping that we will recover full rank at the new linearization point.
- Try a different starting point.