Sep 28, 2017

Aligning 2 AR device local frames through calibration

I spent the last 6 months self-studying computer vision and nonlinear estimation methods--to align the local frames of 2 AR devices in a peer-to-peer AR game.  I am going to stop this phase of my self-study activities for a while, so I want to document what I've done so far.

Problem description

ARKit (or ARCore) will tell you the current pose (position and orientation) of your device in its local frame (position is initialized to 0 where ARKit starts).  Let's call your device A, and your friend's device B, and imagine that you are trying to play an AR video game where you need to look at some objects and be presented with a marker that is spatially accurate.  If your friend also runs an ARKit on her phone, she also has her own local frame, written as lA and lB in the schematic below.
The frames involved in aligning the device A's and B's local frames.  Note that the frames are left-handed because I write Unity games.  I adopt the pose notation from Paul Grove's seminal textbook on GNSS, which contains careful and thorough explanation of riding body dynamics.
As you can see, lA and lB are not aligned: nobody knows that the relative position offset rlAlAlB and rotation ClAlB are.  If we could estimate these 2 quantities (Vector3 and Quaternion for me), we then align the 2 devices.

1st attempt: A and B face each other, with torch lit, and spin around each other

The observable: blob angle (𝛂𝛃as appearing on the camera image, divided by the focal length (camera intrinsic)

I first tried detecting another phone's torch as a blob with OpenCV's blob detector last December.  Automatically adjusting the torch brightness based on the ambient scene intensity observed by the other phone is challenging enough (requires network communication--which I solved with Apple's Multipeer Kit), but the fact that the iOS torch brightness granularity is atrocious meant many trial-and-error on my part.   When I finally got it not too bright or not too dim, the torch is observable as a blob that is quite close to the remote phone's camera frame.  In the A's camera frame, B's torch therefore is at pose rAAB CAB.  Such pose is observed on A's camera at pixel location p(x,y), shown below.
I verified experimentally that the pixel origin is at the upper left corner.  Along with the device pose, ARKit also reports the camera intrinsic matrix K (containing the horizontal/vertical focal length and the camera center--all in [pixel] unit), so that given the pixel location of the blob p(x,y), the angle from A's camera optical axis to the blob is calculated as (𝛂, 𝛃) = atan2(p - c, fz), where pc, and fz are all vector (2D) quantities.

The expected: blob angle (𝛂𝛃according to the rigid body transformation

Using a series of rigid body transformation, I can express the B's pose in A's camera frame, as developed below.  Assuming that B's torch is at the origin of the B's camera frame, I can take that into the A's local frame lA using the unknown quantities rlAlAlB and ClAlB
rlAlAB = ClAlB rlBlBB + rlAlAlB

Next, take that into the frame A:
rAAB = CAlA ( rlAlABrlAAlA )

The expected observation is then
zAB = atan2(x,y of rAAB / z of rAAB)

Synchronizing A and B poses

Equations are generally cheap to write down, but implementing such equation on a physical device usually takes a lot of care and effort.  Note the above equation mixes ARKit reported poses and observations made on 2 different devices, which have different clock.  Even though networked modern phones run NTP, they will only be synchronized to on the order of a second.  Here is the scheme I used to interpolate the ARKit reports between A and B.

First, the records are tagged with local time (which is monotonically increasing), as shown below:
𝚫clk is a state (parameter to estimate).  In this example, I align B's record to A using the current estimate of πš«clk
Using the current estimate of 𝚫clk, I can map the other device's record to the local device.  These records of course do NOT align, so I (linearly) interpolate as shown below:

Caution: inverse of rlAlAlB is NOT −rlBlBlA

From B's perspective, the above equation is completely valid if I flip the A and B.  But when calculating rlBlBlA, do NOT assume (as I did at first) that it is the same as −rlAlAlB.  Chasing this bug for a few days drilled into my obtuse brain the absolute necessity of keeping track of the resolving reference frame: when I change the resolving frame, I need to apply the rotation between the 2 different resolving frames!

Least squares?  Not just yet!  Initialize first

As soon as I have a bunch of observations and expected observations for some unknown parameters, I immediately think of the method of least squares, which I explained in a previous blog entry.  To temp one even more into this line of thinking, the remote phone's blob reminds you strongly of the satellite observation that motivated Gauss to invent the method of least squares in the first place!  The problem is that general rigid body transformation is a crazy nonlinear problem, and divergence is highly likely when using linear approximation (the Gauss-Newton method).  Luckily, when the 2 devices are facing each other, there is a strong constraint: the average orientation of the other device is quite close to being 180ΒΊ around vertical, and the distance is somewhere between 2~4 m apart (too close or too far, the blob detector fails).  Using this constraint, you can solve for rlAlAlB and ClAlB that will yield the good initial guess of rAAB and CAB, as shown by the A and B observation residuals below:
This is actually the initial residual for the 2nd approach (discussed below) when A and B are not moving rapidly in space.  Can you spot the unobservability of the problem from these residuals?

I blew by this rather fast, but there is a huge gotcha you won't find in many places.  When averaging or interpolating Quaternion (Unity uses Quaternions rather than rotation matrix), you can't just do a SLERP (spherical linear interpolation), because of the "quaternion jumps"; when you write out quaternion as [x, y, z, w] (or the other order, as is often done), that is actually the same as [-x, -y, -z, -w].  So two quaternions that are in fact almost the same may look like very different numbers.  And you say: "Ah, I see why they use SLERP, which will use the angle-axis representation to interpolate between two angle"--and you would be wrong, because the angle axis representation suffers from the same ambiguity of minus signs (flipping the axis of rotation is the same as flipping the amount of rotation).  What I wound up doing is:

  1. Generate the 2 different representations of the quaternion dQ for the difference between those 2 quaternions (call them dQ1 and dQ2),
  2. Calculate the angle axis of those 2 possible rotations
  3. Choose the angle axis with smaller magnitude
  4. Then apply the interpolation and average
I actually ran into the "Q is the same as -Q" problem a year ago, but did not fully understand the problem or devise a solution until this hobby project.

Gauss Newton iterations

Among nonlinear least squares methods, Gauss-Newton is the simplest: it just requires Jacobian, which is the partial derivative of the residual WRT each of the parameters to estimate.  Since I have multiple (hopefully many more than the parameters to estimate) observations, the Jacobian is a "tall" matrix.  As you can imagine when you stare at the derivation of expected observation, analytical derivation of the Jacobian is difficult, but fortunately, numerical method to estimate the Jacobian usually works quite well.  If the i'th observation Zi = f(X), where X = [x1, ..., xn], the i'th row of the Jacobian J is a row vector
Ji = [ {f([x1 + ∂x1, ..., xn]) − f([x1, ..., xn])} / ∂x1, ..., {f([x1, ..., xn + ∂xn]) − f([x1, ..., xn])} / ∂xn ]

Where I think analytical derivation of the partial derivative can be helpful is to obtain the Hessian (double derivative) necessary for Newton's method, which can converge much faster than any other method (at the risk of some ringing)

The Gauss-Newton correction is the solution to the standard Normal Equation (J π›…X= πœΊZ), which is J* (pseudo inverse of J) multiplied by the residual 𝜺Z:
𝛅X = J* 𝜺Z
which is also a minimum norm solution (that is, ⎮𝛅X⎮ is the minimum along all solutions that satisfy J π›…X= πœΊZ).  Matlab's backslash operator (\) yields the sparse solution, which has the most number of zeros in the solution--not quite the same thing!

Boldly solving unobservable problem---heuristically

If you have an intuition about this sort of geometry problem, you would have caught on to the fact that this is an unobservable problem: for a given observation, rotation cannot be teased out from translation.  Try for yourself: imagine translating B relative to A, vs. rotating the B about the vertical: the A's blob will move in a similar way!  The observability is only restored when you have a lot of observations at different distance away: because the distance acts as a lever arm.  Unfortunately, it is quite difficult to ensure sufficient distribution of distances during calibration.

Another problem is the weak observability of translation (change in rAAB) relative to rotation (change in CAB): except when the devices are quite close, the distance acting as the lever arm amplifies the change in rotation, so that even with a minimum norm solution, practically only the rotation is corrected.  As I read in this textbook on least squares, this situation is distantly similar to having a large condition number (ratio between the greatest and the least singular values of a matrix J), and a standard technique to dealing with this situation is Tikhonov regularization, which is to minimize
⎮J π›…X + π›ŒI −  πœΊZ
rather than the original cost function ⎮J π›…X −  πœΊZ⎮.  The intuition is to artificially boost all the singular values: for the large singular values, the artificial boost is negligible, while the weak singular values are "saved" by the boost.  But to address the rotation drowning out translation, I used unequal boosting for rotation vs. translation, so that my regularization cost function is
⎮J π›…X + 𝚲 −  πœΊZ
where πš² is a diagonal weight matrix.  The solution to this regularized problem is
 π›…X = (J' J + πš²)* J' πœΊZ, where J' is J transposed.

Even this technique was not enough, so I then tried correcting the rotation and translation only at every other correction step.

Torpedoed by the SLAM's static scenery assumption

The result of all this was quite unsatisfactory when the 2 devices spun around each other, because a rather large object (a person) in the middle of the scenery that seems to stay in the scenery while the device is rotation confuses ARKit, as you can see here
Device A is spinning around coffee table, so the yaw angle (top) is monotonically increasing.  But ARKit is quite confused about A's position (The trajectory should sketch out a circle!), because my wife (B) is apparently static in the scenery.  Even though it does the best it can, the residuals are huge at 2 m inter-device distance.

2nd attempt: just tip and rotate the phone in hand while sitting still

After realizing the root cause of the horrendous residual, I tried just sitting still and rotating the device gently in my hand, so the observed blob traces out a rectangle on the camera image.  Then I was able to reduce the residual down to an acceptable level (~0.03 radians).  The problem was that moving the phones so gently required some practice, and it took me 1~2 minutes to gather enough quality observations to throw into the least squares solver derived above.

Since my target application is a consumer facing video game, I reasoned that this is impractical, and looked for...

3rd attempt: A and B face the same direction, and share features

Feature detect

A surer way to NOT confuse ARKit is to clear away any moving object in ARKit's scenery--for example by having both A and B face the same direction.  Of course, then I have no blob to observe.  What I can observe instead are the feature points that are commonly used in feature tracking algorithms; OpenCV packs no less than a dozen different such algorithms: FAST, SIFT, SURF, ORB, just to name a few.  In fact, ARKit (and all known SLAM methods) itself uses such feature detectors; ARKit even emits the position (but not the descriptors) of those features as point clouds.  I thought about aligning the point clouds from A and B by minimizing the Euclidean distance between those points, but realized it would be computationally intractable for any non-trivially small offsets between A and B.  So I set about running a feature detector myself (outside of ARKit).

Match features

I borrowed heavily from stereo correspondence problem, as explained in Hartley and Zisserman: when situated pretty close together and facing the same direction, A and B images are like the left and the right images of a stereo camera.  With small effort, I can pick up a lot of features that should appear in both A and B's images.  If B sends the top 200 of such feature points (with descriptors) to A, then A can  run the OpenCV's DescriptorMatcher--even a brute force matcher is OK--to pick out candidate matches.

Solve for the fundamental matrix

Such matches are corrupted by outliers--"crazy" matches--because the matchers really don't have much context to work with (certainly much less than the human vision system).  So when solving the stereo correspondence problem, we need to use a robust (immune to outliers) algorithm such as RANSAC (random sampling and consensus).  [Some people reported better result with PROSAC--a derivative of RANSAC]  Once again, OpenCV does the heavy lifting in findFundamentalMat() function, which takes 2 sets of feature points.  Note that the solution F is in [pixel] unit.

Recover the essential matrix

The intrinsic matrix K from ARKit is then applied to F to obtain E:
E = K-1 F K
E is by definition the translation and rotation: E = T^ R, where T^ is the skew matrix of translation:
      0  -tx  ty
T^ =  tz  0   -tx
     -ty  tx   0 
When I decompose E with SVD, i.e.
E = U Ξ£ VT
There are 2 possible solutions to consider:
(T^, R) = (U W Ξ£ UT, U W Ξ£ VT) or
(T^, R) = (U WT Ξ£ UT, U WT Ξ£ VT)
where
    0 -1  0   
W = 1  0  0      
    0  0  1
 
The solution that puts more features in front (z > 0) of the camera should be chosen.
Note that the resulting translation is still in [pixel] unit, rather than [m].  This is because the projective geometry alone cannot resolve the scale; in VIO (visual inertial odometry, of which ARKit is an implementation), the inertial sensor is brought in to establish scale.  This suggests that if I run the above algorithm twice with ARKit turned on, I can use the change in position to recover the scale factor.

Unity implementation

Incomplete--I did not do this because...

Why I am stopping now

My gut feeling is that users will not, or cannot, perform any calibration lasting more than 10 seconds: the compliance will be low, and they can't do it well, even if they want to.  So even the stereo correspondence method is problematic.  I think the right solution is to always share the high quality features between devices and build a joint map.  This is what Hololens does already, but it is just too much effort for me to take on (because I have to really learn the details of "M" in SLAM), because I have low level projects (power electronics and a hard real-time AMP architecture for Raspberry Pi) I've neglected for too long.

I'll revisit this "Shootout" project when there is an AR engine that shares global map in a completely transparent manner to the application developer.

Sep 17, 2017

Caught between (ground) truth and the (LS) fit

Since my last blog post, I've been trying to apply the method of least squares (LS) to the problem of aligning the local frames of 2 different phones running an ARKit based UI (in fact, a Unity game).  Along the way, I had to understand the differences between various nonlinear LS methods: the simplest of them all is the Gauss-Newton method, which requires just a Jacobian (partial differential matrix of the observable as a function of the state elements), but can't deal with losing a rank in the Jacobian. Also, for some random values of initial state guesses, the iteration jumps around wildly, which is a bit unsettling.  Originally, I tried to symbolically write out the Jacobian terms, but because the geometry involves 5 different frames, taking the derivative of so many Quaternion laden terms fills one entire page--for just 1 term.  So I settled on a numerical method instead.  Given the considerable difficulty with even the single partial differential, I wasn't going to try my luck with double partial differential necessary for the Hessian--which is necessary for Newton's method.

But what really sucked about my problem is the lack of observability--because the state has both translation and rotation which are indistinguishable/weakly distinguishable, the iterations can "wander" in the infinite space of solutions and not converge.  Upon further reading, I learned about the difference between the maximum sparse solution of LS (Matlab's "\" operator) and the minimum norm solution (using the pseudo inverse solution).  And to deal with a weak condition number case, I used Tikhonov regularization.  I have yet to come across any discussion of observability in least squares context, so I don't know whether to be pleased or disappointed at the result I got
The solution of 30 iterations of (Tikhonov) regularized LS, with the 6 DOF between the 2 device local frame as the states, and the other phone's blob location on the camera image as the observable.  (I wrote about detecting a remote phone's torch using OpenCV last year).  On the top, the blue circles are my phone's pose within its own local frame, while the red "x" and the dotted lines are my wife's phone pose at the observation instant.  The black "x" and attached lines are my wife's phone's pose in MY camera's frame.  Since the 2 phones are merely rotating, the black "x" should be a fixed Z distance: about 2 m, which is clearly not what the LS solution settled on.
I think I should be disappointed because this solution settled on a wrong distance between the 2 devices: the ground truth is approximately 2 m, but as you can see above, the LS settled on around 6 m.  But it's not the LS that I should be disappointed in: it's my fault for not thinking through the observability problem up front.  What can I say?  I learn by failing.  But I do believe the smart researchers can better warn students about these practical problems.  Maybe they just don't have the energy to write about these practical considerations after painstakingly deriving all those equations.  That's why I appreciate Topics in Astrodynamics (I was originally motivated to study least squares when I learned about the differential correction method (AKA batch filter), which is well explained in Chapter 15 of Topics in Astrodynamics) even more, because in that book, there is at least a mention about the need for better model when the residual is unacceptably large.

Sep 5, 2017

Path to a multiplayer AR game

For slightly less than a year, I've been working on a hobby project to create a peer-to-peer AR game on an iPhone.  I first looked into indoor localization using audible chirps (linear frequency modulation, in technical terms).  I worked out the signal processing, borrowing heavily from radar fundamentals, and wrote about it in a blog entry.  But when I wrote a CoreAudio based program on my phone, I found out in March of this year (about 6 months into the effort) that the received audio signals were heavily distorted and attenuated, so that the signals were quite weak.  I then pivoted toward a SLAM based approach, and have been trying to solve the problem with the device pose estimate from ARKit, as you can see in the picture of the coordinate frames and the observable in my problem definition.
Just last week, Google released ARCore to keep up with Apple.  Last night, I read about it in this review by Matt Miesnieks.

I was impressed at Mr. Miesnieks' technical depth and breath in this area.  His comments about the difficulty of multi-player AR struck me as spot on, and would have normally disheartened me.  But thankfully, I reached a mini-milestone: I collected a 2-minute calibration session data from my wife's and my phone, and put it through the 1st stage of the AR map alignment algorithm I have been working on, as you can see below:
o: measurement of the other devices normalized pixel location on each of the 2 phones in the 1-1 calibration session.
x: expected normalized pixel location based on the estimate of the 6-DOF offset between the 2 phones.
Since this fit is BEFORE the least squares based 2nd stage of the alignment algorithm, this is encouraging! And here is the result of 10 iterations of nonlinear least squares (which I explained in a previous blog entry), which shows that I am on the right track!
Residual innovation (measured vs. predicted) after 10 iterations of nonlinear least squares.
But even here, I can see that my model is not accurate enough to reduce the residual level low enough for a high quality interactive AR game play, suggesting 2 things:
  1. Expand the model to add at least another 3 states (maybe even 6) to the currently 3 states I am estimating.
  2. The possibility that at least some of the estimates may have to be updated even after the initial convergence.
The 2nd is going to be particularly painful, so I want to first evaluate how quickly ARKit drifts after the initial convergence.

Although this is supposed to be a tough problem,  I've been learning and reviewing many things I learned in school in this hobby project.  I am curious whether I can actually pull off an algorithm that is as difficult as Mr. Miesnieks suggests; I'll find out in the next couple of months.