Sep 28, 2017

Aligning 2 AR device local frames through calibration

I spent the last 6 months self-studying computer vision and nonlinear estimation methods--to align the local frames of 2 AR devices in a peer-to-peer AR game.  I am going to stop this phase of my self-study activities for a while, so I want to document what I've done so far.

Problem description

ARKit (or ARCore) will tell you the current pose (position and orientation) of your device in its local frame (position is initialized to 0 where ARKit starts).  Let's call your device A, and your friend's device B, and imagine that you are trying to play an AR video game where you need to look at some objects and be presented with a marker that is spatially accurate.  If your friend also runs an ARKit on her phone, she also has her own local frame, written as lA and lB in the schematic below.
The frames involved in aligning the device A's and B's local frames.  Note that the frames are left-handed because I write Unity games.  I adopt the pose notation from Paul Grove's seminal textbook on GNSS, which contains careful and thorough explanation of riding body dynamics.
As you can see, lA and lB are not aligned: nobody knows that the relative position offset rlAlAlB and rotation ClAlB are.  If we could estimate these 2 quantities (Vector3 and Quaternion for me), we then align the 2 devices.

1st attempt: A and B face each other, with torch lit, and spin around each other

The observable: blob angle (𝛂𝛃as appearing on the camera image, divided by the focal length (camera intrinsic)

I first tried detecting another phone's torch as a blob with OpenCV's blob detector last December.  Automatically adjusting the torch brightness based on the ambient scene intensity observed by the other phone is challenging enough (requires network communication--which I solved with Apple's Multipeer Kit), but the fact that the iOS torch brightness granularity is atrocious meant many trial-and-error on my part.   When I finally got it not too bright or not too dim, the torch is observable as a blob that is quite close to the remote phone's camera frame.  In the A's camera frame, B's torch therefore is at pose rAAB CAB.  Such pose is observed on A's camera at pixel location p(x,y), shown below.
I verified experimentally that the pixel origin is at the upper left corner.  Along with the device pose, ARKit also reports the camera intrinsic matrix K (containing the horizontal/vertical focal length and the camera center--all in [pixel] unit), so that given the pixel location of the blob p(x,y), the angle from A's camera optical axis to the blob is calculated as (𝛂, 𝛃) = atan2(p - c, fz), where pc, and fz are all vector (2D) quantities.

The expected: blob angle (𝛂𝛃according to the rigid body transformation

Using a series of rigid body transformation, I can express the B's pose in A's camera frame, as developed below.  Assuming that B's torch is at the origin of the B's camera frame, I can take that into the A's local frame lA using the unknown quantities rlAlAlB and ClAlB
rlAlAB = ClAlB rlBlBB + rlAlAlB

Next, take that into the frame A:
rAAB = CAlA ( rlAlABrlAAlA )

The expected observation is then
zAB = atan2(x,y of rAAB / z of rAAB)

Synchronizing A and B poses

Equations are generally cheap to write down, but implementing such equation on a physical device usually takes a lot of care and effort.  Note the above equation mixes ARKit reported poses and observations made on 2 different devices, which have different clock.  Even though networked modern phones run NTP, they will only be synchronized to on the order of a second.  Here is the scheme I used to interpolate the ARKit reports between A and B.

First, the records are tagged with local time (which is monotonically increasing), as shown below:
𝚫clk is a state (parameter to estimate).  In this example, I align B's record to A using the current estimate of πš«clk
Using the current estimate of 𝚫clk, I can map the other device's record to the local device.  These records of course do NOT align, so I (linearly) interpolate as shown below:

Caution: inverse of rlAlAlB is NOT −rlBlBlA

From B's perspective, the above equation is completely valid if I flip the A and B.  But when calculating rlBlBlA, do NOT assume (as I did at first) that it is the same as −rlAlAlB.  Chasing this bug for a few days drilled into my obtuse brain the absolute necessity of keeping track of the resolving reference frame: when I change the resolving frame, I need to apply the rotation between the 2 different resolving frames!

Least squares?  Not just yet!  Initialize first

As soon as I have a bunch of observations and expected observations for some unknown parameters, I immediately think of the method of least squares, which I explained in a previous blog entry.  To temp one even more into this line of thinking, the remote phone's blob reminds you strongly of the satellite observation that motivated Gauss to invent the method of least squares in the first place!  The problem is that general rigid body transformation is a crazy nonlinear problem, and divergence is highly likely when using linear approximation (the Gauss-Newton method).  Luckily, when the 2 devices are facing each other, there is a strong constraint: the average orientation of the other device is quite close to being 180ΒΊ around vertical, and the distance is somewhere between 2~4 m apart (too close or too far, the blob detector fails).  Using this constraint, you can solve for rlAlAlB and ClAlB that will yield the good initial guess of rAAB and CAB, as shown by the A and B observation residuals below:
This is actually the initial residual for the 2nd approach (discussed below) when A and B are not moving rapidly in space.  Can you spot the unobservability of the problem from these residuals?

I blew by this rather fast, but there is a huge gotcha you won't find in many places.  When averaging or interpolating Quaternion (Unity uses Quaternions rather than rotation matrix), you can't just do a SLERP (spherical linear interpolation), because of the "quaternion jumps"; when you write out quaternion as [x, y, z, w] (or the other order, as is often done), that is actually the same as [-x, -y, -z, -w].  So two quaternions that are in fact almost the same may look like very different numbers.  And you say: "Ah, I see why they use SLERP, which will use the angle-axis representation to interpolate between two angle"--and you would be wrong, because the angle axis representation suffers from the same ambiguity of minus signs (flipping the axis of rotation is the same as flipping the amount of rotation).  What I wound up doing is:

  1. Generate the 2 different representations of the quaternion dQ for the difference between those 2 quaternions (call them dQ1 and dQ2),
  2. Calculate the angle axis of those 2 possible rotations
  3. Choose the angle axis with smaller magnitude
  4. Then apply the interpolation and average
I actually ran into the "Q is the same as -Q" problem a year ago, but did not fully understand the problem or devise a solution until this hobby project.

Gauss Newton iterations

Among nonlinear least squares methods, Gauss-Newton is the simplest: it just requires Jacobian, which is the partial derivative of the residual WRT each of the parameters to estimate.  Since I have multiple (hopefully many more than the parameters to estimate) observations, the Jacobian is a "tall" matrix.  As you can imagine when you stare at the derivation of expected observation, analytical derivation of the Jacobian is difficult, but fortunately, numerical method to estimate the Jacobian usually works quite well.  If the i'th observation Zi = f(X), where X = [x1, ..., xn], the i'th row of the Jacobian J is a row vector
Ji = [ {f([x1 + ∂x1, ..., xn]) − f([x1, ..., xn])} / ∂x1, ..., {f([x1, ..., xn + ∂xn]) − f([x1, ..., xn])} / ∂xn ]

Where I think analytical derivation of the partial derivative can be helpful is to obtain the Hessian (double derivative) necessary for Newton's method, which can converge much faster than any other method (at the risk of some ringing)

The Gauss-Newton correction is the solution to the standard Normal Equation (J π›…X= πœΊZ), which is J* (pseudo inverse of J) multiplied by the residual 𝜺Z:
𝛅X = J* 𝜺Z
which is also a minimum norm solution (that is, ⎮𝛅X⎮ is the minimum along all solutions that satisfy J π›…X= πœΊZ).  Matlab's backslash operator (\) yields the sparse solution, which has the most number of zeros in the solution--not quite the same thing!

Boldly solving unobservable problem---heuristically

If you have an intuition about this sort of geometry problem, you would have caught on to the fact that this is an unobservable problem: for a given observation, rotation cannot be teased out from translation.  Try for yourself: imagine translating B relative to A, vs. rotating the B about the vertical: the A's blob will move in a similar way!  The observability is only restored when you have a lot of observations at different distance away: because the distance acts as a lever arm.  Unfortunately, it is quite difficult to ensure sufficient distribution of distances during calibration.

Another problem is the weak observability of translation (change in rAAB) relative to rotation (change in CAB): except when the devices are quite close, the distance acting as the lever arm amplifies the change in rotation, so that even with a minimum norm solution, practically only the rotation is corrected.  As I read in this textbook on least squares, this situation is distantly similar to having a large condition number (ratio between the greatest and the least singular values of a matrix J), and a standard technique to dealing with this situation is Tikhonov regularization, which is to minimize
⎮J π›…X + π›ŒI −  πœΊZ
rather than the original cost function ⎮J π›…X −  πœΊZ⎮.  The intuition is to artificially boost all the singular values: for the large singular values, the artificial boost is negligible, while the weak singular values are "saved" by the boost.  But to address the rotation drowning out translation, I used unequal boosting for rotation vs. translation, so that my regularization cost function is
⎮J π›…X + 𝚲 −  πœΊZ
where πš² is a diagonal weight matrix.  The solution to this regularized problem is
 π›…X = (J' J + πš²)* J' πœΊZ, where J' is J transposed.

Even this technique was not enough, so I then tried correcting the rotation and translation only at every other correction step.

Torpedoed by the SLAM's static scenery assumption

The result of all this was quite unsatisfactory when the 2 devices spun around each other, because a rather large object (a person) in the middle of the scenery that seems to stay in the scenery while the device is rotation confuses ARKit, as you can see here
Device A is spinning around coffee table, so the yaw angle (top) is monotonically increasing.  But ARKit is quite confused about A's position (The trajectory should sketch out a circle!), because my wife (B) is apparently static in the scenery.  Even though it does the best it can, the residuals are huge at 2 m inter-device distance.

2nd attempt: just tip and rotate the phone in hand while sitting still

After realizing the root cause of the horrendous residual, I tried just sitting still and rotating the device gently in my hand, so the observed blob traces out a rectangle on the camera image.  Then I was able to reduce the residual down to an acceptable level (~0.03 radians).  The problem was that moving the phones so gently required some practice, and it took me 1~2 minutes to gather enough quality observations to throw into the least squares solver derived above.

Since my target application is a consumer facing video game, I reasoned that this is impractical, and looked for...

3rd attempt: A and B face the same direction, and share features

Feature detect

A surer way to NOT confuse ARKit is to clear away any moving object in ARKit's scenery--for example by having both A and B face the same direction.  Of course, then I have no blob to observe.  What I can observe instead are the feature points that are commonly used in feature tracking algorithms; OpenCV packs no less than a dozen different such algorithms: FAST, SIFT, SURF, ORB, just to name a few.  In fact, ARKit (and all known SLAM methods) itself uses such feature detectors; ARKit even emits the position (but not the descriptors) of those features as point clouds.  I thought about aligning the point clouds from A and B by minimizing the Euclidean distance between those points, but realized it would be computationally intractable for any non-trivially small offsets between A and B.  So I set about running a feature detector myself (outside of ARKit).

Match features

I borrowed heavily from stereo correspondence problem, as explained in Hartley and Zisserman: when situated pretty close together and facing the same direction, A and B images are like the left and the right images of a stereo camera.  With small effort, I can pick up a lot of features that should appear in both A and B's images.  If B sends the top 200 of such feature points (with descriptors) to A, then A can  run the OpenCV's DescriptorMatcher--even a brute force matcher is OK--to pick out candidate matches.

Solve for the fundamental matrix

Such matches are corrupted by outliers--"crazy" matches--because the matchers really don't have much context to work with (certainly much less than the human vision system).  So when solving the stereo correspondence problem, we need to use a robust (immune to outliers) algorithm such as RANSAC (random sampling and consensus).  [Some people reported better result with PROSAC--a derivative of RANSAC]  Once again, OpenCV does the heavy lifting in findFundamentalMat() function, which takes 2 sets of feature points.  Note that the solution F is in [pixel] unit.

Recover the essential matrix

The intrinsic matrix K from ARKit is then applied to F to obtain E:
E = K-1 F K
E is by definition the translation and rotation: E = T^ R, where T^ is the skew matrix of translation:
      0  -tx  ty
T^ =  tz  0   -tx
     -ty  tx   0 
When I decompose E with SVD, i.e.
E = U Ξ£ VT
There are 2 possible solutions to consider:
(T^, R) = (U W Ξ£ UT, U W Ξ£ VT) or
(T^, R) = (U WT Ξ£ UT, U WT Ξ£ VT)
where
    0 -1  0   
W = 1  0  0      
    0  0  1
 
The solution that puts more features in front (z > 0) of the camera should be chosen.
Note that the resulting translation is still in [pixel] unit, rather than [m].  This is because the projective geometry alone cannot resolve the scale; in VIO (visual inertial odometry, of which ARKit is an implementation), the inertial sensor is brought in to establish scale.  This suggests that if I run the above algorithm twice with ARKit turned on, I can use the change in position to recover the scale factor.

Unity implementation

Incomplete--I did not do this because...

Why I am stopping now

My gut feeling is that users will not, or cannot, perform any calibration lasting more than 10 seconds: the compliance will be low, and they can't do it well, even if they want to.  So even the stereo correspondence method is problematic.  I think the right solution is to always share the high quality features between devices and build a joint map.  This is what Hololens does already, but it is just too much effort for me to take on (because I have to really learn the details of "M" in SLAM), because I have low level projects (power electronics and a hard real-time AMP architecture for Raspberry Pi) I've neglected for too long.

I'll revisit this "Shootout" project when there is an AR engine that shares global map in a completely transparent manner to the application developer.

No comments:

Post a Comment