Dissecting the Camera Matrix, Part 2: The Extrinsic Matrix

August 22, 2012

Welcome to the third post in the series "The Perspecive Camera - An Interactive Tour." In the last post, we learned how to decompose the camera matrix into a product of intrinsic and extrinsic matrices. In the next two posts, we'll explore the extrinsic and intrinsic matrices in greater detail. First we'll explore various ways of looking at the extrinsic matrix, with an interactive demo at the end.

The Extrinsic Camera Matrix

The camera's extrinsic matrix describes the camera's location in the world, and what direction it's pointing. Those familiar with OpenGL know this as the "view matrix" (or rolled into the "modelview matrix"). It has two components: a rotation matrix, R, and a translation vector t, but as we'll soon see, these don't exactly correspond to the camera's rotation and translation. First we'll examine the parts of the extrinsic matrix, and later we'll look at alternative ways of describing the camera's pose that are more intuitive.

The extrinsic matrix takes the form of a rigid transformation matrix: a 3x3 rotation matrix in the left-block, and 3x1 translation column-vector in the right:

\[ [ R \, |\, \boldsymbol{t}] = \left[ \begin{array}{ccc|c} r_{1,1} & r_{1,2} & r_{1,3} & t_1 \\ r_{2,1} & r_{2,2} & r_{2,3} & t_2 \\ r_{3,1} & r_{3,2} & r_{3,3} & t_3 \\ \end{array} \right] \]

It's common to see a version of this matrix with extra row of (0,0,0,1) added to the bottom. This makes the matrix square, which allows us to further decompose this matrix into a rotation followed by translation:

\[ \begin{align} \left [ \begin{array}{c|c} R & \boldsymbol{t} \\ \hline \boldsymbol{0} & 1 \end{array} \right ] &= \left [ \begin{array}{c|c} I & \boldsymbol{t} \\ \hline \boldsymbol{0} & 1 \end{array} \right ] \times \left [ \begin{array}{c|c} R & \boldsymbol{0} \\ \hline \boldsymbol{0} & 1 \end{array} \right ] \\ &= \left[ \begin{array}{ccc|c} 1 & 0 & 0 & t_1 \\ 0 & 1 & 0 & t_2 \\ 0 & 0 & 1 & t_3 \\ \hline 0 & 0 & 0 & 1 \end{array} \right] \times \left[ \begin{array}{ccc|c} r_{1,1} & r_{1,2} & r_{1,3} & 0 \\ r_{2,1} & r_{2,2} & r_{2,3} & 0 \\ r_{3,1} & r_{3,2} & r_{3,3} & 0 \\ \hline 0 & 0 & 0 & 1 \end{array} \right] \end{align} \]

This matrix describes how to transform points in world coordinates to camera coordinates. The vector t can be interpreted as the position of the world origin in camera coordinates, and the columns of R represent represent the directions of the world-axes in camera coordinates.

The important thing to remember about the extrinsic matrix is that it describes how the world is transformed relative to the camera. This is often counter-intuitive, because we usually want to specify how the camera is transformed relative to the world. Next, we'll examine two alternative ways to describe the camera's extrinsic parameters that are more intuitive and how to convert them into the form of an extrinsic matrix.

Building the Extrinsic Matrix from Camera Pose

It's often more natural to specify the camera's pose directly rather than specifying how world points should transform to camera coordinates. Luckily, building an extrinsic camera matrix this way is easy: just build a rigid transformation matrix that describes the camera's pose and then take it's inverse.

Let C be a column vector describing the location of the camera-center in world coordinates, and let \(R_c\) be the rotation matrix describing the camera's orientation with respect to the world coordinate axes. The transformation matrix that describes the camera's pose is then \([R_c \,|\, C ]\). Like before, we make the matrix square by adding an extra row of (0,0,0,1). Then the extrinsic matrix is obtained by inverting the camera's pose matrix:

\begin{align} \left[ \begin{array}{c|c} R & \boldsymbol{t} \\ \hline \boldsymbol{0} & 1 \\ \end{array} \right] &= \left[ \begin{array}{c|c} R_c & C \\ \hline \boldsymbol{0} & 1 \\ \end{array} \right]^{-1} \\ &= \left[ \left[ \begin{array}{c|c} I & C \\ \hline \boldsymbol{0} & 1 \\ \end{array} \right] \left[ \begin{array}{c|c} R_c & 0 \\ \hline \boldsymbol{0} & 1 \\ \end{array} \right] \right]^{-1} & \text{(decomposing rigid transform)} \\ &= \left[ \begin{array}{c|c} R_c & 0 \\ \hline \boldsymbol{0} & 1 \\ \end{array} \right]^{-1} \left[ \begin{array}{c|c} I & C \\ \hline \boldsymbol{0} & 1 \\ \end{array} \right]^{-1} & \text{(distributing the inverse)}\\ &= \left[ \begin{array}{c|c} R_c^T & 0 \\ \hline \boldsymbol{0} & 1 \\ \end{array} \right] \left[ \begin{array}{c|c} I & -C \\ \hline \boldsymbol{0} & 1 \\ \end{array} \right] & \text{(applying the inverse)}\\ &= \left[ \begin{array}{c|c} R_c^T & -R_c^TC \\ \hline \boldsymbol{0} & 1 \\ \end{array} \right] & \text{(matrix multiplication)} \end{align}

When applying the inverse, we use the fact that the inverse of a rotation matrix is it's transpose, and inverting a translation matrix simply negates the translation vector. Thus, we see that the relationship between the extrinsic matrix parameters and the camera's pose is straightforward:

\[ \begin{align} R &= R_c^T \\ \boldsymbol{t} &= -RC \end{align} \]

Some texts write the extrinsic matrix substituting -RC for t, which mixes a world transform (R) and camera transform notation (C).

The "Look-At" Camera

Readers familiar with OpenGL might prefer a third way of specifying the camera's pose using (a) the camera's position, (b) what it's looking at, and (c) the "up" direction. In legacy OpenGL, this is accomplished by the gluLookAt() function, so we'll call this the "look-at" camera. Let C be the camera center, p be the target point, and u be up-direction. The algorithm for computing the rotation matrix is (paraphrased from the OpenGL documentation):

Compute L = p - C.
Normalize L.
Compute s = L x u. (cross product)
Normalize s.
Compute u' = s x L.

The extrinsic rotation matrix is then given by:

\[ R = \left[ \begin{array}{ccc} s_1 & s_2 & s_3 \\ u_1' & u_2' & u_3' \\ -L_1 & -L_2 & -L_3 \end{array} \right] \]

(Updated May 21, 2014 -- transposed matrix)

You can get the translation vector the same way as before, t = -RC.

Try it out!

Below is an interactive demonstration of the three different ways of parameterizing a camera's extrinsic parameters. Note how the camera moves differently as you switch between the three parameterizations.

This requires a WebGL-enabled browser with Javascript enabled.

Javascript is required for this demo.

Scene

Image

Left: scene with camera and viewing volume. Virtual image plane is shown in yellow. Right: camera's image.

Extrinsic (World)
Extr. (Camera)
Extr. ("Look-at")
Intrinsic

\(\boldsymbol{t}_x\)

\(\boldsymbol{t}_y\)

\(\boldsymbol{t}_z\)

x-Rotation

y-Rotation

z-Rotation

Adjust extrinsic parameters above.

This is a "world-centric" parameterization. These parameters describe how the world changes relative to the camera. These parameters correspond directly to entries in the extrinsic camera matrix.

As you adjust these parameters, note how the camera moves in the world (left pane) and contrast with the "camera-centric" parameterization:

Rotating affects the camera's position (the blue box).
The direction of camera motion depends on the current rotation.
Positive rotations move the camera clockwise (or equivalently, rotate the world counter-clockwise).

Also note how the image is affected (right pane):

Rotating never moves the world origin (red ball).
Changing \(t_x\) always moves the spheres horizontally, regardless of rotation.
Increasing \(t_z\) always moves the camera closer to the world origin.

\(C_x\)

\(C_y\)

\(C_z\)

x-Rotation

y-Rotation

z-Rotation

Adjust extrinsic parameters above.

This is a "camera-centric" parameterization, which describes how the camera changes relative to the world. These parameters correspond to elements of the inverse extrinsic camera matrix.

As you adjust these parameters, note how the camera moves in the world (left pane) and contrast with the "world-centric" parameterization:

Rotation occurs about the camera's position (the blue box).
The direction of camera motion is independent of the current rotation.
A positive rotation rotates the camera counter-clockwise (or equivalently, rotates the world clockwise).
Increasing \(C_y\) always moves the camera toward the sky, regardless of rotation.

Also note how the image is affected (right pane):

Rotating around y moves both spheres horizontally.
With different rotations, changing \(C_x\) moves the spheres in different directions.

\(C_x\)

\(C_y\)

\(C_z\)

\(p_x\)

\(p_y\)

\(p_z\)

Adjust extrinsic parameters above.

This is a "look-at" parameterization, which describes the camera's orientation in terms of what it is looking at. Adjust \(p_x\), \(p_y\), and \(p_z\) to change where the camera is looking (orange dot). The up vector is fixed at (0,1,0)'. Notice that moving the camera center, *C*, causes the camera to rotate.

Focal Length

Axis Skew

\(x_0\)

\(y_0\)

Adjust intrinsic parameters above. As you adjust these parameters, observe how the viewing volume changes in the left pane:

Changing the focal length moves the yellow focal plane, which chainges the field-of-view angle of the viewing volume.
Changing the principal point affects where the green center-line intersects the focal plane.
Setting skew to non-zero causes the focal plane to be non-rectangular

Intrinsic parameters result in 2D transformations only; the depth of objects are ignored. To see this, observe how the image in the right pane is affected by changing intrinsic parameters:

Changing the focal length scales the near sphere and the far sphere equally.
Changing the principal point has no affect on parallax.
No combination of intrinsic parameters will reveal occluded parts of an object.

Conclusion

We've just explored three different ways of parameterizing a camera's extrinsic state. Which parameterization you prefer to use will depend on your application. If you're writing a Wolfenstein-style FPS, you might like the world-centric parameterization, because moving along (t_z) always corresponds to walking forward. Or you might be interpolating a camera through waypoints in your scene, in which case, the camera-centric parameterization is preferred, since you can specify the position of your camera directly. If you aren't sure which you prefer, play with the tool above and decide which approach feels the most natural.

Join us next time when we explore the intrinsic matrix, and we'll learn why hidden parts of your scene can never be revealed by zooming your camera. See you then!

Posted by Kyle Simek

Calibrated Cameras in OpenGL without glFrustum → ← Dissecting the Camera Matrix, Part 1: Extrinsic/Intrinsic Decomposition

Sightations ← A Computer Vision Blog

Dissecting the Camera Matrix, Part 2: The Extrinsic Matrix

The Extrinsic Camera Matrix

Building the Extrinsic Matrix from Camera Pose

The "Look-At" Camera

Try it out!

Conclusion