The Perspective and Orthographic Projection Matrix

A Word of Warning

To understand the content of this lesson, you need to be familiar with the concept of matrix, transforming points from one space to another, perspective projection (including how coordinates of 3D points on the canvas are computed), and the rasterization algorithm. Read the previous lessons and the Geometry lesson if you are unfamiliar with these concepts.

Projection Matrices: What Are They?

What are projection matrices? They are nothing more than 4x4 matrices, which are designed so that when you multiply a 3D point in camera space by one of these matrices, you end up with a new point which is the projected version of the original 3D point onto the canvas. More precisely, multiplying a 3D point by a projection matrix allows you to find the 2D coordinates of this point on the canvas in NDC space (Normalized Device Coordinates). Remember from the previous lesson that the 2D coordinates of a point on the canvas in NDC space are contained in the range [-1, 1]. This is, at least, the convention used by graphics APIs such as Direct3D or OpenGL.

There are two conventions when it comes to NDC space. Coordinates are either considered to be defined in the range [-1, 1]. This is the case with most real-time graphics APIs like OpenGL or Direct3D. Or they can also be defined in the range [0, 1]. The RenderMan specifications define them that way. As you are more likely to encounter the term NDC in the context of real-time 3D APIs, we will stick to the convention [-1, 1].

Another way of saying it is that multiplying a 3D point in camera space by a projection matrix has the same effect as all the series of operations we have been using in the previous lessons to find the 2D coordinates of 3D points in NDC space (this includes the perspective divide step and a few remapping operations to go from screen space to NDC space). In other words, this rather long code snippet which we have been using in the previous lessons:

// convert to screen space
Vec2f P_screen;
P_screen.x = near * P_camera.x / -P_camera.z;
P_screen.y = near * P_camera.y / -P_camera.z;

// Now convert point from screen space to NDC space (in the range [-1, 1])
Vec3f P_ndc;
P_ndc.x = 2 * P_screen.x / (r - l) - (r + l) / (r - l);
P_ndc.y = 2 * P_screen.y / (t - b) - (t + b) / (t - b);


Can be replaced with a single point-matrix multiplication. Assuming $$M_{proj}$$ is a projection matrix, we can write:

Vec3f P_ndc;
M_proj.multVecMatrix(P_camera, P_ndc);


The first version involves five variables: $$near$$, $$t$$, $$b$$, $$l$$, and $$r$$, which are the near clipping plane, the top, bottom, left, and right screen coordinates respectively. Remember that the screen coordinates are also computed normally from the near-clipping plane and the camera angle-of-view (which, if you use a physically-based camera model, is calculated from a whole series of parameters such as the film gate size, the focal length, etc.). This is great because it reduces a complex process into a simple point-matrix multiplication operation. The whole point of this lesson is to explain what $$M_{proj}$$ is. Looking at the two code snippet above should give you some ideas about what we will need to build this matrix. It seems like if this matrix replaces:

• The perspective divide operation,

• As well as remapping the point from screen space to NDC space.

We will somehow have to pack in this matrix all the different variables that are part of these two steps. The near-clipping plane, as well as the screen coordinates. We will explain this in detail in the next chapters. Before we get there, let's explain one important thing about projection matrices and points. First, projection matrices transform vertices or 3D points, not vectors. Using a projection matrix to transform a vector doesn't make any sense. These matrices are used to project vertices of 3D objects onto the screen to create images of these objects that follow the rules of perspective. Remember from the lesson on geometry that a point is also a form of a matrix. A 3D point can be defined as a [1x3] row vector matrix (1 row, 3 columns). Remember that we use the row-major order convention on Scratchapixel. From the same lesson, we know that matrices can only be multiplied by each other if the number of columns of the left matrix equals the number of rows of the right matrix. In other words, the matrices [mxn][nxk] can be multiplied by each other, but the matrices [nxm][kxn] can't. Though if you multiply a 3D point with a 4x4 matrix, you get [1x3][4x4], and technically what this means is that this multiplication can't be done! The trick to making this operation possible is to treat points not as [1x3] vectors but as [1x4] vectors. Then, you can multiply this [1x4] vector by a 4x4 matrix. As usual with matrix multiplication, the result of this operation is another [1x4] matrix. This [1x4] matrix or 4D points, in a way, are called in mathematics points with homogeneous coordinates. A 4D point can only be used as a 3D point if its fourth coordinate equals 1. When this is the case, the first three coordinates of a 4D point can be used as the coordinates of a standard 3D Cartesian point. We will study this conversion process from Homogeneous to Cartesian in detail in the next chapter. Whenever we multiply a point by a 4x4 matrix, points are always treated as 4D points. Still, for a reason that will be explained later, when you use "conventional" 4x4 transformation matrices (the matrices we use the most often in CG to scale, translate or rotate objects for instance), this fourth coordinate doesn't need to be explicitly defined. But when a point is multiplied by a projection matrix, such as the perspective or orthographic projection matrices, this fourth coordinate must be dealt with explicitly. This is why homogeneous coordinates are more often discussed within the context of projections than within the context of general transformations (even though projections are a form of transformation and even though you are also somehow using homogeneous coordinates when you deal with conventional transformation matrices. You only do so implicitly, as we just explained).

What we call "convention" transformation 4x4 matrices belong to a class of transformation called affine transformations in mathematics. Projection matrices belong to a class of transformation called projective transformations. To multiply a point by any of these matrices, points must be defined with homogeneous coordinates. You can find more information about homogeneous coordinates, affine, and projective transformations in the next chapter and the lesson on geometry.

This means that you only have to deal with 4D points in a renderer when you work with projection matrices. The rest of the time, you will never have to deal with them, at least explicitly. Projection matrices are generally only used by programs implementing the rasterization algorithm. In itself, this is not a problem at all, but in the algorithm, there is a process called clipping (we haven't talked about it at all in the lesson on rasterization) that happens while the point is being transformed by the projection matrix. You read correctly: clipping, a process we will describe in the next chapters, happens somewhere when the projection matrix transforms the points. Not before or after. So in essence, the projection matrix is used indirectly to "interleave" a process called clipping that is important in rendering (we will explain what clipping does in the next chapter). And this makes things even more confusing because, generally, when books get to the topic of projection matrices, they do also speak about clipping without really explaining why it is there, where it comes from, and what relation it has with the projection matrix (in fact it has none, it just happens that it is convenient to do it while the points are being transformed).

Where Are They Used and Why?

Projection matrices are a very popular topic on this website and specialized forums. They are still very confusing to many; it must be for something if they are so popular. Not surprisingly, we also found the topic poorly documented, which is another one of these oddities, considering how important the subject matter is. Their popularity comes from their use in real-time graphics APIs (such as OpenGL, Direct3D, WebGL, Vulkan, Metal, WebGPU, etc.), which are very popular due to their use in games and other common desktop graphics applications. What these APIs have in common is that they are used as an interface between your program and the GPU. Not surprisingly, and as we already mentioned in the previous lesson, GPUs implement the rasterization algorithm in their circuits. In old versions of the rendering pipeline used by GPUs (known as the fixed function pipeline), GPUs transformed points from the camera to the NDC space using a projection matrix. But the GPU didn't know how to build this matrix itself. As a programmer, it was your responsibility to build and pass it on to the graphics card. That meant that you were required to know how to build the matrix in the first place.

// Don't use this code - it is now deprecated. OpenGL was
// one of the two APIs of choice for real-time graphics (with DirectX).
glMatrixMode(GL_PROJECTION);
glFrustum(l, r, b, t, n, f);  //set the matrix using screen coordinates and near/far clipping planes
glMatrixMode(GL_MODELVIEW);
glTranslate(0, 0, 10);
...


Don't use this code, though. It is only mentioned for reference and historical reasons. We are not supposed to use OpenGL that way anymore (these functionalities are now deprecated, and OpenGL itself is fading out). The process is slightly different in the more recent "programmable" rendering pipeline (DirectX, Vulkan, Metal, or the latest versions of OpenGL). First, the projection matrices don't convert points from camera space to NDC space directly, but it converts them into some intermediate space called clip space. Don't worry too much about it for now. But to make it short, let's say that in clip space, points have homogeneous coordinates (do you remember the 4D points we talked about earlier?).

In the modern programmable GPU rendering pipeline, this point-matrix multiplication (the transformation of vertices by the projection matrix) occurs in a vertex shader. A vertex shader is nothing else than a small program if you wish, whose job is to transform vertices making up the 3D objects of your scene from camera space to clip space. A simple vertex shader takes a vertex as an input variable, a projection matrix (also a member variable of the shader), and sets a pre-defined global variable (called gl_Position in OpenGL) as the result of the input vertex multiplied by the projection matrix. Note that gl_Position and the input vertex are both declared as vec4, in other words, as points with homogeneous coordinates. Here is an example of a basic vertex shader:

uniform mat4 projMatrix;
in vec3 position;

void main()
{
gl_Position = projMatrix * vec4(position, 1.0);
}


The GPU executes this vertex shader to process every vertex in the scene. In a typical program using the OpenGL or Direct3D API, you store the vertex and the connectivity data (how these vertices are connected to form the mesh's triangles) in the GPU's memory (as a buffer). The GPU then processes this data (the vertex data in this case) to transform them from whatever space they are in when you pass them on to the GPU to clip space. The space the vertices are in when the vertex shader processes them depends on you.

• You can transform them yourself from world space to camera space before loading them into the GPU's memory. This means that the coordinates will already be defined in camera space when the vertices are processed in the vertex shader.

• Or you can leave the vertices in world space before you load them to the GPU's memory. When the vertices are processed in the vertex shader, beside the projection matrix (which is often denoted P or Proj), you will also need to pass on to the shader the world-to-camera matrix, which is often denoted M or MV in the GPU world, where MV stands for the "model-view" matrix; model view because it combines both the object-to-world (model) and world-to-camera (view) matrices in a single matrix. We didn't speak about object space so far, but this is simply the space an object is in before a transformation is applied. World space is the space the object is in after a 4x4 object-to-world matrix has been applied to it. You will first need to transform the vertex from world to camera space and then apply the projection matrix. If the vertices are already in world space when they reach the vertex transformation pipeline, you only need to view part of the matrix (the world-to-camera matrix). In pseudo-code, you would get something like this:

uniform mat4 P;  //projection matrix
uniform mat4 MV; //model-view matrix (object-to-world * world-to-camera)

in vec3 position;

void main()
{
gl_Position = P * MV * vec4(position, 1.0);
}


OpenGL uses column vector notation (Scratchapixel uses row vector convention). Thus the point that is being transformed appears on the right, and you need to read the transformation from right to left. MV, the model-view matrix, first transforms the vertex, then P, the projection matrix.

Many programmers prefer the second option (there are some practical reasons for that). You can concatenate the world-to-camera and the projection matrix into a single, but for reasons that are out of this lesson's scope, it is best to pass them on to the vertex shader as two separate matrices.

uniform mat4 PMV;  //projection matrix * model-view matrix

in vec3 position;

void main()
{
gl_Position = PMV* vec4(position, 1.0);
}

int main(...)
{
...
// we use row-vector matrices
Matrix44f projectionMatrix(...);
Matrix44f modelViewMatrix(...);
Matrix44f PM = modelViewMatrix * projectionMatrix;
// GL uses column-vector matrices, thus we need to transpose our matrix
PM.transpose();
// look for a variable called PM in the vertex shader.
Glint loc = glGetUniformLocation(p, "PM");
// set this shader variable with the content of our PM matrix
glUniformMatrix4fv(loc,  1, false, &PM.m[0][0]);
...
render();
...
return 0;
}


Don't worry too much if you are unfamiliar with graphics API. Or you are familiar with Direct3D but not with OpenGL, though the principles are the same, so you should be able to find similarities easily. Regardless, try to understand what the code is doing. In the main CPU program, we just set the projection matrix and the model view matrix (which combines the object-to-world and the work-to-camera transform) and multiply these two matrices together so that rather than passing two matrices to the vertex shader, only one is needed. The OpenGL API required that we find the location of the variable we are looking for in a shader using the glGetUniformLocation() call.

Don't worry if you don't know how glGetUniformLocation() works. But in short, it takes a program as the first argument and the name of a variable you are looking for on this program and returns the location of this variable. You can then later use this location to set the shader variable using a glUniform() call. A program in OpenGL or Direct3D combines a vertex shader and a fragment shader. You first need to combine these shaders in a program, and the program is then applied or assigned to an object. Both shaders define how the object is transformed from whatever space the model is in when the vertex shader is executed to clip space (this is the role of the vertex shader) and then define its appearance (that's the function of the fragment shader).

Once the location is found, we can set this shader variable using the glUniformMatrix4fv(). When the geometry is rendered, the vertex shader will be executed, which as a result, will transform vertices from object space to clip space. It will do so by multiplying the vertex position (in object space) by the matrix PM, which combines the effect of the projection matrix with the model-view matrix. Hope this all makes sense. This process is central to how objects are rendered on GPUs. Though this lesson is not an introduction to the GPU rendering pipeline, thus we don't want to get into too much detail at this point, but hopefully, these explanations are enough for you to at least understand the concept. And regardless of whether you use the old or the new rendering pipeline (by today's standards, that is, by 2023, you should be only using programmable pipelines), you are still somehow required to build the projection matrix yourself. And this is why projection matrices are so important and, consequently, why they are so popular on CG forums. Now, note that a projection matrix is not required in the new rendering pipeline. All that is required is to transform vertices from whatever space they are in to clip space. Whether you use a matrix for this or write some code that does the same thing as the matrix does, is not important as long as the result is the same. For example, you could very much write:

uniform float near, b, t, l, r;  //near clip plane and screen coordinates

in vec3 position;

void main()
{
// does the same thing than a gl_Position.x = Mproj * position
gl_Position.x = ... some code here ...;
gl_Position.y = ... some code here ...;
gl_Position.z = ... some code here ...;
gl_Position.w = ... some code here ...;
}


Hopefully, you understand what we mean when we say you can use the matrix or write some code that sets gl_Position as if you had used a projection matrix. This wasn't possible in the fixed-function pipeline (because the concept of vertex shader didn't exist back then), but the point we are trying to make here, is that it is not strictly required anymore to use projection matrices if you do not wish to. The advantage of replacing it with some other code is that it gives you more flexibility on how vertices are transformed. Though in 99.99% of cases, you will be more likely to use a standard projection than not, and even if you do not wish to use the matrix, you will still need to understand how the matrix is built to replace the point-matrix multiplication with some code that converts points from world space to camera space and then camera space to NDC space.

All you need to remember really from this chapter is that:

• GPUs' rendering pipeline is based on the rasterization algorithm, in which vertices are typically transformed from camera space to clip space using projection matrices. GPUs are widespread and since projection matrices play a central role in the way images are produced on the GPU, they have also become an important topic of discussion and interest. Another way of saying this is that if GPUs didn't exist or were based on the ray-tracing algorithm instead, we would probably not care about projection matrices.

• In the programmable rendering pipeline, vertices are transformed to clip space in the vertex shader. The projection matrix is generally a member variable of the vertex shader. The matrix is set on the CPU before the shader is used (you can also pass all the variables required to build that matrix to the shader and build it in the shader, though this process is more cumbersome than just passing a built-in matrix). All we need to know is to learn how this matrix is built. And what clip space is.

Orthographic and Perspective Projection Matrix

Finally and to conclude this chapter, you may have noticed that the lesson is called "The Perspective and Orthographic Projection Matrix", and you may wonder what the difference is between the two. First, they are both projection matrices. In other words, their function is to project 3D points onto a 2D surface somehow. We are already familiar with the perspective projection process. When the angle of view becomes very small (in theory, when it tends to 0), the four lines defining the frustum's pyramid parallel each other. In this particular configuration, there is no more foreshortening effect. In other words, an object's size in the image stays constant regardless of its distance from the camera. This is what we call an orthographic projection. Visually, we find an orthographic projection less natural than a perspective projection, however, this form of projection is nonetheless useful in some cases. Sim City is also an example of a game rendered with an orthographic projection (image above). This gives the game a unique look.

Projection Matrix and Ray-Tracing

As we mentioned a few times already, in ray tracing, points don't need to be projected onto the screen. Instead, rays emitted from the image plane are tested for intersections with the objects from the scene. Thus, projection matrices are not used in ray tracing.

What's Next?

We found that the easiest way to learn about projection matrices and everything we just talked about is to start by learning how to build a very basic one. Once you understand these homogeneous coordinates, it will also become simpler to speak about clipping and other concepts that indirectly relate to projection matrices. The simple perspective projection matrix we will build in chapter three won't be as sophisticated as the one used in OpenGL or Direct3D (which we will also study in this lesson). As mentioned, the goal of chapter three is just to explain the principle behind projection matrices. In other words, how they work. We said that a projection matrix remaps vertices from whatever space they were in to clip space. Though for the next two chapters, we will get back to our basic definition of the projection matrix: it converts points from camera space to NDC space. We will study the more generic case or more complex case in which points are converted to clip space in chapter four (in which we will explain how the OpenGL matrix works). For now, don't worry about clip space. Remember this definition: "Projection matrices convert vertices from camera space to NDC space" (though remember that this definition is only temporary).

-next

Want to fix the problem yourself? Learn how to contribute!

Source this file on GitHub

Report a problem with this content on GitHub