-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnerf.qmd
666 lines (539 loc) · 36.8 KB
/
nerf.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
# Radiance Fields {#sec-nerfs}
## Introduction
In this chapter we will come full circle. Starting from the Greek's
model of optics (@sec-challenge_of_vision) that postulated that light is
emitted by the eyes and travels in straight lines, touching objects in
order to produce the sensation of sight (extramission theory), radiance
fields will use that analogy to model scenes from multiple images. Here,
geometry-based vision and machine learning will meet.
Before diving into radiance fields, we will review Adelson and Bergen's
**plenoptic function**; a radiance field can be considered as a way to
represent a portion of the plenoptic function.
### The Plenoptic Function
We first discussed the plenoptic function when we introduced the
challenge of vision in @sec-challenge_of_vision.
As stated in the work by Adelson and Bergen @Adelson91, the plenoptic
function tries to answer the following question: "We begin by asking
what can potentially be seen. What information about the world is
contained in the light filling a region of space?"
Given a scene, the plenoptic function is a full description of all the
light rays that travel across the space (@fig-nerfs-plenoptic_function).
The plenoptic function tells us the light intensity of a light ray
passing through the three-dimensional (3D) point $(X, Y, Z)$ from the
direction given by the angles $(\psi, \phi)$, with wavelength $\lambda$,
at time $t$: $$L(X, Y, Z, \psi, \phi, \lambda, t)$$ In this chapter wewill ignore time and we will only use three color channels instead of
the continuous wavelength. We can represent the plenoptic function with
the parametric form $L_{\theta}$ where the parameters $\theta$ have to
be adapted to represent each scene.
![Plenoptic function @Adelson91. The figure shows a slice of the plenoptic function at four locations. Two of the locations are in free space, and two other locations are inside a pinhole camera.](figures/nerfs/plenoptic_function.png){width="100%" #fig-nerfs-plenoptic_function}
An image gets formed by summing all the light rays that reach each
sensor on the camera (that is, we sum over a section of the plenoptic
function). @fig-nerfs-plenoptic_function shows an illustration of the
plenoptic function at four points, two of them are inside a pinhole
camera and are used to form an image of what is outside. The plenoptic
function inside the pinhole camera has most of the values set to zero
and only specific directions at each location have non-zero values.
In @sec-challenge_of_vision we discarded the plenoptic
function by simply saying, "Although recovering the entire plenoptic
function would have many applications, fortunately, the goal of vision
is not to recover this function." But we did not really give any
argument as to why simplifying the goal of vision was a good idea. What
if we now we change our minds and decide that one useful goal of vision
is to actually recover that function entirely? Can it be done? How?
If we had access to the plenoptic function of a scene, we would be able
to render images from all possible viewpoints within that scene. One
attempt at this is the @Gortler1996 which extracts a subset of the
plenoptic function from a large set of images showing different
viewpoints of an object. Another approach, @Levoy1996, also uses many
images to get a continuous representation of the field of light in order
to be able to render new viewpoints (with some restrictions). This
chapter will mostly focus on a third approach to modeling portions of
the plenoptic function, where we model a scene with a **radiance field**
that assigns a color and volumetric density to each point in 3D space.
## What is a Radiance Field?
A radiance field is a representation of part of the plenoptic function,
based on the idea that the optical content of a scene can be modeled as
a cloud of colorful particles with different levels of transparency.
Such a representation can be rendered into images using volume
rendering, which we will learn about in @sec-nerfs-volume_rendering. Radiance fields come in a
variety of configurations but we will stick with the definition from
Mildenhall et al. @mildenhall2020nerf, which popularized this
representation. We will presently define a radiance field $L$ as a
mapping from the coordinates in a scene to color and **density** (i.e.
transparency) values of the scene content at those coordinates. The
coordinates can be two-dimensional (2D) ($X,Y$) for a 2D world, or 3D
($X,Y,Z$) for a 3D world. They may also contain the angular dimensions
($\psi$ and $\phi$) that represent the viewing angle from which we are
looking at the scene. Representing color using $r,g,b$ values, and
representing density with a scalar $\sigma$, a radiance field is
therefore a function:
$$\begin{aligned}
L: X,Y,Z,\psi,\phi \rightarrow r,g,b,\sigma \quad\quad\triangleleft \quad\text{radiance field}
\end{aligned}$$
The angular coordinates $\psi$ and $\phi$ allow that an
object's color be different depending on the angle we look at it, which
is the case for shiny surfaces and many other materials. We will use the
terms $L^c$ and $L^\sigma$ to denote the subcomponents of $L$ that
output color and density values respectively:
$$\begin{aligned}
L^c&: X,Y,Z,\psi,\phi \rightarrow r,g,b\\
L^{\sigma}&: X,Y,Z,\psi,\phi \rightarrow\sigma
\end{aligned}$$
### A Running Example: Seeing in Flatland
For simplicity, we will study radiance fields for a 2D world. Radiance
fields in 3D are the same thing just with one more dimension. This 2D
world is like the one in *Flatland*, which is a book by Edwin
Abbott @abbott2009flatland where the characters are triangles, squares,
and other simple shapes that inhabit a 2D plane. Here is what their
world looks like (@fig-nerfs-flatland_cameras_and_images):
![The scene we are modeling. The circle of black triangles on the left are the cameras. The 1D images they see are shown on the right. These images are denoted as $\{\boldsymbol\ell^{(i)}\}_{i=1}^N$.](figures/nerfs/flatland_cameras_and_images.png){width="65%" #fig-nerfs-flatland_cameras_and_images}
On the left is a top-down image of the world, but inhabitants of
Flatland cannot see this. Instead they see the images on the right,
which are one-dimensional (1D) renderings of the 2D world. The circle of
black triangles on the left are cameras that result in the 1D photos
seen on the right. These show what our Flatland citizens would see if
they look at the scene from different angles.
A radiance field for this scene is given in
@fig-nerfs-flatland_implicit_to_explicit. This figure shows how each
possible input coordinate $(X,Y)$ gets mapped to a corresponding output
color $L^c$ and density $L^{\sigma}$.
![How a radiance field maps coordinates to colors/densities. (left) Input $(X,Y)$ coordinates, visualized with $X$-values in the green channel and $Y$-values in the blue channel. (right) Radiance field components $L^c$ (colors) and $L^\sigma$ (densities) rendered at each of these coordinates.](figures/nerfs/flatland_implicit_to_explicit_v2.png){width="100%" #fig-nerfs-flatland_implicit_to_explicit}
### Aside: Representing Scenes with Fields
Radiance fields are an example of **vector fields**, which are functions
that assign a vector to each position in space. For example, we might
have a vector field $f: X, Y, Z \rightarrow v_1, v_2$, where $X$, $Y$
and $Z$ are coordinates and $v_1$ and $v_2$ are some values. Let's take
a moment to consider field-based representations more broadly, since
they turn out to be a very important way of representing scenes.
Essentially, fields are a way to represent functions that vary over
space, and fields appear all over the place in this book. For example,
every image is a field: it is a mapping from pixel coordinates to color
values. The images we usually deal with in computer vision are discrete
fields (the input coordinates are discrete-valued) but in this chapter
we will instead deal with continuous fields (the input coordinates are
continuous-valued). Unlike regular images, fields can also be more than
two-dimensional and can even be defined over non-Euclidean geometries
like the surface of a sphere. Because they are such a general-purpose
object, fields are used as a scene representation in many contexts. Some
examples are the following:
- **Pixel** images, $f: x, y \rightarrow r, g, b$, are fields with the
property that the input domain is discrete, or, equivalently, the
field is piecewise constant over little squares that tile the space.
- **Voxel** fields, $f: X,Y,Z \rightarrow \mathbf{v}$ are discrete
fields that can represent values $\mathbf{v}$ like spatial
occupancy, flux, or color in a volume. They are like 3D pixels. Like
pixels, they have the special property that they are piecewise
constant over cubes the size of their resolution.
- **Optical flow fields**, $f: x, y, t \rightarrow u, v$, measure
motion in a video in terms of how scene content flows across the
image plane. We encountered these fields in @sec-3D_motion_and_its_2D_projection.
- **Signed distance functions** (**SDFs**) @curless1996volumetric,
$f: X,Y,Z \rightarrow d$, represent the distance to the closest
surface to point $[X,Y,Z]^\mathsf{T}$. This is a useful
representation of geometry.
:::{.column-margin}
Notation reminder: we use capital letters for world
coordinates and lowercase letters for image coordinates.
:::
As you read this chapter, keep in mind these other kinds of fields, and
think about how the methods you learn next could be used to model other
kinds of fields as well.
## Representing Radiance Fields With Parameterized Functions
Continuous fields are powerful because they have infinite resolution.
But this means that we can't represent a continuous field explicitly; we
cannot record the value at all *infinite* possible positions. Instead we
may use parameterized functions to represent continuous fields. These
functions take as input any continuous position and tell us the value of
the field at that location. Sometimes this is called an **implicit
representation** of the field, since we don't explicitly store the
values of the field at all coordinates. Instead we just record a
*finite* set of parameters of a function that can tell us the value of
the field at any coordinate. Recall that we learned about implicit image
representations in @sec-implicit_image_representations.
:::{.column-margin}
Implicit representations are also used in variational autoencoders (VAEs)
where we represent an infinite set (an infinite mixture of Gaussians) via a parameterized function from a continuous input domain (see \chap{\ref{chapter:generative_modeling_and_representation_learning}}). You can consider a VAE to be a continuous field mapping from latent variables to images!
:::
There are many different parameterized functions $L_{\theta}$ that could
represent our target radiance field $L$. We could use a neural net
(e.g., @mildenhall2020nerf), or a mixture of Gaussians (e.g.,
@kerbl20233d), or any number of other function approximators. The
important property we seek is that the field $L_{\theta}$ is a
differentiable function of the parameters $\theta$, hence we can use
gradient-based optimization to find a setting of the parameters that
yields a desirable field (we will use this property when we fit a
radiance field to images). Whatever function family we use, it will take
as input coordinates and produce colors and densities as output
(@fig-nerfs-nerf_module). Because we will use differentiable functions
for this, we can think of it just like a module in a differentiable
computation graph (@fig-nerfs-nerf_module). Later we will put together
multiple such modules and use backpropagation to optimize the whole
system to fit the field to a set of images.
![Module for a parameterized radiance field. We use $\mathbf{R}$ to denote the vector of coordinates $L_\theta$ takes as input.](figures/nerfs/nerf_module.png){width="100%" #fig-nerfs-nerf_module}
### Neural Radiance Fields (NeRFs) {#sec-nerfs-nerf_section}
We will focus our attention on one very popular way of parameterizing
$L_{\theta}$: **Neural radiance fields**
(**NeRFs**) @mildenhall2020nerf. NeRFs model the radiance field $L$ with
a neural network $L_{\theta}$. The neural network architecture in the
original NeRF is a multilayer perceptron (MLP), but other architectures
could be used as well. For the MLP formulation, evaluating the
color-density at a single coordinate corresponds to a single forward
pass through an MLP.
Often, however, we want to evaluate the color-density at multiple
coordinate locations. To create an explicit representation of the
field's values at a grid of coordinates, we can query the field on this
grid. In this usage, the NeRF $L_{\theta}$ looks like a convolutional
neural network (CNN) with 1x1 filters, as this is the architecture that
results from applying an MLP to each position in a grid of inputs. We
show this usage below, in @fig-nerfs-image_to_image_arch.
![NeRF architecture for computing the values at each position in an entire radiance field, sampled on a grid. The $\texttt{pos\_enc}$ refers to positional encoding. You can consider this architecture be a CNN with 1x1 filters, or as an MLP applied to each input coordinate vector.](figures/nerfs/image_to_image_arch.png){width="100%" #fig-nerfs-image_to_image_arch}
As in the original NeRF @mildenhall2020nerf, the output layer is
$\texttt{relu}$ for producing $\sigma$ (which needs to be nonnegative)
and $\texttt{sigmoid}$ for producing $r,g,b$ values (which fall in the
range $[0,1]$).
NeRF also includes a positional encoding layer to transform the raw
coordinates into a Fourier representation. In fact, we use the same
positional encoding scheme as is used in transformers (see @sec-transformers-positional_encodings.
@fig-nerfs-flatland_positional_encoding shows how this scheme
translates the $XY$-coordinate field of Flatland to positional
encodings:
![The first layer of NeRF applies positional encoding to the input coordinate values. Here we show the resulting positional codes for all possible input coordinate values $(X,Y)$ within some range.](figures/nerfs/flatland_positional_encoding.png){width="70%" #fig-nerfs-flatland_positional_encoding}
Fourier positional encodings are very effective for NeRFs because they
help the network to model high frequencies variations, which are
abundant in radiance fields @tancik2020fourier. Why do such encodings
help model high frequencies? Essentially, in regular Cartesian
coordinates, the location $(0,0)$ is much closer to $(1,1)$ than it is
to $(100,100)$, but this isn't necessarily the case with Fourier
positional codes. Because of this, with Cartesian coordinates as inputs,
an MLP will be biased to assign similar values to $(0,0)$ and $(1,1)$
and different values to $(0,0)$ and $(100,100)$. Conversely, with
Fourier positional codes as input, this bias will be reduced. This makes
it so the MLP, taking Fourier positional codes as input, has an easier
time fitting high frequency functions that change in value rapidly, as
in the case of assigning very different output values to the locations
$(0,0)$ and $(1,1)$. See @tancik2020fourier for a more thorough analysis
of this phenomenon.
### Other Ways of Parameterizing Radiance fields
In principle $L_{\theta}$ could be other kinds of functions, including
those that are not neural nets. The important property it needs to have
is that it be differentiable, so that we can optimize its parameters to
fit a given scene. One alternative is to parameterize the radiance
fields using a voxel-based representation, as was explored in, e.g.,
@fridovich2022plenoxels. Another alternative is to use a mixture of
Gaussians @kerbl20233d. In this case, each Gaussian represents an
semi-transparent ellipsoid of some color $\mathbf{c}$ and some density
$\sigma$. When many such Gaussians overlap, their sum can approximate
scene content very well.
## Rendering Radiance Fields {#sec-nerfs-volume_rendering}
Radiance fields can be useful for many tasks in scene modeling, as they
represent both appearance and geometry (density) of content in 3D.
However, the main purpose they were introduced for is . This is the
problem of rendering what a scene looks like from different camera
viewpoints. In our Flatland example from
@fig-nerfs-flatland_cameras_and_images, this is the problem of
rendering the set of 1D images, $\{\boldsymbol\ell^{(i)}\}_{i=1}^N$,
shown on the left of that figure. In this section, we will see how you
can solve view synthesis once you have a radiance field for a scene.
For this purpose, we will use **volume rendering**, which is the
strategy that methods like NeRFs use for rendering images from radiance
fields (although other rendering methods could be possible). Volume
rendering works by taking an integral of the radiance values along each
camera ray. Think of it like we are looking into a hazy volume of
colored dust. The integral adds up all the color values along the camera
ray, weighted by a value related to the density of the dust. A cartoon
visualization of the process is given in
@fig-nerfs-flatland_volume_rendering, and several real results of
rendering radiance fields are shown in @fig-nerfs-flatland_training.
But before we get to those, let us go step by step through the math of
volume rendering.
The volume rendering equation is:
$$\begin{aligned}
\ell(r) = \int_{t_n}^{t_f}\alpha(t) \, \overbrace{L^\sigma(r(t))}^{\text{density}} \, \overbrace{L^c(r(t),\mathbf{D})}^{\text{color}} \, dt :vol_rendering_integral\\
\alpha(t) = \exp \Big(-\int_{t_n}^{t} \underbrace{L^\sigma(r(t))}_{\text{density}} dt\Big)
\end{aligned}$${#eq-nerfs-vol_rendering_integral}
where $r(t)$ gives the coordinates, in the
radiance field, of a point $t$ distance along the ray, $\mathbf{D}$ is
the direction of the ray (a unit vector), and $t_n$ and $t_f$ are the
near and far cutoff points, respectively (we only integrate over the
region of the ray within an interval sufficiently large to capture the
scene content of interest). To render an image, we can apply this
equation to the ray that passes through each pixel in the image. This
procedure maps a radiance field $L$ to an image $\boldsymbol\ell$ as
viewed by some camera.
:::{.column-margin}
Here we model the color as being dependent on the direction $\mathbf{D}$ but the density as being direction-independent. This is a modeling choice and happens to work well because view-dependent color effects are abundant in scenes (e.g., specularities) but view-dependent density effects are less common.
:::
This model is based on the idea that the volume is filled with colored
particles. When we trace a ray out of the camera into the scene, it may
potentially hit one of these colored particles. If it does, that is the
color we will see. However, for any given increment along the ray, there
is also a chance we will not hit a particle. The chance we hit a
particle within an increment is modeled by the density function
$L^\sigma$, which represents the differential probability of hitting a
particle as we move along the ray. The chance that we have not hit a
particle all the way up to distance $t$ is then given by $\alpha$, which
is a cumulative integral of the densities along the ray. The integral in
@eq-nerfs-vol_rendering_integral averages over the colors of all the
particles we might hit along the ray, weighted by the probability of
hitting them.
This may seem a bit strange, but note that there is a special case that
might be more intuitive to you. If the particle density is zero, then we
have free space, which contributes nothing to the integral. If the
particle density is infinite, we have a solid object, and after the ray
intersects such an object, $\alpha$ immediately goes to zero and the ray
essentially terminates at the surface of the object. Putting these two
cases together, if we have a scene with solid objects and otherwise
empty space, *then volume rendering reduces to measuring the color of
the first surface hit by each ray exiting the camera*. This simple
model, called **ray casting**, is in fact how many basic renderers work,
such as in the standard rasterization pipeline @shirley2009fundamentals.
This is all to say that volume rendering is a simple generalization of
ray casting. One advantage it has is that it can model translucency, and
media like gases, fluids, and thin films that let some photons pass
through and reflect back others. We will see an even bigger advantage in
@sec-nerfs-fitting, where we will find that volume rendering
gives good gradients for fitting radiance fields to images.
### Computing the Volume Rendering Integral
The volume rendering integral has no simple analytical form. It depends
on the function $L$, which may be arbitrarily complex; think about how
complex this function will have to be to represent all the objects
around you, including their geometry, colors, and material properties.
Because of this, we must use numerical methods to approximate
@eq-nerfs-vol_rendering_integral.
One way to do this is to approximate the integral as a discrete sum. A
particularly effective approximation is the following, which is called a
quadrature rule (see @max1995optical, @max2005local for justification
of this choice of approximation):
$$\begin{aligned} \ell(r) &\approx \sum_{i=1}^T \alpha_i \, (1-e^{-L^\sigma(\mathbf{R}_i)\delta_i}) \, L^c(\mathbf{R}_i, \mathbf{D})\\
&\alpha_i = \exp\Big(-\sum_{j=1}^{i-1} L^\sigma(\mathbf{R}_j)\delta_j\Big)\\
&\delta_i = t_{i+1} - t_i
\end{aligned}$$ where we have now replaced our continuous radiance field
from @eq-nerfs-vol_rendering_integral with a discrete vector of samples
from the radiance field, $\{\mathbf{R}_1, \ldots, \mathbf{R}_T\}$, where
$\mathbf{R}_t = r(t)$ is the world coordinate of a point distance $t$
along the ray $r$.
@fig-nerfs-flatland_volume_rendering visualizes this procedure:
![Volume rendering of our Flatland radiance field.](figures/nerfs/flatland_volume_rendering.png){width="75%" #fig-nerfs-flatland_volume_rendering}
Here we have a 2D radiance field of our Flatland scene, and we are
looking at it from above. We show how two cameras (gray triangles in
corners) view this scene. The camera sensors are depicted as a row of
five pixels, indexed by coordinate $x$. We send a ray outward from the
camera origin through each pixel to query the scene. Each ray gets
sampled at the points that are drawn as circles. The size of the circle
indicates the $\alpha$ value and the color/transparency of the circle
indicates the color/density of the radiance field at that point. In this
scene, we have solid objects, so the density is infinite within the
objects and as soon as a ray hits an object, all remaining points become
occluded from view ($\alpha$ goes to zero for the remaining points, as
indicated by the circles shrinking in size).
In this figure, we took samples at evenly spaced increments along the
ray. Instead, as suggested in @mildenhall2020nerf, we can do better by
chopping the ray into $T$ evenly spaced intervals, then sampling one
point at uniform within each of these intervals. The effect of this is
that all continuous coordinates will get sampled with some probability,
and therefore if we run the approximation over and over again (like we
will when using it in an optimization loop; see @sec-nerfs-nerf_section) we will eventually take into
account every point in the continuous space (they will all be supervised
during fitting a radiance field to explain a scene).
Following this strategy, for a ray whose origin is $\mathbf{O}$ and
whose direction is $\mathbf{D}$, we compute sampled coordinates
$\{\mathbf{R}_1, \ldots, \mathbf{R}_T\}$ as follows:
$$\begin{aligned}
\mathbf{R}_i &= \mathbf{O} + t_i\mathbf{D}\\
&t_i \sim \mathcal{U}[i-1,i]*\frac{t_f-t_n}{T}+t_n
\end{aligned}$$
:::{.column-margin}
The distribution $\mathcal{U}[a,b]$ is the uniform distribution over the interval $[a,b]$.
:::
Now we can put all our pieces together as a
series a computational modules that define the full volume rendering
pipeline. Later in the chapter, we will combine these modules with other
modules (including neural nets) to create a computation graph that can
be optimized with backpropagation.
Our task is to compute the image $\boldsymbol\ell$ that will be seen by
a specified camera viewing the scene. We need to find the color of each
pixel in this image. To find the color of a pixel,
$\boldsymbol\ell[n,m,:]$[^1], at camera coordinate $n,m$, the first step
is to find the ray (origin $\mathbf{O}$ and direction $\mathbf{D}$) that
passes through this pixel. This can be done using the methods we have
encountered in this book for modeling camera optics and geometry, which
we will not repeat here (see @sec-pixel_to_rays. The key pieces of information we need
to solve this task are the camera origin, $\mathbf{O}_{\texttt{cam}}$,
and the camera matrix, $\mathbf{K}_{\texttt{cam}}$, that describes the
mapping between pixel coordinates and world coordinates (steps for
constructing $\mathbf{K}_{\texttt{cam}}$ are given in @sec-simple_system_revisited-adapting_the_output). We apply
this mapping, then compute the unit vector from the origin to the pixel
to get the direction $\mathbf{D}$ (@fig-nerfs-pixel2ray):
![Module for mapping from pixel coordinates to the world coordinates of a ray through that pixel.](figures/nerfs/pixel2ray.png){width="100%" #fig-nerfs-pixel2ray}
Next we sample world coordinates along the ray, as described previously
(@fig-nerfs-ray2coords):
![Module for sampling coordinates along a ray.](figures/nerfs/ray2coords.png){width="100%" #fig-nerfs-ray2coords}
Finally, we use the volume rendering integral (using the quadrature rule
approximation given previously) to compute the color of the pixel
(@fig-nerfs-vrender):
![Module for volume rendering of a single ray.](figures/nerfs/vrender.png){width="100%" #fig-nerfs-vrender}
This pipeline gives a mapping from pixel coordinates to color values:
$n,m \rightarrow r,g,b$. Our task in the next section will be to explain
a set of images as being the result of volume rendering of a radiance
field, that is, we want to infer the radiance field from the photos.
## Fitting a Radiance Field to Explain a Scene {#sec-nerfs-fitting}
In this section our goal is to find a radiance field that can explain a
set of observed images. This is the task our denizens of Flatland have
to solve as they walk around their world and try to come up with a
representation of the scene they are seeing. This is the task of vision.
Rendering radiance fields into images (view synthesis) is a graphics
problem. We are now moving on to the vision problem, which the inverse
problem and is where we really wanted to get: inferring the radiance
field that renders to an observed set of images.
Concretely, our task is to take as input a set of images, along with the
camera parameters for the cameras that captured those images. We will
produce a parameterized radiance field as output. The full fitting
procedure therefore maps
$\{\boldsymbol\ell^{(i)}, \mathbf{K}^{(i)}_{\texttt{cam}}, \mathbf{O}^{(i)}_{\texttt{cam}}\}_{i=1}^N \rightarrow L_{\theta}$.
Our objective is that if we render that radiance field, using each of
the input cameras, the rendering will match the input images as closely
as possible.
### Computation Graph for Rendering an Image
Given $L_{\theta}$, we will render an image simply using volume
rendering as described previously. The entire computation graph is shown
in @fig-nerfs-full_nerf_pipeline. Notice that all the modules are
differentiable, which will be critical for our next step, fitting the
parameters to data.
![Full NeRF pipeline for rendering an image.](figures/nerfs/full_nerf_pipeline.png){width="100%" #fig-nerfs-full_nerf_pipeline}
:::{.column-margin}
Note that the camera parameters also also need to be input into $\texttt{pixel2ray}$ and that $\texttt{vrender}$ also takes as
input the direction $\mathbf{D}$ and sampling points $\mathbf{t}$ computed by $\texttt{ray2coords}$.
:::
### Optimizing the Parameters of this Computation Graph
Fitting a $L_{\theta}$ to explain the images just involves optimizing
the parameters $\theta$ to minimize the reconstruction error between the
images of the radiance field rendered through $\texttt{render}_{\theta}$
and the observed input images.
We can phrase this procedure as a learning problem and show it as the
diagram below:
![](figures/nerfs/nerf_learning.png){width="100%"}
:::{.column-margin}
Fitting a radiance field to volume rendered images is
very much like solving a magic square puzzle. We try to find the
radiance field (the square) that integrates to the observed images (the
row and column sums of the square).
:::
From this perspective, radiance fields such as NeRF are a clever
hypothesis space for constraining the mapping from inputs to outputs of
a learned rendering system. Of course we could have used a much less
constrained hypothesis space such as a big transformer that directly
maps from input pixel coordinates to output colors, without the other
modules like $\texttt{ray2coords}$ or $\texttt{vrender}$. However, such
an unconstrained approach would be much less sample efficient and would
require far more data and parameters to learn a good rendering solution
(refer back to chapter
@sec-problem_of_generalization) to recall why a more
constrained hypothesis space can reduce the required amount of training
data to find a good solution). However, such a solution could also be
more general purpose, as it could handle optical effects that are not
well captured by radiance fields.
In fact, there are two products you can get out of a fit radiance field,
one for the graphics community and one for the vision community. The
graphics product is a system that can render a scene from novel
viewpoints. This works by just applying $\texttt{render}_{\theta}$
(@fig-nerfs-full_nerf_pipeline) to new cameras with new viewpoints. The
second product is an inferred radiance field, which tells us something
about what is where in the scene. For example, the density of this
radiance field can tell us the geometry of the scene, as we will see in
the following example of fitting a radiance field to our Flatland scene.
![Iterations of fitting a radiance field to Flatland. All of these are top-down views of the world (we are looking at Flatland from above, which is a view the inhabitants cannot see). (a) The radiance field visualized as an image with color equal to the color of the field at each position and transparency proportional to the density of the field at each position. (b and c) The volume rendering process for two different cameras looking at the radiance field. The circle colors and transparencies again show the color and density of the field, and the circle size shows the $\alpha$ value as we walk along each camera ray. Small circles mean those points are more occluded; i.e. there is a low probability of the ray reaching them and they contribute very little to the volume rendering integral.](figures/nerfs/flatland_training.jpg){width="90%" #fig-nerfs-flatland_training}
In @fig-nerfs-flatland_training, we show several iterations of the
fitting process. The top row shows the radiance field as seen from
above. This is a perspective our Flatlanders cannot directly see but can
infer, just like we on Earth cannot directly see the full shape of a 3D
($X,Y,Z$) radiance field but can infer it from images. The bottom rows
show volume rendering from two different cameras. The small circles
along the rays are samples of the radiance field. Their size is the
$\alpha$ value at that location along the ray, and their color and
opacity are the color and density values of that location in the
radiance field. Notice that initially we have a hazy cloud of stuff in
the middle of the scene and over iterations it congeals into solid
objects in the shapes of our Flatland denizens.
After around 2,500 steps of gradient descent, we have arrive at a fairly
accurate representation of the scene. At this point, the objects'
densities are high enough that volume rendering essentially amounts to
just ray casting (finding the color of the first surface we hit). So why
did we use volume rendering? Not because we care about volumetric
effects; in this scene there are none. Rather because it makes it
possible for optimization to find the ray casting solution. With volume
rendering, we have the property that small changes to the scene content
yield small changes to our loss; that is, we have small but non-zero
gradients. Ray casting, in contrast, is an all or none operation, and
therefore small changes to scene content can yield drastic changes to
the rendered images (an occluder can suddenly an the object that was
previously in view of a pixel). This makes gradient-based optimization
difficult with ray casting, but achievable with volume rendering. This
is the second, and arguably main benefit of volume rendering that we
alluded to previously (the first being the ability to model translucency
and subsurface effects).
## Beyond Radiance Fields: The Rendering Equation
Radiance fields are built to support volume rendering, but volume
rendering has certain limitations. One major limitation of is that
volume rendering does not model multi-bounce optical effects, where a
photon hits a surface, bounces off, and then hits another surface, and
so on. This leads to issues with how radiance fields represent shadows
and reflections. Instead of a shadow being the result of a physical
process where photons are blocked from illuminating one part of the
scene, in radiance fields shadows get painted onto scene elements: $L^c$
gets a darker color value in shadowed parts of the scene. If we want to
change the lighting of our scene, we therefore need to update the colors
in the radiance field to account for the new shadow locations, and with
the radiance field representation, there is no simple way to do this (we
may have to run our fitting procedure anew, this time fitting the
radiance field to new images of the newly lit scene). The same is true
for reflections: a radiance field will represent reflected content as
painted onto a surface, rather than modeling the reflection in the
physically correct way, as due to light bouncing off the surface.
Fortunately, other, more general, rendering models exist that better
handle multi-bounce effects. Perhaps the most general is **the rendering
equation**, which was introduced by Kajiya @Kajiya1986 in the field of
computer graphics. The goal was to introduce a new formalism for image
rendering by directly modeling the light scattering off the surfaces
composing a scene. The rendering equation can be written as,
$$L(\mathbf{x},\mathbf{x}') = g(\mathbf{x},\mathbf{x}') \left[ e(\mathbf{x},\mathbf{x}') + \int_S \rho(\mathbf{x},\mathbf{x}',\mathbf{x}^{\prime\prime}) L(\mathbf{x}',\mathbf{x}^{\prime\prime}) d\mathbf{x}^{\prime\prime} \right]
$${#eq-nerf-Kajiya_rendering_equation}
where $L(\mathbf{x},\mathbf{x}')$ is the intensity of light ray that passes
from point $\mathbf{x}'$ to point $\mathbf{x}$ (this is analogous to the
plenoptic function but with a different parameterization). The function
$e(\mathbf{x},\mathbf{x}')$ is the light emitted from $\mathbf{x}'$ to
$\mathbf{x}$, and can be used to represent light sources. The function
$\rho(\mathbf{x},\mathbf{x}',\mathbf{x}^{\prime\prime})$ is the
intensity of light scattered from $\mathbf{x}^{\prime\prime}$ to
$\mathbf{x}$ by a patch of surface at location $x'$ (this is related to
the bidirectional reflectance distribution function).
:::{.column-margin}
Here $\mathbf{x}$, $\mathbf{x}'$, and $\mathbf{x}^{\prime\prime}$ represents a vector of world coordinates, e.g., $\mathbf{x} = [X,Y,Z]$.
:::
The term $g(\mathbf{x}, \mathbf{x}')$ is a visibility function and
encodes the geometry of the scene and the occlusions present. If points
$\mathbf{x}$ and $\mathbf{x}'$ are not visible from each other, the
function is 0, otherwise it is
$1/\lVert \mathbf{x}-\mathbf{x}' \rVert ^2$, modeling how energy
propagates from each point.
In words of Kajiya, "The equation states that the transport intensity of
light from one surface point to another is simply the sum of the emitted
light and the total light intensity which is scattered toward
$\mathbf{x}$ from all other surface points." Integrating equation
(@eq-nerf-Kajiya_rendering_equation) can be done using numeric methods
and has been the focus of numerous studies in computer graphics.
While this rendering model is more powerful than the one radiance fields
use, it is also more costly to compute. In the future, as hardware
improves, we may see a movement away from radiance fields and toward
models that more fully approximate the full rendering equation.
## Concluding Remarks
Radiance fields try to model the light content in a scene. The ultimate
goal is to model the full plenoptic function, that is, all physical
properties of all the photons in the scene. Along the way to this goal,
many simplified models have been proposed, and radiance fields are just
one of them. Methods for modeling and rendering radiance fields build
upon many of the topics we have seen earlier in this book, such as
multiview geometry, signal processing, and neural networks. They appear
near the end of this book because they rest upon almost all the
foundations we have by now built up.
[^1]: In this chapter, we represent RGB images as $N \times M \times 3$
dimensional arrays.