chap-results.tex

%!TEX root = foo-thesis.tex


\chapter{Results and Discussion}
\label{chap:results}

This chapter details the performance and rendering quality characteristics of the implemented techniques and examines their memory usage. For each technique, it will subsequently discuss the up- and downsides.

\section{Scenes, Settings and Testing System}
\label{sec:results:settings}

The implementation is tested in the Crytek Sponza and San Miguel scene, with 260k and 7.9M triangles respectively, both provided by \citet{McGuire2011Data}. Unless otherwise noted, the parameters used for the measurements and screenshots are:

\begin{outline}
    \1 1920x1080\,px output resolution
    \1 1024 VPLs
    \1 2048²\,px ISM texture, i.\,e., 64²\,px per ISM
    \1 Single-pixel point renderer enabled, 16 VPLs considered per point, up to 4 collected
    \1 4x4 interleaving pattern
    \1 Tiled shading enabled
\end{outline}

\noindent
The hardware specification of the testing system is as follows:

\begin{outline}
    \1 Intel Xeon W3530 with four cores at 2.8\,GHz
    \1 NVIDIA GeForce GTX 980 factory-overclocked by ca.\ 4\%, driver version 375.57
\end{outline}

\noindent
The GTX 980 has been locked to its base clock rate of 1164\,MHz using SetStablePowerState.exe\footnote{\url{https://developer.nvidia.com/setstablepowerstateexe-\%20disabling\%20-gpu-boost-windows-10-getting-more-deterministic-timestamp-queries}}, preventing its GPU Boost feature\footnote{\url{http://www.geforce.com/hardware/technology/gpu-boost-2}} from altering the results depending on the GPUs temperature and power draw. This comes at the cost of preventing the GPU from using its maximum potential, but combined with the factory-overclocking, the results should be close to the real-world performance.

\pagebreak


\section{Rendering Time Breakdown}
\label{sec:results:RenderingTimeBreakdown}


\begin{table}[h]
    \centering
    \captionabove{Timing breakdown of an entire frame rendered by the presented software. Timings are in milliseconds.}
    \begin{tabulary}{0.98\textwidth}{| L | L | L | L | L | L | L || L |}
        \hline
        G-Buffer & RSM & ISM & GI & SSAO & Combine & Blit & Total \\ \hline
        0.6\,ms & 0.2\,ms & 2.7\,ms & 3.2\,ms & 1.4\,ms & 0.3\,ms & 0.3\,ms & 8.70\,ms \\
        \hline
    \end{tabulary}
    \label{tab:results:timing_breakdown_frame}
\end{table}

\noindent
\Cref{tab:results:timing_breakdown_frame} gives an overview of the time needed for rendering a single frame. The Crytek Sponza scene and default settings have been used. All timings measure GPU time only; there are no noteworthy calculations done on the CPU. With a total of 8.7\,ms, the application achieves real-time frame rates while leaving room for more expensive calculations, higher output resolutions, or more detailed scenes.

The GI, SSAO, and combine pass operate in screen space and are mostly dependent on the screen size (and, of course, on light and sample count in the case of GI and SSAO, respectively). The G-Buffer and RSM generation follow the usual performance characteristics and are mainly dependent on output resolution and geometric complexity. The performance behavior of the ISM rendering heavily depends on the settings and is further detailed below. The blit phase copies a texture to the back buffer and serves debugging purposes only.


\section{RSM Generation and VPL Sampling}
\label{sec:results:RsmAndVplSampling}

\begin{figure}[htb]
    \centering
    \begin{subfigure}[b]{1.0\textwidth}
        \centering
        \includegraphics[width=1.0\linewidth]{screenshots/RSM_diffuse}%
        \caption{}
    \end{subfigure}\\
    \par\medskip
    \begin{subfigure}[b]{0.49\textwidth}
        \centering
        \includegraphics[width=1.0\linewidth]{screenshots/RSM_normal}%
        \caption{}
    \end{subfigure}%
    \hfill
    \begin{subfigure}[b]{0.49\textwidth}
        \centering
        \includegraphics[width=1.0\linewidth]{screenshots/RSM_depth}%
        \caption{}
    \end{subfigure}%
    \caption{A reflective shadow map, with its (a) diffuse, (b) normal, and (c) depth buffer.}
    \label{fig:results:RSMBuffers}%
\end{figure}%

\noindent
As explained before, the RSM generation re-uses most of the code of the G-Buffer generation; see \Cref{fig:results:RSMBuffers} for an impression on the rendered result. Accordingly, the performance characteristics are similar to the regular geometry pass of deferred renderers.

Since the implemented VPL sampling distributes the samples over the entire shadow map, we simply limited the VPLs' locations to the relevant area by tuning the light's extents. As a result, an aspect ratio chosen specifically for each test scene is used. Since our implementation re-uses the RSM as shadow map for direct lighting, we chose a resolution of 512²\,px (or, in the case of the Sponza scene, 1024x256\,px), which is more than required by the naive VPL sampling. The RSM is rendered in 0.17\,ms for the Sponza scene.

The VPL sampling itself takes only several microseconds and is negligible. Bear in mind that, in order to achieve high quality levels, a more elaborate sampling algorithm needs to be implemented, which can in fact take most of the available time. For instance, the implemented sampling does not take the relevance of the sampled VPLs to the current frame into account. This causes the entire system to be rather inefficient because it chooses lights that might contribute little or nothing to the rendered output (\Cref{fig:results:RSMUnfavorable}). An additional downside of the naive sampling is the poor temporal stability when the scene light moves. This is due to each VPL staying at the exact same position in the light's viewport; by consequence their positions strictly follow the light's movement, jumping over depth discontinuities along the way.

\begin{figure}[htb]
\centering
  \begin{tabular}{@{}c@{}}
    \includegraphics[width=1.0\textwidth]{screenshots/RSM_unfavorable} \\
  \end{tabular}
  \caption{A visualization of inefficiently placed VPLs. Each of the bright spots is a VPL. Here, they are all placed on the roof pointing upwards. Since there is no further geometry above them that could be illuminated, they do not contribute to the final image.}
  \label{fig:results:RSMUnfavorable}
\end{figure}


\section{ISM Rendering}
\label{sec:results:ism}

This section presents the results for visibility testing using imperfect shadow maps. The next three subsections will show the outcome in terms of quality, performance, and memory usage. The technique does not handle high geometric complexity of the test scenes well; this is detailed in the fourth subsection. Afterwards, the two different ISM renderers are compared and the suitability of ISMs for global illumination in general is discussed.


\subsection{Quality}
\label{sec:results:ism:quality}

\begin{figure}[htb]
\centering
  \begin{tabular}{@{}cc@{}}
    \includegraphics[width=.48\textwidth]{screenshots/ism_splat_cropped} &
    \includegraphics[width=.48\textwidth]{screenshots/ism_single_pixel_cropped}
  \end{tabular}
  \caption{ISMs rendered using the splat renderer (left) and single-pixel renderer (right) with default settings. The single-pixel renderer performs interpolation between points, renders more points, and does not let points bleed into neighboring ISMs, but takes more time.}
  \label{fig:results:isms}
\end{figure}

\noindent
\Cref{fig:results:isms} shows a few ISMs rendered with the splat and single-pixel renderer. The imperfections do show, and not only through the low resolutions. The distortion of the surface silhouettes by the splat renderer or postprocessing contribute their part, making it hard to identify which part of the Crytek Sponza is rendered. However, keep in mind that the imperfections, while being noticeable in each individual ISM, are expected to partly cancel each other out and result in an acceptable average error in the final rendering.

\begin{figure}[htb]
\centering
  \begin{tabular}{@{}cc@{}}
    \includegraphics[width=.48\textwidth]{screenshots/darkening_splat} &
    \includegraphics[width=.48\textwidth]{screenshots/darkening_single_pixel}
  \end{tabular}
  \caption{Darkening caused by the splat renderer (left) compared to the single-pixel renderer (right). Note that there is no skylight rendered, which leads to an unnaturally dark upper part in the image even for the single-pixel renderer.}
  \label{fig:results:ismDarkening}
\end{figure}

\noindent
Comparing the screenshots in \cref{fig:results:ismDarkening}, it becomes apparent that the splat renderer causes visible darkening in the upper part of the image. The reason is probably that the point splats are always oriented towards the camera and do not take the point's normal into account when rendering, amplifying the usual aliasing artifacts of common shadow maps. As a result, any point size larger than one pixel causes surfaces that are not directly facing the camera to appear nearer than they actually are when doing shadow lookups in ISMs. A larger shadow bias could compensate for that but would introduce heavier light leaking. Another possibility is to use the normal to calculate a point's depth per fragment at a potentially high performance cost.

As the single-pixel renderer does not use splats, this problem is largely mitigated. In a certain way it performs the per-fragment depth calculation implicitly during interpolation in the postprocessing phase.


\begin{figure}[htb]
\centering
  \begin{tabular}{@{}cc@{}}
    \includegraphics[width=.48\textwidth]{screenshots/leaks_splat} &
    \includegraphics[width=.48\textwidth]{screenshots/leaks_single_pixel}\\
    \includegraphics[width=.48\textwidth]{screenshots/leaks_splat_exposure} &
    \includegraphics[width=.48\textwidth]{screenshots/leaks_single_pixel_exposure}
  \end{tabular}
  \caption{Light leaks caused by imperfect shadow maps rendered with the splat renderer (left) and single-pixel renderer (right). VPLs from beneath the curtains leak light onto the wall and ceiling, while VPLs on the pillars and curtains light the floor. The renderings in the bottom row were rendered with a higher exposure for illustration. }
  \label{fig:results:leaks}
\end{figure}

\noindent
\Cref{fig:results:leaks} shows a case which the ISM technique has difficulties with. The image should be mostly dark or at least uniformly lit through the small gap below the curtains, instead the wall and ceiling appear too bright and the floor displays several artifacts. The root cause is that the VPLs are placed right behind the curtains, because of which the occluding geometry, i.\,e., the curtain, is very near to the light source. Due to the randomness involved when selecting point sets for rendering ISMs, it often happens that the points near to the light source are rendered into other ISMs, leaving a large hole behind. The single-pixel renderer fares a bit better than the splat renderer since it uses more points, but does not provide a satisfactory result either.


\begin{figure}[htb]
\centering
  \begin{tabular}{@{}cc@{}}
    \includegraphics[width=.48\textwidth]{screenshots/bias_splat} &
    \includegraphics[width=.48\textwidth]{screenshots/bias_single_pixel}\\
      \includegraphics[width=.48\textwidth]{screenshots/bias_splat_exposure} &
      \includegraphics[width=.48\textwidth]{screenshots/bias_single_pixel_exposure}
  \end{tabular}
  \caption{Light leaks caused by using a relatively large shadow bias with the purpose to hide artifacts of the ISMs. The screenshots to the left use the splat renderer, the right ones the single-pixel renderer. The bottom row was rendered with a higher exposure.}
  \label{fig:results:bias}
\end{figure}

Our implementation hides some of the inaccuracies of ISMs by using a relatively large shadow bias. The resulting light leaks are shown in \Cref{fig:results:bias}. While the splat renderer fares better here, it does so mainly by darkening the whole image since it uses rather large points. The single-pixel renderer on the other hand correctly lets light illuminate the left wall through the columns, but shows a rather noisy result.


\subsection{Performance}
\label{sec:results:ism:performance}

\begin{table}[h]
    \centering
    \captionabove{Timings of the ISM renderers with different settings.}
    \begin{tabulary}{0.98\textwidth}{| L | L | L | L |}
        \hline
        Splat Default Settings & Single-Pixel Default Settings & Splat with Single-Pixel Settings & Single-Pixel with Splat Settings \\ \hline
        1.08\,ms & 2.74\,ms & 10.14\,ms & 2.31\,ms \\
        \hline
    \end{tabulary}
    \label{tab:results:ism_timings}
\end{table}

\noindent
\Cref{tab:results:ism_timings} shows the time needed for rendering the ISMs with both renderers and different settings. Note that the comparison between the two renderers with default settings is not a fair one, since the single-pixel renderer renders more points and by default clamps each point's extents to its ISM. If the splat renderer is changed to behave similarly, it takes 0.3 additional milliseconds for clamping, and 8.66 additional milliseconds for using the same technique to collect VPLs that the single-pixel renderer uses.

There are several reasons for this heavy slowdown: First, this is implemented by emitting multiple vertices in the geometry shader, which is known to have poor performance on current GPUs. Second, the additional fillrate and overdraw became an issue when rendering too many points, a bottleneck that is unlikely to occur when using the single-pixel renderer. And third, while the single-pixel renderer uses shared memory to load the 16 VPLs once, the geometry shader of the splat renderer has no access to shared memory and has to load all 16 VPLs per invocation, creating high register pressure.

Conversely, if the single-pixel renderer does not perform clamping, the push phase needs 0.07\,ms less (0.85\,ms to 0.78\,ms), and when it considers only one VPL per point, the point renderer takes 0.41\,ms less (0.65\,ms to 0.24\,ms).

Since the presented implementation uses no adaptivity, these numbers are independent from the viewport. They are however slightly affected by VPL placement, since it influences culling. In a second scenario, where the sunlight shines directly from above, all VPLs are placed on the floor facing up. As a result, much fewer points are culled during ISM rendering, resulting in slightly higher timings. Both renderers need about 0.1\,ms more in this case.


\subsubsection{Detailed Performance Measurements for the Single-Pixel Point Renderer}
\label{sec:results:ism:performanceSinglePixelRenderer}


\begin{table}[h]
    \centering
    \captionabove{Timing breakdown of the single-pixel point renderer.}
    \begin{tabulary}{0.98\textwidth}{| L | L | L | L || L |}
        \hline
        Point Collection & Point Rendering & Pull Phase & Push Phase & Total\\ \hline
        0.83\,ms & 0.65\,ms & 0.39\,ms & 0.85\,ms & 2.72\,ms\\
        \hline
    \end{tabulary}
    \label{tab:results:timing_breakdown_single_pixel}
\end{table}

\vspace{-5pt}

\begin{table}[h]
    \centering
    \captionabove{Timing breakdown of the pull (PL) and push (PS) phase. The numbers of the individual steps indicate to which mipmap level they write, which is why the pull phase starts at 1 and the push phase has descending numbers. All timings are in milliseconds.}
    \begin{tabulary}{0.98\textwidth}{| L | L | L | L | L | L | L | L || L | L || L |}
        \hline
        PL 1 & PL 2 & PL 3 & PL > 3 & PS > 2 & PS 2 & PS 1 & PS 0 & PL Total & PS Total & Total \\ \hline
        0.16 & 0.16 & 0.04 & 0.03   & 0.06   & 0.08 & 0.30 & 0.41 & 0.39     & 0.85     & 1.24\\
        \hline
    \end{tabulary}
    \label{tab:results:timing_breakdown_pull_push}
\end{table}

\noindent
\Cref{tab:results:timing_breakdown_single_pixel} gives more detailed performance measurements of the single-pixel renderer, while \cref{tab:results:timing_breakdown_pull_push} further breaks down the individual steps of the pull and push phase. Note how the timings scale roughly as expected (PL 3 and PS 2 are taking roughly four times longer than PL 2 and PS 1, respectively), except for the first and last step. This is due to the reduced input data size (or output size, respectively), as explained in \cref{sec:impl:pullPushPostprocessing}. Also note how the push phase does take roughly the time of the pull phase if it were not for the last phase, where it writes to the full 2048²\,px of miplevel zero, whereas the largest miplevel written to by the pull phase is level one with 1024²\,px.


\subsection{Memory Usage}
\label{sec:results:ism:memory}

The point splat renderer requires no more memory than the ISM texture itself requires, which is 8\,MB for a 2048²\,px 16-bit depth buffer. The single-pixel renderer requires additional memory. First, it uses a buffer for storing the points. This is implemented as four-channel 32-bit float texture, with the first three channels containing the position, and the last channel containing radius (8 bit) and normal (24 bit, see \cite{Cigolle:2014:NormalPacking}). With a maximum point count of 2048 per ISM (keep in mind they are rendered into multiple ISMs later), this buffer uses 32\,MB.
The additional textures created are the render target of the single-pixel renderer (single-channel 2048x2048\,px, 32-bit integer, 16\,MB), the mipmap levels used by the pull-push algorithm (four channels, 32-bit, approx. 22\,MB) and the final ISM (same format as used by the splat renderer, 8MB).

Added together, the single-pixel renderer requires a total of 79\,MB. There remains some rooms for optimization: For instance, one channel of the mipmap levels is unused (OpenGL does not offer three-channel images), and for some textures, using 16-bit floats instead of 32-bit float might be sufficient.


\subsection{Problems with High Geometric Density}
\label{sec:results:ism:densityProblems}

A more demanding test case is the San Miguel scene with 7.9M triangles. The most apparent problem of the presented implementation is the lack of adaptivity to the current viewport in addition to the lack of LOD methods. As a result, all 7.9M triangles are used every frame to render the ISMs, with corresponding performance results: The point collection phase of the single-pixel renderer now takes 12.0\,ms, while the point rendering takes an additional 5.3\,ms. The time needed by the postprocessing is unchanged since it works on fixed resolution textures. The splat renderer takes 12.0\,ms, which is the exact same amount of time as the point collection phase of the single-pixel renderer, even though the latter has no rasterization to perform. This demonstrates how this process is bottlenecked by geometric complexity.

\begin{figure}[b]
\centering
  \begin{tabular}{@{}cc@{}}
    \includegraphics[width=.48\textwidth]{screenshots/san_miguel_ugly_shadow} &
    \includegraphics[width=.48\textwidth]{screenshots/san_miguel_ugly_shadow_gi_only}
  \end{tabular}
  \caption{Particularly inaccurate shadows produced by ISMs. The right image shows indirect light only. Ideally, the shadows on the walls would be much smoother. The low-frequency noise is caused by choosing random points per ISM, while the many edges of the shadows are caused by the basic shadow mapping algorithm, which performs a simple binary choice.}
  \label{fig:results:san_miguel_ugly_shadow}
\end{figure}

\Cref{fig:results:san_miguel_ugly_shadow} shows an example screenshot of the San Miguel scene. The reason for the exaggerated shadow is that during ISM rendering, all points are rendered into at least one pixel, which makes them take out too much light if they are actually smaller than one pixel. This is especially the case with the thin structures of the chairs and trees.


\begin{figure}[htb]
\centering
  \begin{tabular}{@{}c@{}}
    \includegraphics[width=1.0\textwidth]{screenshots/san_miguel_wide} \\
  \end{tabular}
  \caption{A rendering of the San Miguel scene in the presented implementation. Note the unnaturally dark region behind the pillars. \vspace{-0.25cm}}
  \label{fig:results:san_miguel_wide}
\end{figure}

\pagebreak
\noindent
The overdarkening is visible in \Cref{fig:results:san_miguel_wide} as well. Here, the missing second bounce makes the situation even worse: Since most of the VPLs are placed at floor level, the floor behind the pillars is not lit at all.

\vspace{5pt}

\subsection{Comparison of the Splat and Single-Pixel Renderer}
\label{sec:results:ism:comparison}

The single-pixel renderer provides numerous advantages over the splat renderer: Using a compute shader to render points lifts the restrictions of the fixed-function pipeline and, e.\,g., allows rendering each point into multiple ISMs without large performance losses. Quality-wise, it can reproduce surfaces more accurately by correctly interpolating between points. However, reproducing the silhouettes of objects accurately is not possible for both renderers; this requires at least using a more complete set of points per ISM, and possibly returning to using triangles instead of points.

The drawbacks obviously include the additional rendering time and memory. Although having moderate requirements in absolute terms, using the single-pixel renderer makes ISM rendering the second most time-consuming and most memory-consuming rendering phase, despite using low resolutions and an approximated scene representation. Besides, the added complexity should not be underestimated. Specifically, we found it hard to fine-tune the pull-push algorithm to produce satisfactory results, especially in the face of the low resolutions of ISMs and the limited data available for reconstruction after selecting a random and sparse set of points.


\pagebreak

\subsection{Discussion}
\label{sec:results:ism:discussion}

Imperfect shadow maps have several deficits, some of which are inherent to the technique and difficult to solve, while others are specific to the implementation choices and trade-offs in the presented implementation and might be easier to overcome.

First, ISMs reduce the scene's geometry to points. At the resolution that is achievable in a real-time budget, this is a very rough approximation that loses a lot of accuracy.

Second, because a sparse set of points is used for each ISM, each point must be enlarged to accommodate for the neighboring points that are likely missing in the chosen ISM. For instance, if an area is represented by a thousand points and only one tenth of all points is used for each ISM, each remaining point's area must be enlarged by a factor of ten to result in an equally large area in the rendered output. Of course, this approach leads to deformed geometry since points at the edges of the area are enlarged as well, growing over the borders of the original geometry.

Third, ISMs also have numerical issues. For instance, both the splatting and postprocessing approaches ignore the distortion caused by the paraboloid projection, and thus might render even simple surfaces incorrectly. This contributes to the necessity of using a relatively large shadow bias. At the cost of additional complexity, this can likely be solved with relative ease.

Fourth, and this is in our view the most important shortcoming, ISMs handle geometry in the vicinity to VPLs badly (\Cref{fig:results:leaks}). To sufficiently approximate such surfaces when rendering ISMs, one would need to greatly increase the number of points created near VPLs, possibly in addition to stepping away from using a fully random approach for point selection, and one would have to select a specific point set for each ISM depending on the VPLs location.

As for most global illumination algorithms, a massive improvement would be to differentiate between large-scale scene geometry that is important for diffuse light bounces, and smaller geometry of lesser importance. While the latter can possibly be ignored altogether with only minor losses in quality, the shape of large-scale geometry is all the more important to preserve (relatively) precisely.

The San Miguel scene is a good example for this: While the small detail work like dishes and small plants make up most of the geometry and thus most of the rendering time, they contribute only very little to global illumination. This is where the presented implementation suffers most from the lack of level-of-detail methods.


\section{Interleaved Sampling Using Compute Shaders}
\label{sec:results:interleavedShading}

Interleaved sampling substantially reduces the number of lights that are processed per pixel. Using a 4x4 interleaving pattern, each pixel processes only a sixteenth of all lights, ideally speeding up the final gathering phase by a factor of 16. However, an additional blur pass is required to mask the noise that results from interleaving, which takes additional time.


\begin{table}[htb]
    \centering
    \captionabove{Timings of interleaved sampling (IS) with a block size of 4x4. Note that the blur pass is mandatory when using interleaved sampling; the "Speedup" column merely illustrates the efficiency of the presented implementation, while the "Speedup with Blur" column shows the practical speedup that is achieved.}
    \begin{tabulary}{0.98\textwidth}{| L | L | L | L | L |}
        \hline
        Without IS & With IS & Speedup & Blur Pass & Speedup with Blur\\ \hline
        40.24\,ms & 2.61\,ms & 15.42 & 0.63\,ms & 12.42\\
        \hline
    \end{tabulary}
    \label{tab:results:timings_interleaved_shading}
\end{table}


\pagebreak
\noindent
\Cref{tab:results:timings_interleaved_shading} shows timings of the presented implementation. Most notably, the interleaved sampling technique introduces very little overhead, as indicated by the speedup of 15.42 compared to the theoretical maximum of 16. In absolute terms, a speedup of 16 would require the final gathering phase to take 2.51\,ms. As a result, the technique introduces 0.1\,ms of overhead.

When considering the blur phase, the overhead adds up to 0.73\,ms and the speedup is lowered to 12.42. Of course this is still a substantial improvement, especially considering the negligible quality impact.

This technique requires an additional buffer for storing the intermediate result of the blur phase, while the final result can be written into the same texture that was used for final gathering. The texture format \texttt{GL\_R11F\_G11F\_B10F} is used here, resulting in roughly 2\,MB additional storage at full HD. The presented implementation does not re-use the final gathering texture to keep the results of all stages available for debugging purposes.


\subsubsection{Discussion}
Interleaved sampling is an obvious win for many-light global illumination methods. Paying the full cost of evaluating all VPLs per shading point is simply unnecessary due to the low-frequency nature of global illumination. Adapting the shading frequency to a similarly low level is the only reasonable choice for real-time applications.

In fact, given the unnoticeable quality impact (\Cref{fig:results:interleaved_quality}) of interleaved sampling when using a 4x4 block size, larger block sizes, e.\,g., 8x8 as used by \citet{hedman2016sequential} should be considered. Especially with the advent of high-resolution displays, lowering the shading frequency of global illumination effect even further might be necessary to keep the costs down to a reasonable level. However, the performance gains from lower sample counts will be offset in part by larger blur radii that come with larger block sizes.


Going further, this technique is limited by how well the blur phase is able to mask the noise in more difficult areas with lots of depth discontinuities like vegetation. To this end, one option might be to collect fragments that receive too few information during shading and blurring, and re-process them with larger sets of VPLs in a second pass (similar to \cite{Lauritzen:2010:Deferred}). Temporal reprojection methods \citep{Jimenez:2016:FilmicSMAA} that re-use information from previous frames and use different sets of VPLs per frame would be another natural extension of interleaved sampling.

\pagebreak

\begin{figure}[htb]
    \centering
    \begin{subfigure}[b]{0.20\textwidth}
        \centering
        \includegraphics[width=.95\textwidth]{screenshots/interleaved_before}
        \caption{}
        \label{fig:results:interleaved_before}%
    \end{subfigure}%
    \begin{subfigure}[b]{0.20\textwidth}
        \centering
        \includegraphics[width=.95\textwidth]{screenshots/interleaved_horizontal_blur}%
        \caption{}
        \label{fig:results:interleaved_horizontal_blur}%
    \end{subfigure}%
    \begin{subfigure}[b]{0.20\textwidth}
        \centering
        \includegraphics[width=.95\textwidth]{screenshots/interleaved_final}%
        \caption{}
        \label{fig:results:interleaved_final}%
    \end{subfigure}%
    \begin{subfigure}[b]{0.20\textwidth}
        \centering
        \includegraphics[width=.95\textwidth]{screenshots/interleaved_without}%
        \caption{}
        \label{fig:results:interleaved_without}%
    \end{subfigure}%
    \begin{subfigure}[b]{0.20\textwidth}
        \centering
        \includegraphics[width=.95\textwidth]{screenshots/interleaved_difference_gi}%
        \caption{}
        \label{fig:results:interleaved_difference_gi}%
    \end{subfigure}\\
    \par\medskip
    \begin{subfigure}[b]{0.333\textwidth}
        \centering
        \includegraphics[width=.95\textwidth]{screenshots/interleaved_without_textured}%
        \caption{}
        \label{fig:results:interleaved_without_textured}%
    \end{subfigure}%
    \begin{subfigure}[b]{0.333\textwidth}
        \centering
        \includegraphics[width=.95\textwidth]{screenshots/interleaved_with_textured}%
        \caption{}
        \label{fig:results:interleaved_with_textured}%
    \end{subfigure}%
    \begin{subfigure}[b]{0.333\textwidth}
        \centering
        \includegraphics[width=.95\textwidth]{screenshots/interleaved_difference_shaded}%
        \caption{}
        \label{fig:results:interleaved_difference_shaded}%
    \end{subfigure}%
\vspace{-5pt}
      \caption{Interleaved sampling results. The top row shows indirect light only. From left to right: (a) Result of the final gathering phase without the bilateral blur, (b) with the horizontal phase of the blur applied, (c) with both phases applied, (d) result of the final gathering phase without using interleaved sampling, (e) difference between (c) and (d). Bottom row: (f) final result after blending when using interleaved sampling, (g) without interleaved sampling, (h) difference between (f) and (g). Difference values are multiplied by eight.}
      \label{fig:results:interleaved_quality}
\end{figure}


\vspace{-10pt}

\section{Clustered and Tiled Deferred Shading}
\label{sec:results:clusteredShading}

Clustered deferred shading and tiled deferred shading reduce the number of lighting computations by clustering fragments and culling lights against those clusters. As such, if implemented correctly, they affect only the performance and do not alter the output.

\begin{table}[htb]
    \centering
    \captionabove{Timings of the final gathering stage without optimizations, with clustered shading (CS) and with tiled shading (TS). Each line is a different camera position. Note that the timing ``With CS'' includes roughly 0.06\,ms for the clustering phase and 0.13\,ms for the light list phase.}
    \begin{tabulary}{0.98\textwidth}{| L | L | L | L | L |}
        \hline
        Scene & Without CS/TS & With CS & With TS \\ \hline
        Sponza 1 & 3.41\,ms & 2.80\,ms & 2.40\\
        Sponza 2 & 3.94\,ms & 3.63\,ms & 3.09\\
        Sponza 3 & 2.68\,ms & 1.64\,ms & 1.34\\
        San Miguel & 4.75\,ms & 5.21\,ms & 4.43\\
        \hline
    \end{tabulary}
    \label{tab:results:timings_clustered_shading}
    \vspace{-10pt}
\end{table}


As visible in \Cref{tab:results:timings_clustered_shading}, both clustered shading and tiled deferred shading can improve the performance of the final gathering phase considerably: In one of the measurements 50\,\% of the computation time is saved with tiled shading.


It is interesting that clustered shading, being the successor of tiled shading, performs noticeably worse in the context of many-light methods. This is true even for the San Miguel scene, in which the higher amount of depth discontinuities should be favorable to clustered shading. While these performance characteristics might be specific to the presented implementation, there are plausible reasons that they might apply generally: Due to the infinite light radii used, the percentage of culled lights is generally low. Therefore it might be the better choice to optimize for the worst case and implement a low overhead solution like tiled shading, instead of a more accurate solution like clustered shading that might improve the culling rate, but comes with more overhead.

An impression of the amount of overhead introduced by clustered shading is given by the San Miguel scene, from which the last line of measurements is taken. This scene runs much slower even when only considering the final gathering phase, because most VPLs are placed on the walls and floor, pointing inwards or upwards respectively. Therefore much of the geometry lies on the illuminated side of most VPLs, resulting in very little performance gains from culling.

With a low culling rate, clustered shading actually slows down the final gathering phase by 0.46\,ms. While 0.19\,ms of these are fixed costs added by the clustering and light list calculation phases, the final gathering shader was slowed down by 0.27\,ms as well. This might be caused by the indirection when accessing lights, since instead of reading from the light buffer directly, the light list with indices is accessed first before reading the light buffer with a dynamic index. Another explanation might be the divergent data flow, since threads from the same work group can access different light lists and thus different light data and shadow maps, possibly hurting cache efficiency.

As discussed in section \Cref{sec:impl:clusteredShading}, clustered shading consumes about 2\,MB of memory. The tiled shading algorithm however, being integrated into the final gathering shader, uses no additional VRAM at all.

\subsubsection{Discussion}

The presented measurements make tiled shading seem like a clear winner, but as so often the situation is more complicated. The integration of tiled shading into the final gathering shader, while an efficient and easy to implement solution, is inherently less flexible than clustered shading. For instance, \citet{olsson2012clustered} propose to use normals as another dimension besides the three spatial ones, further improving culling efficiency. While this is presumably easy to integrate into the more flexible approach taken with clustered shading, the integrated tiled shading is constrained by the amount of available shared memory, which makes it challenging to take additional dimensions into account.