Saturday, December 11, 2010

Post-processing step for irradiance interpolation

During the past week, I decided not to continue on the REYES type renderer since it is really difficult to optimize to interactive rates and the rasterization scheme is out of the scope for this current project. Instead, I went on fixing bugs and tried to improve the rendering process within the given framework of the renderer.

First, I found a pretty serious bug with the photon shooting/scattering part of the tracer. For all the photons, only the first-bounce photons were recorded with correct power; the later, multi-bounce photons had zero power. Essentially, the renderer was only rendering first bounce indirect lighting, and that explains why the Sponza scene was so dark. I only discovered this problem when I started to directly visualize all the photons on the screen (which I should've done a lot earlier...) Anyways, this is the current rendered image of the Sponza scene. Note that the color bleeding is much more interesting, and I also added sky color bleeding, which is added to the image whenever secondary rays hit the sky, and sun-like parallel lighting,
Then, I explored the possibility of providing a fast previewing feature. The basic idea is to introduce a post-processing step when the sampling count is low (eg. only around 10-20 samples per pixel which naturally would introduce a lot of noise and artifacts).

Inspired by the idea of irradiance caching, we apply a final post-processing step so that we can minimize the number of required final gather samples per pixel while achieving similar results. The basic idea is to do an image-space irradiance interpolation on the rendering canvas. For each one of the pixels, x, in the canvas in parallel, we examine all the surrounding pixels within a preset radius r. For each of the nearby pixel xj, we compute its weight by

where x, xj, n(x), n(xj) are the spatial positions for the two pixels and their respective normals. The rj term is a harmonic means of the distances from the xj to nearby geometry. This value can be easily while final gathering is done on each pixel, where we have the distance information for each final gather sample, via the following formula:

Where N is the number of final gather samples, di is the distance to each secondary intersection. Note that in order to progressively compute the value of rj, we compute its reciprocal, 1/rj, so that for each additional final gather sample, we can simply add 1/di/N to the current 1/rj value.
With all the weights computed, we can compute the diffuse irradiance value at x by

Where E(x) is the existing diffuse values stored in the canvas. With this final postprocessing step, we essentially smoothen the indirect illumination across the entire scene. In practice, we found that a searching radius of 7 works best for improving image quality. When implemented in CUDA with shared memory, the computational overhead for this step is almost negligible at less than a tenth of a second.

The three images below showcases the effects of the post-processing step. The top image is rendered with only 12 samples/pixel, without the postprocessing step. The middle image is postprocessed, with the same 12spp setting, rendered in 2.2 seconds. The bottom image is a reference image rendered with 300spp, taking 42.3 seconds to render. All three images are rendered with 800x800 resolution and 200,000 photons. Notice the smoothness in indirect lighting in the middle image, but also note its problem: the places where Rx is small (local geometry hinders the light reception) lost the darkness. This problem might be addressed by shooting more final gather rays and avoid interpolation for those spots and is worth future research.

 Finally a video showing the current status of the renderer:

Monday, December 6, 2010

Rearranging code and some new thoughts

In the past week, most of the work was done in the back end. I re-arranged the code-base which had been pretty messy till last week. Basically, I had this giant main.cpp,, objLoader.cpp, a couple of header files and that's it. The main.cpp and were each over 1600 lines of code and it was really a magic to see all of this glued mess working. So I did lots of refactoring work, summarized as follows:
1. Scene graph related stuff was put into a SceneManager class. It references another class, KdTreeBuilder, which is in charge of Kd-Tree and photon mapping construction. It also implements a transformation-less scene graph, which are basically a mesh split into objects. Every time a transformation is done to the objects, the new coordinates are directly computed and embedded into the mesh, and Kd-Tree/Photon map is re-built on the fly.
2. Rendering related stuff was put into a RenderManager class. It keeps track of the current rendering settings (parameters, render methods, etc.)
3. The CUDA source,, was also split among rendering, helper methods and Kd-Tree building functions. Each kernel function has a wrapper C function that automatically sets up the block and thread settings for the kernels.

I also came up with an interesting idea to explore if I have enough time before the end of the semester. After reading Point-Based Approximate Color-Bleeding paper by Christensen et. al. (, I thought it would be interesting to see if this can be used on my renderer. The basic idea is to divide each polygons into small disks, or surfels, and arrange them into an octree. Whenever we need to compute indirect illumination on a point, we simply rasterize the surfels or octree nodes on a view cube around the point and then multiply the result by the surface BRDF. I would like to use ray trace for the direct illumination, kd-tree for surfel storage and same rasterization strategy and see what I can come up with.

Finally, a screen shot of the classic Cornell Box scene of a prototyped version that implements part of the paper mentioned above. Rendering takes 3 seconds on this 640x640 image with 200spp. More details will follow as I make more progress on this.

Sunday, November 28, 2010

Live scene editing + path tracing support

The week of Thanksgiving gave me both time to go to Chinatown to have Thanksgiving dinner, time to gamble a bit in Atlantic City, and time to polish my renderer. Speed was once again improved due to a more "greedy algorithm" in scene traversal. Moreover, path tracing support was also added to the renderer (using a path tracing / bidirectional path tracing hybrid model), and it works really well for outdoor scenes, where photon mapping lacks good support. Next, the scene can now be freely edited. Each object as defined in the obj file is extracted out as separate objects, and the user can translate/scale/rotate/delete any of them as needed. Thanks to the fast kd-tree/photon map builder, operating on the scene and reconstructing the entire kd-tree is pretty painless. Last but now least, we can now freely choose from one of six rendering methods during run time: OpenGL wireframe, OpenGL smooth, (bidirectional) path tracing, direct photon mapping, photon mapping with final gathering and ambient occlusion. Below are three videos showcasing the renderer's current capabilities with different scene settings.

The first is a test scene I made for path tracing and scene editing. It includes a bunny, among other objects. The scene contains only 7K triangles.

The second video is our familiar Sponza scene, with 72K triangles. Note the improved performance and brighter interior space (after a fix made to Photon Mapping algorithm).

The third video is for editing the new interior scene, with 608K triangles (I got the wrong triangle count the last time).

Monday, November 22, 2010

More updates on the indoor scene rendering results...

After spending today tweaking the indoor scene settings, adding more textures, using disk-shaped photon collector instead of a spherical one, making the light source external and sun-like, and fixing bugs here and there, the image quality is notably better:

Note that by using disk-shaped collector, false color bleeding near the edges and on thin objects (quilt, books) has been largely solved.

Actual indoor scene, normal map, more BRDF support, and more

The last week I began to use an actual indoor scene for my renderer. It was freely available online, but since it was made with 3DS MAX and V-Ray and the OBJ exporter doesn't support V-Ray textures etc, I had to remap all the material and textures (They are not set either, I had to literally look at a final render the website provided and do some matching work). I also made up a test scene to play with different BRDF (the renderer only used to support perfectly diffuse, Lambertian surfaces). The following is a test scene with ambient occlusion and glossy reflection.

So I started with this scene with no texture mapping, material settings. It contains 1400K triangles, many more than Sponza's 70K, and can indeed be a stress to the renderer.
I had to find textures from online and set all the materials (lots of which needed to be set manually in the MTL file since I'm not familiar with 3DS MAX). I also implemented normal mapping, perfectly specular material, and Oren-Nayar reflectance model. After increasing the photon count to 200K, the current result is shown in the following image.

Notice the correct color bleeding (it wasn't correct at the beginning and that wasn't found in the Sponza scene since the mesh kind of has a uniform color). Also note that the floor and the left wall is glossy, the mirror is perfectly specular, the quilt is Lambertian, some of the wooden furniture is Oren-Nayar and the rest are Phong. Also, the quilt and the rug have normal maps attached. With this setting, I'm still able to achieve interactive rate (shown in the video below). I also fixed the problem where the program with reset the video card after a timeout. In addition to cudaThreadSynchronize() to make the calls synchronous, we also need to set two registry keys in Windows to bypass the timeout check.

Finally, this is a rendering result provided by the website providing the 3D model. There's obviously still a long way to go before achieving this level of picture quality, and it can be very challenging considering the interactive nature of this renderer.

Sunday, November 14, 2010

Faster and clearer irradiance caching and ambient occlusion

The past week has been spent on optimizing irradiance caching, reducing artifacts and trying other options such as ambient occlusion. Irradiance caching was accelerated further during the past week by around 5 times, and the small scanline-like artifacts have been significantly reduced although there are still block artifacts. The blocks were removed by implementing a "Behind Test" as described by Ward in his SIGGRAPH course. I also tried other options during the week such as ambient occlusion and sun light imitation. In scenes such as Sponza where color bleeding does not happen a lot, ambient occlusion usually provides a pretty good estimate for the indirect illumination, and I would like to further explore the possibilities of combining ambient occlusion with irradiance caching in the coming week. I will keep you posted.

The first is a video showcasing the status of the renderer with irradiance caching:

Then we have a screen shot of the ambient occlusion computed result. Finally there's a video showing the performance of ambient occlusion.

Sunday, November 7, 2010

More updates on Monday about Irradiance Caching

I'm currently still working on the irradiance caching implementation. The speed now is ~2 seconds for cache sampling and almost instantaneous rendering. However, there are still lots of visual glitches and I'm in the process of improving the image quality. This is a sketch of the irradiance caching algorithm:

1. In the image space, compute the error in all directions according to the classic irrandiance caching method. The error takes into account both spatial displacement and difference in normal. We sum up the errors to all the neighboring pixels for all the points in the image space in parallel.
2. Then, a probability based sample selection is performed, essentially picking sample points according to the error. The larger the error, the more likely the point is to become a sample.
3. Instead of storing the cache in a spatial acceleration structure such as a quad-tree, we directly compute the effective radius of each of the sample points in parallel. Then for all the pixels within the effective region of each sample points, we add a reference of the sample point. After this step, each pixel in the image space will have a list of irradiance caching sample points it needs to take into account, essentially trading space for time.
4. After the irradiance cache has been built, the rest of the rendering is done using the common method.

Due to the visual glitches as shown in the following image, I'm still working on the improvement, and hopefully I can post some updated results on Monday.

Sunday, October 31, 2010

Photon Mapping with Final Gathering is Ready

Photon mapping with final gathering and stratified importance sampling was implemented on CUDA this week. I also tried a crude implementation of irradiance caching on the GPU using a divergence field based cache position selection which I came up with myself, it still needs further improvement, and I will try to post that next week. Now, without irradiance caching, with 800x600 resolution, 100k photons and 200 final gather rays per pixel, rendering a frame takes ~ 15-20 secs. This is far from interactive, but since I uses a incremental based approach where intermediate results are shown, the system is far more responsive in quickly visualize results.

The following is a screen shot of the current renderer. With 1024 by 768 resolution, 100k photons and 260 final gather rays per pixel, It's fully rendered in 25 seconds.

The following video showcases the current state of the renderer. Note that when the light source is moved, the reponse is pretty slow (~1fps). This is because we need to precompute irradiance at each photon position to accelerate the final gather. Also we shoot 100k photons. With 10k photons, we get around 10-15fps when moving the light source and similar rendering results.

Sunday, October 24, 2010

CUDA kd-tree builder is ready and needs to be further optimized.

I mentioned in last week's post that I have a three-step plan for the kd-tree builder. Now the plan has changed somewhat: I came up with a segmentation/binning based algorithm which I thought would substitute Kun Zhou's. I successfully implemented the CUDA kd-tree builder among other features.

So in brief summary form, the algorithm is as follows:

1. We begin with a list of triangles and a root kd-tree node containing all the triangles. Each item in the triangle list has two entries: the triangle id and the node that it belongs to. We don't yet store the triangle list for each node in the nodes themselves yet.

2. We have a list of active nodes to work on each time, the active nodes are nodes to be split. To start, active nodes list only contains the root node.

3. While active node list is not empty,

3a.      We divide each node in the active list along the split axis into a predefined number of segments. Place a minimum bin and a maximum bin for each segment.

3b.      For each triangle in parallel, determine its starting/ending segment in the nodes it belongs to. Add to the min bin where its starting position belongs to. Do similar to the max bin.

3c.      Now we apply segmented prefix-sum scan (provided by CUDPP) on the min max bins list for all the nodes. Now we can easily get the # of triangles to the left/right of each of the segment.

3d.      For each segment in parallel, compute the SAH at that segment position.

3e.      Apply parallel reduction with min operator to arrive at the min SAH for each node.

3f.      For each node in parallel, conditionally split itself along the segment position with min SAH.

3g.     For each triangles in parallel, do a triangle/box test to see whether it goes to the left or right node. If it belongs to one of the node, simply re-point to the child node it belongs to. If it belongs to both nodes, we need to create an additional entry in the triangle list to accommodate it. The newly created nodes will be the new active node list.

11. Now we need to rearrange the triangle list so it lists the triangles for each node in consecutive entries. This can be done by first doing a prefix sum scan on the node's triangles count so that we know where the triangle list for each node would start in the big list. Then for each triangles in parallel, put the triangles in to the arranged list. I just needed to be careful to use atomic operations to avoid race conditions.

The renderer now also supports texture mapping with bilinear filtering and bump mapping. For the Sponza scene with 77K triangles, with precomputed kd-tree, we get ~20fps on 800x600 scene. If we re-build the kd-tree for every frame, we still get 4-5 fps, which is not bad considering the fact that there's still room for optimization. The two videos below show the rendering result, with and without kd-tree per frame re-computation.

For the coming week, I will apply the kd-tree building algorithm to photon mapping (with VVH as opposed to SAH for split cost), and get the final gathering working (hopefully it can be interactive). I've already started on this, and will keep everyone posted.

Sunday, October 17, 2010

kd-tree CPU builder and GPU-based tracer is ready

In the past week, I added significant amount of code to my renderer, most of which pertained to kd-tree construction and traversal. I read through Kun Zhou's paper on GPU based kd-tree construction, and decided that the paper could be implemented in the following steps:

1) Make sure the kd-tree tracer (traversal) is working. To achieve this, I first implement a CPU based kd-tree builder with data structure optimized for CUDA. Then a GPU based tracer that utilizes the resulting kd-tree should be implemented. At this point, I should have a kd-tree CPU builder and GPU tracer.

2) Substitute a GPU kd-tree builder for the CPU one built in the first step. The output data structure should be in sync so that we don't need to change (hopefully) the GPU tracer in step 1. The GPU kd-tree builder should implement the Small Node stage according to Zhou's paper.

3) Implement the Large Node stage as proposed in Zhou's paper. This is trickier than Small Node Stage, but hopefully this is going to provide worthwhile performance boost.

So for this week Step 1 is done. I set up a team foundation server project so that I can keep track of different versions of the code which is very useful. For the actual kd-tree implementation, to make the data readily available to the GPU (and later to the GPU kd-tree builder), the kd-tree nodes, triangle lists and split position lists are pre-allocated chunks of memory. For example, KdTreeNode class looks like the following:

struct KdTreeNode
            float split;
            int axis, left, tri;
            float4 __data;

It is packed into one float4 value. Left is the index of the left child in the pre-allocated array of KdTreeNodes, and the right child is simply left + 1. The integer tri points at the head index of the triangle list of the node in the pre-allocated triangle list array. The triangle list and split position lists are two big linked lists. Whenever some resource is requested, an element is fetched from the next available element in the lists, and gets returned to the "pool" once it's no longer used. This way, we only need to allocate the memory once.

The GPU tracing part is a stacked-based kd-tree traversal implementation according to Zhou's paper. However, the triangle list isn't optimized into pre-order traversal order yet. This can be done in Step 2 & 3.

For the next week, I plan to complete Step 2 at the minimum, and optionally Step 3 if time permits. Moreover, I will implement more intuitive movement control by using keyboard and let the obj loader support texture mapping.

The following is a video showcasing the current state of the tracer. I used goldfish.obj with 23k triangles. After kd-tree construction using the CPU with 13 maximum levels and 9516 kd-tree nodes, the ray-tracer achieved still a stable 60fps (max FPS for the monitor) with 2 levels of reflection. Pretty satisfactory result for now.


Friday, October 8, 2010

Current status on my existing photon tracer and new CUDA RTRT

This past week was spent in learning and setting up CUDA, stabilizing and rendering using my existing photon tracer and writing a simple CUDA ray tracer.

For the existing photon tracer (ktracer) I wrote last semester for CIS-660 (Advanced Topics in Computer Graphics), it is able to photon trace any given triangular mesh scene in 3DS format. It is able to merge multiple 3ds files into a single scene. It uses hierarchical uniform grid as acceleration structure, supports multi-threaded rendering and uses SIMD intrinsic on low-level computation. For photon mapping part, irradiance caching is done on the photons so that final gathering could be greatly accelerated. Here's some result of  the current ktracer:

This is a photon traced image of the well-known Cornell Box which I made with 3DS MAX. With 50 area light samples and 50 final gather samples, the final image with 640x640 resolution is rendered in 16.562s.

To showcase the ability to take advantage of the uniform grid, two car models and a house model is added to the scene. Now with 65K triangles and the same rendering settings, the above image is rendered in 27.43s.

The existing ktracer provides a baseline performance. It is still far from being interactive. In order to be interactive, the bottom scene needs to be rendered in ~1fps. In other words, without final gathering and using only one sample per pixel, we should get at least 50fps. This will be the performance target for my senior project.

For the rest of the week, I read a number of CUDA/raytracing tutorials. Here's a list of them that I found particularly helpful:

CUDA, Supercomputing for the Masses (;jsessionid=XNULCSWKTFGIZQE1GHPCKH4ATMY32JVN)
Triers CUDA Raytracing Tutorial ( They provided some great skeletal code for starting to write a ray tracer. Part of my project is based on the CUDA/OpenGL code of the tutorial.
NVIDIA CUDA C Programming Guide. This is the official guide with up-to-date information.

I set up my newly built test bench with i7-950 processor, 4GB memory and GeForce GTX 470 graphics card. Then I downloaded and installed CUDA Toolkit and GPU Computing SDK 3.2 RC. I also found that the latest 1.5 RC version of Parallel NSight, which supports Visual Studio 2010, is open to Beta program participants. So I applied to their program and soon got admitted. In my opinion, the folks in NVIDIA have done a really good job integrating with VS2010. Also, with a host and a target machine, we can enable remote debugging of CUDA programs with easy operations such as setting breakpoints, a very neat feature.

After quite some time of coding, I finished the initial ray tracer that is able to load OBJ files with MTL support. It currently doesn't have any acceleration structures, but for the Cornell Box scene, it renders it very smoothly at over 60 fps. I started off porting the 3DS loader from my previous renderer to the new one, but ended up getting all kinds of compiler errors. Eventually, I gave up that idea and instead wrote an OBJ loader and used 3DS MAX to export the file to OBJ format. A screenshot and a video below showcases the current renderer:

Here is a video showing some real-time camera movement:


Sunday, October 3, 2010

Interactive Global Illumination of Indoor Scenes with Photon Mapping - A Brief Overview

Efficiently and realistically rendering the nuances as represented by direct and indirect lighting effects of indoor scenes has always been a challenging task in computer graphics. Especially, scenes with moveable light sources
and camera position deserve particular attention as rendering those scenes efficiently would enable animators and designers to quickly inspect architectures from different perspectives and lighting settings.

Based on this requirements, I propose a fast, robust and realistic renderer with global illumination capabilities specifically for indoor scenes, by coupling kd-tree and coherent grid acceleration structures. The renderer uses regular ray-tracing for direct illumination and reflection, and uses photon map final gathering for indirect illumination. The scene comprises mostly static background objects and limited moving objects. Whenever a change in light sources or camera position is required after a given time point, spatio-temporal coherence is exploited so that only a fraction of the work done for the previous frame is needed for rendering a new frame. 

To best utilize kd-tree's fast look-up property and coherent uniform grid's fast reconstruction property, we use the former for static background scene and the latter for moving objects. To improve rendering responsiveness, interim rendering results are displayed during the final gathering process; moreover, photon mapping pre-computation could be used. This should dramatically reduce the required final gathering samples. Lastly, CUDA will be utilized to better parallelize the rendering process. This project is an extension to and further research on my previous work on efficient ray-tracing. Built on the framework of my existing photon tracer, this high quality renderer should be able to deliver impressive scene fly-by experiences and speed up the workflow of creating and rendering indoor architectures.