So in brief summary form, the algorithm is as follows:
1. We begin with a list of triangles and a root kd-tree node containing all the triangles. Each item in the triangle list has two entries: the triangle id and the node that it belongs to. We don't yet store the triangle list for each node in the nodes themselves yet.
2. We have a list of active nodes to work on each time, the active nodes are nodes to be split. To start, active nodes list only contains the root node.
3. While active node list is not empty,
3a. We divide each node in the active list along the split axis into a predefined number of segments. Place a minimum bin and a maximum bin for each segment.
3b. For each triangle in parallel, determine its starting/ending segment in the nodes it belongs to. Add to the min bin where its starting position belongs to. Do similar to the max bin.
3c. Now we apply segmented prefix-sum scan (provided by CUDPP) on the min max bins list for all the nodes. Now we can easily get the # of triangles to the left/right of each of the segment.
3d. For each segment in parallel, compute the SAH at that segment position.
3e. Apply parallel reduction with min operator to arrive at the min SAH for each node.
3f. For each node in parallel, conditionally split itself along the segment position with min SAH.
3g. For each triangles in parallel, do a triangle/box test to see whether it goes to the left or right node. If it belongs to one of the node, simply re-point to the child node it belongs to. If it belongs to both nodes, we need to create an additional entry in the triangle list to accommodate it. The newly created nodes will be the new active node list.
11. Now we need to rearrange the triangle list so it lists the triangles for each node in consecutive entries. This can be done by first doing a prefix sum scan on the node's triangles count so that we know where the triangle list for each node would start in the big list. Then for each triangles in parallel, put the triangles in to the arranged list. I just needed to be careful to use atomic operations to avoid race conditions.
The renderer now also supports texture mapping with bilinear filtering and bump mapping. For the Sponza scene with 77K triangles, with precomputed kd-tree, we get ~20fps on 800x600 scene. If we re-build the kd-tree for every frame, we still get 4-5 fps, which is not bad considering the fact that there's still room for optimization. The two videos below show the rendering result, with and without kd-tree per frame re-computation.
For the coming week, I will apply the kd-tree building algorithm to photon mapping (with VVH as opposed to SAH for split cost), and get the final gathering working (hopefully it can be interactive). I've already started on this, and will keep everyone posted.