Update to v1.3.1.0

NVIDIAGameWorks · Jul 22, 2024 · cb5428a · cb5428a
1 parent d2c55b1
commit cb5428a
Show file tree

Hide file tree

Showing 10 changed files with 286 additions and 125 deletions.
diff --git a/docs/Integration.md b/docs/Integration.md
@@ -2,11 +2,11 @@
 
 SHaRC algorithm integration doesn't require substantial modifications to the existing path tracer code. The core algorithm consists of two passes. The first pass uses sparse tracing to fill the world-space radiance cache using existing path tracer code, second pass samples cached data on ray hit to speed up tracing.
 
-<p style="text-align: center">
+<figure align="center">
 <img src="images/sample_normal.jpg" width=49%></img>
 <img src="images/sample_sharc.jpg" width=49%></img>
-<em>Image 1. Path traced output at 1 path per pixel left and with SHaRC cache usage right</em>
-</p>
+<figcaption>Image 1. Path traced output at 1 path per pixel left and with SHaRC cache usage right</figcaption>
+</figure>
 
 ## Integration Steps
 
@@ -19,9 +19,9 @@ Create main resources:
 * `Voxel data` buffer - structured buffer with 128-bit entries which stores accumulated radiance and sample count. Two instances are used to store current and previous frame data
 * `Copy offset` buffer - structured buffer with 32-bits per entry used for data compaction
 
-The number of entries in each buffer should be the same, it represents the number of scene voxels used for radiance caching. A solid baseline for most scenes can be the usage of $2^{22}$ elements.
+The number of entries in each buffer should be the same, it represents the number of scene voxels used for radiance caching. A solid baseline for most scenes can be the usage of $2^{22}$ elements. Commonly a power of 2 values are suggested. Higher element count can be used for scenes with high depth complexity, lower element count reduce memmory pressure, but can result in more hash collisions.
 
-All buffers should be cleared with '0' before usage
+> :warning: **All buffers should be initially cleared with '0'**
 
 At Render-Time
 
@@ -42,63 +42,103 @@ gridParameters.sceneScale = g_Constants.sharcSceneScale;
 float3 color = HashGridDebugColoredHash(positionWorld, gridParameters);
 ```
 
-<p style="text-align: center">
-<img src="images/render_normal.jpg" width=49%></img>
-<img src="images/render_debug.jpg" width=49%></img>
-<em>Image 2. SHaRC hash grid vizualization</em>
-</p>
+<figure align="center">
+<img src="images/00_normal.jpg" width=49%></img>
+<img src="images/00_debug.jpg" width=49%></img>
+<figcaption>Image 2. SHaRC hash grid vizualization</figcaption>
+</figure>
 
-Logarithm base controls levels of detail distribution and voxel size ratio change between neighboring levels, it doesn’t make voxel sizes bigger or smaller on average. To control voxel size use ```sceneScale``` parameter instead.
+Logarithm base controls levels of detail distribution and voxel size ratio change between neighboring levels, it doesn’t make voxel sizes bigger or smaller on average. To control voxel size use ```sceneScale``` parameter instead. HASH_GRID_LEVEL_BIAS should be used to control at which level near the camera the voxel level get's clamped to avoid getting detailed levels if it is not required.
 
 ## Implementation Details
 
 ### Render Loop Change
 
 Instead of the original trace call, we should have the following four passes with SHaRC:
 
-* SHaRC update - updates the cache with the new data on each frame. Requires `SHARC_UPDATE 1` shader define
-* SHaRC resolve - combines new cache data with data obtained on the previous frame
-* SHaRC hash copy - the second step of the resolve pass required for data compaction
-* SHaRC render/query - trace scene paths with early termination using cached data. Requires `SHARC_QUERY 1` shader define
+* SHaRC Update - RT call which updates the cache with the new data on each frame. Requires `SHARC_UPDATE 1` shader define
+* SHaRC Resolve - Compute call which combines new cache data with data obtained on the previous frame
+* SHaRC Compaction - Compute call to perform data compaction after previous resolve call
+* SHaRC Render/Query - RT call which traces scene paths and performs early termination using cached data. Requires `SHARC_QUERY 1` shader define
 
 ### Resource Binding
 
 The SDK provides shader-side headers and code snippets that implement most of the steps above. Shader code should include [SharcCommon.h](../Shaders/Include/SharcCommon.h) which already includes [HashGridCommon.h](../Shaders/Include/HashGridCommon.h)
 
-<p style="text-align: center">
-<img src="images/sharc_passes.svg" width=50%></img>
-</p>
+| **Render Pass**  | **Hash Entries** | **Voxel Data** | **Voxel Data Previous** | **Copy Offset** |
+|:-----------------|:----------------:|:--------------:|:-----------------------:|:---------------:|
+| SHaRC Update     |        RW        |       RW       |           Read          |       RW*       |
+| SHaRC Resolve    |       Read       |       RW       |           Read          |      Write      |
+| SHaRC Compaction |        RW        |                |                         |        RW       |
+| SHaRC Render     |       Read       |      Read      |                         |                 |
+
+*Read - resource can be read-only*  
+*Write - resource can be write-only*  
 
 *Buffer is used if SHARC_ENABLE_64_BIT_ATOMICS is set to 0
 
 Each pass requires appropriate transition/UAV barries to wait for the previous stage completion.
 
 ### SHaRC Update
 
-Pass requires `SHARC_UPDATE 1` shader define.
+> :warning: Requires `SHARC_UPDATE 1` shader define. `Voxel Data` buffer should be cleared with `0` if `Resolve` pass is active
 
 This pass runs a full path tracer loop for a subset of screen pixels with some modifications applied. We recommend starting with random pixel selection for each 5x5 block to process only 4% of the original paths per frame. This typically should result in a good data set for the cache update and have a small performance overhead at the same time. Positions should be different between frames, producing whole-screen coverage over time. Each path segment during the update step is treated individually, this way we should reset path throughput to 1.0 and accumulated radiance to 0.0 on each bounce. For each new sample(path) we should first call `SharcInit()`. On a miss event `SharcUpdateMiss()` is called and the path gets terminated, for hit we should evaluate radiance at the hit point and then call `SharcUpdateHit()`. If `SharcUpdateHit()` call returns false, we can immediately terminate the path. Once a new ray has been selected we should update the path throughput and call `SharcSetThroughput()`, after that path throughput can be safely reset back to 1.0.
 
-<p style="text-align: center">
-<img src="images/sharc_update.svg" width=35%>
-<em>Figure 1. Path tracer loop during SHARC update</em>
-</p>
+<figure align="center">
+<img src="images/sharc_update.svg" width=40%>
+<figcaption>Figure 1. Path tracer loop during SHaRC Update pass</figcaption>
+</figure>
+
+### SHaRC Resolve and Compaction
+
+`Resolve` pass is performed using compute shader which runs `SharcResolveEntry()` for each element. `Compaction` pass uses `SharcCopyHashEntry()` call.
+> :tip: Check [Resource Binding](#resource-binding) section for details on the required resources and their usage for each pass 
+
+`SharcResolveEntry()` takes maximum number of accumulated frames as an input parameter to control the quality and responsivness of the cached data. Larger values can increase the quality at increase response times. `staleFrameNumMax` parameter is used to control the lifetime of cached elements, it is used to control cache occupancy
+
+> :warning: Small `staleFrameNumMax` values can negatively impact performance, `SHARC_STALE_FRAME_NUM_MIN` constant is used to prevent such behaviour
 
 ### SHaRC Render
 
-Pass requires `SHARC_QUERY 1` shader define.
+> :warning: Requires `SHARC_QUERY 1` shader define
 
 During rendering with SHaRC cache usage we should try obtaining cached data using `SharcGetCachedRadiance()` on each hit except the primary hit if any. Upon success, the path tracing loop should be immediately terminated.
 
-<p style="text-align: center">
-<img src="images/sharc_render.svg" width=35%>
-</p>
+<figure align="center">
+<img src="images/sharc_render.svg" width=40%>
+<figcaption>Figure 2. Path tracer loop during SHaRC Render pass</figcaption>
+</figure>
 
 To avoid potential rendering artifacts certain aspects should be taken into account. If the path segment length is less than a voxel size(checked using `GetVoxelSize()`) we should continue tracing until the path segment is long enough to be safely usable. Unlike diffuse lobes, specular ones should be treated with care. For the glossy specular lobe, we can estimate its "effective" cone spread and if it exceeds the spatial resolution of the voxel grid then the cache can be used. Cone spread can be estimated as:
 
 $$2.0 * ray.length * sqrt(0.5 * a^2 / (1 - a^2))$$
 where `a` is material roughness squared.
 
-## Memory usage
+## Parameters Selection and Debugging
+
+For the rendering step adding debug heatmap for the bounce count can help with understanding cache usage efficiency.
+
+<figure align="center">
+<img src="images/01_cache_off.jpg" width=49%></img>
+<img src="images/01_cache_on.jpg" width=49%></img>
+<figcaption>Image 3. Tracing depth heatmap, left - SHaRC off, right - SHaRC on (green - 1 indirect bounce, red - 2+ indirect bounces)</figcaption>
+</figure>
+
+Sample count uses SHARC_SAMPLE_NUM_BIT_NUM(18) bits to store accumulated sample number.
+> :note: `SHARC_SAMPLE_NUM_MULTIPLIER` is used internally to improve precision of math operations for elements with low sample number, every new sample will increase the internal counter by 'SHARC_SAMPLE_NUM_MULTIPLIER'.
+
+SHaRC radiance values are internally premultiplied with `SHARC_RADIANCE_SCALE` and accumulated using 32-bit integer representation per component.
+
+> :note: [SharcCommon.h](../Shaders/Include/SharcCommon.h) provides several methods to verify potential overflow in internal data structures. `SharcDebugBitsOccupancySampleNum()` and `SharcDebugBitsOccupancyRadiance()` can be used to verify consistency in the sample count and corresponding radiance values representation.
+
+`HashGridDebugOccupancy()` should be used to validate cache occupancy. With a static camera around 10-20% of elements should be used on average, on fast camera movement the occupancy will go up. Increased occupancy can negatively impact performance, to control that we can increase the element count as well as decrease the threshold for the stale frames to evict outdated elements more agressivly.
+
+<figure align="center">
+<img src="images/sample_occupancy.jpg" width=49%></img>
+<figcaption>Image 4. Debug overlay to visualize cache occupancy through HashGridDebugOccupancy()</figcaption>
+</figure>
+
+## Memory Usage
 
 ```Hash entries``` buffer, two ```Voxel data``` and ```Copy offset``` buffers totally require 352 (64 + 128 * 2 + 32) bits per voxel. For $2^{22}$ cache elements this will require ~185 MBs of video memory. Total number of elements may vary depending on the voxel size and scene scale. Larger buffer sizes may be needed to reduce potential hash collisions.
diff --git a/docs/images/00_debug.jpg b/docs/images/00_debug.jpg
diff --git a/docs/images/00_normal.jpg b/docs/images/00_normal.jpg
diff --git a/docs/images/01_cache_off.jpg b/docs/images/01_cache_off.jpg
diff --git a/docs/images/01_cache_on.jpg b/docs/images/01_cache_on.jpg
diff --git a/docs/images/sample_occupancy.jpg b/docs/images/sample_occupancy.jpg
diff --git a/docs/images/sharc_render.svg b/docs/images/sharc_render.svg
diff --git a/docs/images/sharc_update.svg b/docs/images/sharc_update.svg
diff --git a/include/HashGridCommon.h b/include/HashGridCommon.h
@@ -17,11 +17,11 @@
 #define HASH_GRID_HASH_MAP_BUCKET_SIZE      32
 #define HASH_GRID_INVALID_HASH_KEY          0
 #define HASH_GRID_INVALID_CACHE_ENTRY       0xFFFFFFFF
-#define HASH_GRID_USE_NORMALS               1
+#define HASH_GRID_USE_NORMALS               1       // account for normal data in the hash key
 #define HASH_GRID_ALLOW_COMPACTION          (HASH_GRID_HASH_MAP_BUCKET_SIZE == 32)
-#define HASH_GRID_LEVEL_BIAS                2 // positive bias adds extra levels with content magnification
+#define HASH_GRID_LEVEL_BIAS                2       // positive bias adds extra levels with content magnification (can be negative as well)
 #define HASH_GRID_POSITION_OFFSET           float3(0.0f, 0.0f, 0.0f)
-#define HASH_GRID_POSITION_BIAS             float3(1e-6f, 1e-6f, 1e-6f) // may require adjustment for extreme scene scales
+#define HASH_GRID_POSITION_BIAS             1e-4f   // may require adjustment for extreme scene scales
 #define HASH_GRID_NORMAL_BIAS               1e-3f
 
 #define CacheEntry uint
@@ -69,9 +69,9 @@ uint Hash32(HashKey hashKey)
 
 uint GetGridLevel(float3 samplePosition, GridParameters gridParameters)
 {
-    const float distance = length(gridParameters.cameraPosition - samplePosition);
+    const float distance2 = dot(gridParameters.cameraPosition - samplePosition, gridParameters.cameraPosition - samplePosition);
 
-    return clamp(int(floor(LogBase(distance, gridParameters.logarithmBase) + HASH_GRID_LEVEL_BIAS)), 1, HASH_GRID_LEVEL_BIT_MASK);
+    return uint(clamp(0.5f * LogBase(distance2, gridParameters.logarithmBase) + HASH_GRID_LEVEL_BIAS, 1.0f, float(HASH_GRID_LEVEL_BIT_MASK)));
 }
 
 float GetVoxelSize(uint gridLevel, GridParameters gridParameters)
@@ -82,7 +82,7 @@ float GetVoxelSize(uint gridLevel, GridParameters gridParameters)
 // Based on logarithmic caching by Johannes Jendersie
 int4 CalculateGridPositionLog(float3 samplePosition, GridParameters gridParameters)
 {
-    samplePosition += HASH_GRID_POSITION_BIAS;
+    samplePosition += float3(HASH_GRID_POSITION_BIAS, HASH_GRID_POSITION_BIAS, HASH_GRID_POSITION_BIAS);
 
     uint  gridLevel    = GetGridLevel(samplePosition, gridParameters);
     float voxelSize    = GetVoxelSize(gridLevel, gridParameters);
@@ -93,8 +93,6 @@ int4 CalculateGridPositionLog(float3 samplePosition, GridParameters gridParamete
 
 HashKey ComputeSpatialHash(float3 samplePosition, float3 sampleNormal, GridParameters gridParameters)
 {
-    const float distance = length(gridParameters.cameraPosition - samplePosition);
-
     uint4 gridPosition = uint4(CalculateGridPositionLog(samplePosition, gridParameters));
 
     HashKey hashKey = ((uint64_t(gridPosition.x) & HASH_GRID_POSITION_BIT_MASK) << (HASH_GRID_POSITION_BIT_NUM * 0))
@@ -120,9 +118,9 @@ float3 GetPositionFromHashKey(const HashKey hashKey, GridParameters gridParamete
     const int signMask     = ~((1 << HASH_GRID_POSITION_BIT_NUM) - 1);
 
     int3 gridPosition;
-    gridPosition.x = int((hashKey >> HASH_GRID_POSITION_BIT_NUM * 0) & HASH_GRID_POSITION_BIT_MASK);
-    gridPosition.y = int((hashKey >> HASH_GRID_POSITION_BIT_NUM * 1) & HASH_GRID_POSITION_BIT_MASK);
-    gridPosition.z = int((hashKey >> HASH_GRID_POSITION_BIT_NUM * 2) & HASH_GRID_POSITION_BIT_MASK);
+    gridPosition.x = int((hashKey >> (HASH_GRID_POSITION_BIT_NUM * 0)) & HASH_GRID_POSITION_BIT_MASK);
+    gridPosition.y = int((hashKey >> (HASH_GRID_POSITION_BIT_NUM * 1)) & HASH_GRID_POSITION_BIT_MASK);
+    gridPosition.z = int((hashKey >> (HASH_GRID_POSITION_BIT_NUM * 2)) & HASH_GRID_POSITION_BIT_MASK);
 
     // Fix negative coordinates
     gridPosition.x = (gridPosition.x & signBit) != 0 ? gridPosition.x | signMask : gridPosition.x;
@@ -156,7 +154,7 @@ void HashMapAtomicCompareExchange(in HashMapData hashMapData, in uint dstOffset,
     InterlockedCompareExchange(BUFFER_AT_OFFSET(hashMapData.hashEntriesBuffer, dstOffset), compareValue, value, originalValue);
 #endif // !SHARC_ENABLE_GLSL
 #else // !HASH_GRID_ENABLE_64_BIT_ATOMICS
-    // ANY rearangments to the code below lead to device hang if fuze is unlimited
+    // ANY rearangments to the code below lead to device hang if fuse is unlimited
     const uint cLock = 0xAAAAAAAA;
     uint fuse = 0;
     const uint fuseLength = 8;
@@ -169,7 +167,7 @@ void HashMapAtomicCompareExchange(in HashMapData hashMapData, in uint dstOffset,
 
         if (state != cLock)
         {
-            originalValue = BUFFER_AT_OFFSET(hashMapData.hashEntriesBuffer, d34stOffset);
+            originalValue = BUFFER_AT_OFFSET(hashMapData.hashEntriesBuffer, dstOffset);
             if (originalValue == compareValue)
                 BUFFER_AT_OFFSET(hashMapData.hashEntriesBuffer, dstOffset) = value;
             InterlockedExchange(hashMapData.lockBuffer[dstOffset], state, fuse);
@@ -263,5 +261,30 @@ float3 HashGridDebugColoredHash(float3 samplePosition, GridParameters gridParame
 {
     HashKey hashKey = ComputeSpatialHash(samplePosition, float3(0, 0, 0), gridParameters);
 
-    return GetColorFromHash32(Hash32(hashKey));
+    uint gridLevel = GetGridLevel(samplePosition, gridParameters);
+    float3 color = GetColorFromHash32(Hash32(hashKey)) * GetColorFromHash32(HashJenkins32(gridLevel)).xyz;
+
+    return color;
+}
+
+float3 HashGridDebugOccupancy(uint2 pixelPosition, uint2 screenSize, HashMapData hashMapData)
+{
+    const uint elementSize = 7;
+    const uint borderSize = 1;
+    const uint blockSize = elementSize + borderSize;
+
+    uint rowNum = screenSize.y / blockSize;
+    uint rowIndex = pixelPosition.y / blockSize;
+    uint columnIndex = pixelPosition.x / blockSize;
+    uint elementIndex = (columnIndex / HASH_GRID_HASH_MAP_BUCKET_SIZE) * (rowNum * HASH_GRID_HASH_MAP_BUCKET_SIZE) + rowIndex * HASH_GRID_HASH_MAP_BUCKET_SIZE + (columnIndex % HASH_GRID_HASH_MAP_BUCKET_SIZE);
+
+    if (elementIndex < hashMapData.capacity && ((pixelPosition.x % blockSize) < elementSize && (pixelPosition.y % blockSize) < elementSize))
+    {
+        HashKey storedHashKey = BUFFER_AT_OFFSET(hashMapData.hashEntriesBuffer, elementIndex);
+
+        if (storedHashKey != HASH_GRID_INVALID_HASH_KEY)
+            return float3(0.0f, 1.0f, 0.0f);
+    }
+
+    return float3(0.0f, 0.0f, 0.0f);
 }