Purpose: Document the complete data flow from glTF model through the Vulkan ray tracing and rasterization pipeline -- from scene graph to BLAS/TLAS acceleration structures and GPU render nodes.
Critical for: Understanding how scene editing (duplicate/delete) affects the ray tracing and rasterization paths.
┌─────────────────────────────────────────────────────────────────────────┐
│ CPU: tinygltf::Model (Scene Hierarchy) │
│ • nodes[i] - Scene graph nodes │
│ • meshes[j] - Mesh definitions │
│ • materials[k] - Material definitions │
└─────────────────────────────────────────────────────────────────────────┘
│
│ parseScene()
│ (traverses hierarchy, flattens to instances)
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ CPU: Flat Render Arrays (Derived, Regenerated) │
│ • RenderNode[] - One per primitive instance (N:1 with leaf nodes/primitives) │
│ • RenderPrimitive[] - Deduplicated geometry / BLAS │
└─────────────────────────────────────────────────────────────────────────────────┘
│
┌────────────┴────────────┐
│ │
uploadRenderNodes/syncFromScene cmdCreateBuildTopLevelAS()
│ │
▼ ▼
┌──────────────────────────────┐ ┌──────────────────────────────┐
│ GPU: SSBO (Shader Access) │ │ Ray Tracing: TLAS │
│ GltfRenderNode nodes[N] │ │ VkASInstance instances[N] │
│ • objectToWorld │ │ • transform matrix │
│ • materialID │ │ • BLAS reference │
│ • renderPrimID │ │ • instanceCustomIndex │
└──────────────────────────────┘ └──────────────────────────────┘
Key Insight: RenderNodes are derived structures, rebuilt from Model whenever hierarchy changes.
There is a direct correlation between BLAS and primitive index:
- BLAS (Bottom-Level Acceleration Structures) are built from
RenderPrimitive[]in order:m_blasAccel[renderPrimID]. - TLAS (Top-Level AS) instances reference BLAS via
object.renderPrimID:blasAddress = m_blasAccel[object.renderPrimID].address.
When are BLAS actually built? Only when the Vulkan scene is (re)created:
- Once at load:
GltfRenderer::createVulkanScene()→buildAccelerationStructures()→createBottomLevelAccelerationStructure()+cmdBuildBottomLevelAccelerationStructure(). - Again only on full geometry rebuild:
GltfRenderer::rebuildVulkanSceneInternal()→buildAccelerationStructures()(e.g. whenprimitivesChangedis set in dirty flags).
BLAS are not rebuilt every frame or on every parseScene(). Hierarchy-only edits (reparent, duplicate, delete) that do not change the set of meshes/primitives only update CPU render nodes and TLAS (transforms/visibility); the existing BLAS array is reused. So for a given geometry state, BLAS are built once.
If the order of primitives (and thus renderPrimID) ever changed without rebuilding the BLAS, the TLAS would reference the wrong BLAS. Therefore:
- RenderPrimitive list must be built in deterministic order (by mesh index, then primitive index), not in traversal order. See
parseScene(): the unique-primitive list is filled by iteratingm_model.meshesand their primitives before any scene-graph traversal. - BLAS are built from that ordered list;
renderPrimIDis stable across hierarchy edits (reparent, duplicate, delete) as long as the mesh set is unchanged. - Do not clear and repopulate
m_renderPrimitives/m_uniquePrimitiveIndexin a way that depends on traversal order, or BLAS and TLAS will go out of sync.
Location: gltf_scene.cpp
Purpose: Traverse scene hierarchy and flatten into renderable instances.
parseScene():
1. Snapshot current render-node state (worldMatrix, materialID, renderPrimID, visible)
2. clearParsedData() -- wipe render nodes, primitives, lights
3. createMissingTangentsForModel() -- ensure tangent attributes exist (stabilizes primitive keys)
4. primMap = buildPrimitiveKeyMap() -- register unique primitives in deterministic order
(by mesh index, then primitive index — not traversal order)
5. For each root node in current scene:
traverseSceneGraph(depth-first):
- collect lights via handleLightTraversal()
- collect render nodes via handleRenderNode(nodeID, worldMatrix, primMap)
6. updateRenderNodesFull() -- apply animations, skinning, morph, visibility
7. Diff against snapshot → set dirty flags:
renderNodesVk / renderNodesRtx (indices that changed)
allRenderNodesDirty (count changed or >50% dirty)
materials, primitivesChanged, lights
handleRenderNode(nodeID, worldMatrix, primMap):
if node has no mesh → skip, continue traversal
createRenderNodesForNode(nodeID, worldMatrix, visible=true, primMap):
for each primitive in mesh:
renderPrimID = primMap[primitiveKey] -- deduplicated primitive index
build RenderNode { worldMatrix, materialID, renderPrimID, refNodeID, skinID, visible }
if node has EXT_mesh_gpu_instancing:
handleGpuInstancing() → create N RenderNodes (one per instance transform)
else:
register single RenderNode in m_renderNodeRegistry
tinygltf::Model: RenderNodes[] (flat, contiguous):
Scene.nodes = [0]
Node[0] "Car" (No RenderNode - empty transform)
├─ children = [1, 2, 3]
│
├─ Node[1] "Body" → RenderNode[0]: {worldMat, matID=5, primID=10, refNode=1}
│ mesh = 5 (1 primitive)
│
├─ Node[2] "Wheels" (No RenderNode - empty transform)
│ ├─ children = [4, 5]
│ │
│ ├─ Node[4] "WheelFL" → RenderNode[1]: {worldMat, matID=7, primID=12, refNode=4}
│ │ mesh = 7 (1 primitive)
│ │
│ └─ Node[5] "WheelFR" → RenderNode[2]: {worldMat, matID=7, primID=12, refNode=5}
│ mesh = 7 (1 primitive) ↑ SAME primID (instanced!)
│
└─ Node[3] "Engine" → RenderNode[3]: {worldMat, matID=9, primID=15, refNode=3}
→ RenderNode[4]: {worldMat, matID=9, primID=16, refNode=3}
mesh = 8 (2 primitives) ↑ SAME node, multiple RenderNodes!
Observations:
- Empty nodes (Car, Wheels) → No RenderNodes (just hierarchy)
- Multi-primitive mesh (Engine) → Multiple RenderNodes (one per primitive)
- Instancing (WheelFL, WheelFR) → Different RenderNodes, same
primID - Hierarchy depth irrelevant → Flat RenderNodes array
Function: void SceneVk::uploadRenderNodes(staging, scene, dirtyIndices) or syncFromScene(staging, scene)
Location: gltf_scene_vk.cpp
Purpose: Upload RenderNode data to GPU SSBO for shader access.
// GPU format (shaders/gltf_raster.slang, gltf_pathtrace.slang)
struct GltfRenderNode {
mat4 objectToWorld; // 64 bytes - Transform to world space
mat4 worldToObject; // 64 bytes - Inverse transform (for normals)
int materialID; // Material array index
int renderPrimID; // Primitive array index (vertex/index data)
};uploadRenderNodes(staging, scene, dirtyIndices):
renderNodes = scene.getRenderNodes()
ensureRenderNodeBuffer(renderNodes.size) -- recreate GPU buffer if size changed
if buffer was recreated OR dirtyIndices is empty:
Full upload: convert all RenderNodes → GltfRenderNode[], stage entire buffer
else:
Surgical upload: for each index in dirtyIndices,
convert RenderNode[index] → GltfRenderNode, stage at byte offset
Rasterization (vertex shader):
instance = renderNodes[gl_InstanceIndex] -- direct indexing into SSBO
worldMatrix = instance.objectToWorld
materialID = instance.materialID
Ray Tracing (closest hit / any hit):
instance = renderNodes[gl_InstanceID] -- TLAS instance index
worldMatrix = instance.objectToWorld
materialID = instance.materialID
Location: gltf_scene_rtx.cpp
Purpose: Build Top-Level Acceleration Structure (TLAS) for ray tracing.
cmdCreateBuildTopLevelAccelerationStructure(cmd, staging, scene):
drawObjects = scene.getRenderNodes()
for each RenderNode in drawObjects:
blasAddress = m_blasAccel[renderNode.renderPrimID].address
if not renderNode.visible:
blasAddress = 0 -- hide from ray traversal
create VkAccelerationStructureInstanceKHR:
transform = renderNode.worldMatrix -- 3x4 row-major
instanceCustomIndex = renderNode.renderPrimID -- accessible in shader as gl_InstanceCustomIndexEXT
accelerationStructureReference = blasAddress -- which BLAS to use
mask = 0x01
flags = getInstanceFlag(material) -- cull mode from material
append to m_tlasInstances[]
upload m_tlasInstances[] to GPU via staging
build TLAS acceleration structure on device
TLAS (Top-Level Acceleration Structure)
├─ Instance[0] → BLAS[primID=10] @ transform[worldMatrix]
├─ Instance[1] → BLAS[primID=12] @ transform[worldMatrix] ← Instanced (same BLAS)
├─ Instance[2] → BLAS[primID=12] @ transform[worldMatrix] ← Instanced (same BLAS)
├─ Instance[3] → BLAS[primID=15] @ transform[worldMatrix]
└─ Instance[4] → BLAS[primID=16] @ transform[worldMatrix]
Each BLAS (Bottom-Level AS):
BLAS[primID] = acceleration structure for RenderPrimitive[primID]
• Built from vertex/index buffers
• Shared across multiple instances (e.g., wheels)
When a ray hits TLAS instance[i]:
gl_InstanceID = i -- index into renderNodes SSBO
gl_InstanceCustomIndexEXT = renderPrimID -- index into renderPrimitives / BLAS
Shader reads:
renderNodes[gl_InstanceID] → transform, materialID
renderPrimitives[gl_InstanceCustomIndexEXT] → vertex/index buffer info
struct RenderNode {
mat4 worldMatrix; // Computed during parseScene() traversal
int materialID; // → m_model.materials[materialID]
int renderPrimID; // → m_renderPrimitives[renderPrimID]
int refNodeID; // → m_model.nodes[refNodeID] (back-reference)
int skinID; // → m_model.skins[skinID] (or -1)
bool visible; // Visibility flag
};Relationships:
- N:1 with Node - Multiple RenderNodes per node (if mesh has multiple primitives)
- N:1 with Primitive - Multiple RenderNodes share same primitive (instancing)
- N:1 with Material - Many RenderNodes use same material
struct RenderPrimitive {
tinygltf::Primitive* pPrimitive; // → mesh.primitives[i] (pointer!) (speed up for Skin)
int vertexCount;
int indexCount;
int meshID; // Which mesh this came from
};Purpose: Deduplicated geometry - if two nodes use same mesh, they share RenderPrimitives.
Indexing: renderNode.renderPrimID → m_renderPrimitives[renderPrimID]
// RenderNodeRegistry: flat vector + bidirectional maps
// - getRenderNodes() → vector<RenderNode> (flat array for GPU upload)
// - getRenderNodesForNode(nodeID) → nodeID → list of RenderNode indices (unordered_map internally)
// - getNodeAndPrim(renderNodeID) → (nodeID, primIndex)
// Example:
// getRenderNodesForNode(3) = {5, 6} // Node 3 has RenderNodes 5 and 6
// → Node 3 has a mesh with 2 primitives
// → RenderNode[5] and RenderNode[6] both have refNodeID = 31. scene.load("file.gltf")
→ load tinygltf::Model from disk
→ parseScene()
→ build m_renderPrimitives[] (deduplicated, deterministic order)
→ build m_renderNodes[] (flat instances from hierarchy traversal)
2. sceneVk.create(cmd, staging, scene)
→ upload vertex/index buffers for all primitives
→ uploadRenderNodes() → create GPU SSBO, upload all GltfRenderNode[]
3. buildAccelerationStructures()
→ createBottomLevelAccelerationStructure() → prepare BLAS build data
→ cmdBuildBottomLevelAccelerationStructure() → GPU BLAS build (budgeted)
→ cmdCreateBuildTopLevelAccelerationStructure() → GPU TLAS build
User edits a node transform (gizmo, inspector):
scene.editor().setNodeTRS(nodeIdx, translation, rotation, scale)
→ modifies m_model.nodes[nodeIdx] TRS values
→ calls markNodeDirty(nodeIdx) → adds to m_dirtyFlags.nodes
On next frame, updateSceneChanges(cmd):
1. updateSceneChanges_NodeTransforms():
scene.updateRenderNodeDirtyFromNodes(true)
→ converts dirty node indices to dirty render-node indices
scene.updateNodeWorldMatrices()
→ recomputes world matrices for dirty nodes and descendants
→ updates RenderNode.worldMatrix in the registry
2. sceneVk.syncFromScene(staging, scene)
→ reads dirty flags → uploads ONLY changed RenderNodes (surgical)
3. sceneRtx.syncTopLevelAS(cmd, staging, scene)
→ updates TLAS instances with new transforms (rebuild or update)
Note: Animation updates follow a similar but separate path inline in the animation
processing block (not via updateSceneChanges). The same functions are called
(updateRenderNodeDirtyFromNodes, updateNodeWorldMatrices, syncFromScene, syncTopLevelAS)
but within the animation frame section, which also handles morph/skin GPU uploads.
Optimization: Only changed RenderNodes uploaded, not entire buffer.
User duplicates a node:
scene.editor().duplicateNode(nodeIdx)
→ duplicateNodeRecursive() deep-copies nodes + subtree in m_model
→ links new subtree into parent's children (or scene roots)
→ calls parseScene() internally:
snapshot → clearParsedData → buildPrimitiveKeyMap → traverse → diff
→ dirty flags: allRenderNodesDirty = true (count changed)
On next frame, updateSceneChanges(cmd):
1. sceneVk.syncFromScene(staging, scene)
→ buffer size mismatch detected → recreate buffer → full upload
2. sceneRtx.syncTopLevelAS(cmd, staging, scene)
→ instance count changed → full TLAS rebuild
Key: Hierarchy changes trigger full rebuild (not surgical update).
Note: deleteNode and duplicateNode are methods on SceneEditor, accessed via scene.editor().
SceneEditor::deleteNode(nodeIndex):
1. deleteNodeRecursive(nodeIndex):
for each child (deepest first):
removeNodeFromParent / removeNodeFromSceneRoots
erase from m_model.nodes[]
remapIndicesAfterNodeDeletion() -- fix all node/animation/skin references
2. parseScene() -- full rebuild, sets dirty flags
On next frame, updateSceneChanges(cmd):
sceneVk.syncFromScene(...) -- resize + upload render nodes
sceneRtx.syncTopLevelAS(...) -- rebuild TLAS
Result: Deleted node's RenderNodes disappear automatically (not in traversal anymore).
SceneEditor::duplicateNode(originalIndex):
1. newIdx = duplicateNodeRecursive(originalIndex, originalParent)
deep-copies node + all descendants in m_model.nodes[]
2. Link new subtree under same parent (or as scene root)
3. parseScene() -- full rebuild, sets dirty flags
return newIdx
On next frame, updateSceneChanges(cmd):
sceneVk.syncFromScene(...) -- resize + upload render nodes
sceneRtx.syncTopLevelAS(...) -- rebuild TLAS
Result: Duplicated node's RenderNodes appear automatically (in traversal now).
- CPU: O(N) where N = number of dirty nodes (typically 1-10)
- GPU Upload: Only changed RenderNodes (typically < 1KB)
- Cost: ~0.1 ms
- CPU: O(N) where N = total nodes in scene (parseScene traversal)
- GPU Upload: All RenderNodes (typically 1-100 KB)
- TLAS Rebuild: O(N) instances
- Cost: ~1-5 ms (depends on scene complexity)
- Simple: 10-50 nodes → 20-100 RenderNodes
- Complex: 100-500 nodes → 200-1000 RenderNodes
- Very Complex: 1000+ nodes → 2000+ RenderNodes
Implication: Full rebuilds are cheap enough to do on every hierarchy edit (< 5ms).
m_renderNodeRegistry.getRenderNodes() -- contiguous vector<RenderNode>
m_renderPrimitives[] -- contiguous vector<RenderPrimitive>
Both are directly uploadable to GPU via staging (single memcpy-style transfer).
┌────────────────────────────────────────┐
│ m_bRenderNode (SSBO) │
│ ┌────────────────────────────────────┐ │
│ │ GltfRenderNode[0] │ │ ← Instance 0
│ │ GltfRenderNode[1] │ │ ← Instance 1
│ │ GltfRenderNode[2] │ │ ← Instance 2
│ │ ... │ │
│ │ GltfRenderNode[N-1] │ │ ← Instance N-1
│ └────────────────────────────────────┘ │
└────────────────────────────────────────┘
↑ Shader indexing: nodes[gl_InstanceIndex]
Access Pattern: Direct indexing (O(1)) in shaders.
Never modify render nodes in the registry directly. Always modify the tinygltf Model, then call the appropriate rebuild:
- Transform change:
setNodeTRS()+updateNodeWorldMatrices() - Hierarchy change: editor operation (which calls
parseScene()internally)
RenderNodes[i] corresponds to TLAS Instance[i].
When a ray hits instance i, gl_InstanceID = i, and the shader reads renderNodes[i] for transform and material.
m_renderPrimitives[p] corresponds to m_blasAccel[p].
BLAS are built once during load (geometry does not change at runtime).
Only the TLAS is rebuilt when instances move, appear, or disappear.
| Operation | Model | RenderNodes | GPU SSBO | TLAS |
|---|---|---|---|---|
| Load scene | ✅ New | ✅ Full rebuild | ✅ Create & upload | ✅ Create & build |
| Merge scene | ✅ New | ✅ Full rebuild | ✅ Full GPU recreation | ✅ Full rebuild |
| Transform change | ✅ Modified | ✅ Partial update | ✅ Surgical upload | ✅ Update instances |
| Add empty node | ✅ Modified | ✅ Full rebuild | No GPU work | No GPU work |
| Delete node | ✅ Modified | ✅ Full rebuild | ✅ Resize & full upload | ✅ Rebuild |
| Duplicate node | ✅ Modified | ✅ Full rebuild | ✅ Resize & upload new | ✅ Rebuild |
| Reparent | ✅ Modified | ✅ Transforms only | ✅ Surgical upload (transforms) | ✅ Update instances |
| Split mesh | ✅ Modified | ✅ Full rebuild | ✅ Surgical upload (changed indices) | ✅ Update instances |
| Merge mesh | ✅ Modified | ✅ Full rebuild | ✅ Surgical upload (changed indices) | ✅ Update instances |
| Material change | ✅ Modified | ✅ Partial update | ✅ Surgical upload | No change |
Key: parseScene() always does a full CPU rebuild, but its internal diff sets precise dirty flags
so the GPU sync is surgical -- only changed render node indices, new materials, etc. are uploaded.
parseScene() snapshots the full render node state (worldMatrix, materialID, renderPrimID, visible) before
clearing, then compares after rebuild. This sets precise dirty flags:
renderNodesVk/renderNodesRtx: indices where any field differs (surgical upload)allRenderNodesDirty: set when count changes or >50% of indices differ (full upload, avoids hash-set overhead)materials: new material indicesprimitivesChanged: primitive count changed (BLAS rebuild needed)lights: all lights dirty if light count changed
The renderer has one unified sync path (syncFromScene + syncTopLevelAS) that processes these flags.
Buffer resize is handled automatically by size-mismatch detection in uploadRenderNodes and rebuildTopLevelAS.
When the m_validateGpuSync flag is enabled (default: true), after every updateSceneChanges(), validateGpuSync() compares:
- Shadow copy of last-uploaded render nodes (materialID, renderPrimID) against current CPU state
- TLAS
instanceCustomIndexagainst CPUrenderPrimID - Material buffer size against CPU material count
Any mismatch is logged as a warning with a descriptive error message.
parseScene()-gltf_scene.cpphandleRenderNode()-gltf_scene.cppcreateRenderNodesForNode()-gltf_scene.cppbuildPrimitiveKeyMap()-gltf_scene.cppupdateNodeWorldMatrices()-gltf_scene.cppupdateRenderNodeDirtyFromNodes()-gltf_scene.cppmarkNodeDirty()-gltf_scene.cpp
duplicateNode()/duplicateNodeRecursive()-gltf_scene_editor.cppdeleteNode()/deleteNodeRecursive()/deleteNodeSingle()-gltf_scene_editor.cppremapIndicesAfterNodeDeletion()-gltf_scene_editor.cpp
syncFromScene()-gltf_scene_vk.cppuploadRenderNodes()-gltf_scene_vk.cppuploadPrimitives()-gltf_scene_vk.cppuploadMaterials()-gltf_scene_vk.cpp(uploads theshaderio::GltfShadeMaterialarray produced bynvvkgltf::MaterialCacheingltf_material_cache.cpp; the material struct is locally forked, see developer.md → Material System)createVertexBuffers()-gltf_scene_vk.cpp
cmdCreateBuildTopLevelAccelerationStructure()-gltf_scene_rtx.cppcreateBottomLevelAccelerationStructure()-gltf_scene_rtx.cppsyncTopLevelAS()/rebuildTopLevelAS()-gltf_scene_rtx.cpp
updateSceneChanges()-renderer.cpp(unified sync path, processes all dirty flags)
After parseScene(), log each render node's refNodeID, materialID, and renderPrimID.
The count should match the expected number of mesh-primitive instances in the scene.
After cmdCreateBuildTopLevelAccelerationStructure(), check that m_tlasInstances.size() equals scene.getRenderNodes().size().
The built-in validateGpuSync() does this automatically when enabled.
- RenderNodes count wrong → parseScene() not called after structural edit
- Invisible objects → Check
renderNode.visibleflag - Wrong transforms → Check if
updateNodeWorldMatrices()called - Missing instances → Check if node was added to scene.nodes[] (not just created)
- Wrong materials after structural edit → Check
validateGpuSync()output in debug build
Data is ONE-WAY:
Model (authoritative) → parseScene() → RenderNodes (derived) → GPU
Never modify RenderNodes directly - always modify Model and rebuild.
Hierarchy changes are cheap - parseScene() + GPU upload ~1-5ms.
This architecture makes editing simple:
- Modify vectors in
m_modeldirectly - Call
parseScene()to regenerate RenderNodes - Upload to GPU via
syncFromScene()oruploadRenderNodes() - Rebuild TLAS via
syncTopLevelAS()orcmdCreateBuildTopLevelAccelerationStructure()
No complex synchronization needed - derived data is always regenerated from source of truth.
Option 1: Split parseScene() into targeted rebuilds
rebuildPrimitivesAndRenderNodes()for split/merge meshrebuildRenderNodesAndLights()for node add/delete (already exists)- Only rebuild what the operation requires
- Medium complexity
Option 2: Incremental RenderNode updates
- Add/remove RenderNodes without full rebuild
- Remap TLAS instances manually
- Very complex, high bug risk
Current Approach (Full CPU Rebuild + Surgical GPU Sync):
- ✅ Simple and correct
- ✅ CPU rebuild < 5ms for typical scenes (~10-20ms for 1M nodes)
- ✅ GPU sync is surgical (only changed indices uploaded)
- ✅ Debug validation catches any drift
- ✅ One unified sync path in the renderer
Recommendation: Keep full CPU rebuild + surgical GPU sync. Split parseScene() later if profiling shows CPU cost matters at 1M+ node scale.