-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Prioritizing Extending glTF to support Interactive, Multi-Participant, Synchronized Stereo and Volumetric Video and Audio Experiences
Background:
glTF has become foundational for 3D asset delivery, but has yet to fully address the rise of immersive video experiences. With the ongoing spatial video renaissance—including the launch of new stereoscopic and volumetric camera systems—there is now an urgent need for synchronized, immersive video integration into the glTF ecosystem. This includes support for stereo and 4D video, external video resources, audio synchronization, and timeline metadata.
This issue is not to propose a fixed specification, but to prioritize this capability for inclusion in glTF’s roadmap. It is a strategic call for action and a request for collaboration—particularly for SIGGRAPH 2025—to champion and incubate a standards-based approach.
Why Now:
- New spatial and volumetric capture workflows are entering the market.
- Spatial video is being widely adopted for VR, AR, social, education, training, and storytelling.
- Closed or proprietary immersive video pipelines hinder broad adoption. glTF’s open, efficient design provides the interoperable foundation needed for synchronized, immersive video experiences at scale.
Motivation & Use Cases:
- Synchronized XR Viewing (6DoF): Multi-user watch parties in VR with frame-accurate sync and full 6DoF movement during video playback.
- Timeline-driven Interactivity: Story branches, scene triggers, and animation sync linked to video playback.
- Training & Simulation: Anchors for narration, UI overlays, or instruction stages bound to moments in time.
- Live Broadcast + VOD: Hybrid delivery of spatial or volumetric video enhanced by glTF overlays or spatial metadata.
- Volumetric Media Players: Integration with glTF scenes for marker-based playback in 3D spaces.
Clarifications:
- This functionality is best suited as a
KHR_extension, not a vendor-specific extension. It addresses a foundational media use case for glTF that crosses vendors, platforms, and runtimes. - This proposal is focused on synchronized video and its metadata in glTF.
- It defers to and aims to support KHR_video as the canonical extension for external video streaming.
- Stereoscopic or volumetric (4D) format support should be a property of the video node (e.g.,
layout,viewCount, or an additionaltypefield), not a separate extension. - Whether a video is mono, stereo, or volumetric is declared via metadata, not autodetected.
- Synchronization metadata is authored in the glTF file, not embedded in or altering the external video.
- Video files remain external resources and stream directly via WebCodecs,
<video>, or native decoders.
Synchronization Handles and Timeline Metadata
We propose extending KHR_video with structured timeline metadata for synchronizing scene elements with video playback. While not currently part of any published glTF extension, the need for this functionality has been implicit in use cases involving synchronized audio-visual playback and multi-user XR experiences. We now articulate it explicitly for community discussion and incubation.
Proposed Schema Additions (Illustrative Only)
Design Considerations & Interoperability
timeline.markers[]allows glTF nodes to be triggered or updated based on video playback.syncIdprovides an identifier for distributed multi-user playback synchronization.- This mechanism should be compatible with
KHR_audioor a future audio sync extension to support AV alignment. The interplay between video and audio extensions should be explicitly coordinated to ensure timeline consistency, precision timing, and event dispatch compatibility across platforms. - The
typefield provides clarity for the video renderer and XR runtime without relying on container inference. - Timeline events should define precision (milliseconds or frame-accurate), label constraints (UTF-8 strings), and node anchoring behavior (e.g., by index or UUID reference).
- Fallback behavior should be explicitly defined for unsupported platforms, including whether timeline entries are ignored or dispatched via a standard runtime warning.
Adaptive Rendering: 2D Contexts and Surfaces
While the KHR_video extension supports immersive stereo and volumetric video types, not all playback environments are XR-capable. To ensure compatibility:
- Runtimes MAY render videos with
"type": "stereo"or"type": "volumetric"as monoscopic if the target platform lacks stereoscopic or volumetric capability. "layout": "side-by-side"or"layout": "top-bottom"SHOULD be respected by immersive players, but monoscopic fallbacks SHOULD default to the left eye (or first view) only.- A
"type": "mono"video is always rendered as 2D regardless of environment. - If a video is associated with a 2D surface (e.g., a mesh in a non-XR glTF scene), it MAY be rendered using the same
KHR_videometadata. - Players SHOULD detect the rendering context (e.g., no headset, WebGL-only, etc.) and apply appropriate decoding/rendering pathways.
if (videoExtension.type === "stereo" && !xrSupported) {
renderMonoFromStereo(video, eye = "left");
} else if (videoExtension.type === "volumetric" && !volumetricDecoderAvailable) {
showFallbackMeshOrPosterFrame();
}This allows the same KHR_video declaration to work across XR and non-XR environments without needing separate assets.
Implementation Notes
- Multi-user synchronization via
syncIdintroduces session-based state handling, which may be decoupled from playback timing. This aspect may merit its own focused proposal to support broader networking and coordination use cases. - Runtime and toolchain vendors should be consulted early to confirm feasibility, flag performance bottlenecks (e.g., decoding latency or AV sync drift), and encourage adoption.
Backward Compatibility
- The extension is fully optional. glTF assets that do not declare the
KHR_videoextension will function as they do today. - When present,
KHR_videometadata is non-destructive and additive, providing enhanced capability without requiring changes to existing workflows. - Loaders and runtimes that do not support
KHR_videowill safely ignore the extension, per glTF's standard extension handling rules.
Reference Implementations
- Web Implementation: WebXR + WebCodecs + WebGPU video playback bound to glTF scenes, with marker events dispatched to the runtime.
- Native Implementation: Extended libVLC or OpenXR player that parses glTF timeline metadata, synchronizes stereo or 4D video playback with spatial anchors.
Next Steps
- Coordinate with the KHR_video extension editors to define where timeline sync fits.
- Define data formats and constraints for
timeline.markers[]andtypevalues, including expected precision, allowed label syntax, and node resolution behavior. Clarify optional vs required fields and establish fallback behaviors for unsupported entries. - Prototype a glTF player that supports timeline-based metadata events and synchronization triggers for both local playback and multi-user streaming scenarios.
- Align timeline sync mechanisms with the audio extension and establish a shared reference clock model or sequencing layer.
- Encourage formation of a SIGGRAPH 2025 working group to define governance and reference architecture.
Call to SIGGRAPH 2025
We urge members of the community to champion this proposal. Spatial and volumetric video experiences—especially those requiring frame-accurate synchronization and event-based interactivity—are already being developed today. glTF must provide a robust, extensible structure to support time-based media coordination in 3D and XR contexts.
To prevent fragmentation, this should not remain a vendor-specific effort. It is best suited for incubation as a KHR_video extension with structured metadata support, ideally reviewed alongside the audio timeline effort. Governance should include representatives from both runtime vendors and spatial video toolchains.
A suggested roadmap:
- Q3 2025: Schema and validator discussion with glTF WG
- Q4 2025: Reference implementation published across one WebXR engine and one native stack
- SIGGRAPH 2026: Interop showcase or working session to ratify shared timeline metadata alignment across media
Let’s ensure glTF is prepared to deliver open, synchronized, immersive video experiences to the next generation of devices.
Cc: @KhronosGroup/glTF, @immersive-web, @webcodecs, @XR-Community, @SIGGRAPH-2025
Submitted by Ben Erwin as a community member and standards advocate.