Description
At some point we should do an evaluation of whether or not the WebXR API is appropriate for use with audio-only "AR" devices. (Like these Bose glasses)
This issue is effectively splitting a portion of the conversation out of #815, where the conversation has been mostly focused on using the existing API on existing hardware without a dependency on rendering. I'm going to set this to the "future" milestone immediately because the hardware that operates in this modality is rare and none of it is consumer-facing that I'm aware of, but I think it's reasonable to expect that to change over the next few years.
Copying some additional comments of mine over from that thread:
Based on the information that I've been able to find about audio-only AR (like the Bose glasses) I'm not entirely sure how they perform their location-based functions. I would actually be surprised if it was built around any form of real positional tracking, and am guessing it's more along the lines of Google Lens which surfaces information based on a captured image with little understanding of the device's precise location. In any case, I'd love to know more about how existing or upcoming non-visual AR devices work so we can better evaluate what the appropriate interactions are with WebXR.
Now, ignoring the above questions about how current hardware works, if we assume that we discover a device that provides precise positional tracking capabilities but has no visual output component we can brainstorm how that would theoretically work out. While it's not clear how web content would surface itself on such a device, it seems safe to say that traditional immersive-vr style content wouldn't be of much interest, and so we'd likely want to advertise a new session mode explicitly for audio-only sessions. Let's call it immersive-audio. Once that's established the various text tweaks David mentions would be appropriate, but from a technical point of view the biggest change would be that an immersive-audio session wouldn't require a baseLayer in order to process XRFrames. Instead we would probably just surface poses via requestAnimationFrame() as usual and allow Javascript to feed those poses into both the WebAudio API for spatial sound and to whatever services are needed to surface the relevant audio data. There's also some interesting possibilities that could come from deeper integration with the audio context, like providing poses directly to an audio worklet.
Regardless, given the relative scarcity of this style of hardware today and the large number of unknowns around it I don't see any pressing need to move to support this style of content just yet. It's absolutely a topic that the Working Group should follow with great interest, though!