-
Notifications
You must be signed in to change notification settings - Fork 18
Description
I'm currently rewriting SeeShark from scratch without FFmpeg. Right now the backends are mostly done for Linux and MacOS, and I'm about to implement the Windows one.
However, one problem that has become very apparent is managing image formats in a cross-platform way, as it turns out that every platform does it differently.
I've kept some notes on the differences, so I want to unfold them here and potentially see if I can find a better way to handle input formats.
Vocabulary
In SeeShark 5, I call "image format" the thing that describes how an image is laid out in memory. It could be raw pixels arranged in a certain way like ARGB or YUYV, or some planar configuration, or a compressed format like MJPEG or whatever else. I chose to call it like this because it fits the way it is used best in my view. Both operating systems and FFmpeg call it either an input format or a pixel format, the latter generally only referring to raw formats, but it's somewhat inconsistently applied and a bit of a mess.
And thus a video format contains an image format describing the format of its frames.
How input formats are queried
Illustrating how input formats are queried with some C#-ish pseudo-code.
Linux
var supportedVideoFormats = [];
foreach (var inputFormat in SupportedInputFormats(selectedDevice))
foreach (var frameSize in QueryFrameSizes(selectedDevice, inputFormat))
foreach (var frameInterval in QueryFrameIntervals(selectedDevice, inputFormat, frameSize))
supportedImageFormats.Add(new VideoFormat(inputFormat, frameSize, frameInterval));On Linux with V4L2, both frame sizes and frame intervals can either be "discrete" (one size), "continuous" (a minimum and maximum size) or "stepwise" (a minimum and maximum size, along with a step size describing which values in-between the min and max sizes are also valid). I am currently not handling continuous frame sizes and intervals but it's probably important that I do, so I'll have to look more into it.
Question: are the supported input formats chosen by the device or by the kernel? Do they depend on the device?
MacOS
var supportedVideoFormats = [];
var pixelFormatTypes = AvailablePixelFormatTypes();
foreach (var deviceFormat in SupportedCaptureDeviceFormats(selectedDevice))
{
var maxPhotoDimensions = SupportedMaxPhotoDimensions(deviceFormat);
var frameRateRanges = VideoSupportedFrameRateRanges(deviceFormat);
foreach (var maxPhotoDimension in maxPhotoDimensions)
foreach (var frameRateRange in frameRateRanges)
foreach (var pixelFormatType in pixelFormatTypes)
supportedVideoFormats.Add(new VideoFormat(maxPhotoDimension, frameRateRange.Max, pixelFormatType));
}On MacOS with AVFoundation, it seems like a device can have multiple "supported max photo dimensions". I assume that is basically the same as Linux's discrete frame sizes, but I'm not sure and Apple's documentation doesn't say much else about it. There are also framerate ranges, which seem equivalent to Linux's continuous frame intervals, and do not have information about which in-between values are also valid. Maybe they all are.
Windows
TODO
FourCC codes for image formats
All these platforms use FourCC codes to represent image formats, but it seems like:
- there can be multiple FourCC codes for the same format, according to the documentation I found on wiki.multimedia.cx
- platforms use only one of these and it's different between them.
For example:
- the UYVY 4:2:2 pixel format's FourCC code is
'UYVY'on Linux (V4l2InputFormat.UYVY)'2vuy'(or'yuv2'?) on MacOS (CVPixelFormatType.k_422YpCbCr8)
- the RGBA 32-bit pixel format's FourCC code is
'AB24'on Linux (V4l2InputFormat.RGBA32)'RGBA'on MacOS (CVPixelFormatType.k_32RGBA)
So I can't just use a single integer to represent all of them in a standard way, unfortunately it seems like it has to be a giant enum instead that I create big switch statements for on each platform. And that requires testing them, which I might not be able to do with my mere 2 cameras and a Virtual OBS, one of which is my Macbook Air's built-in webcam.
Some useful links
- FourCC
- fourcc.org
- Video FourCCs
- OS sources
- Windows Video FourCCs
- Apple Pixel Format Identifiers
CVPixelBuffer.hheader file containing all definitions with their corresponding FourCC if it exists, available on MacOS at/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/CoreVideo.framework/Versions/A/Headers/CVPixelBuffer.h
- Linux Pixel Formats
videodev2.hheader file containing all definitions with their corresponding FourCC
Now what?
Some problems I see are as follows.
First, the individual parts of a video format are queried differently and with different dependencies on each platform (as illustrated by the pseudo-C# nested foreaches). Right now I am just brute-forcing my way into enumerating them by returning all combinations of all possible configurations (most of the time), but this is probably not sustainable, and even if it were, it is jarring to see so many input formats when it could be expressed much more succinctly.
Second, sometimes stuff is described as a range instead of a list of discrete values. If literally anything between the min and max are valid, then I'd have to make the way I list available video formats more intelligent and start describing these ranges too.
So yeah, there's a lot of stuff to think about here.