-
Notifications
You must be signed in to change notification settings - Fork 51
Description
It's pretty well-established that Matroska's poor timebase support is one of the format's worst properties. While it support very precise timestamps (down to the nanosecond), it's very inefficient to do so (and the resulting values still aren't exact for most input rates), so muxers tend to default to 1ms timestamps, which can lead to a variety of subtle issues, especially with high-packet-rate streams (e.g. audio) and VFR video content. Muxers can choose rates that are closer to the time base of their inputs (or the packet rate of the content), but exactly how best to do so has always been unclear, and some of the possible options would lead to either worse player behavior, or timestamp drift. I'm proposing a format addition to remedy this.
The only actual normative change I propose is this: in addition to the classic nanosecond-denominator time scale, muxers could provide 2 additional integers, serving as a numerator and denominator time base value, which is required to round to the existing nanosecond-scaled value.
This should be paired with some advice for muxer implementations on how to make use of this feature. This depends on the properties of the input. For reference, here are some examples of the error produced by rounding a variety of common time bases to the nearest nanosecond, scaled by 3 hours (a reasonable target for the duration of a film):
nearest_ns(x) = round(x * 1,000,000,000) / 1,000,000,000
ceil_ns(x) = ceil(x * 1,000,000,000) / 1,000,000,000
floor_ns(x) = floor(x * 1,000,000,000) / 1,000,000,000
nearest_error(x) = 1 - (x / nearest_ns(x))
ceil_error(x) = 1 - (x / ceil_ns(x))
floor_error(x) = 1 - (x / floor_ns(x))
nearest_error_3h(x) = nearest_error(x) * 60 * 60 * 3
ceil_error_3h(x) = ceil_error(x) * 60 * 60 * 3
floor_error_3h(x) = floor_error(x) * 60 * 60 * 3
e(x) = nearest_error_3h(1 / x)
ce(x) = ceil_error_3h(1 / x)
fe(x) = floor_error_3h(1 / x)
# Integer video frame rates
e(24) => 8.64e-5
e(25) => 0
e(30) => -0.0001
e(48) => -0.0002
e(50) => 0
e(60) => 0.0002
e(120) => -0.0004
# NTSC video frame rates
e(24/1.001) => -8.6314e-5
e(30/1.001) => 0.0001
e(48/1.001) => 0.0002
e(60/1.001) => -0.0002
e(120/1.001) => 0.0004
# TrueHD frame rates
e(44100/40) => -0.0057
e(48000/40) => -0.0043
e(88200/40) => 0.0062
e(96000/40) => 0.0086
# AAC frame rates
e(44100/960) => -0.0002
e(48000/960) => 0
e(88200/960) => 0.0003
e(96000/960) => 0
e(44100/1024) => 0.0002
e(48000/1024) => -0.0002
e(88200/1024) => -0.0003
e(96000/1024) => 0.0003
# MP3 frame rates
e(44100/1152) => 8.4375e-6
e(48000/1152) => 0
e(88200/1152) => -0.0004
e(96000/1152) => 0
# Other audio frame rates
e(44100/128) => -0.0012
e(48000/128) => 0.0013
e(88200/128) => -0.0012
e(96000/128) => -0.0027
e(44100/2880) => -7.425e-5
e(48000/2880) => 2.3981e-12
e(88200/2880) => -7.425e-5
e(96000/2880) => 2.3981e-12
# GCF of common short-first audio frame sizes
e(44100/64) => -0.0012
e(48000/64) => -0.0027
e(88200/64) => 0.0062
e(96000/64) => 0.0054
# Raw audio sample rates
e(44100) => 0.1253
e(48000) => -0.1728
e(88200) => 0.1253
e(96000) => 0.3456
fe(44100) => -0.351
ce(48000) => 0.3456
fe(88200) => -0.8273
fe(96000) => -0.6912
# MPEGTS time base
e(90000) => -0.108
ce(90000) => 0.8639
# Common multiples
e(30000) => -0.108
e(60000) => 0.216
e(120000) => -0.432
e(240000) => 0.8639
e(480000) => -1.7283
ce(30000) => 0.216
fe(60000) => -0.432
ce(120000) => 0.8639
fe(240000) => -1.7283
ce(480000) => 3.4549
As we can see, rounding common video and audio frame rates (including e.g. the least common multiple of 24 and 60 for that VFR case) produces a negligible amount of error over a reasonable duration. This means that for content where all timestamps can reasonably be expressed in integer values of those rates, there would be no significant error over common file durations, even if different streams were muxed with different time bases.
There are a few real-world time bases that would produce significant rounding error (upwards of 100ms) over the course of 3 hours when used in existing players: MPEGTS's 90000Hz, all common raw audio sample rates, and least-common-multiples between integer and NTSC video frame rates. This essentially means that mixing these rates with others would produce significant desync over a reasonable duration for static on-disk content; the same issue could occur when muxing very lengthy content (e.g. streaming).
All of these issues can be addressed in one of the following ways:
- Using a lower rate (e.g. 90,000Hz isn't usually the real content rate but instead an artifact of its previous container; expressing timestamps in samples rather than frames is usually unnecessary)
- Choosing the highest of the input rates for all streams (e.g. 48000 is a multiple of many common frame rates, including 24/1.001)
- Choosing a more precise common-multiple rate that may create a larger total drift, but does so equally for all streams (see the "Common multiples" section; 1/30000 is suitable for mixing 24fps and 30/1.001fps content alongside most common framed audio rates, while the later listed bases are suitable for increasingly large sets).
- Round some tracks' nanosecond timescales in the opposite direction, creating a larger drift, but potentially one with the same sign (and thus a closer value) as the drift in other tracks (this is probably too complex and niche to have substantial use)
- Fall back to classic rounded nanosecond-based timestamps (and don't write an integer-fraction time base at all)
- Use the extension, resulting in significant sync drift in older players that haven't implemented the change
This last option is usually unacceptable, but may be fine for files that use codecs that become available after the change is made (and thus are unavoidably non-backwards-compatible anyway).
If combined with clear advice in the spec on how muxers SHOULD (or MAY) decide on time bases for various possible input cases, I think this extension could get actual adoption in muxers and solve one of the format's longest-standing problems.