Name: “Corruption Detection”; “Extension for Automatic Detection of Video Corruptions”
Formal name: http://www.webrtc.org/experiments/rtp-hdrext/corruption-detection
Status: This extension is defined here to allow for experimentation.
Contact: sprang@google.com
NOTE: This explainer is a work in progress and may change without notice.
The Corruption Detection (sometimes referred to as automatic corruption detection or ACD) extension is intended to be a part of a system that allows estimating a likelihood that a video transmission is in a valid state. That is, the input to the video encoder on the send side corresponds to the output of the video decoder on the receive side with the only difference being the expected distortions from lossy compression.
The goal is to be able to detect outright coding errors caused by things such as bugs in encoder/decoders, malformed packetization data, incorrect relay decisions in SFU-type servers, incorrect handling of packet loss/reordering, and so forth. We want to accomplish this with a high signal-to-noise ratio while consuming a minimum of resources in terms of bandwidth and/or computation. It should be noted that it is not a goal to be able to e.g. gauge general video quality using this method.
This explainer contains two parts:
If this extension has been negotiated, all the client behavior outlined in this doc MUST be adhered to.
The message format of the header extension:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |B| seq index | std dev | Y err | UV err| sample 0 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | sample 1 | sample 2 | … up to sample <=12 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
A special case is the so-called “synchronization” message. Such a message only contains the first byte. They are used to keep the sender and receiver in sync even if no “full” message has been received for a while. Such messages MUST NOT be sent on droppable frames.
Privacy and security are core parts of nearly every WebRTC-based application, which means that some sort of encryption needs to be present. The most common form of encryption is SRTP, defined in RFC 3711. However, as mentioned in section 9.4 of that RFC, RTP header extensions are considered part of the header and are thus not encrypted.
The automatic corruption detection header extension is different from most other header extensions in that it provides not only metadata about the media stream being transmitted but in practice comprises an extremely sparse representation of the actual video stream itself. Given a static scene and enough time, a crude image of the encrypted video can rather trivially be constructed.
As such, most applications should use this extension with SRTP only if additional security is present to protect it. That could be for example in the form of explicit header extension encryption provided by RFC 6904/RFC 9335, or by encapsulating the entire RTP stream in an additional layer such as IPSec.
In this section we’ll first look at a general overview of the intended usage of this header extensions, followed by more details around the expected implementation.
The premise of the extension described here is that we can validate the state of the video pipeline by quasi-randomly selecting a few samples from the raw input frame to an encoder, and then checking them against the output of a decoder. Assuming that a lossless codec is used we can follow these steps:
Lossless encoding is however rarely used in practice, and that introduces problems for the above algorithm.
We must therefore take these distortions into consideration, as they are merely a natural side-effect of the compression and their effect is not to be considered an “invalid state”. We aim to accomplish this using two tools.
First, instead of a sample being a single raw sample value let it be a filtered one: a weighted average of samples in the vicinity of the desired location, with the weights being a 2D Gaussian centered at that location and the variance adjusted depending on the magnitude of the expected distortions (higher distortion => higher variance). This smoothes out inaccuracies caused by both quantization and motion compensation.
Secondly, even with a very large filter kernel the new sample might not converge towards the exact desired value. For that reason, set an “allowed error threshold” that removes small magnitude differences. Since chroma and luma channels have different scales, separate error thresholds are available for them.
The quasi-random sequence of choice for this extension is a 2D Halton Sequence.
The index into the Halton Sequence is indicated by the header extension and results in a 14 bit unsigned integer which on overflow will wrap around back to 0.
For each sample contained within the extension, the sequence index should be considered to be incremented by one. Thus the sequence index at the start of the header should be considered “the sequence index for the next sample to be drawn”.
The ACD extension may be sent containing either the 7 most significant bits (B = true) or the 7 least significant bits (B = false) of the sequence index.
Key-frames MUST be populated with the ACD extension, and those MUST use B = true indicating only the 7 most significant bits are transmitted.
The sender may choose any arbitrary starting point. The biggest reason to not always start with (B = true, seq index = 0) is that with frequent/periodic keyframes you might end up always sampling the same small subset of image locations over and over.
If B = false and the LSB seq index + number of samples exceeds the capacity of the 7-bit field (i.e. > 0x7F), then the most significant bits of the 14 bit sequence counter should be considered to be implicitly incremented by the overflow.
Delta-frames may be encoded as “droppable” or “non-droppable”. Consider for example temporal layering using the L1T3 mode. In that scenario, key-frames and all T0 frames are non-droppable, while all T1 and T2 frames are droppable.
For non-droppable frames, B MAY be set to true even though there is often little utility for it. For droppable frames B MUST NOT be set to true, since a receiver could otherwise easily end up out of sync with the sender.
A receiver must store a state containing the last sequence index used. If an ACD extension is receiver with B = false but the LSB does not match the last known sequence index state, this indicates that an instrumented frame has been dropped. The receiver should recover from this by incrementing the last known sequence index until the 7 least significant bits match.
Because of this, the sender MUST send ACD messages on non-droppable frames such that the delta between their sequence indexing (from the last sample of the previous packet to the first of the next) indexing does not exceed 0x7F. A synchronization message may be used for this purpose if there is no wish to instrument the non-droppable frame.
It is not required to add the ACD extension to every frame. Indeed, for performance reasons it may be reasonable to only instrument a small subset of frames, for example using only one frame per second.
Additionally, when encoding a structure that has independent decode targets (e.g. L3T3_KEY) - the sender should generate an independent stream ACD sequence per target resolution so that a receiver can validate the state of the sub-stream they receive.
// TODO: Add concrete examples.
As mentioned above, a Halton Sequence is used to generate sampling coordinates. Base 2 is used for selecting the rows, and base 3 is used for selecting columns.
Each sample in the ACD extension represents a single image sample, meaning it belongs to a single channel rather than e.g. being an RGB pixel.
The initial version of the ACD extension supports only the I420 chroma subsampling format. When determining which plane a location belongs to, it is easiest to visualize it as the chroma planes being “stacked” to the side of the luma plane:
+------+---+ | | U | + Y +---+ | | V | +------+---+
In pseudo code:
row = GetHaltonSequence(seq_index, /*base=*/2) * image_height; col = GetHaltonSequence(seq_index, /*base=*/3) * image_width * 1.5; if (col < image_width) { HandleSample(Y_PLANE, row, col); } else if (row < image_height / 2) { HandleSample(U_PLANE, row, col - image_width); } else { HandleSample(V_PLANE, row - (image_height / 2), col - image_width); } seq_index++;
Support for other layout types may be added in later versions of this extension.
Note that the image dimensions are not explicitly a part of the ACD extension - that has to be inferred from the raw image itself.
As mentioned above, when filtering a sample we create a weighted average around the desired location. Only samples in the same plane are considered. The weighting consists of a 2D Gaussian centered on the desired location, with the standard deviation specified in the ACD extension header.
If the standard deviation is specified as 0.0 - we consider only a singular sample. Otherwise, we first determine a cutoff distance below which the weights are considered too small to matter. For now, we have set the weight cutoff to 0.2 - meaning the maximum distance from the center sample we need to consider is max_d = ceil(sqrt(-2.0 * ln(0.2) * stddev^2) - 1.
Any samples outside the plane are considered to have weight 0.
In pseudo-code, that means we get the following:
sample_sum = 0; weight_sum = 0; for (y = max(0, row - max_d) to min(plane_height, row + max_d) { for (x = max(0, col - max_d) to min(plane_width, col + max_d) { weight = e^(-1 * ((y - row)^2 + (x - col)^2) / (2 * stddev^2)); sample_sum += SampleAt(x, y) * weight; weight_sum += weight; } } filtered_sample = sample_sum / weight_sum;
When a frame has been decoded and an ACD message is present, the receiver performs the following steps:
We then need to compare the actual samples present in the ACD message and the samples generated from the locally decoded frame, and take the allowed error into account:
for (i = 0 to num_samples) { // Allowed error from ACD message, depending on which plane sample i is in. allowed_error = SampleType(i) == Y_PLANE ? Y_ERR : UV_ERR; delta_i = max(0, abs(RemoteSample(i) - LocalSample(i)) - allowed_error); }
It is then up to the receiver how to interpret these deltas. A suggested method is to calculate a “corruption score” by calculating sum(delta(i)^2), where delta(i) is the delta for i:th sample in the message, and then scaling and capping that result to a maximum of 1.0. By squaring the sample, we make sure that even singular samples that are way outside their expected values cause a noticeable shift in the score. Another possible way is to calculate the distance and cap it using a sigmoid function.
This extension message format does not make recommendations about what a receiver should do with the corruption scores, but some possibilities are:
It is up to the sender to estimate how large the filter kernel and the allowed error thresholds should be.
One method to do this is to analyze example outputs from different encoders and map the average frame QP to suitable settings. There will of course have to be different such mapping for e.g. AV1 compared to VP8 - but it’s also possible to get “tighter” values with knowledge of the exact implementation used. E.g. a mapping designed just for libaom encoder version X running with speed setting Y.
Another method is to use the actual reconstructor state from the encoder. That of course means the encoder has to expose that state, which is not common. A benefit of doing it that way is that the filter size and allowed error can be very small (really only post-processing could introduce distortions in that scenario). A drawback is if the reconstructed state already contains corruption due to an encoder bug - then we would not be able to detect that corruption at all.
There are also possibly more accurate but probably much more costly alternatives as well, such as training an ML model to determine the settings based on both the content of the source frame and any metadata present in the encoded bitstream.
Regardless of method, the implementation at the send side SHOULD strive to set the filter size and error thresholds such that 99.5% of filtered samples end up with a delta <= the error threshold for that plane, based on a representative set of test clips and bandwidth constraints.
Notes: The extension must not be present in more than 1 packet per video frame.