| # Corruption Detection |
| |
| **Name:** |
| "Corruption Detection"; "Extension for Automatic Detection of Video Corruptions" |
| |
| **Formal name:** |
| <http://www.webrtc.org/experiments/rtp-hdrext/corruption-detection> |
| |
| **Status:** This extension is defined here to allow for experimentation. |
| |
| **Contact:** <sprang@google.com> |
| |
| NOTE: This explainer is a work in progress and may change without notice. |
| |
| The Corruption Detection (sometimes referred to as automatic corruption |
| detection or ACD) extension is intended to be a part of a system that allows |
| estimating a likelihood that a video transmission is in a valid state. That is, |
| the input to the video encoder on the send side corresponds to the output of the |
| video decoder on the receive side with the only difference being the expected |
| distortions from lossy compression. |
| |
| The goal is to be able to detect outright coding errors caused by things such as |
| bugs in encoder/decoders, malformed packetization data, incorrect relay |
| decisions in SFU-type servers, incorrect handling of packet loss/reordering, and |
| so forth. We want to accomplish this with a high signal-to-noise ratio while |
| consuming a minimum of resources in terms of bandwidth and/or computation. It |
| should be noted that it is _not_ a goal to be able to e.g. gauge general video |
| quality using this method. |
| |
| This explainer contains two parts: |
| |
| 1) A definition of the RTP header extension itself and how it is to be parsed. |
| 2) The intended usage and implementation details for a WebRTC sender and |
| receiver respectively. |
| |
| If this extension has been negotiated, all the client behavior outlined in this |
| doc MUST be adhered to. |
| |
| ## RTP Header Extension Format |
| |
| ### Data Layout Overview |
| |
| The message format of the header extension: |
| |
| 0 1 2 3 |
| 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| |B| seq index | std dev | Y err | UV err| sample 0 | |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | sample 1 | sample 2 | … up to sample <=12 |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| |
| ### Data Layout Details |
| |
| * B (1 bit): If the sequence number should be interpreted as the MSB or LSB |
| of the full size 14 bit sequence index described in the next point. |
| * seq index (7 bits): The index into the Halton sequence (used to locate |
| where the samples should be drawn from). |
| * If B is set: the 7 most significant bits of the true index. The 7 least |
| significant bits of the true index shall be interpreted as 0. This is |
| because this is the point where we can guarantee that the sender and |
| receiver has the same full index. B MUST be set on keyframes. On droppable |
| frames B MUST NOT be set. |
| * If B is not set: The 7 LSB of the true index. The 7 most significant bits |
| should be inferred based on the most recent message. |
| * std dev (8 bits): The standard deviation of the Gaussian filter used |
| to weigh the samples. The value is scaled using a linear map: |
| 0 = 0.0 to 255 = 40.0. A std dev of 0 is interpreted as directly using |
| just the sample value at the desired coordinate, without any weighting. |
| * Y err (4 bits): The allowed error for the luma channel. |
| * UV err (4 bits): The allowed error for the chroma channels. |
| * Sample N (8 bits): The N:th filtered sample from the input image. Each |
| sample represents a new point in one of the image planes, the plane and |
| coordinates being determined by index into the Halton sequence (starting at |
| seq# index and is incremented by one for each sample). Each sample has gone |
| through a Gaussian filter with the std dev specified above. The samples |
| have been floored to the nearest integer. |
| |
| A special case is the so-called "synchronization" message. Such a message |
| only contains the first byte. They are used to keep the sender and receiver in |
| sync even if no "full" message has been received for a while. Such messages |
| MUST NOT be sent on droppable frames. |
| |
| ### A note on encryption |
| |
| Privacy and security are core parts of nearly every WebRTC-based application, |
| which means that some sort of encryption needs to be present. The most common |
| form of encryption is SRTP, defined in RFC 3711. However, as mentioned in |
| section 9.4 of that RFC, RTP header extensions are considered part of the header |
| and are thus not encrypted. |
| |
| The automatic corruption detection header extension is different from most other |
| header extensions in that it provides not only metadata about the media stream |
| being transmitted but in practice comprises an extremely sparse representation |
| of the actual video stream itself. Given a static scene and enough time, a crude |
| image of the encrypted video can rather trivially be constructed. |
| |
| As such, most applications should use this extension with SRTP only if |
| additional security is present to protect it. That could be for example in the |
| form of explicit header extension encryption provided by RFC 6904/RFC 9335, or |
| by encapsulating the entire RTP stream in an additional layer such as IPSec. |
| |
| ## Usage & Guidelines |
| |
| In this section we’ll first look at a general overview of the intended usage of |
| this header extensions, followed by more details around the expected |
| implementation. |
| |
| ### Overview |
| |
| The premise of the extension described here is that we can validate the state of |
| the video pipeline by quasi-randomly selecting a few samples from the raw input |
| frame to an encoder, and then checking them against the output of a decoder. |
| Assuming that a lossless codec is used we can follow these steps: |
| |
| 1) In an image that is to be encoded, quasi-randomly select N sampling positions |
| and store the samples values for those positions from the raw input image. |
| 2) Encode the image, and attach the selected sample values to the RTP packets |
| containing the encoded bitstream of that image. |
| 3) Transmit the RTP packets to a remote receiver. |
| 4) At the receiver, collect the attached sample values from the RTP packets when |
| assembling the frame, and then pass the bitstream to a decoder. |
| 5) Using the same quasi-random sequence as in (1), calculate the corresponding N |
| sampling positions. |
| 6) Take the output of the decoder and check the values of the samples from the |
| RTP packets. If they differ significantly, it is likely that an image |
| corruption has occurred. |
| |
| Lossless encoding is however rarely used in practice, and that introduces |
| problems for the above algorithm. |
| |
| * Quantization causes values to be different from the desired value. |
| * Whole blocks of pixels might be shifted somewhat due to inaccuracies in motion |
| vectors. |
| * Inaccuracies caused by in-loop or post-process filtering. |
| * etc. |
| |
| We must therefore take these distortions into consideration, as they are merely |
| a natural side-effect of the compression and their effect is not to be |
| considered an “invalid state”. We aim to accomplish this using two tools. |
| |
| First, instead of a sample being a single raw sample value let it be a filtered |
| one: a weighted average of samples in the vicinity of the desired location, with |
| the weights being a 2D Gaussian centered at that location and the variance |
| adjusted depending on the magnitude of the expected distortions |
| (higher distortion => higher variance). This smoothes out inaccuracies caused by |
| both quantization and motion compensation. |
| |
| Secondly, even with a very large filter kernel the new sample might not converge |
| towards the exact desired value. For that reason, set an “allowed error |
| threshold” that removes small magnitude differences. Since chroma and luma |
| channels have different scales, separate error thresholds are available for |
| them. |
| |
| ### Sequence Index Handling |
| |
| The quasi-random sequence of choice for this extension is a 2D |
| [Halton Sequence](https://en.wikipedia.org/wiki/Halton_sequence). |
| |
| The index into the Halton Sequence is indicated by the header extension and |
| results in a 14 bit unsigned integer which on overflow will wrap around back to |
| 0. |
| |
| For each sample contained within the extension, the sequence index should be |
| considered to be incremented by one. Thus the sequence index at the start of the |
| header should be considered “the sequence index for the next sample to be |
| drawn”. |
| |
| The ACD extension may be sent containing either the 7 most significant bits |
| (B = true) or the 7 least significant bits (B = false) of the sequence index. |
| |
| Key-frames MUST be populated with the ACD extension, and those MUST use B = true |
| indicating only the 7 most significant bits are transmitted. |
| |
| The sender may choose any arbitrary starting point. The biggest reason to not |
| always start with (B = true, seq index = 0) is that with frequent/periodic |
| keyframes you might end up always sampling the same small subset of image |
| locations over and over. |
| |
| If B = false and the LSB seq index + number of samples exceeds the capacity of |
| the 7-bit field (i.e. > 0x7F), then the most significant bits of the 14 bit |
| sequence counter should be considered to be implicitly incremented by the |
| overflow. |
| |
| Delta-frames may be encoded as “droppable” or “non-droppable”. Consider for |
| example temporal layering using the |
| [L1T3](https://www.w3.org/TR/webrtc-svc/#L1T3*) mode. In that scenario, |
| key-frames and all T0 frames are non-droppable, while all T1 and T2 frames are |
| droppable. |
| |
| For non-droppable frames, B MAY be set to true even though there is often little |
| utility for it. |
| For droppable frames B MUST NOT be set to true, since a receiver could otherwise |
| easily end up out of sync with the sender. |
| |
| A receiver must store a state containing the last sequence index used. If an ACD |
| extension is receiver with B = false but the LSB does not match the last known |
| sequence index state, this indicates that an instrumented frame has been |
| dropped. The receiver should recover from this by incrementing the last known |
| sequence index until the 7 least significant bits match. |
| |
| Because of this, the sender MUST send ACD messages on non-droppable frames such |
| that the delta between their sequence indexing (from the last sample of the |
| previous packet to the first of the next) indexing does not exceed 0x7F. A |
| synchronization message may be used for this purpose if there is no wish to |
| instrument the non-droppable frame. |
| |
| It is not required to add the ACD extension to every frame. Indeed, for |
| performance reasons it may be reasonable to only instrument a small subset of |
| frames, for example using only one frame per second. |
| |
| Additionally, when encoding a structure that has independent decode targets |
| (e.g. L3T3_KEY) - the sender should generate an independent stream ACD sequence |
| per target resolution so that a receiver can validate the state of the |
| sub-stream they receive. |
| |
| // TODO: Add concrete examples. |
| |
| ### Sample Selection |
| |
| As mentioned above, a Halton Sequence is used to generate sampling coordinates. |
| Base 2 is used for selecting the rows, and base 3 is used for selecting columns. |
| |
| Each sample in the ACD extension represents a single image sample, meaning it |
| belongs to a single channel rather than e.g. being an RGB pixel. |
| |
| The initial version of the ACD extension supports only the I420 chroma |
| subsampling format. When determining which plane a location belongs to, it is |
| easiest to visualize it as the chroma planes being “stacked” to the side of the |
| luma plane: |
| |
| +------+---+ |
| | | U | |
| + Y +---+ |
| | | V | |
| +------+---+ |
| |
| In pseudo code: |
| ``` |
| row = GetHaltonSequence(seq_index, /*base=*/2) * image_height; |
| col = GetHaltonSequence(seq_index, /*base=*/3) * image_width * 1.5; |
| |
| if (col < image_width) { |
| HandleSample(Y_PLANE, row, col); |
| } else if (row < image_height / 2) { |
| HandleSample(U_PLANE, row, col - image_width); |
| } else { |
| HandleSample(V_PLANE, row - (image_height / 2), col - image_width); |
| } |
| |
| seq_index++; |
| ``` |
| Support for other layout types may be added in later versions of this extension. |
| |
| Note that the image dimensions are not explicitly a part of the ACD extension - |
| that has to be inferred from the raw image itself. |
| |
| ### Sample Filtering |
| |
| As mentioned above, when filtering a sample we create a weighted average around |
| the desired location. Only samples in the same plane are considered. The |
| weighting consists of a 2D Gaussian centered on the desired location, with the |
| standard deviation specified in the ACD extension header. |
| |
| If the standard deviation is specified as 0.0 - we consider only a singular |
| sample. Otherwise, we first determine a cutoff distance below which the weights |
| are considered too small to matter. For now, we have set the weight cutoff to |
| 0.2 - meaning the maximum distance from the center sample we need to consider is |
| max_d = ceil(sqrt(-2.0 * ln(0.2) * stddev^2) - 1. |
| |
| Any samples outside the plane are considered to have weight 0. |
| |
| In pseudo-code, that means we get the following: |
| ``` |
| sample_sum = 0; |
| weight_sum = 0; |
| for (y = max(0, row - max_d) to min(plane_height, row + max_d) { |
| for (x = max(0, col - max_d) to min(plane_width, col + max_d) { |
| weight = e^(-1 * ((y - row)^2 + (x - col)^2) / (2 * stddev^2)); |
| sample_sum += SampleAt(x, y) * weight; |
| weight_sum += weight; |
| } |
| } |
| filtered_sample = sample_sum / weight_sum; |
| ``` |
| ### Receive Side Considerations |
| |
| When a frame has been decoded and an ACD message is present, the receiver |
| performs the following steps: |
| |
| * Update the sequence index so that it is consistent with the ACD message. |
| * Calculate the sample positions from the Halton sequence. |
| * Filter each sample of the decoded image using the standard deviation provided |
| in the ACD message. |
| |
| We then need to compare the actual samples present in the ACD message and the |
| samples generated from the locally decoded frame, and take the allowed error |
| into account: |
| |
| ``` |
| for (i = 0 to num_samples) { |
| // Allowed error from ACD message, depending on which plane sample i is in. |
| allowed_error = SampleType(i) == Y_PLANE ? Y_ERR : UV_ERR; |
| delta_i = max(0, abs(RemoteSample(i) - LocalSample(i)) - allowed_error); |
| } |
| ``` |
| |
| It is then up to the receiver how to interpret these deltas. A suggested method |
| is to calculate a “corruption score” by calculating sum(delta(i)^2), where |
| delta(i) is the delta for i:th sample in the message, and then scaling and |
| capping that result to a maximum of 1.0. By squaring the sample, we make sure |
| that even singular samples that are way outside their expected values cause a |
| noticeable shift in the score. Another possible way is to calculate the distance |
| and cap it using a sigmoid function. |
| |
| This extension message format does not make recommendations about what a |
| receiver should do with the corruption scores, but some possibilities are: |
| |
| * Expose it as a statistics connected to the video receive stream. Let the |
| application decide what to do with the information. |
| * Let the WebRTC application use a corruption signal to take proactive measures. |
| E.g. request a key-frame in order to recover, or try to switch to another |
| codec type or implementation. |
| |
| ### Determining Filter Settings & Error Thresholds |
| |
| It is up to the sender to estimate how large the filter kernel and the allowed |
| error thresholds should be. |
| |
| One method to do this is to analyze example outputs from different encoders and |
| map the average frame QP to suitable settings. There will of course have to be |
| different such mapping for e.g. AV1 compared to VP8 - but it’s also possible to |
| get “tighter” values with knowledge of the exact implementation used. E.g. a |
| mapping designed just for libaom encoder version X running with speed setting Y. |
| |
| Another method is to use the actual reconstructor state from the encoder. That |
| of course means the encoder has to expose that state, which is not common. |
| A benefit of doing it that way is that the filter size and allowed error can be |
| very small (really only post-processing could introduce distortions in that |
| scenario). A drawback is if the reconstructed state already contains corruption |
| due to an encoder bug - then we would not be able to detect that corruption at |
| all. |
| |
| There are also possibly more accurate but probably much more costly alternatives |
| as well, such as training an ML model to determine the settings based on both |
| the content of the source frame and any metadata present in the encoded |
| bitstream. |
| |
| Regardless of method, the implementation at the send side SHOULD strive to set |
| the filter size and error thresholds such that 99.5% of filtered samples end up |
| with a delta <= the error threshold for that plane, based on a representative |
| set of test clips and bandwidth constraints. |
| |
| Notes: The extension must not be present in more than 1 packet per video frame. |