Update explainer text for corruption detection header extension.
Bug: webrtc:358039777
Change-Id: I6a1cffc2a5797d154bfecb50c60b4c05d4943426
Reviewed-on: https://webrtc-review.googlesource.com/c/src/+/360661
Commit-Queue: Erik Språng <sprang@webrtc.org>
Auto-Submit: Erik Språng <sprang@webrtc.org>
Reviewed-by: Fanny Linderborg <linderborg@webrtc.org>
Cr-Commit-Position: refs/heads/main@{#42862}
diff --git a/docs/native-code/rtp-hdrext/README.md b/docs/native-code/rtp-hdrext/README.md
index 081a727..bde7349 100644
--- a/docs/native-code/rtp-hdrext/README.md
+++ b/docs/native-code/rtp-hdrext/README.md
@@ -1,13 +1,15 @@
# Experimental RTP header extensions
-The following subpages define experiemental RTP header extensions:
+The following subpages define experimental RTP header extensions:
- * [abs-send-time](abs-send-time/README.md)
* [abs-capture-time](abs-capture-time/README.md)
+ * [abs-send-time](abs-send-time/README.md)
* [color-space](color-space/README.md)
+ * [corruption-detection](corruption-detection/README.md)
+ * [inband-cn](inband-cn/README.md)
* [playout-delay](playout-delay/README.md)
* [transport-wide-cc-02](transport-wide-cc-02/README.md)
* [video-content-type](video-content-type/README.md)
+ * [video-frame-tracking-id](video-frame-tracking-id/README.md)
+ * [video-layers-allocation00](video-layers-allocation00/README.md)
* [video-timing](video-timing/README.md)
- * [inband-cn](inband-cn/README.md)
- * [video-layers-allocation00](video-layes-allocation00/README.md)
diff --git a/docs/native-code/rtp-hdrext/corruption-detection/README.md b/docs/native-code/rtp-hdrext/corruption-detection/README.md
new file mode 100644
index 0000000..187a102
--- /dev/null
+++ b/docs/native-code/rtp-hdrext/corruption-detection/README.md
@@ -0,0 +1,349 @@
+# Corruption Detection
+
+** Name: **
+"Corruption Detection"; "Extension for Automatic Detection of Video Corruptions"
+** Formal name: **
+<http://www.webrtc.org/experiments/rtp-hdrext/corruption-detection>
+** Status: ** This extension is defined here to allow for experimentation.
+** Contact: ** <sprang@google.com>
+
+NOTE: This explainer is a work in progress and may change without notice.
+
+The Corruption Detection (sometimes referred to as automatic corruption
+detection or ACD) extension is intended to be a part of a system that allows
+estimating a likelihood that a video transmission is in a valid state. That is,
+the input to the video encoder on the send side corresponds to the output of the
+video decoder on the receive side with the only difference being the expected
+distortions from lossy compression.
+
+The goal is to be able to detect outright coding errors caused by things such as
+bugs in encoder/decoders, malformed packetization data, incorrect relay
+decisions in SFU-type servers, incorrect handling of packet loss/reordering, and
+so forth. We want to accomplish this with a high signal-to-noise ratio while
+consuming a minimum of resources in terms of bandwidth and/or computation. It
+should be noted that it is _not_ a goal to be able to e.g. gauge general video
+quality using this method.
+
+This explainer contains two parts:
+
+1) A definition of the RTP header extension itself and how it is to be parsed.
+2) The intended usage and implementation details for a WebRTC sender and
+ receiver respectively.
+
+If this extension has been negotiated, all the client behavior outlined in this
+doc MUST be adhered to.
+
+## RTP Header Extension Format
+
+### Data Layout Overview
+
+The message format of the header extension:
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |B| seq index | std dev | Y err | UV err| sample 0 |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | sample 1 | sample 2 | … up to sample <=12
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+### Data Layout Details
+
+* B (1 bit): If the sequence number should be interpreted as the MSB or LSB
+ of the full size 14 bit sequence index described in the next point.
+* seq index (7 bits): The index into the Halton sequence (used to locate
+ where the samples should be drawn from).
+ * If B is set: the 7 most significant bits of the true index. The 7 least
+ significant bits of the true index shall be interpreted as 0. This is
+ because this is the point where we can guarantee that the sender and
+ receiver has the same full index. B MUST be set on keyframes. On droppable
+ frames B MUST NOT be set.
+ * If B is not set: The 7 LSB of the true index. The 7 most significant bits
+ should be inferred based on the most recent message.
+* std dev (8 bits): The standard deviation of the Gaussian filter used
+ to weigh the samples. The value is scaled using a linear map:
+ 0 = 0.0 to 255 = 40.0. A std dev of 0 is interpreted as directly using
+ just the sample value at the desired coordinate, without any weighting.
+* Y err (4 bits): The allowed error for the luma channel.
+* UV err (4 bits): The allowed error for the chroma channels.
+* Sample N (8 bits): The N:th filtered sample from the input image. Each
+ sample represents a new point in one of the image planes, the plane and
+ coordinates being determined by index into the Halton sequence (starting at
+ seq# index and is incremented by one for each sample). Each sample has gone
+ through a Gaussian filter with the std dev specified above. The samples
+ have been floored to the nearest integer.
+
+A special case is the so-called "synchronization" message. Such a message
+only contains the first byte. They are used to keep the sender and receiver in
+sync even if no "full" message has been received for a while. Such messages
+MUST NOT be sent on droppable frames.
+
+### A note on encryption
+
+Privacy and security are core parts of nearly every WebRTC-based application,
+which means that some sort of encryption needs to be present. The most common
+form of encryption is SRTP, defined in RFC 3711. However, as mentioned in
+section 9.4 of that RFC, RTP header extensions are considered part of the header
+and are thus not encrypted.
+
+The automatic corruption detection header extension is different from most other
+header extensions in that it provides not only metadata about the media stream
+being transmitted but in practice comprises an extremely sparse representation
+of the actual video stream itself. Given a static scene and enough time, a crude
+image of the encrypted video can rather trivially be constructed.
+
+As such, most applications should use this extension with SRTP only if
+additional security is present to protect it. That could be for example in the
+form of explicit header extension encryption provided by RFC 6904/RFC 9335, or
+by encapsulating the entire RTP stream in an additional layer such as IPSec.
+
+## Usage & Guidelines
+
+In this section we’ll first look at a general overview of the intended usage of
+this header extensions, followed by more details around the expected
+implementation.
+
+### Overview
+
+The premise of the extension described here is that we can validate the state of
+the video pipeline by quasi-randomly selecting a few samples from the raw input
+frame to an encoder, and then checking them against the output of a decoder.
+Assuming that a lossless codec is used we can follow these steps:
+
+1) In an image that is to be encoded, quasi-randomly select N sampling positions
+ and store the samples values for those positions from the raw input image.
+2) Encode the image, and attach the selected sample values to the RTP packets
+ containing the encoded bitstream of that image.
+3) Transmit the RTP packets to a remote receiver.
+4) At the receiver, collect the attached sample values from the RTP packets when
+ assembling the frame, and then pass the bitstream to a decoder.
+5) Using the same quasi-random sequence as in (1), calculate the corresponding N
+ sampling positions.
+6) Take the output of the decoder and check the values of the samples from the
+ RTP packets. If they differ significantly, it is likely that an image
+ corruption has occurred.
+
+Lossless encoding is however rarely used in practice, and that introduces
+problems for the above algorithm.
+
+* Quantization causes values to be different from the desired value.
+* Whole blocks of pixels might be shifted somewhat due to inaccuracies in motion
+ vectors.
+* Inaccuracies caused by in-loop or post-process filtering.
+* etc.
+
+We must therefore take these distortions into consideration, as they are merely
+a natural side-effect of the compression and their effect is not to be
+considered an “invalid state”. We aim to accomplish this using two tools.
+
+First, instead of a sample being a single raw sample value let it be a filtered
+one: a weighted average of samples in the vicinity of the desired location, with
+the weights being a 2D Gaussian centered at that location and the variance
+adjusted depending on the magnitude of the expected distortions
+(higher distortion => higher variance). This smoothes out inaccuracies caused by
+both quantization and motion compensation.
+
+Secondly, even with a very large filter kernel the new sample might not converge
+towards the exact desired value. For that reason, set an “allowed error
+threshold” that removes small magnitude differences. Since chroma and luma
+channels have different scales, separate error thresholds are available for
+them.
+
+### Sequence Index Handling
+
+The quasi-random sequence of choice for this extension is a 2D
+[Halton Sequence](https://en.wikipedia.org/wiki/Halton_sequence).
+
+The index into the Halton Sequence is indicated by the header extension and
+results in a 14 bit unsigned integer which on overflow will wrap around back to
+0.
+
+For each sample contained within the extension, the sequence index should be
+considered to be incremented by one. Thus the sequence index at the start of the
+header should be considered “the sequence index for the next sample to be
+drawn”.
+
+The ACD extension may be sent containing either the 7 most significant bits
+(B = true) or the 7 least significant bits (B = false) of the sequence index.
+
+Key-frames MUST be populated with the ACD extension, and those MUST use B = true
+indicating only the 7 most significant bits are transmitted.
+
+The sender may choose any arbitrary starting point. The biggest reason to not
+always start with (B = true, seq index = 0) is that with frequent/periodic
+keyframes you might end up always sampling the same small subset of image
+locations over and over.
+
+If B = false and the LSB seq index + number of samples exceeds the capacity of
+the 7-bit field (i.e. > 0x7F), then the most significant bits of the 14 bit
+sequence counter should be considered to be implicitly incremented by the
+overflow.
+
+Delta-frames may be encoded as “droppable” or “non-droppable”. Consider for
+example temporal layering using the
+[L1T3](https://www.w3.org/TR/webrtc-svc/#L1T3*) mode. In that scenario,
+key-frames and all T0 frames are non-droppable, while all T1 and T2 frames are
+droppable.
+
+For non-droppable frames, B MAY be set to true even though there is often little
+utility for it.
+For droppable frames B MUST NOT be set to true, since a receiver could otherwise
+easily end up out of sync with the sender.
+
+A receiver must store a state containing the last sequence index used. If an ACD
+extension is receiver with B = false but the LSB does not match the last known
+sequence index state, this indicates that an instrumented frame has been
+dropped. The receiver should recover from this by incrementing the last known
+sequence index until the 7 least significant bits match.
+
+Because of this, the sender MUST send ACD messages on non-droppable frames such
+that the delta between their sequence indexing (from the last sample of the
+previous packet to the first of the next) indexing does not exceed 0x7F. A
+synchronization message may be used for this purpose if there is no wish to
+instrument the non-droppable frame.
+
+It is not required to add the ACD extension to every frame. Indeed, for
+performance reasons it may be reasonable to only instrument a small subset of
+frames, for example using only one frame per second.
+
+Additionally, when encoding a structure that has independent decode targets
+(e.g. L3T3_KEY) - the sender should generate an independent stream ACD sequence
+per target resolution so that a receiver can validate the state of the
+sub-stream they receive.
+
+// TODO: Add concrete examples.
+
+### Sample Selection
+
+As mentioned above, a Halton Sequence is used to generate sampling coordinates.
+Base 2 is used for selecting the rows, and base 3 is used for selecting columns.
+
+Each sample in the ACD extension represents a single image sample, meaning it
+belongs to a single channel rather than e.g. being an RGB pixel.
+
+The initial version of the ACD extension supports only the I420 chroma
+subsampling format. When determining which plane a location belongs to, it is
+easiest to visualize it as the chroma planes being “stacked” to the side of the
+luma plane:
+
++------+---+
+| | U |
++ Y +---+
+| | V |
++------+---+
+
+In pseudo code:
+```
+ row = GetHaltonSequence(seq_index, /*base=*/2) * image_height;
+ col = GetHaltonSequence(seq_index, /*base=*/3) * image_width * 1.5;
+
+ if (col < image_width) {
+ HandleSample(Y_PLANE, row, col);
+ } else if (row < image_height / 2) {
+ HandleSample(U_PLANE, row, col - image_width);
+ } else {
+ HandleSample(V_PLANE, row - (image_height / 2), col - image_width);
+ }
+
+ seq_index++;
+```
+Support for other layout types may be added in later versions of this extension.
+
+Note that the image dimensions are not explicitly a part of the ACD extension -
+that has to be inferred from the raw image itself.
+
+### Sample Filtering
+
+As mentioned above, when filtering a sample we create a weighted average around
+the desired location. Only samples in the same plane are considered. The
+weighting consists of a 2D Gaussian centered on the desired location, with the
+standard deviation specified in the ACD extension header.
+
+If the standard deviation is specified as 0.0 - we consider only a singular
+sample. Otherwise, we first determine a cutoff distance below which the weights
+are considered too small to matter. For now, we have set the weight cutoff to
+0.2 - meaning the maximum distance from the center sample we need to consider is
+max_d = ceil(sqrt(-2.0 * ln(0.2) * stddev^2) - 1.
+
+Any samples outside the plane are considered to have weight 0.
+
+In pseudo-code, that means we get the following:
+```
+ sample_sum = 0;
+ weight_sum = 0;
+ for (y = max(0, row - max_d) to min(plane_height, row + max_d) {
+ for (x = max(0, col - max_d) to min(plane_width, col + max_d) {
+ weight = e^(-1 * ((y - row)^2 + (x - col)^2) / (2 * stddev^2));
+ sample_sum += SampleAt(x, y) * weight;
+ weight_sum += weight;
+ }
+ }
+ filtered_sample = sample_sum / weight_sum;
+```
+### Receive Side Considerations
+
+When a frame has been decoded and an ACD message is present, the receiver
+performs the following steps:
+
+* Update the sequence index so that it is consistent with the ACD message.
+* Calculate the sample positions from the Halton sequence.
+* Filter each sample of the decoded image using the standard deviation provided
+ in the ACD message.
+
+We then need to compare the actual samples present in the ACD message and the
+samples generated from the locally decoded frame, and take the allowed error
+into account:
+
+```
+for (i = 0 to num_samples) {
+ // Allowed error from ACD message, depending on which plane sample i is in.
+ allowed_error = SampleType(i) == Y_PLANE ? Y_ERR : UV_ERR;
+ delta_i = max(0, abs(RemoteSample(i) - LocalSample(i)) - allowed_error);
+}
+```
+
+It is then up to the receiver how to interpret these deltas. A suggested method
+is to calculate a “corruption score” by calculating sum(delta(i)^2), where
+delta(i) is the delta for i:th sample in the message, and then scaling and
+capping that result to a maximum of 1.0. By squaring the sample, we make sure
+that even singular samples that are way outside their expected values cause a
+noticeable shift in the score. Another possible way is to calculate the distance
+and cap it using a sigmoid function.
+
+This extension message format does not make recommendations about what a
+receiver should do with the corruption scores, but some possibilities are:
+
+* Expose it as a statistics connected to the video receive stream. Let the
+ application decide what to do with the information.
+* Let the WebRTC application use a corruption signal to take proactive measures.
+ E.g. request a key-frame in order to recover, or try to switch to another
+ codec type or implementation.
+
+### Determining Filter Settings & Error Thresholds
+
+It is up to the sender to estimate how large the filter kernel and the allowed
+error thresholds should be.
+
+One method to do this is to analyze example outputs from different encoders and
+map the average frame QP to suitable settings. There will of course have to be
+different such mapping for e.g. AV1 compared to VP8 - but it’s also possible to
+get “tighter” values with knowledge of the exact implementation used. E.g. a
+mapping designed just for libaom encoder version X running with speed setting Y.
+
+Another method is to use the actual reconstructor state from the encoder. That
+of course means the encoder has to expose that state, which is not common.
+A benefit of doing it that way is that the filter size and allowed error can be
+very small (really only post-processing could introduce distortions in that
+scenario). A drawback is if the reconstructed state already contains corruption
+due to an encoder bug - then we would not be able to detect that corruption at
+all.
+
+There are also possibly more accurate but probably much more costly alternatives
+as well, such as training an ML model to determine the settings based on both
+the content of the source frame and any metadata present in the encoded
+bitstream.
+
+Regardless of method, the implementation at the send side SHOULD strive to set
+the filter size and error thresholds such that 99.5% of filtered samples end up
+with a delta <= the error threshold for that plane, based on a representative
+set of test clips and bandwidth constraints.
diff --git a/modules/rtp_rtcp/source/corruption_detection_extension.cc b/modules/rtp_rtcp/source/corruption_detection_extension.cc
index 614d0c9..408f1eb 100644
--- a/modules/rtp_rtcp/source/corruption_detection_extension.cc
+++ b/modules/rtp_rtcp/source/corruption_detection_extension.cc
@@ -27,44 +27,8 @@
} // namespace
-// The message format of the header extension:
-//
-// 0 1 2 3
-// 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
-// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
-// |B| seq# index | kernel size | Y err | UV err| sample 0 |
-// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
-// | sample 1 | sample 2 | … up to sample <=12
-// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
-//
-// * B (1 bit): If the sequence number should be interpreted as the MSB or LSB
-// of the full size 14 bit sequence index described in the next point.
-// * seq# index (7 bits): The index into the Halton sequence (used to locate
-// where the samples should be drawn from).
-// * If B is set: the 7 most significant bits of the true index. The 7 least
-// significant bits of the true index shall be interpreted as 0. This is
-// because this is the point where we can guarantee that the sender and
-// receiver has the same full index). For this reason, B must only be set
-// for key frames.
-// * If B is not set: The 7 LSB of the true index. The 7 most significant bits
-// should be inferred based on the most recent message.
-// * kernel size (8 bits): The standard deviation of the gaussian filter used
-// to weigh the samples. The value is scaled using a linear map:
-// 0 = 0.0 to 255 = 40.0. A kernel size of 0 is interpreted as directly using
-// just the sample value at the desired coordinate, without any weighting.
-// * Y err (4 bits): The allowed error for the luma channel.
-// * UV err (4 bits): The allowed error for the chroma channels.
-// * Sample N (8 bits): The N:th filtered sample from the input image. Each
-// sample represents a new point in one of the image planes, the plane and
-// coordinates being determined by index into the Halton sequence (starting at
-// seq# index and is incremented by one for each sample). Each sample has gone
-// through a Gaussian filter with the kernel size specified above. The samples
-// have been floored to the nearest integer.
-//
-// A special case is so called "synchronization" messages. These are messages
-// that only contains the first byte. They always have B set and are used to
-// keep the sender and receiver in sync even if no "full" messages have been
-// sent for a while.
+// A description of the extension can be found at
+// http://www.webrtc.org/experiments/rtp-hdrext/corruption-detection
bool CorruptionDetectionExtension::Parse(rtc::ArrayView<const uint8_t> data,
CorruptionDetectionMessage* message) {