docs/native-code/rtp-hdrext/corruption-detection/README.md - src - Git at Google

 # Corruption Detection

 **Name:**
 "Corruption Detection"; "Extension for Automatic Detection of Video Corruptions"

 **Formal name:**
 <http://www.webrtc.org/experiments/rtp-hdrext/corruption-detection>

 **Status:** This extension is defined here to allow for experimentation.

 **Contact:** <sprang@google.com>

 NOTE: This explainer is a work in progress and may change without notice.

 The Corruption Detection (sometimes referred to as automatic corruption
 detection or ACD) extension is intended to be a part of a system that allows
 estimating a likelihood that a video transmission is in a valid state. That is,
 the input to the video encoder on the send side corresponds to the output of the
 video decoder on the receive side with the only difference being the expected
 distortions from lossy compression.

 The goal is to be able to detect outright coding errors caused by things such as
 bugs in encoder/decoders, malformed packetization data, incorrect relay
 decisions in SFU-type servers, incorrect handling of packet loss/reordering, and
 so forth. We want to accomplish this with a high signal-to-noise ratio while
 consuming a minimum of resources in terms of bandwidth and/or computation. It
 should be noted that it is _not_ a goal to be able to e.g. gauge general video
 quality using this method.

 This explainer contains two parts:

 1) A definition of the RTP header extension itself and how it is to be parsed.
 2) The intended usage and implementation details for a WebRTC sender and
    receiver respectively.

 If this extension has been negotiated, all the client behavior outlined in this
 doc MUST be adhered to.

 ## RTP Header Extension Format

 ### Data Layout Overview

 The message format of the header extension:

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |B|  seq index  |    std dev    | Y err | UV err|    sample 0   |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |    sample 1   |   sample 2    |    …   up to sample <=12
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 ### Data Layout Details

 * B (1 bit): If the sequence number should be interpreted as the MSB or LSB
   of the full size 14 bit sequence index described in the next point.
 * seq index (7 bits): The index into the Halton sequence (used to locate
   where the samples should be drawn from).
   * If B is set: the 7 most significant bits of the true index. The 7 least
     significant bits of the true index shall be interpreted as 0. This is
     because this is the point where we can guarantee that the sender and
     receiver has the same full index. B MUST be set on keyframes. On droppable
     frames B MUST NOT be set.
   * If B is not set: The 7 LSB of the true index. The 7 most significant bits
     should be inferred based on the most recent message.
 * std dev (8 bits):  The standard deviation of the Gaussian filter used
   to weigh the samples. The value is scaled using a linear map:
   0 = 0.0 to 255 = 40.0. A std dev of 0 is interpreted as directly using
   just the sample value at the desired coordinate, without any weighting.
 * Y err (4 bits): The allowed error for the luma channel.
 * UV err (4 bits): The allowed error for the chroma channels.
 * Sample N (8 bits): The N:th filtered sample from the input image. Each
   sample represents a new point in one of the image planes, the plane and
   coordinates being determined by index into the Halton sequence (starting at
   seq# index and is incremented by one for each sample). Each sample has gone
   through a Gaussian filter with the std dev specified above. The samples
   have been floored to the nearest integer.

 A special case is the so-called "synchronization" message. Such a message
 only contains the first byte. They are used to keep the sender and receiver in
 sync even if no "full" message has been received for a while. Such messages
 MUST NOT be sent on droppable frames.

 ### A note on encryption

 Privacy and security are core parts of nearly every WebRTC-based application,
 which means that some sort of encryption needs to be present. The most common
 form of encryption is SRTP, defined in RFC 3711. However, as mentioned in
 section 9.4 of that RFC, RTP header extensions are considered part of the header
 and are thus not encrypted.

 The automatic corruption detection header extension is different from most other
 header extensions in that it provides not only metadata about the media stream
 being transmitted but in practice comprises an extremely sparse representation
 of the actual video stream itself. Given a static scene and enough time, a crude
 image of the encrypted video can rather trivially be constructed.

 As such, most applications should use this extension with SRTP only if
 additional security is present to protect it. That could be for example in the
 form of explicit header extension encryption provided by RFC 6904/RFC 9335, or
 by encapsulating the entire RTP stream in an additional layer such as IPSec.

 ## Usage & Guidelines

 In this section we’ll first look at a general overview of the intended usage of
 this header extensions, followed by more details around the expected
 implementation.

 ### Overview

 The premise of the extension described here is that we can validate the state of
 the video pipeline by quasi-randomly selecting a few samples from the raw input
 frame to an encoder, and then checking them against the output of a decoder.
 Assuming that a lossless codec is used we can follow these steps:

 1) In an image that is to be encoded, quasi-randomly select N sampling positions
    and store the samples values for those positions from the raw input image.
 2) Encode the image, and attach the selected sample values to the RTP packets
    containing the encoded bitstream of that image.
 3) Transmit the RTP packets to a remote receiver.
 4) At the receiver, collect the attached sample values from the RTP packets when
    assembling the frame, and then pass the bitstream to a decoder.
 5) Using the same quasi-random sequence as in (1), calculate the corresponding N
    sampling positions.
 6) Take the output of the decoder and check the values of the samples from the
    RTP packets. If they differ significantly, it is likely that an image
    corruption has occurred.

 Lossless encoding is however rarely used in practice, and that introduces
 problems for the above algorithm.

 * Quantization causes values to be different from the desired value.
 * Whole blocks of pixels might be shifted somewhat due to inaccuracies in motion
   vectors.
 * Inaccuracies caused by in-loop or post-process filtering.
 * etc.

 We must therefore take these distortions into consideration, as they are merely
 a natural side-effect of the compression and their effect is not to be
 considered an “invalid state”. We aim to accomplish this using two tools.

 First, instead of a sample being a single raw sample value let it be a filtered
 one: a weighted average of samples in the vicinity of the desired location, with
 the weights being a 2D Gaussian centered at that location and the variance
 adjusted depending on the magnitude of the expected distortions
 (higher distortion => higher variance). This smoothes out inaccuracies caused by
 both quantization and motion compensation.

 Secondly, even with a very large filter kernel the new sample might not converge
 towards the exact desired value. For that reason, set an “allowed error
 threshold” that removes small magnitude differences. Since chroma and luma
 channels have different scales, separate error thresholds are available for
 them.

 ### Sequence Index Handling

 The quasi-random sequence of choice for this extension is a 2D
 [Halton Sequence](https://en.wikipedia.org/wiki/Halton_sequence).

 The index into the Halton Sequence is indicated by the header extension and
 results in a 14 bit unsigned integer which on overflow will wrap around back to
 0.

 For each sample contained within the extension, the sequence index should be
 considered to be incremented by one. Thus the sequence index at the start of the
 header should be considered “the sequence index for the next sample to be
 drawn”.

 The ACD extension may be sent containing either the 7 most significant bits
 (B = true) or the 7 least significant bits (B = false) of the sequence index.

 Key-frames MUST be populated with the ACD extension, and those MUST use B = true
 indicating only the 7 most significant bits are transmitted.

 The sender may choose any arbitrary starting point. The biggest reason to not
 always start with (B = true, seq index = 0) is that with frequent/periodic
 keyframes you might end up always sampling the same small subset of image
 locations over and over.

 If B = false and the LSB seq index + number of samples exceeds the capacity of
 the 7-bit field (i.e. > 0x7F), then the most significant bits of the 14 bit
 sequence counter should be considered to be implicitly incremented by the
 overflow.

 Delta-frames may be encoded as “droppable” or “non-droppable”. Consider for
 example temporal layering using the
 [L1T3](https://www.w3.org/TR/webrtc-svc/#L1T3*) mode. In that scenario,
 key-frames and all T0 frames are non-droppable, while all T1 and T2 frames are
 droppable.

 For non-droppable frames, B MAY be set to true even though there is often little
 utility for it.
 For droppable frames B MUST NOT be set to true, since a receiver could otherwise
 easily end up out of sync with the sender.

 A receiver must store a state containing the last sequence index used. If an ACD
 extension is receiver with B = false but the LSB does not match the last known
 sequence index state, this indicates that an instrumented frame has been
 dropped. The receiver should recover from this by incrementing the last known
 sequence index until the 7 least significant bits match.

 Because of this, the sender MUST send ACD messages on non-droppable frames such
 that the delta between their sequence indexing (from the last sample of the
 previous packet to the first of the next) indexing does not exceed 0x7F. A
 synchronization message may be used for this purpose if there is no wish to
 instrument the non-droppable frame.

 It is not required to add the ACD extension to every frame. Indeed, for
 performance reasons it may be reasonable to only instrument a small subset of
 frames, for example using only one frame per second.

 Additionally, when encoding a structure that has independent decode targets
 (e.g. L3T3_KEY) - the sender should generate an independent stream ACD sequence
 per target resolution so that a receiver can validate the state of the
 sub-stream they receive.

 // TODO: Add concrete examples.

 ### Sample Selection

 As mentioned above, a Halton Sequence is used to generate sampling coordinates.
 Base 2 is used for selecting the rows, and base 3 is used for selecting columns.

 Each sample in the ACD extension represents a single image sample, meaning it
 belongs to a single channel rather than e.g. being an RGB pixel.

 The initial version of the ACD extension supports only the I420 chroma
 subsampling format. When determining which plane a location belongs to, it is
 easiest to visualize it as the chroma planes being “stacked” to the side of the
 luma plane:

     +------+---+
     |      | U |
     +  Y   +---+
     |      | V |
     +------+---+

 In pseudo code:
 ```
   row = GetHaltonSequence(seq_index, /*base=*/2) * image_height;
   col = GetHaltonSequence(seq_index, /*base=*/3) * image_width * 1.5;

   if (col < image_width) {
     HandleSample(Y_PLANE, row, col);
   } else if (row < image_height / 2) {
     HandleSample(U_PLANE, row, col - image_width);
   } else {
     HandleSample(V_PLANE, row - (image_height / 2), col - image_width);
   }

   seq_index++;
 ```
 Support for other layout types may be added in later versions of this extension.

 Note that the image dimensions are not explicitly a part of the ACD extension -
 that has to be inferred from the raw image itself.

 ### Sample Filtering

 As mentioned above, when filtering a sample we create a weighted average around
 the desired location. Only samples in the same plane are considered. The
 weighting consists of a 2D Gaussian centered on the desired location, with the
 standard deviation specified in the ACD extension header.

 If the standard deviation is specified as 0.0 - we consider only a singular
 sample. Otherwise, we first determine a cutoff distance below which the weights
 are considered too small to matter. For now, we have set the weight cutoff to
 0.2 - meaning the maximum distance from the center sample we need to consider is
 max_d = ceil(sqrt(-2.0 * ln(0.2) * stddev^2) - 1.

 Any samples outside the plane are considered to have weight 0.

 In pseudo-code, that means we get the following:
 ```
   sample_sum = 0;
   weight_sum = 0;
   for (y = max(0, row - max_d) to min(plane_height, row + max_d) {
     for (x = max(0, col - max_d) to min(plane_width, col + max_d) {
       weight = e^(-1 * ((y - row)^2 + (x - col)^2) / (2 * stddev^2));
       sample_sum += SampleAt(x, y) * weight;
       weight_sum += weight;
     }
   }
   filtered_sample = sample_sum / weight_sum;
 ```
 ### Receive Side Considerations

 When a frame has been decoded and an ACD message is present, the receiver
 performs the following steps:

 * Update the sequence index so that it is consistent with the ACD message.
 * Calculate the sample positions from the Halton sequence.
 * Filter each sample of the decoded image using the standard deviation provided
   in the ACD message.

 We then need to compare the actual samples present in the ACD message and the
 samples generated from the locally decoded frame, and take the allowed error
 into account:

 ```
 for (i = 0 to num_samples) {
   // Allowed error from ACD message, depending on which plane sample i is in.
   allowed_error = SampleType(i) == Y_PLANE ? Y_ERR : UV_ERR;
   delta_i = max(0, abs(RemoteSample(i) - LocalSample(i)) - allowed_error);
 }
 ```

 It is then up to the receiver how to interpret these deltas. A suggested method
 is to calculate a “corruption score” by calculating sum(delta(i)^2), where
 delta(i) is the delta for i:th sample in the message, and then scaling and
 capping that result to a maximum of 1.0. By squaring the sample, we make sure
 that even singular samples that are way outside their expected values cause a
 noticeable shift in the score. Another possible way is to calculate the distance
 and cap it using a sigmoid function.

 This extension message format does not make recommendations about what a
 receiver should do with the corruption scores, but some possibilities are:

 * Expose it as a statistics connected to the video receive stream. Let the
   application decide what to do with the information.
 * Let the WebRTC application use a corruption signal to take proactive measures.
   E.g. request a key-frame in order to recover, or try to switch to another
   codec type or implementation.

 ### Determining Filter Settings & Error Thresholds

 It is up to the sender to estimate how large the filter kernel and the allowed
 error thresholds should be.

 One method to do this is to analyze example outputs from different encoders and
 map the average frame QP to suitable settings. There will of course have to be
 different such mapping for e.g. AV1 compared to VP8 - but it’s also possible to
 get “tighter” values with knowledge of the exact implementation used. E.g. a
 mapping designed just for libaom encoder version X running with speed setting Y.

 Another method is to use the actual reconstructor state from the encoder. That
 of course means the encoder has to expose that state, which is not common.
 A benefit of doing it that way is that the filter size and allowed error can be
 very small (really only post-processing could introduce distortions in that
 scenario). A drawback is if the reconstructed state already contains corruption
 due to an encoder bug - then we would not be able to detect that corruption at
 all.

 There are also possibly more accurate but probably much more costly alternatives
 as well, such as training an ML model to determine the settings based on both
 the content of the source frame and any metadata present in the encoded
 bitstream.

 Regardless of method, the implementation at the send side SHOULD strive to set
 the filter size and error thresholds such that 99.5% of filtered samples end up
 with a delta <= the error threshold for that plane, based on a representative
 set of test clips and bandwidth constraints.

 Notes: The extension must not be present in more than 1 packet per video frame.
	# Corruption Detection

	Name:
	"Corruption Detection"; "Extension for Automatic Detection of Video Corruptions"

	Formal name:
	<http://www.webrtc.org/experiments/rtp-hdrext/corruption-detection>

	Status: This extension is defined here to allow for experimentation.

	Contact: <sprang@google.com>

	NOTE: This explainer is a work in progress and may change without notice.

	The Corruption Detection (sometimes referred to as automatic corruption
	detection or ACD) extension is intended to be a part of a system that allows
	estimating a likelihood that a video transmission is in a valid state. That is,
	the input to the video encoder on the send side corresponds to the output of the
	video decoder on the receive side with the only difference being the expected
	distortions from lossy compression.

	The goal is to be able to detect outright coding errors caused by things such as
	bugs in encoder/decoders, malformed packetization data, incorrect relay
	decisions in SFU-type servers, incorrect handling of packet loss/reordering, and
	so forth. We want to accomplish this with a high signal-to-noise ratio while
	consuming a minimum of resources in terms of bandwidth and/or computation. It
	should be noted that it is _not_ a goal to be able to e.g. gauge general video
	quality using this method.

	This explainer contains two parts:

	1) A definition of the RTP header extension itself and how it is to be parsed.
	2) The intended usage and implementation details for a WebRTC sender and
	receiver respectively.

	If this extension has been negotiated, all the client behavior outlined in this
	doc MUST be adhered to.

	## RTP Header Extension Format

	### Data Layout Overview

	The message format of the header extension:

	0 1 2 3
	0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
	+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	\|B\| seq index \| std dev \| Y err \| UV err\| sample 0 \|
	+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	\| sample 1 \| sample 2 \| … up to sample <=12
	+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

	### Data Layout Details

	* B (1 bit): If the sequence number should be interpreted as the MSB or LSB
	of the full size 14 bit sequence index described in the next point.
	* seq index (7 bits): The index into the Halton sequence (used to locate
	where the samples should be drawn from).
	* If B is set: the 7 most significant bits of the true index. The 7 least
	significant bits of the true index shall be interpreted as 0. This is
	because this is the point where we can guarantee that the sender and
	receiver has the same full index. B MUST be set on keyframes. On droppable
	frames B MUST NOT be set.
	* If B is not set: The 7 LSB of the true index. The 7 most significant bits
	should be inferred based on the most recent message.
	* std dev (8 bits): The standard deviation of the Gaussian filter used
	to weigh the samples. The value is scaled using a linear map:
	0 = 0.0 to 255 = 40.0. A std dev of 0 is interpreted as directly using
	just the sample value at the desired coordinate, without any weighting.
	* Y err (4 bits): The allowed error for the luma channel.
	* UV err (4 bits): The allowed error for the chroma channels.
	* Sample N (8 bits): The N:th filtered sample from the input image. Each
	sample represents a new point in one of the image planes, the plane and
	coordinates being determined by index into the Halton sequence (starting at
	seq# index and is incremented by one for each sample). Each sample has gone
	through a Gaussian filter with the std dev specified above. The samples
	have been floored to the nearest integer.

	A special case is the so-called "synchronization" message. Such a message
	only contains the first byte. They are used to keep the sender and receiver in
	sync even if no "full" message has been received for a while. Such messages
	MUST NOT be sent on droppable frames.

	### A note on encryption

	Privacy and security are core parts of nearly every WebRTC-based application,
	which means that some sort of encryption needs to be present. The most common
	form of encryption is SRTP, defined in RFC 3711. However, as mentioned in
	section 9.4 of that RFC, RTP header extensions are considered part of the header
	and are thus not encrypted.

	The automatic corruption detection header extension is different from most other
	header extensions in that it provides not only metadata about the media stream
	being transmitted but in practice comprises an extremely sparse representation
	of the actual video stream itself. Given a static scene and enough time, a crude
	image of the encrypted video can rather trivially be constructed.

	As such, most applications should use this extension with SRTP only if
	additional security is present to protect it. That could be for example in the
	form of explicit header extension encryption provided by RFC 6904/RFC 9335, or
	by encapsulating the entire RTP stream in an additional layer such as IPSec.

	## Usage & Guidelines

	In this section we’ll first look at a general overview of the intended usage of
	this header extensions, followed by more details around the expected
	implementation.

	### Overview

	The premise of the extension described here is that we can validate the state of
	the video pipeline by quasi-randomly selecting a few samples from the raw input
	frame to an encoder, and then checking them against the output of a decoder.
	Assuming that a lossless codec is used we can follow these steps:

	1) In an image that is to be encoded, quasi-randomly select N sampling positions
	and store the samples values for those positions from the raw input image.
	2) Encode the image, and attach the selected sample values to the RTP packets
	containing the encoded bitstream of that image.
	3) Transmit the RTP packets to a remote receiver.
	4) At the receiver, collect the attached sample values from the RTP packets when
	assembling the frame, and then pass the bitstream to a decoder.
	5) Using the same quasi-random sequence as in (1), calculate the corresponding N
	sampling positions.
	6) Take the output of the decoder and check the values of the samples from the
	RTP packets. If they differ significantly, it is likely that an image
	corruption has occurred.

	Lossless encoding is however rarely used in practice, and that introduces
	problems for the above algorithm.

	* Quantization causes values to be different from the desired value.
	* Whole blocks of pixels might be shifted somewhat due to inaccuracies in motion
	vectors.
	* Inaccuracies caused by in-loop or post-process filtering.
	* etc.

	We must therefore take these distortions into consideration, as they are merely
	a natural side-effect of the compression and their effect is not to be
	considered an “invalid state”. We aim to accomplish this using two tools.

	First, instead of a sample being a single raw sample value let it be a filtered
	one: a weighted average of samples in the vicinity of the desired location, with
	the weights being a 2D Gaussian centered at that location and the variance
	adjusted depending on the magnitude of the expected distortions
	(higher distortion => higher variance). This smoothes out inaccuracies caused by
	both quantization and motion compensation.

	Secondly, even with a very large filter kernel the new sample might not converge
	towards the exact desired value. For that reason, set an “allowed error
	threshold” that removes small magnitude differences. Since chroma and luma
	channels have different scales, separate error thresholds are available for
	them.

	### Sequence Index Handling

	The quasi-random sequence of choice for this extension is a 2D
	[Halton Sequence](https://en.wikipedia.org/wiki/Halton_sequence).

	The index into the Halton Sequence is indicated by the header extension and
	results in a 14 bit unsigned integer which on overflow will wrap around back to
	0.

	For each sample contained within the extension, the sequence index should be
	considered to be incremented by one. Thus the sequence index at the start of the
	header should be considered “the sequence index for the next sample to be
	drawn”.

	The ACD extension may be sent containing either the 7 most significant bits
	(B = true) or the 7 least significant bits (B = false) of the sequence index.

	Key-frames MUST be populated with the ACD extension, and those MUST use B = true
	indicating only the 7 most significant bits are transmitted.

	The sender may choose any arbitrary starting point. The biggest reason to not
	always start with (B = true, seq index = 0) is that with frequent/periodic
	keyframes you might end up always sampling the same small subset of image
	locations over and over.

	If B = false and the LSB seq index + number of samples exceeds the capacity of
	the 7-bit field (i.e. > 0x7F), then the most significant bits of the 14 bit
	sequence counter should be considered to be implicitly incremented by the
	overflow.

	Delta-frames may be encoded as “droppable” or “non-droppable”. Consider for
	example temporal layering using the
	[L1T3](https://www.w3.org/TR/webrtc-svc/#L1T3*) mode. In that scenario,
	key-frames and all T0 frames are non-droppable, while all T1 and T2 frames are
	droppable.

	For non-droppable frames, B MAY be set to true even though there is often little
	utility for it.
	For droppable frames B MUST NOT be set to true, since a receiver could otherwise
	easily end up out of sync with the sender.

	A receiver must store a state containing the last sequence index used. If an ACD
	extension is receiver with B = false but the LSB does not match the last known
	sequence index state, this indicates that an instrumented frame has been
	dropped. The receiver should recover from this by incrementing the last known
	sequence index until the 7 least significant bits match.

	Because of this, the sender MUST send ACD messages on non-droppable frames such
	that the delta between their sequence indexing (from the last sample of the
	previous packet to the first of the next) indexing does not exceed 0x7F. A
	synchronization message may be used for this purpose if there is no wish to
	instrument the non-droppable frame.

	It is not required to add the ACD extension to every frame. Indeed, for
	performance reasons it may be reasonable to only instrument a small subset of
	frames, for example using only one frame per second.

	Additionally, when encoding a structure that has independent decode targets
	(e.g. L3T3_KEY) - the sender should generate an independent stream ACD sequence
	per target resolution so that a receiver can validate the state of the
	sub-stream they receive.

	// TODO: Add concrete examples.

	### Sample Selection

	As mentioned above, a Halton Sequence is used to generate sampling coordinates.
	Base 2 is used for selecting the rows, and base 3 is used for selecting columns.

	Each sample in the ACD extension represents a single image sample, meaning it
	belongs to a single channel rather than e.g. being an RGB pixel.

	The initial version of the ACD extension supports only the I420 chroma
	subsampling format. When determining which plane a location belongs to, it is
	easiest to visualize it as the chroma planes being “stacked” to the side of the
	luma plane:

	+------+---+
	\| \| U \|
	+ Y +---+
	\| \| V \|
	+------+---+

	In pseudo code:
	```
	row = GetHaltonSequence(seq_index, /base=/2) * image_height;
	col = GetHaltonSequence(seq_index, /base=/3) * image_width * 1.5;

	if (col < image_width) {
	HandleSample(Y_PLANE, row, col);
	} else if (row < image_height / 2) {
	HandleSample(U_PLANE, row, col - image_width);
	} else {
	HandleSample(V_PLANE, row - (image_height / 2), col - image_width);
	}

	seq_index++;
	```
	Support for other layout types may be added in later versions of this extension.

	Note that the image dimensions are not explicitly a part of the ACD extension -
	that has to be inferred from the raw image itself.

	### Sample Filtering

	As mentioned above, when filtering a sample we create a weighted average around
	the desired location. Only samples in the same plane are considered. The
	weighting consists of a 2D Gaussian centered on the desired location, with the
	standard deviation specified in the ACD extension header.

	If the standard deviation is specified as 0.0 - we consider only a singular
	sample. Otherwise, we first determine a cutoff distance below which the weights
	are considered too small to matter. For now, we have set the weight cutoff to
	0.2 - meaning the maximum distance from the center sample we need to consider is
	max_d = ceil(sqrt(-2.0 * ln(0.2) * stddev^2) - 1.

	Any samples outside the plane are considered to have weight 0.

	In pseudo-code, that means we get the following:
	```
	sample_sum = 0;
	weight_sum = 0;
	for (y = max(0, row - max_d) to min(plane_height, row + max_d) {
	for (x = max(0, col - max_d) to min(plane_width, col + max_d) {
	weight = e^(-1 * ((y - row)^2 + (x - col)^2) / (2 * stddev^2));
	sample_sum += SampleAt(x, y) * weight;
	weight_sum += weight;
	}
	}
	filtered_sample = sample_sum / weight_sum;
	```
	### Receive Side Considerations

	When a frame has been decoded and an ACD message is present, the receiver
	performs the following steps:

	* Update the sequence index so that it is consistent with the ACD message.
	* Calculate the sample positions from the Halton sequence.
	* Filter each sample of the decoded image using the standard deviation provided
	in the ACD message.

	We then need to compare the actual samples present in the ACD message and the
	samples generated from the locally decoded frame, and take the allowed error
	into account:

	```
	for (i = 0 to num_samples) {
	// Allowed error from ACD message, depending on which plane sample i is in.
	allowed_error = SampleType(i) == Y_PLANE ? Y_ERR : UV_ERR;
	delta_i = max(0, abs(RemoteSample(i) - LocalSample(i)) - allowed_error);
	}
	```

	It is then up to the receiver how to interpret these deltas. A suggested method
	is to calculate a “corruption score” by calculating sum(delta(i)^2), where
	delta(i) is the delta for i:th sample in the message, and then scaling and
	capping that result to a maximum of 1.0. By squaring the sample, we make sure
	that even singular samples that are way outside their expected values cause a
	noticeable shift in the score. Another possible way is to calculate the distance
	and cap it using a sigmoid function.

	This extension message format does not make recommendations about what a
	receiver should do with the corruption scores, but some possibilities are:

	* Expose it as a statistics connected to the video receive stream. Let the
	application decide what to do with the information.
	* Let the WebRTC application use a corruption signal to take proactive measures.
	E.g. request a key-frame in order to recover, or try to switch to another
	codec type or implementation.

	### Determining Filter Settings & Error Thresholds

	It is up to the sender to estimate how large the filter kernel and the allowed
	error thresholds should be.

	One method to do this is to analyze example outputs from different encoders and
	map the average frame QP to suitable settings. There will of course have to be
	different such mapping for e.g. AV1 compared to VP8 - but it’s also possible to
	get “tighter” values with knowledge of the exact implementation used. E.g. a
	mapping designed just for libaom encoder version X running with speed setting Y.

	Another method is to use the actual reconstructor state from the encoder. That
	of course means the encoder has to expose that state, which is not common.
	A benefit of doing it that way is that the filter size and allowed error can be
	very small (really only post-processing could introduce distortions in that
	scenario). A drawback is if the reconstructed state already contains corruption
	due to an encoder bug - then we would not be able to detect that corruption at
	all.

	There are also possibly more accurate but probably much more costly alternatives
	as well, such as training an ML model to determine the settings based on both
	the content of the source frame and any metadata present in the encoded
	bitstream.

	Regardless of method, the implementation at the send side SHOULD strive to set
	the filter size and error thresholds such that 99.5% of filtered samples end up
	with a delta <= the error threshold for that plane, based on a representative
	set of test clips and bandwidth constraints.

	Notes: The extension must not be present in more than 1 packet per video frame.