updated block format documentation

to specifically call attention to offset==0 scenario, which is invalid, and could lead to some form of security issue if incorrectly dealt with (ignored).
author: Yann Collet <cyan@fb.com> 2022-02-02 07:55:35 (GMT)
committer: Yann Collet <cyan@fb.com> 2022-02-02 07:59:40 (GMT)
commit: 1da033e601ca5bd1e91a2bb13702ce6368be6de7 (patch)
tree: 8e590587cacf32499ef096fadeed774143ab954f
parent: 7601d57f878649eea61fa2629424e0059d1591c6 (diff)
download: lz4-1da033e601ca5bd1e91a2bb13702ce6368be6de7.zip
lz4-1da033e601ca5bd1e91a2bb13702ce6368be6de7.tar.gz
lz4-1da033e601ca5bd1e91a2bb13702ce6368be6de7.tar.bz2
1 files changed, 41 insertions, 30 deletions
diff --git a/doc/lz4_Block_format.md b/doc/lz4_Block_format.md
index 8f3d4b9..a0017b9 100644
--- a/doc/lz4_Block_format.md
+++ b/doc/lz4_Block_format.md
@@ -1,6 +1,6 @@
 LZ4 Block Format Description
 ============================
-Last revised: 2019-03-30.
+Last revised: 2022-02-02.
 Author : Yann Collet
 
 
@@ -42,8 +42,9 @@ If the field value is 0, then there is no literal.
 If it is 15, then we need to add some more bytes to indicate the full length.
 Each additional byte then represent a value from 0 to 255,
 which is added to the previous value to produce a total length.
-When the byte value is 255, another byte is output.
-There can be any number of bytes following `token`. There is no "size limit".
+When the byte value is 255, another byte must read and added, and so on.
+There can be any number of bytes of value "255" following `token`.
+There is no "size limit".
 (Side note : this is why a not-compressible input block is expanded by 0.4%).
 
 Example 1 : A literal length of 48 will be represented as :
@@ -74,22 +75,23 @@ This is a 2 bytes value, in little endian format
 (the 1st byte is the "low" byte, the 2nd one is the "high" byte).
 
 The `offset` represents the position of the match to be copied from.
-1 means "current position - 1 byte".
-The maximum `offset` value is 65535, 65536 cannot be coded.
-Note that 0 is an invalid value, not used.
+For example, 1 means "current position - 1 byte".
+The maximum `offset` value is 65535, 65536 and beyond cannot be coded.
+Note that 0 is an invalid offset value.
+The presence of such a value denotes an invalid (corrupted) block.
 
-Then we need to extract the `matchlength`.
+Then the `matchlength` can be extracted.
 For this, we use the second token field, the low 4-bits.
-Value, obviously, ranges from 0 to 15.
+Such a value, obviously, ranges from 0 to 15.
 However here, 0 means that the copy operation will be minimal.
 The minimum length of a match, called `minmatch`, is 4.
 As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
 Similar to literal length, on reaching the highest possible value (15),
-we output additional bytes, one at a time, with values ranging from 0 to 255.
+one must read additional bytes, one at a time, with values ranging from 0 to 255.
 They are added to total to provide the final match length.
 A 255 value means there is another byte to read and add.
-There is no limit to the number of optional bytes that can be output this way.
-(This points towards a maximum achievable compression ratio of about 250).
+There is no limit to the number of optional "255" bytes that can be present.
+(Note: this points towards a maximum achievable compression ratio of about 250).
 
 Decoding the `matchlength` reaches the end of current sequence.
 Next byte will be the start of another sequence.
@@ -97,9 +99,9 @@ But before moving to next sequence,
 it's time to use the decoded match position and length.
 The decoder copies `matchlength` bytes from match position to current position.
 
-In some cases, `matchlength` is larger than `offset`.
-Therefore, `match_pos + matchlength > current_pos`,
-which means that later bytes to copy are not yet decoded.
+In some cases, `matchlength` can be larger than `offset`.
+Therefore, since `match_pos + matchlength > current_pos`,
+later bytes to copy are not decoded yet.
 This is called an "overlap match", and must be handled with special care.
 A common case is an offset of 1,
 meaning the last byte is repeated `matchlength` times.
@@ -107,7 +109,7 @@ meaning the last byte is repeated `matchlength` times.
 
 End of block restrictions
 -----------------------
-There are specific rules required to terminate a block.
+There are specific restrictions required to terminate an LZ4 block.
 
 1. The last sequence contains only literals.
    The block ends right after them.
@@ -124,33 +126,42 @@ There are specific rules required to terminate a block.
      an independent block < 13 bytes cannot be compressed,
      because the match must copy "something",
      so it needs at least one prior byte.
-   - When a block can reference data from another block,
+   - However, when a block can reference data from another block,
      it can start immediately with a match and no literal,
-     so a block of 12 bytes can be compressed.
+     therefore a block of exactly 12 bytes can be compressed.
 
 When a block does not respect these end conditions,
 a conformant decoder is allowed to reject the block as incorrect.
 
-These rules are in place to ensure that a conformant decoder
-can be designed for speed, issuing speculatively instructions,
-while never reading nor writing beyond provided I/O buffers.
-
+These rules are in place to ensure compatibility with
+a wide range of historical decoders
+which rely on these conditions in their speed-oriented design.
 
 Additional notes
 -----------------------
-If the decoder will decompress data from an external source,
-it is recommended to ensure that the decoder will not be vulnerable to
-buffer overflow manipulations.
+If the decoder will decompress data from any external source,
+it is recommended to ensure that the decoder is resilient to corrupted data,
+and typically not vulnerable to buffer overflow manipulations.
 Always ensure that read and write operations
 remain within the limits of provided buffers.
 Test the decoder with fuzzers
-to ensure it's resilient to improbable combinations.
-
-The format makes no assumption nor limits to the way the compressor
+to ensure it's resilient to improbable sequences of conditions.
+Combine them with sanitizers, in order to catch overflows (asan)
+or initialization issues (msan).
+Pay some attention to offset 0 scenario, which is invalid,
+and therefore must not be blindly decoded
+(a naive implementation could preserve destination buffer content,
+which could then result in information disclosure
+if such buffer was uninitialized and still containing private data).
+For reference, in such a scenario, the reference LZ4 decoder
+clears the match segment with `0` bytes,
+though other solutions are certainly possible.
+
+The format makes no assumption nor limits to the way a compressor
 searches and selects matches within the source data block.
 Multiple techniques can be considered,
 featuring distinct time / performance trade offs.
-As long as the format is respected,
+For example, an upper compression limit can be reached,
+using a technique called "full optimal parsing", at very high cpu cost.
+As long as the specified format is respected,
 the result will be compatible and decodable by any compliant decoder.
-An upper compression limit can be reached,
-using a technique called "full optimal parsing", at high cpu cost.
author	Yann Collet <cyan@fb.com>	2022-02-02 07:55:35 (GMT)
committer	Yann Collet <cyan@fb.com>	2022-02-02 07:59:40 (GMT)
commit	1da033e601ca5bd1e91a2bb13702ce6368be6de7 (patch)
tree	8e590587cacf32499ef096fadeed774143ab954f
parent	7601d57f878649eea61fa2629424e0059d1591c6 (diff)
download	lz4-1da033e601ca5bd1e91a2bb13702ce6368be6de7.zip lz4-1da033e601ca5bd1e91a2bb13702ce6368be6de7.tar.gz lz4-1da033e601ca5bd1e91a2bb13702ce6368be6de7.tar.bz2