summaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
authorYann Collet <cyan@fb.com>2019-01-02 22:36:12 (GMT)
committerYann Collet <cyan@fb.com>2019-01-02 22:36:12 (GMT)
commit7a4e04e6a683cab6dd8961c898a7164d3b2ecc7c (patch)
treef7996a03fa538490b40129624030704ece3b5e9f /doc
parent6e24ef902a4f6de5fd9419c6364a4e8b473445b9 (diff)
downloadlz4-7a4e04e6a683cab6dd8961c898a7164d3b2ecc7c.zip
lz4-7a4e04e6a683cab6dd8961c898a7164d3b2ecc7c.tar.gz
lz4-7a4e04e6a683cab6dd8961c898a7164d3b2ecc7c.tar.bz2
updated LZ4 block format
rewording the end of block conditions for clarity and answering related questions.
Diffstat (limited to 'doc')
-rw-r--r--doc/lz4_Block_format.md69
1 files changed, 33 insertions, 36 deletions
diff --git a/doc/lz4_Block_format.md b/doc/lz4_Block_format.md
index 5438730..2fb4c19 100644
--- a/doc/lz4_Block_format.md
+++ b/doc/lz4_Block_format.md
@@ -1,6 +1,6 @@
LZ4 Block Format Description
============================
-Last revised: 2018-04-25.
+Last revised: 2018-12-30.
Author : Yann Collet
@@ -10,7 +10,8 @@ using any programming language.
LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
There is no entropy encoder back-end nor framing layer.
-The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]).
+The latter is assumed to be handled by other parts of the system
+(see [LZ4 Frame format]).
This design is assumed to favor simplicity and speed.
It helps later on for optimizations, compactness, and features.
@@ -104,45 +105,41 @@ A common case is an offset of 1,
meaning the last byte is repeated `matchlength` times.
-Parsing restrictions
+End of block restrictions
-----------------------
-There are specific parsing rules to respect in order to remain compatible
-with assumptions made by the decoder :
-
-1. The last 5 bytes are always literals. In other words, the last five bytes
- from the uncompressed input (or all bytes, if the input has less than five
- bytes) must be encoded as literals on behalf of the last sequence.
- The last sequence is incomplete, and stops right after the literals.
-2. The last match must start at least 12 bytes before end of block.
- The last match is part of the penultimate sequence,
- since the last sequence stops right after literals.
+There are specific rules required to terminate a block.
+
+1. The last sequence only contains literals. The block ends right after them.
+1. The last 5 bytes of input are always literals.
+ Therefore, the last sequence contains at least 5 bytes,
+ or all input bytes if input is smaller than 5 bytes
+ (empty input can be represented with a zero byte,
+ interpreted as a token without literal and without a match).
+2. The last match must start at least 12 bytes before the end of block.
+ The last match is part of the penultimate sequence.
+ It is followed by the last sequence, which only contains literals.
Note that, as a consequence, blocks < 13 bytes cannot be compressed.
-These rules are in place to ensure that the decoder
-can speculatively execute copy instructions
-without ever reading nor writing beyond provided I/O buffers.
-
-1. To copy literals from a non-last sequence, an 8-byte copy instruction
- can always be safely issued (without reading past the input),
- because literals are followed by a 2-byte offset,
- and last sequence is at least 1+5 bytes long.
-2. Similarly, a match operation can speculatively copy up to 12 bytes
- while remaining within output buffer boundaries.
-
-Empty inputs can be represented with a zero byte,
-interpreted as a token without literals and without a match.
+These rules are in place to ensure that a compatible decoder
+can be designed for speed, issuing speculatively instructions,
+while never reading nor writing beyond provided I/O buffers.
Additional notes
-----------------------
-There is no assumption nor limits to the way the compressor
+If the decoder will decompress data from an external source,
+it is recommended to ensure that the decoder will not be vulnerable to
+buffer overflow manipulations.
+Always ensure that read and write operations
+remain within the limits of provided buffers.
+Test the decoder with fuzzers
+to ensure it's resilient to improbable combinations.
+
+The format makes no assumption nor limits to the way the compressor
searches and selects matches within the source data block.
-It could be a fast scan, a multi-probe, a full search using BST,
-standard hash chains or MMC, well whatever.
-
-Advanced parsing strategies can also be implemented, such as lazy match,
-or full optimal parsing.
-
-All these trade-off offer distinctive speed/memory/compression advantages.
-Whatever the method used by the compressor, its result will be decodable
-by any LZ4 decoder if it follows the format specification described above.
+Multiple techniques can be considered,
+featuring distinct time / performance trade offs.
+As long as the format is respected,
+the result will be compatible and decodable by any compliant decoder.
+An upper compression limit can be reached,
+using a technique called "full optimal parsing", at high cpu cost.