diff options
author | Yann Collet <yann.collet.73@gmail.com> | 2015-03-30 17:31:55 (GMT) |
---|---|---|
committer | Yann Collet <yann.collet.73@gmail.com> | 2015-03-30 17:31:55 (GMT) |
commit | b93f629681ad3245a09add28e4d0b2e43bcde58a (patch) | |
tree | 6fd3ad884811d2350e3d02d52f65ca84daabbf0b /lz4_block_format.txt | |
parent | eeb8bea34c2df279307eaed922f3a89cc420316c (diff) | |
download | lz4-b93f629681ad3245a09add28e4d0b2e43bcde58a.zip lz4-b93f629681ad3245a09add28e4d0b2e43bcde58a.tar.gz lz4-b93f629681ad3245a09add28e4d0b2e43bcde58a.tar.bz2 |
changed file name
Diffstat (limited to 'lz4_block_format.txt')
-rw-r--r-- | lz4_block_format.txt | 121 |
1 files changed, 0 insertions, 121 deletions
diff --git a/lz4_block_format.txt b/lz4_block_format.txt deleted file mode 100644 index e248fd9..0000000 --- a/lz4_block_format.txt +++ /dev/null @@ -1,121 +0,0 @@ -LZ4 Block Format Description -============================ -Last revised: 2015-03-26; -Author : Yann Collet - - - -This small specification intents to provide enough information -to anyone willing to produce LZ4-compatible compressed data blocks -using any programming language. - -LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding. -The most important design principle behind LZ4 is simplicity. -It helps to create an easy to read and maintain source code. -It also helps later on for optimizations, compactness, and speed. -There is no entropy encoder back-end nor framing layer. -The latter is assumed to be handled by other parts of the system. - -This document only describes the block format, -not how the LZ4 compressor nor decompressor actually work. -The correctness of the decompressor should not depend -on implementation details of the compressor, and vice versa. - - - -Compressed block format ------------------------ -An LZ4 compressed block is composed of sequences. -Schematically, a sequence is a suite of literals, followed by a match copy. - -Each sequence starts with a token. -The token is a one byte value, separated into two 4-bits fields. -Therefore each field ranges from 0 to 15. - - -The first field uses the 4 high-bits of the token. -It provides the length of literals to follow. -(Note : a literal is a not-compressed byte). -If the field value is 0, then there is no literal. -If it is 15, then we need to add some more bytes to indicate the full length. -Each additionnal byte then represent a value from 0 to 255, -which is added to the previous value to produce a total length. -When the byte value is 255, another byte is output. -There can be any number of bytes following the token. There is no "size limit". -(Sidenote this is why a not-compressible input block is expanded by 0.4%). - -Example 1 : A length of 48 will be represented as : -- 15 : value for the 4-bits High field -- 33 : (=48-15) remaining length to reach 48 - -Example 2 : A length of 280 will be represented as : -- 15 : value for the 4-bits High field -- 255 : following byte is maxed, since 280-15 >= 255 -- 10 : (=280 - 15 - 255) ) remaining length to reach 280 - -Example 3 : A length of 15 will be represented as : -- 15 : value for the 4-bits High field -- 0 : (=15-15) yes, the zero must be output - -Following the token and optional length bytes, are the literals themselves. -They are exactly as numerous as previously decoded (length of literals). -It's possible that there are zero literal. - - -Following the literals is the match copy operation. - -It starts by the offset. -This is a 2 bytes value, in little endian format. - -The offset represents the position of the match to be copied from. -1 means "current position - 1 byte". -The maximum offset value is 65535, 65536 cannot be coded. -Note that 0 is an invalid value, not used. - -Then we need to extract the match length. -For this, we use the second token field, the low 4-bits. -Value, obviously, ranges from 0 to 15. -However here, 0 means that the copy operation will be minimal. -The minimum length of a match, called minmatch, is 4. -As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes. -Similar to literal length, on reaching the highest possible value (15), -we output additional bytes, one at a time, with values ranging from 0 to 255. -They are added to total to provide the final match length. -A 255 value means there is another byte to read and add. -There is no limit to the number of optional bytes that can be output this way. -(This points towards a maximum achievable compression ratio of ~250). - -With the offset and the matchlength, -the decoder can now proceed to copy the data from the already decoded buffer. -On decoding the matchlength, we reach the end of the compressed sequence, -and therefore start another one. - - -Parsing restrictions ------------------------ -There are specific parsing rules to respect in order to remain compatible -with assumptions made by the decoder : -1) The last 5 bytes are always literals -2) The last match must start at least 12 bytes before end of block -Consequently, a block with less than 13 bytes cannot be compressed. -These rules are in place to ensure that the decoder -will never read beyond the input buffer, nor write beyond the output buffer. - -Note that the last sequence is also incomplete, -and stops right after literals. - - -Additional notes ------------------------ -There is no assumption nor limits to the way the compressor -searches and selects matches within the source data block. -It could be a fast scan, a multi-probe, a full search using BST, -standard hash chains or MMC, well whatever. - -Advanced parsing strategies can also be implemented, such as lazy match, -or full optimal parsing. - -All these trade-off offer distinctive speed/memory/compression advantages. -Whatever the method used by the compressor, its result will be decodable -by any LZ4 decoder if it follows the format specification described above. - |