From 88cef262ec50cf905da8498b7af2d8f167fbf61d Mon Sep 17 00:00:00 2001 From: Przemyslaw Skibinski Date: Thu, 3 Nov 2016 13:25:20 +0100 Subject: documentation moved to doc/ --- README.md | 4 +- doc/lz4_Block_format.md | 127 ++++++++++++++++ doc/lz4_Frame_format.md | 385 ++++++++++++++++++++++++++++++++++++++++++++++++ lib/README.md | 2 +- lib/lz4.h | 2 +- lz4_Block_format.md | 127 ---------------- lz4_Frame_format.md | 385 ------------------------------------------------ 7 files changed, 516 insertions(+), 516 deletions(-) create mode 100644 doc/lz4_Block_format.md create mode 100644 doc/lz4_Frame_format.md delete mode 100644 lz4_Block_format.md delete mode 100644 lz4_Frame_format.md diff --git a/README.md b/README.md index 753f11d..3f9cba6 100644 --- a/README.md +++ b/README.md @@ -89,6 +89,6 @@ A list of known source ports is maintained on the [LZ4 Homepage]. [Open-Source Benchmark program by m^2 (v0.14.3)]: http://encode.ru/threads/1371-Filesystem-benchmark?p=34029&viewfull=1#post34029 [Silesia Corpus]: http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia -[lz4_Block_format]: lz4_Block_format.md -[lz4_Frame_format]: lz4_Frame_format.md +[lz4_Block_format]: doc/lz4_Block_format.md +[lz4_Frame_format]: doc/lz4_Frame_format.md [LZ4 Homepage]: http://www.lz4.org diff --git a/doc/lz4_Block_format.md b/doc/lz4_Block_format.md new file mode 100644 index 0000000..0f6a5ba --- /dev/null +++ b/doc/lz4_Block_format.md @@ -0,0 +1,127 @@ +LZ4 Block Format Description +============================ +Last revised: 2015-05-07. +Author : Yann Collet + + +This specification is intended for developers +willing to produce LZ4-compatible compressed data blocks +using any programming language. + +LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding. +There is no entropy encoder back-end nor framing layer. +The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]). +This design is assumed to favor simplicity and speed. +It helps later on for optimizations, compactness, and features. + +This document describes only the block format, +not how the compressor nor decompressor actually work. +The correctness of the decompressor should not depend +on implementation details of the compressor, and vice versa. + +[LZ4 Frame format]: lz4_Frame_format.md + + + +Compressed block format +----------------------- +An LZ4 compressed block is composed of sequences. +A sequence is a suite of literals (not-compressed bytes), +followed by a match copy. + +Each sequence starts with a token. +The token is a one byte value, separated into two 4-bits fields. +Therefore each field ranges from 0 to 15. + + +The first field uses the 4 high-bits of the token. +It provides the length of literals to follow. + +If the field value is 0, then there is no literal. +If it is 15, then we need to add some more bytes to indicate the full length. +Each additional byte then represent a value from 0 to 255, +which is added to the previous value to produce a total length. +When the byte value is 255, another byte is output. +There can be any number of bytes following the token. There is no "size limit". +(Side note : this is why a not-compressible input block is expanded by 0.4%). + +Example 1 : A length of 48 will be represented as : + + - 15 : value for the 4-bits High field + - 33 : (=48-15) remaining length to reach 48 + +Example 2 : A length of 280 will be represented as : + + - 15 : value for the 4-bits High field + - 255 : following byte is maxed, since 280-15 >= 255 + - 10 : (=280 - 15 - 255) ) remaining length to reach 280 + +Example 3 : A length of 15 will be represented as : + + - 15 : value for the 4-bits High field + - 0 : (=15-15) yes, the zero must be output + +Following the token and optional length bytes, are the literals themselves. +They are exactly as numerous as previously decoded (length of literals). +It's possible that there are zero literal. + + +Following the literals is the match copy operation. + +It starts by the offset. +This is a 2 bytes value, in little endian format +(the 1st byte is the "low" byte, the 2nd one is the "high" byte). + +The offset represents the position of the match to be copied from. +1 means "current position - 1 byte". +The maximum offset value is 65535, 65536 cannot be coded. +Note that 0 is an invalid value, not used. + +Then we need to extract the match length. +For this, we use the second token field, the low 4-bits. +Value, obviously, ranges from 0 to 15. +However here, 0 means that the copy operation will be minimal. +The minimum length of a match, called minmatch, is 4. +As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes. +Similar to literal length, on reaching the highest possible value (15), +we output additional bytes, one at a time, with values ranging from 0 to 255. +They are added to total to provide the final match length. +A 255 value means there is another byte to read and add. +There is no limit to the number of optional bytes that can be output this way. +(This points towards a maximum achievable compression ratio of about 250). + +With the offset and the matchlength, +the decoder can now proceed to copy the data from the already decoded buffer. +On decoding the matchlength, we reach the end of the compressed sequence, +and therefore start another one. + + +Parsing restrictions +----------------------- +There are specific parsing rules to respect in order to remain compatible +with assumptions made by the decoder : + +1. The last 5 bytes are always literals +2. The last match must start at least 12 bytes before end of block. + Consequently, a block with less than 13 bytes cannot be compressed. + +These rules are in place to ensure that the decoder +will never read beyond the input buffer, nor write beyond the output buffer. + +Note that the last sequence is also incomplete, +and stops right after literals. + + +Additional notes +----------------------- +There is no assumption nor limits to the way the compressor +searches and selects matches within the source data block. +It could be a fast scan, a multi-probe, a full search using BST, +standard hash chains or MMC, well whatever. + +Advanced parsing strategies can also be implemented, such as lazy match, +or full optimal parsing. + +All these trade-off offer distinctive speed/memory/compression advantages. +Whatever the method used by the compressor, its result will be decodable +by any LZ4 decoder if it follows the format specification described above. diff --git a/doc/lz4_Frame_format.md b/doc/lz4_Frame_format.md new file mode 100644 index 0000000..2ea1a86 --- /dev/null +++ b/doc/lz4_Frame_format.md @@ -0,0 +1,385 @@ +LZ4 Frame Format Description +============================ + +###Notices + +Copyright (c) 2013-2015 Yann Collet + +Permission is granted to copy and distribute this document +for any purpose and without charge, +including translations into other languages +and incorporation into compilations, +provided that the copyright notice and this notice are preserved, +and that any substantive changes or deletions from the original +are clearly marked. +Distribution of this document is unlimited. + +###Version + +1.5.1 (31/03/2015) + + +Introduction +------------ + +The purpose of this document is to define a lossless compressed data format, +that is independent of CPU type, operating system, +file system and character set, suitable for +File compression, Pipe and streaming compression +using the [LZ4 algorithm](http://www.lz4.org). + +The data can be produced or consumed, +even for an arbitrarily long sequentially presented input data stream, +using only an a priori bounded amount of intermediate storage, +and hence can be used in data communications. +The format uses the LZ4 compression method, +and optional [xxHash-32 checksum method](https://github.com/Cyan4973/xxHash), +for detection of data corruption. + +The data format defined by this specification +does not attempt to allow random access to compressed data. + +This specification is intended for use by implementers of software +to compress data into LZ4 format and/or decompress data from LZ4 format. +The text of the specification assumes a basic background in programming +at the level of bits and other primitive data representations. + +Unless otherwise indicated below, +a compliant compressor must produce data sets +that conform to the specifications presented here. +It doesn’t need to support all options though. + +A compliant decompressor must be able to decompress +at least one working set of parameters +that conforms to the specifications presented here. +It may also ignore checksums. +Whenever it does not support a specific parameter within the compressed stream, +it must produce a non-ambiguous error code +and associated error message explaining which parameter is unsupported. + + +General Structure of LZ4 Frame format +------------------------------------- + +| MagicNb | F. Descriptor | Block | (...) | EndMark | C. Checksum | +|:-------:|:-------------:| ----- | ----- | ------- | ----------- | +| 4 bytes | 3-11 bytes | | | 4 bytes | 0-4 bytes | + +__Magic Number__ + +4 Bytes, Little endian format. +Value : 0x184D2204 + +__Frame Descriptor__ + +3 to 11 Bytes, to be detailed in the next part. +Most important part of the spec. + +__Data Blocks__ + +To be detailed later on. +That’s where compressed data is stored. + +__EndMark__ + +The flow of blocks ends when the last data block has a size of “0”. +The size is expressed as a 32-bits value. + +__Content Checksum__ + +Content Checksum verify that the full content has been decoded correctly. +The content checksum is the result +of [xxh32() hash function](https://github.com/Cyan4973/xxHash) +digesting the original (decoded) data as input, and a seed of zero. +Content checksum is only present when its associated flag +is set in the frame descriptor. +Content Checksum validates the result, +that all blocks were fully transmitted in the correct order and without error, +and also that the encoding/decoding process itself generated no distortion. +Its usage is recommended. + +__Frame Concatenation__ + +In some circumstances, it may be preferable to append multiple frames, +for example in order to add new data to an existing compressed file +without re-framing it. + +In such case, each frame has its own set of descriptor flags. +Each frame is considered independent. +The only relation between frames is their sequential order. + +The ability to decode multiple concatenated frames +within a single stream or file +is left outside of this specification. +As an example, the reference lz4 command line utility behavior is +to decode all concatenated frames in their sequential order. + + +Frame Descriptor +---------------- + +| FLG | BD | (Content Size) | HC | +| ------- | ------- |:--------------:| ------- | +| 1 byte | 1 byte | 0 - 8 bytes | 1 byte | + +The descriptor uses a minimum of 3 bytes, +and up to 11 bytes depending on optional parameters. + +__FLG byte__ + +| BitNb | 7-6 | 5 | 4 | 3 | 2 | 1-0 | +| ------- | ------- | ------- | --------- | ------- | --------- | -------- | +|FieldName| Version | B.Indep | B.Checksum| C.Size | C.Checksum|*Reserved*| + + +__BD byte__ + +| BitNb | 7 | 6-5-4 | 3-2-1-0 | +| ------- | -------- | ------------ | -------- | +|FieldName|*Reserved*| Block MaxSize|*Reserved*| + +In the tables, bit 7 is highest bit, while bit 0 is lowest. + +__Version Number__ + +2-bits field, must be set to “01”. +Any other value cannot be decoded by this version of the specification. +Other version numbers will use different flag layouts. + +__Block Independence flag__ + +If this flag is set to “1”, blocks are independent. +If this flag is set to “0”, each block depends on previous ones +(up to LZ4 window size, which is 64 KB). +In such case, it’s necessary to decode all blocks in sequence. + +Block dependency improves compression ratio, especially for small blocks. +On the other hand, it makes direct jumps or multi-threaded decoding impossible. + +__Block checksum flag__ + +If this flag is set, each data block will be followed by a 4-bytes checksum, +calculated by using the xxHash-32 algorithm on the raw (compressed) data block. +The intention is to detect data corruption (storage or transmission errors) +immediately, before decoding. +Block checksum usage is optional. + +__Content Size flag__ + +If this flag is set, the uncompressed size of data included within the frame +will be present as an 8 bytes unsigned little endian value, after the flags. +Content Size usage is optional. + +__Content checksum flag__ + +If this flag is set, a content checksum will be appended after the EndMark. + +Recommended value : “1” (content checksum is present) + +__Block Maximum Size__ + +This information is intended to help the decoder allocate memory. +Size here refers to the original (uncompressed) data size. +Block Maximum Size is one value among the following table : + +| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +| --- | --- | --- | --- | ----- | ------ | ---- | ---- | +| N/A | N/A | N/A | N/A | 64 KB | 256 KB | 1 MB | 4 MB | + +The decoder may refuse to allocate block sizes above a (system-specific) size. +Unused values may be used in a future revision of the spec. +A decoder conformant to the current version of the spec +is only able to decode blocksizes defined in this spec. + +__Reserved bits__ + +Value of reserved bits **must** be 0 (zero). +Reserved bit might be used in a future version of the specification, +typically enabling new optional features. +If this happens, a decoder respecting the current version of the specification +shall not be able to decode such a frame. + +__Content Size__ + +This is the original (uncompressed) size. +This information is optional, and only present if the associated flag is set. +Content size is provided using unsigned 8 Bytes, for a maximum of 16 HexaBytes. +Format is Little endian. +This value is informational, typically for display or memory allocation. +It can be skipped by a decoder, or used to validate content correctness. + +__Header Checksum__ + +One-byte checksum of combined descriptor fields, including optional ones. +The value is the second byte of xxh32() : ` (xxh32()>>8) & 0xFF ` +using zero as a seed, +and the full Frame Descriptor as an input +(including optional fields when they are present). +A wrong checksum indicates an error in the descriptor. +Header checksum is informational and can be skipped. + + +Data Blocks +----------- + +| Block Size | data | (Block Checksum) | +|:----------:| ------ |:----------------:| +| 4 bytes | | 0 - 4 bytes | + + +__Block Size__ + +This field uses 4-bytes, format is little-endian. + +The highest bit is “1” if data in the block is uncompressed. + +The highest bit is “0” if data in the block is compressed by LZ4. + +All other bits give the size, in bytes, of the following data block +(the size does not include the block checksum if present). + +Block Size shall never be larger than Block Maximum Size. +Such a thing could happen for incompressible source data. +In such case, such a data block shall be passed in uncompressed format. + +__Data__ + +Where the actual data to decode stands. +It might be compressed or not, depending on previous field indications. +Uncompressed size of Data can be any size, up to “block maximum size”. +Note that data block is not necessarily full : +an arbitrary “flush” may happen anytime. Any block can be “partially filled”. + +__Block checksum__ + +Only present if the associated flag is set. +This is a 4-bytes checksum value, in little endian format, +calculated by using the xxHash-32 algorithm on the raw (undecoded) data block, +and a seed of zero. +The intention is to detect data corruption (storage or transmission errors) +before decoding. + +Block checksum is cumulative with Content checksum. + + +Skippable Frames +---------------- + +| Magic Number | Frame Size | User Data | +|:------------:|:----------:| --------- | +| 4 bytes | 4 bytes | | + +Skippable frames allow the integration of user-defined data +into a flow of concatenated frames. +Its design is pretty straightforward, +with the sole objective to allow the decoder to quickly skip +over user-defined data and continue decoding. + +For the purpose of facilitating identification, +it is discouraged to start a flow of concatenated frames with a skippable frame. +If there is a need to start such a flow with some user data +encapsulated into a skippable frame, +it’s recommended to start with a zero-byte LZ4 frame +followed by a skippable frame. +This will make it easier for file type identifiers. + + +__Magic Number__ + +4 Bytes, Little endian format. +Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F. +All 16 values are valid to identify a skippable frame. + +__Frame Size__ + +This is the size, in bytes, of the following User Data +(without including the magic number nor the size field itself). +4 Bytes, Little endian format, unsigned 32-bits. +This means User Data can’t be bigger than (2^32-1) Bytes. + +__User Data__ + +User Data can be anything. Data will just be skipped by the decoder. + + +Legacy frame +------------ + +The Legacy frame format was defined into the initial versions of “LZ4Demo”. +Newer compressors should not use this format anymore, as it is too restrictive. + +Main characteristics of the legacy format : + +- Fixed block size : 8 MB. +- All blocks must be completely filled, except the last one. +- All blocks are always compressed, even when compression is detrimental. +- The last block is detected either because + it is followed by the “EOF” (End of File) mark, + or because it is followed by a known Frame Magic Number. +- No checksum +- Convention is Little endian + +| MagicNb | B.CSize | CData | B.CSize | CData | (...) | EndMark | +| ------- | ------- | ----- | ------- | ----- | ------- | ------- | +| 4 bytes | 4 bytes | CSize | 4 bytes | CSize | x times | EOF | + + +__Magic Number__ + +4 Bytes, Little endian format. +Value : 0x184C2102 + +__Block Compressed Size__ + +This is the size, in bytes, of the following compressed data block. +4 Bytes, Little endian format. + +__Data__ + +Where the actual compressed data stands. +Data is always compressed, even when compression is detrimental. + +__EndMark__ + +End of legacy frame is implicit only. +It must be followed by a standard EOF (End Of File) signal, +wether it is a file or a stream. + +Alternatively, if the frame is followed by a valid Frame Magic Number, +it is considered completed. +It makes legacy frames compatible with frame concatenation. + +Any other value will be interpreted as a block size, +and trigger an error if it does not fit within acceptable range. + + +Version changes +--------------- + +1.5.1 : changed format to MarkDown compatible + +1.5 : removed Dictionary ID from specification + +1.4.1 : changed wording from “stream” to “frame” + +1.4 : added skippable streams, re-added stream checksum + +1.3 : modified header checksum + +1.2 : reduced choice of “block size”, to postpone decision on “dynamic size of BlockSize Field”. + +1.1 : optional fields are now part of the descriptor + +1.0 : changed “block size” specification, adding a compressed/uncompressed flag + +0.9 : reduced scale of “block maximum size” table + +0.8 : removed : high compression flag + +0.7 : removed : stream checksum + +0.6 : settled : stream size uses 8 bytes, endian convention is little endian + +0.5: added copyright notice + +0.4 : changed format to Google Doc compatible OpenDocument diff --git a/lib/README.md b/lib/README.md index f932d42..209cac1 100644 --- a/lib/README.md +++ b/lib/README.md @@ -37,4 +37,4 @@ Other files present in the directory are not source code. There are : - liblz4.pc.in : for pkg-config (make install) - README.md : this file -[official interoperable frame format]: ../lz4_Frame_format.md +[official interoperable frame format]: ../doc/lz4_Frame_format.md diff --git a/lib/lz4.h b/lib/lz4.h index 3ece4dd..0f17c7c 100644 --- a/lib/lz4.h +++ b/lib/lz4.h @@ -44,7 +44,7 @@ extern "C" { * Block compression functions are not-enough to send information, * since it's still necessary to provide metadata (such as compressed size), * and each application can do it in whichever way it wants. - * For interoperability, there is LZ4 frame specification (lz4_Frame_format.md). + * For interoperability, there is LZ4 frame specification (doc/lz4_Frame_format.md). * A library is provided to take care of it, see lz4frame.h. */ diff --git a/lz4_Block_format.md b/lz4_Block_format.md deleted file mode 100644 index 0f6a5ba..0000000 --- a/lz4_Block_format.md +++ /dev/null @@ -1,127 +0,0 @@ -LZ4 Block Format Description -============================ -Last revised: 2015-05-07. -Author : Yann Collet - - -This specification is intended for developers -willing to produce LZ4-compatible compressed data blocks -using any programming language. - -LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding. -There is no entropy encoder back-end nor framing layer. -The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]). -This design is assumed to favor simplicity and speed. -It helps later on for optimizations, compactness, and features. - -This document describes only the block format, -not how the compressor nor decompressor actually work. -The correctness of the decompressor should not depend -on implementation details of the compressor, and vice versa. - -[LZ4 Frame format]: lz4_Frame_format.md - - - -Compressed block format ------------------------ -An LZ4 compressed block is composed of sequences. -A sequence is a suite of literals (not-compressed bytes), -followed by a match copy. - -Each sequence starts with a token. -The token is a one byte value, separated into two 4-bits fields. -Therefore each field ranges from 0 to 15. - - -The first field uses the 4 high-bits of the token. -It provides the length of literals to follow. - -If the field value is 0, then there is no literal. -If it is 15, then we need to add some more bytes to indicate the full length. -Each additional byte then represent a value from 0 to 255, -which is added to the previous value to produce a total length. -When the byte value is 255, another byte is output. -There can be any number of bytes following the token. There is no "size limit". -(Side note : this is why a not-compressible input block is expanded by 0.4%). - -Example 1 : A length of 48 will be represented as : - - - 15 : value for the 4-bits High field - - 33 : (=48-15) remaining length to reach 48 - -Example 2 : A length of 280 will be represented as : - - - 15 : value for the 4-bits High field - - 255 : following byte is maxed, since 280-15 >= 255 - - 10 : (=280 - 15 - 255) ) remaining length to reach 280 - -Example 3 : A length of 15 will be represented as : - - - 15 : value for the 4-bits High field - - 0 : (=15-15) yes, the zero must be output - -Following the token and optional length bytes, are the literals themselves. -They are exactly as numerous as previously decoded (length of literals). -It's possible that there are zero literal. - - -Following the literals is the match copy operation. - -It starts by the offset. -This is a 2 bytes value, in little endian format -(the 1st byte is the "low" byte, the 2nd one is the "high" byte). - -The offset represents the position of the match to be copied from. -1 means "current position - 1 byte". -The maximum offset value is 65535, 65536 cannot be coded. -Note that 0 is an invalid value, not used. - -Then we need to extract the match length. -For this, we use the second token field, the low 4-bits. -Value, obviously, ranges from 0 to 15. -However here, 0 means that the copy operation will be minimal. -The minimum length of a match, called minmatch, is 4. -As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes. -Similar to literal length, on reaching the highest possible value (15), -we output additional bytes, one at a time, with values ranging from 0 to 255. -They are added to total to provide the final match length. -A 255 value means there is another byte to read and add. -There is no limit to the number of optional bytes that can be output this way. -(This points towards a maximum achievable compression ratio of about 250). - -With the offset and the matchlength, -the decoder can now proceed to copy the data from the already decoded buffer. -On decoding the matchlength, we reach the end of the compressed sequence, -and therefore start another one. - - -Parsing restrictions ------------------------ -There are specific parsing rules to respect in order to remain compatible -with assumptions made by the decoder : - -1. The last 5 bytes are always literals -2. The last match must start at least 12 bytes before end of block. - Consequently, a block with less than 13 bytes cannot be compressed. - -These rules are in place to ensure that the decoder -will never read beyond the input buffer, nor write beyond the output buffer. - -Note that the last sequence is also incomplete, -and stops right after literals. - - -Additional notes ------------------------ -There is no assumption nor limits to the way the compressor -searches and selects matches within the source data block. -It could be a fast scan, a multi-probe, a full search using BST, -standard hash chains or MMC, well whatever. - -Advanced parsing strategies can also be implemented, such as lazy match, -or full optimal parsing. - -All these trade-off offer distinctive speed/memory/compression advantages. -Whatever the method used by the compressor, its result will be decodable -by any LZ4 decoder if it follows the format specification described above. diff --git a/lz4_Frame_format.md b/lz4_Frame_format.md deleted file mode 100644 index 2ea1a86..0000000 --- a/lz4_Frame_format.md +++ /dev/null @@ -1,385 +0,0 @@ -LZ4 Frame Format Description -============================ - -###Notices - -Copyright (c) 2013-2015 Yann Collet - -Permission is granted to copy and distribute this document -for any purpose and without charge, -including translations into other languages -and incorporation into compilations, -provided that the copyright notice and this notice are preserved, -and that any substantive changes or deletions from the original -are clearly marked. -Distribution of this document is unlimited. - -###Version - -1.5.1 (31/03/2015) - - -Introduction ------------- - -The purpose of this document is to define a lossless compressed data format, -that is independent of CPU type, operating system, -file system and character set, suitable for -File compression, Pipe and streaming compression -using the [LZ4 algorithm](http://www.lz4.org). - -The data can be produced or consumed, -even for an arbitrarily long sequentially presented input data stream, -using only an a priori bounded amount of intermediate storage, -and hence can be used in data communications. -The format uses the LZ4 compression method, -and optional [xxHash-32 checksum method](https://github.com/Cyan4973/xxHash), -for detection of data corruption. - -The data format defined by this specification -does not attempt to allow random access to compressed data. - -This specification is intended for use by implementers of software -to compress data into LZ4 format and/or decompress data from LZ4 format. -The text of the specification assumes a basic background in programming -at the level of bits and other primitive data representations. - -Unless otherwise indicated below, -a compliant compressor must produce data sets -that conform to the specifications presented here. -It doesn’t need to support all options though. - -A compliant decompressor must be able to decompress -at least one working set of parameters -that conforms to the specifications presented here. -It may also ignore checksums. -Whenever it does not support a specific parameter within the compressed stream, -it must produce a non-ambiguous error code -and associated error message explaining which parameter is unsupported. - - -General Structure of LZ4 Frame format -------------------------------------- - -| MagicNb | F. Descriptor | Block | (...) | EndMark | C. Checksum | -|:-------:|:-------------:| ----- | ----- | ------- | ----------- | -| 4 bytes | 3-11 bytes | | | 4 bytes | 0-4 bytes | - -__Magic Number__ - -4 Bytes, Little endian format. -Value : 0x184D2204 - -__Frame Descriptor__ - -3 to 11 Bytes, to be detailed in the next part. -Most important part of the spec. - -__Data Blocks__ - -To be detailed later on. -That’s where compressed data is stored. - -__EndMark__ - -The flow of blocks ends when the last data block has a size of “0”. -The size is expressed as a 32-bits value. - -__Content Checksum__ - -Content Checksum verify that the full content has been decoded correctly. -The content checksum is the result -of [xxh32() hash function](https://github.com/Cyan4973/xxHash) -digesting the original (decoded) data as input, and a seed of zero. -Content checksum is only present when its associated flag -is set in the frame descriptor. -Content Checksum validates the result, -that all blocks were fully transmitted in the correct order and without error, -and also that the encoding/decoding process itself generated no distortion. -Its usage is recommended. - -__Frame Concatenation__ - -In some circumstances, it may be preferable to append multiple frames, -for example in order to add new data to an existing compressed file -without re-framing it. - -In such case, each frame has its own set of descriptor flags. -Each frame is considered independent. -The only relation between frames is their sequential order. - -The ability to decode multiple concatenated frames -within a single stream or file -is left outside of this specification. -As an example, the reference lz4 command line utility behavior is -to decode all concatenated frames in their sequential order. - - -Frame Descriptor ----------------- - -| FLG | BD | (Content Size) | HC | -| ------- | ------- |:--------------:| ------- | -| 1 byte | 1 byte | 0 - 8 bytes | 1 byte | - -The descriptor uses a minimum of 3 bytes, -and up to 11 bytes depending on optional parameters. - -__FLG byte__ - -| BitNb | 7-6 | 5 | 4 | 3 | 2 | 1-0 | -| ------- | ------- | ------- | --------- | ------- | --------- | -------- | -|FieldName| Version | B.Indep | B.Checksum| C.Size | C.Checksum|*Reserved*| - - -__BD byte__ - -| BitNb | 7 | 6-5-4 | 3-2-1-0 | -| ------- | -------- | ------------ | -------- | -|FieldName|*Reserved*| Block MaxSize|*Reserved*| - -In the tables, bit 7 is highest bit, while bit 0 is lowest. - -__Version Number__ - -2-bits field, must be set to “01”. -Any other value cannot be decoded by this version of the specification. -Other version numbers will use different flag layouts. - -__Block Independence flag__ - -If this flag is set to “1”, blocks are independent. -If this flag is set to “0”, each block depends on previous ones -(up to LZ4 window size, which is 64 KB). -In such case, it’s necessary to decode all blocks in sequence. - -Block dependency improves compression ratio, especially for small blocks. -On the other hand, it makes direct jumps or multi-threaded decoding impossible. - -__Block checksum flag__ - -If this flag is set, each data block will be followed by a 4-bytes checksum, -calculated by using the xxHash-32 algorithm on the raw (compressed) data block. -The intention is to detect data corruption (storage or transmission errors) -immediately, before decoding. -Block checksum usage is optional. - -__Content Size flag__ - -If this flag is set, the uncompressed size of data included within the frame -will be present as an 8 bytes unsigned little endian value, after the flags. -Content Size usage is optional. - -__Content checksum flag__ - -If this flag is set, a content checksum will be appended after the EndMark. - -Recommended value : “1” (content checksum is present) - -__Block Maximum Size__ - -This information is intended to help the decoder allocate memory. -Size here refers to the original (uncompressed) data size. -Block Maximum Size is one value among the following table : - -| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | -| --- | --- | --- | --- | ----- | ------ | ---- | ---- | -| N/A | N/A | N/A | N/A | 64 KB | 256 KB | 1 MB | 4 MB | - -The decoder may refuse to allocate block sizes above a (system-specific) size. -Unused values may be used in a future revision of the spec. -A decoder conformant to the current version of the spec -is only able to decode blocksizes defined in this spec. - -__Reserved bits__ - -Value of reserved bits **must** be 0 (zero). -Reserved bit might be used in a future version of the specification, -typically enabling new optional features. -If this happens, a decoder respecting the current version of the specification -shall not be able to decode such a frame. - -__Content Size__ - -This is the original (uncompressed) size. -This information is optional, and only present if the associated flag is set. -Content size is provided using unsigned 8 Bytes, for a maximum of 16 HexaBytes. -Format is Little endian. -This value is informational, typically for display or memory allocation. -It can be skipped by a decoder, or used to validate content correctness. - -__Header Checksum__ - -One-byte checksum of combined descriptor fields, including optional ones. -The value is the second byte of xxh32() : ` (xxh32()>>8) & 0xFF ` -using zero as a seed, -and the full Frame Descriptor as an input -(including optional fields when they are present). -A wrong checksum indicates an error in the descriptor. -Header checksum is informational and can be skipped. - - -Data Blocks ------------ - -| Block Size | data | (Block Checksum) | -|:----------:| ------ |:----------------:| -| 4 bytes | | 0 - 4 bytes | - - -__Block Size__ - -This field uses 4-bytes, format is little-endian. - -The highest bit is “1” if data in the block is uncompressed. - -The highest bit is “0” if data in the block is compressed by LZ4. - -All other bits give the size, in bytes, of the following data block -(the size does not include the block checksum if present). - -Block Size shall never be larger than Block Maximum Size. -Such a thing could happen for incompressible source data. -In such case, such a data block shall be passed in uncompressed format. - -__Data__ - -Where the actual data to decode stands. -It might be compressed or not, depending on previous field indications. -Uncompressed size of Data can be any size, up to “block maximum size”. -Note that data block is not necessarily full : -an arbitrary “flush” may happen anytime. Any block can be “partially filled”. - -__Block checksum__ - -Only present if the associated flag is set. -This is a 4-bytes checksum value, in little endian format, -calculated by using the xxHash-32 algorithm on the raw (undecoded) data block, -and a seed of zero. -The intention is to detect data corruption (storage or transmission errors) -before decoding. - -Block checksum is cumulative with Content checksum. - - -Skippable Frames ----------------- - -| Magic Number | Frame Size | User Data | -|:------------:|:----------:| --------- | -| 4 bytes | 4 bytes | | - -Skippable frames allow the integration of user-defined data -into a flow of concatenated frames. -Its design is pretty straightforward, -with the sole objective to allow the decoder to quickly skip -over user-defined data and continue decoding. - -For the purpose of facilitating identification, -it is discouraged to start a flow of concatenated frames with a skippable frame. -If there is a need to start such a flow with some user data -encapsulated into a skippable frame, -it’s recommended to start with a zero-byte LZ4 frame -followed by a skippable frame. -This will make it easier for file type identifiers. - - -__Magic Number__ - -4 Bytes, Little endian format. -Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F. -All 16 values are valid to identify a skippable frame. - -__Frame Size__ - -This is the size, in bytes, of the following User Data -(without including the magic number nor the size field itself). -4 Bytes, Little endian format, unsigned 32-bits. -This means User Data can’t be bigger than (2^32-1) Bytes. - -__User Data__ - -User Data can be anything. Data will just be skipped by the decoder. - - -Legacy frame ------------- - -The Legacy frame format was defined into the initial versions of “LZ4Demo”. -Newer compressors should not use this format anymore, as it is too restrictive. - -Main characteristics of the legacy format : - -- Fixed block size : 8 MB. -- All blocks must be completely filled, except the last one. -- All blocks are always compressed, even when compression is detrimental. -- The last block is detected either because - it is followed by the “EOF” (End of File) mark, - or because it is followed by a known Frame Magic Number. -- No checksum -- Convention is Little endian - -| MagicNb | B.CSize | CData | B.CSize | CData | (...) | EndMark | -| ------- | ------- | ----- | ------- | ----- | ------- | ------- | -| 4 bytes | 4 bytes | CSize | 4 bytes | CSize | x times | EOF | - - -__Magic Number__ - -4 Bytes, Little endian format. -Value : 0x184C2102 - -__Block Compressed Size__ - -This is the size, in bytes, of the following compressed data block. -4 Bytes, Little endian format. - -__Data__ - -Where the actual compressed data stands. -Data is always compressed, even when compression is detrimental. - -__EndMark__ - -End of legacy frame is implicit only. -It must be followed by a standard EOF (End Of File) signal, -wether it is a file or a stream. - -Alternatively, if the frame is followed by a valid Frame Magic Number, -it is considered completed. -It makes legacy frames compatible with frame concatenation. - -Any other value will be interpreted as a block size, -and trigger an error if it does not fit within acceptable range. - - -Version changes ---------------- - -1.5.1 : changed format to MarkDown compatible - -1.5 : removed Dictionary ID from specification - -1.4.1 : changed wording from “stream” to “frame” - -1.4 : added skippable streams, re-added stream checksum - -1.3 : modified header checksum - -1.2 : reduced choice of “block size”, to postpone decision on “dynamic size of BlockSize Field”. - -1.1 : optional fields are now part of the descriptor - -1.0 : changed “block size” specification, adding a compressed/uncompressed flag - -0.9 : reduced scale of “block maximum size” table - -0.8 : removed : high compression flag - -0.7 : removed : stream checksum - -0.6 : settled : stream size uses 8 bytes, endian convention is little endian - -0.5: added copyright notice - -0.4 : changed format to Google Doc compatible OpenDocument -- cgit v0.12