summaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
Diffstat (limited to 'doc')
-rw-r--r--doc/lz4_Block_format.md127
-rw-r--r--doc/lz4_Frame_format.md385
2 files changed, 512 insertions, 0 deletions
diff --git a/doc/lz4_Block_format.md b/doc/lz4_Block_format.md
new file mode 100644
index 0000000..0f6a5ba
--- /dev/null
+++ b/doc/lz4_Block_format.md
@@ -0,0 +1,127 @@
+LZ4 Block Format Description
+============================
+Last revised: 2015-05-07.
+Author : Yann Collet
+
+
+This specification is intended for developers
+willing to produce LZ4-compatible compressed data blocks
+using any programming language.
+
+LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
+There is no entropy encoder back-end nor framing layer.
+The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]).
+This design is assumed to favor simplicity and speed.
+It helps later on for optimizations, compactness, and features.
+
+This document describes only the block format,
+not how the compressor nor decompressor actually work.
+The correctness of the decompressor should not depend
+on implementation details of the compressor, and vice versa.
+
+[LZ4 Frame format]: lz4_Frame_format.md
+
+
+
+Compressed block format
+-----------------------
+An LZ4 compressed block is composed of sequences.
+A sequence is a suite of literals (not-compressed bytes),
+followed by a match copy.
+
+Each sequence starts with a token.
+The token is a one byte value, separated into two 4-bits fields.
+Therefore each field ranges from 0 to 15.
+
+
+The first field uses the 4 high-bits of the token.
+It provides the length of literals to follow.
+
+If the field value is 0, then there is no literal.
+If it is 15, then we need to add some more bytes to indicate the full length.
+Each additional byte then represent a value from 0 to 255,
+which is added to the previous value to produce a total length.
+When the byte value is 255, another byte is output.
+There can be any number of bytes following the token. There is no "size limit".
+(Side note : this is why a not-compressible input block is expanded by 0.4%).
+
+Example 1 : A length of 48 will be represented as :
+
+ - 15 : value for the 4-bits High field
+ - 33 : (=48-15) remaining length to reach 48
+
+Example 2 : A length of 280 will be represented as :
+
+ - 15 : value for the 4-bits High field
+ - 255 : following byte is maxed, since 280-15 >= 255
+ - 10 : (=280 - 15 - 255) ) remaining length to reach 280
+
+Example 3 : A length of 15 will be represented as :
+
+ - 15 : value for the 4-bits High field
+ - 0 : (=15-15) yes, the zero must be output
+
+Following the token and optional length bytes, are the literals themselves.
+They are exactly as numerous as previously decoded (length of literals).
+It's possible that there are zero literal.
+
+
+Following the literals is the match copy operation.
+
+It starts by the offset.
+This is a 2 bytes value, in little endian format
+(the 1st byte is the "low" byte, the 2nd one is the "high" byte).
+
+The offset represents the position of the match to be copied from.
+1 means "current position - 1 byte".
+The maximum offset value is 65535, 65536 cannot be coded.
+Note that 0 is an invalid value, not used.
+
+Then we need to extract the match length.
+For this, we use the second token field, the low 4-bits.
+Value, obviously, ranges from 0 to 15.
+However here, 0 means that the copy operation will be minimal.
+The minimum length of a match, called minmatch, is 4.
+As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
+Similar to literal length, on reaching the highest possible value (15),
+we output additional bytes, one at a time, with values ranging from 0 to 255.
+They are added to total to provide the final match length.
+A 255 value means there is another byte to read and add.
+There is no limit to the number of optional bytes that can be output this way.
+(This points towards a maximum achievable compression ratio of about 250).
+
+With the offset and the matchlength,
+the decoder can now proceed to copy the data from the already decoded buffer.
+On decoding the matchlength, we reach the end of the compressed sequence,
+and therefore start another one.
+
+
+Parsing restrictions
+-----------------------
+There are specific parsing rules to respect in order to remain compatible
+with assumptions made by the decoder :
+
+1. The last 5 bytes are always literals
+2. The last match must start at least 12 bytes before end of block.
+ Consequently, a block with less than 13 bytes cannot be compressed.
+
+These rules are in place to ensure that the decoder
+will never read beyond the input buffer, nor write beyond the output buffer.
+
+Note that the last sequence is also incomplete,
+and stops right after literals.
+
+
+Additional notes
+-----------------------
+There is no assumption nor limits to the way the compressor
+searches and selects matches within the source data block.
+It could be a fast scan, a multi-probe, a full search using BST,
+standard hash chains or MMC, well whatever.
+
+Advanced parsing strategies can also be implemented, such as lazy match,
+or full optimal parsing.
+
+All these trade-off offer distinctive speed/memory/compression advantages.
+Whatever the method used by the compressor, its result will be decodable
+by any LZ4 decoder if it follows the format specification described above.
diff --git a/doc/lz4_Frame_format.md b/doc/lz4_Frame_format.md
new file mode 100644
index 0000000..2ea1a86
--- /dev/null
+++ b/doc/lz4_Frame_format.md
@@ -0,0 +1,385 @@
+LZ4 Frame Format Description
+============================
+
+###Notices
+
+Copyright (c) 2013-2015 Yann Collet
+
+Permission is granted to copy and distribute this document
+for any purpose and without charge,
+including translations into other languages
+and incorporation into compilations,
+provided that the copyright notice and this notice are preserved,
+and that any substantive changes or deletions from the original
+are clearly marked.
+Distribution of this document is unlimited.
+
+###Version
+
+1.5.1 (31/03/2015)
+
+
+Introduction
+------------
+
+The purpose of this document is to define a lossless compressed data format,
+that is independent of CPU type, operating system,
+file system and character set, suitable for
+File compression, Pipe and streaming compression
+using the [LZ4 algorithm](http://www.lz4.org).
+
+The data can be produced or consumed,
+even for an arbitrarily long sequentially presented input data stream,
+using only an a priori bounded amount of intermediate storage,
+and hence can be used in data communications.
+The format uses the LZ4 compression method,
+and optional [xxHash-32 checksum method](https://github.com/Cyan4973/xxHash),
+for detection of data corruption.
+
+The data format defined by this specification
+does not attempt to allow random access to compressed data.
+
+This specification is intended for use by implementers of software
+to compress data into LZ4 format and/or decompress data from LZ4 format.
+The text of the specification assumes a basic background in programming
+at the level of bits and other primitive data representations.
+
+Unless otherwise indicated below,
+a compliant compressor must produce data sets
+that conform to the specifications presented here.
+It doesn’t need to support all options though.
+
+A compliant decompressor must be able to decompress
+at least one working set of parameters
+that conforms to the specifications presented here.
+It may also ignore checksums.
+Whenever it does not support a specific parameter within the compressed stream,
+it must produce a non-ambiguous error code
+and associated error message explaining which parameter is unsupported.
+
+
+General Structure of LZ4 Frame format
+-------------------------------------
+
+| MagicNb | F. Descriptor | Block | (...) | EndMark | C. Checksum |
+|:-------:|:-------------:| ----- | ----- | ------- | ----------- |
+| 4 bytes | 3-11 bytes | | | 4 bytes | 0-4 bytes |
+
+__Magic Number__
+
+4 Bytes, Little endian format.
+Value : 0x184D2204
+
+__Frame Descriptor__
+
+3 to 11 Bytes, to be detailed in the next part.
+Most important part of the spec.
+
+__Data Blocks__
+
+To be detailed later on.
+That’s where compressed data is stored.
+
+__EndMark__
+
+The flow of blocks ends when the last data block has a size of “0”.
+The size is expressed as a 32-bits value.
+
+__Content Checksum__
+
+Content Checksum verify that the full content has been decoded correctly.
+The content checksum is the result
+of [xxh32() hash function](https://github.com/Cyan4973/xxHash)
+digesting the original (decoded) data as input, and a seed of zero.
+Content checksum is only present when its associated flag
+is set in the frame descriptor.
+Content Checksum validates the result,
+that all blocks were fully transmitted in the correct order and without error,
+and also that the encoding/decoding process itself generated no distortion.
+Its usage is recommended.
+
+__Frame Concatenation__
+
+In some circumstances, it may be preferable to append multiple frames,
+for example in order to add new data to an existing compressed file
+without re-framing it.
+
+In such case, each frame has its own set of descriptor flags.
+Each frame is considered independent.
+The only relation between frames is their sequential order.
+
+The ability to decode multiple concatenated frames
+within a single stream or file
+is left outside of this specification.
+As an example, the reference lz4 command line utility behavior is
+to decode all concatenated frames in their sequential order.
+
+
+Frame Descriptor
+----------------
+
+| FLG | BD | (Content Size) | HC |
+| ------- | ------- |:--------------:| ------- |
+| 1 byte | 1 byte | 0 - 8 bytes | 1 byte |
+
+The descriptor uses a minimum of 3 bytes,
+and up to 11 bytes depending on optional parameters.
+
+__FLG byte__
+
+| BitNb | 7-6 | 5 | 4 | 3 | 2 | 1-0 |
+| ------- | ------- | ------- | --------- | ------- | --------- | -------- |
+|FieldName| Version | B.Indep | B.Checksum| C.Size | C.Checksum|*Reserved*|
+
+
+__BD byte__
+
+| BitNb | 7 | 6-5-4 | 3-2-1-0 |
+| ------- | -------- | ------------ | -------- |
+|FieldName|*Reserved*| Block MaxSize|*Reserved*|
+
+In the tables, bit 7 is highest bit, while bit 0 is lowest.
+
+__Version Number__
+
+2-bits field, must be set to “01”.
+Any other value cannot be decoded by this version of the specification.
+Other version numbers will use different flag layouts.
+
+__Block Independence flag__
+
+If this flag is set to “1”, blocks are independent.
+If this flag is set to “0”, each block depends on previous ones
+(up to LZ4 window size, which is 64 KB).
+In such case, it’s necessary to decode all blocks in sequence.
+
+Block dependency improves compression ratio, especially for small blocks.
+On the other hand, it makes direct jumps or multi-threaded decoding impossible.
+
+__Block checksum flag__
+
+If this flag is set, each data block will be followed by a 4-bytes checksum,
+calculated by using the xxHash-32 algorithm on the raw (compressed) data block.
+The intention is to detect data corruption (storage or transmission errors)
+immediately, before decoding.
+Block checksum usage is optional.
+
+__Content Size flag__
+
+If this flag is set, the uncompressed size of data included within the frame
+will be present as an 8 bytes unsigned little endian value, after the flags.
+Content Size usage is optional.
+
+__Content checksum flag__
+
+If this flag is set, a content checksum will be appended after the EndMark.
+
+Recommended value : “1” (content checksum is present)
+
+__Block Maximum Size__
+
+This information is intended to help the decoder allocate memory.
+Size here refers to the original (uncompressed) data size.
+Block Maximum Size is one value among the following table :
+
+| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+| --- | --- | --- | --- | ----- | ------ | ---- | ---- |
+| N/A | N/A | N/A | N/A | 64 KB | 256 KB | 1 MB | 4 MB |
+
+The decoder may refuse to allocate block sizes above a (system-specific) size.
+Unused values may be used in a future revision of the spec.
+A decoder conformant to the current version of the spec
+is only able to decode blocksizes defined in this spec.
+
+__Reserved bits__
+
+Value of reserved bits **must** be 0 (zero).
+Reserved bit might be used in a future version of the specification,
+typically enabling new optional features.
+If this happens, a decoder respecting the current version of the specification
+shall not be able to decode such a frame.
+
+__Content Size__
+
+This is the original (uncompressed) size.
+This information is optional, and only present if the associated flag is set.
+Content size is provided using unsigned 8 Bytes, for a maximum of 16 HexaBytes.
+Format is Little endian.
+This value is informational, typically for display or memory allocation.
+It can be skipped by a decoder, or used to validate content correctness.
+
+__Header Checksum__
+
+One-byte checksum of combined descriptor fields, including optional ones.
+The value is the second byte of xxh32() : ` (xxh32()>>8) & 0xFF `
+using zero as a seed,
+and the full Frame Descriptor as an input
+(including optional fields when they are present).
+A wrong checksum indicates an error in the descriptor.
+Header checksum is informational and can be skipped.
+
+
+Data Blocks
+-----------
+
+| Block Size | data | (Block Checksum) |
+|:----------:| ------ |:----------------:|
+| 4 bytes | | 0 - 4 bytes |
+
+
+__Block Size__
+
+This field uses 4-bytes, format is little-endian.
+
+The highest bit is “1” if data in the block is uncompressed.
+
+The highest bit is “0” if data in the block is compressed by LZ4.
+
+All other bits give the size, in bytes, of the following data block
+(the size does not include the block checksum if present).
+
+Block Size shall never be larger than Block Maximum Size.
+Such a thing could happen for incompressible source data.
+In such case, such a data block shall be passed in uncompressed format.
+
+__Data__
+
+Where the actual data to decode stands.
+It might be compressed or not, depending on previous field indications.
+Uncompressed size of Data can be any size, up to “block maximum size”.
+Note that data block is not necessarily full :
+an arbitrary “flush” may happen anytime. Any block can be “partially filled”.
+
+__Block checksum__
+
+Only present if the associated flag is set.
+This is a 4-bytes checksum value, in little endian format,
+calculated by using the xxHash-32 algorithm on the raw (undecoded) data block,
+and a seed of zero.
+The intention is to detect data corruption (storage or transmission errors)
+before decoding.
+
+Block checksum is cumulative with Content checksum.
+
+
+Skippable Frames
+----------------
+
+| Magic Number | Frame Size | User Data |
+|:------------:|:----------:| --------- |
+| 4 bytes | 4 bytes | |
+
+Skippable frames allow the integration of user-defined data
+into a flow of concatenated frames.
+Its design is pretty straightforward,
+with the sole objective to allow the decoder to quickly skip
+over user-defined data and continue decoding.
+
+For the purpose of facilitating identification,
+it is discouraged to start a flow of concatenated frames with a skippable frame.
+If there is a need to start such a flow with some user data
+encapsulated into a skippable frame,
+it’s recommended to start with a zero-byte LZ4 frame
+followed by a skippable frame.
+This will make it easier for file type identifiers.
+
+
+__Magic Number__
+
+4 Bytes, Little endian format.
+Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
+All 16 values are valid to identify a skippable frame.
+
+__Frame Size__
+
+This is the size, in bytes, of the following User Data
+(without including the magic number nor the size field itself).
+4 Bytes, Little endian format, unsigned 32-bits.
+This means User Data can’t be bigger than (2^32-1) Bytes.
+
+__User Data__
+
+User Data can be anything. Data will just be skipped by the decoder.
+
+
+Legacy frame
+------------
+
+The Legacy frame format was defined into the initial versions of “LZ4Demo”.
+Newer compressors should not use this format anymore, as it is too restrictive.
+
+Main characteristics of the legacy format :
+
+- Fixed block size : 8 MB.
+- All blocks must be completely filled, except the last one.
+- All blocks are always compressed, even when compression is detrimental.
+- The last block is detected either because
+ it is followed by the “EOF” (End of File) mark,
+ or because it is followed by a known Frame Magic Number.
+- No checksum
+- Convention is Little endian
+
+| MagicNb | B.CSize | CData | B.CSize | CData | (...) | EndMark |
+| ------- | ------- | ----- | ------- | ----- | ------- | ------- |
+| 4 bytes | 4 bytes | CSize | 4 bytes | CSize | x times | EOF |
+
+
+__Magic Number__
+
+4 Bytes, Little endian format.
+Value : 0x184C2102
+
+__Block Compressed Size__
+
+This is the size, in bytes, of the following compressed data block.
+4 Bytes, Little endian format.
+
+__Data__
+
+Where the actual compressed data stands.
+Data is always compressed, even when compression is detrimental.
+
+__EndMark__
+
+End of legacy frame is implicit only.
+It must be followed by a standard EOF (End Of File) signal,
+wether it is a file or a stream.
+
+Alternatively, if the frame is followed by a valid Frame Magic Number,
+it is considered completed.
+It makes legacy frames compatible with frame concatenation.
+
+Any other value will be interpreted as a block size,
+and trigger an error if it does not fit within acceptable range.
+
+
+Version changes
+---------------
+
+1.5.1 : changed format to MarkDown compatible
+
+1.5 : removed Dictionary ID from specification
+
+1.4.1 : changed wording from “stream” to “frame”
+
+1.4 : added skippable streams, re-added stream checksum
+
+1.3 : modified header checksum
+
+1.2 : reduced choice of “block size”, to postpone decision on “dynamic size of BlockSize Field”.
+
+1.1 : optional fields are now part of the descriptor
+
+1.0 : changed “block size” specification, adding a compressed/uncompressed flag
+
+0.9 : reduced scale of “block maximum size” table
+
+0.8 : removed : high compression flag
+
+0.7 : removed : stream checksum
+
+0.6 : settled : stream size uses 8 bytes, endian convention is little endian
+
+0.5: added copyright notice
+
+0.4 : changed format to Google Doc compatible OpenDocument