From 9523101aff2f7d38572b22b522430b9af39df873 Mon Sep 17 00:00:00 2001 From: Yann Collet Date: Mon, 1 Dec 2014 00:53:20 +0100 Subject: Clarified some file names --- LZ4_Frame_Format.html | 1 + LZ4_Framing_Format.html | 1 - liblz4.pc.in | 14 ------ lz4_block_format.txt | 120 +++++++++++++++++++++++++++++++++++++++++++++ lz4_format_description.txt | 120 --------------------------------------------- 5 files changed, 121 insertions(+), 135 deletions(-) create mode 100644 LZ4_Frame_Format.html delete mode 100644 LZ4_Framing_Format.html delete mode 100644 liblz4.pc.in create mode 100644 lz4_block_format.txt delete mode 100644 lz4_format_description.txt diff --git a/LZ4_Frame_Format.html b/LZ4_Frame_Format.html new file mode 100644 index 0000000..bd1304c --- /dev/null +++ b/LZ4_Frame_Format.html @@ -0,0 +1 @@ +LZ4 Framing format - stable

LZ4 Framing Format

Notices

Permission is granted to copy and distribute this document for any purpose and without charge, including translations into other languages and incorporation into compilations, provided that the copyright notice and this notice are preserved, and that any substantive changes or deletions from the original are clearly marked.

Version

1.4.1

Introduction

The purpose of this document is to define a lossless compressed data format, that is independent of CPU type, operating system, file system and character set, suitable for File compression, Pipe and streaming compression using the LZ4 algorithm : http://code.google.com/p/lz4/

The data can be produced or consumed, even for an arbitrarily long sequentially presented input data stream, using only an a priori bounded amount of intermediate storage, and hence can be used in data communications. The format uses the LZ4 compression method, and xxHash-32 checksum method, for detection of data corruption.

The data format defined by this specification does not attempt to allow random access to compressed data.

This specification is intended for use by implementers of software to compress data into LZ4 format and/or decompress data from LZ4 format. The text of the specification assumes a basic background in programming at the level of bits and other primitive data representations.

Unless otherwise indicated below, a compliant compressor must produce data sets that conform to all the specifications presented here.

A compliant decompressor must be able to accept and decompress at least one data set that conforms to the specifications presented here; whenever it does not support any parameter, it must produce a non-ambiguous error code and associated error message explaining which parameter value is unsupported (a typical example being an unsupported buffer size).

Distribution of this document is unlimited.

Summary :

Introduction

General structure of LZ4 Framing Format

Frame Descriptor

Data Blocks

Skippable Frames

Legacy format

Appendix

General Structure of LZ4 Framing format

LZ4 Framing Format - General Structure.png

Magic Number

4 Bytes, Little endian format.
Value : 0x184D2204

Frame Descriptor

3 to 15 Bytes, to be detailed in the next part.
Most significant part of the spec.

Data Blocks

To be detailed later on.
That’s where compressed data is stored.

EndMark

The flow of blocks ends when the last data block has a size of “0”.
The size is expressed as a 32-bits value.

Content Checksum

Content Checksum verify that the full content has been decoded correctly.
The content checksum is the result of xxh32() hash function digesting the original (decoded) data as input, and a seed of zero.
Content checksum is only present when its associated flag is set in the framing descriptor. Content Checksum validates the result, that all blocks were fully transmitted in the correct order and without error, and also that the encoding/decoding process itself generated no distortion. Its usage is recommended.

Frame Concatenation

In some circumstances, it may be preferable to append multiple frames, for example in order to add new data to an existing compressed file without re-framing it.

In such case, each frame has its own set of descriptor flags. Each frame is considered independent. The only relation between frames is their sequential order.

The ability to decode multiple concatenated frames within a single stream or file is left outside of this specification. While a logical default behavior could be to decode the frames in their sequential order, this is not a requirement.

Frame Descriptor

LZ4 Framing Format - Frame Descriptor.png

LZ4 Framing Format - Descriptor Flags.png

The descriptor uses a minimum of 3 bytes, and up to 15 bytes depending on optional parameters.
In the picture, bit 7 is highest bit, while bit 0 is lowest.

Version Number :

2-bits field, must be set to “01”.
Any other value cannot be decoded by this version of the specification.
Other version numbers will use different flag layouts.

Block Independence flag :

If this flag is set to “1”, blocks are independent, and can therefore be decoded independently, in parallel.
If this flag is set to “0”, each block depends on previous ones for decoding (up to LZ4 window size, which is 64 KB). In this case, it’s necessary to decode all blocks in sequence.

Block dependency improves compression ratio, especially for small blocks. On the other hand, it makes jumps or multi-threaded decoding impossible.

Block checksum flag :

If this flag is set, each data block will be followed by a 4-bytes checksum, calculated by using the xxHash-32 algorithm on the raw (compressed) data block.
The intention is to detect data corruption (storage or transmission errors) immediately, before decoding.
Block checksum usage is optional.

Content Size flag :

If this flag is set, the original (uncompressed) size of data included within the frame will be present as an 8 bytes unsigned value, little endian format, after the flags.

Recommended value : “0” (not present)

Content checksum flag :

If this flag is set, a content checksum will be appended after the EoS mark.

Recommended value : “1” (content checksum is present)

Preset Dictionary flag :

If this flag is set, a Dict-ID field will be present, just after the descriptor flags and the Content size.

Usual value : “0” (not present)

Block Maximum Size :

This information is intended to help the decoder allocate the right amount of memory.
Size here refers to the original (uncompressed) data size.
Block Maximum Size is one value among the following table :

The decoder may refuse to allocate block sizes above a (system-specific) size.
Unused values may be used in a future revision of the spec.
A decoder conformant to the current version of the spec is only able to decode blocksizes defined in this spec.

Reserved bits :

Value of reserved bits must be 0 (zero).
Reserved bit might be used in a future version of the specification, to enable any (yet-to-decide) optional feature.
If this happens, a decoder respecting the current version of the specification shall not be able to decode such a frame.

Content Size

This is the original (uncompressed) size.
This information is optional, and only present if the associated flag is set.
Content size is provided using unsigned 8 Bytes, for a maximum of 16 HexaBytes.
Format is Little endian.
This field has no impact on decoding, it just informs the decoder how much data the frame holds (for example, to display it during decoding process, or for verification purpose). It can be safely skipped by a conformant decoder.

Dictionary ID

Dict-ID is only present if the associated flag is set.
A dictionary is specially useful to compress short input sequences. The compressor can take advantage of the dictionary context to encode the input in a more compact manner. It works as a kind of “known prefix” which is used by both the compressor and the decompressor to “warm-up” reference tables and help compress small data blocks.

Dict-ID is the xxHash-32 checksum of this “known prefix”. Format is Little endian.

The decompressor uses this identifier to determine which dictionary has been used by the compressor. The compressor and the decompressor must use exactly the same dictionary. This document does not specify the contents of predefined dictionaries, since the optimal dictionaries are application specific. Any data format using this feature must precisely define the allowed dictionaries.

Within a single frame, a single dictionary can be defined.
When the frame descriptor defines independent blocks, each block will be initialised with the same dictionary.
If the frame descriptor defines linked blocks, the dictionary will only be used once, at the beginning of the decoding process.

Header Checksum :

One-byte checksum of all descriptor fields, including optional ones when present.
The byte is second byte of xxh32() : { (xxh32()>>8) & 0xFF } ,
using zero as a seed,
and the full Frame Descriptor as an input (including optional fields when they are present).
A different checksum indicates an error in the descriptor.

Data Blocks

Block Size

This field uses 4-bytes, format is little-endian.

The highest bit is “1” if data in the block is uncompressed.

The highest bit is “0” if data in the block is compressed by LZ4.

All other bits give the size, in bytes, of the following data block (the size does not include the checksum if present).

Block Size shall never be larger than Block Maximum Size. Such a thing could happen when the original data is incompressible. In this case, such a data block shall be passed in uncompressed format.

Data

Where the actual data to decode stands. It might be compressed or not, depending on previous field indications.
Uncompressed size of Data can be any size, up to “block maximum size”.
Note that the data block is not necessarily filled : an arbitrary “flush” may happen anytime. Any block can be “partially filled”.

Block checksum :

Only present if the associated flag is set.
This is a 4-bytes checksum value, in little endian format,
calculated by using the xxHash-32 algorithm on the raw (undecoded) data block,
and a seed of zero.
The intention is to detect data corruption (storage or transmission errors) before decoding.

Block checksum is cumulative with Content checksum.

Skippable Frames

LZ4 Framing Format - Skippable Frame.png

Skippable frames allow the integration of user-defined data into a flow of concatenated frames.
Its design is pretty straightforward, with the sole objective to allow the decoder to quickly skip over user-defined data and continue decoding.

For the purpose of facilitating identification, it is discouraged to start a flow of concatenated frames with a skippable frame. If there is a need to start such a flow with some user data encapsulated into a skippable frame, it’s recommended to start will a zero-byte LZ4 frame followed by a skippable frame. This will make it easier for file type identifiers.

Magic Number

4 Bytes, Little endian format.
Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F. All 16 values are valid to identify a skippable frame.

Frame Size

This is the size, in bytes, of the following User Data (without including the magic number nor the size field itself).
4 Bytes, Little endian format, unsigned 32-bits.
This means User Data can’t be bigger than (2^32-1) Bytes.

User Data

User Data can be anything. Data will just be skipped by the decoder.

Legacy frame

The Legacy frame format was defined into the initial versions of “LZ4Demo”.
Newer compressors should not use this format anymore, since it is too restrictive.
It is recommended that decompressors shall be able to decode this format during the transition period.

Main properties of legacy format :
- Fixed block size : 8 MB.
- All blocks must be completely filled, except the last one.
- All blocks are always compressed, even when compression is detrimental.
- The last block is detected either because it is followed by the “EOF” (End of File) mark, or because it is followed by a known Frame Magic Number.
- No checksum
- Convention is Little endian

Magic Number

4 Bytes, Little endian format.
Value : 0x184C2102

Block Compressed Size

This is the size, in bytes, of the following compressed data block.
4 Bytes, Little endian format.

Data

Where the actual data stands.
Data is always compressed, even when compression is detrimental (i.e. larger than original size).

Appendix

Version changes

1.4.1 : changed wording from “stream” to “frame”

1.4 : added skippable streams, re-added stream checksum

1.3 : modified header checksum

1.2 : reduced choice of “block size”, to postpone decision on “dynamic size of BlockSize Field”.

1.1 : optional fields are now part of the descriptor

1.0 : changed “block size” specification, adding a compressed/uncompressed flag

0.9 : reduced scale of “block maximum size” table

0.8 : removed : high compression flag

0.7 : removed : stream checksum

0.6 : settled : stream size uses 8 bytes, endian convention is little endian

0.5: added copyright notice

0.4 : changed format to Google Doc compatible OpenDocument

\ No newline at end of file diff --git a/LZ4_Framing_Format.html b/LZ4_Framing_Format.html deleted file mode 100644 index bd1304c..0000000 --- a/LZ4_Framing_Format.html +++ /dev/null @@ -1 +0,0 @@ -LZ4 Framing format - stable

LZ4 Framing Format

Notices

Version

1.4.1

Introduction

The data format defined by this specification does not attempt to allow random access to compressed data.

Unless otherwise indicated below, a compliant compressor must produce data sets that conform to all the specifications presented here.

Distribution of this document is unlimited.

Summary :

Introduction

General structure of LZ4 Framing Format

Frame Descriptor

Data Blocks

Skippable Frames

Legacy format

Appendix

General Structure of LZ4 Framing format

LZ4 Framing Format - General Structure.png

Magic Number

4 Bytes, Little endian format.
Value : 0x184D2204

Frame Descriptor

3 to 15 Bytes, to be detailed in the next part.
Most significant part of the spec.

Data Blocks

To be detailed later on.
That’s where compressed data is stored.

EndMark

The flow of blocks ends when the last data block has a size of “0”.
The size is expressed as a 32-bits value.

Content Checksum

Frame Concatenation

In some circumstances, it may be preferable to append multiple frames, for example in order to add new data to an existing compressed file without re-framing it.

In such case, each frame has its own set of descriptor flags. Each frame is considered independent. The only relation between frames is their sequential order.

Frame Descriptor

LZ4 Framing Format - Frame Descriptor.png

LZ4 Framing Format - Descriptor Flags.png

The descriptor uses a minimum of 3 bytes, and up to 15 bytes depending on optional parameters.
In the picture, bit 7 is highest bit, while bit 0 is lowest.

Version Number :

2-bits field, must be set to “01”.
Any other value cannot be decoded by this version of the specification.
Other version numbers will use different flag layouts.

Block Independence flag :

Block dependency improves compression ratio, especially for small blocks. On the other hand, it makes jumps or multi-threaded decoding impossible.

Block checksum flag :

Content Size flag :

If this flag is set, the original (uncompressed) size of data included within the frame will be present as an 8 bytes unsigned value, little endian format, after the flags.

Recommended value : “0” (not present)

Content checksum flag :

If this flag is set, a content checksum will be appended after the EoS mark.

Recommended value : “1” (content checksum is present)

Preset Dictionary flag :

If this flag is set, a Dict-ID field will be present, just after the descriptor flags and the Content size.

Usual value : “0” (not present)

Block Maximum Size :

Reserved bits :

Content Size

Dictionary ID

Dict-ID is the xxHash-32 checksum of this “known prefix”. Format is Little endian.

Header Checksum :

Data Blocks

Block Size

This field uses 4-bytes, format is little-endian.

The highest bit is “1” if data in the block is uncompressed.

The highest bit is “0” if data in the block is compressed by LZ4.

All other bits give the size, in bytes, of the following data block (the size does not include the checksum if present).

Block Size shall never be larger than Block Maximum Size. Such a thing could happen when the original data is incompressible. In this case, such a data block shall be passed in uncompressed format.

Data

Block checksum :

Block checksum is cumulative with Content checksum.

Skippable Frames

LZ4 Framing Format - Skippable Frame.png

Magic Number

4 Bytes, Little endian format.
Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F. All 16 values are valid to identify a skippable frame.

Frame Size

User Data

User Data can be anything. Data will just be skipped by the decoder.

Legacy frame

Magic Number

4 Bytes, Little endian format.
Value : 0x184C2102

Block Compressed Size

This is the size, in bytes, of the following compressed data block.
4 Bytes, Little endian format.

Data

Where the actual data stands.
Data is always compressed, even when compression is detrimental (i.e. larger than original size).

Appendix

Version changes

1.4.1 : changed wording from “stream” to “frame”

1.4 : added skippable streams, re-added stream checksum

1.3 : modified header checksum

1.2 : reduced choice of “block size”, to postpone decision on “dynamic size of BlockSize Field”.

1.1 : optional fields are now part of the descriptor

1.0 : changed “block size” specification, adding a compressed/uncompressed flag

0.9 : reduced scale of “block maximum size” table

0.8 : removed : high compression flag

0.7 : removed : stream checksum

0.6 : settled : stream size uses 8 bytes, endian convention is little endian

0.5: added copyright notice

0.4 : changed format to Google Doc compatible OpenDocument

\ No newline at end of file diff --git a/liblz4.pc.in b/liblz4.pc.in deleted file mode 100644 index 0d05152..0000000 --- a/liblz4.pc.in +++ /dev/null @@ -1,14 +0,0 @@ -# LZ4 - Fast LZ compression algorithm -# Copyright (C) 2011-2014, Yann Collet. -# BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) - -prefix=@PREFIX@ -libdir=@LIBDIR@ -includedir=@INCLUDEDIR@ - -Name: lz4 -Description: fast lossless compression algorithm library -URL: http://code.google.com/p/lz4/ -Version: @VERSION@ -Libs: -L@LIBDIR@ -llz4 -Cflags: -I@INCLUDEDIR@ diff --git a/lz4_block_format.txt b/lz4_block_format.txt new file mode 100644 index 0000000..2c424c5 --- /dev/null +++ b/lz4_block_format.txt @@ -0,0 +1,120 @@ +LZ4 Format Description +Last revised: 2012-02-27 +Author : Y. Collet + + + +This small specification intents to provide enough information +to anyone willing to produce LZ4-compatible compressed data blocks +using any programming language. + +LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding. +The most important design principle behind LZ4 is simplicity. +It helps to create an easy to read and maintain source code. +It also helps later on for optimisations, compactness, and speed. +There is no entropy encoder backend nor framing layer. +The latter is assumed to be handled by other parts of the system. + +This document only describes the format, +not how the LZ4 compressor nor decompressor actually work. +The correctness of the decompressor should not depend +on implementation details of the compressor, and vice versa. + + + +-- Compressed block format -- + +An LZ4 compressed block is composed of sequences. +Schematically, a sequence is a suite of literals, followed by a match copy. + +Each sequence starts with a token. +The token is a one byte value, separated into two 4-bits fields. +Therefore each field ranges from 0 to 15. + + +The first field uses the 4 high-bits of the token. +It provides the length of literals to follow. +(Note : a literal is a not-compressed byte). +If the field value is 0, then there is no literal. +If it is 15, then we need to add some more bytes to indicate the full length. +Each additionnal byte then represent a value from 0 to 255, +which is added to the previous value to produce a total length. +When the byte value is 255, another byte is output. +There can be any number of bytes following the token. There is no "size limit". +(Sidenote this is why a not-compressible input block is expanded by 0.4%). + +Example 1 : A length of 48 will be represented as : +- 15 : value for the 4-bits High field +- 33 : (=48-15) remaining length to reach 48 + +Example 2 : A length of 280 will be represented as : +- 15 : value for the 4-bits High field +- 255 : following byte is maxed, since 280-15 >= 255 +- 10 : (=280 - 15 - 255) ) remaining length to reach 280 + +Example 3 : A length of 15 will be represented as : +- 15 : value for the 4-bits High field +- 0 : (=15-15) yes, the zero must be output + +Following the token and optional length bytes, are the literals themselves. +They are exactly as numerous as previously decoded (length of literals). +It's possible that there are zero literal. + + +Following the literals is the match copy operation. + +It starts by the offset. +This is a 2 bytes value, in little endian format. + +The offset represents the position of the match to be copied from. +1 means "current position - 1 byte". +The maximum offset value is 65535, 65536 cannot be coded. +Note that 0 is an invalid value, not used. + +Then we need to extract the match length. +For this, we use the second token field, the low 4-bits. +Value, obviously, ranges from 0 to 15. +However here, 0 means that the copy operation will be minimal. +The minimum length of a match, called minmatch, is 4. +As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes. +Similar to literal length, on reaching the highest possible value (15), +we output additional bytes, one at a time, with values ranging from 0 to 255. +They are added to total to provide the final match length. +A 255 value means there is another byte to read and add. +There is no limit to the number of optional bytes that can be output this way. +(This points towards a maximum achievable compression ratio of ~250). + +With the offset and the matchlength, +the decoder can now proceed to copy the data from the already decoded buffer. +On decoding the matchlength, we reach the end of the compressed sequence, +and therefore start another one. + + +-- Parsing restrictions -- + +There are specific parsing rules to respect in order to remain compatible +with assumptions made by the decoder : +1) The last 5 bytes are always literals +2) The last match must start at least 12 bytes before end of block +Consequently, a block with less than 13 bytes cannot be compressed. +These rules are in place to ensure that the decoder +will never read beyond the input buffer, nor write beyond the output buffer. + +Note that the last sequence is also incomplete, +and stops right after literals. + + +-- Additional notes -- + +There is no assumption nor limits to the way the compressor +searches and selects matches within the source data block. +It could be a fast scan, a multi-probe, a full search using BST, +standard hash chains or MMC, well whatever. + +Advanced parsing strategies can also be implemented, such as lazy match, +or full optimal parsing. + +All these trade-off offer distinctive speed/memory/compression advantages. +Whatever the method used by the compressor, its result will be decodable +by any LZ4 decoder if it follows the format specification described above. + diff --git a/lz4_format_description.txt b/lz4_format_description.txt deleted file mode 100644 index 2c424c5..0000000 --- a/lz4_format_description.txt +++ /dev/null @@ -1,120 +0,0 @@ -LZ4 Format Description -Last revised: 2012-02-27 -Author : Y. Collet - - - -This small specification intents to provide enough information -to anyone willing to produce LZ4-compatible compressed data blocks -using any programming language. - -LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding. -The most important design principle behind LZ4 is simplicity. -It helps to create an easy to read and maintain source code. -It also helps later on for optimisations, compactness, and speed. -There is no entropy encoder backend nor framing layer. -The latter is assumed to be handled by other parts of the system. - -This document only describes the format, -not how the LZ4 compressor nor decompressor actually work. -The correctness of the decompressor should not depend -on implementation details of the compressor, and vice versa. - - - --- Compressed block format -- - -An LZ4 compressed block is composed of sequences. -Schematically, a sequence is a suite of literals, followed by a match copy. - -Each sequence starts with a token. -The token is a one byte value, separated into two 4-bits fields. -Therefore each field ranges from 0 to 15. - - -The first field uses the 4 high-bits of the token. -It provides the length of literals to follow. -(Note : a literal is a not-compressed byte). -If the field value is 0, then there is no literal. -If it is 15, then we need to add some more bytes to indicate the full length. -Each additionnal byte then represent a value from 0 to 255, -which is added to the previous value to produce a total length. -When the byte value is 255, another byte is output. -There can be any number of bytes following the token. There is no "size limit". -(Sidenote this is why a not-compressible input block is expanded by 0.4%). - -Example 1 : A length of 48 will be represented as : -- 15 : value for the 4-bits High field -- 33 : (=48-15) remaining length to reach 48 - -Example 2 : A length of 280 will be represented as : -- 15 : value for the 4-bits High field -- 255 : following byte is maxed, since 280-15 >= 255 -- 10 : (=280 - 15 - 255) ) remaining length to reach 280 - -Example 3 : A length of 15 will be represented as : -- 15 : value for the 4-bits High field -- 0 : (=15-15) yes, the zero must be output - -Following the token and optional length bytes, are the literals themselves. -They are exactly as numerous as previously decoded (length of literals). -It's possible that there are zero literal. - - -Following the literals is the match copy operation. - -It starts by the offset. -This is a 2 bytes value, in little endian format. - -The offset represents the position of the match to be copied from. -1 means "current position - 1 byte". -The maximum offset value is 65535, 65536 cannot be coded. -Note that 0 is an invalid value, not used. - -Then we need to extract the match length. -For this, we use the second token field, the low 4-bits. -Value, obviously, ranges from 0 to 15. -However here, 0 means that the copy operation will be minimal. -The minimum length of a match, called minmatch, is 4. -As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes. -Similar to literal length, on reaching the highest possible value (15), -we output additional bytes, one at a time, with values ranging from 0 to 255. -They are added to total to provide the final match length. -A 255 value means there is another byte to read and add. -There is no limit to the number of optional bytes that can be output this way. -(This points towards a maximum achievable compression ratio of ~250). - -With the offset and the matchlength, -the decoder can now proceed to copy the data from the already decoded buffer. -On decoding the matchlength, we reach the end of the compressed sequence, -and therefore start another one. - - --- Parsing restrictions -- - -There are specific parsing rules to respect in order to remain compatible -with assumptions made by the decoder : -1) The last 5 bytes are always literals -2) The last match must start at least 12 bytes before end of block -Consequently, a block with less than 13 bytes cannot be compressed. -These rules are in place to ensure that the decoder -will never read beyond the input buffer, nor write beyond the output buffer. - -Note that the last sequence is also incomplete, -and stops right after literals. - - --- Additional notes -- - -There is no assumption nor limits to the way the compressor -searches and selects matches within the source data block. -It could be a fast scan, a multi-probe, a full search using BST, -standard hash chains or MMC, well whatever. - -Advanced parsing strategies can also be implemented, such as lazy match, -or full optimal parsing. - -All these trade-off offer distinctive speed/memory/compression advantages. -Whatever the method used by the compressor, its result will be decodable -by any LZ4 decoder if it follows the format specification described above. - -- cgit v0.12