Added : Frame documentation in MarkDown format

author: Yann Collet <yann.collet.73@gmail.com> 2015-03-31 08:44:56 (GMT)
committer: Yann Collet <yann.collet.73@gmail.com> 2015-03-31 08:44:56 (GMT)
commit: f17496423cb7ca9be9d9d855fefcf78fbb70439a (patch)
tree: af9924fc11b72468f7debf136642cc5cf7dab032
parent: 880381c11bc9838b8609643dff64044fca8856e2 (diff)
download: lz4-f17496423cb7ca9be9d9d855fefcf78fbb70439a.zip
lz4-f17496423cb7ca9be9d9d855fefcf78fbb70439a.tar.gz
lz4-f17496423cb7ca9be9d9d855fefcf78fbb70439a.tar.bz2
3 files changed, 416 insertions, 10 deletions
diff --git a/NEWS b/NEWS
index d80b138..b692b8d 100644
--- a/NEWS
+++ b/NEWS
@@ -1,5 +1,6 @@
 r129:
-Added : LZ4_compress_fast()
+Added  : LZ4_compress_fast()
+Updated: Documentation converted to MarkDown
 
 r128:
 New   : lz4cli sparse file support
diff --git a/README.md b/README.md
index 0f2cd40..69f7397 100644..100755
--- a/README.md
+++ b/README.md
@@ -1,8 +1,15 @@
 LZ4 - Extremely fast compression
 ================================
 
-LZ4 is lossless compression algorithm, providing compression speed at 400 MB/s per core, scalable with multi-cores CPU. It also features an extremely fast decoder, with speed in multiple GB/s per core, typically reaching RAM speed limits on multi-core systems.
-A high compression derivative, called LZ4_HC, is also provided. It trades CPU time for compression ratio.
+LZ4 is lossless compression algorithm, 
+providing compression speed at 400 MB/s per core, 
+scalable with multi-cores CPU. 
+It also features an extremely fast decoder, 
+with speed in multiple GB/s per core, 
+typically reaching RAM speed limits on multi-core systems.
+
+A high compression derivative, called LZ4_HC, is also provided. 
+It trades CPU time for compression ratio.
 
 |Branch      |Status   |
 |------------|---------|
@@ -13,16 +20,21 @@ A high compression derivative, called LZ4_HC, is also provided. It trades CPU ti
 > **Branch Policy:**
 
 > - The "master" branch is considered stable, at all times.
-> - The "dev" branch is the one where all contributions must be merged before being promoted to master.
->  - If you plan to propose a patch, please commit into the "dev" branch. Direct commit to "master" are not permitted.
-> - Feature branches can also exist, for dedicated testing of larger modifications before merge into "dev" branch.
+> - The "dev" branch is the one where all contributions must be merged 
+    before being promoted to master.
+>   + If you plan to propose a patch, please commit into the "dev" branch. 
+      Direct commit to "master" are not permitted.
+> - Feature branches can also exist,
+    for dedicated tests of larger modifications before merge into "dev" branch.
 
 Benchmarks
 -------------------------
 
-The benchmark uses the [Open-Source Benchmark program by m^2 (v0.14.3)](http://encode.ru/threads/1371-Filesystem-benchmark?p=33548&viewfull=1#post33548) compiled with GCC v4.8.2 on Linux Mint 64-bits v17.
+The benchmark uses the [Open-Source Benchmark program by m^2 (v0.14.3)]
+compiled with GCC v4.8.2 on Linux Mint 64-bits v17.
 The reference system uses a Core i5-4300U @1.9GHz.
-Benchmark evaluates the compression of reference [Silesia Corpus](http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia) in single-thread mode.
+Benchmark evaluates the compression of reference [Silesia Corpus]
+in single-thread mode.
 
 |  Compressor       | Ratio   | Compression | Decompression |
 |  ----------       | -----   | ----------- | ------------- |
@@ -40,7 +52,15 @@ Benchmark evaluates the compression of reference [Silesia Corpus](http://sun.aei
 |**LZ4 HC (r129)**  |**2.720**|   22 MB/s   | **1830 MB/s** |
 |  zlib 1.2.8 -6    |  3.099  |   18 MB/s   |    270 MB/s   |
 
-The LZ4 block compression format is detailed within [lz4_Block_format](lz4_Block_format.md).
+The LZ4 block compression format is detailed within [lz4_Block_format].
 
-For streaming unknown amount of data and compress files of any size, a frame format has been published, and can be consulted within the file LZ4_Frame_Format.html .
+Block format doesn't deal with header information, 
+nor how to handle arbitrarily long files or data streams.
+This is the purpose of the Frame format.
+Interoperable versions of LZ4 should use the same frame format, 
+defined into [lz4_Frame_format].
 
+[Open-Source Benchmark program by m^2 (v0.14.3)]: http://encode.ru/threads/1371-Filesystem-benchmark?p=34029&viewfull=1#post34029
+[Silesia Corpus]: http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
+[lz4_Block_format]: lz4_Block_format.md
+[lz4_Frame_format]: lz4_Frame_format.md
+\ No newline at end of file
diff --git a/lz4_Frame_format.md b/lz4_Frame_format.md
new file mode 100755
index 0000000..55edaa0
--- /dev/null
+++ b/lz4_Frame_format.md
@@ -0,0 +1,385 @@
+LZ4 Frame Format Description
+============================
+
+###Notices
+
+Copyright (c) 2013-2015 Yann Collet
+
+Permission is granted to copy and distribute this document 
+for any  purpose and without charge, 
+including translations into other  languages 
+and incorporation into compilations, 
+provided that the copyright notice and this notice are preserved, 
+and that any substantive changes or deletions from the original 
+are clearly marked.
+Distribution of this document is unlimited.
+
+###Version
+
+1.5.1 (31/03/2015)
+
+
+Introduction
+------------
+
+The purpose of this document is to define a lossless compressed data format, 
+that is independent of CPU type, operating system, 
+file system and character set, suitable for 
+File compression, Pipe and streaming compression 
+using the [LZ4 algorithm](http://www.lz4.info).
+
+The data can be produced or consumed, 
+even for an arbitrarily long sequentially presented input data stream,
+using only an a priori bounded amount of intermediate storage,
+and hence can be used in data communications.
+The format uses the LZ4 compression method,
+and optional [xxHash-32 checksum method](https://github.com/Cyan4973/xxHash),
+for detection of data corruption.
+
+The data format defined by this specification 
+does not attempt to allow random access to compressed data.
+
+This specification is intended for use by implementers of software
+to compress data into LZ4 format and/or decompress data from LZ4 format.
+The text of the specification assumes a basic background in programming
+at the level of bits and other primitive data representations.
+
+Unless otherwise indicated below,
+a compliant compressor must produce data sets
+that conform to the specifications presented here.
+It doesn’t need to support all options though.
+
+A compliant decompressor must be able to decompress
+at least one working set of parameters
+that conforms to the specifications presented here.
+It may also ignore checksums.
+Whenever it does not support a specific parameter within the compressed stream,
+it must produce a non-ambiguous error code
+and associated error message explaining which parameter is unsupported.
+
+
+General Structure of LZ4 Frame format
+-------------------------------------
+
+| MagicNb | F. Descriptor | Block | (...) | EndMark | C. Checksum |
+|:-------:|:-------------:| ----- | ----- | ------- | ----------- |
+| 4 bytes |  3-11 bytes   |       |       | 4 bytes |   4 bytes   | 
+
+__Magic Number__
+
+4 Bytes, Little endian format.
+Value : 0x184D2204
+
+__Frame Descriptor__
+
+3 to 11 Bytes, to be detailed in the next part.
+Most important part of the spec.
+
+__Data Blocks__
+
+To be detailed later on.
+That’s where compressed data is stored.
+
+__EndMark__
+
+The flow of blocks ends when the last data block has a size of “0”.
+The size is expressed as a 32-bits value.
+
+__Content Checksum__
+
+Content Checksum verify that the full content has been decoded correctly.
+The content checksum is the result 
+of [xxh32() hash function](https://github.com/Cyan4973/xxHash)
+digesting the original (decoded) data as input, and a seed of zero.
+Content checksum is only present when its associated flag
+is set in the frame descriptor. 
+Content Checksum validates the result,
+that all blocks were fully transmitted in the correct order and without error,
+and also that the encoding/decoding process itself generated no distortion.
+Its usage is recommended.
+
+__Frame Concatenation__
+
+In some circumstances, it may be preferable to append multiple frames,
+for example in order to add new data to an existing compressed file
+without re-framing it.
+
+In such case, each frame has its own set of descriptor flags.
+Each frame is considered independent.
+The only relation between frames is their sequential order.
+
+The ability to decode multiple concatenated frames 
+within a single stream or file
+is left outside of this specification. 
+As an example, the reference lz4 command line utility behavior is
+to decode all concatenated frames in their sequential order.
+
+ 
+Frame Descriptor
+----------------
+
+| FLG     | BD      | (Content Size) | HC      |
+| ------- | ------- |:--------------:| ------- |
+| 1 byte  | 1 byte  |  0 - 8 bytes   | 1 byte  | 
+
+The descriptor uses a minimum of 3 bytes,
+and up to 11 bytes depending on optional parameters.
+
+__FLG byte__
+
+|  BitNb  |   7-6   |    5    |     4     |   3     |     2     |    1-0   |
+| ------- | ------- | ------- | --------- | ------- | --------- | -------- |
+|FieldName| Version | B.Indep | B.Checksum| C.Size  | C.Checksum|*Reserved*|
+
+
+__BD byte__
+
+|  BitNb  |     7    |     6-5-4    |  3-2-1-0 |
+| ------- | -------- | ------------ | -------- |
+|FieldName|*Reserved*| Block MaxSize|*Reserved*|
+
+In the tables, bit 7 is highest bit, while bit 0 is lowest.
+
+__Version Number__
+
+2-bits field, must be set to “01”.
+Any other value cannot be decoded by this version of the specification.
+Other version numbers will use different flag layouts.
+
+__Block Independence flag__
+
+If this flag is set to “1”, blocks are independent. 
+If this flag is set to “0”, each block depends on previous ones
+(up to LZ4 window size, which is 64 KB).
+In such case, it’s necessary to decode all blocks in sequence.
+
+Block dependency improves compression ratio, especially for small blocks.
+On the other hand, it makes direct jumps or multi-threaded decoding impossible.
+
+__Block checksum flag__
+
+If this flag is set, each data block will be followed by a 4-bytes checksum,
+calculated by using the xxHash-32 algorithm on the raw (compressed) data block.
+The intention is to detect data corruption (storage or transmission errors) 
+immediately, before decoding.
+Block checksum usage is optional.
+
+__Content Size flag__
+
+If this flag is set, the uncompressed size of data included within the frame 
+will be present as an 8 bytes unsigned little endian value, after the flags.
+Content Size usage is optional.
+
+__Content checksum flag__
+
+If this flag is set, a content checksum will be appended after the EndMark.
+
+Recommended value : “1” (content checksum is present)
+
+__Block Maximum Size__
+
+This information is intended to help the decoder allocate memory.
+Size here refers to the original (uncompressed) data size.
+Block Maximum Size is one value among the following table :
+
+|  0  |  1  |  2  |  3  |   4   |   5    |  6   |  7   | 
+| --- | --- | --- | --- | ----- | ------ | ---- | ---- | 
+| N/A | N/A | N/A | N/A | 64 KB | 256 KB | 1 MB | 4 MB | 
+
+The decoder may refuse to allocate block sizes above a (system-specific) size.
+Unused values may be used in a future revision of the spec.
+A decoder conformant to the current version of the spec
+is only able to decode blocksizes defined in this spec.
+
+__Reserved bits__
+
+Value of reserved bits **must** be 0 (zero).
+Reserved bit might be used in a future version of the specification,
+typically enabling new optional features.
+If this happens, a decoder respecting the current version of the specification
+shall not be able to decode such a frame.
+
+__Content Size__
+
+This is the original (uncompressed) size.
+This information is optional, and only present if the associated flag is set.
+Content size is provided using unsigned 8 Bytes, for a maximum of 16 HexaBytes.
+Format is Little endian.
+This value is informational.
+It can be safely skipped by a conformant decoder.
+
+__Header Checksum__
+
+One-byte checksum of combined descriptor fields, including optional ones.
+The value is the second byte of xxh32() : ` (xxh32()>>8) & 0xFF } `
+using zero as a seed,
+and the full Frame Descriptor as an input
+(including optional fields when they are present).
+A wrong checksum indicates an error in the descriptor.
+Header checksum is informational and can be skipped.
+
+
+Data Blocks
+-----------
+
+| Block Size |  data  | (Block Checksum) |
+|:----------:| ------ |:----------------:|
+|  4 bytes   |        |   0 - 4 bytes    | 
+
+
+__Block Size__
+
+This field uses 4-bytes, format is little-endian.
+
+The highest bit is “1” if data in the block is uncompressed.
+
+The highest bit is “0” if data in the block is compressed by LZ4.
+
+All other bits give the size, in bytes, of the following data block
+(the size does not include the block checksum if present).
+
+Block Size shall never be larger than Block Maximum Size.
+Such a thing could happen for incompressible source data. 
+In such case, such a data block shall be passed in uncompressed format.
+
+__Data__
+
+Where the actual data to decode stands.
+It might be compressed or not, depending on previous field indications.
+Uncompressed size of Data can be any size, up to “block maximum size”.
+Note that data block is not necessarily full : 
+an arbitrary “flush” may happen anytime. Any block can be “partially filled”.
+
+__Block checksum__
+
+Only present if the associated flag is set.
+This is a 4-bytes checksum value, in little endian format,
+calculated by using the xxHash-32 algorithm on the raw (undecoded) data block,
+and a seed of zero.
+The intention is to detect data corruption (storage or transmission errors) 
+before decoding.
+
+Block checksum is cumulative with Content checksum.
+
+
+Skippable Frames
+----------------
+
+| Magic Number | Frame Size | User Data |
+|:------------:|:----------:| --------- |
+|   4 bytes    |  4 bytes   |           | 
+
+Skippable frames allow the integration of user-defined data
+into a flow of concatenated frames.
+Its design is pretty straightforward,
+with the sole objective to allow the decoder to quickly skip 
+over user-defined data and continue decoding.
+
+For the purpose of facilitating identification,
+it is discouraged to start a flow of concatenated frames with a skippable frame.
+If there is a need to start such a flow with some user data
+encapsulated into a skippable frame,
+it’s recommended to start with a zero-byte LZ4 frame
+followed by a skippable frame.
+This will make it easier for file type identifiers.
+
+ 
+__Magic Number__
+
+4 Bytes, Little endian format.
+Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
+All 16 values are valid to identify a skippable frame.
+
+__Frame Size__ 
+
+This is the size, in bytes, of the following User Data
+(without including the magic number nor the size field itself).
+4 Bytes, Little endian format, unsigned 32-bits.
+This means User Data can’t be bigger than (2^32-1) Bytes.
+
+__User Data__
+
+User Data can be anything. Data will just be skipped by the decoder.
+
+
+Legacy frame
+------------
+
+The Legacy frame format was defined into the initial versions of “LZ4Demo”.
+Newer compressors should not use this format anymore, as it is too restrictive.
+
+Main characteristics of the legacy format :
+
+- Fixed block size : 8 MB.
+- All blocks must be completely filled, except the last one.
+- All blocks are always compressed, even when compression is detrimental.
+- The last block is detected either because 
+  it is followed by the “EOF” (End of File) mark,
+  or because it is followed by a known Frame Magic Number.
+- No checksum
+- Convention is Little endian
+
+| MagicNb | B.CSize | CData | B.CSize | CData |  (...)  | EndMark |
+| ------- | ------- | ----- | ------- | ----- | ------- | ------- |
+| 4 bytes | 4 bytes | CSize | 4 bytes | CSize | x times |   EOF   |
+
+
+__Magic Number__
+
+4 Bytes, Little endian format.
+Value : 0x184C2102
+
+__Block Compressed Size__
+
+This is the size, in bytes, of the following compressed data block.
+4 Bytes, Little endian format.
+
+__Data__
+
+Where the actual compressed data stands.
+Data is always compressed, even when compression is detrimental.
+
+__EndMark__
+
+End of compressed stream is implicit.
+It needs to be followed by a standard EOF (End Of File) signal,
+wether it is a file or a stream.
+
+Alternatively, if the frame is followed by a valid Frame Magic Number,
+it is considered completed.
+It makes legacy frames compatible with frame concatenation.
+
+Any other value will be interpreted as a block size,
+and trigger an error if it does not fit within acceptable range.
+
+
+Version changes
+---------------
+
+1.5.1 : changed format to MarkDown compatible
+
+1.5 : removed Dictionary ID from specification
+
+1.4.1 : changed wording from “stream” to “frame”
+
+1.4 : added skippable streams, re-added stream checksum
+
+1.3 : modified header checksum
+
+1.2 : reduced choice of “block size”, to postpone decision on “dynamic size of BlockSize Field”.
+
+1.1 : optional fields are now part of the descriptor
+
+1.0 : changed “block size” specification, adding a compressed/uncompressed flag
+
+0.9 : reduced scale of “block maximum size” table
+
+0.8 : removed : high compression flag
+
+0.7 : removed : stream checksum
+
+0.6 : settled : stream size uses 8 bytes, endian convention is little endian
+
+0.5: added copyright notice
+
+0.4 : changed format to Google Doc compatible OpenDocument
+\ No newline at end of file
author	Yann Collet <yann.collet.73@gmail.com>	2015-03-31 08:44:56 (GMT)
committer	Yann Collet <yann.collet.73@gmail.com>	2015-03-31 08:44:56 (GMT)
commit	f17496423cb7ca9be9d9d855fefcf78fbb70439a (patch)
tree	af9924fc11b72468f7debf136642cc5cf7dab032
parent	880381c11bc9838b8609643dff64044fca8856e2 (diff)
download	lz4-f17496423cb7ca9be9d9d855fefcf78fbb70439a.zip lz4-f17496423cb7ca9be9d9d855fefcf78fbb70439a.tar.gz lz4-f17496423cb7ca9be9d9d855fefcf78fbb70439a.tar.bz2