summaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
authorAlexey Tourbin <alexey.tourbin@gmail.com>2018-04-24 22:40:12 (GMT)
committerAlexey Tourbin <alexey.tourbin@gmail.com>2018-04-24 23:39:28 (GMT)
commitff9b4cf82678f9643d256129d06098b692072584 (patch)
tree0787ff85c4b04d6cdc1d1860ff555c6f94da0c24 /doc
parent62d7cdcc741480842a0c217df7cb26ad3946ab32 (diff)
downloadlz4-ff9b4cf82678f9643d256129d06098b692072584.zip
lz4-ff9b4cf82678f9643d256129d06098b692072584.tar.gz
lz4-ff9b4cf82678f9643d256129d06098b692072584.tar.bz2
lz4_Block_format.md: clarify on short inputs and restrictions
It occurred to me that the formula "The last 5 bytes are always literals", on the list of "assumptions made by the decoder", is remarkably ambiguous. Suppose the decoder is presented with 5 bytes. Are they literals? It may seem that the decoder degenerates to memcpy on short inputs. But of course the answer is no, so the formula needs some clarification. Parsing restrictions should be explained as well, otherwise they look like arbitrary numbers. The 5-byte restriction has been mentioned recently in connection with the shortcut in LZ4_decompress_generic, so I add that. The second restriction is left to be explained by the author. I also took the liberty to explain that empty inputs "are either unrepresentable or can be represented with a null byte". This wording may actually have some merit: it leaves for the implementation, as opposed to the spec, to decide whether the encoder can compress empty inputs, and whether the decoder can produce an empty output (which the implementation should further clarify).
Diffstat (limited to 'doc')
-rw-r--r--doc/lz4_Block_format.md15
1 files changed, 12 insertions, 3 deletions
diff --git a/doc/lz4_Block_format.md b/doc/lz4_Block_format.md
index 4e39b41..dd4c91b 100644
--- a/doc/lz4_Block_format.md
+++ b/doc/lz4_Block_format.md
@@ -109,15 +109,24 @@ Parsing restrictions
There are specific parsing rules to respect in order to remain compatible
with assumptions made by the decoder :
-1. The last 5 bytes are always literals
+1. The last 5 bytes are always literals. In other words, the last five bytes
+ from the uncompressed input (or all bytes, if the input has less than five
+ bytes) must be encoded as literals on behalf of the last sequence.
+ The last sequence is incomplete, and stops right after the literals.
2. The last match must start at least 12 bytes before end of block.
Consequently, a block with less than 13 bytes cannot be compressed.
These rules are in place to ensure that the decoder
will never read beyond the input buffer, nor write beyond the output buffer.
-Note that the last sequence is also incomplete,
-and stops right after literals.
+1. To copy literals from a non-last sequence, an 8-byte copy instruction
+ can always be safely issued (without reading past the input), because
+ the literals are followed by a 2-byte offset, and the last sequence
+ is at least 1+5 bytes long.
+2. TODO: explain the benefits of the second restriction.
+
+Empty inputs are either unrepresentable or can be represented with a null byte,
+which can be interpreted as a token without literals and without a match.
Additional notes