14 files changed, 415 insertions, 226 deletions
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
index 86b7696..e47afe7 100644
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -26,7 +26,7 @@ If applicable, add screenshots to help explain your problem.
  - Version [e.g. 22]
  - Compiler [e.g. gcc]
  - Build System [e.g. Makefile]
- - Other hardware specs [e.g Core 2 duo...]
+ - Other hardware specs [e.g. Core 2 duo...]
 
 **Additional context**
 Add any other context about the problem here.
diff --git a/.github/workflows/README.md b/.github/workflows/README.md
index eaa9f03..306d875 100644
--- a/.github/workflows/README.md
+++ b/.github/workflows/README.md
@@ -5,7 +5,7 @@ This directory contains [GitHub Actions](https://github.com/features/actions) wo
 ## USAN, ASAN (`lz4-ubsan-x64`, `lz4-ubsan-x86`, `lz4-asan-x64`)
 
 For now, `lz4-ubsan-*` ignores the exit code of `make usan` and `make usan32`.
-Because there're several issues which may take relatively long time to resolve.
+Because there are several issues which may take relatively long time to resolve.
 
 We'll fully enable it when we ensure `make usan` is ready for all commits and PRs.
 
diff --git a/contrib/snap/README.md b/contrib/snap/README.md
index 612d6d7..55c97e0 100644
--- a/contrib/snap/README.md
+++ b/contrib/snap/README.md
@@ -6,7 +6,7 @@ of lz4. Snaps are universal Linux packages that allow you to easily
 build your application from any source and ship it to any Linux
 distribution by publishing it to https://snapcraft.io/. A key attribute
 of a snap package is that it is (ideally) confined such that it
-executes within a controlled environmenti with all its dependencies
+executes within a controlled environment with all its dependencies
 bundled with it and does not share dependencies with of from any other
 package on the system (with a couple of minor exceptions).
 
diff --git a/doc/lz4_Block_format.md b/doc/lz4_Block_format.md
index a0017b9..9e80227 100644
--- a/doc/lz4_Block_format.md
+++ b/doc/lz4_Block_format.md
@@ -1,24 +1,22 @@
 LZ4 Block Format Description
 ============================
-Last revised: 2022-02-02.
+Last revised: 2022-07-31 .
 Author : Yann Collet
 
 
-This specification is intended for developers
-willing to produce LZ4-compatible compressed data blocks
-using any programming language.
+This specification is intended for developers willing to
+produce or read LZ4 compressed data blocks
+using any programming language of their choice.
 
-LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
+LZ4 is an LZ77-type compressor with a fixed byte-oriented encoding format.
 There is no entropy encoder back-end nor framing layer.
 The latter is assumed to be handled by other parts of the system
 (see [LZ4 Frame format]).
 This design is assumed to favor simplicity and speed.
-It helps later on for optimizations, compactness, and features.
 
-This document describes only the block format,
+This document describes only the Block Format,
 not how the compressor nor decompressor actually work.
-The correctness of the decompressor should not depend
-on implementation details of the compressor, and vice versa.
+For more details on such topics, see later section "Implementation Notes".
 
 [LZ4 Frame format]: lz4_Frame_format.md
 
@@ -28,7 +26,7 @@ Compressed block format
 -----------------------
 An LZ4 compressed block is composed of sequences.
 A sequence is a suite of literals (not-compressed bytes),
-followed by a match copy.
+followed by a match copy operation.
 
 Each sequence starts with a `token`.
 The `token` is a one byte value, separated into two 4-bits fields.
@@ -38,14 +36,20 @@ Therefore each field ranges from 0 to 15.
 The first field uses the 4 high-bits of the token.
 It provides the length of literals to follow.
 
-If the field value is 0, then there is no literal.
-If it is 15, then we need to add some more bytes to indicate the full length.
-Each additional byte then represent a value from 0 to 255,
+If the field value is smaller than 15,
+then it represents the total nb of literals present in the sequence,
+including 0, in which case there is no literal.
+
+The value 15 is a special case: more bytes are required to indicate the full length.
+Each additional byte then represents a value from 0 to 255,
 which is added to the previous value to produce a total length.
-When the byte value is 255, another byte must read and added, and so on.
-There can be any number of bytes of value "255" following `token`.
-There is no "size limit".
-(Side note : this is why a not-compressible input block is expanded by 0.4%).
+When the byte value is 255, another byte must be read and added, and so on.
+There can be any number of bytes of value `255` following `token`.
+The Block Format does not define any "size limit",
+though real implementations may feature some practical limits
+(see more details in later chapter "Implementation Notes").
+
+Note : this format explains why a non-compressible input block is expanded by 0.4%.
 
 Example 1 : A literal length of 48 will be represented as :
 
@@ -56,7 +60,7 @@ Example 2 : A literal length of 280 will be represented as :
 
   - 15  : value for the 4-bits High field
   - 255 : following byte is maxed, since 280-15 >= 255
-  - 10  : (=280 - 15 - 255) ) remaining length to reach 280
+  - 10  : (=280 - 15 - 255) remaining length to reach 280
 
 Example 3 : A literal length of 15 will be represented as :
 
@@ -64,104 +68,177 @@ Example 3 : A literal length of 15 will be represented as :
   - 0  : (=15-15) yes, the zero must be output
 
 Following `token` and optional length bytes, are the literals themselves.
-They are exactly as numerous as previously decoded (length of literals).
-It's possible that there are zero literals.
+They are exactly as numerous as just decoded (length of literals).
+Reminder: it's possible that there are zero literals.
 
 
 Following the literals is the match copy operation.
 
-It starts by the `offset`.
+It starts by the `offset` value.
 This is a 2 bytes value, in little endian format
 (the 1st byte is the "low" byte, the 2nd one is the "high" byte).
 
-The `offset` represents the position of the match to be copied from.
+The `offset` represents the position of the match to be copied from the past.
 For example, 1 means "current position - 1 byte".
-The maximum `offset` value is 65535, 65536 and beyond cannot be coded.
-Note that 0 is an invalid offset value.
-The presence of such a value denotes an invalid (corrupted) block.
+The maximum `offset` value is 65535. 65536 and beyond cannot be coded.
+Note that 0 is an invalid `offset` value.
+The presence of a 0 `offset` value denotes an invalid (corrupted) block.
 
 Then the `matchlength` can be extracted.
-For this, we use the second token field, the low 4-bits.
+For this, we use the second `token` field, the low 4-bits.
 Such a value, obviously, ranges from 0 to 15.
-However here, 0 means that the copy operation will be minimal.
+However here, 0 means that the copy operation is minimal.
 The minimum length of a match, called `minmatch`, is 4.
-As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
-Similar to literal length, on reaching the highest possible value (15),
-one must read additional bytes, one at a time, with values ranging from 0 to 255.
+As a consequence, a 0 value means 4 bytes.
+Similarly to literal length, any value smaller than 15 represents a length,
+to which 4 (`minmatch`) must be added, thus ranging from 4 to 18.
+A value of 15 is special, meaning 19+ bytes,
+to which one must read additional bytes, one at a time,
+with each byte value ranging from 0 to 255.
 They are added to total to provide the final match length.
 A 255 value means there is another byte to read and add.
-There is no limit to the number of optional "255" bytes that can be present.
-(Note: this points towards a maximum achievable compression ratio of about 250).
+There is no limit to the number of optional `255` bytes that can be present,
+and therefore no limit to representable match length,
+though real-life implementations are likely going to enforce limits for practical reasons (see more details in "Implementation Notes" section below).
+
+Note: this format has a maximum achievable compression ratio of about ~250.
 
 Decoding the `matchlength` reaches the end of current sequence.
-Next byte will be the start of another sequence.
-But before moving to next sequence,
-it's time to use the decoded match position and length.
-The decoder copies `matchlength` bytes from match position to current position.
-
-In some cases, `matchlength` can be larger than `offset`.
-Therefore, since `match_pos + matchlength > current_pos`,
-later bytes to copy are not decoded yet.
-This is called an "overlap match", and must be handled with special care.
-A common case is an offset of 1,
-meaning the last byte is repeated `matchlength` times.
+Next byte will be the start of another sequence, and therefore a new `token`.
 
 
-End of block restrictions
------------------------
+End of block conditions
+-------------------------
 There are specific restrictions required to terminate an LZ4 block.
 
 1. The last sequence contains only literals.
-   The block ends right after them.
+   The block ends right after the literals (no `offset` field).
 2. The last 5 bytes of input are always literals.
    Therefore, the last sequence contains at least 5 bytes.
    - Special : if input is smaller than 5 bytes,
      there is only one sequence, it contains the whole input as literals.
-     Empty input can be represented with a zero byte,
+     Even empty input can be represented, using a zero byte,
      interpreted as a final token without literal and without a match.
 3. The last match must start at least 12 bytes before the end of block.
-   The last match is part of the penultimate sequence.
-   It is followed by the last sequence, which contains only literals.
+   The last match is part of the _penultimate_ sequence.
+   It is followed by the last sequence, which contains _only_ literals.
    - Note that, as a consequence,
-     an independent block < 13 bytes cannot be compressed,
-     because the match must copy "something",
-     so it needs at least one prior byte.
-   - However, when a block can reference data from another block,
-     it can start immediately with a match and no literal,
-     therefore a block of exactly 12 bytes can be compressed.
+     blocks < 12 bytes cannot be compressed.
+     And as an extension, _independent_ blocks < 13 bytes cannot be compressed,
+     because they must start by at least one literal,
+     that the match can then copy afterwards.
 
 When a block does not respect these end conditions,
 a conformant decoder is allowed to reject the block as incorrect.
 
 These rules are in place to ensure compatibility with
 a wide range of historical decoders
-which rely on these conditions in their speed-oriented design.
+which rely on these conditions for their speed-oriented design.
 
-Additional notes
+Implementation notes
 -----------------------
-If the decoder will decompress data from any external source,
-it is recommended to ensure that the decoder is resilient to corrupted data,
-and typically not vulnerable to buffer overflow manipulations.
+The LZ4 Block Format only defines the compressed format,
+it does not tell how to create a decoder or an encoder,
+which design is left free to the imagination of the implementer.
+
+However, thanks to experience, there are a number of typical topics that
+most implementations will have to consider.
+This section tries to provide a few guidelines.
+
+#### Metadata
+
+An LZ4-compressed Block requires additional metadata for proper decoding.
+Typically, a decoder will require the compressed block's size,
+and an upper bound of decompressed size.
+Other variants exist, such as knowing the decompressed size,
+and having an upper bound of the input size.
+The Block Format does not specify how to transmit such information,
+which is considered an out-of-band information channel.
+That's because in many cases, the information is present in the environment.
+For example, databases must store the size of their compressed block for indexing,
+and know that their decompressed block can't be larger than a certain threshold.
+
+If you need a format which is "self-contained",
+and also transports the necessary metadata for proper decoding on any platform,
+consider employing the [LZ4 Frame format] instead.
+
+#### Large lengths
+
+While the Block Format does not define any maximum value for length fields,
+in practice, most implementations will feature some form of limit,
+since it's expected for such values to be stored into registers of fixed bit width.
+
+If length fields use 64-bit registers,
+then it can be assumed that there is no practical limit,
+as it would require a single continuous block of multiple petabytes to reach it,
+which is unreasonable by today's standard.
+
+If length fields use 32-bit registers, then it can be overflowed,
+but requires a compressed block of size > 16 MB.
+Therefore, implementations that do not deal with compressed blocks > 16 MB are safe.
+However, if such a case is allowed,
+then it's recommended to check that no large length overflows the register.
+
+If length fields use 16-bit registers,
+then it's definitely possible to overflow such register,
+with less than < 300 bytes of compressed data.
+
+A conformant decoder should be able to detect length overflows when it's possible,
+and simply error out when that happens.
+The input block might not be invalid,
+it's just not decodable by the local decoder implementation.
+
+Note that, in order to be compatible with the larger LZ4 ecosystem,
+it's recommended to be able to read and represent lengths of up to 4 MB,
+and to accept blocks of size up to 4 MB.
+Such limits are compatible with 32-bit length registers,
+and prevent overflow of 32-bit registers.
+
+#### Safe decoding
+
+If a decoder receives compressed data from any external source,
+it is recommended to ensure that the decoder is resilient to corrupted input,
+and made safe from buffer overflow manipulations.
 Always ensure that read and write operations
 remain within the limits of provided buffers.
-Test the decoder with fuzzers
+
+Of particular importance, ensure that the nb of bytes instructed to copy
+does not overflow neither the input nor the output buffers.
+Ensure also, when reading an offset value, that the resulting position to copy
+does not reach beyond the beginning of the buffer.
+Such a situation can happen during the first 64 KB of decoded data.
+
+For more safety, test the decoder with fuzzers
 to ensure it's resilient to improbable sequences of conditions.
 Combine them with sanitizers, in order to catch overflows (asan)
 or initialization issues (msan).
+
 Pay some attention to offset 0 scenario, which is invalid,
-and therefore must not be blindly decoded
-(a naive implementation could preserve destination buffer content,
+and therefore must not be blindly decoded:
+a naive implementation could preserve destination buffer content,
 which could then result in information disclosure
-if such buffer was uninitialized and still containing private data).
+if such buffer was uninitialized and still containing private data.
 For reference, in such a scenario, the reference LZ4 decoder
 clears the match segment with `0` bytes,
 though other solutions are certainly possible.
 
+Finally, pay attention to the "overlap match" scenario,
+when `matchlength` is larger than `offset`.
+In which case, since `match_pos + matchlength > current_pos`,
+some of the later bytes to copy do not exist yet,
+and will be generated during the early stage of match copy operation.
+Such scenario must be handled with special care.
+A common case is an offset of 1,
+meaning the last byte is repeated `matchlength` times.
+
+#### Compression techniques
+
+The core of a LZ4 compressor is to detect duplicated data across past 64 KB.
 The format makes no assumption nor limits to the way a compressor
 searches and selects matches within the source data block.
-Multiple techniques can be considered,
-featuring distinct time / performance trade offs.
 For example, an upper compression limit can be reached,
-using a technique called "full optimal parsing", at very high cpu cost.
+using a technique called "full optimal parsing", at high cpu and memory cost.
+But multiple other techniques can be considered,
+featuring distinct time / performance trade-offs.
 As long as the specified format is respected,
-the result will be compatible and decodable by any compliant decoder.
+the result will be compatible with and decodable by any compliant decoder.
diff --git a/examples/blockStreaming_lineByLine.md b/examples/blockStreaming_lineByLine.md
index 90342f6..7b66883 100644
--- a/examples/blockStreaming_lineByLine.md
+++ b/examples/blockStreaming_lineByLine.md
@@ -107,7 +107,7 @@ This is called "External Dictionary Mode".
 In Line#X+2 (see (5)), finally LZ4 forget almost all memories but still remains Line#X+1.
 This is the same situation as Line#2.
 
-Continue these procedure to the end of text file.
+Continue these procedures to the end of text file.
 
 
 ## How the decompression works
@@ -119,4 +119,4 @@ Decompression will do reverse order.
  - Output decompressed plain text line to the file.
  - Forward ringbuffer offset. If offset exceeds end of the ringbuffer, reset it.
 
-Continue these procedure to the end of the compressed file.
+Continue these procedures to the end of the compressed file.
diff --git a/examples/dictionaryRandomAccess.md b/examples/dictionaryRandomAccess.md
index fb1fade..c6f4388 100644
--- a/examples/dictionaryRandomAccess.md
+++ b/examples/dictionaryRandomAccess.md
@@ -64,4 +64,4 @@ Decompression will do reverse order.
  - Read the next block.
  - Decompress it and write that page to the file.
 
-Continue these procedure until all the required data has been read.
+Continue these procedures until all the required data has been read.
diff --git a/examples/streaming_api_basics.md b/examples/streaming_api_basics.md
index abffaef..6f5ae41 100644
--- a/examples/streaming_api_basics.md
+++ b/examples/streaming_api_basics.md
@@ -9,7 +9,7 @@ LZ4 has the following API sets :
    It guarantees interoperability with other LZ4 framing format compliant tools/libraries
    such as LZ4 command line utility, node-lz4, etc.
  - "Block" API : This is recommended for simple purpose.
-   It compress single raw memory block to LZ4 memory block and vice versa.
+   It compresses single raw memory block to LZ4 memory block and vice versa.
  - "Streaming" API : This is designed for complex things.
    For example, compress huge stream data in restricted memory environment.
 
diff --git a/lib/README.md b/lib/README.md
index d08e0f1..244d65c 100644
--- a/lib/README.md
+++ b/lib/README.md
@@ -96,7 +96,7 @@ The following build macro can be selected to adjust source code behavior at comp
   by using bitcount instructions, generally implemented as fast single instructions in many cpus.
   In case the target cpus doesn't support it, or compiler intrinsic doesn't work, or feature bad performance,
   it's possible to use an optimized software path instead.
-  This is achieved by setting this build macros .
+  This is achieved by setting this build macros.
   In most cases, it's not expected to be necessary,
   but it can be legitimately considered for less common platforms.
 
diff --git a/lib/dll/example/README.md b/lib/dll/example/README.md
index 6d93248..b93914b 100644
--- a/lib/dll/example/README.md
+++ b/lib/dll/example/README.md
@@ -51,7 +51,7 @@ The compiled executable will require LZ4 DLL which is available at `dll\msys-lz4
 Open `example\fullbench-dll.sln` to compile `fullbench-dll` that uses a
 dynamic LZ4 library from the `dll` directory. The solution works with Visual C++
 2010 or newer. When one will open the solution with Visual C++ newer than 2010
-then the solution will upgraded to the current version.
+then the solution will be upgraded to the current version.
 
 
 #### Using LZ4 DLL with Visual C++
diff --git a/lib/lz4.c b/lib/lz4.c
index b9c5a65..5fae002 100644
--- a/lib/lz4.c
+++ b/lib/lz4.c
@@ -210,7 +210,10 @@ void  LZ4_free(void* p);
 #endif
 
 #include <string.h>   /* memset, memcpy */
-#define MEM_INIT(p,v,s)   memset((p),(v),(s))
+#if !defined(LZ4_memset)
+#  define LZ4_memset(p,v,s) memset((p),(v),(s))
+#endif
+#define MEM_INIT(p,v,s)   LZ4_memset((p),(v),(s))
 
 
 /*-************************************
@@ -321,10 +324,20 @@ typedef enum {
  * memcpy() as if it were standard compliant, so it can inline it in freestanding
  * environments. This is needed when decompressing the Linux Kernel, for example.
  */
-#if defined(__GNUC__) && (__GNUC__ >= 4)
-#define LZ4_memcpy(dst, src, size) __builtin_memcpy(dst, src, size)
-#else
-#define LZ4_memcpy(dst, src, size) memcpy(dst, src, size)
+#if !defined(LZ4_memcpy)
+#  if defined(__GNUC__) && (__GNUC__ >= 4)
+#    define LZ4_memcpy(dst, src, size) __builtin_memcpy(dst, src, size)
+#  else
+#    define LZ4_memcpy(dst, src, size) memcpy(dst, src, size)
+#  endif
+#endif
+
+#if !defined(LZ4_memmove)
+#  if defined(__GNUC__) && (__GNUC__ >= 4)
+#    define LZ4_memmove __builtin_memmove
+#  else
+#    define LZ4_memmove memmove
+#  endif
 #endif
 
 static unsigned LZ4_isLittleEndian(void)
@@ -529,7 +542,7 @@ static unsigned LZ4_NbCommonBytes (reg_t val)
 * including _tzcnt_u64. Therefore, we need to neuter the _tzcnt_u64 code path for ARM64EC.
 ****************************************************************************************************/
 #         if defined(__clang__) && (__clang_major__ < 10)
-            /* Avoid undefined clang-cl intrinics issue.
+            /* Avoid undefined clang-cl intrinsics issue.
              * See https://github.com/lz4/lz4/pull/1017 for details. */
             return (unsigned)__builtin_ia32_tzcnt_u64(val) >> 3;
 #         else
@@ -1711,7 +1724,7 @@ int LZ4_saveDict (LZ4_stream_t* LZ4_dict, char* safeBuffer, int dictSize)
     if (dictSize > 0) {
         const BYTE* const previousDictEnd = dict->dictionary + dict->dictSize;
         assert(dict->dictionary);
-        memmove(safeBuffer, previousDictEnd - dictSize, dictSize);
+        LZ4_memmove(safeBuffer, previousDictEnd - dictSize, (size_t)dictSize);
     }
 
     dict->dictionary = (const BYTE*)safeBuffer;
@@ -1726,39 +1739,163 @@ int LZ4_saveDict (LZ4_stream_t* LZ4_dict, char* safeBuffer, int dictSize)
  *  Decompression functions
  ********************************/
 
-typedef enum { endOnOutputSize = 0, endOnInputSize = 1 } endCondition_directive;
 typedef enum { decode_full_block = 0, partial_decode = 1 } earlyEnd_directive;
 
 #undef MIN
 #define MIN(a,b)    ( (a) < (b) ? (a) : (b) )
 
+
+/* variant for decompress_unsafe()
+ * does not know end of input
+ * presumes input is well formed
+ * note : will consume at least one byte */
+size_t read_long_length_no_check(const BYTE** pp)
+{
+    size_t b, l = 0;
+    do { b = **pp; (*pp)++; l += b; } while (b==255);
+    DEBUGLOG(6, "read_long_length_no_check: +length=%zu using %zu input bytes", l, l/255 + 1)
+    return l;
+}
+
+/* core decoder variant for LZ4_decompress_fast*()
+ * for legacy support only : these entry points are deprecated.
+ * - Presumes input is correctly formed (no defense vs malformed inputs)
+ * - Does not know input size (presume input buffer is "large enough")
+ * - Decompress a full block (only)
+ * @return : nb of bytes read from input.
+ * Note : this variant is not optimized for speed, just for maintenance.
+ *        the goal is to remove support of decompress_fast*() variants by v2.0
+**/
+LZ4_FORCE_INLINE int
+LZ4_decompress_unsafe_generic(
+                 const BYTE* const istart,
+                 BYTE* const ostart,
+                 int decompressedSize,
+
+                 size_t prefixSize,
+                 const BYTE* const dictStart,  /* only if dict==usingExtDict */
+                 const size_t dictSize         /* note: =0 if dictStart==NULL */
+                 )
+{
+    const BYTE* ip = istart;
+    BYTE* op = (BYTE*)ostart;
+    BYTE* const oend = ostart + decompressedSize;
+    const BYTE* const prefixStart = ostart - prefixSize;
+
+    DEBUGLOG(5, "LZ4_decompress_unsafe_generic");
+    if (dictStart == NULL) assert(dictSize == 0);
+
+    while (1) {
+        /* start new sequence */
+        unsigned token = *ip++;
+
+        /* literals */
+        {   size_t ll = token >> ML_BITS;
+            if (ll==15) {
+                /* long literal length */
+                ll += read_long_length_no_check(&ip);
+            }
+            if ((size_t)(oend-op) < ll) return -1; /* output buffer overflow */
+            LZ4_memmove(op, ip, ll); /* support in-place decompression */
+            op += ll;
+            ip += ll;
+            if ((size_t)(oend-op) < MFLIMIT) {
+                if (op==oend) break;  /* end of block */
+                DEBUGLOG(5, "invalid: literals end at distance %zi from end of block", oend-op);
+                /* incorrect end of block :
+                 * last match must start at least MFLIMIT==12 bytes before end of output block */
+                return -1;
+        }   }
+
+        /* match */
+        {   size_t ml = token & 15;
+            size_t const offset = LZ4_readLE16(ip);
+            ip+=2;
+
+            if (ml==15) {
+                /* long literal length */
+                ml += read_long_length_no_check(&ip);
+            }
+            ml += MINMATCH;
+
+            if ((size_t)(oend-op) < ml) return -1; /* output buffer overflow */
+
+            {   const BYTE* match = op - offset;
+
+                /* out of range */
+                if (offset > (size_t)(op - prefixStart) + dictSize) {
+                    DEBUGLOG(6, "offset out of range");
+                    return -1;
+                }
+
+                /* check special case : extDict */
+                if (offset > (size_t)(op - prefixStart)) {
+                    /* extDict scenario */
+                    const BYTE* const dictEnd = dictStart + dictSize;
+                    const BYTE* extMatch = dictEnd - (offset - (size_t)(op-prefixStart));
+                    size_t const extml = (size_t)(dictEnd - extMatch);
+                    if (extml > ml) {
+                        /* match entirely within extDict */
+                        LZ4_memmove(op, extMatch, ml);
+                        op += ml;
+                        ml = 0;
+                    } else {
+                        /* match split between extDict & prefix */
+                        LZ4_memmove(op, extMatch, extml);
+                        op += extml;
+                        ml -= extml;
+                    }
+                    match = prefixStart;
+                }
+
+                /* match copy - slow variant, supporting overlap copy */
+                {   size_t u;
+                    for (u=0; u<ml; u++) {
+                        op[u] = match[u];
+            }   }   }
+            op += ml;
+            if ((size_t)(oend-op) < LASTLITERALS) {
+                DEBUGLOG(5, "invalid: match ends at distance %zi from end of block", oend-op);
+                /* incorrect end of block :
+                 * last match must stop at least LASTLITERALS==5 bytes before end of output block */
+                return -1;
+            }
+        } /* match */
+    } /* main loop */
+    return (int)(ip - istart);
+}
+
+
 /* Read the variable-length literal or match length.
  *
- * ip - pointer to use as input.
- * lencheck - end ip.  Return an error if ip advances >= lencheck.
- * loop_check - check ip >= lencheck in body of loop.  Returns loop_error if so.
- * initial_check - check ip >= lencheck before start of loop.  Returns initial_error if so.
- * error (output) - error code.  Should be set to 0 before call.
- */
-typedef enum { loop_error = -2, initial_error = -1, ok = 0 } variable_length_error;
-LZ4_FORCE_INLINE unsigned
-read_variable_length(const BYTE**ip, const BYTE* lencheck,
-                     int loop_check, int initial_check,
-                     variable_length_error* error)
-{
-    U32 length = 0;
-    U32 s;
-    if (initial_check && unlikely((*ip) >= lencheck)) {    /* overflow detection */
-        *error = initial_error;
-        return length;
+ * @ip : input pointer
+ * @ilimit : position after which if length is not decoded, the input is necessarily corrupted.
+ * @initial_check - check ip >= ipmax before start of loop.  Returns initial_error if so.
+ * @error (output) - error code.  Must be set to 0 before call.
+**/
+typedef size_t Rvl_t;
+static const Rvl_t rvl_error = (Rvl_t)(-1);
+LZ4_FORCE_INLINE Rvl_t
+read_variable_length(const BYTE** ip, const BYTE* ilimit,
+                     int initial_check)
+{
+    Rvl_t s, length = 0;
+    assert(ip != NULL);
+    assert(*ip !=  NULL);
+    assert(ilimit != NULL);
+    if (initial_check && unlikely((*ip) >= ilimit)) {    /* read limit reached */
+        return rvl_error;
     }
     do {
         s = **ip;
         (*ip)++;
         length += s;
-        if (loop_check && unlikely((*ip) >= lencheck)) {    /* overflow detection */
-            *error = loop_error;
-            return length;
+        if (unlikely((*ip) > ilimit)) {    /* read limit reached */
+            return rvl_error;
+        }
+        /* accumulator overflow detection (32-bit mode only) */
+        if ((sizeof(length)<8) && unlikely(length > ((Rvl_t)(-1)/2)) ) {
+            return rvl_error;
         }
     } while (s==255);
 
@@ -1778,7 +1915,6 @@ LZ4_decompress_generic(
                  int srcSize,
                  int outputSize,         /* If endOnInput==endOnInputSize, this value is `dstCapacity` */
 
-                 endCondition_directive endOnInput,   /* endOnOutputSize, endOnInputSize */
                  earlyEnd_directive partialDecoding,  /* full, partial */
                  dict_directive dict,                 /* noDict, withPrefix64k, usingExtDict */
                  const BYTE* const lowPrefix,  /* always <= dst, == dst when no prefix */
@@ -1797,13 +1933,12 @@ LZ4_decompress_generic(
 
         const BYTE* const dictEnd = (dictStart == NULL) ? NULL : dictStart + dictSize;
 
-        const int safeDecode = (endOnInput==endOnInputSize);
-        const int checkOffset = ((safeDecode) && (dictSize < (int)(64 KB)));
+        const int checkOffset = (dictSize < (int)(64 KB));
 
 
         /* Set up the "end" pointers for the shortcut. */
-        const BYTE* const shortiend = iend - (endOnInput ? 14 : 8) /*maxLL*/ - 2 /*offset*/;
-        const BYTE* const shortoend = oend - (endOnInput ? 14 : 8) /*maxLL*/ - 18 /*maxML*/;
+        const BYTE* const shortiend = iend - 14 /*maxLL*/ - 2 /*offset*/;
+        const BYTE* const shortoend = oend - 14 /*maxLL*/ - 18 /*maxML*/;
 
         const BYTE* match;
         size_t offset;
@@ -1815,83 +1950,70 @@ LZ4_decompress_generic(
 
         /* Special cases */
         assert(lowPrefix <= op);
-        if ((endOnInput) && (unlikely(outputSize==0))) {
+        if (unlikely(outputSize==0)) {
             /* Empty output buffer */
             if (partialDecoding) return 0;
             return ((srcSize==1) && (*ip==0)) ? 0 : -1;
         }
-        if ((!endOnInput) && (unlikely(outputSize==0))) { return (*ip==0 ? 1 : -1); }
-        if ((endOnInput) && unlikely(srcSize==0)) { return -1; }
+        if (unlikely(srcSize==0)) { return -1; }
 
-	/* Currently the fast loop shows a regression on qualcomm arm chips. */
+    /* LZ4_FAST_DEC_LOOP:
+     * designed for modern OoO performance cpus,
+     * where copying reliably 32-bytes is preferable to an unpredictable branch.
+     * note : fast loop may show a regression for some client arm chips. */
 #if LZ4_FAST_DEC_LOOP
         if ((oend - op) < FASTLOOP_SAFE_DISTANCE) {
             DEBUGLOG(6, "skip fast decode loop");
             goto safe_decode;
         }
 
-        /* Fast loop : decode sequences as long as output < iend-FASTLOOP_SAFE_DISTANCE */
+        /* Fast loop : decode sequences as long as output < oend-FASTLOOP_SAFE_DISTANCE */
         while (1) {
             /* Main fastloop assertion: We can always wildcopy FASTLOOP_SAFE_DISTANCE */
             assert(oend - op >= FASTLOOP_SAFE_DISTANCE);
-            if (endOnInput) { assert(ip < iend); }
+            assert(ip < iend);
             token = *ip++;
             length = token >> ML_BITS;  /* literal length */
 
-            assert(!endOnInput || ip <= iend); /* ip < iend before the increment */
-
             /* decode literal length */
             if (length == RUN_MASK) {
-                variable_length_error error = ok;
-                length += read_variable_length(&ip, iend-RUN_MASK, (int)endOnInput, (int)endOnInput, &error);
-                if (error == initial_error) { goto _output_error; }
-                if ((safeDecode) && unlikely((uptrval)(op)+length<(uptrval)(op))) { goto _output_error; } /* overflow detection */
-                if ((safeDecode) && unlikely((uptrval)(ip)+length<(uptrval)(ip))) { goto _output_error; } /* overflow detection */
+                size_t const addl = read_variable_length(&ip, iend-RUN_MASK, 1);
+                if (addl == rvl_error) { goto _output_error; }
+                length += addl;
+                if (unlikely((uptrval)(op)+length<(uptrval)(op))) { goto _output_error; } /* overflow detection */
+                if (unlikely((uptrval)(ip)+length<(uptrval)(ip))) { goto _output_error; } /* overflow detection */
 
                 /* copy literals */
                 cpy = op+length;
                 LZ4_STATIC_ASSERT(MFLIMIT >= WILDCOPYLENGTH);
-                if (endOnInput) {  /* LZ4_decompress_safe() */
-                    if ((cpy>oend-32) || (ip+length>iend-32)) { goto safe_literal_copy; }
-                    LZ4_wildCopy32(op, ip, cpy);
-                } else {   /* LZ4_decompress_fast() */
-                    if (cpy>oend-8) { goto safe_literal_copy; }
-                    LZ4_wildCopy8(op, ip, cpy); /* LZ4_decompress_fast() cannot copy more than 8 bytes at a time :
-                                                 * it doesn't know input length, and only relies on end-of-block properties */
-                }
+                if ((cpy>oend-32) || (ip+length>iend-32)) { goto safe_literal_copy; }
+                LZ4_wildCopy32(op, ip, cpy);
                 ip += length; op = cpy;
             } else {
                 cpy = op+length;
-                if (endOnInput) {  /* LZ4_decompress_safe() */
-                    DEBUGLOG(7, "copy %u bytes in a 16-bytes stripe", (unsigned)length);
-                    /* We don't need to check oend, since we check it once for each loop below */
-                    if (ip > iend-(16 + 1/*max lit + offset + nextToken*/)) { goto safe_literal_copy; }
-                    /* Literals can only be 14, but hope compilers optimize if we copy by a register size */
-                    LZ4_memcpy(op, ip, 16);
-                } else {  /* LZ4_decompress_fast() */
-                    /* LZ4_decompress_fast() cannot copy more than 8 bytes at a time :
-                     * it doesn't know input length, and relies on end-of-block properties */
-                    LZ4_memcpy(op, ip, 8);
-                    if (length > 8) { LZ4_memcpy(op+8, ip+8, 8); }
-                }
+                DEBUGLOG(7, "copy %u bytes in a 16-bytes stripe", (unsigned)length);
+                /* We don't need to check oend, since we check it once for each loop below */
+                if (ip > iend-(16 + 1/*max lit + offset + nextToken*/)) { goto safe_literal_copy; }
+                /* Literals can only be <= 14, but hope compilers optimize better when copy by a register size */
+                LZ4_memcpy(op, ip, 16);
                 ip += length; op = cpy;
             }
 
             /* get offset */
             offset = LZ4_readLE16(ip); ip+=2;
             match = op - offset;
-            assert(match <= op);
+            assert(match <= op);  /* overflow check */
 
             /* get matchlength */
             length = token & ML_MASK;
 
             if (length == ML_MASK) {
-                variable_length_error error = ok;
-                if ((checkOffset) && (unlikely(match + dictSize < lowPrefix))) { goto _output_error; } /* Error : offset outside buffers */
-                length += read_variable_length(&ip, iend - LASTLITERALS + 1, (int)endOnInput, 0, &error);
-                if (error != ok) { goto _output_error; }
-                if ((safeDecode) && unlikely((uptrval)(op)+length<(uptrval)op)) { goto _output_error; } /* overflow detection */
+                size_t const addl = read_variable_length(&ip, iend - LASTLITERALS + 1, 0);
+                if (addl == rvl_error) { goto _output_error; }
+                length += addl;
                 length += MINMATCH;
+                if (unlikely((uptrval)(op)+length<(uptrval)op)) { goto _output_error; } /* overflow detection */
+                if ((checkOffset) && (unlikely(match + dictSize < lowPrefix))) { goto _output_error; } /* Error : offset outside buffers */
                 if (op + length >= oend - FASTLOOP_SAFE_DISTANCE) {
                     goto safe_match_copy;
                 }
@@ -1901,7 +2023,7 @@ LZ4_decompress_generic(
                     goto safe_match_copy;
                 }
 
-                /* Fastpath check: Avoids a branch in LZ4_wildCopy32 if true */
+                /* Fastpath check: skip LZ4_wildCopy32 when true */
                 if ((dict == withPrefix64k) || (match >= lowPrefix)) {
                     if (offset >= 8) {
                         assert(match >= lowPrefix);
@@ -1918,6 +2040,7 @@ LZ4_decompress_generic(
             if (checkOffset && (unlikely(match + dictSize < lowPrefix))) { goto _output_error; } /* Error : offset outside buffers */
             /* match starting within external dictionary */
             if ((dict==usingExtDict) && (match < lowPrefix)) {
+                assert(dictEnd != NULL);
                 if (unlikely(op+length > oend-LASTLITERALS)) {
                     if (partialDecoding) {
                         DEBUGLOG(7, "partialDecoding: dictionary match, close to dstEnd");
@@ -1928,7 +2051,7 @@ LZ4_decompress_generic(
 
                 if (length <= (size_t)(lowPrefix-match)) {
                     /* match fits entirely within external dictionary : just copy */
-                    memmove(op, dictEnd - (lowPrefix-match), length);
+                    LZ4_memmove(op, dictEnd - (lowPrefix-match), length);
                     op += length;
                 } else {
                     /* match stretches into both external dictionary and current block */
@@ -1964,11 +2087,10 @@ LZ4_decompress_generic(
 
         /* Main Loop : decode remaining sequences where output < FASTLOOP_SAFE_DISTANCE */
         while (1) {
+            assert(ip < iend);
             token = *ip++;
             length = token >> ML_BITS;  /* literal length */
 
-            assert(!endOnInput || ip <= iend); /* ip < iend before the increment */
-
             /* A two-stage shortcut for the most common case:
              * 1) If the literal length is 0..14, and there is enough space,
              * enter the shortcut and copy 16 bytes on behalf of the literals
@@ -1978,11 +2100,11 @@ LZ4_decompress_generic(
              * those 18 bytes earlier, upon entering the shortcut (in other words,
              * there is a combined check for both stages).
              */
-            if ( (endOnInput ? length != RUN_MASK : length <= 8)
+            if ( (length != RUN_MASK)
                 /* strictly "less than" on input, to re-enter the loop with at least one byte */
-              && likely((endOnInput ? ip < shortiend : 1) & (op <= shortoend)) ) {
+              && likely((ip < shortiend) & (op <= shortoend)) ) {
                 /* Copy the literals */
-                LZ4_memcpy(op, ip, endOnInput ? 16 : 8);
+                LZ4_memcpy(op, ip, 16);
                 op += length; ip += length;
 
                 /* The second stage: prepare for match copying, decode full info.
@@ -2012,11 +2134,11 @@ LZ4_decompress_generic(
 
             /* decode literal length */
             if (length == RUN_MASK) {
-                variable_length_error error = ok;
-                length += read_variable_length(&ip, iend-RUN_MASK, (int)endOnInput, (int)endOnInput, &error);
-                if (error == initial_error) { goto _output_error; }
-                if ((safeDecode) && unlikely((uptrval)(op)+length<(uptrval)(op))) { goto _output_error; } /* overflow detection */
-                if ((safeDecode) && unlikely((uptrval)(ip)+length<(uptrval)(ip))) { goto _output_error; } /* overflow detection */
+                size_t const addl = read_variable_length(&ip, iend-RUN_MASK, 1);
+                if (addl == rvl_error) { goto _output_error; }
+                length += addl;
+                if (unlikely((uptrval)(op)+length<(uptrval)(op))) { goto _output_error; } /* overflow detection */
+                if (unlikely((uptrval)(ip)+length<(uptrval)(ip))) { goto _output_error; } /* overflow detection */
             }
 
             /* copy literals */
@@ -2025,9 +2147,7 @@ LZ4_decompress_generic(
         safe_literal_copy:
 #endif
             LZ4_STATIC_ASSERT(MFLIMIT >= WILDCOPYLENGTH);
-            if ( ((endOnInput) && ((cpy>oend-MFLIMIT) || (ip+length>iend-(2+1+LASTLITERALS))) )
-              || ((!endOnInput) && (cpy>oend-WILDCOPYLENGTH)) )
-            {
+            if ((cpy>oend-MFLIMIT) || (ip+length>iend-(2+1+LASTLITERALS))) {
                 /* We've either hit the input parsing restriction or the output parsing restriction.
                  * In the normal scenario, decoding a full block, it must be the last sequence,
                  * otherwise it's an error (invalid input or dimensions).
@@ -2037,7 +2157,6 @@ LZ4_decompress_generic(
                     /* Since we are partial decoding we may be in this block because of the output parsing
                      * restriction, which is not valid since the output buffer is allowed to be undersized.
                      */
-                    assert(endOnInput);
                     DEBUGLOG(7, "partialDecoding: copying literals, close to input or output end")
                     DEBUGLOG(7, "partialDecoding: literal length = %u", (unsigned)length);
                     DEBUGLOG(7, "partialDecoding: remaining space in dstBuffer : %i", (int)(oend - op));
@@ -2058,21 +2177,17 @@ LZ4_decompress_generic(
                         length = (size_t)(oend-op);
                     }
                 } else {
-                    /* We must be on the last sequence because of the parsing limitations so check
-                     * that we exactly regenerate the original size (must be exact when !endOnInput).
-                     */
-                    if ((!endOnInput) && (cpy != oend)) { goto _output_error; }
                      /* We must be on the last sequence (or invalid) because of the parsing limitations
                       * so check that we exactly consume the input and don't overrun the output buffer.
                       */
-                    if ((endOnInput) && ((ip+length != iend) || (cpy > oend))) {
+                    if ((ip+length != iend) || (cpy > oend)) {
                         DEBUGLOG(6, "should have been last run of literals")
                         DEBUGLOG(6, "ip(%p) + length(%i) = %p != iend (%p)", ip, (int)length, ip+length, iend);
                         DEBUGLOG(6, "or cpy(%p) > oend(%p)", cpy, oend);
                         goto _output_error;
                     }
                 }
-                memmove(op, ip, length);  /* supports overlapping memory regions; only matters for in-place decompression scenarios */
+                LZ4_memmove(op, ip, length);  /* supports overlapping memory regions, for in-place decompression scenarios */
                 ip += length;
                 op += length;
                 /* Necessarily EOF when !partialDecoding.
@@ -2084,7 +2199,7 @@ LZ4_decompress_generic(
                     break;
                 }
             } else {
-                LZ4_wildCopy8(op, ip, cpy);   /* may overwrite up to WILDCOPYLENGTH beyond cpy */
+                LZ4_wildCopy8(op, ip, cpy);   /* can overwrite up to 8 bytes beyond cpy */
                 ip += length; op = cpy;
             }
 
@@ -2097,10 +2212,10 @@ LZ4_decompress_generic(
 
     _copy_match:
             if (length == ML_MASK) {
-              variable_length_error error = ok;
-              length += read_variable_length(&ip, iend - LASTLITERALS + 1, (int)endOnInput, 0, &error);
-              if (error != ok) goto _output_error;
-                if ((safeDecode) && unlikely((uptrval)(op)+length<(uptrval)op)) goto _output_error;   /* overflow detection */
+                size_t const addl = read_variable_length(&ip, iend - LASTLITERALS + 1, 0);
+                if (addl == rvl_error) { goto _output_error; }
+                length += addl;
+                if (unlikely((uptrval)(op)+length<(uptrval)op)) goto _output_error;   /* overflow detection */
             }
             length += MINMATCH;
 
@@ -2110,6 +2225,7 @@ LZ4_decompress_generic(
             if ((checkOffset) && (unlikely(match + dictSize < lowPrefix))) goto _output_error;   /* Error : offset outside buffers */
             /* match starting within external dictionary */
             if ((dict==usingExtDict) && (match < lowPrefix)) {
+                assert(dictEnd != NULL);
                 if (unlikely(op+length > oend-LASTLITERALS)) {
                     if (partialDecoding) length = MIN(length, (size_t)(oend-op));
                     else goto _output_error;   /* doesn't respect parsing restriction */
@@ -2117,7 +2233,7 @@ LZ4_decompress_generic(
 
                 if (length <= (size_t)(lowPrefix-match)) {
                     /* match fits entirely within external dictionary : just copy */
-                    memmove(op, dictEnd - (lowPrefix-match), length);
+                    LZ4_memmove(op, dictEnd - (lowPrefix-match), length);
                     op += length;
                 } else {
                     /* match stretches into both external dictionary and current block */
@@ -2188,12 +2304,8 @@ LZ4_decompress_generic(
         }
 
         /* end of decoding */
-        if (endOnInput) {
-            DEBUGLOG(5, "decoded %i bytes", (int) (((char*)op)-dst));
-           return (int) (((char*)op)-dst);     /* Nb of output bytes decoded */
-       } else {
-           return (int) (((const char*)ip)-src);   /* Nb of input bytes read */
-       }
+        DEBUGLOG(5, "decoded %i bytes", (int) (((char*)op)-dst));
+        return (int) (((char*)op)-dst);     /* Nb of output bytes decoded */
 
         /* Overflow error detected */
     _output_error:
@@ -2208,7 +2320,7 @@ LZ4_FORCE_O2
 int LZ4_decompress_safe(const char* source, char* dest, int compressedSize, int maxDecompressedSize)
 {
     return LZ4_decompress_generic(source, dest, compressedSize, maxDecompressedSize,
-                                  endOnInputSize, decode_full_block, noDict,
+                                  decode_full_block, noDict,
                                   (BYTE*)dest, NULL, 0);
 }
 
@@ -2217,16 +2329,17 @@ int LZ4_decompress_safe_partial(const char* src, char* dst, int compressedSize,
 {
     dstCapacity = MIN(targetOutputSize, dstCapacity);
     return LZ4_decompress_generic(src, dst, compressedSize, dstCapacity,
-                                  endOnInputSize, partial_decode,
+                                  partial_decode,
                                   noDict, (BYTE*)dst, NULL, 0);
 }
 
 LZ4_FORCE_O2
 int LZ4_decompress_fast(const char* source, char* dest, int originalSize)
 {
-    return LZ4_decompress_generic(source, dest, 0, originalSize,
-                                  endOnOutputSize, decode_full_block, withPrefix64k,
-                                  (BYTE*)dest - 64 KB, NULL, 0);
+    DEBUGLOG(5, "LZ4_decompress_fast");
+    return LZ4_decompress_unsafe_generic(
+                (const BYTE*)source, (BYTE*)dest, originalSize,
+                0, NULL, 0);
 }
 
 /*===== Instantiate a few more decoding cases, used more than once. =====*/
@@ -2235,7 +2348,7 @@ LZ4_FORCE_O2 /* Exported, an obsolete API function. */
 int LZ4_decompress_safe_withPrefix64k(const char* source, char* dest, int compressedSize, int maxOutputSize)
 {
     return LZ4_decompress_generic(source, dest, compressedSize, maxOutputSize,
-                                  endOnInputSize, decode_full_block, withPrefix64k,
+                                  decode_full_block, withPrefix64k,
                                   (BYTE*)dest - 64 KB, NULL, 0);
 }
 
@@ -2244,16 +2357,16 @@ static int LZ4_decompress_safe_partial_withPrefix64k(const char* source, char* d
 {
     dstCapacity = MIN(targetOutputSize, dstCapacity);
     return LZ4_decompress_generic(source, dest, compressedSize, dstCapacity,
-                                  endOnInputSize, partial_decode, withPrefix64k,
+                                  partial_decode, withPrefix64k,
                                   (BYTE*)dest - 64 KB, NULL, 0);
 }
 
 /* Another obsolete API function, paired with the previous one. */
 int LZ4_decompress_fast_withPrefix64k(const char* source, char* dest, int originalSize)
 {
-    /* LZ4_decompress_fast doesn't validate match offsets,
-     * and thus serves well with any prefixed dictionary. */
-    return LZ4_decompress_fast(source, dest, originalSize);
+    return LZ4_decompress_unsafe_generic(
+                (const BYTE*)source, (BYTE*)dest, originalSize,
+                64 KB, NULL, 0);
 }
 
 LZ4_FORCE_O2
@@ -2261,7 +2374,7 @@ static int LZ4_decompress_safe_withSmallPrefix(const char* source, char* dest, i
                                                size_t prefixSize)
 {
     return LZ4_decompress_generic(source, dest, compressedSize, maxOutputSize,
-                                  endOnInputSize, decode_full_block, noDict,
+                                  decode_full_block, noDict,
                                   (BYTE*)dest-prefixSize, NULL, 0);
 }
 
@@ -2271,7 +2384,7 @@ static int LZ4_decompress_safe_partial_withSmallPrefix(const char* source, char*
 {
     dstCapacity = MIN(targetOutputSize, dstCapacity);
     return LZ4_decompress_generic(source, dest, compressedSize, dstCapacity,
-                                  endOnInputSize, partial_decode, noDict,
+                                  partial_decode, noDict,
                                   (BYTE*)dest-prefixSize, NULL, 0);
 }
 
@@ -2281,7 +2394,7 @@ int LZ4_decompress_safe_forceExtDict(const char* source, char* dest,
                                      const void* dictStart, size_t dictSize)
 {
     return LZ4_decompress_generic(source, dest, compressedSize, maxOutputSize,
-                                  endOnInputSize, decode_full_block, usingExtDict,
+                                  decode_full_block, usingExtDict,
                                   (BYTE*)dest, (const BYTE*)dictStart, dictSize);
 }
 
@@ -2292,7 +2405,7 @@ int LZ4_decompress_safe_partial_forceExtDict(const char* source, char* dest,
 {
     dstCapacity = MIN(targetOutputSize, dstCapacity);
     return LZ4_decompress_generic(source, dest, compressedSize, dstCapacity,
-                                  endOnInputSize, partial_decode, usingExtDict,
+                                  partial_decode, usingExtDict,
                                   (BYTE*)dest, (const BYTE*)dictStart, dictSize);
 }
 
@@ -2300,9 +2413,9 @@ LZ4_FORCE_O2
 static int LZ4_decompress_fast_extDict(const char* source, char* dest, int originalSize,
                                        const void* dictStart, size_t dictSize)
 {
-    return LZ4_decompress_generic(source, dest, 0, originalSize,
-                                  endOnOutputSize, decode_full_block, usingExtDict,
-                                  (BYTE*)dest, (const BYTE*)dictStart, dictSize);
+    return LZ4_decompress_unsafe_generic(
+                (const BYTE*)source, (BYTE*)dest, originalSize,
+                0, (const BYTE*)dictStart, dictSize);
 }
 
 /* The "double dictionary" mode, for use with e.g. ring buffers: the first part
@@ -2314,16 +2427,7 @@ int LZ4_decompress_safe_doubleDict(const char* source, char* dest, int compresse
                                    size_t prefixSize, const void* dictStart, size_t dictSize)
 {
     return LZ4_decompress_generic(source, dest, compressedSize, maxOutputSize,
-                                  endOnInputSize, decode_full_block, usingExtDict,
-                                  (BYTE*)dest-prefixSize, (const BYTE*)dictStart, dictSize);
-}
-
-LZ4_FORCE_INLINE
-int LZ4_decompress_fast_doubleDict(const char* source, char* dest, int originalSize,
-                                   size_t prefixSize, const void* dictStart, size_t dictSize)
-{
-    return LZ4_decompress_generic(source, dest, 0, originalSize,
-                                  endOnOutputSize, decode_full_block, usingExtDict,
+                                  decode_full_block, usingExtDict,
                                   (BYTE*)dest-prefixSize, (const BYTE*)dictStart, dictSize);
 }
 
@@ -2431,29 +2535,35 @@ int LZ4_decompress_safe_continue (LZ4_streamDecode_t* LZ4_streamDecode, const ch
     return result;
 }
 
-LZ4_FORCE_O2
-int LZ4_decompress_fast_continue (LZ4_streamDecode_t* LZ4_streamDecode, const char* source, char* dest, int originalSize)
+LZ4_FORCE_O2 int
+LZ4_decompress_fast_continue (LZ4_streamDecode_t* LZ4_streamDecode,
+                        const char* source, char* dest, int originalSize)
 {
-    LZ4_streamDecode_t_internal* lz4sd = &LZ4_streamDecode->internal_donotuse;
+    LZ4_streamDecode_t_internal* const lz4sd =
+        (assert(LZ4_streamDecode!=NULL), &LZ4_streamDecode->internal_donotuse);
     int result;
+
+    DEBUGLOG(5, "LZ4_decompress_fast_continue (toDecodeSize=%i)", originalSize);
     assert(originalSize >= 0);
 
     if (lz4sd->prefixSize == 0) {
+        DEBUGLOG(5, "first invocation : no prefix nor extDict");
         assert(lz4sd->extDictSize == 0);
         result = LZ4_decompress_fast(source, dest, originalSize);
         if (result <= 0) return result;
         lz4sd->prefixSize = (size_t)originalSize;
         lz4sd->prefixEnd = (BYTE*)dest + originalSize;
     } else if (lz4sd->prefixEnd == (BYTE*)dest) {
-        if (lz4sd->prefixSize >= 64 KB - 1 || lz4sd->extDictSize == 0)
-            result = LZ4_decompress_fast(source, dest, originalSize);
-        else
-            result = LZ4_decompress_fast_doubleDict(source, dest, originalSize,
-                                                    lz4sd->prefixSize, lz4sd->externalDict, lz4sd->extDictSize);
+        DEBUGLOG(5, "continue using existing prefix");
+        result = LZ4_decompress_unsafe_generic(
+                        (const BYTE*)source, (BYTE*)dest, originalSize,
+                        lz4sd->prefixSize,
+                        lz4sd->externalDict, lz4sd->extDictSize);
         if (result <= 0) return result;
         lz4sd->prefixSize += (size_t)originalSize;
         lz4sd->prefixEnd  += originalSize;
     } else {
+        DEBUGLOG(5, "prefix becomes extDict");
         lz4sd->extDictSize = lz4sd->prefixSize;
         lz4sd->externalDict = lz4sd->prefixEnd - lz4sd->extDictSize;
         result = LZ4_decompress_fast_extDict(source, dest, originalSize,
@@ -2507,7 +2617,9 @@ int LZ4_decompress_safe_partial_usingDict(const char* source, char* dest, int co
 int LZ4_decompress_fast_usingDict(const char* source, char* dest, int originalSize, const char* dictStart, int dictSize)
 {
     if (dictSize==0 || dictStart+dictSize == dest)
-        return LZ4_decompress_fast(source, dest, originalSize);
+        return LZ4_decompress_unsafe_generic(
+                        (const BYTE*)source, (BYTE*)dest, originalSize,
+                        (size_t)dictSize, NULL, 0);
     assert(dictSize >= 0);
     return LZ4_decompress_fast_extDict(source, dest, originalSize, dictStart, (size_t)dictSize);
 }
diff --git a/lib/lz4hc.c b/lib/lz4hc.c
index 8122bd8..4771ef8 100644
--- a/lib/lz4hc.c
+++ b/lib/lz4hc.c
@@ -755,7 +755,7 @@ _last_literals:
         } else {
             *op++ = (BYTE)(lastRunSize << ML_BITS);
         }
-        memcpy(op, anchor, lastRunSize);
+        LZ4_memcpy(op, anchor, lastRunSize);
         op += lastRunSize;
     }
 
@@ -894,7 +894,7 @@ LZ4HC_compress_generic_dictCtx (
         ctx->dictCtx = NULL;
         return LZ4HC_compress_generic_noDictCtx(ctx, src, dst, srcSizePtr, dstCapacity, cLevel, limit);
     } else if (position == 0 && *srcSizePtr > 4 KB) {
-        memcpy(ctx, ctx->dictCtx, sizeof(LZ4HC_CCtx_internal));
+        LZ4_memcpy(ctx, ctx->dictCtx, sizeof(LZ4HC_CCtx_internal));
         LZ4HC_setExternalDict(ctx, (const BYTE *)src);
         ctx->compressionLevel = (short)cLevel;
         return LZ4HC_compress_generic_noDictCtx(ctx, src, dst, srcSizePtr, dstCapacity, cLevel, limit);
@@ -1179,7 +1179,7 @@ int LZ4_saveDictHC (LZ4_streamHC_t* LZ4_streamHCPtr, char* safeBuffer, int dictS
     if (dictSize > prefixSize) dictSize = prefixSize;
     if (safeBuffer == NULL) assert(dictSize == 0);
     if (dictSize > 0)
-        memmove(safeBuffer, streamPtr->end - dictSize, dictSize);
+        LZ4_memmove(safeBuffer, streamPtr->end - dictSize, dictSize);
     {   U32 const endIndex = (U32)(streamPtr->end - streamPtr->prefixStart) + streamPtr->dictLimit;
         streamPtr->end = (const BYTE*)safeBuffer + dictSize;
         streamPtr->prefixStart = streamPtr->end - dictSize;
@@ -1589,7 +1589,7 @@ _last_literals:
          } else {
              *op++ = (BYTE)(lastRunSize << ML_BITS);
          }
-         memcpy(op, anchor, lastRunSize);
+         LZ4_memcpy(op, anchor, lastRunSize);
          op += lastRunSize;
      }
 
diff --git a/programs/bench.c b/programs/bench.c
index 5a56e6f..4d35ef9 100644
--- a/programs/bench.c
+++ b/programs/bench.c
@@ -409,7 +409,7 @@ static int BMK_benchMem(const void* srcBuffer, size_t srcSize,
                 remaining -= thisBlockSize;
     }   }   }
 
-    /* warmimg up memory */
+    /* warming up memory */
     RDG_genBuffer(compressedBuffer, maxCompressedSize, 0.10, 0.50, 1);
 
     /* decode-only mode : copy input to @compressedBuffer */
diff --git a/programs/lz4cli.c b/programs/lz4cli.c
index 51969fd..8c3f9fd 100644
--- a/programs/lz4cli.c
+++ b/programs/lz4cli.c
@@ -186,7 +186,7 @@ static int usage_longhelp(const char* exeName)
     DISPLAY( "\n");
     DISPLAY( "Compression levels : \n");
     DISPLAY( "---------------------\n");
-    DISPLAY( "-0 ... -2  => Fast compression, all identicals\n");
+    DISPLAY( "-0 ... -2  => Fast compression, all identical\n");
     DISPLAY( "-3 ... -%d => High compression; higher number == more compression but slower\n", LZ4HC_CLEVEL_MAX);
     DISPLAY( "\n");
     DISPLAY( "stdin, stdout and the console : \n");
diff --git a/tests/README.md b/tests/README.md
index 6b8302c..65437de 100644
--- a/tests/README.md
+++ b/tests/README.md
@@ -25,7 +25,7 @@ After `sleepTime` (an optional parameter, default 300 seconds) seconds the scrip
 If a new commit is found it is compiled and a speed benchmark for this commit is performed.
 The results of the speed benchmark are compared to the previous results.
 If compression or decompression speed for one of lz4 levels is lower than `lowerLimit` (an optional parameter, default 0.98) the speed benchmark is restarted.
-If second results are also lower than `lowerLimit` the warning e-mail is send to recipients from the list (the `emails` parameter).
+If second results are also lower than `lowerLimit` the warning e-mail is sent to recipients from the list (the `emails` parameter).
 
 Additional remarks:
 - To be sure that speed results are accurate the script should be run on a "stable" target system with no other jobs running in parallel