summaryrefslogtreecommitdiffstats
path: root/Objects/stringlib/codecs.h
Commit message (Collapse)AuthorAgeFilesLines
* bpo-36819: Fix crashes in built-in encoders with weird error handlers (GH-28593)Serhiy Storchaka2022-05-021-2/+13
| | | | | | | If the error handler returns position less or equal than the starting position of non-encodable characters, most of built-in encoders didn't properly re-size the output buffer. This led to out-of-bounds writes, and segfaults.
* bpo-43179: Generalise alignment for optimised string routines (GH-24624)Jessica Clarke2021-03-311-7/+4
| | | | | | | | | | | | | | | | | | | | | | | | | * Remove m68k-specific hack from ascii_decode On m68k, alignments of primitives is more relaxed, with 4-byte and 8-byte types only requiring 2-byte alignment, thus using sizeof(size_t) does not work. Instead, use the portable alternative. Note that this is a minimal fix that only relaxes the assertion and the condition for when to use the optimised version remains overly strict. Such issues will be fixed tree-wide in the next commit. NB: In C11 we could use _Alignof(size_t) instead, but for compatibility we use autoconf. * Optimise string routines for architectures with non-natural alignment C only requires that sizeof(x) is a multiple of alignof(x), not that the two are equal. Thus anywhere where we optimise based on alignment we should be using alignof(x) not sizeof(x). This is more annoying than it would be in C11 where we could just use _Alignof(x) (and alignof(x) in C++11), but since we still require only C99 we must plumb the information all the way from autoconf through the various typedefs and defines.
* bpo-38252: Use 8-byte step to detect ASCII sequence in 64bit Windows build ↵Ma Lin2020-10-181-15/+15
| | | | (GH-16334)
* bpo-29882: Add _Py_popcount32() function (GH-20518)Victor Stinner2020-06-081-1/+1
| | | | | | * Rename pycore_byteswap.h to pycore_bitutils.h. * Move popcount_digit() to pycore_bitutils.h as _Py_popcount32(). * _Py_popcount32() uses GCC and clang builtin function if available. * Add unit tests to _Py_popcount32().
* bpo-40302: UTF-32 encoder SWAB4() macro use a|b rather than a+b (GH-19572)Victor Stinner2020-04-171-1/+1
|
* bpo-40302: Add pycore_byteswap.h header file (GH-19552)Victor Stinner2020-04-171-16/+20
| | | | | | | | | | | | | | Add a new internal pycore_byteswap.h header file with the following functions: * _Py_bswap16() * _Py_bswap32() * _Py_bswap64() Use these functions in _ctypes, sha256 and sha512 modules, and also use in the UTF-32 encoder. sha256, sha512 and _ctypes modules are now built with the internal C API.
* bpo-39943: Add the const qualifier to pointers on non-mutable PyUnicode ↵Serhiy Storchaka2020-04-111-1/+1
| | | | data. (GH-19345)
* Update some www.unicode.org URLs to use HTTPS. (GH-18912)Benjamin Peterson2020-03-111-1/+1
|
* bpo-39087: Optimize PyUnicode_AsUTF8AndSize() (GH-18327)Inada Naoki2020-02-271-18/+17
| | | Avoid using temporary bytes object.
* closes bpo-39605: Fix some casts to not cast away const. (GH-18453)Andy Lester2020-02-121-2/+2
| | | | | | | | | | | | | | | gcc -Wcast-qual turns up a number of instances of casting away constness of pointers. Some of these can be safely modified, by either: Adding the const to the type cast, as in: - return _PyUnicode_FromUCS1((unsigned char*)s, size); + return _PyUnicode_FromUCS1((const unsigned char*)s, size); or, Removing the cast entirely, because it's not necessary (but probably was at one time), as in: - PyDTrace_FUNCTION_ENTRY((char *)filename, (char *)funcname, lineno); + PyDTrace_FUNCTION_ENTRY(filename, funcname, lineno); These changes will not change code, but they will make it much easier to check for errors in consts
* bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304)Serhiy Storchaka2019-06-251-3/+3
| | | | | | | * The UTF-8 incremental decoders fails now fast if encounter a sequence that can't be handled by the error handler. * The UTF-16 incremental decoders with the surrogatepass error handler decodes now a lone low surrogate with final=False.
* bpo-36775: _PyCoreConfig only uses wchar_t* (GH-13062)Victor Stinner2019-05-021-1/+1
| | | | | | | | | | | | | | | | | _PyCoreConfig: Change filesystem_encoding, filesystem_errors, stdio_encoding and stdio_errors fields type from char* to wchar_t*. Changes: * PyInterpreterState: replace fscodec_initialized (int) with fs_codec structure. * Add get_error_handler_wide() and unicode_encode_utf8() helper functions. * Add error_handler parameter to unicode_encode_locale() and unicode_decode_locale(). * Remove _PyCoreConfig_SetString(). * Rename _PyCoreConfig_SetWideString() to _PyCoreConfig_SetString(). * Rename _PyCoreConfig_SetWideStringFromString() to _PyCoreConfig_DecodeLocale().
* bpo-34523: Support surrogatepass in locale codecs (GH-8995)Victor Stinner2018-08-291-1/+1
| | | | | | | | | | | | | | | | | | | | Add support for the "surrogatepass" error handler in PyUnicode_DecodeFSDefault() and PyUnicode_EncodeFSDefault() for the UTF-8 encoding. Changes: * _Py_DecodeUTF8Ex() and _Py_EncodeUTF8Ex() now support the surrogatepass error handler (_Py_ERROR_SURROGATEPASS). * _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx() now use the _Py_error_handler enum instead of "int surrogateescape" to pass the error handler. These functions now return -3 if the error handler is unknown. * Add unit tests on _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx() in test_codecs. * Rename get_error_handler() to _Py_GetErrorHandler() and expose it as a private function. * _freeze_importlib doesn't need config.filesystem_errors="strict" workaround anymore.
* bpo-30923: Silence fall-through warnings included in -Wextra since gcc-7.0. ↵Stefan Krah2017-08-211-2/+2
| | | | (#3157)
* Issue #28561: Clean up UTF-8 encoder: remove dead code, update comments, etc.Serhiy Storchaka2016-10-301-10/+4
| | | | Patch by Xiang Zhang.
* PEP 7 style for if/else in CVictor Stinner2016-09-021-1/+2
| | | | Add also a newline for readability in normalize_encoding().
* Issue #27895: Spelling fixes (Contributed by Ville Skyttä).Raymond Hettinger2016-08-301-3/+3
|
* Issue #26765: Ensure that bytes- and unicode-specific stringlib files are usedSerhiy Storchaka2016-05-161-3/+3
| | | | with correct type.
* Optimize error handlers of ASCII and Latin1 encoders when the replacementVictor Stinner2015-10-091-11/+7
| | | | | | | | | | | string is pure ASCII: use _PyBytesWriter_WriteBytes(), don't check individual character. Cleanup unicode_encode_ucs1(): * Rename repunicode to rep * Clear rep object on error * Factorize code between bytes and unicode path
* Add _PyBytesWriter_WriteBytes() to factorize the codeVictor Stinner2015-10-091-11/+11
|
* _PyBytesWriter: simplify code to avoid "prealloc" parametersVictor Stinner2015-10-091-8/+12
| | | | | Substract preallocate bytes from min_size before calling _PyBytesWriter_Prepare().
* Optimize backslashreplace error handlerVictor Stinner2015-10-081-2/+16
| | | | | | | | | | Issue #25318: Optimize backslashreplace and xmlcharrefreplace error handlers in UTF-8 encoder. Optimize also backslashreplace error handler for ASCII and Latin1 encoders. Use the new _PyBytesWriter API to optimize these error handlers for the encoders. It avoids to create an exception and call the slow implementation of the error handler.
* Issue #25318: Add _PyBytesWriter APIVictor Stinner2015-10-081-63/+21
| | | | | | | | | | | Add a new private API to optimize Unicode encoders. It uses a small buffer allocated on the stack and supports overallocation. Use _PyBytesWriter API for UCS1 (ASCII and Latin1) and UTF-8 encoders. Enable overallocation for the UTF-8 encoder with error handlers. unicode_encode_ucs1(): initialize collend to collstart+1 to not check the current character twice, we already know that it is not ASCII.
* Issue #25267: The UTF-8 encoder is now up to 75 times as fast for errorVictor Stinner2015-10-011-51/+96
| | | | | handlers: ``ignore``, ``replace``, ``surrogateescape``, ``surrogatepass``. Patch co-written with Serhiy Storchaka.
* Fixed typos in comments.Serhiy Storchaka2015-05-181-4/+4
|\
| * Fixed typos in comments.Serhiy Storchaka2015-05-181-2/+2
| |
* | Issue #15027: The UTF-32 encoder is now 3x to 7x faster.Serhiy Storchaka2015-05-121-0/+87
|/
* Reverted changeset b72c5573c5e7 (issue #15027).Serhiy Storchaka2014-01-041-87/+0
|
* Issue #15027: Rewrite the UTF-32 encoder. It is now 1.6x to 3.5x faster.Serhiy Storchaka2014-01-041-0/+87
|
* Remove dead code committed in issue #12892.Serhiy Storchaka2013-11-191-104/+0
|
* Issue #12892: The utf-16* and utf-32* codecs now reject (lone) surrogates.Serhiy Storchaka2013-11-191-16/+182
| | | | | | | | | | The utf-16* and utf-32* encoders no longer allow surrogate code points (U+D800-U+DFFF) to be encoded. The utf-32* decoders no longer decode byte sequences that correspond to surrogate code points. The surrogatepass error handler now works with the utf-16* and utf-32* codecs. Based on patches by Victor Stinner and Kang-Hao (Kenny) Lu.
* Issue #18722: Remove uses of the "register" keyword in C code.Antoine Pitrou2013-08-131-3/+3
|
* (Merge 3.3) Issue #8271: Fix compilation on WindowsVictor Stinner2012-11-041-1/+1
|\
| * Issue #8271: Fix compilation on WindowsVictor Stinner2012-11-041-1/+1
| |
* | #8271: merge with 3.3.Ezio Melotti2012-11-041-30/+62
|\ \ | |/
| * #8271: the utf-8 decoder now outputs the correct number of U+FFFD ↵Ezio Melotti2012-11-041-30/+62
| | | | | | | | characters when used with the "replace" error handler on invalid utf-8 sequences. Patch by Serhiy Storchaka, tests by Ezio Melotti.
* | Issue #16166: Add PY_LITTLE_ENDIAN and PY_BIG_ENDIAN macros and unifiedChristian Heimes2012-10-171-3/+3
|/ | | | endianess detection and handling.
* Issue #15144: Fix possible integer overflow when handling pointers as ↵Antoine Pitrou2012-09-201-9/+5
| | | | | | integer values, by using Py_uintptr_t instead of size_t. Patch by Serhiy Storchaka.
* Use correct types for ASCII_CHAR_MASK integer constants.Mark Dickinson2012-07-071-2/+2
|
* Issue #14923: Optimize continuation-byte check in UTF-8 decoding. Patch by ↵Mark Dickinson2012-06-231-6/+10
| | | | Serhiy Storchaka.
* Issue #15026: utf-16 encoding is now significantly faster (up to 10x).Antoine Pitrou2012-06-151-0/+64
| | | | Patch by Serhiy Storchaka.
* Issue #14624: UTF-16 decoding is now 3x to 4x faster on various inputs.Antoine Pitrou2012-05-151-1/+148
| | | | Patch by Serhiy Storchaka.
* Issue #14738: Speed-up UTF-8 decoding on non-ASCII data. Patch by Serhiy ↵Antoine Pitrou2012-05-101-78/+143
| | | | Storchaka.
* Issue #13624: Write a specialized UTF-8 encoder to allow more optimizationVictor Stinner2011-12-181-0/+197
| | | | The main bottleneck was the PyUnicode_READ() macro.
* Issue #13417: speed up utf-8 decoding by around 2x for the non-fully-ASCII case.Antoine Pitrou2011-11-211-0/+156
This almost catches up with pre-PEP 393 performance, when decoding needed only one pass.