diff options
author | Victor Stinner <vstinner@python.org> | 2020-11-01 22:07:23 (GMT) |
---|---|---|
committer | GitHub <noreply@github.com> | 2020-11-01 22:07:23 (GMT) |
commit | e662c398d87f136497f8ec672e83657ae3a599e0 (patch) | |
tree | cc9383c30557769a096be580b7f8f1b936565ea9 /Doc/c-api | |
parent | 82458b6cdbae3b849dc11d0d7dc2ab06ef0451c4 (diff) | |
download | cpython-e662c398d87f136497f8ec672e83657ae3a599e0.zip cpython-e662c398d87f136497f8ec672e83657ae3a599e0.tar.gz cpython-e662c398d87f136497f8ec672e83657ae3a599e0.tar.bz2 |
bpo-42236: Use UTF-8 encoding if nl_langinfo(CODESET) fails (GH-23086)
If the nl_langinfo(CODESET) function returns an empty string, Python
now uses UTF-8 as the filesystem encoding.
In May 2010 (commit b744ba1d14c5487576c95d0311e357b707600b47), I
modified Python to log a warning and use UTF-8 as the filesystem
encoding (instead of None) if nl_langinfo(CODESET) returns an empty
string.
In August 2020 (commit 94908bbc1503df830d1d615e7b57744ae1b41079), I
modified Python startup to fail with a fatal error and a specific
error message if nl_langinfo(CODESET) returns an empty string. The
intent was to prevent guessing the encoding and also investigate user
configuration where this case happens.
In 10 years (2010 to 2020), I saw zero user report about the error
message related to nl_langinfo(CODESET) returning an empty string.
Today, UTF-8 became the defacto standard and it's safe to make the
assumption that the user expects UTF-8. For example,
nl_langinfo(CODESET) can return an empty string on macOS if the
LC_CTYPE locale is not supported, and UTF-8 is the default encoding
on macOS.
While this change is likely to not affect anyone in practice, it
should make UTF-8 lover happy ;-)
Rewrite also the documentation explaining how Python selects the
filesystem encoding and error handler.
Diffstat (limited to 'Doc/c-api')
-rw-r--r-- | Doc/c-api/init_config.rst | 52 |
1 files changed, 47 insertions, 5 deletions
diff --git a/Doc/c-api/init_config.rst b/Doc/c-api/init_config.rst index 37f5b9f..92a6c3a 100644 --- a/Doc/c-api/init_config.rst +++ b/Doc/c-api/init_config.rst @@ -253,10 +253,16 @@ PyPreConfig See :c:member:`PyConfig.isolated`. - .. c:member:: int legacy_windows_fs_encoding (Windows only) + .. c:member:: int legacy_windows_fs_encoding - If non-zero, disable UTF-8 Mode, set the Python filesystem encoding to - ``mbcs``, set the filesystem error handler to ``replace``. + If non-zero: + + * Set :c:member:`PyPreConfig.utf8_mode` to ``0``, + * Set :c:member:`PyConfig.filesystem_encoding` to ``"mbcs"``, + * Set :c:member:`PyConfig.filesystem_errors` to ``"replace"``. + + Initialized the from :envvar:`PYTHONLEGACYWINDOWSFSENCODING` environment + variable value. Only available on Windows. ``#ifdef MS_WINDOWS`` macro can be used for Windows specific code. @@ -499,11 +505,47 @@ PyConfig .. c:member:: wchar_t* filesystem_encoding - Filesystem encoding, :func:`sys.getfilesystemencoding`. + Filesystem encoding: :func:`sys.getfilesystemencoding`. + + On macOS, Android and VxWorks: use ``"utf-8"`` by default. + + On Windows: use ``"utf-8"`` by default, or ``"mbcs"`` if + :c:member:`~PyPreConfig.legacy_windows_fs_encoding` of + :c:type:`PyPreConfig` is non-zero. + + Default encoding on other platforms: + + * ``"utf-8"`` if :c:member:`PyPreConfig.utf8_mode` is non-zero. + * ``"ascii"`` if Python detects that ``nl_langinfo(CODESET)`` announces + the ASCII encoding (or Roman8 encoding on HP-UX), whereas the + ``mbstowcs()`` function decodes from a different encoding (usually + Latin1). + * ``"utf-8"`` if ``nl_langinfo(CODESET)`` returns an empty string. + * Otherwise, use the LC_CTYPE locale encoding: + ``nl_langinfo(CODESET)`` result. + + At Python statup, the encoding name is normalized to the Python codec + name. For example, ``"ANSI_X3.4-1968"`` is replaced with ``"ascii"``. + + See also the :c:member:`~PyConfig.filesystem_errors` member. .. c:member:: wchar_t* filesystem_errors - Filesystem encoding errors, :func:`sys.getfilesystemencodeerrors`. + Filesystem error handler: :func:`sys.getfilesystemencodeerrors`. + + On Windows: use ``"surrogatepass"`` by default, or ``"replace"`` if + :c:member:`~PyPreConfig.legacy_windows_fs_encoding` of + :c:type:`PyPreConfig` is non-zero. + + On other platforms: use ``"surrogateescape"`` by default. + + Supported error handlers: + + * ``"strict"`` + * ``"surrogateescape"`` + * ``"surrogatepass"`` (only supported with the UTF-8 encoding) + + See also the :c:member:`~PyConfig.filesystem_encoding` member. .. c:member:: unsigned long hash_seed .. c:member:: int use_hash_seed |