summaryrefslogtreecommitdiffstats
path: root/Doc/c-api
diff options
context:
space:
mode:
authorVictor Stinner <vstinner@python.org>2020-11-01 22:07:23 (GMT)
committerGitHub <noreply@github.com>2020-11-01 22:07:23 (GMT)
commite662c398d87f136497f8ec672e83657ae3a599e0 (patch)
treecc9383c30557769a096be580b7f8f1b936565ea9 /Doc/c-api
parent82458b6cdbae3b849dc11d0d7dc2ab06ef0451c4 (diff)
downloadcpython-e662c398d87f136497f8ec672e83657ae3a599e0.zip
cpython-e662c398d87f136497f8ec672e83657ae3a599e0.tar.gz
cpython-e662c398d87f136497f8ec672e83657ae3a599e0.tar.bz2
bpo-42236: Use UTF-8 encoding if nl_langinfo(CODESET) fails (GH-23086)
If the nl_langinfo(CODESET) function returns an empty string, Python now uses UTF-8 as the filesystem encoding. In May 2010 (commit b744ba1d14c5487576c95d0311e357b707600b47), I modified Python to log a warning and use UTF-8 as the filesystem encoding (instead of None) if nl_langinfo(CODESET) returns an empty string. In August 2020 (commit 94908bbc1503df830d1d615e7b57744ae1b41079), I modified Python startup to fail with a fatal error and a specific error message if nl_langinfo(CODESET) returns an empty string. The intent was to prevent guessing the encoding and also investigate user configuration where this case happens. In 10 years (2010 to 2020), I saw zero user report about the error message related to nl_langinfo(CODESET) returning an empty string. Today, UTF-8 became the defacto standard and it's safe to make the assumption that the user expects UTF-8. For example, nl_langinfo(CODESET) can return an empty string on macOS if the LC_CTYPE locale is not supported, and UTF-8 is the default encoding on macOS. While this change is likely to not affect anyone in practice, it should make UTF-8 lover happy ;-) Rewrite also the documentation explaining how Python selects the filesystem encoding and error handler.
Diffstat (limited to 'Doc/c-api')
-rw-r--r--Doc/c-api/init_config.rst52
1 files changed, 47 insertions, 5 deletions
diff --git a/Doc/c-api/init_config.rst b/Doc/c-api/init_config.rst
index 37f5b9f..92a6c3a 100644
--- a/Doc/c-api/init_config.rst
+++ b/Doc/c-api/init_config.rst
@@ -253,10 +253,16 @@ PyPreConfig
See :c:member:`PyConfig.isolated`.
- .. c:member:: int legacy_windows_fs_encoding (Windows only)
+ .. c:member:: int legacy_windows_fs_encoding
- If non-zero, disable UTF-8 Mode, set the Python filesystem encoding to
- ``mbcs``, set the filesystem error handler to ``replace``.
+ If non-zero:
+
+ * Set :c:member:`PyPreConfig.utf8_mode` to ``0``,
+ * Set :c:member:`PyConfig.filesystem_encoding` to ``"mbcs"``,
+ * Set :c:member:`PyConfig.filesystem_errors` to ``"replace"``.
+
+ Initialized the from :envvar:`PYTHONLEGACYWINDOWSFSENCODING` environment
+ variable value.
Only available on Windows. ``#ifdef MS_WINDOWS`` macro can be used for
Windows specific code.
@@ -499,11 +505,47 @@ PyConfig
.. c:member:: wchar_t* filesystem_encoding
- Filesystem encoding, :func:`sys.getfilesystemencoding`.
+ Filesystem encoding: :func:`sys.getfilesystemencoding`.
+
+ On macOS, Android and VxWorks: use ``"utf-8"`` by default.
+
+ On Windows: use ``"utf-8"`` by default, or ``"mbcs"`` if
+ :c:member:`~PyPreConfig.legacy_windows_fs_encoding` of
+ :c:type:`PyPreConfig` is non-zero.
+
+ Default encoding on other platforms:
+
+ * ``"utf-8"`` if :c:member:`PyPreConfig.utf8_mode` is non-zero.
+ * ``"ascii"`` if Python detects that ``nl_langinfo(CODESET)`` announces
+ the ASCII encoding (or Roman8 encoding on HP-UX), whereas the
+ ``mbstowcs()`` function decodes from a different encoding (usually
+ Latin1).
+ * ``"utf-8"`` if ``nl_langinfo(CODESET)`` returns an empty string.
+ * Otherwise, use the LC_CTYPE locale encoding:
+ ``nl_langinfo(CODESET)`` result.
+
+ At Python statup, the encoding name is normalized to the Python codec
+ name. For example, ``"ANSI_X3.4-1968"`` is replaced with ``"ascii"``.
+
+ See also the :c:member:`~PyConfig.filesystem_errors` member.
.. c:member:: wchar_t* filesystem_errors
- Filesystem encoding errors, :func:`sys.getfilesystemencodeerrors`.
+ Filesystem error handler: :func:`sys.getfilesystemencodeerrors`.
+
+ On Windows: use ``"surrogatepass"`` by default, or ``"replace"`` if
+ :c:member:`~PyPreConfig.legacy_windows_fs_encoding` of
+ :c:type:`PyPreConfig` is non-zero.
+
+ On other platforms: use ``"surrogateescape"`` by default.
+
+ Supported error handlers:
+
+ * ``"strict"``
+ * ``"surrogateescape"``
+ * ``"surrogatepass"`` (only supported with the UTF-8 encoding)
+
+ See also the :c:member:`~PyConfig.filesystem_encoding` member.
.. c:member:: unsigned long hash_seed
.. c:member:: int use_hash_seed