diff options
author | Illia Volochii <illia.volochii@gmail.com> | 2023-05-17 08:49:20 (GMT) |
---|---|---|
committer | GitHub <noreply@github.com> | 2023-05-17 08:49:20 (GMT) |
commit | 2f630e1ce18ad2e07428296532a68b11dc66ad10 (patch) | |
tree | 9304975238c9ef66124cbb4a43f8b7f006ffd3fe /Doc/library/urllib.parse.rst | |
parent | b58bc8c2a9a316891a5ea1a0487aebfc86c2793a (diff) | |
download | cpython-2f630e1ce18ad2e07428296532a68b11dc66ad10.zip cpython-2f630e1ce18ad2e07428296532a68b11dc66ad10.tar.gz cpython-2f630e1ce18ad2e07428296532a68b11dc66ad10.tar.bz2 |
gh-102153: Start stripping C0 control and space chars in `urlsplit` (#102508)
`urllib.parse.urlsplit` has already been respecting the WHATWG spec a bit #25595.
This adds more sanitizing to respect the "Remove any leading C0 control or space from input" [rule](https://url.spec.whatwg.org/#url-parsing:~:text=Remove%20any%20leading%20and%20trailing%20C0%20control%20or%20space%20from%20input.) in response to [CVE-2023-24329](https://nvd.nist.gov/vuln/detail/CVE-2023-24329).
---------
Co-authored-by: Gregory P. Smith [Google] <greg@krypto.org>
Diffstat (limited to 'Doc/library/urllib.parse.rst')
-rw-r--r-- | Doc/library/urllib.parse.rst | 46 |
1 files changed, 44 insertions, 2 deletions
diff --git a/Doc/library/urllib.parse.rst b/Doc/library/urllib.parse.rst index 96b3965..5a9a53f 100644 --- a/Doc/library/urllib.parse.rst +++ b/Doc/library/urllib.parse.rst @@ -159,6 +159,10 @@ or on combining URL components into a URL string. ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='') + .. warning:: + + :func:`urlparse` does not perform validation. See :ref:`URL parsing + security <url-parsing-security>` for details. .. versionchanged:: 3.2 Added IPv6 URL parsing capabilities. @@ -324,8 +328,14 @@ or on combining URL components into a URL string. ``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is decomposed before parsing, no error will be raised. - Following the `WHATWG spec`_ that updates RFC 3986, ASCII newline - ``\n``, ``\r`` and tab ``\t`` characters are stripped from the URL. + Following some of the `WHATWG spec`_ that updates RFC 3986, leading C0 + control and space characters are stripped from the URL. ``\n``, + ``\r`` and tab ``\t`` characters are removed from the URL at any position. + + .. warning:: + + :func:`urlsplit` does not perform validation. See :ref:`URL parsing + security <url-parsing-security>` for details. .. versionchanged:: 3.6 Out-of-range port numbers now raise :exc:`ValueError`, instead of @@ -338,6 +348,9 @@ or on combining URL components into a URL string. .. versionchanged:: 3.10 ASCII newline and tab characters are stripped from the URL. + .. versionchanged:: 3.12 + Leading WHATWG C0 control and space characters are stripped from the URL. + .. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser .. function:: urlunsplit(parts) @@ -414,6 +427,35 @@ or on combining URL components into a URL string. or ``scheme://host/path``). If *url* is not a wrapped URL, it is returned without changes. +.. _url-parsing-security: + +URL parsing security +-------------------- + +The :func:`urlsplit` and :func:`urlparse` APIs do not perform **validation** of +inputs. They may not raise errors on inputs that other applications consider +invalid. They may also succeed on some inputs that might not be considered +URLs elsewhere. Their purpose is for practical functionality rather than +purity. + +Instead of raising an exception on unusual input, they may instead return some +component parts as empty strings. Or components may contain more than perhaps +they should. + +We recommend that users of these APIs where the values may be used anywhere +with security implications code defensively. Do some verification within your +code before trusting a returned component part. Does that ``scheme`` make +sense? Is that a sensible ``path``? Is there anything strange about that +``hostname``? etc. + +What constitutes a URL is not universally well defined. Different applications +have different needs and desired constraints. For instance the living `WHATWG +spec`_ describes what user facing web clients such as a web browser require. +While :rfc:`3986` is more general. These functions incorporate some aspects of +both, but cannot be claimed compliant with either. The APIs and existing user +code with expectations on specific behaviors predate both standards leading us +to be very cautious about making API behavior changes. + .. _parsing-ascii-encoded-bytes: Parsing ASCII Encoded Bytes |