diff options
author | Ezio Melotti <ezio.melotti@gmail.com> | 2013-11-23 17:52:05 (GMT) |
---|---|---|
committer | Ezio Melotti <ezio.melotti@gmail.com> | 2013-11-23 17:52:05 (GMT) |
commit | 95401c5f6b9f07b094924559177c9b30a1c38998 (patch) | |
tree | 3029ea3bbffc0c53c64275a2e587bbf696a740cb /Doc/library/html.parser.rst | |
parent | e7f87e12626d6ae3b9ed8cae8904a6afad580ffc (diff) | |
download | cpython-95401c5f6b9f07b094924559177c9b30a1c38998.zip cpython-95401c5f6b9f07b094924559177c9b30a1c38998.tar.gz cpython-95401c5f6b9f07b094924559177c9b30a1c38998.tar.bz2 |
#13633: Added a new convert_charrefs keyword arg to HTMLParser that, when True, automatically converts all character references.
Diffstat (limited to 'Doc/library/html.parser.rst')
-rw-r--r-- | Doc/library/html.parser.rst | 35 |
1 files changed, 24 insertions, 11 deletions
diff --git a/Doc/library/html.parser.rst b/Doc/library/html.parser.rst index 0ea9644..44b7d6e 100644 --- a/Doc/library/html.parser.rst +++ b/Doc/library/html.parser.rst @@ -16,14 +16,21 @@ This module defines a class :class:`HTMLParser` which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. -.. class:: HTMLParser(strict=False) +.. class:: HTMLParser(strict=False, *, convert_charrefs=False) - Create a parser instance. If *strict* is ``False`` (the default), the parser - will accept and parse invalid markup. If *strict* is ``True`` the parser - will raise an :exc:`~html.parser.HTMLParseError` exception instead [#]_ when - it's not able to parse the markup. - The use of ``strict=True`` is discouraged and the *strict* argument is - deprecated. + Create a parser instance. + + If *convert_charrefs* is ``True`` (default: ``False``), all character + references (except the ones in ``script``/``style`` elements) are + automatically converted to the corresponding Unicode characters. + The use of ``convert_charrefs=True`` is encouraged and will become + the default in Python 3.5. + + If *strict* is ``False`` (the default), the parser will accept and parse + invalid markup. If *strict* is ``True`` the parser will raise an + :exc:`~html.parser.HTMLParseError` exception instead [#]_ when it's not + able to parse the markup. The use of ``strict=True`` is discouraged and + the *strict* argument is deprecated. An :class:`.HTMLParser` instance is fed HTML data and calls handler methods when start tags, end tags, text, comments, and other markup elements are @@ -34,12 +41,15 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. handler for elements which are closed implicitly by closing an outer element. .. versionchanged:: 3.2 - *strict* keyword added. + *strict* argument added. .. deprecated-removed:: 3.3 3.5 The *strict* argument and the strict mode have been deprecated. The parser is now able to accept and parse invalid markup too. + .. versionchanged:: 3.4 + *convert_charrefs* keyword argument added. + An exception is defined as well: @@ -181,7 +191,8 @@ implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`): This method is called to process a named character reference of the form ``&name;`` (e.g. ``>``), where *name* is a general entity reference - (e.g. ``'gt'``). + (e.g. ``'gt'``). This method is never called if *convert_charrefs* is + ``True``. .. method:: HTMLParser.handle_charref(name) @@ -189,7 +200,8 @@ implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`): This method is called to process decimal and hexadecimal numeric character references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``; - in this case the method will receive ``'62'`` or ``'x3E'``. + in this case the method will receive ``'62'`` or ``'x3E'``. This method + is never called if *convert_charrefs* is ``True``. .. method:: HTMLParser.handle_comment(data) @@ -324,7 +336,8 @@ correct char (note: these 3 references are all equivalent to ``'>'``):: Num ent : > Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but -:meth:`~HTMLParser.handle_data` might be called more than once:: +:meth:`~HTMLParser.handle_data` might be called more than once +(unless *convert_charrefs* is set to ``True``):: >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']: ... parser.feed(chunk) |