summaryrefslogtreecommitdiffstats
path: root/Lib/html/parser.py
Commit message (Collapse)AuthorAgeFilesLines
* gh-140875: Fix handling of unclosed charrefs before EOF in HTMLParser ↵Serhiy Storchaka2025-11-191-10/+19
| | | | (GH-140904)
* gh-137836: Support more RAWTEXT and PLAINTEXT elements in HTMLParser (GH-137837)Serhiy Storchaka2025-10-311-6/+18
| | | | | | * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript"
* gh-135661: Fix parsing unterminated bogus comments in HTMLParser (GH-137873)Serhiy Storchaka2025-08-171-14/+8
| | | | Bogus comments that start with "<![CDATA[" should not include the starting "!" in its value.
* gh-135661: Fix CDATA section parsing in HTMLParser (GH-135665)Serhiy Storchaka2025-08-141-2/+26
| | | | | | | | | "] ]>" and "]] >" no longer end the CDATA section. Make CDATA section parsing context depending. Add private method HTMLParser._set_support_cdata() to change the context. If called with True, "<[CDATA[" starts a CDATA section which ends with "]]>". If called with False, "<[CDATA[" starts a bogus comments which ends with ">".
* gh-118350: Fix support of elements "textarea" and "title" in HTMLParser ↵Timon Viola2025-07-221-5/+15
| | | | | | (#135310) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Łukasz Langa <lukasz@langa.pl>
* gh-135661: Fix parsing attributes with whitespaces around the "=" separator ↵Serhiy Storchaka2025-07-211-2/+2
| | | | | in HTMLParser (GH-136908) This fixes a regression introduced in GH-135930.
* gh-102555: Fix comment parsing in HTMLParser according to the HTML5 standard ↵Serhiy Storchaka2025-07-041-1/+17
| | | | | | | | | | | | (GH-135664) * "--!>" now ends the comment. * "-- >" no longer ends the comment. * Support abnormally ended empty comments "<-->" and "<--->". --------- Co-author: Kerim Kabirov <the.privat33r+gh@pm.me> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
* gh-135661: Fix parsing start and end tags in HTMLParser according to the ↵Serhiy Storchaka2025-07-031-74/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | HTML5 standard (GH-135930) * Whitespaces no longer accepted between `</` and the tag name. E.g. `</ script>` does not end the script section. * Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized as whitespaces. The only whitespaces are `\t\n\r\f `. * Null character (U+0000) no longer ends the tag name. * Attributes and slashes after the tag name in end tags are now ignored, instead of terminating after the first `>` in quoted attribute value. E.g. `</script/foo=">"/>`. * Multiple slashes and whitespaces between the last attribute and closing `>` are now ignored in both start and end tags. E.g. `<a foo=bar/ //>`. * Multiple `=` between attribute name and value are no longer collapsed. E.g. `<a foo==bar>` produces attribute "foo" with value "=bar". * Whitespaces between the `=` separator and attribute name or value are no longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and "=bar", both with value None; `<a foo= bar>` produces two attributes: "foo" with value "" and "bar" with value None. * Fix Sphinx errors. * Apply suggestions from code review Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com> * Address review comments. * Move to Security. --------- Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
* gh-135462: Fix quadratic complexity in processing special input in ↵Serhiy Storchaka2025-06-131-11/+30
| | | | | | HTMLParser (GH-135464) End-of-file errors are now handled according to the HTML5 specs -- comments and declarations are automatically closed, tags are ignored.
* gh-86155: Fix data loss after unclosed script or style tag in HTMLParser ↵Waylan Limberg2025-05-101-1/+1
| | | | | | | (GH-22658) When calling .close() the HTMLParser should flush all remaining content, even when that content is in an unclosed script or style tag.
* gh-77057: Fix handling of invalid markup declarations in HTMLParser (GH-9295)Ezio Melotti2025-05-101-2/+2
| | | Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
* gh-69426: HTMLParser: only unescape properly terminated character entities ↵Sascha Ißbrücker2025-05-071-1/+19
| | | | | | | | | | in attribute values (GH-95215) According to the HTML5 spec, named character references in attribute values should only be processed if they are not followed by an ASCII alphanumeric, or an equals sign. https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
* gh-95813: Improve HTMLParser from the view of inheritance (#95874)Dong-hee Na2022-08-181-1/+2
| | | | | | | * gh-95813: Improve HTMLParser from the view of inheritance * gh-95813: Add unittest * Address code review
* bpo-45421: Remove dead code from html.parser (GH-28847)Alberto Mardegan2021-10-121-7/+0
| | | | | Support for HtmlParserError was removed back in 2014 with commit 73a4359eb0eb624c588c5d52083ea4944f9787ea, however this small block was missed.
* Fix typos in the Lib directory (GH-28775)Christian Clauss2021-10-061-1/+1
| | | | | Fix typos in the Lib directory as identified by codespell. Co-authored-by: Terry Jan Reedy <tjreedy@udel.edu>
* bpo-41748: Handles unquoted attributes with commas (#24072)Karl Dubost2021-02-011-1/+1
| | | | | | | | | | | | | | | | | | * bpo-41748: Adds tests for unquoted attributes with comma * bpo-41748: Handles unquoted attributes with comma * bpo-41748: Addresses review comments * bpo-41748: Addresses review comments * Adds more test cases * Simplifies the regex for handling spaces * bpo-41748: Moves attributes tests under the right class * bpo-41748: Addresses review about duplicate attributes * bpo-41748: Adds NEWS.d entry for this patch
* bpo-37328: remove deprecated HTMLParser.unescape (GH-14186)Inada Naoki2019-08-271-8/+0
| | | It is deprecated since Python 3.4.
* bpo-30629: Remove second call of str.lower() in html.parser.parse_endtag. ↵Motoki Naruse2017-06-171-1/+1
| | | | | | (#2099) elem is the result of .lower() 6 lines above the handle_endtag call. Patch by Motoki Naruse
* Revert "Fixed a typo in the HTMLParser.feed docstrings" (#1771)Serhiy Storchaka2017-05-241-1/+1
| | | | | * Revert "Fixed a typo in the HTMLParser.feed docstrings. The docstring started with an 'r', like a The docstring was correct. I read the patch in opposite direction, as *adding* the "r" prefix. This reverts commit 5ba185039f1bd465d3f82531324fd3fe1ee42f0c.
* Fixed a typo in the HTMLParser.feed docstrings. The docstring started with ↵Jani Šumak2017-05-231-1/+1
| | | | an 'r', like a rawstring. (#1759)
* #27364: fix "incorrect" uses of escape character in the stdlib.R David Murray2016-09-081-2/+2
| | | | | | | And most of the tools. Patch by Emanual Barry, reviewed by me, Serhiy Storchaka, and Martin Panter.
* Issue #27076: Doc, comment and tests spelling fixesMartin Panter2016-05-261-1/+1
| | | | Most fixes to Doc/ and Lib/ directories by Ville Skyttä.
* #23144: merge with 3.4.Ezio Melotti2015-09-061-1/+9
|\
| * #23144: Make sure that HTMLParser.feed() returns all the data, even when ↵Ezio Melotti2015-09-061-1/+9
| | | | | | | | convert_charrefs is True.
* | #21047: set the default value for the *convert_charrefs* argument of ↵Ezio Melotti2014-08-021-8/+2
| | | | | | | | HTMLParser to True. Patch by Berker Peksag.
* | #15114: the strict mode and argument of HTMLParser, HTMLParser.error, and ↵Ezio Melotti2014-08-021-94/+12
|/ | | | the HTMLParserError exception have been removed.
* #20288: merge with 3.3.Ezio Melotti2014-02-011-3/+3
|\
| * #20288: fix handling of invalid numeric charrefs in HTMLParser.Ezio Melotti2014-02-011-3/+3
| |
* | #13633: Added a new convert_charrefs keyword arg to HTMLParser that, when ↵Ezio Melotti2013-11-231-17/+45
| | | | | | | | True, automatically converts all character references.
* | #19688: add back and deprecate the internal HTMLParser.unescape() method.Ezio Melotti2013-11-221-0/+7
| |
* | #2927: Added the unescape() function to the html module.Ezio Melotti2013-11-191-33/+5
| |
* | #19480: merge with 3.3.Ezio Melotti2013-11-071-9/+12
|\ \ | |/
| * #19480: HTMLParser now accepts all valid start-tag names as defined by the ↵Ezio Melotti2013-11-071-9/+12
| | | | | | | | HTML5 standard.
* | #15114: The html.parser module now raises a DeprecationWarning when the ↵Ezio Melotti2013-11-021-4/+10
| | | | | | | | strict argument of HTMLParser or the HTMLParser.error method are used.
* | #17802: merge with 3.3.Ezio Melotti2013-05-011-0/+1
|\ \ | |/
| * #17802: Fix an UnboundLocalError in html.parser. Initial tests by Thomas ↵Ezio Melotti2013-05-011-0/+1
| | | | | | | | Barlow.
* | #14679: add an __all__ (that contains only HTMLParser) to html.parser.Ezio Melotti2013-05-011-0/+2
|/
* #15156: HTMLParser now uses the new "html.entities.html5" dictionary.Ezio Melotti2012-06-241-17/+15
|
* #15114: the strict mode of HTMLParser and the HTMLParseError exception are ↵Ezio Melotti2012-06-231-9/+12
| | | | deprecated now that the parser is able to parse invalid markup.
* #14538: HTMLParser can now parse correctly start tags that contain a bare /.Ezio Melotti2012-04-191-3/+3
|
* HTMLParser is now able to handle slashes in the start tag.Ezio Melotti2012-02-211-7/+11
|
* Fix an index and clean up comments.Ezio Melotti2012-02-131-1/+2
|
* Improve handling of declarations in HTMLParser.Ezio Melotti2012-02-131-8/+22
|
* #13993: HTMLParser is now able to handle broken end tags when strict=False.Ezio Melotti2012-02-131-15/+27
|
* #13960: HTMLParser is now able to handle broken comments when strict=False.Ezio Melotti2012-02-101-1/+24
|
* #13358: HTMLParser now calls handle_data only once for each CDATA.Ezio Melotti2011-11-181-3/+4
|
* #1745761, #755670, #13357, #12629, #1200313: improve attribute handling in ↵Ezio Melotti2011-11-141-9/+10
| | | | HTMLParser.
* #670664: Fix HTMLParser to correctly handle the content of ↵Ezio Melotti2011-11-011-4/+18
| | | | ``<script>...</script>`` and ``<style>...</style>``.
* #13273: fix a bug that prevented HTMLParser to properly detect some tags ↵Ezio Melotti2011-10-281-3/+2
| | | | when strict=False.
* #12888: Fix a bug in HTMLParser.unescape that prevented it to escape more ↵Ezio Melotti2011-09-051-1/+1
| | | | than 128 entities. Patch by Peter Otten.