summaryrefslogtreecommitdiffstats
path: root/Lib/html/parser.py
Commit message (Collapse)AuthorAgeFilesLines
* [3.13] gh-118350: Fix support of elements "textarea" and "title" in ↵Miss Islington (bot)2025-07-221-5/+15
| | | | | | | | | HTMLParser (GH-135310) (GH-136985) (cherry picked from commit 4d02f31cdd45d81b95540d9076222b709d4f2335) Co-authored-by: Timon Viola <44016238+timonviola@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Łukasz Langa <lukasz@langa.pl>
* [3.13] gh-135661: Fix parsing attributes with whitespaces around the "=" ↵Miss Islington (bot)2025-07-221-2/+2
| | | | | | | | separator in HTMLParser (GH-136908) (GH-136918) This fixes a regression introduced in GH-135930. (cherry picked from commit dee650189497735edbc08a54edabb5b06ef1bd09) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
* [3.13] gh-102555: Fix comment parsing in HTMLParser according to the HTML5 ↵Miss Islington (bot)2025-07-041-1/+17
| | | | | | | | | | | | | | standard (GH-135664) (GH-136272) * "--!>" now ends the comment. * "-- >" no longer ends the comment. * Support abnormally ended empty comments "<-->" and "<--->". --------- (cherry picked from commit 8ac7613dc8b8f82253d7c0e2b6ef6ed703a0a1ee) Co-author: Kerim Kabirov <the.privat33r+gh@pm.me> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
* [3.13] gh-135661: Fix parsing start and end tags in HTMLParser according to ↵Miss Islington (bot)2025-07-031-74/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the HTML5 standard (GH-135930) (GH-136256) * Whitespaces no longer accepted between `</` and the tag name. E.g. `</ script>` does not end the script section. * Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized as whitespaces. The only whitespaces are `\t\n\r\f `. * Null character (U+0000) no longer ends the tag name. * Attributes and slashes after the tag name in end tags are now ignored, instead of terminating after the first `>` in quoted attribute value. E.g. `</script/foo=">"/>`. * Multiple slashes and whitespaces between the last attribute and closing `>` are now ignored in both start and end tags. E.g. `<a foo=bar/ //>`. * Multiple `=` between attribute name and value are no longer collapsed. E.g. `<a foo==bar>` produces attribute "foo" with value "=bar". * Whitespaces between the `=` separator and attribute name or value are no longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and "=bar", both with value None; `<a foo= bar>` produces two attributes: "foo" with value "" and "bar" with value None. --------- (cherry picked from commit 0243f97cbadec8d985e63b1daec5d1cbc850cae3) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
* [3.13] gh-135462: Fix quadratic complexity in processing special input in ↵Miss Islington (bot)2025-06-131-11/+30
| | | | | | | | | HTMLParser (GH-135464) (GH-135482) End-of-file errors are now handled according to the HTML5 specs -- comments and declarations are automatically closed, tags are ignored. (cherry picked from commit 6eb6c5dbfb528bd07d77b60fd71fd05d81d45c41) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
* [3.13] gh-86155: Fix data loss after unclosed script or style tag in ↵Miss Islington (bot)2025-05-101-1/+1
| | | | | | | | | HTMLParser (GH-22658) (GH-133845) When calling .close() the HTMLParser should flush all remaining content, even when that content is in an unclosed script or style tag. (cherry picked from commit 53383e90e4df7029f792b7aa81aa2e4cff348ed0) Co-authored-by: Waylan Limberg <waylan.limberg@icloud.com>
* [3.13] gh-77057: Fix handling of invalid markup declarations in HTMLParser ↵Miss Islington (bot)2025-05-101-2/+2
| | | | | | | | (GH-9295) (GH-133834) (cherry picked from commit 76c0b01bc401c3e976011bbc69cec56dbebe0ad5) Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
* [3.13] gh-69426: HTMLParser: only unescape properly terminated character ↵Miss Islington (bot)2025-05-091-1/+19
| | | | | | | | | | | | | entities in attribute values (GH-95215) (GH-133586) According to the HTML5 spec, named character references in attribute values should only be processed if they are not followed by an ASCII alphanumeric, or an equals sign. (cherry picked from commit 77b14a6d58e527f915966446eb0866652a46feb5) https: //html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com>
* gh-95813: Improve HTMLParser from the view of inheritance (#95874)Dong-hee Na2022-08-181-1/+2
| | | | | | | * gh-95813: Improve HTMLParser from the view of inheritance * gh-95813: Add unittest * Address code review
* bpo-45421: Remove dead code from html.parser (GH-28847)Alberto Mardegan2021-10-121-7/+0
| | | | | Support for HtmlParserError was removed back in 2014 with commit 73a4359eb0eb624c588c5d52083ea4944f9787ea, however this small block was missed.
* Fix typos in the Lib directory (GH-28775)Christian Clauss2021-10-061-1/+1
| | | | | Fix typos in the Lib directory as identified by codespell. Co-authored-by: Terry Jan Reedy <tjreedy@udel.edu>
* bpo-41748: Handles unquoted attributes with commas (#24072)Karl Dubost2021-02-011-1/+1
| | | | | | | | | | | | | | | | | | * bpo-41748: Adds tests for unquoted attributes with comma * bpo-41748: Handles unquoted attributes with comma * bpo-41748: Addresses review comments * bpo-41748: Addresses review comments * Adds more test cases * Simplifies the regex for handling spaces * bpo-41748: Moves attributes tests under the right class * bpo-41748: Addresses review about duplicate attributes * bpo-41748: Adds NEWS.d entry for this patch
* bpo-37328: remove deprecated HTMLParser.unescape (GH-14186)Inada Naoki2019-08-271-8/+0
| | | It is deprecated since Python 3.4.
* bpo-30629: Remove second call of str.lower() in html.parser.parse_endtag. ↵Motoki Naruse2017-06-171-1/+1
| | | | | | (#2099) elem is the result of .lower() 6 lines above the handle_endtag call. Patch by Motoki Naruse
* Revert "Fixed a typo in the HTMLParser.feed docstrings" (#1771)Serhiy Storchaka2017-05-241-1/+1
| | | | | * Revert "Fixed a typo in the HTMLParser.feed docstrings. The docstring started with an 'r', like a The docstring was correct. I read the patch in opposite direction, as *adding* the "r" prefix. This reverts commit 5ba185039f1bd465d3f82531324fd3fe1ee42f0c.
* Fixed a typo in the HTMLParser.feed docstrings. The docstring started with ↵Jani Šumak2017-05-231-1/+1
| | | | an 'r', like a rawstring. (#1759)
* #27364: fix "incorrect" uses of escape character in the stdlib.R David Murray2016-09-081-2/+2
| | | | | | | And most of the tools. Patch by Emanual Barry, reviewed by me, Serhiy Storchaka, and Martin Panter.
* Issue #27076: Doc, comment and tests spelling fixesMartin Panter2016-05-261-1/+1
| | | | Most fixes to Doc/ and Lib/ directories by Ville Skyttä.
* #23144: merge with 3.4.Ezio Melotti2015-09-061-1/+9
|\
| * #23144: Make sure that HTMLParser.feed() returns all the data, even when ↵Ezio Melotti2015-09-061-1/+9
| | | | | | | | convert_charrefs is True.
* | #21047: set the default value for the *convert_charrefs* argument of ↵Ezio Melotti2014-08-021-8/+2
| | | | | | | | HTMLParser to True. Patch by Berker Peksag.
* | #15114: the strict mode and argument of HTMLParser, HTMLParser.error, and ↵Ezio Melotti2014-08-021-94/+12
|/ | | | the HTMLParserError exception have been removed.
* #20288: merge with 3.3.Ezio Melotti2014-02-011-3/+3
|\
| * #20288: fix handling of invalid numeric charrefs in HTMLParser.Ezio Melotti2014-02-011-3/+3
| |
* | #13633: Added a new convert_charrefs keyword arg to HTMLParser that, when ↵Ezio Melotti2013-11-231-17/+45
| | | | | | | | True, automatically converts all character references.
* | #19688: add back and deprecate the internal HTMLParser.unescape() method.Ezio Melotti2013-11-221-0/+7
| |
* | #2927: Added the unescape() function to the html module.Ezio Melotti2013-11-191-33/+5
| |
* | #19480: merge with 3.3.Ezio Melotti2013-11-071-9/+12
|\ \ | |/
| * #19480: HTMLParser now accepts all valid start-tag names as defined by the ↵Ezio Melotti2013-11-071-9/+12
| | | | | | | | HTML5 standard.
* | #15114: The html.parser module now raises a DeprecationWarning when the ↵Ezio Melotti2013-11-021-4/+10
| | | | | | | | strict argument of HTMLParser or the HTMLParser.error method are used.
* | #17802: merge with 3.3.Ezio Melotti2013-05-011-0/+1
|\ \ | |/
| * #17802: Fix an UnboundLocalError in html.parser. Initial tests by Thomas ↵Ezio Melotti2013-05-011-0/+1
| | | | | | | | Barlow.
* | #14679: add an __all__ (that contains only HTMLParser) to html.parser.Ezio Melotti2013-05-011-0/+2
|/
* #15156: HTMLParser now uses the new "html.entities.html5" dictionary.Ezio Melotti2012-06-241-17/+15
|
* #15114: the strict mode of HTMLParser and the HTMLParseError exception are ↵Ezio Melotti2012-06-231-9/+12
| | | | deprecated now that the parser is able to parse invalid markup.
* #14538: HTMLParser can now parse correctly start tags that contain a bare /.Ezio Melotti2012-04-191-3/+3
|
* HTMLParser is now able to handle slashes in the start tag.Ezio Melotti2012-02-211-7/+11
|
* Fix an index and clean up comments.Ezio Melotti2012-02-131-1/+2
|
* Improve handling of declarations in HTMLParser.Ezio Melotti2012-02-131-8/+22
|
* #13993: HTMLParser is now able to handle broken end tags when strict=False.Ezio Melotti2012-02-131-15/+27
|
* #13960: HTMLParser is now able to handle broken comments when strict=False.Ezio Melotti2012-02-101-1/+24
|
* #13358: HTMLParser now calls handle_data only once for each CDATA.Ezio Melotti2011-11-181-3/+4
|
* #1745761, #755670, #13357, #12629, #1200313: improve attribute handling in ↵Ezio Melotti2011-11-141-9/+10
| | | | HTMLParser.
* #670664: Fix HTMLParser to correctly handle the content of ↵Ezio Melotti2011-11-011-4/+18
| | | | ``<script>...</script>`` and ``<style>...</style>``.
* #13273: fix a bug that prevented HTMLParser to properly detect some tags ↵Ezio Melotti2011-10-281-3/+2
| | | | when strict=False.
* #12888: Fix a bug in HTMLParser.unescape that prevented it to escape more ↵Ezio Melotti2011-09-051-1/+1
| | | | than 128 entities. Patch by Peter Otten.
* Merge 3.1Éric Araujo2011-05-251-1/+1
|\
| * Fix display of html.parser.HTMLParser.feed docstringÉric Araujo2011-05-041-1/+1
| |
| * Merged revisions 87542 via svnmerge fromSenthil Kumaran2010-12-281-7/+10
| | | | | | | | | | | | | | | | | | | | svn+ssh://pythondev@svn.python.org/python/branches/py3k ........ r87542 | senthil.kumaran | 2010-12-28 23:55:16 +0800 (Tue, 28 Dec 2010) | 3 lines Fix Issue10759 - html.parser.unescape() fails on HTML entities with incorrect syntax ........
| * Merged revisions 81504 via svnmerge fromVictor Stinner2010-05-241-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | svn+ssh://pythondev@svn.python.org/python/branches/py3k ................ r81504 | victor.stinner | 2010-05-24 23:46:25 +0200 (lun., 24 mai 2010) | 13 lines Recorded merge of revisions 81500-81501 via svnmerge from svn+ssh://pythondev@svn.python.org/python/trunk ........ r81500 | victor.stinner | 2010-05-24 23:33:24 +0200 (lun., 24 mai 2010) | 2 lines Issue #6662: Fix parsing of malformatted charref (&#bad;) ........ r81501 | victor.stinner | 2010-05-24 23:37:28 +0200 (lun., 24 mai 2010) | 2 lines Add the author of the last fix (Issue #6662) ........ ................