diff options
author | Ezio Melotti <ezio.melotti@gmail.com> | 2012-02-18 00:03:35 (GMT) |
---|---|---|
committer | Ezio Melotti <ezio.melotti@gmail.com> | 2012-02-18 00:03:35 (GMT) |
commit | c48cfe37d22dfe37bb163fe87dacefdaf786fef6 (patch) | |
tree | b17002f22447ae3df159bc9d33f2d35de866b850 | |
parent | aa2c670ee66988ef14e5b74b1247e1ccfe8c320d (diff) | |
parent | 4279bc7aef89ff668b81e5dd7514cc5ec281a753 (diff) | |
download | cpython-c48cfe37d22dfe37bb163fe87dacefdaf786fef6.zip cpython-c48cfe37d22dfe37bb163fe87dacefdaf786fef6.tar.gz cpython-c48cfe37d22dfe37bb163fe87dacefdaf786fef6.tar.bz2 |
#14020: merge with 3.2.
-rw-r--r-- | Doc/library/html.parser.rst | 272 |
1 files changed, 205 insertions, 67 deletions
diff --git a/Doc/library/html.parser.rst b/Doc/library/html.parser.rst index 7c44bec..f3c36ec 100644 --- a/Doc/library/html.parser.rst +++ b/Doc/library/html.parser.rst @@ -19,14 +19,15 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. .. class:: HTMLParser(strict=True) Create a parser instance. If *strict* is ``True`` (the default), invalid - html results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If + HTML results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If *strict* is ``False``, the parser uses heuristics to make a best guess at - the intention of any invalid html it encounters, similar to the way most - browsers do. + the intention of any invalid HTML it encounters, similar to the way most + browsers do. Using ``strict=False`` is advised. - An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags - begin and end. The :class:`HTMLParser` class is meant to be overridden by the - user to provide a desired behavior. + An :class:`.HTMLParser` instance is fed HTML data and calls handler methods + when start tags, end tags, text, comments, and other markup elements are + encountered. The user should subclass :class:`.HTMLParser` and override its + methods to implement the desired behavior. This parser does not check that end tags match start tags or call the end-tag handler for elements which are closed implicitly by closing an outer element. @@ -39,25 +40,61 @@ An exception is defined as well: .. exception:: HTMLParseError Exception raised by the :class:`HTMLParser` class when it encounters an error - while parsing. This exception provides three attributes: :attr:`msg` is a brief - message explaining the error, :attr:`lineno` is the number of the line on which - the broken construct was detected, and :attr:`offset` is the number of - characters into the line at which the construct starts. + while parsing and *strict* is ``True``. This exception provides three + attributes: :attr:`msg` is a brief message explaining the error, + :attr:`lineno` is the number of the line on which the broken construct was + detected, and :attr:`offset` is the number of characters into the line at + which the construct starts. -:class:`HTMLParser` instances have the following methods: +Example HTML Parser Application +------------------------------- -.. method:: HTMLParser.reset() +As a basic example, below is a simple HTML parser that uses the +:class:`HTMLParser` class to print out start tags, end tags, and data +as they are encountered:: - Reset the instance. Loses all unprocessed data. This is called implicitly at - instantiation time. + from html.parser import HTMLParser + + class MyHTMLParser(HTMLParser): + def handle_starttag(self, tag, attrs): + print("Encountered a start tag:", tag) + def handle_endtag(self, tag): + print("Encountered an end tag :", tag) + def handle_data(self, data): + print("Encountered some data :", data) + + parser = MyHTMLParser(strict=False) + parser.feed('<html><head><title>Test</title></head>' + '<body><h1>Parse me!</h1></body></html>') + +The output will then be:: + + Encountered a start tag: html + Encountered a start tag: head + Encountered a start tag: title + Encountered some data : Test + Encountered an end tag : title + Encountered an end tag : head + Encountered a start tag: body + Encountered a start tag: h1 + Encountered some data : Parse me! + Encountered an end tag : h1 + Encountered an end tag : body + Encountered an end tag : html + + +:class:`.HTMLParser` Methods +---------------------------- + +:class:`HTMLParser` instances have the following methods: .. method:: HTMLParser.feed(data) Feed some text to the parser. It is processed insofar as it consists of complete elements; incomplete data is buffered until more data is fed or - :meth:`close` is called. + :meth:`close` is called. *data* must be :class:`str`. .. method:: HTMLParser.close() @@ -68,6 +105,12 @@ An exception is defined as well: the :class:`HTMLParser` base class method :meth:`close`. +.. method:: HTMLParser.reset() + + Reset the instance. Loses all unprocessed data. This is called implicitly at + instantiation time. + + .. method:: HTMLParser.getpos() Return current line number and offset. @@ -81,23 +124,35 @@ An exception is defined as well: attributes can be preserved, etc.). +The following methods are called when data or markup elements are encountered +and they are meant to be overridden in a subclass. The base class +implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`): + + .. method:: HTMLParser.handle_starttag(tag, attrs) - This method is called to handle the start of a tag. It is intended to be - overridden by a derived class; the base class implementation does nothing. + This method is called to handle the start of a tag (e.g. ``<div id="main">``). The *tag* argument is the name of the tag converted to lower case. The *attrs* argument is a list of ``(name, value)`` pairs containing the attributes found inside the tag's ``<>`` brackets. The *name* will be translated to lower case, and quotes in the *value* have been removed, and character and entity references - have been replaced. For instance, for the tag ``<A - HREF="http://www.cwi.nl/">``, this method would be called as - ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. + have been replaced. + + For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method + would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. All entity references from :mod:`html.entities` are replaced in the attribute values. +.. method:: HTMLParser.handle_endtag(tag) + + This method is called to handle the end tag of an element (e.g. ``</div>``). + + The *tag* argument is the name of the tag converted to lower case. + + .. method:: HTMLParser.handle_startendtag(tag, attrs) Similar to :meth:`handle_starttag`, but called when the parser encounters an @@ -106,57 +161,46 @@ An exception is defined as well: implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`. -.. method:: HTMLParser.handle_endtag(tag) - - This method is called to handle the end tag of an element. It is intended to be - overridden by a derived class; the base class implementation does nothing. The - *tag* argument is the name of the tag converted to lower case. - - .. method:: HTMLParser.handle_data(data) - This method is called to process arbitrary data (e.g. the content of - ``<script>...</script>`` and ``<style>...</style>``). It is intended to be - overridden by a derived class; the base class implementation does nothing. + This method is called to process arbitrary data (e.g. text nodes and the + content of ``<script>...</script>`` and ``<style>...</style>``). -.. method:: HTMLParser.handle_charref(name) +.. method:: HTMLParser.handle_entityref(name) - This method is called to process a character reference of the form ``&#ref;``. - It is intended to be overridden by a derived class; the base class - implementation does nothing. + This method is called to process a named character reference of the form + ``&name;`` (e.g. ``>``), where *name* is a general entity reference + (e.g. ``'gt'``). -.. method:: HTMLParser.handle_entityref(name) +.. method:: HTMLParser.handle_charref(name) - This method is called to process a general entity reference of the form - ``&name;`` where *name* is an general entity reference. It is intended to be - overridden by a derived class; the base class implementation does nothing. + This method is called to process decimal and hexadecimal numeric character + references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal + equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``; + in this case the method will receive ``'62'`` or ``'x3E'``. .. method:: HTMLParser.handle_comment(data) - This method is called when a comment is encountered. The *comment* argument is - a string containing the text between the ``--`` and ``--`` delimiters, but not - the delimiters themselves. For example, the comment ``<!--text-->`` will cause - this method to be called with the argument ``'text'``. It is intended to be - overridden by a derived class; the base class implementation does nothing. + This method is called when a comment is encountered (e.g. ``<!--comment-->``). + For example, the comment ``<!-- comment -->`` will cause this method to be + called with the argument ``' comment '``. -.. method:: HTMLParser.handle_decl(decl) + The content of Internet Explorer conditional comments (condcoms) will also be + sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``, + this method will receive ``'[if IE 9]>IE-specific content<![endif]'``. - Method called when an SGML ``doctype`` declaration is read by the parser. - The *decl* parameter will be the entire contents of the declaration inside - the ``<!...>`` markup. It is intended to be overridden by a derived class; - the base class implementation does nothing. +.. method:: HTMLParser.handle_decl(decl) -.. method:: HTMLParser.unknown_decl(data) + This method is called to handle an HTML doctype declaration (e.g. + ``<!DOCTYPE html>``). - Method called when an unrecognized SGML declaration is read by the parser. - The *data* parameter will be the entire contents of the declaration inside - the ``<!...>`` markup. It is sometimes useful to be overridden by a - derived class; the base class implementation raises an :exc:`HTMLParseError`. + The *decl* parameter will be the entire contents of the declaration inside + the ``<!...>`` markup (e.g. ``'DOCTYPE html'``). .. method:: HTMLParser.handle_pi(data) @@ -174,29 +218,123 @@ An exception is defined as well: cause the ``'?'`` to be included in *data*. -.. _htmlparser-example: +.. method:: HTMLParser.unknown_decl(data) -Example HTML Parser Application -------------------------------- + This method is called when an unrecognized declaration is read by the parser. + + The *data* parameter will be the entire contents of the declaration inside + the ``<![...]>`` markup. It is sometimes useful to be overridden by a + derived class. The base class implementation raises an :exc:`HTMLParseError` + when *strict* is ``True``. -As a basic example, below is a simple HTML parser that uses the -:class:`HTMLParser` class to print out start tags, end tags, and data -as they are encountered:: + +.. _htmlparser-examples: + +Examples +-------- + +The following class implements a parser that will be used to illustrate more +examples:: from html.parser import HTMLParser + from html.entities import name2codepoint class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): - print("Encountered a start tag:", tag) + print("Start tag:", tag) + for attr in attrs: + print(" attr:", attr) def handle_endtag(self, tag): - print("Encountered an end tag:", tag) + print("End tag :", tag) def handle_data(self, data): - print("Encountered some data:", data) - - parser = MyHTMLParser() - parser.feed('<html><head><title>Test</title></head>' - '<body><h1>Parse me!</h1></body></html>') - + print("Data :", data) + def handle_comment(self, data): + print("Comment :", data) + def handle_entityref(self, name): + c = chr(name2codepoint[name]) + print("Named ent:", c) + def handle_charref(self, name): + if name.startswith('x'): + c = chr(int(name[1:], 16)) + else: + c = chr(int(name)) + print("Num ent :", c) + def handle_decl(self, data): + print("Decl :", data) + + parser = MyHTMLParser(strict=False) + +Parsing a doctype:: + + >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" ' + ... '"http://www.w3.org/TR/html4/strict.dtd">') + Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" + +Parsing an element with a few attributes and a title:: + + >>> parser.feed('<img src="python-logo.png" alt="The Python logo">') + Start tag: img + attr: ('src', 'python-logo.png') + attr: ('alt', 'The Python logo') + >>> + >>> parser.feed('<h1>Python</h1>') + Start tag: h1 + Data : Python + End tag : h1 + +The content of ``script`` and ``style`` elements is returned as is, without +further parsing:: + + >>> parser.feed('<style type="text/css">#python { color: green }</style>') + Start tag: style + attr: ('type', 'text/css') + Data : #python { color: green } + End tag : style + >>> + >>> parser.feed('<script type="text/javascript">' + ... 'alert("<strong>hello!</strong>");</script>') + Start tag: script + attr: ('type', 'text/javascript') + Data : alert("<strong>hello!</strong>"); + End tag : script + +Parsing comments:: + + >>> parser.feed('<!-- a comment -->' + ... '<!--[if IE 9]>IE-specific content<![endif]-->') + Comment : a comment + Comment : [if IE 9]>IE-specific content<![endif] + +Parsing named and numeric character references and converting them to the +correct char (note: these 3 references are all equivalent to ``'>'``):: + + >>> parser.feed('>>>') + Named ent: > + Num ent : > + Num ent : > + +Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but +:meth:`~HTMLParser.handle_data` might be called more than once:: + + >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']: + ... parser.feed(chunk) + ... + Start tag: span + Data : buff + Data : ered + Data : text + End tag : span + +Parsing invalid HTML (e.g. unquoted attributes) also works:: + + >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>') + Start tag: p + Start tag: a + attr: ('class', 'link') + attr: ('href', '#main') + Data : tag soup + End tag : p + End tag : a .. rubric:: Footnotes |