summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorEzio Melotti <ezio.melotti@gmail.com>2012-02-18 00:03:35 (GMT)
committerEzio Melotti <ezio.melotti@gmail.com>2012-02-18 00:03:35 (GMT)
commitc48cfe37d22dfe37bb163fe87dacefdaf786fef6 (patch)
treeb17002f22447ae3df159bc9d33f2d35de866b850
parentaa2c670ee66988ef14e5b74b1247e1ccfe8c320d (diff)
parent4279bc7aef89ff668b81e5dd7514cc5ec281a753 (diff)
downloadcpython-c48cfe37d22dfe37bb163fe87dacefdaf786fef6.zip
cpython-c48cfe37d22dfe37bb163fe87dacefdaf786fef6.tar.gz
cpython-c48cfe37d22dfe37bb163fe87dacefdaf786fef6.tar.bz2
#14020: merge with 3.2.
-rw-r--r--Doc/library/html.parser.rst272
1 files changed, 205 insertions, 67 deletions
diff --git a/Doc/library/html.parser.rst b/Doc/library/html.parser.rst
index 7c44bec..f3c36ec 100644
--- a/Doc/library/html.parser.rst
+++ b/Doc/library/html.parser.rst
@@ -19,14 +19,15 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
.. class:: HTMLParser(strict=True)
Create a parser instance. If *strict* is ``True`` (the default), invalid
- html results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If
+ HTML results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If
*strict* is ``False``, the parser uses heuristics to make a best guess at
- the intention of any invalid html it encounters, similar to the way most
- browsers do.
+ the intention of any invalid HTML it encounters, similar to the way most
+ browsers do. Using ``strict=False`` is advised.
- An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags
- begin and end. The :class:`HTMLParser` class is meant to be overridden by the
- user to provide a desired behavior.
+ An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
+ when start tags, end tags, text, comments, and other markup elements are
+ encountered. The user should subclass :class:`.HTMLParser` and override its
+ methods to implement the desired behavior.
This parser does not check that end tags match start tags or call the end-tag
handler for elements which are closed implicitly by closing an outer element.
@@ -39,25 +40,61 @@ An exception is defined as well:
.. exception:: HTMLParseError
Exception raised by the :class:`HTMLParser` class when it encounters an error
- while parsing. This exception provides three attributes: :attr:`msg` is a brief
- message explaining the error, :attr:`lineno` is the number of the line on which
- the broken construct was detected, and :attr:`offset` is the number of
- characters into the line at which the construct starts.
+ while parsing and *strict* is ``True``. This exception provides three
+ attributes: :attr:`msg` is a brief message explaining the error,
+ :attr:`lineno` is the number of the line on which the broken construct was
+ detected, and :attr:`offset` is the number of characters into the line at
+ which the construct starts.
-:class:`HTMLParser` instances have the following methods:
+Example HTML Parser Application
+-------------------------------
-.. method:: HTMLParser.reset()
+As a basic example, below is a simple HTML parser that uses the
+:class:`HTMLParser` class to print out start tags, end tags, and data
+as they are encountered::
- Reset the instance. Loses all unprocessed data. This is called implicitly at
- instantiation time.
+ from html.parser import HTMLParser
+
+ class MyHTMLParser(HTMLParser):
+ def handle_starttag(self, tag, attrs):
+ print("Encountered a start tag:", tag)
+ def handle_endtag(self, tag):
+ print("Encountered an end tag :", tag)
+ def handle_data(self, data):
+ print("Encountered some data :", data)
+
+ parser = MyHTMLParser(strict=False)
+ parser.feed('<html><head><title>Test</title></head>'
+ '<body><h1>Parse me!</h1></body></html>')
+
+The output will then be::
+
+ Encountered a start tag: html
+ Encountered a start tag: head
+ Encountered a start tag: title
+ Encountered some data : Test
+ Encountered an end tag : title
+ Encountered an end tag : head
+ Encountered a start tag: body
+ Encountered a start tag: h1
+ Encountered some data : Parse me!
+ Encountered an end tag : h1
+ Encountered an end tag : body
+ Encountered an end tag : html
+
+
+:class:`.HTMLParser` Methods
+----------------------------
+
+:class:`HTMLParser` instances have the following methods:
.. method:: HTMLParser.feed(data)
Feed some text to the parser. It is processed insofar as it consists of
complete elements; incomplete data is buffered until more data is fed or
- :meth:`close` is called.
+ :meth:`close` is called. *data* must be :class:`str`.
.. method:: HTMLParser.close()
@@ -68,6 +105,12 @@ An exception is defined as well:
the :class:`HTMLParser` base class method :meth:`close`.
+.. method:: HTMLParser.reset()
+
+ Reset the instance. Loses all unprocessed data. This is called implicitly at
+ instantiation time.
+
+
.. method:: HTMLParser.getpos()
Return current line number and offset.
@@ -81,23 +124,35 @@ An exception is defined as well:
attributes can be preserved, etc.).
+The following methods are called when data or markup elements are encountered
+and they are meant to be overridden in a subclass. The base class
+implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
+
+
.. method:: HTMLParser.handle_starttag(tag, attrs)
- This method is called to handle the start of a tag. It is intended to be
- overridden by a derived class; the base class implementation does nothing.
+ This method is called to handle the start of a tag (e.g. ``<div id="main">``).
The *tag* argument is the name of the tag converted to lower case. The *attrs*
argument is a list of ``(name, value)`` pairs containing the attributes found
inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
and quotes in the *value* have been removed, and character and entity references
- have been replaced. For instance, for the tag ``<A
- HREF="http://www.cwi.nl/">``, this method would be called as
- ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
+ have been replaced.
+
+ For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
+ would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
All entity references from :mod:`html.entities` are replaced in the attribute
values.
+.. method:: HTMLParser.handle_endtag(tag)
+
+ This method is called to handle the end tag of an element (e.g. ``</div>``).
+
+ The *tag* argument is the name of the tag converted to lower case.
+
+
.. method:: HTMLParser.handle_startendtag(tag, attrs)
Similar to :meth:`handle_starttag`, but called when the parser encounters an
@@ -106,57 +161,46 @@ An exception is defined as well:
implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
-.. method:: HTMLParser.handle_endtag(tag)
-
- This method is called to handle the end tag of an element. It is intended to be
- overridden by a derived class; the base class implementation does nothing. The
- *tag* argument is the name of the tag converted to lower case.
-
-
.. method:: HTMLParser.handle_data(data)
- This method is called to process arbitrary data (e.g. the content of
- ``<script>...</script>`` and ``<style>...</style>``). It is intended to be
- overridden by a derived class; the base class implementation does nothing.
+ This method is called to process arbitrary data (e.g. text nodes and the
+ content of ``<script>...</script>`` and ``<style>...</style>``).
-.. method:: HTMLParser.handle_charref(name)
+.. method:: HTMLParser.handle_entityref(name)
- This method is called to process a character reference of the form ``&#ref;``.
- It is intended to be overridden by a derived class; the base class
- implementation does nothing.
+ This method is called to process a named character reference of the form
+ ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
+ (e.g. ``'gt'``).
-.. method:: HTMLParser.handle_entityref(name)
+.. method:: HTMLParser.handle_charref(name)
- This method is called to process a general entity reference of the form
- ``&name;`` where *name* is an general entity reference. It is intended to be
- overridden by a derived class; the base class implementation does nothing.
+ This method is called to process decimal and hexadecimal numeric character
+ references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
+ equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
+ in this case the method will receive ``'62'`` or ``'x3E'``.
.. method:: HTMLParser.handle_comment(data)
- This method is called when a comment is encountered. The *comment* argument is
- a string containing the text between the ``--`` and ``--`` delimiters, but not
- the delimiters themselves. For example, the comment ``<!--text-->`` will cause
- this method to be called with the argument ``'text'``. It is intended to be
- overridden by a derived class; the base class implementation does nothing.
+ This method is called when a comment is encountered (e.g. ``<!--comment-->``).
+ For example, the comment ``<!-- comment -->`` will cause this method to be
+ called with the argument ``' comment '``.
-.. method:: HTMLParser.handle_decl(decl)
+ The content of Internet Explorer conditional comments (condcoms) will also be
+ sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
+ this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
- Method called when an SGML ``doctype`` declaration is read by the parser.
- The *decl* parameter will be the entire contents of the declaration inside
- the ``<!...>`` markup. It is intended to be overridden by a derived class;
- the base class implementation does nothing.
+.. method:: HTMLParser.handle_decl(decl)
-.. method:: HTMLParser.unknown_decl(data)
+ This method is called to handle an HTML doctype declaration (e.g.
+ ``<!DOCTYPE html>``).
- Method called when an unrecognized SGML declaration is read by the parser.
- The *data* parameter will be the entire contents of the declaration inside
- the ``<!...>`` markup. It is sometimes useful to be overridden by a
- derived class; the base class implementation raises an :exc:`HTMLParseError`.
+ The *decl* parameter will be the entire contents of the declaration inside
+ the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
.. method:: HTMLParser.handle_pi(data)
@@ -174,29 +218,123 @@ An exception is defined as well:
cause the ``'?'`` to be included in *data*.
-.. _htmlparser-example:
+.. method:: HTMLParser.unknown_decl(data)
-Example HTML Parser Application
--------------------------------
+ This method is called when an unrecognized declaration is read by the parser.
+
+ The *data* parameter will be the entire contents of the declaration inside
+ the ``<![...]>`` markup. It is sometimes useful to be overridden by a
+ derived class. The base class implementation raises an :exc:`HTMLParseError`
+ when *strict* is ``True``.
-As a basic example, below is a simple HTML parser that uses the
-:class:`HTMLParser` class to print out start tags, end tags, and data
-as they are encountered::
+
+.. _htmlparser-examples:
+
+Examples
+--------
+
+The following class implements a parser that will be used to illustrate more
+examples::
from html.parser import HTMLParser
+ from html.entities import name2codepoint
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
- print("Encountered a start tag:", tag)
+ print("Start tag:", tag)
+ for attr in attrs:
+ print(" attr:", attr)
def handle_endtag(self, tag):
- print("Encountered an end tag:", tag)
+ print("End tag :", tag)
def handle_data(self, data):
- print("Encountered some data:", data)
-
- parser = MyHTMLParser()
- parser.feed('<html><head><title>Test</title></head>'
- '<body><h1>Parse me!</h1></body></html>')
-
+ print("Data :", data)
+ def handle_comment(self, data):
+ print("Comment :", data)
+ def handle_entityref(self, name):
+ c = chr(name2codepoint[name])
+ print("Named ent:", c)
+ def handle_charref(self, name):
+ if name.startswith('x'):
+ c = chr(int(name[1:], 16))
+ else:
+ c = chr(int(name))
+ print("Num ent :", c)
+ def handle_decl(self, data):
+ print("Decl :", data)
+
+ parser = MyHTMLParser(strict=False)
+
+Parsing a doctype::
+
+ >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
+ ... '"http://www.w3.org/TR/html4/strict.dtd">')
+ Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
+
+Parsing an element with a few attributes and a title::
+
+ >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
+ Start tag: img
+ attr: ('src', 'python-logo.png')
+ attr: ('alt', 'The Python logo')
+ >>>
+ >>> parser.feed('<h1>Python</h1>')
+ Start tag: h1
+ Data : Python
+ End tag : h1
+
+The content of ``script`` and ``style`` elements is returned as is, without
+further parsing::
+
+ >>> parser.feed('<style type="text/css">#python { color: green }</style>')
+ Start tag: style
+ attr: ('type', 'text/css')
+ Data : #python { color: green }
+ End tag : style
+ >>>
+ >>> parser.feed('<script type="text/javascript">'
+ ... 'alert("<strong>hello!</strong>");</script>')
+ Start tag: script
+ attr: ('type', 'text/javascript')
+ Data : alert("<strong>hello!</strong>");
+ End tag : script
+
+Parsing comments::
+
+ >>> parser.feed('<!-- a comment -->'
+ ... '<!--[if IE 9]>IE-specific content<![endif]-->')
+ Comment : a comment
+ Comment : [if IE 9]>IE-specific content<![endif]
+
+Parsing named and numeric character references and converting them to the
+correct char (note: these 3 references are all equivalent to ``'>'``)::
+
+ >>> parser.feed('&gt;&#62;&#x3E;')
+ Named ent: >
+ Num ent : >
+ Num ent : >
+
+Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
+:meth:`~HTMLParser.handle_data` might be called more than once::
+
+ >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
+ ... parser.feed(chunk)
+ ...
+ Start tag: span
+ Data : buff
+ Data : ered
+ Data : text
+ End tag : span
+
+Parsing invalid HTML (e.g. unquoted attributes) also works::
+
+ >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
+ Start tag: p
+ Start tag: a
+ attr: ('class', 'link')
+ attr: ('href', '#main')
+ Data : tag soup
+ End tag : p
+ End tag : a
.. rubric:: Footnotes