diff options
author | Georg Brandl <georg@python.org> | 2008-05-18 07:53:01 (GMT) |
---|---|---|
committer | Georg Brandl <georg@python.org> | 2008-05-18 07:53:01 (GMT) |
commit | 9087b7f83b2e2fb8eaee899a9b6d6f0646bd3aa1 (patch) | |
tree | 65afd9f3376da8395e950eb84d0f3e25e787a3c8 /Doc/library/html.parser.rst | |
parent | bf93b0470affd720a603805eb00c917a9bc39cc9 (diff) | |
download | cpython-9087b7f83b2e2fb8eaee899a9b6d6f0646bd3aa1.zip cpython-9087b7f83b2e2fb8eaee899a9b6d6f0646bd3aa1.tar.gz cpython-9087b7f83b2e2fb8eaee899a9b6d6f0646bd3aa1.tar.bz2 |
Merged revisions 63438 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
r63438 | georg.brandl | 2008-05-17 23:54:03 +0200 (Sat, 17 May 2008) | 3 lines
Rename html.parser file, and split html.entities from htmllib
to ease removal of the latter in Py3k.
........
Diffstat (limited to 'Doc/library/html.parser.rst')
-rw-r--r-- | Doc/library/html.parser.rst | 179 |
1 files changed, 179 insertions, 0 deletions
diff --git a/Doc/library/html.parser.rst b/Doc/library/html.parser.rst new file mode 100644 index 0000000..fe6748a --- /dev/null +++ b/Doc/library/html.parser.rst @@ -0,0 +1,179 @@ +:mod:`html.parser` --- Simple HTML and XHTML parser +=================================================== + +.. module:: html.parser + :synopsis: A simple parser that can handle HTML and XHTML. + + +.. index:: + single: HTML + single: XHTML + +This module defines a class :class:`HTMLParser` which serves as the basis for +parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. +Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser +in :mod:`sgmllib`. + + +.. class:: HTMLParser() + + The :class:`HTMLParser` class is instantiated without arguments. + + An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags + begin and end. The :class:`HTMLParser` class is meant to be overridden by the + user to provide a desired behavior. + + Unlike the parser in :mod:`htmllib`, this parser does not check that end tags + match start tags or call the end-tag handler for elements which are closed + implicitly by closing an outer element. + +An exception is defined as well: + + +.. exception:: HTMLParseError + + Exception raised by the :class:`HTMLParser` class when it encounters an error + while parsing. This exception provides three attributes: :attr:`msg` is a brief + message explaining the error, :attr:`lineno` is the number of the line on which + the broken construct was detected, and :attr:`offset` is the number of + characters into the line at which the construct starts. + +:class:`HTMLParser` instances have the following methods: + + +.. method:: HTMLParser.reset() + + Reset the instance. Loses all unprocessed data. This is called implicitly at + instantiation time. + + +.. method:: HTMLParser.feed(data) + + Feed some text to the parser. It is processed insofar as it consists of + complete elements; incomplete data is buffered until more data is fed or + :meth:`close` is called. + + +.. method:: HTMLParser.close() + + Force processing of all buffered data as if it were followed by an end-of-file + mark. This method may be redefined by a derived class to define additional + processing at the end of the input, but the redefined version should always call + the :class:`HTMLParser` base class method :meth:`close`. + + +.. method:: HTMLParser.getpos() + + Return current line number and offset. + + +.. method:: HTMLParser.get_starttag_text() + + Return the text of the most recently opened start tag. This should not normally + be needed for structured processing, but may be useful in dealing with HTML "as + deployed" or for re-generating input with minimal changes (whitespace between + attributes can be preserved, etc.). + + +.. method:: HTMLParser.handle_starttag(tag, attrs) + + This method is called to handle the start of a tag. It is intended to be + overridden by a derived class; the base class implementation does nothing. + + The *tag* argument is the name of the tag converted to lower case. The *attrs* + argument is a list of ``(name, value)`` pairs containing the attributes found + inside the tag's ``<>`` brackets. The *name* will be translated to lower case, + and quotes in the *value* have been removed, and character and entity references + have been replaced. For instance, for the tag ``<A + HREF="http://www.cwi.nl/">``, this method would be called as + ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. + + All entity references from :mod:`html.entities` are replaced in the attribute + values. + + +.. method:: HTMLParser.handle_startendtag(tag, attrs) + + Similar to :meth:`handle_starttag`, but called when the parser encounters an + XHTML-style empty tag (``<a .../>``). This method may be overridden by + subclasses which require this particular lexical information; the default + implementation simple calls :meth:`handle_starttag` and :meth:`handle_endtag`. + + +.. method:: HTMLParser.handle_endtag(tag) + + This method is called to handle the end tag of an element. It is intended to be + overridden by a derived class; the base class implementation does nothing. The + *tag* argument is the name of the tag converted to lower case. + + +.. method:: HTMLParser.handle_data(data) + + This method is called to process arbitrary data. It is intended to be + overridden by a derived class; the base class implementation does nothing. + + +.. method:: HTMLParser.handle_charref(name) + + This method is called to process a character reference of the form ``&#ref;``. + It is intended to be overridden by a derived class; the base class + implementation does nothing. + + +.. method:: HTMLParser.handle_entityref(name) + + This method is called to process a general entity reference of the form + ``&name;`` where *name* is an general entity reference. It is intended to be + overridden by a derived class; the base class implementation does nothing. + + +.. method:: HTMLParser.handle_comment(data) + + This method is called when a comment is encountered. The *comment* argument is + a string containing the text between the ``--`` and ``--`` delimiters, but not + the delimiters themselves. For example, the comment ``<!--text-->`` will cause + this method to be called with the argument ``'text'``. It is intended to be + overridden by a derived class; the base class implementation does nothing. + + +.. method:: HTMLParser.handle_decl(decl) + + Method called when an SGML declaration is read by the parser. The *decl* + parameter will be the entire contents of the declaration inside the ``<!``...\ + ``>`` markup. It is intended to be overridden by a derived class; the base + class implementation does nothing. + + +.. method:: HTMLParser.handle_pi(data) + + Method called when a processing instruction is encountered. The *data* + parameter will contain the entire processing instruction. For example, for the + processing instruction ``<?proc color='red'>``, this method would be called as + ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived + class; the base class implementation does nothing. + + .. note:: + + The :class:`HTMLParser` class uses the SGML syntactic rules for processing + instructions. An XHTML processing instruction using the trailing ``'?'`` will + cause the ``'?'`` to be included in *data*. + + +.. _htmlparser-example: + +Example HTML Parser Application +------------------------------- + +As a basic example, below is a very basic HTML parser that uses the +:class:`HTMLParser` class to print out tags as they are encountered:: + + from html.parser import HTMLParser + + class MyHTMLParser(HTMLParser): + + def handle_starttag(self, tag, attrs): + print "Encountered the beginning of a %s tag" % tag + + def handle_endtag(self, tag): + print "Encountered the end of a %s tag" % tag + |