diff options
Diffstat (limited to 'Doc/library/htmllib.rst')
-rw-r--r-- | Doc/library/htmllib.rst | 186 |
1 files changed, 186 insertions, 0 deletions
diff --git a/Doc/library/htmllib.rst b/Doc/library/htmllib.rst new file mode 100644 index 0000000..96a7d08 --- /dev/null +++ b/Doc/library/htmllib.rst @@ -0,0 +1,186 @@ + +:mod:`htmllib` --- A parser for HTML documents +============================================== + +.. module:: htmllib + :synopsis: A parser for HTML documents. + + +.. index:: + single: HTML + single: hypertext + +.. index:: + module: sgmllib + module: formatter + single: SGMLParser (in module sgmllib) + +This module defines a class which can serve as a base for parsing text files +formatted in the HyperText Mark-up Language (HTML). The class is not directly +concerned with I/O --- it must be provided with input in string form via a +method, and makes calls to methods of a "formatter" object in order to produce +output. The :class:`HTMLParser` class is designed to be used as a base class +for other classes in order to add functionality, and allows most of its methods +to be extended or overridden. In turn, this class is derived from and extends +the :class:`SGMLParser` class defined in module :mod:`sgmllib`. The +:class:`HTMLParser` implementation supports the HTML 2.0 language as described +in :rfc:`1866`. Two implementations of formatter objects are provided in the +:mod:`formatter` module; refer to the documentation for that module for +information on the formatter interface. + +The following is a summary of the interface defined by +:class:`sgmllib.SGMLParser`: + +* The interface to feed data to an instance is through the :meth:`feed` method, + which takes a string argument. This can be called with as little or as much + text at a time as desired; ``p.feed(a); p.feed(b)`` has the same effect as + ``p.feed(a+b)``. When the data contains complete HTML markup constructs, these + are processed immediately; incomplete constructs are saved in a buffer. To + force processing of all unprocessed data, call the :meth:`close` method. + + For example, to parse the entire contents of a file, use:: + + parser.feed(open('myfile.html').read()) + parser.close() + +* The interface to define semantics for HTML tags is very simple: derive a class + and define methods called :meth:`start_tag`, :meth:`end_tag`, or :meth:`do_tag`. + The parser will call these at appropriate moments: :meth:`start_tag` or + :meth:`do_tag` is called when an opening tag of the form ``<tag ...>`` is + encountered; :meth:`end_tag` is called when a closing tag of the form ``<tag>`` + is encountered. If an opening tag requires a corresponding closing tag, like + ``<H1>`` ... ``</H1>``, the class should define the :meth:`start_tag` method; if + a tag requires no closing tag, like ``<P>``, the class should define the + :meth:`do_tag` method. + +The module defines a parser class and an exception: + + +.. class:: HTMLParser(formatter) + + This is the basic HTML parser class. It supports all entity names required by + the XHTML 1.0 Recommendation (http://www.w3.org/TR/xhtml1). It also defines + handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. + + +.. exception:: HTMLParseError + + Exception raised by the :class:`HTMLParser` class when it encounters an error + while parsing. + + .. versionadded:: 2.4 + + +.. seealso:: + + Module :mod:`formatter` + Interface definition for transforming an abstract flow of formatting events into + specific output events on writer objects. + + Module :mod:`HTMLParser` + Alternate HTML parser that offers a slightly lower-level view of the input, but + is designed to work with XHTML, and does not implement some of the SGML syntax + not used in "HTML as deployed" and which isn't legal for XHTML. + + Module :mod:`htmlentitydefs` + Definition of replacement text for XHTML 1.0 entities. + + Module :mod:`sgmllib` + Base class for :class:`HTMLParser`. + + +.. _html-parser-objects: + +HTMLParser Objects +------------------ + +In addition to tag methods, the :class:`HTMLParser` class provides some +additional methods and instance variables for use within tag methods. + + +.. attribute:: HTMLParser.formatter + + This is the formatter instance associated with the parser. + + +.. attribute:: HTMLParser.nofill + + Boolean flag which should be true when whitespace should not be collapsed, or + false when it should be. In general, this should only be true when character + data is to be treated as "preformatted" text, as within a ``<PRE>`` element. + The default value is false. This affects the operation of :meth:`handle_data` + and :meth:`save_end`. + + +.. method:: HTMLParser.anchor_bgn(href, name, type) + + This method is called at the start of an anchor region. The arguments + correspond to the attributes of the ``<A>`` tag with the same names. The + default implementation maintains a list of hyperlinks (defined by the ``HREF`` + attribute for ``<A>`` tags) within the document. The list of hyperlinks is + available as the data attribute :attr:`anchorlist`. + + +.. method:: HTMLParser.anchor_end() + + This method is called at the end of an anchor region. The default + implementation adds a textual footnote marker using an index into the list of + hyperlinks created by :meth:`anchor_bgn`. + + +.. method:: HTMLParser.handle_image(source, alt[, ismap[, align[, width[, height]]]]) + + This method is called to handle images. The default implementation simply + passes the *alt* value to the :meth:`handle_data` method. + + +.. method:: HTMLParser.save_bgn() + + Begins saving character data in a buffer instead of sending it to the formatter + object. Retrieve the stored data via :meth:`save_end`. Use of the + :meth:`save_bgn` / :meth:`save_end` pair may not be nested. + + +.. method:: HTMLParser.save_end() + + Ends buffering character data and returns all data saved since the preceding + call to :meth:`save_bgn`. If the :attr:`nofill` flag is false, whitespace is + collapsed to single spaces. A call to this method without a preceding call to + :meth:`save_bgn` will raise a :exc:`TypeError` exception. + + +:mod:`htmlentitydefs` --- Definitions of HTML general entities +============================================================== + +.. module:: htmlentitydefs + :synopsis: Definitions of HTML general entities. +.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org> + + +This module defines three dictionaries, ``name2codepoint``, ``codepoint2name``, +and ``entitydefs``. ``entitydefs`` is used by the :mod:`htmllib` module to +provide the :attr:`entitydefs` member of the :class:`HTMLParser` class. The +definition provided here contains all the entities defined by XHTML 1.0 that +can be handled using simple textual substitution in the Latin-1 character set +(ISO-8859-1). + + +.. data:: entitydefs + + A dictionary mapping XHTML 1.0 entity definitions to their replacement text in + ISO Latin-1. + + +.. data:: name2codepoint + + A dictionary that maps HTML entity names to the Unicode codepoints. + + .. versionadded:: 2.3 + + +.. data:: codepoint2name + + A dictionary that maps Unicode codepoints to HTML entity names. + + .. versionadded:: 2.3 + |