diff options
Diffstat (limited to 'Doc/library/htmllib.rst')
-rw-r--r-- | Doc/library/htmllib.rst | 147 |
1 files changed, 0 insertions, 147 deletions
diff --git a/Doc/library/htmllib.rst b/Doc/library/htmllib.rst deleted file mode 100644 index 5e6554a..0000000 --- a/Doc/library/htmllib.rst +++ /dev/null @@ -1,147 +0,0 @@ - -:mod:`htmllib` --- A parser for HTML documents -============================================== - -.. module:: htmllib - :synopsis: A parser for HTML documents. - - -.. index:: - single: HTML - single: hypertext - -.. index:: - module: sgmllib - module: formatter - single: SGMLParser (in module sgmllib) - -This module defines a class which can serve as a base for parsing text files -formatted in the HyperText Mark-up Language (HTML). The class is not directly -concerned with I/O --- it must be provided with input in string form via a -method, and makes calls to methods of a "formatter" object in order to produce -output. The :class:`HTMLParser` class is designed to be used as a base class -for other classes in order to add functionality, and allows most of its methods -to be extended or overridden. In turn, this class is derived from and extends -the :class:`SGMLParser` class defined in module :mod:`sgmllib`. The -:class:`HTMLParser` implementation supports the HTML 2.0 language as described -in :rfc:`1866`. Two implementations of formatter objects are provided in the -:mod:`formatter` module; refer to the documentation for that module for -information on the formatter interface. - -The following is a summary of the interface defined by -:class:`sgmllib.SGMLParser`: - -* The interface to feed data to an instance is through the :meth:`feed` method, - which takes a string argument. This can be called with as little or as much - text at a time as desired; ``p.feed(a); p.feed(b)`` has the same effect as - ``p.feed(a+b)``. When the data contains complete HTML markup constructs, these - are processed immediately; incomplete constructs are saved in a buffer. To - force processing of all unprocessed data, call the :meth:`close` method. - - For example, to parse the entire contents of a file, use:: - - parser.feed(open('myfile.html').read()) - parser.close() - -* The interface to define semantics for HTML tags is very simple: derive a class - and define methods called :meth:`start_tag`, :meth:`end_tag`, or :meth:`do_tag`. - The parser will call these at appropriate moments: :meth:`start_tag` or - :meth:`do_tag` is called when an opening tag of the form ``<tag ...>`` is - encountered; :meth:`end_tag` is called when a closing tag of the form ``<tag>`` - is encountered. If an opening tag requires a corresponding closing tag, like - ``<H1>`` ... ``</H1>``, the class should define the :meth:`start_tag` method; if - a tag requires no closing tag, like ``<P>``, the class should define the - :meth:`do_tag` method. - -The module defines a parser class and an exception: - - -.. class:: HTMLParser(formatter) - - This is the basic HTML parser class. It supports all entity names required by - the XHTML 1.0 Recommendation (http://www.w3.org/TR/xhtml1). It also defines - handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. - - -.. exception:: HTMLParseError - - Exception raised by the :class:`HTMLParser` class when it encounters an error - while parsing. - - -.. seealso:: - - Module :mod:`formatter` - Interface definition for transforming an abstract flow of formatting events into - specific output events on writer objects. - - Module :mod:`html.parser` - Alternate HTML parser that offers a slightly lower-level view of the input, but - is designed to work with XHTML, and does not implement some of the SGML syntax - not used in "HTML as deployed" and which isn't legal for XHTML. - - Module :mod:`html.entities` - Definition of replacement text for XHTML 1.0 entities. - - Module :mod:`sgmllib` - Base class for :class:`HTMLParser`. - - -.. _html-parser-objects: - -HTMLParser Objects ------------------- - -In addition to tag methods, the :class:`HTMLParser` class provides some -additional methods and instance variables for use within tag methods. - - -.. attribute:: HTMLParser.formatter - - This is the formatter instance associated with the parser. - - -.. attribute:: HTMLParser.nofill - - Boolean flag which should be true when whitespace should not be collapsed, or - false when it should be. In general, this should only be true when character - data is to be treated as "preformatted" text, as within a ``<PRE>`` element. - The default value is false. This affects the operation of :meth:`handle_data` - and :meth:`save_end`. - - -.. method:: HTMLParser.anchor_bgn(href, name, type) - - This method is called at the start of an anchor region. The arguments - correspond to the attributes of the ``<A>`` tag with the same names. The - default implementation maintains a list of hyperlinks (defined by the ``HREF`` - attribute for ``<A>`` tags) within the document. The list of hyperlinks is - available as the data attribute :attr:`anchorlist`. - - -.. method:: HTMLParser.anchor_end() - - This method is called at the end of an anchor region. The default - implementation adds a textual footnote marker using an index into the list of - hyperlinks created by :meth:`anchor_bgn`. - - -.. method:: HTMLParser.handle_image(source, alt[, ismap[, align[, width[, height]]]]) - - This method is called to handle images. The default implementation simply - passes the *alt* value to the :meth:`handle_data` method. - - -.. method:: HTMLParser.save_bgn() - - Begins saving character data in a buffer instead of sending it to the formatter - object. Retrieve the stored data via :meth:`save_end`. Use of the - :meth:`save_bgn` / :meth:`save_end` pair may not be nested. - - -.. method:: HTMLParser.save_end() - - Ends buffering character data and returns all data saved since the preceding - call to :meth:`save_bgn`. If the :attr:`nofill` flag is false, whitespace is - collapsed to single spaces. A call to this method without a preceding call to - :meth:`save_bgn` will raise a :exc:`TypeError` exception. |