diff options
Diffstat (limited to 'Doc/library/sgmllib.rst')
-rw-r--r-- | Doc/library/sgmllib.rst | 253 |
1 files changed, 0 insertions, 253 deletions
diff --git a/Doc/library/sgmllib.rst b/Doc/library/sgmllib.rst deleted file mode 100644 index 637aa91..0000000 --- a/Doc/library/sgmllib.rst +++ /dev/null @@ -1,253 +0,0 @@ - -:mod:`sgmllib` --- Simple SGML parser -===================================== - -.. module:: sgmllib - :synopsis: Only as much of an SGML parser as needed to parse HTML. - - -.. index:: single: SGML - -This module defines a class :class:`SGMLParser` which serves as the basis for -parsing text files formatted in SGML (Standard Generalized Mark-up Language). -In fact, it does not provide a full SGML parser --- it only parses SGML insofar -as it is used by HTML, and the module only exists as a base for the -:mod:`htmllib` module. Another HTML parser which supports XHTML and offers a -somewhat different interface is available in the :mod:`HTMLParser` module. - - -.. class:: SGMLParser() - - The :class:`SGMLParser` class is instantiated without arguments. The parser is - hardcoded to recognize the following constructs: - - * Opening and closing tags of the form ``<tag attr="value" ...>`` and - ``</tag>``, respectively. - - * Numeric character references of the form ``&#name;``. - - * Entity references of the form ``&name;``. - - * SGML comments of the form ``<!--text-->``. Note that spaces, tabs, and - newlines are allowed between the trailing ``>`` and the immediately preceding - ``--``. - -A single exception is defined as well: - - -.. exception:: SGMLParseError - - Exception raised by the :class:`SGMLParser` class when it encounters an error - while parsing. - -:class:`SGMLParser` instances have the following methods: - - -.. method:: SGMLParser.reset() - - Reset the instance. Loses all unprocessed data. This is called implicitly at - instantiation time. - - -.. method:: SGMLParser.setnomoretags() - - Stop processing tags. Treat all following input as literal input (CDATA). - (This is only provided so the HTML tag ``<PLAINTEXT>`` can be implemented.) - - -.. method:: SGMLParser.setliteral() - - Enter literal mode (CDATA mode). - - -.. method:: SGMLParser.feed(data) - - Feed some text to the parser. It is processed insofar as it consists of - complete elements; incomplete data is buffered until more data is fed or - :meth:`close` is called. - - -.. method:: SGMLParser.close() - - Force processing of all buffered data as if it were followed by an end-of-file - mark. This method may be redefined by a derived class to define additional - processing at the end of the input, but the redefined version should always call - :meth:`close`. - - -.. method:: SGMLParser.get_starttag_text() - - Return the text of the most recently opened start tag. This should not normally - be needed for structured processing, but may be useful in dealing with HTML "as - deployed" or for re-generating input with minimal changes (whitespace between - attributes can be preserved, etc.). - - -.. method:: SGMLParser.handle_starttag(tag, method, attributes) - - This method is called to handle start tags for which either a :meth:`start_tag` - or :meth:`do_tag` method has been defined. The *tag* argument is the name of - the tag converted to lower case, and the *method* argument is the bound method - which should be used to support semantic interpretation of the start tag. The - *attributes* argument is a list of ``(name, value)`` pairs containing the - attributes found inside the tag's ``<>`` brackets. - - The *name* has been translated to lower case. Double quotes and backslashes in - the *value* have been interpreted, as well as known character references and - known entity references terminated by a semicolon (normally, entity references - can be terminated by any non-alphanumerical character, but this would break the - very common case of ``<A HREF="url?spam=1&eggs=2">`` when ``eggs`` is a valid - entity name). - - For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method would - be called as ``unknown_starttag('a', [('href', 'http://www.cwi.nl/')])``. The - base implementation simply calls *method* with *attributes* as the only - argument. - - -.. method:: SGMLParser.handle_endtag(tag, method) - - This method is called to handle endtags for which an :meth:`end_tag` method has - been defined. The *tag* argument is the name of the tag converted to lower - case, and the *method* argument is the bound method which should be used to - support semantic interpretation of the end tag. If no :meth:`end_tag` method is - defined for the closing element, this handler is not called. The base - implementation simply calls *method*. - - -.. method:: SGMLParser.handle_data(data) - - This method is called to process arbitrary data. It is intended to be - overridden by a derived class; the base class implementation does nothing. - - -.. method:: SGMLParser.handle_charref(ref) - - This method is called to process a character reference of the form ``&#ref;``. - The base implementation uses :meth:`convert_charref` to convert the reference to - a string. If that method returns a string, it is passed to :meth:`handle_data`, - otherwise ``unknown_charref(ref)`` is called to handle the error. - - -.. method:: SGMLParser.convert_charref(ref) - - Convert a character reference to a string, or ``None``. *ref* is the reference - passed in as a string. In the base implementation, *ref* must be a decimal - number in the range 0-255. It converts the code point found using the - :meth:`convert_codepoint` method. If *ref* is invalid or out of range, this - method returns ``None``. This method is called by the default - :meth:`handle_charref` implementation and by the attribute value parser. - - -.. method:: SGMLParser.convert_codepoint(codepoint) - - Convert a codepoint to a :class:`str` value. Encodings can be handled here if - appropriate, though the rest of :mod:`sgmllib` is oblivious on this matter. - - -.. method:: SGMLParser.handle_entityref(ref) - - This method is called to process a general entity reference of the form - ``&ref;`` where *ref* is an general entity reference. It converts *ref* by - passing it to :meth:`convert_entityref`. If a translation is returned, it calls - the method :meth:`handle_data` with the translation; otherwise, it calls the - method ``unknown_entityref(ref)``. The default :attr:`entitydefs` defines - translations for ``&``, ``&apos``, ``>``, ``<``, and ``"``. - - -.. method:: SGMLParser.convert_entityref(ref) - - Convert a named entity reference to a :class:`str` value, or ``None``. The - resulting value will not be parsed. *ref* will be only the name of the entity. - The default implementation looks for *ref* in the instance (or class) variable - :attr:`entitydefs` which should be a mapping from entity names to corresponding - translations. If no translation is available for *ref*, this method returns - ``None``. This method is called by the default :meth:`handle_entityref` - implementation and by the attribute value parser. - - -.. method:: SGMLParser.handle_comment(comment) - - This method is called when a comment is encountered. The *comment* argument is - a string containing the text between the ``<!--`` and ``-->`` delimiters, but - not the delimiters themselves. For example, the comment ``<!--text-->`` will - cause this method to be called with the argument ``'text'``. The default method - does nothing. - - -.. method:: SGMLParser.handle_decl(data) - - Method called when an SGML declaration is read by the parser. In practice, the - ``DOCTYPE`` declaration is the only thing observed in HTML, but the parser does - not discriminate among different (or broken) declarations. Internal subsets in - a ``DOCTYPE`` declaration are not supported. The *data* parameter will be the - entire contents of the declaration inside the ``<!``...\ ``>`` markup. The - default implementation does nothing. - - -.. method:: SGMLParser.report_unbalanced(tag) - - This method is called when an end tag is found which does not correspond to any - open element. - - -.. method:: SGMLParser.unknown_starttag(tag, attributes) - - This method is called to process an unknown start tag. It is intended to be - overridden by a derived class; the base class implementation does nothing. - - -.. method:: SGMLParser.unknown_endtag(tag) - - This method is called to process an unknown end tag. It is intended to be - overridden by a derived class; the base class implementation does nothing. - - -.. method:: SGMLParser.unknown_charref(ref) - - This method is called to process unresolvable numeric character references. - Refer to :meth:`handle_charref` to determine what is handled by default. It is - intended to be overridden by a derived class; the base class implementation does - nothing. - - -.. method:: SGMLParser.unknown_entityref(ref) - - This method is called to process an unknown entity reference. It is intended to - be overridden by a derived class; the base class implementation does nothing. - -Apart from overriding or extending the methods listed above, derived classes may -also define methods of the following form to define processing of specific tags. -Tag names in the input stream are case independent; the *tag* occurring in -method names must be in lower case: - - -.. method:: SGMLParser.start_tag(attributes) - :noindex: - - This method is called to process an opening tag *tag*. It has preference over - :meth:`do_tag`. The *attributes* argument has the same meaning as described for - :meth:`handle_starttag` above. - - -.. method:: SGMLParser.do_tag(attributes) - :noindex: - - This method is called to process an opening tag *tag* for which no - :meth:`start_tag` method is defined. The *attributes* argument has the same - meaning as described for :meth:`handle_starttag` above. - - -.. method:: SGMLParser.end_tag() - :noindex: - - This method is called to process a closing tag *tag*. - -Note that the parser maintains a stack of open elements for which no end tag has -been found yet. Only tags processed by :meth:`start_tag` are pushed on this -stack. Definition of an :meth:`end_tag` method is optional for these tags. For -tags processed by :meth:`do_tag` or by :meth:`unknown_tag`, no :meth:`end_tag` -method must be defined; if defined, it will not be used. If both -:meth:`start_tag` and :meth:`do_tag` methods exist for a tag, the -:meth:`start_tag` method takes precedence. - |