diff options
Diffstat (limited to 'Doc/library/sgmllib.rst')
-rw-r--r-- | Doc/library/sgmllib.rst | 270 |
1 files changed, 270 insertions, 0 deletions
diff --git a/Doc/library/sgmllib.rst b/Doc/library/sgmllib.rst new file mode 100644 index 0000000..c0ef1a2 --- /dev/null +++ b/Doc/library/sgmllib.rst @@ -0,0 +1,270 @@ + +:mod:`sgmllib` --- Simple SGML parser +===================================== + +.. module:: sgmllib + :synopsis: Only as much of an SGML parser as needed to parse HTML. + + +.. index:: single: SGML + +This module defines a class :class:`SGMLParser` which serves as the basis for +parsing text files formatted in SGML (Standard Generalized Mark-up Language). +In fact, it does not provide a full SGML parser --- it only parses SGML insofar +as it is used by HTML, and the module only exists as a base for the +:mod:`htmllib` module. Another HTML parser which supports XHTML and offers a +somewhat different interface is available in the :mod:`HTMLParser` module. + + +.. class:: SGMLParser() + + The :class:`SGMLParser` class is instantiated without arguments. The parser is + hardcoded to recognize the following constructs: + + * Opening and closing tags of the form ``<tag attr="value" ...>`` and + ``</tag>``, respectively. + + * Numeric character references of the form ``&#name;``. + + * Entity references of the form ``&name;``. + + * SGML comments of the form ``<!--text-->``. Note that spaces, tabs, and + newlines are allowed between the trailing ``>`` and the immediately preceding + ``--``. + +A single exception is defined as well: + + +.. exception:: SGMLParseError + + Exception raised by the :class:`SGMLParser` class when it encounters an error + while parsing. + + .. versionadded:: 2.1 + +:class:`SGMLParser` instances have the following methods: + + +.. method:: SGMLParser.reset() + + Reset the instance. Loses all unprocessed data. This is called implicitly at + instantiation time. + + +.. method:: SGMLParser.setnomoretags() + + Stop processing tags. Treat all following input as literal input (CDATA). + (This is only provided so the HTML tag ``<PLAINTEXT>`` can be implemented.) + + +.. method:: SGMLParser.setliteral() + + Enter literal mode (CDATA mode). + + +.. method:: SGMLParser.feed(data) + + Feed some text to the parser. It is processed insofar as it consists of + complete elements; incomplete data is buffered until more data is fed or + :meth:`close` is called. + + +.. method:: SGMLParser.close() + + Force processing of all buffered data as if it were followed by an end-of-file + mark. This method may be redefined by a derived class to define additional + processing at the end of the input, but the redefined version should always call + :meth:`close`. + + +.. method:: SGMLParser.get_starttag_text() + + Return the text of the most recently opened start tag. This should not normally + be needed for structured processing, but may be useful in dealing with HTML "as + deployed" or for re-generating input with minimal changes (whitespace between + attributes can be preserved, etc.). + + +.. method:: SGMLParser.handle_starttag(tag, method, attributes) + + This method is called to handle start tags for which either a :meth:`start_tag` + or :meth:`do_tag` method has been defined. The *tag* argument is the name of + the tag converted to lower case, and the *method* argument is the bound method + which should be used to support semantic interpretation of the start tag. The + *attributes* argument is a list of ``(name, value)`` pairs containing the + attributes found inside the tag's ``<>`` brackets. + + The *name* has been translated to lower case. Double quotes and backslashes in + the *value* have been interpreted, as well as known character references and + known entity references terminated by a semicolon (normally, entity references + can be terminated by any non-alphanumerical character, but this would break the + very common case of ``<A HREF="url?spam=1&eggs=2">`` when ``eggs`` is a valid + entity name). + + For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method would + be called as ``unknown_starttag('a', [('href', 'http://www.cwi.nl/')])``. The + base implementation simply calls *method* with *attributes* as the only + argument. + + .. versionadded:: 2.5 + Handling of entity and character references within attribute values. + + +.. method:: SGMLParser.handle_endtag(tag, method) + + This method is called to handle endtags for which an :meth:`end_tag` method has + been defined. The *tag* argument is the name of the tag converted to lower + case, and the *method* argument is the bound method which should be used to + support semantic interpretation of the end tag. If no :meth:`end_tag` method is + defined for the closing element, this handler is not called. The base + implementation simply calls *method*. + + +.. method:: SGMLParser.handle_data(data) + + This method is called to process arbitrary data. It is intended to be + overridden by a derived class; the base class implementation does nothing. + + +.. method:: SGMLParser.handle_charref(ref) + + This method is called to process a character reference of the form ``&#ref;``. + The base implementation uses :meth:`convert_charref` to convert the reference to + a string. If that method returns a string, it is passed to :meth:`handle_data`, + otherwise ``unknown_charref(ref)`` is called to handle the error. + + .. versionchanged:: 2.5 + Use :meth:`convert_charref` instead of hard-coding the conversion. + + +.. method:: SGMLParser.convert_charref(ref) + + Convert a character reference to a string, or ``None``. *ref* is the reference + passed in as a string. In the base implementation, *ref* must be a decimal + number in the range 0-255. It converts the code point found using the + :meth:`convert_codepoint` method. If *ref* is invalid or out of range, this + method returns ``None``. This method is called by the default + :meth:`handle_charref` implementation and by the attribute value parser. + + .. versionadded:: 2.5 + + +.. method:: SGMLParser.convert_codepoint(codepoint) + + Convert a codepoint to a :class:`str` value. Encodings can be handled here if + appropriate, though the rest of :mod:`sgmllib` is oblivious on this matter. + + .. versionadded:: 2.5 + + +.. method:: SGMLParser.handle_entityref(ref) + + This method is called to process a general entity reference of the form + ``&ref;`` where *ref* is an general entity reference. It converts *ref* by + passing it to :meth:`convert_entityref`. If a translation is returned, it calls + the method :meth:`handle_data` with the translation; otherwise, it calls the + method ``unknown_entityref(ref)``. The default :attr:`entitydefs` defines + translations for ``&``, ``&apos``, ``>``, ``<``, and ``"``. + + .. versionchanged:: 2.5 + Use :meth:`convert_entityref` instead of hard-coding the conversion. + + +.. method:: SGMLParser.convert_entityref(ref) + + Convert a named entity reference to a :class:`str` value, or ``None``. The + resulting value will not be parsed. *ref* will be only the name of the entity. + The default implementation looks for *ref* in the instance (or class) variable + :attr:`entitydefs` which should be a mapping from entity names to corresponding + translations. If no translation is available for *ref*, this method returns + ``None``. This method is called by the default :meth:`handle_entityref` + implementation and by the attribute value parser. + + .. versionadded:: 2.5 + + +.. method:: SGMLParser.handle_comment(comment) + + This method is called when a comment is encountered. The *comment* argument is + a string containing the text between the ``<!--`` and ``-->`` delimiters, but + not the delimiters themselves. For example, the comment ``<!--text-->`` will + cause this method to be called with the argument ``'text'``. The default method + does nothing. + + +.. method:: SGMLParser.handle_decl(data) + + Method called when an SGML declaration is read by the parser. In practice, the + ``DOCTYPE`` declaration is the only thing observed in HTML, but the parser does + not discriminate among different (or broken) declarations. Internal subsets in + a ``DOCTYPE`` declaration are not supported. The *data* parameter will be the + entire contents of the declaration inside the ``<!``...\ ``>`` markup. The + default implementation does nothing. + + +.. method:: SGMLParser.report_unbalanced(tag) + + This method is called when an end tag is found which does not correspond to any + open element. + + +.. method:: SGMLParser.unknown_starttag(tag, attributes) + + This method is called to process an unknown start tag. It is intended to be + overridden by a derived class; the base class implementation does nothing. + + +.. method:: SGMLParser.unknown_endtag(tag) + + This method is called to process an unknown end tag. It is intended to be + overridden by a derived class; the base class implementation does nothing. + + +.. method:: SGMLParser.unknown_charref(ref) + + This method is called to process unresolvable numeric character references. + Refer to :meth:`handle_charref` to determine what is handled by default. It is + intended to be overridden by a derived class; the base class implementation does + nothing. + + +.. method:: SGMLParser.unknown_entityref(ref) + + This method is called to process an unknown entity reference. It is intended to + be overridden by a derived class; the base class implementation does nothing. + +Apart from overriding or extending the methods listed above, derived classes may +also define methods of the following form to define processing of specific tags. +Tag names in the input stream are case independent; the *tag* occurring in +method names must be in lower case: + + +.. method:: SGMLParser.start_tag(attributes) + :noindex: + + This method is called to process an opening tag *tag*. It has preference over + :meth:`do_tag`. The *attributes* argument has the same meaning as described for + :meth:`handle_starttag` above. + + +.. method:: SGMLParser.do_tag(attributes) + :noindex: + + This method is called to process an opening tag *tag* for which no + :meth:`start_tag` method is defined. The *attributes* argument has the same + meaning as described for :meth:`handle_starttag` above. + + +.. method:: SGMLParser.end_tag() + :noindex: + + This method is called to process a closing tag *tag*. + +Note that the parser maintains a stack of open elements for which no end tag has +been found yet. Only tags processed by :meth:`start_tag` are pushed on this +stack. Definition of an :meth:`end_tag` method is optional for these tags. For +tags processed by :meth:`do_tag` or by :meth:`unknown_tag`, no :meth:`end_tag` +method must be defined; if defined, it will not be used. If both +:meth:`start_tag` and :meth:`do_tag` methods exist for a tag, the +:meth:`start_tag` method takes precedence. + |