diff options
Diffstat (limited to 'Doc/lib/libhtmllib.tex')
-rw-r--r-- | Doc/lib/libhtmllib.tex | 181 |
1 files changed, 0 insertions, 181 deletions
diff --git a/Doc/lib/libhtmllib.tex b/Doc/lib/libhtmllib.tex deleted file mode 100644 index e51dfcb..0000000 --- a/Doc/lib/libhtmllib.tex +++ /dev/null @@ -1,181 +0,0 @@ -\section{\module{htmllib} --- - A parser for HTML documents} - -\declaremodule{standard}{htmllib} -\modulesynopsis{A parser for HTML documents.} - -\index{HTML} -\index{hypertext} - - -This module defines a class which can serve as a base for parsing text -files formatted in the HyperText Mark-up Language (HTML). The class -is not directly concerned with I/O --- it must be provided with input -in string form via a method, and makes calls to methods of a -``formatter'' object in order to produce output. The -\class{HTMLParser} class is designed to be used as a base class for -other classes in order to add functionality, and allows most of its -methods to be extended or overridden. In turn, this class is derived -from and extends the \class{SGMLParser} class defined in module -\refmodule{sgmllib}\refstmodindex{sgmllib}. The \class{HTMLParser} -implementation supports the HTML 2.0 language as described in -\rfc{1866}. Two implementations of formatter objects are provided in -the \refmodule{formatter}\refstmodindex{formatter}\ module; refer to the -documentation for that module for information on the formatter -interface. -\withsubitem{(in module sgmllib)}{\ttindex{SGMLParser}} - -The following is a summary of the interface defined by -\class{sgmllib.SGMLParser}: - -\begin{itemize} - -\item -The interface to feed data to an instance is through the \method{feed()} -method, which takes a string argument. This can be called with as -little or as much text at a time as desired; \samp{p.feed(a); -p.feed(b)} has the same effect as \samp{p.feed(a+b)}. When the data -contains complete HTML markup constructs, these are processed immediately; -incomplete constructs are saved in a buffer. To force processing of all -unprocessed data, call the \method{close()} method. - -For example, to parse the entire contents of a file, use: -\begin{verbatim} -parser.feed(open('myfile.html').read()) -parser.close() -\end{verbatim} - -\item -The interface to define semantics for HTML tags is very simple: derive -a class and define methods called \method{start_\var{tag}()}, -\method{end_\var{tag}()}, or \method{do_\var{tag}()}. The parser will -call these at appropriate moments: \method{start_\var{tag}} or -\method{do_\var{tag}()} is called when an opening tag of the form -\code{<\var{tag} ...>} is encountered; \method{end_\var{tag}()} is called -when a closing tag of the form \code{<\var{tag}>} is encountered. If -an opening tag requires a corresponding closing tag, like \code{<H1>} -... \code{</H1>}, the class should define the \method{start_\var{tag}()} -method; if a tag requires no closing tag, like \code{<P>}, the class -should define the \method{do_\var{tag}()} method. - -\end{itemize} - -The module defines a parser class and an exception: - -\begin{classdesc}{HTMLParser}{formatter} -This is the basic HTML parser class. It supports all entity names -required by the XHTML 1.0 Recommendation (\url{http://www.w3.org/TR/xhtml1}). -It also defines handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. -\end{classdesc} - -\begin{excdesc}{HTMLParseError} -Exception raised by the \class{HTMLParser} class when it encounters an -error while parsing. -\versionadded{2.4} -\end{excdesc} - - -\begin{seealso} - \seemodule{formatter}{Interface definition for transforming an - abstract flow of formatting events into - specific output events on writer objects.} - \seemodule{HTMLParser}{Alternate HTML parser that offers a slightly - lower-level view of the input, but is - designed to work with XHTML, and does not - implement some of the SGML syntax not used in - ``HTML as deployed'' and which isn't legal - for XHTML.} - \seemodule{htmlentitydefs}{Definition of replacement text for XHTML 1.0 - entities.} - \seemodule{sgmllib}{Base class for \class{HTMLParser}.} -\end{seealso} - - -\subsection{HTMLParser Objects \label{html-parser-objects}} - -In addition to tag methods, the \class{HTMLParser} class provides some -additional methods and instance variables for use within tag methods. - -\begin{memberdesc}[HTMLParser]{formatter} -This is the formatter instance associated with the parser. -\end{memberdesc} - -\begin{memberdesc}[HTMLParser]{nofill} -Boolean flag which should be true when whitespace should not be -collapsed, or false when it should be. In general, this should only -be true when character data is to be treated as ``preformatted'' text, -as within a \code{<PRE>} element. The default value is false. This -affects the operation of \method{handle_data()} and \method{save_end()}. -\end{memberdesc} - - -\begin{methoddesc}[HTMLParser]{anchor_bgn}{href, name, type} -This method is called at the start of an anchor region. The arguments -correspond to the attributes of the \code{<A>} tag with the same -names. The default implementation maintains a list of hyperlinks -(defined by the \code{HREF} attribute for \code{<A>} tags) within the -document. The list of hyperlinks is available as the data attribute -\member{anchorlist}. -\end{methoddesc} - -\begin{methoddesc}[HTMLParser]{anchor_end}{} -This method is called at the end of an anchor region. The default -implementation adds a textual footnote marker using an index into the -list of hyperlinks created by \method{anchor_bgn()}. -\end{methoddesc} - -\begin{methoddesc}[HTMLParser]{handle_image}{source, alt\optional{, ismap\optional{, - align\optional{, width\optional{, height}}}}} -This method is called to handle images. The default implementation -simply passes the \var{alt} value to the \method{handle_data()} -method. -\end{methoddesc} - -\begin{methoddesc}[HTMLParser]{save_bgn}{} -Begins saving character data in a buffer instead of sending it to the -formatter object. Retrieve the stored data via \method{save_end()}. -Use of the \method{save_bgn()} / \method{save_end()} pair may not be -nested. -\end{methoddesc} - -\begin{methoddesc}[HTMLParser]{save_end}{} -Ends buffering character data and returns all data saved since the -preceding call to \method{save_bgn()}. If the \member{nofill} flag is -false, whitespace is collapsed to single spaces. A call to this -method without a preceding call to \method{save_bgn()} will raise a -\exception{TypeError} exception. -\end{methoddesc} - - - -\section{\module{htmlentitydefs} --- - Definitions of HTML general entities} - -\declaremodule{standard}{htmlentitydefs} -\modulesynopsis{Definitions of HTML general entities.} -\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org} - -This module defines three dictionaries, \code{name2codepoint}, -\code{codepoint2name}, and \code{entitydefs}. \code{entitydefs} is -used by the \refmodule{htmllib} module to provide the -\member{entitydefs} member of the \class{HTMLParser} class. The -definition provided here contains all the entities defined by XHTML 1.0 -that can be handled using simple textual substitution in the Latin-1 -character set (ISO-8859-1). - - -\begin{datadesc}{entitydefs} - A dictionary mapping XHTML 1.0 entity definitions to their - replacement text in ISO Latin-1. - -\end{datadesc} - -\begin{datadesc}{name2codepoint} - A dictionary that maps HTML entity names to the Unicode codepoints. - \versionadded{2.3} -\end{datadesc} - -\begin{datadesc}{codepoint2name} - A dictionary that maps Unicode codepoints to HTML entity names. - \versionadded{2.3} -\end{datadesc} |