diff options
Diffstat (limited to 'Doc/lib/emailparser.tex')
-rw-r--r-- | Doc/lib/emailparser.tex | 111 |
1 files changed, 74 insertions, 37 deletions
diff --git a/Doc/lib/emailparser.tex b/Doc/lib/emailparser.tex index 40ce853..b5d9900 100644 --- a/Doc/lib/emailparser.tex +++ b/Doc/lib/emailparser.tex @@ -1,20 +1,20 @@ \declaremodule{standard}{email.Parser} \modulesynopsis{Parse flat text email messages to produce a message - object tree.} + object structure.} -Message object trees can be created in one of two ways: they can be +Message object structures can be created in one of two ways: they can be created from whole cloth by instantiating \class{Message} objects and -stringing them together via \method{add_payload()} and +stringing them together via \method{attach()} and \method{set_payload()} calls, or they can be created by parsing a flat text representation of the email message. The \module{email} package provides a standard parser that understands most email document structures, including MIME documents. You can pass the parser a string or a file object, and the parser will return -to you the root \class{Message} instance of the object tree. For +to you the root \class{Message} instance of the object structure. For simple, non-MIME messages the payload of this root object will likely be a string containing the text of the message. For MIME -messages, the root object will return true from its +messages, the root object will return \code{True} from its \method{is_multipart()} method, and the subparts can be accessed via the \method{get_payload()} and \method{walk()} methods. @@ -27,28 +27,46 @@ message object trees any way it finds necessary. The primary parser class is \class{Parser} which parses both the headers and the payload of the message. In the case of \mimetype{multipart} messages, it will recursively parse the body of -the container message. The \module{email.Parser} module also provides -a second class, called \class{HeaderParser} which can be used if -you're only interested in the headers of the message. -\class{HeaderParser} can be much faster in this situations, since it -does not attempt to parse the message body, instead setting the -payload to the raw body as a string. \class{HeaderParser} has the -same API as the \class{Parser} class. +the container message. Two modes of parsing are supported, +\emph{strict} parsing, which will usually reject any non-RFC compliant +message, and \emph{lax} parsing, which attempts to adjust for common +MIME formatting problems. + +The \module{email.Parser} module also provides a second class, called +\class{HeaderParser} which can be used if you're only interested in +the headers of the message. \class{HeaderParser} can be much faster in +these situations, since it does not attempt to parse the message body, +instead setting the payload to the raw body as a string. +\class{HeaderParser} has the same API as the \class{Parser} class. \subsubsection{Parser class API} -\begin{classdesc}{Parser}{\optional{_class}} -The constructor for the \class{Parser} class takes a single optional +\begin{classdesc}{Parser}{\optional{_class\optional{, strict}}} +The constructor for the \class{Parser} class takes an optional argument \var{_class}. This must be a callable factory (such as a function or a class), and it is used whenever a sub-message object needs to be created. It defaults to \class{Message} (see \refmodule{email.Message}). The factory will be called without arguments. + +The optional \var{strict} flag specifies whether strict or lax parsing +should be performed. Normally, when things like MIME terminating +boundaries are missing, or when messages contain other formatting +problems, the \class{Parser} will raise a +\exception{MessageParseError}. However, when lax parsing is enabled, +the \class{Parser} will attempt to workaround such broken formatting +to produce a usable message structure (this doesn't mean +\exception{MessageParseError}s are never raised; some ill-formatted +messages just can't be parsed). The \var{strict} flag defaults to +\code{False} since lax parsing usually provides the most convenient +behavior. + +\versionchanged[The \var{strict} flag was added]{2.2.2} \end{classdesc} The other public \class{Parser} methods are: -\begin{methoddesc}[Parser]{parse}{fp} +\begin{methoddesc}[Parser]{parse}{fp\optional{, headersonly}} Read all the data from the file-like object \var{fp}, parse the resulting text, and return the root message object. \var{fp} must support both the \method{readline()} and the \method{read()} methods @@ -56,32 +74,49 @@ on file-like objects. The text contained in \var{fp} must be formatted as a block of \rfc{2822} style headers and header continuation lines, optionally preceeded by a -\emph{Unix-From} header. The header block is terminated either by the +envelope header. The header block is terminated either by the end of the data or by a blank line. Following the header block is the body of the message (which may contain MIME-encoded subparts). + +Optional \var{headersonly} is a flag specifying whether to stop +parsing after reading the headers or not. The default is \code{False}, +meaning it parses the entire contents of the file. + +\versionchanged[The \var{headersonly} flag was added]{2.2.2} \end{methoddesc} -\begin{methoddesc}[Parser]{parsestr}{text} +\begin{methoddesc}[Parser]{parsestr}{text\optional{, headersonly}} Similar to the \method{parse()} method, except it takes a string object instead of a file-like object. Calling this method on a string is exactly equivalent to wrapping \var{text} in a \class{StringIO} instance first and calling \method{parse()}. + +Optional \var{headersonly} is a flag specifying whether to stop +parsing after reading the headers or not. The default is \code{False}, +meaning it parses the entire contents of the file. + +\versionchanged[The \var{headersonly} flag was added]{2.2.2} \end{methoddesc} -Since creating a message object tree from a string or a file object is -such a common task, two functions are provided as a convenience. They -are available in the top-level \module{email} package namespace. +Since creating a message object structure from a string or a file +object is such a common task, two functions are provided as a +convenience. They are available in the top-level \module{email} +package namespace. -\begin{funcdesc}{message_from_string}{s\optional{, _class}} +\begin{funcdesc}{message_from_string}{s\optional{, _class\optional{, strict}}} Return a message object tree from a string. This is exactly -equivalent to \code{Parser().parsestr(s)}. Optional \var{_class} is -interpreted as with the \class{Parser} class constructor. +equivalent to \code{Parser().parsestr(s)}. Optional \var{_class} and +\var{strict} are interpreted as with the \class{Parser} class constructor. + +\versionchanged[The \var{strict} flag was added]{2.2.2} \end{funcdesc} -\begin{funcdesc}{message_from_file}{fp\optional{, _class}} +\begin{funcdesc}{message_from_file}{fp\optional{, _class\optional{, strict}}} Return a message object tree from an open file object. This is exactly -equivalent to \code{Parser().parse(fp)}. Optional \var{_class} is -interpreted as with the \class{Parser} class constructor. +equivalent to \code{Parser().parse(fp)}. Optional \var{_class} and +\var{strict} are interpreted as with the \class{Parser} class constructor. + +\versionchanged[The \var{strict} flag was added]{2.2.2} \end{funcdesc} Here's an example of how you might use this at an interactive Python @@ -99,15 +134,17 @@ Here are some notes on the parsing semantics: \begin{itemize} \item Most non-\mimetype{multipart} type messages are parsed as a single message object with a string payload. These objects will return - 0 for \method{is_multipart()}. -\item One exception is for \mimetype{message/delivery-status} type - messages. Because the body of such messages consist of - blocks of headers, \class{Parser} will create a non-multipart - object containing non-multipart subobjects for each header - block. -\item Another exception is for \mimetype{message/*} types (more - general than \mimetype{message/delivery-status}). These are - typically \mimetype{message/rfc822} messages, represented as a - non-multipart object containing a singleton payload which is - another non-multipart \class{Message} instance. + \code{False} for \method{is_multipart()}. Their + \method{get_payload()} method will return a string object. +\item All \mimetype{multipart} type messages will be parsed as a + container message object with a list of sub-message objects for + their payload. These messages will return \code{True} for + \method{is_multipart()} and their \method{get_payload()} method + will return a list of \class{Message} instances. +\item Most messages with a content type of \mimetype{message/*} + (e.g. \mimetype{message/deliver-status} and + \mimetype{message/rfc822}) will also be parsed as container + object containing a list payload of length 1. Their + \method{is_multipart()} method will return \code{True}. The + single element in the list payload will be a sub-message object. \end{itemize} |