Doc/lib/emailparser.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152

\declaremodule{standard}{email.Parser}
\modulesynopsis{Parse flat text email messages to produce a message
	        object structure.}

Message object structures can be created in one of two ways: they can be
created from whole cloth by instantiating \class{Message} objects and
stringing them together via \method{attach()} and
\method{set_payload()} calls, or they can be created by parsing a flat text
representation of the email message.

The \module{email} package provides a standard parser that understands
most email document structures, including MIME documents.  You can
pass the parser a string or a file object, and the parser will return
to you the root \class{Message} instance of the object structure.  For
simple, non-MIME messages the payload of this root object will likely
be a string containing the text of the message.  For MIME
messages, the root object will return \code{True} from its
\method{is_multipart()} method, and the subparts can be accessed via
the \method{get_payload()} and \method{walk()} methods.

Note that the parser can be extended in limited ways, and of course
you can implement your own parser completely from scratch.  There is
no magical connection between the \module{email} package's bundled
parser and the \class{Message} class, so your custom parser can create
message object trees any way it finds necessary.

The primary parser class is \class{Parser} which parses both the
headers and the payload of the message.  In the case of
\mimetype{multipart} messages, it will recursively parse the body of
the container message.  Two modes of parsing are supported,
\emph{strict} parsing, which will usually reject any non-RFC compliant
message, and \emph{lax} parsing, which attempts to adjust for common
MIME formatting problems.

The \module{email.Parser} module also provides a second class, called
\class{HeaderParser} which can be used if you're only interested in
the headers of the message. \class{HeaderParser} can be much faster in
these situations, since it does not attempt to parse the message body,
instead setting the payload to the raw body as a string.
\class{HeaderParser} has the same API as the \class{Parser} class.

\subsubsection{Parser class API}

\begin{classdesc}{Parser}{\optional{_class\optional{, strict}}}
The constructor for the \class{Parser} class takes an optional
argument \var{_class}.  This must be a callable factory (such as a
function or a class), and it is used whenever a sub-message object
needs to be created.  It defaults to \class{Message} (see
\refmodule{email.Message}).  The factory will be called without
arguments.

The optional \var{strict} flag specifies whether strict or lax parsing
should be performed.  Normally, when things like MIME terminating
boundaries are missing, or when messages contain other formatting
problems, the \class{Parser} will raise a
\exception{MessageParseError}.  However, when lax parsing is enabled,
the \class{Parser} will attempt to work around such broken formatting
to produce a usable message structure (this doesn't mean
\exception{MessageParseError}s are never raised; some ill-formatted
messages just can't be parsed).  The \var{strict} flag defaults to
\code{False} since lax parsing usually provides the most convenient
behavior.

\versionchanged[The \var{strict} flag was added]{2.2.2}
\end{classdesc}

The other public \class{Parser} methods are:

\begin{methoddesc}[Parser]{parse}{fp\optional{, headersonly}}
Read all the data from the file-like object \var{fp}, parse the
resulting text, and return the root message object.  \var{fp} must
support both the \method{readline()} and the \method{read()} methods
on file-like objects.

The text contained in \var{fp} must be formatted as a block of \rfc{2822}
style headers and header continuation lines, optionally preceded by a
envelope header.  The header block is terminated either by the
end of the data or by a blank line.  Following the header block is the
body of the message (which may contain MIME-encoded subparts).

Optional \var{headersonly} is as with the \method{parse()} method.

\versionchanged[The \var{headersonly} flag was added]{2.2.2}
\end{methoddesc}

\begin{methoddesc}[Parser]{parsestr}{text\optional{, headersonly}}
Similar to the \method{parse()} method, except it takes a string
object instead of a file-like object.  Calling this method on a string
is exactly equivalent to wrapping \var{text} in a \class{StringIO}
instance first and calling \method{parse()}.

Optional \var{headersonly} is a flag specifying whether to stop
parsing after reading the headers or not.  The default is \code{False},
meaning it parses the entire contents of the file.

\versionchanged[The \var{headersonly} flag was added]{2.2.2}
\end{methoddesc}

Since creating a message object structure from a string or a file
object is such a common task, two functions are provided as a
convenience.  They are available in the top-level \module{email}
package namespace.

\begin{funcdesc}{message_from_string}{s\optional{, _class\optional{, strict}}}
Return a message object structure from a string.  This is exactly
equivalent to \code{Parser().parsestr(s)}.  Optional \var{_class} and
\var{strict} are interpreted as with the \class{Parser} class constructor.

\versionchanged[The \var{strict} flag was added]{2.2.2}
\end{funcdesc}

\begin{funcdesc}{message_from_file}{fp\optional{, _class\optional{, strict}}}
Return a message object structure tree from an open file object.  This
is exactly equivalent to \code{Parser().parse(fp)}.  Optional
\var{_class} and \var{strict} are interpreted as with the
\class{Parser} class constructor.

\versionchanged[The \var{strict} flag was added]{2.2.2}
\end{funcdesc}

Here's an example of how you might use this at an interactive Python
prompt:

\begin{verbatim}
>>> import email
>>> msg = email.message_from_string(myString)
\end{verbatim}

\subsubsection{Additional notes}

Here are some notes on the parsing semantics:

\begin{itemize}
\item Most non-\mimetype{multipart} type messages are parsed as a single
      message object with a string payload.  These objects will return
      \code{False} for \method{is_multipart()}.  Their
      \method{get_payload()} method will return a string object.

\item All \mimetype{multipart} type messages will be parsed as a
      container message object with a list of sub-message objects for
      their payload.  The outer container message will return
      \code{True} for \method{is_multipart()} and their
      \method{get_payload()} method will return the list of
      \class{Message} subparts.

\item Most messages with a content type of \mimetype{message/*}
      (e.g. \mimetype{message/deliver-status} and
      \mimetype{message/rfc822}) will also be parsed as container
      object containing a list payload of length 1.  Their
      \method{is_multipart()} method will return \code{True}.  The
      single element in the list payload will be a sub-message object.
\end{itemize}