summaryrefslogtreecommitdiffstats
path: root/Doc/lib/libcodecs.tex
blob: a72df8596f6ba231aac9a5c72ef0ef1faec8d0a6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
\section{\module{codecs} ---
         Codec registry and base classes}

\declaremodule{standard}{codecs}
\modulesynopsis{Encode and decode data and streams.}
\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}


\index{Unicode}
\index{Codecs}
\indexii{Codecs}{encode}
\indexii{Codecs}{decode}
\index{streams}
\indexii{stackable}{streams}


This module defines base classes for standard Python codecs (encoders
and decoders) and provides access to the internal Python codec
registry which manages the codec lookup process.

It defines the following functions:

\begin{funcdesc}{register}{search_function}
Register a codec search function. Search functions are expected to
take one argument, the encoding name in all lower case letters, and
return a tuple of functions \code{(\var{encoder}, \var{decoder}, \var{stream_reader},
\var{stream_writer})} taking the following arguments:

  \var{encoder} and \var{decoder}: These must be functions or methods
  which have the same interface as the .encode/.decode methods of
  Codec instances (see Codec Interface). The functions/methods are
  expected to work in a stateless mode.

  \var{stream_reader} and \var{stream_writer}: These have to be
  factory functions providing the following interface:

	\code{factory(\var{stream}, \var{errors}='strict')}

  The factory functions must return objects providing the interfaces
  defined by the base classes \class{StreamWriter} and
  \class{StreamReader}, respectively. Stream codecs can maintain
  state.

  Possible values for errors are \code{'strict'} (raise an exception
  in case of an encoding error), \code{'replace'} (replace malformed
  data with a suitable replacement marker, such as \character{?}) and
  \code{'ignore'} (ignore malformed data and continue without further
  notice).

In case a search function cannot find a given encoding, it should
return \code{None}.
\end{funcdesc}

\begin{funcdesc}{lookup}{encoding}
Looks up a codec tuple in the Python codec registry and returns the
function tuple as defined above.

Encodings are first looked up in the registry's cache. If not found,
the list of registered search functions is scanned. If no codecs tuple
is found, a \exception{LookupError} is raised. Otherwise, the codecs
tuple is stored in the cache and returned to the caller.
\end{funcdesc}

To simplify working with encoded files or stream, the module
also defines these utility functions:

\begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
                       errors\optional{, buffering}}}}
Open an encoded file using the given \var{mode} and return
a wrapped version providing transparent encoding/decoding.

\strong{Note:} The wrapped version will only accept the object format
defined by the codecs, i.e.\ Unicode objects for most built-in
codecs.  Output is also codec-dependent and will usually be Unicode as
well.

\var{encoding} specifies the encoding which is to be used for the
the file.

\var{errors} may be given to define the error handling. It defaults
to \code{'strict'} which causes a \exception{ValueError} to be raised
in case an encoding error occurs.

\var{buffering} has the same meaning as for the built-in
\function{open()} function.  It defaults to line buffered.
\end{funcdesc}

\begin{funcdesc}{EncodedFile}{file, input\optional{,
                              output\optional{, errors}}}
Return a wrapped version of file which provides transparent
encoding translation.

Strings written to the wrapped file are interpreted according to the
given \var{input} encoding and then written to the original file as
strings using the \var{output} encoding. The intermediate encoding will
usually be Unicode but depends on the specified codecs.

If \var{output} is not given, it defaults to \var{input}.

\var{errors} may be given to define the error handling. It defaults to
\code{'strict'}, which causes \exception{ValueError} to be raised in case
an encoding error occurs.
\end{funcdesc}



...XXX document codec base classes...



The module also provides the following constants which are useful
for reading and writing to platform dependent files:

\begin{datadesc}{BOM}
\dataline{BOM_BE}
\dataline{BOM_LE}
\dataline{BOM32_BE}
\dataline{BOM32_LE}
\dataline{BOM64_BE}
\dataline{BOM64_LE}
These constants define the byte order marks (BOM) used in data
streams to indicate the byte order used in the stream or file.
\constant{BOM} is either \constant{BOM_BE} or \constant{BOM_LE}
depending on the platform's native byte order, while the others
represent big endian (\samp{_BE} suffix) and little endian
(\samp{_LE} suffix) byte order using 32-bit and 64-bit encodings.
\end{datadesc}