summaryrefslogtreecommitdiffstats
path: root/Doc/lib/libpickle.tex
blob: 92a79897cebf91d7a720b4ffe406454ca683c386 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
\section{\module{pickle} --- Python object serialization}

\declaremodule{standard}{pickle}
\modulesynopsis{Convert Python objects to streams of bytes and back.}
% Substantial improvements by Jim Kerr <jbkerr@sr.hp.com>.
% Rewritten by Barry Warsaw <barry@zope.com>

\index{persistence}
\indexii{persistent}{objects}
\indexii{serializing}{objects}
\indexii{marshalling}{objects}
\indexii{flattening}{objects}
\indexii{pickling}{objects}

The \module{pickle} module implements a fundamental, but powerful
algorithm for serializing and de-serializing a Python object
structure.  ``Pickling'' is the process whereby a Python object
hierarchy is converted into a byte stream, and ``unpickling'' is the
inverse operation, whereby a byte stream is converted back into an
object hierarchy.  Pickling (and unpickling) is alternatively known as
``serialization'', ``marshalling,''\footnote{Don't confuse this with
the \refmodule{marshal} module} or ``flattening'',
however the preferred term used here is ``pickling'' and
``unpickling'' to avoid confusing.

This documentation describes both the \module{pickle} module and the 
\refmodule{cPickle} module.

\subsection{Relationship to other Python modules}

The \module{pickle} module has an optimized cousin called the
\module{cPickle} module.  As its name implies, \module{cPickle} is
written in C, so it can be up to 1000 times faster than
\module{pickle}.  However it does not support subclassing of the
\function{Pickler()} and \function{Unpickler()} classes, because in
\module{cPickle} these are functions, not classes.  Most applications
have no need for this functionality, and can benefit from the improved
performance of \module{cPickle}.  Other than that, the interfaces of
the two modules are nearly identical; the common interface is
described in this manual and differences are pointed out where
necessary.  In the following discussions, we use the term ``pickle''
to collectively describe the \module{pickle} and
\module{cPickle} modules.

The data streams the two modules produce are guaranteed to be
interchangeable.

Python has a more primitive serialization module called
\refmodule{marshal}, but in general
\module{pickle} should always be the preferred way to serialize Python
objects.  \module{marshal} exists primarily to support Python's
\file{.pyc} files.

The \module{pickle} module differs from \refmodule{marshal} several
significant ways:

\begin{itemize}

\item The \module{pickle} module keeps track of the objects it has
      already serialized, so that later references to the same object
      won't be serialized again.  \module{marshal} doesn't do this.

      This has implications both for recursive objects and object
      sharing.  Recursive objects are objects that contain references
      to themselves.  These are not handled by marshal, and in fact,
      attempting to marshal recursive objects will crash your Python
      interpreter.  Object sharing happens when there are multiple
      references to the same object in different places in the object
      hierarchy being serialized.  \module{pickle} stores such objects
      only once, and ensures that all other references point to the
      master copy.  Shared objects remain shared, which can be very
      important for mutable objects.

\item \module{marshal} cannot be used to serialize user-defined
      classes and their instances.  \module{pickle} can save and
      restore class instances transparently, however the class
      definition must be importable and live in the same module as
      when the object was stored.

\item The \module{marshal} serialization format is not guaranteed to
      be portable across Python versions.  Because its primary job in
      life is to support \file{.pyc} files, the Python implementers
      reserve the right to change the serialization format in
      non-backwards compatible ways should the need arise.  The
      \module{pickle} serialization format is guaranteed to be
      backwards compatible across Python releases.

\item The \module{pickle} module doesn't handle code objects, which
      the \module{marshal} module does.  This avoids the possibility
      of smuggling Trojan horses into a program through the
      \module{pickle} module\footnote{This doesn't necessarily imply
      that \module{pickle} is inherently secure.  See
      section~\ref{pickle-sec} for a more detailed discussion on
      \module{pickle} module security.  Besides, it's possible that
      \module{pickle} will eventually support serializing code
      objects.}.

\end{itemize}

Note that serialization is a more primitive notion than persistence;
although
\module{pickle} reads and writes file objects, it does not handle the
issue of naming persistent objects, nor the (even more complicated)
issue of concurrent access to persistent objects.  The \module{pickle}
module can transform a complex object into a byte stream and it can
transform the byte stream into an object with the same internal
structure.  Perhaps the most obvious thing to do with these byte
streams is to write them onto a file, but it is also conceivable to
send them across a network or store them in a database.  The module
\refmodule{shelve} provides a simple interface
to pickle and unpickle objects on DBM-style database files.

\subsection{Data stream format}

The data format used by \module{pickle} is Python-specific.  This has
the advantage that there are no restrictions imposed by external
standards such as XDR\index{XDR}\index{External Data Representation}
(which can't represent pointer sharing); however it means that
non-Python programs may not be able to reconstruct pickled Python
objects.

By default, the \module{pickle} data format uses a printable \ASCII{}
representation.  This is slightly more voluminous than a binary
representation.  The big advantage of using printable \ASCII{} (and of
some other characteristics of \module{pickle}'s representation) is that
for debugging or recovery purposes it is possible for a human to read
the pickled file with a standard text editor.

There are currently 3 different protocols which can be used for pickling.

\begin{itemize}

\item Protocol version 0 is the original ASCII protocol and is backwards
compatible with earlier versions of Python.

\item Protocol version 1 is the old binary format which is also compatible
with earlier versions of Python.

\item Protocol version 2 was introduced in Python 2.3.  It provides
much more efficient pickling of new-style classes.

\end{itemize}

Refer to PEP 307 for more information.

If a \var{protocol} is not specified, protocol 0 is used.
If \var{protocol} is specified as a negative value
or \constant{HIGHEST_PROTOCOL},
the highest protocol version available will be used.

\versionchanged[The \var{bin} parameter is deprecated and only provided
for backwards compatibility.  You should use the \var{protocol}
parameter instead]{2.3}

A binary format, which is slightly more efficient, can be chosen by
specifying a true value for the \var{bin} argument to the
\class{Pickler} constructor or the \function{dump()} and \function{dumps()}
functions.  A \var{protocol} version >= 1 implies use of a binary format.

\subsection{Usage}

To serialize an object hierarchy, you first create a pickler, then you
call the pickler's \method{dump()} method.  To de-serialize a data
stream, you first create an unpickler, then you call the unpickler's
\method{load()} method.  The \module{pickle} module provides the
following constant:

\begin{datadesc}{HIGHEST_PROTOCOL}
The highest protocol version available.  This value can be passed
as a \var{protocol} value.
\end{datadesc}

The \module{pickle} module provides the
following functions to make this process more convenient:

\begin{funcdesc}{dump}{object, file\optional{, protocol\optional{, bin}}}
Write a pickled representation of \var{object} to the open file object
\var{file}.  This is equivalent to
\code{Pickler(\var{file}, \var{protocol}, \var{bin}).dump(\var{object})}.

If the \var{protocol} parameter is ommitted, protocol 0 is used.
If \var{protocol} is specified as a negative value
or \constant{HIGHEST_PROTOCOL},
the highest protocol version will be used.

\versionchanged[The \var{protocol} parameter was added.
The \var{bin} parameter is deprecated and only provided
for backwards compatibility.  You should use the \var{protocol}
parameter instead]{2.3}

If the optional \var{bin} argument is true, the binary pickle format
is used; otherwise the (less efficient) text pickle format is used
(for backwards compatibility, this is the default).

\var{file} must have a \method{write()} method that accepts a single
string argument.  It can thus be a file object opened for writing, a
\refmodule{StringIO} object, or any other custom
object that meets this interface.
\end{funcdesc}

\begin{funcdesc}{load}{file}
Read a string from the open file object \var{file} and interpret it as
a pickle data stream, reconstructing and returning the original object
hierarchy.  This is equivalent to \code{Unpickler(\var{file}).load()}.

\var{file} must have two methods, a \method{read()} method that takes
an integer argument, and a \method{readline()} method that requires no
arguments.  Both methods should return a string.  Thus \var{file} can
be a file object opened for reading, a
\module{StringIO} object, or any other custom
object that meets this interface.

This function automatically determines whether the data stream was
written in binary mode or not.
\end{funcdesc}

\begin{funcdesc}{dumps}{object\optional{, protocol\optional{, bin}}}
Return the pickled representation of the object as a string, instead
of writing it to a file.

If the \var{protocol} parameter is ommitted, protocol 0 is used.
If \var{protocol} is specified as a negative value
or \constant{HIGHEST_PROTOCOL},
the highest protocol version will be used.

\versionchanged[The \var{protocol} parameter was added.
The \var{bin} parameter is deprecated and only provided
for backwards compatibility.  You should use the \var{protocol}
parameter instead]{2.3}

If the optional \var{bin} argument is
true, the binary pickle format is used; otherwise the (less efficient)
text pickle format is used (this is the default).
\end{funcdesc}

\begin{funcdesc}{loads}{string}
Read a pickled object hierarchy from a string.  Characters in the
string past the pickled object's representation are ignored.
\end{funcdesc}

The \module{pickle} module also defines three exceptions:

\begin{excdesc}{PickleError}
A common base class for the other exceptions defined below.  This
inherits from \exception{Exception}.
\end{excdesc}

\begin{excdesc}{PicklingError}
This exception is raised when an unpicklable object is passed to
the \method{dump()} method.
\end{excdesc}

\begin{excdesc}{UnpicklingError}
This exception is raised when there is a problem unpickling an object,
such as a security violation.  Note that other exceptions may also be
raised during unpickling, including (but not necessarily limited to)
\exception{AttributeError}, \exception{EOFError},
\exception{ImportError}, and \exception{IndexError}.
\end{excdesc}

The \module{pickle} module also exports two callables\footnote{In the
\module{pickle} module these callables are classes, which you could
subclass to customize the behavior.  However, in the \module{cPickle}
modules these callables are factory functions and so cannot be
subclassed.  One of the common reasons to subclass is to control what
objects can actually be unpickled.  See section~\ref{pickle-sec} for
more details on security concerns.}, \class{Pickler} and
\class{Unpickler}:

\begin{classdesc}{Pickler}{file\optional{, protocol\optional{, bin}}}
This takes a file-like object to which it will write a pickle data
stream.  

If the \var{protocol} parameter is ommitted, protocol 0 is used.
If \var{protocol} is specified as a negative value,
the highest protocol version will be used.

\versionchanged[The \var{bin} parameter is deprecated and only provided
for backwards compatibility.  You should use the \var{protocol}
parameter instead]{2.3}

Optional \var{bin} if true, tells the pickler to use the more
efficient binary pickle format, otherwise the \ASCII{} format is used
(this is the default).

\var{file} must have a \method{write()} method that accepts a single
string argument.  It can thus be an open file object, a
\module{StringIO} object, or any other custom
object that meets this interface.
\end{classdesc}

\class{Pickler} objects define one (or two) public methods:

\begin{methoddesc}[Pickler]{dump}{object}
Write a pickled representation of \var{object} to the open file object
given in the constructor.  Either the binary or \ASCII{} format will
be used, depending on the value of the \var{bin} flag passed to the
constructor.
\end{methoddesc}

\begin{methoddesc}[Pickler]{clear_memo}{}
Clears the pickler's ``memo''.  The memo is the data structure that
remembers which objects the pickler has already seen, so that shared
or recursive objects pickled by reference and not by value.  This
method is useful when re-using picklers.

\begin{notice}
Prior to Python 2.3, \method{clear_memo()} was only available on the
picklers created by \refmodule{cPickle}.  In the \module{pickle} module,
picklers have an instance variable called \member{memo} which is a
Python dictionary.  So to clear the memo for a \module{pickle} module
pickler, you could do the following:

\begin{verbatim}
mypickler.memo.clear()
\end{verbatim}

Code that does not need to support older versions of Python should
simply use \method{clear_memo()}.
\end{notice}
\end{methoddesc}

It is possible to make multiple calls to the \method{dump()} method of
the same \class{Pickler} instance.  These must then be matched to the
same number of calls to the \method{load()} method of the
corresponding \class{Unpickler} instance.  If the same object is
pickled by multiple \method{dump()} calls, the \method{load()} will
all yield references to the same object\footnote{\emph{Warning}: this
is intended for pickling multiple objects without intervening
modifications to the objects or their parts.  If you modify an object
and then pickle it again using the same \class{Pickler} instance, the
object is not pickled again --- a reference to it is pickled and the
\class{Unpickler} will return the old value, not the modified one.
There are two problems here: (1) detecting changes, and (2)
marshalling a minimal set of changes.  Garbage Collection may also
become a problem here.}.

\class{Unpickler} objects are defined as:

\begin{classdesc}{Unpickler}{file}
This takes a file-like object from which it will read a pickle data
stream.  This class automatically determines whether the data stream
was written in binary mode or not, so it does not need a flag as in
the \class{Pickler} factory.

\var{file} must have two methods, a \method{read()} method that takes
an integer argument, and a \method{readline()} method that requires no
arguments.  Both methods should return a string.  Thus \var{file} can
be a file object opened for reading, a
\module{StringIO} object, or any other custom
object that meets this interface.
\end{classdesc}

\class{Unpickler} objects have one (or two) public methods:

\begin{methoddesc}[Unpickler]{load}{}
Read a pickled object representation from the open file object given
in the constructor, and return the reconstituted object hierarchy
specified therein.
\end{methoddesc}

\begin{methoddesc}[Unpickler]{noload}{}
This is just like \method{load()} except that it doesn't actually
create any objects.  This is useful primarily for finding what's
called ``persistent ids'' that may be referenced in a pickle data
stream.  See section~\ref{pickle-protocol} below for more details.

\strong{Note:} the \method{noload()} method is currently only
available on \class{Unpickler} objects created with the
\module{cPickle} module.  \module{pickle} module \class{Unpickler}s do
not have the \method{noload()} method.
\end{methoddesc}

\subsection{What can be pickled and unpickled?}

The following types can be pickled:

\begin{itemize}

\item \code{None}, \code{True}, and \code{False}

\item integers, long integers, floating point numbers, complex numbers

\item normal and Unicode strings

\item tuples, lists, and dictionaries containing only picklable objects

\item functions defined at the top level of a module

\item built-in functions defined at the top level of a module

\item classes that are defined at the top level of a module

\item instances of such classes whose \member{__dict__} or
\method{__setstate__()} is picklable  (see
section~\ref{pickle-protocol} for details)

\end{itemize}

Attempts to pickle unpicklable objects will raise the
\exception{PicklingError} exception; when this happens, an unspecified
number of bytes may have already been written to the underlying file.

Note that functions (built-in and user-defined) are pickled by ``fully
qualified'' name reference, not by value.  This means that only the
function name is pickled, along with the name of module the function
is defined in.  Neither the function's code, nor any of its function
attributes are pickled.  Thus the defining module must be importable
in the unpickling environment, and the module must contain the named
object, otherwise an exception will be raised\footnote{The exception
raised will likely be an \exception{ImportError} or an
\exception{AttributeError} but it could be something else.}.

Similarly, classes are pickled by named reference, so the same
restrictions in the unpickling environment apply.  Note that none of
the class's code or data is pickled, so in the following example the
class attribute \code{attr} is not restored in the unpickling
environment:

\begin{verbatim}
class Foo:
    attr = 'a class attr'

picklestring = pickle.dumps(Foo)
\end{verbatim}

These restrictions are why picklable functions and classes must be
defined in the top level of a module.

Similarly, when class instances are pickled, their class's code and
data are not pickled along with them.  Only the instance data are
pickled.  This is done on purpose, so you can fix bugs in a class or
add methods to the class and still load objects that were created with
an earlier version of the class.  If you plan to have long-lived
objects that will see many versions of a class, it may be worthwhile
to put a version number in the objects so that suitable conversions
can be made by the class's \method{__setstate__()} method.

\subsection{The pickle protocol
\label{pickle-protocol}}\setindexsubitem{(pickle protocol)}

This section describes the ``pickling protocol'' that defines the
interface between the pickler/unpickler and the objects that are being
serialized.  This protocol provides a standard way for you to define,
customize, and control how your objects are serialized and
de-serialized.  The description in this section doesn't cover specific
customizations that you can employ to make the unpickling environment
safer from untrusted pickle data streams; see section~\ref{pickle-sec}
for more details.

\subsubsection{Pickling and unpickling normal class
    instances\label{pickle-inst}}

When a pickled class instance is unpickled, its \method{__init__()}
method is normally \emph{not} invoked.  If it is desirable that the
\method{__init__()} method be called on unpickling, a class can define
a method \method{__getinitargs__()}, which should return a
\emph{tuple} containing the arguments to be passed to the class
constructor (i.e. \method{__init__()}).  The
\method{__getinitargs__()} method is called at
pickle time; the tuple it returns is incorporated in the pickle for
the instance.
\withsubitem{(copy protocol)}{\ttindex{__getinitargs__()}}
\withsubitem{(instance constructor)}{\ttindex{__init__()}}

\withsubitem{(copy protocol)}{
  \ttindex{__getstate__()}\ttindex{__setstate__()}}
\withsubitem{(instance attribute)}{
  \ttindex{__dict__}}

Classes can further influence how their instances are pickled; if the
class defines the method \method{__getstate__()}, it is called and the
return state is pickled as the contents for the instance, instead of
the contents of the instance's dictionary.  If there is no
\method{__getstate__()} method, the instance's \member{__dict__} is
pickled.

Upon unpickling, if the class also defines the method
\method{__setstate__()}, it is called with the unpickled
state\footnote{These methods can also be used to implement copying
class instances.}.  If there is no \method{__setstate__()} method, the
pickled state must be a dictionary and its items are assigned to the
new instance's dictionary.  If a class defines both
\method{__getstate__()} and \method{__setstate__()}, the state object
needn't be a dictionary and these methods can do what they
want.\footnote{This protocol is also used by the shallow and deep
copying operations defined in the
\refmodule{copy} module.}

\begin{notice}[warning]
  For new-style classes, if \method{__getstate__()} returns a false
  value, the \method{__setstate__()} method will not be called.
\end{notice}


\subsubsection{Pickling and unpickling extension types}

When the \class{Pickler} encounters an object of a type it knows
nothing about --- such as an extension type --- it looks in two places
for a hint of how to pickle it.  One alternative is for the object to
implement a \method{__reduce__()} method.  If provided, at pickling
time \method{__reduce__()} will be called with no arguments, and it
must return either a string or a tuple.

If a string is returned, it names a global variable whose contents are
pickled as normal.  When a tuple is returned, it must be of length two
or three, with the following semantics:

\begin{itemize}

\item A callable object, which in the unpickling environment must be
      either a class, a callable registered as a ``safe constructor''
      (see below), or it must have an attribute
      \member{__safe_for_unpickling__} with a true value.  Otherwise,
      an \exception{UnpicklingError} will be raised in the unpickling
      environment.  Note that as usual, the callable itself is pickled
      by name.

\item A tuple of arguments for the callable object, or \code{None}.
\deprecated{2.3}{Use the tuple of arguments instead}								

\item Optionally, the object's state, which will be passed to
      the object's \method{__setstate__()} method as described in
      section~\ref{pickle-inst}.  If the object has no
      \method{__setstate__()} method, then, as above, the value must
      be a dictionary and it will be added to the object's
      \member{__dict__}.

\end{itemize}

Upon unpickling, the callable will be called (provided that it meets
the above criteria), passing in the tuple of arguments; it should
return the unpickled object.

If the second item was \code{None}, then instead of calling the
callable directly, its \method{__basicnew__()} method is called
without arguments.  It should also return the unpickled object.

\deprecated{2.3}{Use the tuple of arguments instead}

An alternative to implementing a \method{__reduce__()} method on the
object to be pickled, is to register the callable with the
\refmodule[copyreg]{copy_reg} module.  This module provides a way
for programs to register ``reduction functions'' and constructors for
user-defined types.   Reduction functions have the same semantics and
interface as the \method{__reduce__()} method described above, except
that they are called with a single argument, the object to be pickled.

The registered constructor is deemed a ``safe constructor'' for purposes
of unpickling as described above.

\subsubsection{Pickling and unpickling external objects}

For the benefit of object persistence, the \module{pickle} module
supports the notion of a reference to an object outside the pickled
data stream.  Such objects are referenced by a ``persistent id'',
which is just an arbitrary string of printable \ASCII{} characters.
The resolution of such names is not defined by the \module{pickle}
module; it will delegate this resolution to user defined functions on
the pickler and unpickler\footnote{The actual mechanism for
associating these user defined functions is slightly different for
\module{pickle} and \module{cPickle}.  The description given here
works the same for both implementations.  Users of the \module{pickle}
module could also use subclassing to effect the same results,
overriding the \method{persistent_id()} and \method{persistent_load()}
methods in the derived classes.}.

To define external persistent id resolution, you need to set the
\member{persistent_id} attribute of the pickler object and the
\member{persistent_load} attribute of the unpickler object.

To pickle objects that have an external persistent id, the pickler
must have a custom \function{persistent_id()} method that takes an
object as an argument and returns either \code{None} or the persistent
id for that object.  When \code{None} is returned, the pickler simply
pickles the object as normal.  When a persistent id string is
returned, the pickler will pickle that string, along with a marker
so that the unpickler will recognize the string as a persistent id.

To unpickle external objects, the unpickler must have a custom
\function{persistent_load()} function that takes a persistent id
string and returns the referenced object.

Here's a silly example that \emph{might} shed more light:

\begin{verbatim}
import pickle
from cStringIO import StringIO

src = StringIO()
p = pickle.Pickler(src)

def persistent_id(obj):
    if hasattr(obj, 'x'):
        return 'the value %d' % obj.x
    else:
        return None

p.persistent_id = persistent_id

class Integer:
    def __init__(self, x):
        self.x = x
    def __str__(self):
        return 'My name is integer %d' % self.x

i = Integer(7)
print i
p.dump(i)

datastream = src.getvalue()
print repr(datastream)
dst = StringIO(datastream)

up = pickle.Unpickler(dst)

class FancyInteger(Integer):
    def __str__(self):
        return 'I am the integer %d' % self.x

def persistent_load(persid):
    if persid.startswith('the value '):
        value = int(persid.split()[2])
        return FancyInteger(value)
    else:
        raise pickle.UnpicklingError, 'Invalid persistent id'

up.persistent_load = persistent_load

j = up.load()
print j
\end{verbatim}

In the \module{cPickle} module, the unpickler's
\member{persistent_load} attribute can also be set to a Python
list, in which case, when the unpickler reaches a persistent id, the
persistent id string will simply be appended to this list.  This
functionality exists so that a pickle data stream can be ``sniffed''
for object references without actually instantiating all the objects
in a pickle\footnote{We'll leave you with the image of Guido and Jim
sitting around sniffing pickles in their living rooms.}.  Setting
\member{persistent_load} to a list is usually used in conjunction with
the \method{noload()} method on the Unpickler.

% BAW: Both pickle and cPickle support something called
% inst_persistent_id() which appears to give unknown types a second
% shot at producing a persistent id.  Since Jim Fulton can't remember
% why it was added or what it's for, I'm leaving it undocumented.

\subsection{Security \label{pickle-sec}}

Most of the security issues surrounding the \module{pickle} and
\module{cPickle} module involve unpickling.  There are no known
security vulnerabilities
related to pickling because you (the programmer) control the objects
that \module{pickle} will interact with, and all it produces is a
string.

However, for unpickling, it is \strong{never} a good idea to unpickle
an untrusted string whose origins are dubious, for example, strings
read from a socket.  This is because unpickling can create unexpected
objects and even potentially run methods of those objects, such as
their class constructor or destructor\footnote{A special note of
caution is worth raising about the \refmodule{Cookie}
module.  By default, the \class{Cookie.Cookie} class is an alias for
the \class{Cookie.SmartCookie} class, which ``helpfully'' attempts to
unpickle any cookie data string it is passed.  This is a huge security
hole because cookie data typically comes from an untrusted source.
You should either explicitly use the \class{Cookie.SimpleCookie} class
--- which doesn't attempt to unpickle its string --- or you should
implement the defensive programming steps described later on in this
section.}.

You can defend against this by customizing your unpickler so that you
can control exactly what gets unpickled and what gets called.
Unfortunately, exactly how you do this is different depending on
whether you're using \module{pickle} or \module{cPickle}.

One common feature that both modules implement is the
\member{__safe_for_unpickling__} attribute.  Before calling a callable
which is not a class, the unpickler will check to make sure that the
callable has either been registered as a safe callable via the
\refmodule[copyreg]{copy_reg} module, or that it has an
attribute \member{__safe_for_unpickling__} with a true value.  This
prevents the unpickling environment from being tricked into doing
evil things like call \code{os.unlink()} with an arbitrary file name.
See section~\ref{pickle-protocol} for more details.

For safely unpickling class instances, you need to control exactly
which classes will get created.  Be aware that a class's constructor
could be called (if the pickler found a \method{__getinitargs__()}
method) and the the class's destructor (i.e. its \method{__del__()} method)
might get called when the object is garbage collected.  Depending on
the class, it isn't very heard to trick either method into doing bad
things, such as removing a file.  The way to
control the classes that are safe to instantiate differs in
\module{pickle} and \module{cPickle}\footnote{A word of caution: the
mechanisms described here use internal attributes and methods, which
are subject to change in future versions of Python.  We intend to
someday provide a common interface for controlling this behavior,
which will work in either \module{pickle} or \module{cPickle}.}.

In the \module{pickle} module, you need to derive a subclass from
\class{Unpickler}, overriding the \method{load_global()}
method.  \method{load_global()} should read two lines from the pickle
data stream where the first line will the the name of the module
containing the class and the second line will be the name of the
instance's class.  It then look up the class, possibly importing the
module and digging out the attribute, then it appends what it finds to
the unpickler's stack.  Later on, this class will be assigned to the
\member{__class__} attribute of an empty class, as a way of magically
creating an instance without calling its class's \method{__init__()}.
You job (should you choose to accept it), would be to have
\method{load_global()} push onto the unpickler's stack, a known safe
version of any class you deem safe to unpickle.  It is up to you to
produce such a class.  Or you could raise an error if you want to
disallow all unpickling of instances.  If this sounds like a hack,
you're right.  UTSL.

Things are a little cleaner with \module{cPickle}, but not by much.
To control what gets unpickled, you can set the unpickler's
\member{find_global} attribute to a function or \code{None}.  If it is
\code{None} then any attempts to unpickle instances will raise an
\exception{UnpicklingError}.  If it is a function,
then it should accept a module name and a class name, and return the
corresponding class object.  It is responsible for looking up the
class, again performing any necessary imports, and it may raise an
error to prevent instances of the class from being unpickled.

The moral of the story is that you should be really careful about the
source of the strings your application unpickles.

\subsection{Example \label{pickle-example}}

Here's a simple example of how to modify pickling behavior for a
class.  The \class{TextReader} class opens a text file, and returns
the line number and line contents each time its \method{readline()}
method is called. If a \class{TextReader} instance is pickled, all
attributes \emph{except} the file object member are saved. When the
instance is unpickled, the file is reopened, and reading resumes from
the last location. The \method{__setstate__()} and
\method{__getstate__()} methods are used to implement this behavior.

\begin{verbatim}
class TextReader:
    """Print and number lines in a text file."""
    def __init__(self, file):
        self.file = file
        self.fh = open(file)
        self.lineno = 0

    def readline(self):
        self.lineno = self.lineno + 1
        line = self.fh.readline()
        if not line:
            return None
        if line.endswith("\n"):
            line = line[:-1]
        return "%d: %s" % (self.lineno, line)

    def __getstate__(self):
        odict = self.__dict__.copy() # copy the dict since we change it
        del odict['fh']              # remove filehandle entry
        return odict

    def __setstate__(self,dict):
        fh = open(dict['file'])      # reopen file
        count = dict['lineno']       # read from file...
        while count:                 # until line count is restored
            fh.readline()
            count = count - 1
        self.__dict__.update(dict)   # update attributes
        self.fh = fh                 # save the file object
\end{verbatim}

A sample usage might be something like this:

\begin{verbatim}
>>> import TextReader
>>> obj = TextReader.TextReader("TextReader.py")
>>> obj.readline()
'1: #!/usr/local/bin/python'
>>> # (more invocations of obj.readline() here)
... obj.readline()
'7: class TextReader:'
>>> import pickle
>>> pickle.dump(obj,open('save.p','w'))
\end{verbatim}

If you want to see that \refmodule{pickle} works across Python
processes, start another Python session, before continuing.  What
follows can happen from either the same process or a new process.

\begin{verbatim}
>>> import pickle
>>> reader = pickle.load(open('save.p'))
>>> reader.readline()
'8:     "Print and number lines in a text file."'
\end{verbatim}


\begin{seealso}
  \seemodule[copyreg]{copy_reg}{Pickle interface constructor
                                registration for extension types.}

  \seemodule{shelve}{Indexed databases of objects; uses \module{pickle}.}

  \seemodule{copy}{Shallow and deep object copying.}

  \seemodule{marshal}{High-performance serialization of built-in types.}
\end{seealso}


\section{\module{cPickle} --- A faster \module{pickle}}

\declaremodule{builtin}{cPickle}
\modulesynopsis{Faster version of \refmodule{pickle}, but not subclassable.}
\moduleauthor{Jim Fulton}{jfulton@digicool.com}
\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}

The \module{cPickle} module supports serialization and
de-serialization of Python objects, providing an interface and
functionality nearly identical to the
\refmodule{pickle}\refstmodindex{pickle} module.  There are several
differences, the most important being performance and subclassability.

First, \module{cPickle} can be up to 1000 times faster than
\module{pickle} because the former is implemented in C.  Second, in
the \module{cPickle} module the callables \function{Pickler()} and
\function{Unpickler()} are functions, not classes.  This means that
you cannot use them to derive custom pickling and unpickling
subclasses.  Most applications have no need for this functionality and
should benefit from the greatly improved performance of the
\module{cPickle} module.

The pickle data stream produced by \module{pickle} and
\module{cPickle} are identical, so it is possible to use
\module{pickle} and \module{cPickle} interchangeably with existing
pickles\footnote{Since the pickle data format is actually a tiny
stack-oriented programming language, and some freedom is taken in the
encodings of certain objects, it is possible that the two modules
produce different data streams for the same input objects.  However it
is guaranteed that they will always be able to read each other's
data streams.}.

There are additional minor differences in API between \module{cPickle}
and \module{pickle}, however for most applications, they are
interchangable.  More documentation is provided in the
\module{pickle} module documentation, which
includes a list of the documented differences.