Added errors argument to TarFile class that allows the user to

specify an error handling scheme for character conversion. Additional scheme "utf-8" in read mode. Unicode input filenames are now supported by design. The values of the pax_headers dictionary are now limited to unicode objects. Fixed: The prefix field is no longer used in PAX_FORMAT (in conformance with POSIX). Fixed: In read mode use a possible pax header size field. Fixed: Strip trailing slashes from pax header name values. Fixed: Give values in user-specified pax_headers precedence when writing. Added unicode tests. Added pax/regtype4 member to testtar.tar all possible number fields in a pax header. Added two chapters to the documentation about the different formats tarfile.py supports and how unicode issues are handled.
author: Lars Gustäbel <lars@gustaebel.de> 2007-05-27 19:49:30 (GMT)
committer: Lars Gustäbel <lars@gustaebel.de> 2007-05-27 19:49:30 (GMT)
commit: a0fcb9384ead24c412b93a4de903788eb5828dbe (patch)
tree: 6bf71c1d2d2943690bd59f838561520fcaadfdbf /Doc/lib
parent: 0ac601995ccd123696b44b0194c3718f8d364c07 (diff)
download: cpython-a0fcb9384ead24c412b93a4de903788eb5828dbe.zip
cpython-a0fcb9384ead24c412b93a4de903788eb5828dbe.tar.gz
cpython-a0fcb9384ead24c412b93a4de903788eb5828dbe.tar.bz2
1 files changed, 135 insertions, 31 deletions
diff --git a/Doc/lib/libtarfile.tex b/Doc/lib/libtarfile.tex
index 73c35ed..54683a7 100644
--- a/Doc/lib/libtarfile.tex
+++ b/Doc/lib/libtarfile.tex
@@ -133,24 +133,20 @@ Some facts and figures:
     \versionadded{2.6}
 \end{excdesc}
 
+Each of the following constants defines a tar archive format that the
+\module{tarfile} module is able to create. See section \ref{tar-formats} for
+details.
+
 \begin{datadesc}{USTAR_FORMAT}
-    \POSIX{}.1-1988 (ustar) format. It supports filenames up to a length of
-    at best 256 characters and linknames up to 100 characters. The maximum
-    file size is 8 gigabytes. This is an old and limited but widely
-    supported format.
+    \POSIX{}.1-1988 (ustar) format.
 \end{datadesc}
 
 \begin{datadesc}{GNU_FORMAT}
-    GNU tar format. It supports arbitrarily long filenames and linknames and
-    files bigger than 8 gigabytes. It is the defacto standard on GNU/Linux
-    systems.
+    GNU tar format.
 \end{datadesc}
 
 \begin{datadesc}{PAX_FORMAT}
-    \POSIX{}.1-2001 (pax) format. It is the most flexible format with
-    virtually no limits. It supports long filenames and linknames, large files
-    and stores pathnames in a portable way.  However, not all tar
-    implementations today are able to handle pax archives properly.
+    \POSIX{}.1-2001 (pax) format.
 \end{datadesc}
 
 \begin{datadesc}{DEFAULT_FORMAT}
@@ -175,15 +171,15 @@ Some facts and figures:
 
 The \class{TarFile} object provides an interface to a tar archive. A tar
 archive is a sequence of blocks. An archive member (a stored file) is made up
-of a header block followed by data blocks. It is possible, to store a file in a
+of a header block followed by data blocks. It is possible to store a file in a
 tar archive several times. Each archive member is represented by a
 \class{TarInfo} object, see \citetitle{TarInfo Objects} (section
 \ref{tarinfo-objects}) for details.
 
 \begin{classdesc}{TarFile}{name=None, mode='r', fileobj=None,
         format=DEFAULT_FORMAT, tarinfo=TarInfo, dereference=False,
-        ignore_zeros=False, encoding=None, pax_headers=None, debug=0,
-        errorlevel=0}
+        ignore_zeros=False, encoding=None, errors=None, pax_headers=None,
+        debug=0, errorlevel=0}
 
     All following arguments are optional and can be accessed as instance
     attributes as well.
@@ -231,18 +227,14 @@ tar archive several times. Each archive member is represented by a
     If \code{2}, all \emph{non-fatal} errors are raised as \exception{TarError}
     exceptions as well.
 
-    The \var{encoding} argument defines the local character encoding. It
-    defaults to the value from \function{sys.getfilesystemencoding()} or if
-    that is \code{None} to \code{"ascii"}. \var{encoding} is used only in
-    connection with the pax format which stores text data in \emph{UTF-8}. If
-    it is not set correctly, character conversion will fail with a
-    \exception{UnicodeError}.
+    The \var{encoding} and \var{errors} arguments control the way strings are
+    converted to unicode objects and vice versa. The default settings will work
+    for most users. See section \ref{tar-unicode} for in-depth information.
     \versionadded{2.6}
 
-    The \var{pax_headers} argument must be a dictionary whose elements are
-    either unicode objects, numbers or strings that can be decoded to unicode
-    using \var{encoding}. This information will be added to the archive as a
-    pax global header.
+    The \var{pax_headers} argument is an optional dictionary of unicode strings
+    which will be added as a pax global header if \var{format} is
+    \constant{PAX_FORMAT}.
     \versionadded{2.6}
 \end{classdesc}
 
@@ -287,7 +279,7 @@ tar archive several times. Each archive member is represented by a
     Extract all members from the archive to the current working directory
     or directory \var{path}. If optional \var{members} is given, it must be
     a subset of the list returned by \method{getmembers()}.
-    Directory informations like owner, modification time and permissions are
+    Directory information like owner, modification time and permissions are
     set after all members have been extracted. This is done to work around two
     problems: A directory's modification time is reset each time a file is
     created in it. And, if a directory's permissions do not allow writing,
@@ -365,6 +357,11 @@ tar archive several times. Each archive member is represented by a
     \deprecated{2.6}{Use the \member{format} attribute instead.}
 \end{memberdesc}
 
+\begin{memberdesc}{pax_headers}
+    A dictionary containing key-value pairs of pax global headers.
+    \versionadded{2.6}
+\end{memberdesc}
+
 %-----------------
 % TarInfo Objects
 %-----------------
@@ -384,8 +381,8 @@ the file's data itself.
     Create a \class{TarInfo} object.
 \end{classdesc}
 
-\begin{methoddesc}{frombuf}{}
-    Create and return a \class{TarInfo} object from a string buffer.
+\begin{methoddesc}{frombuf}{buf}
+    Create and return a \class{TarInfo} object from string buffer \var{buf}.
     \versionadded[Raises \exception{HeaderError} if the buffer is
     invalid.]{2.6}
 \end{methoddesc}
@@ -396,10 +393,11 @@ the file's data itself.
     \versionadded{2.6}
 \end{methoddesc}
 
-\begin{methoddesc}{tobuf}{\optional{format}}
-    Create a string buffer from a \class{TarInfo} object.  See
-    \class{TarFile}'s \member{format} argument for information.
-    \versionchanged[The \var{format} parameter]{2.6}
+\begin{methoddesc}{tobuf}{\optional{format\optional{, encoding
+        \optional{, errors}}}}
+    Create a string buffer from a \class{TarInfo} object. For information
+    on the arguments see the constructor of the \class{TarFile} class.
+    \versionchanged[The arguments were added]{2.6}
 \end{methoddesc}
 
 A \code{TarInfo} object has the following public data attributes:
@@ -452,6 +450,12 @@ A \code{TarInfo} object has the following public data attributes:
     Group name.
 \end{memberdesc}
 
+\begin{memberdesc}{pax_headers}
+    A dictionary containing key-value pairs of an associated pax
+    extended header.
+    \versionadded{2.6}
+\end{memberdesc}
+
 A \class{TarInfo} object also provides some convenient query methods:
 
 \begin{methoddesc}{isfile}{}
@@ -554,3 +558,103 @@ for tarinfo in tar:
     tar.extract(tarinfo)
 tar.close()
 \end{verbatim}
+
+%------------
+% Tar format
+%------------
+
+\subsection{Supported tar formats \label{tar-formats}}
+
+There are three tar formats that can be created with the \module{tarfile}
+module:
+
+\begin{itemize}
+
+\item
+The \POSIX{}.1-1988 ustar format (\constant{USTAR_FORMAT}). It supports
+filenames up to a length of at best 256 characters and linknames up to 100
+characters. The maximum file size is 8 gigabytes. This is an old and limited
+but widely supported format.
+
+\item
+The GNU tar format (\constant{GNU_FORMAT}). It supports long filenames and
+linknames, files bigger than 8 gigabytes and sparse files. It is the de facto
+standard on GNU/Linux systems. \module{tarfile} fully supports the GNU tar
+extensions for long names, sparse file support is read-only.
+
+\item
+The \POSIX{}.1-2001 pax format (\constant{PAX_FORMAT}). It is the most
+flexible format with virtually no limits. It supports long filenames and
+linknames, large files and stores pathnames in a portable way. However, not
+all tar implementations today are able to handle pax archives properly.
+
+The \emph{pax} format is an extension to the existing \emph{ustar} format. It
+uses extra headers for information that cannot be stored otherwise. There are
+two flavours of pax headers: Extended headers only affect the subsequent file
+header, global headers are valid for the complete archive and affect all
+following files. All the data in a pax header is encoded in \emph{UTF-8} for
+portability reasons.
+
+\end{itemize}
+
+There are some more variants of the tar format which can be read, but not
+created:
+
+\begin{itemize}
+
+\item
+The ancient V7 format. This is the first tar format from \UNIX{} Seventh
+Edition, storing only regular files and directories. Names must not be longer
+than 100 characters, there is no user/group name information. Some archives
+have miscalculated header checksums in case of fields with non-\ASCII{}
+characters.
+
+\item
+The SunOS tar extended format. This format is a variant of the \POSIX{}.1-2001
+pax format, but is not compatible.
+
+\end{itemize}
+
+%----------------
+% Unicode issues
+%----------------
+
+\subsection{Unicode issues \label{tar-unicode}}
+
+The tar format was originally conceived to make backups on tape drives with the
+main focus on preserving file system information. Nowadays tar archives are
+commonly used for file distribution and exchanging archives over networks. One
+problem of the original format (that all other formats are merely variants of)
+is that there is no concept of supporting different character encodings.
+For example, an ordinary tar archive created on a \emph{UTF-8} system cannot be
+read correctly on a \emph{Latin-1} system if it contains non-\ASCII{}
+characters. Names (i.e. filenames, linknames, user/group names) containing
+these characters will appear damaged.  Unfortunately, there is no way to
+autodetect the encoding of an archive.
+
+The pax format was designed to solve this problem. It stores non-\ASCII{} names
+using the universal character encoding \emph{UTF-8}. When a pax archive is
+read, these \emph{UTF-8} names are converted to the encoding of the local
+file system.
+
+The details of unicode conversion are controlled by the \var{encoding} and
+\var{errors} keyword arguments of the \class{TarFile} class.
+
+The default value for \var{encoding} is the local character encoding. It is
+deduced from \function{sys.getfilesystemencoding()} and
+\function{sys.getdefaultencoding()}. In read mode, \var{encoding} is used
+exclusively to convert unicode names from a pax archive to strings in the local
+character encoding. In write mode, the use of \var{encoding} depends on the
+chosen archive format. In case of \constant{PAX_FORMAT}, input names that
+contain non-\ASCII{} characters need to be decoded before being stored as
+\emph{UTF-8} strings. The other formats do not make use of \var{encoding}
+unless unicode objects are used as input names. These are converted to
+8-bit character strings before they are added to the archive.
+
+The \var{errors} argument defines how characters are treated that cannot be
+converted to or from \var{encoding}. Possible values are listed in section
+\ref{codec-base-classes}. In read mode, there is an additional scheme
+\code{'utf-8'} which means that bad characters are replaced by their
+\emph{UTF-8} representation. This is the default scheme. In write mode the
+default value for \var{errors} is \code{'strict'} to ensure that name
+information is not altered unnoticed.
author	Lars Gustäbel <lars@gustaebel.de>	2007-05-27 19:49:30 (GMT)
committer	Lars Gustäbel <lars@gustaebel.de>	2007-05-27 19:49:30 (GMT)
commit	a0fcb9384ead24c412b93a4de903788eb5828dbe (patch)
tree	6bf71c1d2d2943690bd59f838561520fcaadfdbf /Doc/lib
parent	0ac601995ccd123696b44b0194c3718f8d364c07 (diff)
download	cpython-a0fcb9384ead24c412b93a4de903788eb5828dbe.zip cpython-a0fcb9384ead24c412b93a4de903788eb5828dbe.tar.gz cpython-a0fcb9384ead24c412b93a4de903788eb5828dbe.tar.bz2