summaryrefslogtreecommitdiffstats
path: root/Doc
diff options
context:
space:
mode:
authorAndrew M. Kuchling <amk@amk.ca>2005-08-30 01:25:05 (GMT)
committerAndrew M. Kuchling <amk@amk.ca>2005-08-30 01:25:05 (GMT)
commite8f44d683e79c7a9659a4480736d55193da4a7b1 (patch)
tree37e8b05066aa1caf85f6b25d52f1576366e45e8e /Doc
parentf1b2ba6aa1751c5325e8fb87a28e54a857796bfa (diff)
downloadcpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.zip
cpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.tar.gz
cpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.tar.bz2
Commit the howto source to the main Python repository, with Fred's approval
Diffstat (limited to 'Doc')
-rw-r--r--Doc/howto/Makefile88
-rw-r--r--Doc/howto/advocacy.tex405
-rw-r--r--Doc/howto/curses.tex485
-rw-r--r--Doc/howto/doanddont.tex343
-rw-r--r--Doc/howto/regex.tex1466
-rw-r--r--Doc/howto/rexec.tex61
-rw-r--r--Doc/howto/sockets.tex460
-rw-r--r--Doc/howto/sorting.tex267
-rw-r--r--Doc/howto/unicode.rst765
9 files changed, 4340 insertions, 0 deletions
diff --git a/Doc/howto/Makefile b/Doc/howto/Makefile
new file mode 100644
index 0000000..19701c6
--- /dev/null
+++ b/Doc/howto/Makefile
@@ -0,0 +1,88 @@
+
+MKHOWTO=../tools/mkhowto
+WEBDIR=.
+RSTARGS = --input-encoding=utf-8
+VPATH=.:dvi:pdf:ps:txt
+
+# List of HOWTOs that aren't to be processed
+
+REMOVE_HOWTO =
+
+# Determine list of files to be built
+
+HOWTO=$(filter-out $(REMOVE_HOWTO),$(wildcard *.tex))
+RST_SOURCES = $(shell echo *.rst)
+DVI =$(patsubst %.tex,%.dvi,$(HOWTO))
+PDF =$(patsubst %.tex,%.pdf,$(HOWTO))
+PS =$(patsubst %.tex,%.ps,$(HOWTO))
+TXT =$(patsubst %.tex,%.txt,$(HOWTO))
+HTML =$(patsubst %.tex,%,$(HOWTO))
+
+# Rules for building various formats
+%.dvi : %.tex
+ $(MKHOWTO) --dvi $<
+ mv $@ dvi
+
+%.pdf : %.tex
+ $(MKHOWTO) --pdf $<
+ mv $@ pdf
+
+%.ps : %.tex
+ $(MKHOWTO) --ps $<
+ mv $@ ps
+
+%.txt : %.tex
+ $(MKHOWTO) --text $<
+ mv $@ txt
+
+% : %.tex
+ $(MKHOWTO) --html --iconserver="." $<
+ tar -zcvf html/$*.tgz $*
+ #zip -r html/$*.zip $*
+
+default:
+ @echo "'all' -- build all files"
+ @echo "'dvi', 'pdf', 'ps', 'txt', 'html' -- build one format"
+
+all: $(HTML)
+
+.PHONY : dvi pdf ps txt html rst
+dvi: $(DVI)
+
+pdf: $(PDF)
+ps: $(PS)
+txt: $(TXT)
+html:$(HTML)
+
+# Rule to build collected tar files
+dist: #all
+ for i in dvi pdf ps txt ; do \
+ cd $$i ; \
+ tar -zcf All.tgz *.$$i ;\
+ cd .. ;\
+ done
+
+# Rule to copy files to the Web tree on AMK's machine
+web: dist
+ cp dvi/* $(WEBDIR)/dvi
+ cp ps/* $(WEBDIR)/ps
+ cp pdf/* $(WEBDIR)/pdf
+ cp txt/* $(WEBDIR)/txt
+ for dir in $(HTML) ; do cp -rp $$dir $(WEBDIR) ; done
+ for ltx in $(HOWTO) ; do cp -p $$ltx $(WEBDIR)/latex ; done
+
+rst: unicode.html
+
+%.html: %.rst
+ rst2html $(RSTARGS) $< >$@
+
+clean:
+ rm -f *~ *.log *.ind *.l2h *.aux *.toc *.how
+ rm -f *.dvi *.ps *.pdf *.bkm
+ rm -f unicode.html
+
+clobber:
+ rm dvi/* ps/* pdf/* txt/* html/*
+
+
+
diff --git a/Doc/howto/advocacy.tex b/Doc/howto/advocacy.tex
new file mode 100644
index 0000000..619242b
--- /dev/null
+++ b/Doc/howto/advocacy.tex
@@ -0,0 +1,405 @@
+
+\documentclass{howto}
+
+\title{Python Advocacy HOWTO}
+
+\release{0.03}
+
+\author{A.M. Kuchling}
+\authoraddress{\email{amk@amk.ca}}
+
+\begin{document}
+\maketitle
+
+\begin{abstract}
+\noindent
+It's usually difficult to get your management to accept open source
+software, and Python is no exception to this rule. This document
+discusses reasons to use Python, strategies for winning acceptance,
+facts and arguments you can use, and cases where you \emph{shouldn't}
+try to use Python.
+
+This document is available from the Python HOWTO page at
+\url{http://www.python.org/doc/howto}.
+
+\end{abstract}
+
+\tableofcontents
+
+\section{Reasons to Use Python}
+
+There are several reasons to incorporate a scripting language into
+your development process, and this section will discuss them, and why
+Python has some properties that make it a particularly good choice.
+
+ \subsection{Programmability}
+
+Programs are often organized in a modular fashion. Lower-level
+operations are grouped together, and called by higher-level functions,
+which may in turn be used as basic operations by still further upper
+levels.
+
+For example, the lowest level might define a very low-level
+set of functions for accessing a hash table. The next level might use
+hash tables to store the headers of a mail message, mapping a header
+name like \samp{Date} to a value such as \samp{Tue, 13 May 1997
+20:00:54 -0400}. A yet higher level may operate on message objects,
+without knowing or caring that message headers are stored in a hash
+table, and so forth.
+
+Often, the lowest levels do very simple things; they implement a data
+structure such as a binary tree or hash table, or they perform some
+simple computation, such as converting a date string to a number. The
+higher levels then contain logic connecting these primitive
+operations. Using the approach, the primitives can be seen as basic
+building blocks which are then glued together to produce the complete
+product.
+
+Why is this design approach relevant to Python? Because Python is
+well suited to functioning as such a glue language. A common approach
+is to write a Python module that implements the lower level
+operations; for the sake of speed, the implementation might be in C,
+Java, or even Fortran. Once the primitives are available to Python
+programs, the logic underlying higher level operations is written in
+the form of Python code. The high-level logic is then more
+understandable, and easier to modify.
+
+John Ousterhout wrote a paper that explains this idea at greater
+length, entitled ``Scripting: Higher Level Programming for the 21st
+Century''. I recommend that you read this paper; see the references
+for the URL. Ousterhout is the inventor of the Tcl language, and
+therefore argues that Tcl should be used for this purpose; he only
+briefly refers to other languages such as Python, Perl, and
+Lisp/Scheme, but in reality, Ousterhout's argument applies to
+scripting languages in general, since you could equally write
+extensions for any of the languages mentioned above.
+
+ \subsection{Prototyping}
+
+In \emph{The Mythical Man-Month}, Fredrick Brooks suggests the
+following rule when planning software projects: ``Plan to throw one
+away; you will anyway.'' Brooks is saying that the first attempt at a
+software design often turns out to be wrong; unless the problem is
+very simple or you're an extremely good designer, you'll find that new
+requirements and features become apparent once development has
+actually started. If these new requirements can't be cleanly
+incorporated into the program's structure, you're presented with two
+unpleasant choices: hammer the new features into the program somehow,
+or scrap everything and write a new version of the program, taking the
+new features into account from the beginning.
+
+Python provides you with a good environment for quickly developing an
+initial prototype. That lets you get the overall program structure
+and logic right, and you can fine-tune small details in the fast
+development cycle that Python provides. Once you're satisfied with
+the GUI interface or program output, you can translate the Python code
+into C++, Fortran, Java, or some other compiled language.
+
+Prototyping means you have to be careful not to use too many Python
+features that are hard to implement in your other language. Using
+\code{eval()}, or regular expressions, or the \module{pickle} module,
+means that you're going to need C or Java libraries for formula
+evaluation, regular expressions, and serialization, for example. But
+it's not hard to avoid such tricky code, and in the end the
+translation usually isn't very difficult. The resulting code can be
+rapidly debugged, because any serious logical errors will have been
+removed from the prototype, leaving only more minor slip-ups in the
+translation to track down.
+
+This strategy builds on the earlier discussion of programmability.
+Using Python as glue to connect lower-level components has obvious
+relevance for constructing prototype systems. In this way Python can
+help you with development, even if end users never come in contact
+with Python code at all. If the performance of the Python version is
+adequate and corporate politics allow it, you may not need to do a
+translation into C or Java, but it can still be faster to develop a
+prototype and then translate it, instead of attempting to produce the
+final version immediately.
+
+One example of this development strategy is Microsoft Merchant Server.
+Version 1.0 was written in pure Python, by a company that subsequently
+was purchased by Microsoft. Version 2.0 began to translate the code
+into \Cpp, shipping with some \Cpp code and some Python code. Version
+3.0 didn't contain any Python at all; all the code had been translated
+into \Cpp. Even though the product doesn't contain a Python
+interpreter, the Python language has still served a useful purpose by
+speeding up development.
+
+This is a very common use for Python. Past conference papers have
+also described this approach for developing high-level numerical
+algorithms; see David M. Beazley and Peter S. Lomdahl's paper
+``Feeding a Large-scale Physics Application to Python'' in the
+references for a good example. If an algorithm's basic operations are
+things like "Take the inverse of this 4000x4000 matrix", and are
+implemented in some lower-level language, then Python has almost no
+additional performance cost; the extra time required for Python to
+evaluate an expression like \code{m.invert()} is dwarfed by the cost
+of the actual computation. It's particularly good for applications
+where seemingly endless tweaking is required to get things right. GUI
+interfaces and Web sites are prime examples.
+
+The Python code is also shorter and faster to write (once you're
+familiar with Python), so it's easier to throw it away if you decide
+your approach was wrong; if you'd spent two weeks working on it
+instead of just two hours, you might waste time trying to patch up
+what you've got out of a natural reluctance to admit that those two
+weeks were wasted. Truthfully, those two weeks haven't been wasted,
+since you've learnt something about the problem and the technology
+you're using to solve it, but it's human nature to view this as a
+failure of some sort.
+
+ \subsection{Simplicity and Ease of Understanding}
+
+Python is definitely \emph{not} a toy language that's only usable for
+small tasks. The language features are general and powerful enough to
+enable it to be used for many different purposes. It's useful at the
+small end, for 10- or 20-line scripts, but it also scales up to larger
+systems that contain thousands of lines of code.
+
+However, this expressiveness doesn't come at the cost of an obscure or
+tricky syntax. While Python has some dark corners that can lead to
+obscure code, there are relatively few such corners, and proper design
+can isolate their use to only a few classes or modules. It's
+certainly possible to write confusing code by using too many features
+with too little concern for clarity, but most Python code can look a
+lot like a slightly-formalized version of human-understandable
+pseudocode.
+
+In \emph{The New Hacker's Dictionary}, Eric S. Raymond gives the following
+definition for "compact":
+
+\begin{quotation}
+ Compact \emph{adj.} Of a design, describes the valuable property
+ that it can all be apprehended at once in one's head. This
+ generally means the thing created from the design can be used
+ with greater facility and fewer errors than an equivalent tool
+ that is not compact. Compactness does not imply triviality or
+ lack of power; for example, C is compact and FORTRAN is not,
+ but C is more powerful than FORTRAN. Designs become
+ non-compact through accreting features and cruft that don't
+ merge cleanly into the overall design scheme (thus, some fans
+ of Classic C maintain that ANSI C is no longer compact).
+\end{quotation}
+
+(From \url{http://sagan.earthspace.net/jargon/jargon_18.html\#SEC25})
+
+In this sense of the word, Python is quite compact, because the
+language has just a few ideas, which are used in lots of places. Take
+namespaces, for example. Import a module with \code{import math}, and
+you create a new namespace called \samp{math}. Classes are also
+namespaces that share many of the properties of modules, and have a
+few of their own; for example, you can create instances of a class.
+Instances? They're yet another namespace. Namespaces are currently
+implemented as Python dictionaries, so they have the same methods as
+the standard dictionary data type: .keys() returns all the keys, and
+so forth.
+
+This simplicity arises from Python's development history. The
+language syntax derives from different sources; ABC, a relatively
+obscure teaching language, is one primary influence, and Modula-3 is
+another. (For more information about ABC and Modula-3, consult their
+respective Web sites at \url{http://www.cwi.nl/~steven/abc/} and
+\url{http://www.m3.org}.) Other features have come from C, Icon,
+Algol-68, and even Perl. Python hasn't really innovated very much,
+but instead has tried to keep the language small and easy to learn,
+building on ideas that have been tried in other languages and found
+useful.
+
+Simplicity is a virtue that should not be underestimated. It lets you
+learn the language more quickly, and then rapidly write code, code
+that often works the first time you run it.
+
+ \subsection{Java Integration}
+
+If you're working with Java, Jython
+(\url{http://www.jython.org/}) is definitely worth your
+attention. Jython is a re-implementation of Python in Java that
+compiles Python code into Java bytecodes. The resulting environment
+has very tight, almost seamless, integration with Java. It's trivial
+to access Java classes from Python, and you can write Python classes
+that subclass Java classes. Jython can be used for prototyping Java
+applications in much the same way CPython is used, and it can also be
+used for test suites for Java code, or embedded in a Java application
+to add scripting capabilities.
+
+\section{Arguments and Rebuttals}
+
+Let's say that you've decided upon Python as the best choice for your
+application. How can you convince your management, or your fellow
+developers, to use Python? This section lists some common arguments
+against using Python, and provides some possible rebuttals.
+
+\emph{Python is freely available software that doesn't cost anything.
+How good can it be?}
+
+Very good, indeed. These days Linux and Apache, two other pieces of
+open source software, are becoming more respected as alternatives to
+commercial software, but Python hasn't had all the publicity.
+
+Python has been around for several years, with many users and
+developers. Accordingly, the interpreter has been used by many
+people, and has gotten most of the bugs shaken out of it. While bugs
+are still discovered at intervals, they're usually either quite
+obscure (they'd have to be, for no one to have run into them before)
+or they involve interfaces to external libraries. The internals of
+the language itself are quite stable.
+
+Having the source code should be viewed as making the software
+available for peer review; people can examine the code, suggest (and
+implement) improvements, and track down bugs. To find out more about
+the idea of open source code, along with arguments and case studies
+supporting it, go to \url{http://www.opensource.org}.
+
+\emph{Who's going to support it?}
+
+Python has a sizable community of developers, and the number is still
+growing. The Internet community surrounding the language is an active
+one, and is worth being considered another one of Python's advantages.
+Most questions posted to the comp.lang.python newsgroup are quickly
+answered by someone.
+
+Should you need to dig into the source code, you'll find it's clear
+and well-organized, so it's not very difficult to write extensions and
+track down bugs yourself. If you'd prefer to pay for support, there
+are companies and individuals who offer commercial support for Python.
+
+\emph{Who uses Python for serious work?}
+
+Lots of people; one interesting thing about Python is the surprising
+diversity of applications that it's been used for. People are using
+Python to:
+
+\begin{itemize}
+\item Run Web sites
+\item Write GUI interfaces
+\item Control
+number-crunching code on supercomputers
+\item Make a commercial application scriptable by embedding the Python
+interpreter inside it
+\item Process large XML data sets
+\item Build test suites for C or Java code
+\end{itemize}
+
+Whatever your application domain is, there's probably someone who's
+used Python for something similar. Yet, despite being useable for
+such high-end applications, Python's still simple enough to use for
+little jobs.
+
+See \url{http://www.python.org/psa/Users.html} for a list of some of the
+organizations that use Python.
+
+\emph{What are the restrictions on Python's use?}
+
+They're practically nonexistent. Consult the \file{Misc/COPYRIGHT}
+file in the source distribution, or
+\url{http://www.python.org/doc/Copyright.html} for the full language,
+but it boils down to three conditions.
+
+\begin{itemize}
+
+\item You have to leave the copyright notice on the software; if you
+don't include the source code in a product, you have to put the
+copyright notice in the supporting documentation.
+
+\item Don't claim that the institutions that have developed Python
+endorse your product in any way.
+
+\item If something goes wrong, you can't sue for damages. Practically
+all software licences contain this condition.
+
+\end{itemize}
+
+Notice that you don't have to provide source code for anything that
+contains Python or is built with it. Also, the Python interpreter and
+accompanying documentation can be modified and redistributed in any
+way you like, and you don't have to pay anyone any licensing fees at
+all.
+
+\emph{Why should we use an obscure language like Python instead of
+well-known language X?}
+
+I hope this HOWTO, and the documents listed in the final section, will
+help convince you that Python isn't obscure, and has a healthily
+growing user base. One word of advice: always present Python's
+positive advantages, instead of concentrating on language X's
+failings. People want to know why a solution is good, rather than why
+all the other solutions are bad. So instead of attacking a competing
+solution on various grounds, simply show how Python's virtues can
+help.
+
+
+\section{Useful Resources}
+
+\begin{definitions}
+
+\term{\url{http://www.fsbassociates.com/books/pythonchpt1.htm}}
+
+The first chapter of \emph{Internet Programming with Python} also
+examines some of the reasons for using Python. The book is well worth
+buying, but the publishers have made the first chapter available on
+the Web.
+
+\term{\url{http://home.pacbell.net/ouster/scripting.html}}
+
+John Ousterhout's white paper on scripting is a good argument for the
+utility of scripting languages, though naturally enough, he emphasizes
+Tcl, the language he developed. Most of the arguments would apply to
+any scripting language.
+
+\term{\url{http://www.python.org/workshops/1997-10/proceedings/beazley.html}}
+
+The authors, David M. Beazley and Peter S. Lomdahl,
+describe their use of Python at Los Alamos National Laboratory.
+It's another good example of how Python can help get real work done.
+This quotation from the paper has been echoed by many people:
+
+\begin{quotation}
+ Originally developed as a large monolithic application for
+ massively parallel processing systems, we have used Python to
+ transform our application into a flexible, highly modular, and
+ extremely powerful system for performing simulation, data
+ analysis, and visualization. In addition, we describe how Python
+ has solved a number of important problems related to the
+ development, debugging, deployment, and maintenance of scientific
+ software.
+\end{quotation}
+
+%\term{\url{http://www.pythonjournal.com/volume1/art-interview/}}
+
+%This interview with Andy Feit, discussing Infoseek's use of Python, can be
+%used to show that choosing Python didn't introduce any difficulties
+%into a company's development process, and provided some substantial benefits.
+
+\term{\url{http://www.python.org/psa/Commercial.html}}
+
+Robin Friedrich wrote this document on how to support Python's use in
+commercial projects.
+
+\term{\url{http://www.python.org/workshops/1997-10/proceedings/stein.ps}}
+
+For the 6th Python conference, Greg Stein presented a paper that
+traced Python's adoption and usage at a startup called eShop, and
+later at Microsoft.
+
+\term{\url{http://www.opensource.org}}
+
+Management may be doubtful of the reliability and usefulness of
+software that wasn't written commercially. This site presents
+arguments that show how open source software can have considerable
+advantages over closed-source software.
+
+\term{\url{http://sunsite.unc.edu/LDP/HOWTO/mini/Advocacy.html}}
+
+The Linux Advocacy mini-HOWTO was the inspiration for this document,
+and is also well worth reading for general suggestions on winning
+acceptance for a new technology, such as Linux or Python. In general,
+you won't make much progress by simply attacking existing systems and
+complaining about their inadequacies; this often ends up looking like
+unfocused whining. It's much better to point out some of the many
+areas where Python is an improvement over other systems.
+
+\end{definitions}
+
+\end{document}
+
+
diff --git a/Doc/howto/curses.tex b/Doc/howto/curses.tex
new file mode 100644
index 0000000..a6a0e0a
--- /dev/null
+++ b/Doc/howto/curses.tex
@@ -0,0 +1,485 @@
+\documentclass{howto}
+
+\title{Curses Programming with Python}
+
+\release{2.01}
+
+\author{A.M. Kuchling, Eric S. Raymond}
+\authoraddress{\email{amk@amk.ca}, \email{esr@thyrsus.com}}
+
+\begin{document}
+\maketitle
+
+\begin{abstract}
+\noindent
+This document describes how to write text-mode programs with Python 2.x,
+using the \module{curses} extension module to control the display.
+
+This document is available from the Python HOWTO page at
+\url{http://www.python.org/doc/howto}.
+\end{abstract}
+
+\tableofcontents
+
+\section{What is curses?}
+
+The curses library supplies a terminal-independent screen-painting and
+keyboard-handling facility for text-based terminals; such terminals
+include VT100s, the Linux console, and the simulated terminal provided
+by X11 programs such as xterm and rxvt. Display terminals support
+various control codes to perform common operations such as moving the
+cursor, scrolling the screen, and erasing areas. Different terminals
+use widely differing codes, and often have their own minor quirks.
+
+In a world of X displays, one might ask ``why bother''? It's true
+that character-cell display terminals are an obsolete technology, but
+there are niches in which being able to do fancy things with them are
+still valuable. One is on small-footprint or embedded Unixes that
+don't carry an X server. Another is for tools like OS installers
+and kernel configurators that may have to run before X is available.
+
+The curses library hides all the details of different terminals, and
+provides the programmer with an abstraction of a display, containing
+multiple non-overlapping windows. The contents of a window can be
+changed in various ways--adding text, erasing it, changing its
+appearance--and the curses library will automagically figure out what
+control codes need to be sent to the terminal to produce the right
+output.
+
+The curses library was originally written for BSD Unix; the later System V
+versions of Unix from AT\&T added many enhancements and new functions.
+BSD curses is no longer maintained, having been replaced by ncurses,
+which is an open-source implementation of the AT\&T interface. If you're
+using an open-source Unix such as Linux or FreeBSD, your system almost
+certainly uses ncurses. Since most current commercial Unix versions
+are based on System V code, all the functions described here will
+probably be available. The older versions of curses carried by some
+proprietary Unixes may not support everything, though.
+
+No one has made a Windows port of the curses module. On a Windows
+platform, try the Console module written by Fredrik Lundh. The
+Console module provides cursor-addressable text output, plus full
+support for mouse and keyboard input, and is available from
+\url{http://effbot.org/efflib/console}.
+
+\subsection{The Python curses module}
+
+Thy Python module is a fairly simple wrapper over the C functions
+provided by curses; if you're already familiar with curses programming
+in C, it's really easy to transfer that knowledge to Python. The
+biggest difference is that the Python interface makes things simpler,
+by merging different C functions such as \function{addstr},
+\function{mvaddstr}, \function{mvwaddstr}, into a single
+\method{addstr()} method. You'll see this covered in more detail
+later.
+
+This HOWTO is simply an introduction to writing text-mode programs
+with curses and Python. It doesn't attempt to be a complete guide to
+the curses API; for that, see the Python library guide's serction on
+ncurses, and the C manual pages for ncurses. It will, however, give
+you the basic ideas.
+
+\section{Starting and ending a curses application}
+
+Before doing anything, curses must be initialized. This is done by
+calling the \function{initscr()} function, which will determine the
+terminal type, send any required setup codes to the terminal, and
+create various internal data structures. If successful,
+\function{initscr()} returns a window object representing the entire
+screen; this is usually called \code{stdscr}, after the name of the
+corresponding C
+variable.
+
+\begin{verbatim}
+import curses
+stdscr = curses.initscr()
+\end{verbatim}
+
+Usually curses applications turn off automatic echoing of keys to the
+screen, in order to be able to read keys and only display them under
+certain circumstances. This requires calling the \function{noecho()}
+function.
+
+\begin{verbatim}
+curses.noecho()
+\end{verbatim}
+
+Applications will also commonly need to react to keys instantly,
+without requiring the Enter key to be pressed; this is called cbreak
+mode, as opposed to the usual buffered input mode.
+
+\begin{verbatim}
+curses.cbreak()
+\end{verbatim}
+
+Terminals usually return special keys, such as the cursor keys or
+navigation keys such as Page Up and Home, as a multibyte escape
+sequence. While you could write your application to expect such
+sequences and process them accordingly, curses can do it for you,
+returning a special value such as \constant{curses.KEY_LEFT}. To get
+curses to do the job, you'll have to enable keypad mode.
+
+\begin{verbatim}
+stdscr.keypad(1)
+\end{verbatim}
+
+Terminating a curses application is much easier than starting one.
+You'll need to call
+
+\begin{verbatim}
+curses.nocbreak(); stdscr.keypad(0); curses.echo()
+\end{verbatim}
+
+to reverse the curses-friendly terminal settings. Then call the
+\function{endwin()} function to restore the terminal to its original
+operating mode.
+
+\begin{verbatim}
+curses.endwin()
+\end{verbatim}
+
+A common problem when debugging a curses application is to get your
+terminal messed up when the application dies without restoring the
+terminal to its previous state. In Python this commonly happens when
+your code is buggy and raises an uncaught exception. Keys are no
+longer be echoed to the screen when you type them, for example, which
+makes using the shell difficult.
+
+In Python you can avoid these complications and make debugging much
+easier by importing the module \module{curses.wrapper}. It supplies a
+function \function{wrapper} that takes a hook argument. It does the
+initializations described above, and also initializes colors if color
+support is present. It then runs your hook, and then finally
+deinitializes appropriately. The hook is called inside a try-catch
+clause which catches exceptions, performs curses deinitialization, and
+then passes the exception upwards. Thus, your terminal won't be left
+in a funny state on exception.
+
+\section{Windows and Pads}
+
+Windows are the basic abstraction in curses. A window object
+represents a rectangular area of the screen, and supports various
+ methods to display text, erase it, allow the user to input strings,
+and so forth.
+
+The \code{stdscr} object returned by the \function{initscr()} function
+is a window object that covers the entire screen. Many programs may
+need only this single window, but you might wish to divide the screen
+into smaller windows, in order to redraw or clear them separately.
+The \function{newwin()} function creates a new window of a given size,
+returning the new window object.
+
+\begin{verbatim}
+begin_x = 20 ; begin_y = 7
+height = 5 ; width = 40
+win = curses.newwin(height, width, begin_y, begin_x)
+\end{verbatim}
+
+A word about the coordinate system used in curses: coordinates are
+always passed in the order \emph{y,x}, and the top-left corner of a
+window is coordinate (0,0). This breaks a common convention for
+handling coordinates, where the \emph{x} coordinate usually comes
+first. This is an unfortunate difference from most other computer
+applications, but it's been part of curses since it was first written,
+and it's too late to change things now.
+
+When you call a method to display or erase text, the effect doesn't
+immediately show up on the display. This is because curses was
+originally written with slow 300-baud terminal connections in mind;
+with these terminals, minimizing the time required to redraw the
+screen is very important. This lets curses accumulate changes to the
+screen, and display them in the most efficient manner. For example,
+if your program displays some characters in a window, and then clears
+the window, there's no need to send the original characters because
+they'd never be visible.
+
+Accordingly, curses requires that you explicitly tell it to redraw
+windows, using the \function{refresh()} method of window objects. In
+practice, this doesn't really complicate programming with curses much.
+Most programs go into a flurry of activity, and then pause waiting for
+a keypress or some other action on the part of the user. All you have
+to do is to be sure that the screen has been redrawn before pausing to
+wait for user input, by simply calling \code{stdscr.refresh()} or the
+\function{refresh()} method of some other relevant window.
+
+A pad is a special case of a window; it can be larger than the actual
+display screen, and only a portion of it displayed at a time.
+Creating a pad simply requires the pad's height and width, while
+refreshing a pad requires giving the coordinates of the on-screen
+area where a subsection of the pad will be displayed.
+
+\begin{verbatim}
+pad = curses.newpad(100, 100)
+# These loops fill the pad with letters; this is
+# explained in the next section
+for y in range(0, 100):
+ for x in range(0, 100):
+ try: pad.addch(y,x, ord('a') + (x*x+y*y) % 26 )
+ except curses.error: pass
+
+# Displays a section of the pad in the middle of the screen
+pad.refresh( 0,0, 5,5, 20,75)
+\end{verbatim}
+
+The \function{refresh()} call displays a section of the pad in the
+rectangle extending from coordinate (5,5) to coordinate (20,75) on the
+screen;the upper left corner of the displayed section is coordinate
+(0,0) on the pad. Beyond that difference, pads are exactly like
+ordinary windows and support the same methods.
+
+If you have multiple windows and pads on screen there is a more
+efficient way to go, which will prevent annoying screen flicker at
+refresh time. Use the methods \method{noutrefresh()} and/or
+\method{noutrefresh()} of each window to update the data structure
+representing the desired state of the screen; then change the physical
+screen to match the desired state in one go with the function
+\function{doupdate()}. The normal \method{refresh()} method calls
+\function{doupdate()} as its last act.
+
+\section{Displaying Text}
+
+{}From a C programmer's point of view, curses may sometimes look like
+a twisty maze of functions, all subtly different. For example,
+\function{addstr()} displays a string at the current cursor location
+in the \code{stdscr} window, while \function{mvaddstr()} moves to a
+given y,x coordinate first before displaying the string.
+\function{waddstr()} is just like \function{addstr()}, but allows
+specifying a window to use, instead of using \code{stdscr} by default.
+\function{mvwaddstr()} follows similarly.
+
+Fortunately the Python interface hides all these details;
+\code{stdscr} is a window object like any other, and methods like
+\function{addstr()} accept multiple argument forms. Usually there are
+four different forms.
+
+\begin{tableii}{|c|l|}{textrm}{Form}{Description}
+\lineii{\var{str} or \var{ch}}{Display the string \var{str} or
+character \var{ch}}
+\lineii{\var{str} or \var{ch}, \var{attr}}{Display the string \var{str} or
+character \var{ch}, using attribute \var{attr}}
+\lineii{\var{y}, \var{x}, \var{str} or \var{ch}}
+{Move to position \var{y,x} within the window, and display \var{str}
+or \var{ch}}
+\lineii{\var{y}, \var{x}, \var{str} or \var{ch}, \var{attr}}
+{Move to position \var{y,x} within the window, and display \var{str}
+or \var{ch}, using attribute \var{attr}}
+\end{tableii}
+
+Attributes allow displaying text in highlighted forms, such as in
+boldface, underline, reverse code, or in color. They'll be explained
+in more detail in the next subsection.
+
+The \function{addstr()} function takes a Python string as the value to
+be displayed, while the \function{addch()} functions take a character,
+which can be either a Python string of length 1, or an integer. If
+it's a string, you're limited to displaying characters between 0 and
+255. SVr4 curses provides constants for extension characters; these
+constants are integers greater than 255. For example,
+\constant{ACS_PLMINUS} is a +/- symbol, and \constant{ACS_ULCORNER} is
+the upper left corner of a box (handy for drawing borders).
+
+Windows remember where the cursor was left after the last operation,
+so if you leave out the \var{y,x} coordinates, the string or character
+will be displayed wherever the last operation left off. You can also
+move the cursor with the \function{move(\var{y,x})} method. Because
+some terminals always display a flashing cursor, you may want to
+ensure that the cursor is positioned in some location where it won't
+be distracting; it can be confusing to have the cursor blinking at
+some apparently random location.
+
+If your application doesn't need a blinking cursor at all, you can
+call \function{curs_set(0)} to make it invisible. Equivalently, and
+for compatibility with older curses versions, there's a
+\function{leaveok(\var{bool})} function. When \var{bool} is true, the
+curses library will attempt to suppress the flashing cursor, and you
+won't need to worry about leaving it in odd locations.
+
+\subsection{Attributes and Color}
+
+Characters can be displayed in different ways. Status lines in a
+text-based application are commonly shown in reverse video; a text
+viewer may need to highlight certain words. curses supports this by
+allowing you to specify an attribute for each cell on the screen.
+
+An attribute is a integer, each bit representing a different
+attribute. You can try to display text with multiple attribute bits
+set, but curses doesn't guarantee that all the possible combinations
+are available, or that they're all visually distinct. That depends on
+the ability of the terminal being used, so it's safest to stick to the
+most commonly available attributes, listed here.
+
+\begin{tableii}{|c|l|}{constant}{Attribute}{Description}
+\lineii{A_BLINK}{Blinking text}
+\lineii{A_BOLD}{Extra bright or bold text}
+\lineii{A_DIM}{Half bright text}
+\lineii{A_REVERSE}{Reverse-video text}
+\lineii{A_STANDOUT}{The best highlighting mode available}
+\lineii{A_UNDERLINE}{Underlined text}
+\end{tableii}
+
+So, to display a reverse-video status line on the top line of the
+screen,
+you could code:
+
+\begin{verbatim}
+stdscr.addstr(0, 0, "Current mode: Typing mode",
+ curses.A_REVERSE)
+stdscr.refresh()
+\end{verbatim}
+
+The curses library also supports color on those terminals that
+provide it, The most common such terminal is probably the Linux
+console, followed by color xterms.
+
+To use color, you must call the \function{start_color()} function
+soon after calling \function{initscr()}, to initialize the default
+color set (the \function{curses.wrapper.wrapper()} function does this
+automatically). Once that's done, the \function{has_colors()}
+function returns TRUE if the terminal in use can actually display
+color. (Note from AMK: curses uses the American spelling
+'color', instead of the Canadian/British spelling 'colour'. If you're
+like me, you'll have to resign yourself to misspelling it for the sake
+of these functions.)
+
+The curses library maintains a finite number of color pairs,
+containing a foreground (or text) color and a background color. You
+can get the attribute value corresponding to a color pair with the
+\function{color_pair()} function; this can be bitwise-OR'ed with other
+attributes such as \constant{A_REVERSE}, but again, such combinations
+are not guaranteed to work on all terminals.
+
+An example, which displays a line of text using color pair 1:
+
+\begin{verbatim}
+stdscr.addstr( "Pretty text", curses.color_pair(1) )
+stdscr.refresh()
+\end{verbatim}
+
+As I said before, a color pair consists of a foreground and
+background color. \function{start_color()} initializes 8 basic
+colors when it activates color mode. They are: 0:black, 1:red,
+2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and 7:white. The curses
+module defines named constants for each of these colors:
+\constant{curses.COLOR_BLACK}, \constant{curses.COLOR_RED}, and so
+forth.
+
+The \function{init_pair(\var{n, f, b})} function changes the
+definition of color pair \var{n}, to foreground color {f} and
+background color {b}. Color pair 0 is hard-wired to white on black,
+and cannot be changed.
+
+Let's put all this together. To change color 1 to red
+text on a white background, you would call:
+
+\begin{verbatim}
+curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
+\end{verbatim}
+
+When you change a color pair, any text already displayed using that
+color pair will change to the new colors. You can also display new
+text in this color with:
+
+\begin{verbatim}
+stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1) )
+\end{verbatim}
+
+Very fancy terminals can change the definitions of the actual colors
+to a given RGB value. This lets you change color 1, which is usually
+red, to purple or blue or any other color you like. Unfortunately,
+the Linux console doesn't support this, so I'm unable to try it out,
+and can't provide any examples. You can check if your terminal can do
+this by calling \function{can_change_color()}, which returns TRUE if
+the capability is there. If you're lucky enough to have such a
+talented terminal, consult your system's man pages for more
+information.
+
+\section{User Input}
+
+The curses library itself offers only very simple input mechanisms.
+Python's support adds a text-input widget that makes up some of the
+lack.
+
+The most common way to get input to a window is to use its
+\method{getch()} method. that pauses, and waits for the user to hit
+a key, displaying it if \function{echo()} has been called earlier.
+You can optionally specify a coordinate to which the cursor should be
+moved before pausing.
+
+It's possible to change this behavior with the method
+\method{nodelay()}. After \method{nodelay(1)}, \method{getch()} for
+the window becomes non-blocking and returns ERR (-1) when no input is
+ready. There's also a \function{halfdelay()} function, which can be
+used to (in effect) set a timer on each \method{getch()}; if no input
+becomes available within the number of milliseconds specified as the
+argument to \function{halfdelay()}, curses throws an exception.
+
+The \method{getch()} method returns an integer; if it's between 0 and
+255, it represents the ASCII code of the key pressed. Values greater
+than 255 are special keys such as Page Up, Home, or the cursor keys.
+You can compare the value returned to constants such as
+\constant{curses.KEY_PPAGE}, \constant{curses.KEY_HOME}, or
+\constant{curses.KEY_LEFT}. Usually the main loop of your program
+will look something like this:
+
+\begin{verbatim}
+while 1:
+ c = stdscr.getch()
+ if c == ord('p'): PrintDocument()
+ elif c == ord('q'): break # Exit the while()
+ elif c == curses.KEY_HOME: x = y = 0
+\end{verbatim}
+
+The \module{curses.ascii} module supplies ASCII class membership
+functions that take either integer or 1-character-string
+arguments; these may be useful in writing more readable tests for
+your command interpreters. It also supplies conversion functions
+that take either integer or 1-character-string arguments and return
+the same type. For example, \function{curses.ascii.ctrl()} returns
+the control character corresponding to its argument.
+
+There's also a method to retrieve an entire string,
+\constant{getstr()}. It isn't used very often, because its
+functionality is quite limited; the only editing keys available are
+the backspace key and the Enter key, which terminates the string. It
+can optionally be limited to a fixed number of characters.
+
+\begin{verbatim}
+curses.echo() # Enable echoing of characters
+
+# Get a 15-character string, with the cursor on the top line
+s = stdscr.getstr(0,0, 15)
+\end{verbatim}
+
+The Python \module{curses.textpad} module supplies something better.
+With it, you can turn a window into a text box that supports an
+Emacs-like set of keybindings. Various methods of \class{Textbox}
+class support editing with input validation and gathering the edit
+results either with or without trailing spaces. See the library
+documentation on \module{curses.textpad} for the details.
+
+\section{For More Information}
+
+This HOWTO didn't cover some advanced topics, such as screen-scraping
+or capturing mouse events from an xterm instance. But the Python
+library page for the curses modules is now pretty complete. You
+should browse it next.
+
+If you're in doubt about the detailed behavior of any of the ncurses
+entry points, consult the manual pages for your curses implementation,
+whether it's ncurses or a proprietary Unix vendor's. The manual pages
+will document any quirks, and provide complete lists of all the
+functions, attributes, and \constant{ACS_*} characters available to
+you.
+
+Because the curses API is so large, some functions aren't supported in
+the Python interface, not because they're difficult to implement, but
+because no one has needed them yet. Feel free to add them and then
+submit a patch. Also, we don't yet have support for the menus or
+panels libraries associated with ncurses; feel free to add that.
+
+If you write an interesting little program, feel free to contribute it
+as another demo. We can always use more of them!
+
+The ncurses FAQ: \url{http://dickey.his.com/ncurses/ncurses.faq.html}
+
+\end{document}
diff --git a/Doc/howto/doanddont.tex b/Doc/howto/doanddont.tex
new file mode 100644
index 0000000..adbde66
--- /dev/null
+++ b/Doc/howto/doanddont.tex
@@ -0,0 +1,343 @@
+\documentclass{howto}
+
+\title{Idioms and Anti-Idioms in Python}
+
+\release{0.00}
+
+\author{Moshe Zadka}
+\authoraddress{howto@zadka.site.co.il}
+
+\begin{document}
+\maketitle
+
+This document is placed in the public doman.
+
+\begin{abstract}
+\noindent
+This document can be considered a companion to the tutorial. It
+shows how to use Python, and even more importantly, how {\em not}
+to use Python.
+\end{abstract}
+
+\tableofcontents
+
+\section{Language Constructs You Should Not Use}
+
+While Python has relatively few gotchas compared to other languages, it
+still has some constructs which are only useful in corner cases, or are
+plain dangerous.
+
+\subsection{from module import *}
+
+\subsubsection{Inside Function Definitions}
+
+\code{from module import *} is {\em invalid} inside function definitions.
+While many versions of Python do no check for the invalidity, it does not
+make it more valid, no more then having a smart lawyer makes a man innocent.
+Do not use it like that ever. Even in versions where it was accepted, it made
+the function execution slower, because the compiler could not be certain
+which names are local and which are global. In Python 2.1 this construct
+causes warnings, and sometimes even errors.
+
+\subsubsection{At Module Level}
+
+While it is valid to use \code{from module import *} at module level it
+is usually a bad idea. For one, this loses an important property Python
+otherwise has --- you can know where each toplevel name is defined by
+a simple "search" function in your favourite editor. You also open yourself
+to trouble in the future, if some module grows additional functions or
+classes.
+
+One of the most awful question asked on the newsgroup is why this code:
+
+\begin{verbatim}
+f = open("www")
+f.read()
+\end{verbatim}
+
+does not work. Of course, it works just fine (assuming you have a file
+called "www".) But it does not work if somewhere in the module, the
+statement \code{from os import *} is present. The \module{os} module
+has a function called \function{open()} which returns an integer. While
+it is very useful, shadowing builtins is one of its least useful properties.
+
+Remember, you can never know for sure what names a module exports, so either
+take what you need --- \code{from module import name1, name2}, or keep them in
+the module and access on a per-need basis ---
+\code{import module;print module.name}.
+
+\subsubsection{When It Is Just Fine}
+
+There are situations in which \code{from module import *} is just fine:
+
+\begin{itemize}
+
+\item The interactive prompt. For example, \code{from math import *} makes
+ Python an amazing scientific calculator.
+
+\item When extending a module in C with a module in Python.
+
+\item When the module advertises itself as \code{from import *} safe.
+
+\end{itemize}
+
+\subsection{Unadorned \keyword{exec}, \function{execfile} and friends}
+
+The word ``unadorned'' refers to the use without an explicit dictionary,
+in which case those constructs evaluate code in the {\em current} environment.
+This is dangerous for the same reasons \code{from import *} is dangerous ---
+it might step over variables you are counting on and mess up things for
+the rest of your code. Simply do not do that.
+
+Bad examples:
+
+\begin{verbatim}
+>>> for name in sys.argv[1:]:
+>>> exec "%s=1" % name
+>>> def func(s, **kw):
+>>> for var, val in kw.items():
+>>> exec "s.%s=val" % var # invalid!
+>>> execfile("handler.py")
+>>> handle()
+\end{verbatim}
+
+Good examples:
+
+\begin{verbatim}
+>>> d = {}
+>>> for name in sys.argv[1:]:
+>>> d[name] = 1
+>>> def func(s, **kw):
+>>> for var, val in kw.items():
+>>> setattr(s, var, val)
+>>> d={}
+>>> execfile("handle.py", d, d)
+>>> handle = d['handle']
+>>> handle()
+\end{verbatim}
+
+\subsection{from module import name1, name2}
+
+This is a ``don't'' which is much weaker then the previous ``don't''s
+but is still something you should not do if you don't have good reasons
+to do that. The reason it is usually bad idea is because you suddenly
+have an object which lives in two seperate namespaces. When the binding
+in one namespace changes, the binding in the other will not, so there
+will be a discrepancy between them. This happens when, for example,
+one module is reloaded, or changes the definition of a function at runtime.
+
+Bad example:
+
+\begin{verbatim}
+# foo.py
+a = 1
+
+# bar.py
+from foo import a
+if something():
+ a = 2 # danger: foo.a != a
+\end{verbatim}
+
+Good example:
+
+\begin{verbatim}
+# foo.py
+a = 1
+
+# bar.py
+import foo
+if something():
+ foo.a = 2
+\end{verbatim}
+
+\subsection{except:}
+
+Python has the \code{except:} clause, which catches all exceptions.
+Since {\em every} error in Python raises an exception, this makes many
+programming errors look like runtime problems, and hinders
+the debugging process.
+
+The following code shows a great example:
+
+\begin{verbatim}
+try:
+ foo = opne("file") # misspelled "open"
+except:
+ sys.exit("could not open file!")
+\end{verbatim}
+
+The second line triggers a \exception{NameError} which is caught by the
+except clause. The program will exit, and you will have no idea that
+this has nothing to do with the readability of \code{"file"}.
+
+The example above is better written
+
+\begin{verbatim}
+try:
+ foo = opne("file") # will be changed to "open" as soon as we run it
+except IOError:
+ sys.exit("could not open file")
+\end{verbatim}
+
+There are some situations in which the \code{except:} clause is useful:
+for example, in a framework when running callbacks, it is good not to
+let any callback disturb the framework.
+
+\section{Exceptions}
+
+Exceptions are a useful feature of Python. You should learn to raise
+them whenever something unexpected occurs, and catch them only where
+you can do something about them.
+
+The following is a very popular anti-idiom
+
+\begin{verbatim}
+def get_status(file):
+ if not os.path.exists(file):
+ print "file not found"
+ sys.exit(1)
+ return open(file).readline()
+\end{verbatim}
+
+Consider the case the file gets deleted between the time the call to
+\function{os.path.exists} is made and the time \function{open} is called.
+That means the last line will throw an \exception{IOError}. The same would
+happen if \var{file} exists but has no read permission. Since testing this
+on a normal machine on existing and non-existing files make it seem bugless,
+that means in testing the results will seem fine, and the code will get
+shipped. Then an unhandled \exception{IOError} escapes to the user, who
+has to watch the ugly traceback.
+
+Here is a better way to do it.
+
+\begin{verbatim}
+def get_status(file):
+ try:
+ return open(file).readline()
+ except (IOError, OSError):
+ print "file not found"
+ sys.exit(1)
+\end{verbatim}
+
+In this version, *either* the file gets opened and the line is read
+(so it works even on flaky NFS or SMB connections), or the message
+is printed and the application aborted.
+
+Still, \function{get_status} makes too many assumptions --- that it
+will only be used in a short running script, and not, say, in a long
+running server. Sure, the caller could do something like
+
+\begin{verbatim}
+try:
+ status = get_status(log)
+except SystemExit:
+ status = None
+\end{verbatim}
+
+So, try to make as few \code{except} clauses in your code --- those will
+usually be a catch-all in the \function{main}, or inside calls which
+should always succeed.
+
+So, the best version is probably
+
+\begin{verbatim}
+def get_status(file):
+ return open(file).readline()
+\end{verbatim}
+
+The caller can deal with the exception if it wants (for example, if it
+tries several files in a loop), or just let the exception filter upwards
+to {\em its} caller.
+
+The last version is not very good either --- due to implementation details,
+the file would not be closed when an exception is raised until the handler
+finishes, and perhaps not at all in non-C implementations (e.g., Jython).
+
+\begin{verbatim}
+def get_status(file):
+ fp = open(file)
+ try:
+ return fp.readline()
+ finally:
+ fp.close()
+\end{verbatim}
+
+\section{Using the Batteries}
+
+Every so often, people seem to be writing stuff in the Python library
+again, usually poorly. While the occasional module has a poor interface,
+it is usually much better to use the rich standard library and data
+types that come with Python then inventing your own.
+
+A useful module very few people know about is \module{os.path}. It
+always has the correct path arithmetic for your operating system, and
+will usually be much better then whatever you come up with yourself.
+
+Compare:
+
+\begin{verbatim}
+# ugh!
+return dir+"/"+file
+# better
+return os.path.join(dir, file)
+\end{verbatim}
+
+More useful functions in \module{os.path}: \function{basename},
+\function{dirname} and \function{splitext}.
+
+There are also many useful builtin functions people seem not to be
+aware of for some reason: \function{min()} and \function{max()} can
+find the minimum/maximum of any sequence with comparable semantics,
+for example, yet many people write they own max/min. Another highly
+useful function is \function{reduce()}. Classical use of \function{reduce()}
+is something like
+
+\begin{verbatim}
+import sys, operator
+nums = map(float, sys.argv[1:])
+print reduce(operator.add, nums)/len(nums)
+\end{verbatim}
+
+This cute little script prints the average of all numbers given on the
+command line. The \function{reduce()} adds up all the numbers, and
+the rest is just some pre- and postprocessing.
+
+On the same note, note that \function{float()}, \function{int()} and
+\function{long()} all accept arguments of type string, and so are
+suited to parsing --- assuming you are ready to deal with the
+\exception{ValueError} they raise.
+
+\section{Using Backslash to Continue Statements}
+
+Since Python treats a newline as a statement terminator,
+and since statements are often more then is comfortable to put
+in one line, many people do:
+
+\begin{verbatim}
+if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \
+ calculate_number(10, 20) != forbulate(500, 360):
+ pass
+\end{verbatim}
+
+You should realize that this is dangerous: a stray space after the
+\code{\\} would make this line wrong, and stray spaces are notoriously
+hard to see in editors. In this case, at least it would be a syntax
+error, but if the code was:
+
+\begin{verbatim}
+value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \
+ + calculate_number(10, 20)*forbulate(500, 360)
+\end{verbatim}
+
+then it would just be subtly wrong.
+
+It is usually much better to use the implicit continuation inside parenthesis:
+
+This version is bulletproof:
+
+\begin{verbatim}
+value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9]
+ + calculate_number(10, 20)*forbulate(500, 360))
+\end{verbatim}
+
+\end{document}
diff --git a/Doc/howto/regex.tex b/Doc/howto/regex.tex
new file mode 100644
index 0000000..5a65064
--- /dev/null
+++ b/Doc/howto/regex.tex
@@ -0,0 +1,1466 @@
+\documentclass{howto}
+
+% TODO:
+% Document lookbehind assertions
+% Better way of displaying a RE, a string, and what it matches
+% Mention optional argument to match.groups()
+% Unicode (at least a reference)
+
+\title{Regular Expression HOWTO}
+
+\release{0.05}
+
+\author{A.M. Kuchling}
+\authoraddress{\email{amk@amk.ca}}
+
+\begin{document}
+\maketitle
+
+\begin{abstract}
+\noindent
+This document is an introductory tutorial to using regular expressions
+in Python with the \module{re} module. It provides a gentler
+introduction than the corresponding section in the Library Reference.
+
+This document is available from
+\url{http://www.amk.ca/python/howto}.
+
+\end{abstract}
+
+\tableofcontents
+
+\section{Introduction}
+
+The \module{re} module was added in Python 1.5, and provides
+Perl-style regular expression patterns. Earlier versions of Python
+came with the \module{regex} module, which provides Emacs-style
+patterns. Emacs-style patterns are slightly less readable and
+don't provide as many features, so there's not much reason to use
+the \module{regex} module when writing new code, though you might
+encounter old code that uses it.
+
+Regular expressions (or REs) are essentially a tiny, highly
+specialized programming language embedded inside Python and made
+available through the \module{re} module. Using this little language,
+you specify the rules for the set of possible strings that you want to
+match; this set might contain English sentences, or e-mail addresses,
+or TeX commands, or anything you like. You can then ask questions
+such as ``Does this string match the pattern?'', or ``Is there a match
+for the pattern anywhere in this string?''. You can also use REs to
+modify a string or to split it apart in various ways.
+
+Regular expression patterns are compiled into a series of bytecodes
+which are then executed by a matching engine written in C. For
+advanced use, it may be necessary to pay careful attention to how the
+engine will execute a given RE, and write the RE in a certain way in
+order to produce bytecode that runs faster. Optimization isn't
+covered in this document, because it requires that you have a good
+understanding of the matching engine's internals.
+
+The regular expression language is relatively small and restricted, so
+not all possible string processing tasks can be done using regular
+expressions. There are also tasks that \emph{can} be done with
+regular expressions, but the expressions turn out to be very
+complicated. In these cases, you may be better off writing Python
+code to do the processing; while Python code will be slower than an
+elaborate regular expression, it will also probably be more understandable.
+
+\section{Simple Patterns}
+
+We'll start by learning about the simplest possible regular
+expressions. Since regular expressions are used to operate on
+strings, we'll begin with the most common task: matching characters.
+
+For a detailed explanation of the computer science underlying regular
+expressions (deterministic and non-deterministic finite automata), you
+can refer to almost any textbook on writing compilers.
+
+\subsection{Matching Characters}
+
+Most letters and characters will simply match themselves. For
+example, the regular expression \regexp{test} will match the string
+\samp{test} exactly. (You can enable a case-insensitive mode that
+would let this RE match \samp{Test} or \samp{TEST} as well; more
+about this later.)
+
+There are exceptions to this rule; some characters are
+special, and don't match themselves. Instead, they signal that some
+out-of-the-ordinary thing should be matched, or they affect other
+portions of the RE by repeating them. Much of this document is
+devoted to discussing various metacharacters and what they do.
+
+Here's a complete list of the metacharacters; their meanings will be
+discussed in the rest of this HOWTO.
+
+\begin{verbatim}
+. ^ $ * + ? { [ ] \ | ( )
+\end{verbatim}
+% $
+
+The first metacharacters we'll look at are \samp{[} and \samp{]}.
+They're used for specifying a character class, which is a set of
+characters that you wish to match. Characters can be listed
+individually, or a range of characters can be indicated by giving two
+characters and separating them by a \character{-}. For example,
+\regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or
+\samp{c}; this is the same as
+\regexp{[a-c]}, which uses a range to express the same set of
+characters. If you wanted to match only lowercase letters, your
+RE would be \regexp{[a-z]}.
+
+Metacharacters are not active inside classes. For example,
+\regexp{[akm\$]} will match any of the characters \character{a},
+\character{k}, \character{m}, or \character{\$}; \character{\$} is
+usually a metacharacter, but inside a character class it's stripped of
+its special nature.
+
+You can match the characters not within a range by \dfn{complementing}
+the set. This is indicated by including a \character{\^} as the first
+character of the class; \character{\^} elsewhere will simply match the
+\character{\^} character. For example, \verb|[^5]| will match any
+character except \character{5}.
+
+Perhaps the most important metacharacter is the backslash, \samp{\e}.
+As in Python string literals, the backslash can be followed by various
+characters to signal various special sequences. It's also used to escape
+all the metacharacters so you can still match them in patterns; for
+example, if you need to match a \samp{[} or
+\samp{\e}, you can precede them with a backslash to remove their
+special meaning: \regexp{\e[} or \regexp{\e\e}.
+
+Some of the special sequences beginning with \character{\e} represent
+predefined sets of characters that are often useful, such as the set
+of digits, the set of letters, or the set of anything that isn't
+whitespace. The following predefined special sequences are available:
+
+\begin{itemize}
+\item[\code{\e d}]Matches any decimal digit; this is
+equivalent to the class \regexp{[0-9]}.
+
+\item[\code{\e D}]Matches any non-digit character; this is
+equivalent to the class \verb|[^0-9]|.
+
+\item[\code{\e s}]Matches any whitespace character; this is
+equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}.
+
+\item[\code{\e S}]Matches any non-whitespace character; this is
+equivalent to the class \verb|[^ \t\n\r\f\v]|.
+
+\item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class
+\regexp{[a-zA-Z0-9_]}.
+
+\item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class
+\verb|[^a-zA-Z0-9_]|.
+\end{itemize}
+
+These sequences can be included inside a character class. For
+example, \regexp{[\e s,.]} is a character class that will match any
+whitespace character, or \character{,} or \character{.}.
+
+The final metacharacter in this section is \regexp{.}. It matches
+anything except a newline character, and there's an alternate mode
+(\code{re.DOTALL}) where it will match even a newline. \character{.}
+is often used where you want to match ``any character''.
+
+\subsection{Repeating Things}
+
+Being able to match varying sets of characters is the first thing
+regular expressions can do that isn't already possible with the
+methods available on strings. However, if that was the only
+additional capability of regexes, they wouldn't be much of an advance.
+Another capability is that you can specify that portions of the RE
+must be repeated a certain number of times.
+
+The first metacharacter for repeating things that we'll look at is
+\regexp{*}. \regexp{*} doesn't match the literal character \samp{*};
+instead, it specifies that the previous character can be matched zero
+or more times, instead of exactly once.
+
+For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a}
+characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a}
+characters), and so forth. The RE engine has various internal
+limitations stemming from the size of C's \code{int} type, that will
+prevent it from matching over 2 billion \samp{a} characters; you
+probably don't have enough memory to construct a string that large, so
+you shouldn't run into that limit.
+
+Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE,
+the matching engine will try to repeat it as many times as possible.
+If later portions of the pattern don't match, the matching engine will
+then back up and try again with few repetitions.
+
+A step-by-step example will make this more obvious. Let's consider
+the expression \regexp{a[bcd]*b}. This matches the letter
+\character{a}, zero or more letters from the class \code{[bcd]}, and
+finally ends with a \character{b}. Now imagine matching this RE
+against the string \samp{abcbd}.
+
+\begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation}
+\lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.}
+\lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as
+it can, which is to the end of the string.}
+\lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the
+current position is at the end of the string, so it fails.}
+\lineiii{4}{\code{abcb}}{Back up, so that \regexp{[bcd]*} matches
+one less character.}
+\lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the
+current position is at the last character, which is a \character{d}.}
+\lineiii{6}{\code{abc}}{Back up again, so that \regexp{[bcd]*} is
+only matching \samp{bc}.}
+\lineiii{6}{\code{abcb}}{Try \regexp{b} again. This time
+but the character at the current position is \character{b}, so it succeeds.}
+\end{tableiii}
+
+The end of the RE has now been reached, and it has matched
+\samp{abcb}. This demonstrates how the matching engine goes as far as
+it can at first, and if no match is found it will then progressively
+back up and retry the rest of the RE again and again. It will back up
+until it has tried zero matches for \regexp{[bcd]*}, and if that
+subsequently fails, the engine will conclude that the string doesn't
+match the RE at all.
+
+Another repeating metacharacter is \regexp{+}, which matches one or
+more times. Pay careful attention to the difference between
+\regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more
+times, so whatever's being repeated may not be present at all, while
+\regexp{+} requires at least \emph{one} occurrence. To use a similar
+example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}),
+\samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}.
+
+There are two more repeating qualifiers. The question mark character,
+\regexp{?}, matches either once or zero times; you can think of it as
+marking something as being optional. For example, \regexp{home-?brew}
+matches either \samp{homebrew} or \samp{home-brew}.
+
+The most complicated repeated qualifier is
+\regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal
+integers. This qualifier means there must be at least \var{m}
+repetitions, and at most \var{n}. For example, \regexp{a/\{1,3\}b}
+will match \samp{a/b}, \samp{a//b}, and \samp{a///b}. It won't match
+\samp{ab}, which has no slashes, or \samp{a////b}, which has four.
+
+You can omit either \var{m} or \var{n}; in that case, a reasonable
+value is assumed for the missing value. Omitting \var{m} is
+interpreted as a lower limit of 0, while omitting \var{n} results in an
+upper bound of infinity --- actually, the 2 billion limit mentioned
+earlier, but that might as well be infinity.
+
+Readers of a reductionist bent may notice that the three other qualifiers
+can all be expressed using this notation. \regexp{\{0,\}} is the same
+as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and
+\regexp{\{0,1\}} is the same as \regexp{?}. It's better to use
+\regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because
+they're shorter and easier to read.
+
+\section{Using Regular Expressions}
+
+Now that we've looked at some simple regular expressions, how do we
+actually use them in Python? The \module{re} module provides an
+interface to the regular expression engine, allowing you to compile
+REs into objects and then perform matches with them.
+
+\subsection{Compiling Regular Expressions}
+
+Regular expressions are compiled into \class{RegexObject} instances,
+which have methods for various operations such as searching for
+pattern matches or performing string substitutions.
+
+\begin{verbatim}
+>>> import re
+>>> p = re.compile('ab*')
+>>> print p
+<re.RegexObject instance at 80b4150>
+\end{verbatim}
+
+\function{re.compile()} also accepts an optional \var{flags}
+argument, used to enable various special features and syntax
+variations. We'll go over the available settings later, but for now a
+single example will do:
+
+\begin{verbatim}
+>>> p = re.compile('ab*', re.IGNORECASE)
+\end{verbatim}
+
+The RE is passed to \function{re.compile()} as a string. REs are
+handled as strings because regular expressions aren't part of the core
+Python language, and no special syntax was created for expressing
+them. (There are applications that don't need REs at all, so there's
+no need to bloat the language specification by including them.)
+Instead, the \module{re} module is simply a C extension module
+included with Python, just like the \module{socket} or \module{zlib}
+module.
+
+Putting REs in strings keeps the Python language simpler, but has one
+disadvantage which is the topic of the next section.
+
+\subsection{The Backslash Plague}
+
+As stated earlier, regular expressions use the backslash
+character (\character{\e}) to indicate special forms or to allow
+special characters to be used without invoking their special meaning.
+This conflicts with Python's usage of the same character for the same
+purpose in string literals.
+
+Let's say you want to write a RE that matches the string
+\samp{{\e}section}, which might be found in a \LaTeX\ file. To figure
+out what to write in the program code, start with the desired string
+to be matched. Next, you must escape any backslashes and other
+metacharacters by preceding them with a backslash, resulting in the
+string \samp{\e\e section}. The resulting string that must be passed
+to \function{re.compile()} must be \verb|\\section|. However, to
+express this as a Python string literal, both backslashes must be
+escaped \emph{again}.
+
+\begin{tableii}{c|l}{code}{Characters}{Stage}
+ \lineii{\e section}{Text string to be matched}
+ \lineii{\e\e section}{Escaped backslash for \function{re.compile}}
+ \lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal}
+\end{tableii}
+
+In short, to match a literal backslash, one has to write
+\code{'\e\e\e\e'} as the RE string, because the regular expression
+must be \samp{\e\e}, and each backslash must be expressed as
+\samp{\e\e} inside a regular Python string literal. In REs that
+feature backslashes repeatedly, this leads to lots of repeated
+backslashes and makes the resulting strings difficult to understand.
+
+The solution is to use Python's raw string notation for regular
+expressions; backslashes are not handled in any special way in
+a string literal prefixed with \character{r}, so \code{r"\e n"} is a
+two-character string containing \character{\e} and \character{n},
+while \code{"\e n"} is a one-character string containing a newline.
+Frequently regular expressions will be expressed in Python
+code using this raw string notation.
+
+\begin{tableii}{c|c}{code}{Regular String}{Raw string}
+ \lineii{"ab*"}{\code{r"ab*"}}
+ \lineii{"\e\e\e\e section"}{\code{r"\e\e section"}}
+ \lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}}
+\end{tableii}
+
+\subsection{Performing Matches}
+
+Once you have an object representing a compiled regular expression,
+what do you do with it? \class{RegexObject} instances have several
+methods and attributes. Only the most significant ones will be
+covered here; consult \ulink{the Library
+Reference}{http://www.python.org/doc/lib/module-re.html} for a
+complete listing.
+
+\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
+ \lineii{match()}{Determine if the RE matches at the beginning of
+ the string.}
+ \lineii{search()}{Scan through a string, looking for any location
+ where this RE matches.}
+ \lineii{findall()}{Find all substrings where the RE matches,
+and returns them as a list.}
+ \lineii{finditer()}{Find all substrings where the RE matches,
+and returns them as an iterator.}
+\end{tableii}
+
+\method{match()} and \method{search()} return \code{None} if no match
+can be found. If they're successful, a \code{MatchObject} instance is
+returned, containing information about the match: where it starts and
+ends, the substring it matched, and more.
+
+You can learn about this by interactively experimenting with the
+\module{re} module. If you have Tkinter available, you may also want
+to look at \file{Tools/scripts/redemo.py}, a demonstration program
+included with the Python distribution. It allows you to enter REs and
+strings, and displays whether the RE matches or fails.
+\file{redemo.py} can be quite useful when trying to debug a
+complicated RE. Phil Schwartz's
+\ulink{Kodos}{http://kodos.sourceforge.net} is also an interactive
+tool for developing and testing RE patterns. This HOWTO will use the
+standard Python interpreter for its examples.
+
+First, run the Python interpreter, import the \module{re} module, and
+compile a RE:
+
+\begin{verbatim}
+Python 2.2.2 (#1, Feb 10 2003, 12:57:01)
+>>> import re
+>>> p = re.compile('[a-z]+')
+>>> p
+<_sre.SRE_Pattern object at 80c3c28>
+\end{verbatim}
+
+Now, you can try matching various strings against the RE
+\regexp{[a-z]+}. An empty string shouldn't match at all, since
+\regexp{+} means 'one or more repetitions'. \method{match()} should
+return \code{None} in this case, which will cause the interpreter to
+print no output. You can explicitly print the result of
+\method{match()} to make this clear.
+
+\begin{verbatim}
+>>> p.match("")
+>>> print p.match("")
+None
+\end{verbatim}
+
+Now, let's try it on a string that it should match, such as
+\samp{tempo}. In this case, \method{match()} will return a
+\class{MatchObject}, so you should store the result in a variable for
+later use.
+
+\begin{verbatim}
+>>> m = p.match( 'tempo')
+>>> print m
+<_sre.SRE_Match object at 80c4f68>
+\end{verbatim}
+
+Now you can query the \class{MatchObject} for information about the
+matching string. \class{MatchObject} instances also have several
+methods and attributes; the most important ones are:
+
+\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
+ \lineii{group()}{Return the string matched by the RE}
+ \lineii{start()}{Return the starting position of the match}
+ \lineii{end()}{Return the ending position of the match}
+ \lineii{span()}{Return a tuple containing the (start, end) positions
+ of the match}
+\end{tableii}
+
+Trying these methods will soon clarify their meaning:
+
+\begin{verbatim}
+>>> m.group()
+'tempo'
+>>> m.start(), m.end()
+(0, 5)
+>>> m.span()
+(0, 5)
+\end{verbatim}
+
+\method{group()} returns the substring that was matched by the
+RE. \method{start()} and \method{end()} return the starting and
+ending index of the match. \method{span()} returns both start and end
+indexes in a single tuple. Since the \method{match} method only
+checks if the RE matches at the start of a string,
+\method{start()} will always be zero. However, the \method{search}
+method of \class{RegexObject} instances scans through the string, so
+the match may not start at zero in that case.
+
+\begin{verbatim}
+>>> print p.match('::: message')
+None
+>>> m = p.search('::: message') ; print m
+<re.MatchObject instance at 80c9650>
+>>> m.group()
+'message'
+>>> m.span()
+(4, 11)
+\end{verbatim}
+
+In actual programs, the most common style is to store the
+\class{MatchObject} in a variable, and then check if it was
+\code{None}. This usually looks like:
+
+\begin{verbatim}
+p = re.compile( ... )
+m = p.match( 'string goes here' )
+if m:
+ print 'Match found: ', m.group()
+else:
+ print 'No match'
+\end{verbatim}
+
+Two \class{RegexObject} methods return all of the matches for a pattern.
+\method{findall()} returns a list of matching strings:
+
+\begin{verbatim}
+>>> p = re.compile('\d+')
+>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
+['12', '11', '10']
+\end{verbatim}
+
+\method{findall()} has to create the entire list before it can be
+returned as the result. In Python 2.2, the \method{finditer()} method
+is also available, returning a sequence of \class{MatchObject} instances
+as an iterator.
+
+\begin{verbatim}
+>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
+>>> iterator
+<callable-iterator object at 0x401833ac>
+>>> for match in iterator:
+... print match.span()
+...
+(0, 2)
+(22, 24)
+(29, 31)
+\end{verbatim}
+
+
+\subsection{Module-Level Functions}
+
+You don't have to produce a \class{RegexObject} and call its methods;
+the \module{re} module also provides top-level functions called
+\function{match()}, \function{search()}, \function{sub()}, and so
+forth. These functions take the same arguments as the corresponding
+\class{RegexObject} method, with the RE string added as the first
+argument, and still return either \code{None} or a \class{MatchObject}
+instance.
+
+\begin{verbatim}
+>>> print re.match(r'From\s+', 'Fromage amk')
+None
+>>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')
+<re.MatchObject instance at 80c5978>
+\end{verbatim}
+
+Under the hood, these functions simply produce a \class{RegexObject}
+for you and call the appropriate method on it. They also store the
+compiled object in a cache, so future calls using the same
+RE are faster.
+
+Should you use these module-level functions, or should you get the
+\class{RegexObject} and call its methods yourself? That choice
+depends on how frequently the RE will be used, and on your personal
+coding style. If a RE is being used at only one point in the code,
+then the module functions are probably more convenient. If a program
+contains a lot of regular expressions, or re-uses the same ones in
+several locations, then it might be worthwhile to collect all the
+definitions in one place, in a section of code that compiles all the
+REs ahead of time. To take an example from the standard library,
+here's an extract from \file{xmllib.py}:
+
+\begin{verbatim}
+ref = re.compile( ... )
+entityref = re.compile( ... )
+charref = re.compile( ... )
+starttagopen = re.compile( ... )
+\end{verbatim}
+
+I generally prefer to work with the compiled object, even for
+one-time uses, but few people will be as much of a purist about this
+as I am.
+
+\subsection{Compilation Flags}
+
+Compilation flags let you modify some aspects of how regular
+expressions work. Flags are available in the \module{re} module under
+two names, a long name such as \constant{IGNORECASE}, and a short,
+one-letter form such as \constant{I}. (If you're familiar with Perl's
+pattern modifiers, the one-letter forms use the same letters; the
+short form of \constant{re.VERBOSE} is \constant{re.X}, for example.)
+Multiple flags can be specified by bitwise OR-ing them; \code{re.I |
+re.M} sets both the \constant{I} and \constant{M} flags, for example.
+
+Here's a table of the available flags, followed by
+a more detailed explanation of each one.
+
+\begin{tableii}{c|l}{}{Flag}{Meaning}
+ \lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any
+ character, including newlines}
+ \lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches}
+ \lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match}
+ \lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching,
+ affecting \regexp{\^} and \regexp{\$}}
+ \lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs,
+ which can be organized more cleanly and understandably.}
+\end{tableii}
+
+\begin{datadesc}{I}
+\dataline{IGNORECASE}
+Perform case-insensitive matching; character class and literal strings
+will match
+letters by ignoring case. For example, \regexp{[A-Z]} will match
+lowercase letters, too, and \regexp{Spam} will match \samp{Spam},
+\samp{spam}, or \samp{spAM}.
+This lowercasing doesn't take the current locale into account; it will
+if you also set the \constant{LOCALE} flag.
+\end{datadesc}
+
+\begin{datadesc}{L}
+\dataline{LOCALE}
+Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b},
+and \regexp{\e B}, dependent on the current locale.
+
+Locales are a feature of the C library intended to help in writing
+programs that take account of language differences. For example, if
+you're processing French text, you'd want to be able to write
+\regexp{\e w+} to match words, but \regexp{\e w} only matches the
+character class \regexp{[A-Za-z]}; it won't match \character{\'e} or
+\character{\c c}. If your system is configured properly and a French
+locale is selected, certain C functions will tell the program that
+\character{\'e} should also be considered a letter. Setting the
+\constant{LOCALE} flag when compiling a regular expression will cause the
+resulting compiled object to use these C functions for \regexp{\e w};
+this is slower, but also enables \regexp{\e w+} to match French words as
+you'd expect.
+\end{datadesc}
+
+\begin{datadesc}{M}
+\dataline{MULTILINE}
+(\regexp{\^} and \regexp{\$} haven't been explained yet;
+they'll be introduced in section~\ref{more-metacharacters}.)
+
+Usually \regexp{\^} matches only at the beginning of the string, and
+\regexp{\$} matches only at the end of the string and immediately before the
+newline (if any) at the end of the string. When this flag is
+specified, \regexp{\^} matches at the beginning of the string and at
+the beginning of each line within the string, immediately following
+each newline. Similarly, the \regexp{\$} metacharacter matches either at
+the end of the string and at the end of each line (immediately
+preceding each newline).
+
+\end{datadesc}
+
+\begin{datadesc}{S}
+\dataline{DOTALL}
+Makes the \character{.} special character match any character at all,
+including a newline; without this flag, \character{.} will match
+anything \emph{except} a newline.
+\end{datadesc}
+
+\begin{datadesc}{X}
+\dataline{VERBOSE} This flag allows you to write regular expressions
+that are more readable by granting you more flexibility in how you can
+format them. When this flag has been specified, whitespace within the
+RE string is ignored, except when the whitespace is in a character
+class or preceded by an unescaped backslash; this lets you organize
+and indent the RE more clearly. It also enables you to put comments
+within a RE that will be ignored by the engine; comments are marked by
+a \character{\#} that's neither in a character class or preceded by an
+unescaped backslash.
+
+For example, here's a RE that uses \constant{re.VERBOSE}; see how
+much easier it is to read?
+
+\begin{verbatim}
+charref = re.compile(r"""
+ &[#] # Start of a numeric entity reference
+ (
+ [0-9]+[^0-9] # Decimal form
+ | 0[0-7]+[^0-7] # Octal form
+ | x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form
+ )
+""", re.VERBOSE)
+\end{verbatim}
+
+Without the verbose setting, the RE would look like this:
+\begin{verbatim}
+charref = re.compile("&#([0-9]+[^0-9]"
+ "|0[0-7]+[^0-7]"
+ "|x[0-9a-fA-F]+[^0-9a-fA-F])")
+\end{verbatim}
+
+In the above example, Python's automatic concatenation of string
+literals has been used to break up the RE into smaller pieces, but
+it's still more difficult to understand than the version using
+\constant{re.VERBOSE}.
+
+\end{datadesc}
+
+\section{More Pattern Power}
+
+So far we've only covered a part of the features of regular
+expressions. In this section, we'll cover some new metacharacters,
+and how to use groups to retrieve portions of the text that was matched.
+
+\subsection{More Metacharacters\label{more-metacharacters}}
+
+There are some metacharacters that we haven't covered yet. Most of
+them will be covered in this section.
+
+Some of the remaining metacharacters to be discussed are
+\dfn{zero-width assertions}. They don't cause the engine to advance
+through the string; instead, they consume no characters at all,
+and simply succeed or fail. For example, \regexp{\e b} is an
+assertion that the current position is located at a word boundary; the
+position isn't changed by the \regexp{\e b} at all. This means that
+zero-width assertions should never be repeated, because if they match
+once at a given location, they can obviously be matched an infinite
+number of times.
+
+\begin{list}{}{}
+
+\item[\regexp{|}]
+Alternation, or the ``or'' operator.
+If A and B are regular expressions,
+\regexp{A|B} will match any string that matches either \samp{A} or \samp{B}.
+\regexp{|} has very low precedence in order to make it work reasonably when
+you're alternating multi-character strings.
+\regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not
+\samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}.
+
+To match a literal \character{|},
+use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
+
+\item[\regexp{\^}] Matches at the beginning of lines. Unless the
+\constant{MULTILINE} flag has been set, this will only match at the
+beginning of the string. In \constant{MULTILINE} mode, this also
+matches immediately after each newline within the string.
+
+For example, if you wish to match the word \samp{From} only at the
+beginning of a line, the RE to use is \verb|^From|.
+
+\begin{verbatim}
+>>> print re.search('^From', 'From Here to Eternity')
+<re.MatchObject instance at 80c1520>
+>>> print re.search('^From', 'Reciting From Memory')
+None
+\end{verbatim}
+
+%To match a literal \character{\^}, use \regexp{\e\^} or enclose it
+%inside a character class, as in \regexp{[{\e}\^]}.
+
+\item[\regexp{\$}] Matches at the end of a line, which is defined as
+either the end of the string, or any location followed by a newline
+character.
+
+\begin{verbatim}
+>>> print re.search('}$', '{block}')
+<re.MatchObject instance at 80adfa8>
+>>> print re.search('}$', '{block} ')
+None
+>>> print re.search('}$', '{block}\n')
+<re.MatchObject instance at 80adfa8>
+\end{verbatim}
+% $
+
+To match a literal \character{\$}, use \regexp{\e\$} or enclose it
+inside a character class, as in \regexp{[\$]}.
+
+\item[\regexp{\e A}] Matches only at the start of the string. When
+not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are
+effectively the same. In \constant{MULTILINE} mode, however, they're
+different; \regexp{\e A} still matches only at the beginning of the
+string, but \regexp{\^} may match at any location inside the string
+that follows a newline character.
+
+\item[\regexp{\e Z}]Matches only at the end of the string.
+
+\item[\regexp{\e b}] Word boundary.
+This is a zero-width assertion that matches only at the
+beginning or end of a word. A word is defined as a sequence of
+alphanumeric characters, so the end of a word is indicated by
+whitespace or a non-alphanumeric character.
+
+The following example matches \samp{class} only when it's a complete
+word; it won't match when it's contained inside another word.
+
+\begin{verbatim}
+>>> p = re.compile(r'\bclass\b')
+>>> print p.search('no class at all')
+<re.MatchObject instance at 80c8f28>
+>>> print p.search('the declassified algorithm')
+None
+>>> print p.search('one subclass is')
+None
+\end{verbatim}
+
+There are two subtleties you should remember when using this special
+sequence. First, this is the worst collision between Python's string
+literals and regular expression sequences. In Python's string
+literals, \samp{\e b} is the backspace character, ASCII value 8. If
+you're not using raw strings, then Python will convert the \samp{\e b} to
+a backspace, and your RE won't match as you expect it to. The
+following example looks the same as our previous RE, but omits
+the \character{r} in front of the RE string.
+
+\begin{verbatim}
+>>> p = re.compile('\bclass\b')
+>>> print p.search('no class at all')
+None
+>>> print p.search('\b' + 'class' + '\b')
+<re.MatchObject instance at 80c3ee0>
+\end{verbatim}
+
+Second, inside a character class, where there's no use for this
+assertion, \regexp{\e b} represents the backspace character, for
+compatibility with Python's string literals.
+
+\item[\regexp{\e B}] Another zero-width assertion, this is the
+opposite of \regexp{\e b}, only matching when the current
+position is not at a word boundary.
+
+\end{list}
+
+\subsection{Grouping}
+
+Frequently you need to obtain more information than just whether the
+RE matched or not. Regular expressions are often used to dissect
+strings by writing a RE divided into several subgroups which
+match different components of interest. For example, an RFC-822
+header line is divided into a header name and a value, separated by a
+\character{:}. This can be handled by writing a regular expression
+which matches an entire header line, and has one group which matches the
+header name, and another group which matches the header's value.
+
+Groups are marked by the \character{(}, \character{)} metacharacters.
+\character{(} and \character{)} have much the same meaning as they do
+in mathematical expressions; they group together the expressions
+contained inside them. For example, you can repeat the contents of a
+group with a repeating qualifier, such as \regexp{*}, \regexp{+},
+\regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example,
+\regexp{(ab)*} will match zero or more repetitions of \samp{ab}.
+
+\begin{verbatim}
+>>> p = re.compile('(ab)*')
+>>> print p.match('ababababab').span()
+(0, 10)
+\end{verbatim}
+
+Groups indicated with \character{(}, \character{)} also capture the
+starting and ending index of the text that they match; this can be
+retrieved by passing an argument to \method{group()},
+\method{start()}, \method{end()}, and \method{span()}. Groups are
+numbered starting with 0. Group 0 is always present; it's the whole
+RE, so \class{MatchObject} methods all have group 0 as their default
+argument. Later we'll see how to express groups that don't capture
+the span of text that they match.
+
+\begin{verbatim}
+>>> p = re.compile('(a)b')
+>>> m = p.match('ab')
+>>> m.group()
+'ab'
+>>> m.group(0)
+'ab'
+\end{verbatim}
+
+Subgroups are numbered from left to right, from 1 upward. Groups can
+be nested; to determine the number, just count the opening parenthesis
+characters, going from left to right.
+
+\begin{verbatim}
+>>> p = re.compile('(a(b)c)d')
+>>> m = p.match('abcd')
+>>> m.group(0)
+'abcd'
+>>> m.group(1)
+'abc'
+>>> m.group(2)
+'b'
+\end{verbatim}
+
+\method{group()} can be passed multiple group numbers at a time, in
+which case it will return a tuple containing the corresponding values
+for those groups.
+
+\begin{verbatim}
+>>> m.group(2,1,2)
+('b', 'abc', 'b')
+\end{verbatim}
+
+The \method{groups()} method returns a tuple containing the strings
+for all the subgroups, from 1 up to however many there are.
+
+\begin{verbatim}
+>>> m.groups()
+('abc', 'b')
+\end{verbatim}
+
+Backreferences in a pattern allow you to specify that the contents of
+an earlier capturing group must also be found at the current location
+in the string. For example, \regexp{\e 1} will succeed if the exact
+contents of group 1 can be found at the current position, and fails
+otherwise. Remember that Python's string literals also use a
+backslash followed by numbers to allow including arbitrary characters
+in a string, so be sure to use a raw string when incorporating
+backreferences in a RE.
+
+For example, the following RE detects doubled words in a string.
+
+\begin{verbatim}
+>>> p = re.compile(r'(\b\w+)\s+\1')
+>>> p.search('Paris in the the spring').group()
+'the the'
+\end{verbatim}
+
+Backreferences like this aren't often useful for just searching
+through a string --- there are few text formats which repeat data in
+this way --- but you'll soon find out that they're \emph{very} useful
+when performing string substitutions.
+
+\subsection{Non-capturing and Named Groups}
+
+Elaborate REs may use many groups, both to capture substrings of
+interest, and to group and structure the RE itself. In complex REs,
+it becomes difficult to keep track of the group numbers. There are
+two features which help with this problem. Both of them use a common
+syntax for regular expression extensions, so we'll look at that first.
+
+Perl 5 added several additional features to standard regular
+expressions, and the Python \module{re} module supports most of them.
+It would have been difficult to choose new single-keystroke
+metacharacters or new special sequences beginning with \samp{\e} to
+represent the new features without making Perl's regular expressions
+confusingly different from standard REs. If you chose \samp{\&} as a
+new metacharacter, for example, old expressions would be assuming that
+\samp{\&} was a regular character and wouldn't have escaped it by
+writing \regexp{\e \&} or \regexp{[\&]}.
+
+The solution chosen by the Perl developers was to use \regexp{(?...)}
+as the extension syntax. \samp{?} immediately after a parenthesis was
+a syntax error because the \samp{?} would have nothing to repeat, so
+this didn't introduce any compatibility problems. The characters
+immediately after the \samp{?} indicate what extension is being used,
+so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and
+\regexp{(?:foo)} is something else (a non-capturing group containing
+the subexpression \regexp{foo}).
+
+Python adds an extension syntax to Perl's extension syntax. If the
+first character after the question mark is a \samp{P}, you know that
+it's an extension that's specific to Python. Currently there are two
+such extensions: \regexp{(?P<\var{name}>...)} defines a named group,
+and \regexp{(?P=\var{name})} is a backreference to a named group. If
+future versions of Perl 5 add similar features using a different
+syntax, the \module{re} module will be changed to support the new
+syntax, while preserving the Python-specific syntax for
+compatibility's sake.
+
+Now that we've looked at the general extension syntax, we can return
+to the features that simplify working with groups in complex REs.
+Since groups are numbered from left to right and a complex expression
+may use many groups, it can become difficult to keep track of the
+correct numbering, and modifying such a complex RE is annoying.
+Insert a new group near the beginning, and you change the numbers of
+everything that follows it.
+
+First, sometimes you'll want to use a group to collect a part of a
+regular expression, but aren't interested in retrieving the group's
+contents. You can make this fact explicit by using a non-capturing
+group: \regexp{(?:...)}, where you can put any other regular
+expression inside the parentheses.
+
+\begin{verbatim}
+>>> m = re.match("([abc])+", "abc")
+>>> m.groups()
+('c',)
+>>> m = re.match("(?:[abc])+", "abc")
+>>> m.groups()
+()
+\end{verbatim}
+
+Except for the fact that you can't retrieve the contents of what the
+group matched, a non-capturing group behaves exactly the same as a
+capturing group; you can put anything inside it, repeat it with a
+repetition metacharacter such as \samp{*}, and nest it within other
+groups (capturing or non-capturing). \regexp{(?:...)} is particularly
+useful when modifying an existing group, since you can add new groups
+without changing how all the other groups are numbered. It should be
+mentioned that there's no performance difference in searching between
+capturing and non-capturing groups; neither form is any faster than
+the other.
+
+The second, and more significant, feature is named groups; instead of
+referring to them by numbers, groups can be referenced by a name.
+
+The syntax for a named group is one of the Python-specific extensions:
+\regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of
+the group. Except for associating a name with a group, named groups
+also behave identically to capturing groups. The \class{MatchObject}
+methods that deal with capturing groups all accept either integers, to
+refer to groups by number, or a string containing the group name.
+Named groups are still given numbers, so you can retrieve information
+about a group in two ways:
+
+\begin{verbatim}
+>>> p = re.compile(r'(?P<word>\b\w+\b)')
+>>> m = p.search( '(((( Lots of punctuation )))' )
+>>> m.group('word')
+'Lots'
+>>> m.group(1)
+'Lots'
+\end{verbatim}
+
+Named groups are handy because they let you use easily-remembered
+names, instead of having to remember numbers. Here's an example RE
+from the \module{imaplib} module:
+
+\begin{verbatim}
+InternalDate = re.compile(r'INTERNALDATE "'
+ r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
+ r'(?P<year>[0-9][0-9][0-9][0-9])'
+ r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
+ r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
+ r'"')
+\end{verbatim}
+
+It's obviously much easier to retrieve \code{m.group('zonem')},
+instead of having to remember to retrieve group 9.
+
+Since the syntax for backreferences, in an expression like
+\regexp{(...)\e 1}, refers to the number of the group there's
+naturally a variant that uses the group name instead of the number.
+This is also a Python extension: \regexp{(?P=\var{name})} indicates
+that the contents of the group called \var{name} should again be found
+at the current point. The regular expression for finding doubled
+words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as
+\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}:
+
+\begin{verbatim}
+>>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
+>>> p.search('Paris in the the spring').group()
+'the the'
+\end{verbatim}
+
+\subsection{Lookahead Assertions}
+
+Another zero-width assertion is the lookahead assertion. Lookahead
+assertions are available in both positive and negative form, and
+look like this:
+
+\begin{itemize}
+\item[\regexp{(?=...)}] Positive lookahead assertion. This succeeds
+if the contained regular expression, represented here by \code{...},
+successfully matches at the current location, and fails otherwise.
+But, once the contained expression has been tried, the matching engine
+doesn't advance at all; the rest of the pattern is tried right where
+the assertion started.
+
+\item[\regexp{(?!...)}] Negative lookahead assertion. This is the
+opposite of the positive assertion; it succeeds if the contained expression
+\emph{doesn't} match at the current position in the string.
+\end{itemize}
+
+An example will help make this concrete by demonstrating a case
+where a lookahead is useful. Consider a simple pattern to match a
+filename and split it apart into a base name and an extension,
+separated by a \samp{.}. For example, in \samp{news.rc}, \samp{news}
+is the base name, and \samp{rc} is the filename's extension.
+
+The pattern to match this is quite simple:
+
+\regexp{.*[.].*\$}
+
+Notice that the \samp{.} needs to be treated specially because it's a
+metacharacter; I've put it inside a character class. Also notice the
+trailing \regexp{\$}; this is added to ensure that all the rest of the
+string must be included in the extension. This regular expression
+matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and
+\samp{printers.conf}.
+
+Now, consider complicating the problem a bit; what if you want to
+match filenames where the extension is not \samp{bat}?
+Some incorrect attempts:
+
+\verb|.*[.][^b].*$|
+% $
+
+The first attempt above tries to exclude \samp{bat} by requiring that
+the first character of the extension is not a \samp{b}. This is
+wrong, because the pattern also doesn't match \samp{foo.bar}.
+
+% Messes up the HTML without the curly braces around \^
+\regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$}
+
+The expression gets messier when you try to patch up the first
+solution by requiring one of the following cases to match: the first
+character of the extension isn't \samp{b}; the second character isn't
+\samp{a}; or the third character isn't \samp{t}. This accepts
+\samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a
+three-letter extension and won't accept a filename with a two-letter
+extension such as \samp{sendmail.cf}. We'll complicate the pattern
+again in an effort to fix it.
+
+\regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$}
+
+In the third attempt, the second and third letters are all made
+optional in order to allow matching extensions shorter than three
+characters, such as \samp{sendmail.cf}.
+
+The pattern's getting really complicated now, which makes it hard to
+read and understand. Worse, if the problem changes and you want to
+exclude both \samp{bat} and \samp{exe} as extensions, the pattern
+would get even more complicated and confusing.
+
+A negative lookahead cuts through all this:
+
+\regexp{.*[.](?!bat\$).*\$}
+% $
+
+The lookahead means: if the expression \regexp{bat} doesn't match at
+this point, try the rest of the pattern; if \regexp{bat\$} does match,
+the whole pattern will fail. The trailing \regexp{\$} is required to
+ensure that something like \samp{sample.batch}, where the extension
+only starts with \samp{bat}, will be allowed.
+
+Excluding another filename extension is now easy; simply add it as an
+alternative inside the assertion. The following pattern excludes
+filenames that end in either \samp{bat} or \samp{exe}:
+
+\regexp{.*[.](?!bat\$|exe\$).*\$}
+% $
+
+
+\section{Modifying Strings}
+
+Up to this point, we've simply performed searches against a static
+string. Regular expressions are also commonly used to modify a string
+in various ways, using the following \class{RegexObject} methods:
+
+\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
+ \lineii{split()}{Split the string into a list, splitting it wherever the RE matches}
+ \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string}
+ \lineii{subn()}{Does the same thing as \method{sub()},
+ but returns the new string and the number of replacements}
+\end{tableii}
+
+
+\subsection{Splitting Strings}
+
+The \method{split()} method of a \class{RegexObject} splits a string
+apart wherever the RE matches, returning a list of the pieces.
+It's similar to the \method{split()} method of strings but
+provides much more
+generality in the delimiters that you can split by;
+\method{split()} only supports splitting by whitespace or by
+a fixed string. As you'd expect, there's a module-level
+\function{re.split()} function, too.
+
+\begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}}
+ Split \var{string} by the matches of the regular expression. If
+ capturing parentheses are used in the RE, then their contents will
+ also be returned as part of the resulting list. If \var{maxsplit}
+ is nonzero, at most \var{maxsplit} splits are performed.
+\end{methoddesc}
+
+You can limit the number of splits made, by passing a value for
+\var{maxsplit}. When \var{maxsplit} is nonzero, at most
+\var{maxsplit} splits will be made, and the remainder of the string is
+returned as the final element of the list. In the following example,
+the delimiter is any sequence of non-alphanumeric characters.
+
+\begin{verbatim}
+>>> p = re.compile(r'\W+')
+>>> p.split('This is a test, short and sweet, of split().')
+['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
+>>> p.split('This is a test, short and sweet, of split().', 3)
+['This', 'is', 'a', 'test, short and sweet, of split().']
+\end{verbatim}
+
+Sometimes you're not only interested in what the text between
+delimiters is, but also need to know what the delimiter was. If
+capturing parentheses are used in the RE, then their values are also
+returned as part of the list. Compare the following calls:
+
+\begin{verbatim}
+>>> p = re.compile(r'\W+')
+>>> p2 = re.compile(r'(\W+)')
+>>> p.split('This... is a test.')
+['This', 'is', 'a', 'test', '']
+>>> p2.split('This... is a test.')
+['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
+\end{verbatim}
+
+The module-level function \function{re.split()} adds the RE to be
+used as the first argument, but is otherwise the same.
+
+\begin{verbatim}
+>>> re.split('[\W]+', 'Words, words, words.')
+['Words', 'words', 'words', '']
+>>> re.split('([\W]+)', 'Words, words, words.')
+['Words', ', ', 'words', ', ', 'words', '.', '']
+>>> re.split('[\W]+', 'Words, words, words.', 1)
+['Words', 'words, words.']
+\end{verbatim}
+
+\subsection{Search and Replace}
+
+Another common task is to find all the matches for a pattern, and
+replace them with a different string. The \method{sub()} method takes
+a replacement value, which can be either a string or a function, and
+the string to be processed.
+
+\begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}}
+Returns the string obtained by replacing the leftmost non-overlapping
+occurrences of the RE in \var{string} by the replacement
+\var{replacement}. If the pattern isn't found, \var{string} is returned
+unchanged.
+
+The optional argument \var{count} is the maximum number of pattern
+occurrences to be replaced; \var{count} must be a non-negative
+integer. The default value of 0 means to replace all occurrences.
+\end{methoddesc}
+
+Here's a simple example of using the \method{sub()} method. It
+replaces colour names with the word \samp{colour}:
+
+\begin{verbatim}
+>>> p = re.compile( '(blue|white|red)')
+>>> p.sub( 'colour', 'blue socks and red shoes')
+'colour socks and colour shoes'
+>>> p.sub( 'colour', 'blue socks and red shoes', count=1)
+'colour socks and red shoes'
+\end{verbatim}
+
+The \method{subn()} method does the same work, but returns a 2-tuple
+containing the new string value and the number of replacements
+that were performed:
+
+\begin{verbatim}
+>>> p = re.compile( '(blue|white|red)')
+>>> p.subn( 'colour', 'blue socks and red shoes')
+('colour socks and colour shoes', 2)
+>>> p.subn( 'colour', 'no colours at all')
+('no colours at all', 0)
+\end{verbatim}
+
+Empty matches are replaced only when they're not
+adjacent to a previous match.
+
+\begin{verbatim}
+>>> p = re.compile('x*')
+>>> p.sub('-', 'abxd')
+'-a-b-d-'
+\end{verbatim}
+
+If \var{replacement} is a string, any backslash escapes in it are
+processed. That is, \samp{\e n} is converted to a single newline
+character, \samp{\e r} is converted to a carriage return, and so forth.
+Unknown escapes such as \samp{\e j} are left alone. Backreferences,
+such as \samp{\e 6}, are replaced with the substring matched by the
+corresponding group in the RE. This lets you incorporate
+portions of the original text in the resulting
+replacement string.
+
+This example matches the word \samp{section} followed by a string
+enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to
+\samp{subsection}:
+
+\begin{verbatim}
+>>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)
+>>> p.sub(r'subsection{\1}','section{First} section{second}')
+'subsection{First} subsection{second}'
+\end{verbatim}
+
+There's also a syntax for referring to named groups as defined by the
+\regexp{(?P<name>...)} syntax. \samp{\e g<name>} will use the
+substring matched by the group named \samp{name}, and
+\samp{\e g<\var{number}>}
+uses the corresponding group number.
+\samp{\e g<2>} is therefore equivalent to \samp{\e 2},
+but isn't ambiguous in a
+replacement string such as \samp{\e g<2>0}. (\samp{\e 20} would be
+interpreted as a reference to group 20, not a reference to group 2
+followed by the literal character \character{0}.) The following
+substitutions are all equivalent, but use all three variations of the
+replacement string.
+
+\begin{verbatim}
+>>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
+>>> p.sub(r'subsection{\1}','section{First}')
+'subsection{First}'
+>>> p.sub(r'subsection{\g<1>}','section{First}')
+'subsection{First}'
+>>> p.sub(r'subsection{\g<name>}','section{First}')
+'subsection{First}'
+\end{verbatim}
+
+\var{replacement} can also be a function, which gives you even more
+control. If \var{replacement} is a function, the function is
+called for every non-overlapping occurrence of \var{pattern}. On each
+call, the function is
+passed a \class{MatchObject} argument for the match
+and can use this information to compute the desired replacement string and return it.
+
+In the following example, the replacement function translates
+decimals into hexadecimal:
+
+\begin{verbatim}
+>>> def hexrepl( match ):
+... "Return the hex string for a decimal number"
+... value = int( match.group() )
+... return hex(value)
+...
+>>> p = re.compile(r'\d+')
+>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
+'Call 0xffd2 for printing, 0xc000 for user code.'
+\end{verbatim}
+
+When using the module-level \function{re.sub()} function, the pattern
+is passed as the first argument. The pattern may be a string or a
+\class{RegexObject}; if you need to specify regular expression flags,
+you must either use a \class{RegexObject} as the first parameter, or use
+embedded modifiers in the pattern, e.g. \code{sub("(?i)b+", "x", "bbbb
+BBBB")} returns \code{'x x'}.
+
+\section{Common Problems}
+
+Regular expressions are a powerful tool for some applications, but in
+some ways their behaviour isn't intuitive and at times they don't
+behave the way you may expect them to. This section will point out
+some of the most common pitfalls.
+
+\subsection{Use String Methods}
+
+Sometimes using the \module{re} module is a mistake. If you're
+matching a fixed string, or a single character class, and you're not
+using any \module{re} features such as the \constant{IGNORECASE} flag,
+then the full power of regular expressions may not be required.
+Strings have several methods for performing operations with fixed
+strings and they're usually much faster, because the implementation is
+a single small C loop that's been optimized for the purpose, instead
+of the large, more generalized regular expression engine.
+
+One example might be replacing a single fixed string with another
+one; for example, you might replace \samp{word}
+with \samp{deed}. \code{re.sub()} seems like the function to use for
+this, but consider the \method{replace()} method. Note that
+\function{replace()} will also replace \samp{word} inside
+words, turning \samp{swordfish} into \samp{sdeedfish}, but the
+na{\"\i}ve RE \regexp{word} would have done that, too. (To avoid performing
+the substitution on parts of words, the pattern would have to be
+\regexp{\e bword\e b}, in order to require that \samp{word} have a
+word boundary on either side. This takes the job beyond
+\method{replace}'s abilities.)
+
+Another common task is deleting every occurrence of a single character
+from a string or replacing it with another single character. You
+might do this with something like \code{re.sub('\e n', ' ', S)}, but
+\method{translate()} is capable of doing both tasks
+and will be faster that any regular expression operation can be.
+
+In short, before turning to the \module{re} module, consider whether
+your problem can be solved with a faster and simpler string method.
+
+\subsection{match() versus search()}
+
+The \function{match()} function only checks if the RE matches at
+the beginning of the string while \function{search()} will scan
+forward through the string for a match.
+It's important to keep this distinction in mind. Remember,
+\function{match()} will only report a successful match which
+will start at 0; if the match wouldn't start at zero,
+\function{match()} will \emph{not} report it.
+
+\begin{verbatim}
+>>> print re.match('super', 'superstition').span()
+(0, 5)
+>>> print re.match('super', 'insuperable')
+None
+\end{verbatim}
+
+On the other hand, \function{search()} will scan forward through the
+string, reporting the first match it finds.
+
+\begin{verbatim}
+>>> print re.search('super', 'superstition').span()
+(0, 5)
+>>> print re.search('super', 'insuperable').span()
+(2, 7)
+\end{verbatim}
+
+Sometimes you'll be tempted to keep using \function{re.match()}, and
+just add \regexp{.*} to the front of your RE. Resist this temptation
+and use \function{re.search()} instead. The regular expression
+compiler does some analysis of REs in order to speed up the process of
+looking for a match. One such analysis figures out what the first
+character of a match must be; for example, a pattern starting with
+\regexp{Crow} must match starting with a \character{C}. The analysis
+lets the engine quickly scan through the string looking for the
+starting character, only trying the full match if a \character{C} is found.
+
+Adding \regexp{.*} defeats this optimization, requiring scanning to
+the end of the string and then backtracking to find a match for the
+rest of the RE. Use \function{re.search()} instead.
+
+\subsection{Greedy versus Non-Greedy}
+
+When repeating a regular expression, as in \regexp{a*}, the resulting
+action is to consume as much of the pattern as possible. This
+fact often bites you when you're trying to match a pair of
+balanced delimiters, such as the angle brackets surrounding an HTML
+tag. The na{\"\i}ve pattern for matching a single HTML tag doesn't
+work because of the greedy nature of \regexp{.*}.
+
+\begin{verbatim}
+>>> s = '<html><head><title>Title</title>'
+>>> len(s)
+32
+>>> print re.match('<.*>', s).span()
+(0, 32)
+>>> print re.match('<.*>', s).group()
+<html><head><title>Title</title>
+\end{verbatim}
+
+The RE matches the \character{<} in \samp{<html>}, and the
+\regexp{.*} consumes the rest of the string. There's still more left
+in the RE, though, and the \regexp{>} can't match at the end of
+the string, so the regular expression engine has to backtrack
+character by character until it finds a match for the \regexp{>}.
+The final match extends from the \character{<} in \samp{<html>}
+to the \character{>} in \samp{</title>}, which isn't what you want.
+
+In this case, the solution is to use the non-greedy qualifiers
+\regexp{*?}, \regexp{+?}, \regexp{??}, or
+\regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as
+possible. In the above example, the \character{>} is tried
+immediately after the first \character{<} matches, and when it fails,
+the engine advances a character at a time, retrying the \character{>}
+at every step. This produces just the right result:
+
+\begin{verbatim}
+>>> print re.match('<.*?>', s).group()
+<html>
+\end{verbatim}
+
+(Note that parsing HTML or XML with regular expressions is painful.
+Quick-and-dirty patterns will handle common cases, but HTML and XML
+have special cases that will break the obvious regular expression; by
+the time you've written a regular expression that handles all of the
+possible cases, the patterns will be \emph{very} complicated. Use an
+HTML or XML parser module for such tasks.)
+
+\subsection{Not Using re.VERBOSE}
+
+By now you've probably noticed that regular expressions are a very
+compact notation, but they're not terribly readable. REs of
+moderate complexity can become lengthy collections of backslashes,
+parentheses, and metacharacters, making them difficult to read and
+understand.
+
+For such REs, specifying the \code{re.VERBOSE} flag when
+compiling the regular expression can be helpful, because it allows
+you to format the regular expression more clearly.
+
+The \code{re.VERBOSE} flag has several effects. Whitespace in the
+regular expression that \emph{isn't} inside a character class is
+ignored. This means that an expression such as \regexp{dog | cat} is
+equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]}
+will still match the characters \character{a}, \character{b}, or a
+space. In addition, you can also put comments inside a RE; comments
+extend from a \samp{\#} character to the next newline. When used with
+triple-quoted strings, this enables REs to be formatted more neatly:
+
+\begin{verbatim}
+pat = re.compile(r"""
+ \s* # Skip leading whitespace
+ (?P<header>[^:]+) # Header name
+ \s* : # Whitespace, and a colon
+ (?P<value>.*?) # The header's value -- *? used to
+ # lose the following trailing whitespace
+ \s*$ # Trailing whitespace to end-of-line
+""", re.VERBOSE)
+\end{verbatim}
+% $
+
+This is far more readable than:
+
+\begin{verbatim}
+pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
+\end{verbatim}
+% $
+
+\section{Feedback}
+
+Regular expressions are a complicated topic. Did this document help
+you understand them? Were there parts that were unclear, or Problems
+you encountered that weren't covered here? If so, please send
+suggestions for improvements to the author.
+
+The most complete book on regular expressions is almost certainly
+Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published
+by O'Reilly. Unfortunately, it exclusively concentrates on Perl and
+Java's flavours of regular expressions, and doesn't contain any Python
+material at all, so it won't be useful as a reference for programming
+in Python. (The first edition covered Python's now-obsolete
+\module{regex} module, which won't help you much.) Consider checking
+it out from your library.
+
+\end{document}
+
diff --git a/Doc/howto/rexec.tex b/Doc/howto/rexec.tex
new file mode 100644
index 0000000..44a0b30
--- /dev/null
+++ b/Doc/howto/rexec.tex
@@ -0,0 +1,61 @@
+\documentclass{howto}
+
+\title{Restricted Execution HOWTO}
+
+\release{2.1}
+
+\author{A.M. Kuchling}
+\authoraddress{\email{amk@amk.ca}}
+
+\begin{document}
+
+\maketitle
+
+\begin{abstract}
+\noindent
+
+Python 2.2.2 and earlier provided a \module{rexec} module running
+untrusted code. However, it's never been exhaustively audited for
+security and it hasn't been updated to take into account recent
+changes to Python such as new-style classes. Therefore, the
+\module{rexec} module should not be trusted. To discourage use of
+\module{rexec}, this HOWTO has been withdrawn.
+
+The \module{rexec} and \module{Bastion} modules have been disabled in
+the Python CVS tree, both on the trunk (which will eventually become
+Python 2.3alpha2 and later 2.3final) and on the release22-maint branch
+(which will become Python 2.2.3, if someone ever volunteers to issue
+2.2.3).
+
+For discussion of the problems with \module{rexec}, see the python-dev
+threads starting at the following URLs:
+\url{http://mail.python.org/pipermail/python-dev/2002-December/031160.html},
+and
+\url{http://mail.python.org/pipermail/python-dev/2003-January/031848.html}.
+
+\end{abstract}
+
+
+\section{Version History}
+
+Sep. 12, 1998: Minor revisions and added the reference to the Janus
+project.
+
+Feb. 26, 1998: First version. Suggestions are welcome.
+
+Mar. 16, 1998: Made some revisions suggested by Jeff Rush. Some minor
+changes and clarifications, and a sizable section on exceptions added.
+
+Oct. 4, 2000: Checked with Python 2.0. Minor rewrites and fixes made.
+Version number increased to 2.0.
+
+Dec. 17, 2002: Withdrawn.
+
+Jan. 8, 2003: Mention that \module{rexec} will be disabled in Python 2.3,
+and added links to relevant python-dev threads.
+
+\end{document}
+
+
+
+
diff --git a/Doc/howto/sockets.tex b/Doc/howto/sockets.tex
new file mode 100644
index 0000000..4da92a8
--- /dev/null
+++ b/Doc/howto/sockets.tex
@@ -0,0 +1,460 @@
+\documentclass{howto}
+
+\title{Socket Programming HOWTO}
+
+\release{0.00}
+
+\author{Gordon McMillan}
+\authoraddress{\email{gmcm@hypernet.com}}
+
+\begin{document}
+\maketitle
+
+\begin{abstract}
+\noindent
+Sockets are used nearly everywhere, but are one of the most severely
+misunderstood technologies around. This is a 10,000 foot overview of
+sockets. It's not really a tutorial - you'll still have work to do in
+getting things operational. It doesn't cover the fine points (and there
+are a lot of them), but I hope it will give you enough background to
+begin using them decently.
+
+This document is available from the Python HOWTO page at
+\url{http://www.python.org/doc/howto}.
+
+\end{abstract}
+
+\tableofcontents
+
+\section{Sockets}
+
+Sockets are used nearly everywhere, but are one of the most severely
+misunderstood technologies around. This is a 10,000 foot overview of
+sockets. It's not really a tutorial - you'll still have work to do in
+getting things working. It doesn't cover the fine points (and there
+are a lot of them), but I hope it will give you enough background to
+begin using them decently.
+
+I'm only going to talk about INET sockets, but they account for at
+least 99\% of the sockets in use. And I'll only talk about STREAM
+sockets - unless you really know what you're doing (in which case this
+HOWTO isn't for you!), you'll get better behavior and performance from
+a STREAM socket than anything else. I will try to clear up the mystery
+of what a socket is, as well as some hints on how to work with
+blocking and non-blocking sockets. But I'll start by talking about
+blocking sockets. You'll need to know how they work before dealing
+with non-blocking sockets.
+
+Part of the trouble with understanding these things is that "socket"
+can mean a number of subtly different things, depending on context. So
+first, let's make a distinction between a "client" socket - an
+endpoint of a conversation, and a "server" socket, which is more like
+a switchboard operator. The client application (your browser, for
+example) uses "client" sockets exclusively; the web server it's
+talking to uses both "server" sockets and "client" sockets.
+
+
+\subsection{History}
+
+Of the various forms of IPC (\emph{Inter Process Communication}),
+sockets are by far the most popular. On any given platform, there are
+likely to be other forms of IPC that are faster, but for
+cross-platform communication, sockets are about the only game in town.
+
+They were invented in Berkeley as part of the BSD flavor of Unix. They
+spread like wildfire with the Internet. With good reason --- the
+combination of sockets with INET makes talking to arbitrary machines
+around the world unbelievably easy (at least compared to other
+schemes).
+
+\section{Creating a Socket}
+
+Roughly speaking, when you clicked on the link that brought you to
+this page, your browser did something like the following:
+
+\begin{verbatim}
+ #create an INET, STREAMing socket
+ s = socket.socket(
+ socket.AF_INET, socket.SOCK_STREAM)
+ #now connect to the web server on port 80
+ # - the normal http port
+ s.connect(("www.mcmillan-inc.com", 80))
+\end{verbatim}
+
+When the \code{connect} completes, the socket \code{s} can
+now be used to send in a request for the text of this page. The same
+socket will read the reply, and then be destroyed. That's right -
+destroyed. Client sockets are normally only used for one exchange (or
+a small set of sequential exchanges).
+
+What happens in the web server is a bit more complex. First, the web
+server creates a "server socket".
+
+\begin{verbatim}
+ #create an INET, STREAMing socket
+ serversocket = socket.socket(
+ socket.AF_INET, socket.SOCK_STREAM)
+ #bind the socket to a public host,
+ # and a well-known port
+ serversocket.bind((socket.gethostname(), 80))
+ #become a server socket
+ serversocket.listen(5)
+\end{verbatim}
+
+A couple things to notice: we used \code{socket.gethostname()}
+so that the socket would be visible to the outside world. If we had
+used \code{s.bind(('', 80))} or \code{s.bind(('localhost',
+80))} or \code{s.bind(('127.0.0.1', 80))} we would still
+have a "server" socket, but one that was only visible within the same
+machine.
+
+A second thing to note: low number ports are usually reserved for
+"well known" services (HTTP, SNMP etc). If you're playing around, use
+a nice high number (4 digits).
+
+Finally, the argument to \code{listen} tells the socket library that
+we want it to queue up as many as 5 connect requests (the normal max)
+before refusing outside connections. If the rest of the code is
+written properly, that should be plenty.
+
+OK, now we have a "server" socket, listening on port 80. Now we enter
+the mainloop of the web server:
+
+\begin{verbatim}
+ while 1:
+ #accept connections from outside
+ (clientsocket, address) = serversocket.accept()
+ #now do something with the clientsocket
+ #in this case, we'll pretend this is a threaded server
+ ct = client_thread(clientsocket)
+ ct.run()
+\end{verbatim}
+
+There's actually 3 general ways in which this loop could work -
+dispatching a thread to handle \code{clientsocket}, create a new
+process to handle \code{clientsocket}, or restructure this app
+to use non-blocking sockets, and mulitplex between our "server" socket
+and any active \code{clientsocket}s using
+\code{select}. More about that later. The important thing to
+understand now is this: this is \emph{all} a "server" socket
+does. It doesn't send any data. It doesn't receive any data. It just
+produces "client" sockets. Each \code{clientsocket} is created
+in response to some \emph{other} "client" socket doing a
+\code{connect()} to the host and port we're bound to. As soon as
+we've created that \code{clientsocket}, we go back to listening
+for more connections. The two "clients" are free to chat it up - they
+are using some dynamically allocated port which will be recycled when
+the conversation ends.
+
+\subsection{IPC} If you need fast IPC between two processes
+on one machine, you should look into whatever form of shared memory
+the platform offers. A simple protocol based around shared memory and
+locks or semaphores is by far the fastest technique.
+
+If you do decide to use sockets, bind the "server" socket to
+\code{'localhost'}. On most platforms, this will take a shortcut
+around a couple of layers of network code and be quite a bit faster.
+
+
+\section{Using a Socket}
+
+The first thing to note, is that the web browser's "client" socket and
+the web server's "client" socket are identical beasts. That is, this
+is a "peer to peer" conversation. Or to put it another way, \emph{as the
+designer, you will have to decide what the rules of etiquette are for
+a conversation}. Normally, the \code{connect}ing socket
+starts the conversation, by sending in a request, or perhaps a
+signon. But that's a design decision - it's not a rule of sockets.
+
+Now there are two sets of verbs to use for communication. You can use
+\code{send} and \code{recv}, or you can transform your
+client socket into a file-like beast and use \code{read} and
+\code{write}. The latter is the way Java presents their
+sockets. I'm not going to talk about it here, except to warn you that
+you need to use \code{flush} on sockets. These are buffered
+"files", and a common mistake is to \code{write} something, and
+then \code{read} for a reply. Without a \code{flush} in
+there, you may wait forever for the reply, because the request may
+still be in your output buffer.
+
+Now we come the major stumbling block of sockets - \code{send}
+and \code{recv} operate on the network buffers. They do not
+necessarily handle all the bytes you hand them (or expect from them),
+because their major focus is handling the network buffers. In general,
+they return when the associated network buffers have been filled
+(\code{send}) or emptied (\code{recv}). They then tell you
+how many bytes they handled. It is \emph{your} responsibility to call
+them again until your message has been completely dealt with.
+
+When a \code{recv} returns 0 bytes, it means the other side has
+closed (or is in the process of closing) the connection. You will not
+receive any more data on this connection. Ever. You may be able to
+send data successfully; I'll talk about that some on the next page.
+
+A protocol like HTTP uses a socket for only one transfer. The client
+sends a request, the reads a reply. That's it. The socket is
+discarded. This means that a client can detect the end of the reply by
+receiving 0 bytes.
+
+But if you plan to reuse your socket for further transfers, you need
+to realize that \emph{there is no "EOT" (End of Transfer) on a
+socket.} I repeat: if a socket \code{send} or
+\code{recv} returns after handling 0 bytes, the connection has
+been broken. If the connection has \emph{not} been broken, you may
+wait on a \code{recv} forever, because the socket will
+\emph{not} tell you that there's nothing more to read (for now). Now
+if you think about that a bit, you'll come to realize a fundamental
+truth of sockets: \emph{messages must either be fixed length} (yuck),
+\emph{or be delimited} (shrug), \emph{or indicate how long they are}
+(much better), \emph{or end by shutting down the connection}. The
+choice is entirely yours, (but some ways are righter than others).
+
+Assuming you don't want to end the connection, the simplest solution
+is a fixed length message:
+
+\begin{verbatim}
+ class mysocket:
+ '''demonstration class only
+ - coded for clarity, not efficiency'''
+ def __init__(self, sock=None):
+ if sock is None:
+ self.sock = socket.socket(
+ socket.AF_INET, socket.SOCK_STREAM)
+ else:
+ self.sock = sock
+ def connect(host, port):
+ self.sock.connect((host, port))
+ def mysend(msg):
+ totalsent = 0
+ while totalsent < MSGLEN:
+ sent = self.sock.send(msg[totalsent:])
+ if sent == 0:
+ raise RuntimeError, \\
+ "socket connection broken"
+ totalsent = totalsent + sent
+ def myreceive():
+ msg = ''
+ while len(msg) < MSGLEN:
+ chunk = self.sock.recv(MSGLEN-len(msg))
+ if chunk == '':
+ raise RuntimeError, \\
+ "socket connection broken"
+ msg = msg + chunk
+ return msg
+\end{verbatim}
+
+The sending code here is usable for almost any messaging scheme - in
+Python you send strings, and you can use \code{len()} to
+determine its length (even if it has embedded \code{\e 0}
+characters). It's mostly the receiving code that gets more
+complex. (And in C, it's not much worse, except you can't use
+\code{strlen} if the message has embedded \code{\e 0}s.)
+
+The easiest enhancement is to make the first character of the message
+an indicator of message type, and have the type determine the
+length. Now you have two \code{recv}s - the first to get (at
+least) that first character so you can look up the length, and the
+second in a loop to get the rest. If you decide to go the delimited
+route, you'll be receiving in some arbitrary chunk size, (4096 or 8192
+is frequently a good match for network buffer sizes), and scanning
+what you've received for a delimiter.
+
+One complication to be aware of: if your conversational protocol
+allows multiple messages to be sent back to back (without some kind of
+reply), and you pass \code{recv} an arbitrary chunk size, you
+may end up reading the start of a following message. You'll need to
+put that aside and hold onto it, until it's needed.
+
+Prefixing the message with it's length (say, as 5 numeric characters)
+gets more complex, because (believe it or not), you may not get all 5
+characters in one \code{recv}. In playing around, you'll get
+away with it; but in high network loads, your code will very quickly
+break unless you use two \code{recv} loops - the first to
+determine the length, the second to get the data part of the
+message. Nasty. This is also when you'll discover that
+\code{send} does not always manage to get rid of everything in
+one pass. And despite having read this, you will eventually get bit by
+it!
+
+In the interests of space, building your character, (and preserving my
+competitive position), these enhancements are left as an exercise for
+the reader. Lets move on to cleaning up.
+
+\subsection{Binary Data}
+
+It is perfectly possible to send binary data over a socket. The major
+problem is that not all machines use the same formats for binary
+data. For example, a Motorola chip will represent a 16 bit integer
+with the value 1 as the two hex bytes 00 01. Intel and DEC, however,
+are byte-reversed - that same 1 is 01 00. Socket libraries have calls
+for converting 16 and 32 bit integers - \code{ntohl, htonl, ntohs,
+htons} where "n" means \emph{network} and "h" means \emph{host},
+"s" means \emph{short} and "l" means \emph{long}. Where network order
+is host order, these do nothing, but where the machine is
+byte-reversed, these swap the bytes around appropriately.
+
+In these days of 32 bit machines, the ascii representation of binary
+data is frequently smaller than the binary representation. That's
+because a surprising amount of the time, all those longs have the
+value 0, or maybe 1. The string "0" would be two bytes, while binary
+is four. Of course, this doesn't fit well with fixed-length
+messages. Decisions, decisions.
+
+\section{Disconnecting}
+
+Strictly speaking, you're supposed to use \code{shutdown} on a
+socket before you \code{close} it. The \code{shutdown} is
+an advisory to the socket at the other end. Depending on the argument
+you pass it, it can mean "I'm not going to send anymore, but I'll
+still listen", or "I'm not listening, good riddance!". Most socket
+libraries, however, are so used to programmers neglecting to use this
+piece of etiquette that normally a \code{close} is the same as
+\code{shutdown(); close()}. So in most situations, an explicit
+\code{shutdown} is not needed.
+
+One way to use \code{shutdown} effectively is in an HTTP-like
+exchange. The client sends a request and then does a
+\code{shutdown(1)}. This tells the server "This client is done
+sending, but can still receive." The server can detect "EOF" by a
+receive of 0 bytes. It can assume it has the complete request. The
+server sends a reply. If the \code{send} completes successfully
+then, indeed, the client was still receiving.
+
+Python takes the automatic shutdown a step further, and says that when a socket is garbage collected, it will automatically do a \code{close} if it's needed. But relying on this is a very bad habit. If your socket just disappears without doing a \code{close}, the socket at the other end may hang indefinitely, thinking you're just being slow. \emph{Please} \code{close} your sockets when you're done.
+
+
+\subsection{When Sockets Die}
+
+Probably the worst thing about using blocking sockets is what happens
+when the other side comes down hard (without doing a
+\code{close}). Your socket is likely to hang. SOCKSTREAM is a
+reliable protocol, and it will wait a long, long time before giving up
+on a connection. If you're using threads, the entire thread is
+essentially dead. There's not much you can do about it. As long as you
+aren't doing something dumb, like holding a lock while doing a
+blocking read, the thread isn't really consuming much in the way of
+resources. Do \emph{not} try to kill the thread - part of the reason
+that threads are more efficient than processes is that they avoid the
+overhead associated with the automatic recycling of resources. In
+other words, if you do manage to kill the thread, your whole process
+is likely to be screwed up.
+
+\section{Non-blocking Sockets}
+
+If you've understood the preceeding, you already know most of what you
+need to know about the mechanics of using sockets. You'll still use
+the same calls, in much the same ways. It's just that, if you do it
+right, your app will be almost inside-out.
+
+In Python, you use \code{socket.setblocking(0)} to make it
+non-blocking. In C, it's more complex, (for one thing, you'll need to
+choose between the BSD flavor \code{O_NONBLOCK} and the almost
+indistinguishable Posix flavor \code{O_NDELAY}, which is
+completely different from \code{TCP_NODELAY}), but it's the
+exact same idea. You do this after creating the socket, but before
+using it. (Actually, if you're nuts, you can switch back and forth.)
+
+The major mechanical difference is that \code{send},
+\code{recv}, \code{connect} and \code{accept} can
+return without having done anything. You have (of course) a number of
+choices. You can check return code and error codes and generally drive
+yourself crazy. If you don't believe me, try it sometime. Your app
+will grow large, buggy and suck CPU. So let's skip the brain-dead
+solutions and do it right.
+
+Use \code{select}.
+
+In C, coding \code{select} is fairly complex. In Python, it's a
+piece of cake, but it's close enough to the C version that if you
+understand \code{select} in Python, you'll have little trouble
+with it in C.
+
+\begin{verbatim} ready_to_read, ready_to_write, in_error = \\
+ select.select(
+ potential_readers,
+ potential_writers,
+ potential_errs,
+ timeout)
+\end{verbatim}
+
+You pass \code{select} three lists: the first contains all
+sockets that you might want to try reading; the second all the sockets
+you might want to try writing to, and the last (normally left empty)
+those that you want to check for errors. You should note that a
+socket can go into more than one list. The \code{select} call is
+blocking, but you can give it a timeout. This is generally a sensible
+thing to do - give it a nice long timeout (say a minute) unless you
+have good reason to do otherwise.
+
+In return, you will get three lists. They have the sockets that are
+actually readable, writable and in error. Each of these lists is a
+subset (possbily empty) of the corresponding list you passed in. And
+if you put a socket in more than one input list, it will only be (at
+most) in one output list.
+
+If a socket is in the output readable list, you can be
+as-close-to-certain-as-we-ever-get-in-this-business that a
+\code{recv} on that socket will return \emph{something}. Same
+idea for the writable list. You'll be able to send
+\emph{something}. Maybe not all you want to, but \emph{something} is
+better than nothing. (Actually, any reasonably healthy socket will
+return as writable - it just means outbound network buffer space is
+available.)
+
+If you have a "server" socket, put it in the potential_readers
+list. If it comes out in the readable list, your \code{accept}
+will (almost certainly) work. If you have created a new socket to
+\code{connect} to someone else, put it in the ptoential_writers
+list. If it shows up in the writable list, you have a decent chance
+that it has connected.
+
+One very nasty problem with \code{select}: if somewhere in those
+input lists of sockets is one which has died a nasty death, the
+\code{select} will fail. You then need to loop through every
+single damn socket in all those lists and do a
+\code{select([sock],[],[],0)} until you find the bad one. That
+timeout of 0 means it won't take long, but it's ugly.
+
+Actually, \code{select} can be handy even with blocking sockets.
+It's one way of determining whether you will block - the socket
+returns as readable when there's something in the buffers. However,
+this still doesn't help with the problem of determining whether the
+other end is done, or just busy with something else.
+
+\textbf{Portability alert}: On Unix, \code{select} works both with
+the sockets and files. Don't try this on Windows. On Windows,
+\code{select} works with sockets only. Also note that in C, many
+of the more advanced socket options are done differently on
+Windows. In fact, on Windows I usually use threads (which work very,
+very well) with my sockets. Face it, if you want any kind of
+performance, your code will look very different on Windows than on
+Unix. (I haven't the foggiest how you do this stuff on a Mac.)
+
+\subsection{Performance}
+
+There's no question that the fastest sockets code uses non-blocking
+sockets and select to multiplex them. You can put together something
+that will saturate a LAN connection without putting any strain on the
+CPU. The trouble is that an app written this way can't do much of
+anything else - it needs to be ready to shuffle bytes around at all
+times.
+
+Assuming that your app is actually supposed to do something more than
+that, threading is the optimal solution, (and using non-blocking
+sockets will be faster than using blocking sockets). Unfortunately,
+threading support in Unixes varies both in API and quality. So the
+normal Unix solution is to fork a subprocess to deal with each
+connection. The overhead for this is significant (and don't do this on
+Windows - the overhead of process creation is enormous there). It also
+means that unless each subprocess is completely independent, you'll
+need to use another form of IPC, say a pipe, or shared memory and
+semaphores, to communicate between the parent and child processes.
+
+Finally, remember that even though blocking sockets are somewhat
+slower than non-blocking, in many cases they are the "right"
+solution. After all, if your app is driven by the data it receives
+over a socket, there's not much sense in complicating the logic just
+so your app can wait on \code{select} instead of
+\code{recv}.
+
+\end{document}
diff --git a/Doc/howto/sorting.tex b/Doc/howto/sorting.tex
new file mode 100644
index 0000000..a849c66
--- /dev/null
+++ b/Doc/howto/sorting.tex
@@ -0,0 +1,267 @@
+\documentclass{howto}
+
+\title{Sorting Mini-HOWTO}
+
+% Increment the release number whenever significant changes are made.
+% The author and/or editor can define 'significant' however they like.
+\release{0.01}
+
+\author{Andrew Dalke}
+\authoraddress{\email{dalke@bioreason.com}}
+
+\begin{document}
+\maketitle
+
+\begin{abstract}
+\noindent
+This document is a little tutorial
+showing a half dozen ways to sort a list with the built-in
+\method{sort()} method.
+
+This document is available from the Python HOWTO page at
+\url{http://www.python.org/doc/howto}.
+\end{abstract}
+
+\tableofcontents
+
+Python lists have a built-in \method{sort()} method. There are many
+ways to use it to sort a list and there doesn't appear to be a single,
+central place in the various manuals describing them, so I'll do so
+here.
+
+\section{Sorting basic data types}
+
+A simple ascending sort is easy; just call the \method{sort()} method of a list.
+
+\begin{verbatim}
+>>> a = [5, 2, 3, 1, 4]
+>>> a.sort()
+>>> print a
+[1, 2, 3, 4, 5]
+\end{verbatim}
+
+Sort takes an optional function which can be called for doing the
+comparisons. The default sort routine is equivalent to
+
+\begin{verbatim}
+>>> a = [5, 2, 3, 1, 4]
+>>> a.sort(cmp)
+>>> print a
+[1, 2, 3, 4, 5]
+\end{verbatim}
+
+where \function{cmp} is the built-in function which compares two objects, \code{x} and
+\code{y}, and returns -1, 0 or 1 depending on whether $x<y$, $x==y$, or $x>y$. During
+the course of the sort the relationships must stay the same for the
+final list to make sense.
+
+If you want, you can define your own function for the comparison. For
+integers (and numbers in general) we can do:
+
+\begin{verbatim}
+>>> def numeric_compare(x, y):
+>>> return x-y
+>>>
+>>> a = [5, 2, 3, 1, 4]
+>>> a.sort(numeric_compare)
+>>> print a
+[1, 2, 3, 4, 5]
+\end{verbatim}
+
+By the way, this function won't work if result of the subtraction
+is out of range, as in \code{sys.maxint - (-1)}.
+
+Or, if you don't want to define a new named function you can create an
+anonymous one using \keyword{lambda}, as in:
+
+\begin{verbatim}
+>>> a = [5, 2, 3, 1, 4]
+>>> a.sort(lambda x, y: x-y)
+>>> print a
+[1, 2, 3, 4, 5]
+\end{verbatim}
+
+If you want the numbers sorted in reverse you can do
+
+\begin{verbatim}
+>>> a = [5, 2, 3, 1, 4]
+>>> def reverse_numeric(x, y):
+>>> return y-x
+>>>
+>>> a.sort(reverse_numeric)
+>>> print a
+[5, 4, 3, 2, 1]
+\end{verbatim}
+
+(a more general implementation could return \code{cmp(y,x)} or \code{-cmp(x,y)}).
+
+However, it's faster if Python doesn't have to call a function for
+every comparison, so if you want a reverse-sorted list of basic data
+types, do the forward sort first, then use the \method{reverse()} method.
+
+\begin{verbatim}
+>>> a = [5, 2, 3, 1, 4]
+>>> a.sort()
+>>> a.reverse()
+>>> print a
+[5, 4, 3, 2, 1]
+\end{verbatim}
+
+Here's a case-insensitive string comparison using a \keyword{lambda} function:
+
+\begin{verbatim}
+>>> import string
+>>> a = string.split("This is a test string from Andrew.")
+>>> a.sort(lambda x, y: cmp(string.lower(x), string.lower(y)))
+>>> print a
+['a', 'Andrew.', 'from', 'is', 'string', 'test', 'This']
+\end{verbatim}
+
+This goes through the overhead of converting a word to lower case
+every time it must be compared. At times it may be faster to compute
+these once and use those values, and the following example shows how.
+
+\begin{verbatim}
+>>> words = string.split("This is a test string from Andrew.")
+>>> offsets = []
+>>> for i in range(len(words)):
+>>> offsets.append( (string.lower(words[i]), i) )
+>>>
+>>> offsets.sort()
+>>> new_words = []
+>>> for dontcare, i in offsets:
+>>> new_words.append(words[i])
+>>>
+>>> print new_words
+\end{verbatim}
+
+The \code{offsets} list is initialized to a tuple of the lower-case string
+and its position in the \code{words} list. It is then sorted. Python's
+sort method sorts tuples by comparing terms; given \code{x} and \code{y}, compare
+\code{x[0]} to \code{y[0]}, then \code{x[1]} to \code{y[1]}, etc. until there is a difference.
+
+The result is that the \code{offsets} list is ordered by its first
+term, and the second term can be used to figure out where the original
+data was stored. (The \code{for} loop assigns \code{dontcare} and
+\code{i} to the two fields of each term in the list, but we only need the
+index value.)
+
+Another way to implement this is to store the original data as the
+second term in the \code{offsets} list, as in:
+
+\begin{verbatim}
+>>> words = string.split("This is a test string from Andrew.")
+>>> offsets = []
+>>> for word in words:
+>>> offsets.append( (string.lower(word), word) )
+>>>
+>>> offsets.sort()
+>>> new_words = []
+>>> for word in offsets:
+>>> new_words.append(word[1])
+>>>
+>>> print new_words
+\end{verbatim}
+
+This isn't always appropriate because the second terms in the list
+(the word, in this example) will be compared when the first terms are
+the same. If this happens many times, then there will be the unneeded
+performance hit of comparing the two objects. This can be a large
+cost if most terms are the same and the objects define their own
+\method{__cmp__} method, but there will still be some overhead to determine if
+\method{__cmp__} is defined.
+
+Still, for large lists, or for lists where the comparison information
+is expensive to calculate, the last two examples are likely to be the
+fastest way to sort a list. It will not work on weakly sorted data,
+like complex numbers, but if you don't know what that means, you
+probably don't need to worry about it.
+
+\section{Comparing classes}
+
+The comparison for two basic data types, like ints to ints or string to
+string, is built into Python and makes sense. There is a default way
+to compare class instances, but the default manner isn't usually very
+useful. You can define your own comparison with the \method{__cmp__} method,
+as in:
+
+\begin{verbatim}
+>>> class Spam:
+>>> def __init__(self, spam, eggs):
+>>> self.spam = spam
+>>> self.eggs = eggs
+>>> def __cmp__(self, other):
+>>> return cmp(self.spam+self.eggs, other.spam+other.eggs)
+>>> def __str__(self):
+>>> return str(self.spam + self.eggs)
+>>>
+>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
+>>> a.sort()
+>>> for spam in a:
+>>> print str(spam)
+5
+10
+12
+\end{verbatim}
+
+Sometimes you may want to sort by a specific attribute of a class. If
+appropriate you should just define the \method{__cmp__} method to compare
+those values, but you cannot do this if you want to compare between
+different attributes at different times. Instead, you'll need to go
+back to passing a comparison function to sort, as in:
+
+\begin{verbatim}
+>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
+>>> a.sort(lambda x, y: cmp(x.eggs, y.eggs))
+>>> for spam in a:
+>>> print spam.eggs, str(spam)
+3 12
+4 5
+6 10
+\end{verbatim}
+
+If you want to compare two arbitrary attributes (and aren't overly
+concerned about performance) you can even define your own comparison
+function object. This uses the ability of a class instance to emulate
+an function by defining the \method{__call__} method, as in:
+
+\begin{verbatim}
+>>> class CmpAttr:
+>>> def __init__(self, attr):
+>>> self.attr = attr
+>>> def __call__(self, x, y):
+>>> return cmp(getattr(x, self.attr), getattr(y, self.attr))
+>>>
+>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
+>>> a.sort(CmpAttr("spam")) # sort by the "spam" attribute
+>>> for spam in a:
+>>> print spam.spam, spam.eggs, str(spam)
+1 4 5
+4 6 10
+9 3 12
+
+>>> a.sort(CmpAttr("eggs")) # re-sort by the "eggs" attribute
+>>> for spam in a:
+>>> print spam.spam, spam.eggs, str(spam)
+9 3 12
+1 4 5
+4 6 10
+\end{verbatim}
+
+Of course, if you want a faster sort you can extract the attributes
+into an intermediate list and sort that list.
+
+
+So, there you have it; about a half-dozen different ways to define how
+to sort a list:
+\begin{itemize}
+ \item sort using the default method
+ \item sort using a comparison function
+ \item reverse sort not using a comparison function
+ \item sort on an intermediate list (two forms)
+ \item sort using class defined __cmp__ method
+ \item sort using a sort function object
+\end{itemize}
+
+\end{document}
+% LocalWords: maxint
diff --git a/Doc/howto/unicode.rst b/Doc/howto/unicode.rst
new file mode 100644
index 0000000..7ad61c1
--- /dev/null
+++ b/Doc/howto/unicode.rst
@@ -0,0 +1,765 @@
+Unicode HOWTO
+================
+
+**Version 1.02**
+
+This HOWTO discusses Python's support for Unicode, and explains various
+problems that people commonly encounter when trying to work with Unicode.
+
+Introduction to Unicode
+------------------------------
+
+History of Character Codes
+''''''''''''''''''''''''''''''
+
+In 1968, the American Standard Code for Information Interchange,
+better known by its acronym ASCII, was standardized. ASCII defined
+numeric codes for various characters, with the numeric values running from 0 to
+127. For example, the lowercase letter 'a' is assigned 97 as its code
+value.
+
+ASCII was an American-developed standard, so it only defined
+unaccented characters. There was an 'e', but no 'é' or 'Í'. This
+meant that languages which required accented characters couldn't be
+faithfully represented in ASCII. (Actually the missing accents matter
+for English, too, which contains words such as 'naïve' and 'café', and some
+publications have house styles which require spellings such as
+'coöperate'.)
+
+For a while people just wrote programs that didn't display accents. I
+remember looking at Apple ][ BASIC programs, published in French-language
+publications in the mid-1980s, that had lines like these::
+
+ PRINT "FICHER EST COMPLETE."
+ PRINT "CARACTERE NON ACCEPTE."
+
+Those messages should contain accents, and they just look wrong to
+someone who can read French.
+
+In the 1980s, almost all personal computers were 8-bit, meaning that
+bytes could hold values ranging from 0 to 255. ASCII codes only went
+up to 127, so some machines assigned values between 128 and 255 to
+accented characters. Different machines had different codes, however,
+which led to problems exchanging files. Eventually various commonly
+used sets of values for the 128-255 range emerged. Some were true
+standards, defined by the International Standards Organization, and
+some were **de facto** conventions that were invented by one company
+or another and managed to catch on.
+
+255 characters aren't very many. For example, you can't fit
+both the accented characters used in Western Europe and the Cyrillic
+alphabet used for Russian into the 128-255 range because there are more than
+127 such characters.
+
+You could write files using different codes (all your Russian
+files in a coding system called KOI8, all your French files in
+a different coding system called Latin1), but what if you wanted
+to write a French document that quotes some Russian text? In the
+1980s people began to want to solve this problem, and the Unicode
+standardization effort began.
+
+Unicode started out using 16-bit characters instead of 8-bit characters. 16
+bits means you have 2^16 = 65,536 distinct values available, making it
+possible to represent many different characters from many different
+alphabets; an initial goal was to have Unicode contain the alphabets for
+every single human language. It turns out that even 16 bits isn't enough to
+meet that goal, and the modern Unicode specification uses a wider range of
+codes, 0-1,114,111 (0x10ffff in base-16).
+
+There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
+originally separate efforts, but the specifications were merged with
+the 1.1 revision of Unicode.
+
+(This discussion of Unicode's history is highly simplified. I don't
+think the average Python programmer needs to worry about the
+historical details; consult the Unicode consortium site listed in the
+References for more information.)
+
+
+Definitions
+''''''''''''''''''''''''
+
+A **character** is the smallest possible component of a text. 'A',
+'B', 'C', etc., are all different characters. So are 'È' and
+'Í'. Characters are abstractions, and vary depending on the
+language or context you're talking about. For example, the symbol for
+ohms (Ω) is usually drawn much like the capital letter
+omega (Ω) in the Greek alphabet (they may even be the same in
+some fonts), but these are two different characters that have
+different meanings.
+
+The Unicode standard describes how characters are represented by
+**code points**. A code point is an integer value, usually denoted in
+base 16. In the standard, a code point is written using the notation
+U+12ca to mean the character with value 0x12ca (4810 decimal). The
+Unicode standard contains a lot of tables listing characters and their
+corresponding code points::
+
+ 0061 'a'; LATIN SMALL LETTER A
+ 0062 'b'; LATIN SMALL LETTER B
+ 0063 'c'; LATIN SMALL LETTER C
+ ...
+ 007B '{'; LEFT CURLY BRACKET
+
+Strictly, these definitions imply that it's meaningless to say 'this is
+character U+12ca'. U+12ca is a code point, which represents some particular
+character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'.
+In informal contexts, this distinction between code points and characters will
+sometimes be forgotten.
+
+A character is represented on a screen or on paper by a set of graphical
+elements that's called a **glyph**. The glyph for an uppercase A, for
+example, is two diagonal strokes and a horizontal stroke, though the exact
+details will depend on the font being used. Most Python code doesn't need
+to worry about glyphs; figuring out the correct glyph to display is
+generally the job of a GUI toolkit or a terminal's font renderer.
+
+
+Encodings
+'''''''''
+
+To summarize the previous section:
+a Unicode string is a sequence of code points, which are
+numbers from 0 to 0x10ffff. This sequence needs to be represented as
+a set of bytes (meaning, values from 0-255) in memory. The rules for
+translating a Unicode string into a sequence of bytes are called an
+**encoding**.
+
+The first encoding you might think of is an array of 32-bit integers.
+In this representation, the string "Python" would look like this::
+
+ P y t h o n
+ 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
+
+This representation is straightforward but using
+it presents a number of problems.
+
+1. It's not portable; different processors order the bytes
+ differently.
+
+2. It's very wasteful of space. In most texts, the majority of the code
+ points are less than 127, or less than 255, so a lot of space is occupied
+ by zero bytes. The above string takes 24 bytes compared to the 6
+ bytes needed for an ASCII representation. Increased RAM usage doesn't
+ matter too much (desktop computers have megabytes of RAM, and strings
+ aren't usually that large), but expanding our usage of disk and
+ network bandwidth by a factor of 4 is intolerable.
+
+3. It's not compatible with existing C functions such as ``strlen()``,
+ so a new family of wide string functions would need to be used.
+
+4. Many Internet standards are defined in terms of textual data, and
+ can't handle content with embedded zero bytes.
+
+Generally people don't use this encoding, choosing other encodings
+that are more efficient and convenient.
+
+Encodings don't have to handle every possible Unicode character, and
+most encodings don't. For example, Python's default encoding is the
+'ascii' encoding. The rules for converting a Unicode string into the
+ASCII encoding are are simple; for each code point:
+
+1. If the code point is <128, each byte is the same as the value of the
+ code point.
+
+2. If the code point is 128 or greater, the Unicode string can't
+ be represented in this encoding. (Python raises a
+ ``UnicodeEncodeError`` exception in this case.)
+
+Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode
+code points 0-255 are identical to the Latin-1 values, so converting
+to this encoding simply requires converting code points to byte
+values; if a code point larger than 255 is encountered, the string
+can't be encoded into Latin-1.
+
+Encodings don't have to be simple one-to-one mappings like Latin-1.
+Consider IBM's EBCDIC, which was used on IBM mainframes. Letter
+values weren't in one block: 'a' through 'i' had values from 129 to
+137, but 'j' through 'r' were 145 through 153. If you wanted to use
+EBCDIC as an encoding, you'd probably use some sort of lookup table to
+perform the conversion, but this is largely an internal detail.
+
+UTF-8 is one of the most commonly used encodings. UTF stands for
+"Unicode Transformation Format", and the '8' means that 8-bit numbers
+are used in the encoding. (There's also a UTF-16 encoding, but it's
+less frequently used than UTF-8.) UTF-8 uses the following rules:
+
+1. If the code point is <128, it's represented by the corresponding byte value.
+2. If the code point is between 128 and 0x7ff, it's turned into two byte values
+ between 128 and 255.
+3. Code points >0x7ff are turned into three- or four-byte sequences, where
+ each byte of the sequence is between 128 and 255.
+
+UTF-8 has several convenient properties:
+
+1. It can handle any Unicode code point.
+2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent through protocols that can't handle zero bytes.
+3. A string of ASCII text is also valid UTF-8 text.
+4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
+5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize. It's also unlikely that random 8-bit data will look like valid UTF-8.
+
+
+
+References
+''''''''''''''
+
+The Unicode Consortium site at <http://www.unicode.org> has character
+charts, a glossary, and PDF versions of the Unicode specification. Be
+prepared for some difficult reading.
+<http://www.unicode.org/history/> is a chronology of the origin and
+development of Unicode.
+
+To help understand the standard, Jukka Korpela has written an
+introductory guide to reading the Unicode character tables,
+available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
+
+Roman Czyborra wrote another explanation of Unicode's basic principles;
+it's at <http://czyborra.com/unicode/characters.html>.
+Czyborra has written a number of other Unicode-related documentation,
+available from <http://www.cyzborra.com>.
+
+Two other good introductory articles were written by Joel Spolsky
+<http://www.joelonsoftware.com/articles/Unicode.html> and Jason
+Orendorff <http://www.jorendorff.com/articles/unicode/>. If this
+introduction didn't make things clear to you, you should try reading
+one of these alternate articles before continuing.
+
+Wikipedia entries are often helpful; see the entries for "character
+encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
+<http://en.wikipedia.org/wiki/UTF-8>, for example.
+
+
+Python's Unicode Support
+------------------------
+
+Now that you've learned the rudiments of Unicode, we can look at
+Python's Unicode features.
+
+
+The Unicode Type
+'''''''''''''''''''
+
+Unicode strings are expressed as instances of the ``unicode`` type,
+one of Python's repertoire of built-in types. It derives from an
+abstract type called ``basestring``, which is also an ancestor of the
+``str`` type; you can therefore check if a value is a string type with
+``isinstance(value, basestring)``. Under the hood, Python represents
+Unicode strings as either 16- or 32-bit integers, depending on how the
+Python interpreter was compiled, but this
+
+The ``unicode()`` constructor has the signature ``unicode(string[, encoding, errors])``.
+All of its arguments should be 8-bit strings. The first argument is converted
+to Unicode using the specified encoding; if you leave off the ``encoding`` argument,
+the ASCII encoding is used for the conversion, so characters greater than 127 will
+be treated as errors::
+
+ >>> unicode('abcdef')
+ u'abcdef'
+ >>> s = unicode('abcdef')
+ >>> type(s)
+ <type 'unicode'>
+ >>> unicode('abcdef' + chr(255))
+ Traceback (most recent call last):
+ File "<stdin>", line 1, in ?
+ UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
+ ordinal not in range(128)
+
+The ``errors`` argument specifies the response when the input string can't be converted according to the encoding's rules. Legal values for this argument
+are 'strict' (raise a ``UnicodeDecodeError`` exception),
+'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'),
+or 'ignore' (just leave the character out of the Unicode result).
+The following examples show the differences::
+
+ >>> unicode('\x80abc', errors='strict')
+ Traceback (most recent call last):
+ File "<stdin>", line 1, in ?
+ UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
+ ordinal not in range(128)
+ >>> unicode('\x80abc', errors='replace')
+ u'\ufffdabc'
+ >>> unicode('\x80abc', errors='ignore')
+ u'abc'
+
+Encodings are specified as strings containing the encoding's name.
+Python 2.4 comes with roughly 100 different encodings; see the Python
+Library Reference at
+<http://docs.python.org/lib/standard-encodings.html> for a list. Some
+encodings have multiple names; for example, 'latin-1', 'iso_8859_1'
+and '8859' are all synonyms for the same encoding.
+
+One-character Unicode strings can also be created with the
+``unichr()`` built-in function, which takes integers and returns a
+Unicode string of length 1 that contains the corresponding code point.
+The reverse operation is the built-in `ord()` function that takes a
+one-character Unicode string and returns the code point value::
+
+ >>> unichr(40960)
+ u'\ua000'
+ >>> ord(u'\ua000')
+ 40960
+
+Instances of the ``unicode`` type have many of the same methods as
+the 8-bit string type for operations such as searching and formatting::
+
+ >>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
+ >>> s.count('e')
+ 5
+ >>> s.find('feather')
+ 9
+ >>> s.find('bird')
+ -1
+ >>> s.replace('feather', 'sand')
+ u'Was ever sand so lightly blown to and fro as this multitude?'
+ >>> s.upper()
+ u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
+
+Note that the arguments to these methods can be Unicode strings or 8-bit strings.
+8-bit strings will be converted to Unicode before carrying out the operation;
+Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception::
+
+ >>> s.find('Was\x9f')
+ Traceback (most recent call last):
+ File "<stdin>", line 1, in ?
+ UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
+ >>> s.find(u'Was\x9f')
+ -1
+
+Much Python code that operates on strings will therefore work with
+Unicode strings without requiring any changes to the code. (Input and
+output code needs more updating for Unicode; more on this later.)
+
+Another important method is ``.encode([encoding], [errors='strict'])``,
+which returns an 8-bit string version of the
+Unicode string, encoded in the requested encoding. The ``errors``
+parameter is the same as the parameter of the ``unicode()``
+constructor, with one additional possibility; as well as 'strict',
+'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which
+uses XML's character references. The following example shows the
+different results::
+
+ >>> u = unichr(40960) + u'abcd' + unichr(1972)
+ >>> u.encode('utf-8')
+ '\xea\x80\x80abcd\xde\xb4'
+ >>> u.encode('ascii')
+ Traceback (most recent call last):
+ File "<stdin>", line 1, in ?
+ UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
+ >>> u.encode('ascii', 'ignore')
+ 'abcd'
+ >>> u.encode('ascii', 'replace')
+ '?abcd?'
+ >>> u.encode('ascii', 'xmlcharrefreplace')
+ '&#40960;abcd&#1972;'
+
+Python's 8-bit strings have a ``.decode([encoding], [errors])`` method
+that interprets the string using the given encoding::
+
+ >>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
+ >>> utf8_version = u.encode('utf-8') # Encode as UTF-8
+ >>> type(utf8_version), utf8_version
+ (<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
+ >>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
+ >>> u == u2 # The two strings match
+ True
+
+The low-level routines for registering and accessing the available
+encodings are found in the ``codecs`` module. However, the encoding
+and decoding functions returned by this module are usually more
+low-level than is comfortable, so I'm not going to describe the
+``codecs`` module here. If you need to implement a completely new
+encoding, you'll need to learn about the ``codecs`` module interfaces,
+but implementing encodings is a specialized task that also won't be
+covered here. Consult the Python documentation to learn more about
+this module.
+
+The most commonly used part of the ``codecs`` module is the
+``codecs.open()`` function which will be discussed in the section
+on input and output.
+
+
+Unicode Literals in Python Source Code
+''''''''''''''''''''''''''''''''''''''''''
+
+In Python source code, Unicode literals are written as strings
+prefixed with the 'u' or 'U' character: ``u'abcdefghijk'``. Specific
+code points can be written using the ``\u`` escape sequence, which is
+followed by four hex digits giving the code point. The ``\U`` escape
+sequence is similar, but expects 8 hex digits, not 4.
+
+Unicode literals can also use the same escape sequences as 8-bit
+strings, including ``\x``, but ``\x`` only takes two hex digits so it
+can't express an arbitrary code point. Octal escapes can go up to
+U+01ff, which is octal 777.
+
+::
+
+ >>> s = u"a\xac\u1234\u20ac\U00008000"
+ ^^^^ two-digit hex escape
+ ^^^^^^ four-digit Unicode escape
+ ^^^^^^^^^^ eight-digit Unicode escape
+ >>> for c in s: print ord(c),
+ ...
+ 97 172 4660 8364 32768
+
+Using escape sequences for code points greater than 127 is fine in
+small doses, but becomes an annoyance if you're using many accented
+characters, as you would in a program with messages in French or some
+other accent-using language. You can also assemble strings using the
+``unichr()`` built-in function, but this is even more tedious.
+
+Ideally, you'd want to be able to write literals in your language's
+natural encoding. You could then edit Python source code with your
+favorite editor which would display the accented characters naturally,
+and have the right characters used at runtime.
+
+Python supports writing Unicode literals in any encoding, but you have
+to declare the encoding being used. This is done by including a
+special comment as either the first or second line of the source
+file::
+
+ #!/usr/bin/env python
+ # -*- coding: latin-1 -*-
+
+ u = u'abcdé'
+ print ord(u[-1])
+
+The syntax is inspired by Emacs's notation for specifying variables local to a file.
+Emacs supports many different variables, but Python only supports 'coding'.
+The ``-*-`` symbols indicate that the comment is special; within them,
+you must supply the name ``coding`` and the name of your chosen encoding,
+separated by ``':'``.
+
+If you don't include such a comment, the default encoding used will be
+ASCII. Versions of Python before 2.4 were Euro-centric and assumed
+Latin-1 as a default encoding for string literals; in Python 2.4,
+characters greater than 127 still work but result in a warning. For
+example, the following program has no encoding declaration::
+
+ #!/usr/bin/env python
+ u = u'abcdé'
+ print ord(u[-1])
+
+When you run it with Python 2.4, it will output the following warning::
+
+ amk:~$ python p263.py
+ sys:1: DeprecationWarning: Non-ASCII character '\xe9'
+ in file p263.py on line 2, but no encoding declared;
+ see http://www.python.org/peps/pep-0263.html for details
+
+
+Unicode Properties
+'''''''''''''''''''
+
+The Unicode specification includes a database of information about
+code points. For each code point that's defined, the information
+includes the character's name, its category, the numeric value if
+applicable (Unicode has characters representing the Roman numerals and
+fractions such as one-third and four-fifths). There are also
+properties related to the code point's use in bidirectional text and
+other display-related properties.
+
+The following program displays some information about several
+characters, and prints the numeric value of one particular character::
+
+ import unicodedata
+
+ u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
+
+ for i, c in enumerate(u):
+ print i, '%04x' % ord(c), unicodedata.category(c),
+ print unicodedata.name(c)
+
+ # Get numeric value of second character
+ print unicodedata.numeric(u[1])
+
+When run, this prints::
+
+ 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
+ 1 0bf2 No TAMIL NUMBER ONE THOUSAND
+ 2 0f84 Mn TIBETAN MARK HALANTA
+ 3 1770 Lo TAGBANWA LETTER SA
+ 4 33af So SQUARE RAD OVER S SQUARED
+ 1000.0
+
+The category codes are abbreviations describing the nature of the
+character. These are grouped into categories such as "Letter",
+"Number", "Punctuation", or "Symbol", which in turn are broken up into
+subcategories. To take the codes from the above output, ``'Ll'``
+means 'Letter, lowercase', ``'No'`` means "Number, other", ``'Mn'`` is
+"Mark, nonspacing", and ``'So'`` is "Symbol, other". See
+<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values>
+for a list of category codes.
+
+References
+''''''''''''''
+
+The Unicode and 8-bit string types are described in the Python library
+reference at <http://docs.python.org/lib/typesseq.html>.
+
+The documentation for the ``unicodedata`` module is at
+<http://docs.python.org/lib/module-unicodedata.html>.
+
+The documentation for the ``codecs`` module is at
+<http://docs.python.org/lib/module-codecs.html>.
+
+Marc-André Lemburg gave a presentation at EuroPython 2002
+titled "Python and Unicode". A PDF version of his slides
+is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>,
+and is an excellent overview of the design of Python's Unicode features.
+
+
+Reading and Writing Unicode Data
+----------------------------------------
+
+Once you've written some code that works with Unicode data, the next
+problem is input/output. How do you get Unicode strings into your
+program, and how do you convert Unicode into a form suitable for
+storage or transmission?
+
+It's possible that you may not need to do anything depending on your
+input sources and output destinations; you should check whether the
+libraries used in your application support Unicode natively. XML
+parsers often return Unicode data, for example. Many relational
+databases also support Unicode-valued columns and can return Unicode
+values from an SQL query.
+
+Unicode data is usually converted to a particular encoding before it
+gets written to disk or sent over a socket. It's possible to do all
+the work yourself: open a file, read an 8-bit string from it, and
+convert the string with ``unicode(str, encoding)``. However, the
+manual approach is not recommended.
+
+One problem is the multi-byte nature of encodings; one Unicode
+character can be represented by several bytes. If you want to read
+the file in arbitrary-sized chunks (say, 1K or 4K), you need to write
+error-handling code to catch the case where only part of the bytes
+encoding a single Unicode character are read at the end of a chunk.
+One solution would be to read the entire file into memory and then
+perform the decoding, but that prevents you from working with files
+that are extremely large; if you need to read a 2Gb file, you need 2Gb
+of RAM. (More, really, since for at least a moment you'd need to have
+both the encoded string and its Unicode version in memory.)
+
+The solution would be to use the low-level decoding interface to catch
+the case of partial coding sequences. The work of implementing this
+has already been done for you: the ``codecs`` module includes a
+version of the ``open()`` function that returns a file-like object
+that assumes the file's contents are in a specified encoding and
+accepts Unicode parameters for methods such as ``.read()`` and
+``.write()``.
+
+The function's parameters are
+``open(filename, mode='rb', encoding=None, errors='strict', buffering=1)``. ``mode`` can be
+``'r'``, ``'w'``, or ``'a'``, just like the corresponding parameter to the
+regular built-in ``open()`` function; add a ``'+'`` to
+update the file. ``buffering`` is similarly
+parallel to the standard function's parameter.
+``encoding`` is a string giving
+the encoding to use; if it's left as ``None``, a regular Python file
+object that accepts 8-bit strings is returned. Otherwise, a wrapper
+object is returned, and data written to or read from the wrapper
+object will be converted as needed. ``errors`` specifies the action
+for encoding errors and can be one of the usual values of 'strict',
+'ignore', and 'replace'.
+
+Reading Unicode from a file is therefore simple::
+
+ import codecs
+ f = codecs.open('unicode.rst', encoding='utf-8')
+ for line in f:
+ print repr(line)
+
+It's also possible to open files in update mode,
+allowing both reading and writing::
+
+ f = codecs.open('test', encoding='utf-8', mode='w+')
+ f.write(u'\u4500 blah blah blah\n')
+ f.seek(0)
+ print repr(f.readline()[:1])
+ f.close()
+
+Unicode character U+FEFF is used as a byte-order mark (BOM),
+and is often written as the first character of a file in order
+to assist with autodetection of the file's byte ordering.
+Some encodings, such as UTF-16, expect a BOM to be present at
+the start of a file; when such an encoding is used,
+the BOM will be automatically written as the first character
+and will be silently dropped when the file is read. There are
+variants of these encodings, such as 'utf-16-le' and 'utf-16-be'
+for little-endian and big-endian encodings, that specify
+one particular byte ordering and don't
+skip the BOM.
+
+
+Unicode filenames
+'''''''''''''''''''''''''
+
+Most of the operating systems in common use today support filenames
+that contain arbitrary Unicode characters. Usually this is
+implemented by converting the Unicode string into some encoding that
+varies depending on the system. For example, MacOS X uses UTF-8 while
+Windows uses a configurable encoding; on Windows, Python uses the name
+"mbcs" to refer to whatever the currently configured encoding is. On
+Unix systems, there will only be a filesystem encoding if you've set
+the ``LANG`` or ``LC_CTYPE`` environment variables; if you haven't,
+the default encoding is ASCII.
+
+The ``sys.getfilesystemencoding()`` function returns the encoding to
+use on your current system, in case you want to do the encoding
+manually, but there's not much reason to bother. When opening a file
+for reading or writing, you can usually just provide the Unicode
+string as the filename, and it will be automatically converted to the
+right encoding for you::
+
+ filename = u'filename\u4500abc'
+ f = open(filename, 'w')
+ f.write('blah\n')
+ f.close()
+
+Functions in the ``os`` module such as ``os.stat()`` will also accept
+Unicode filenames.
+
+``os.listdir()``, which returns filenames, raises an issue: should it
+return the Unicode version of filenames, or should it return 8-bit
+strings containing the encoded versions? ``os.listdir()`` will do
+both, depending on whether you provided the directory path as an 8-bit
+string or a Unicode string. If you pass a Unicode string as the path,
+filenames will be decoded using the filesystem's encoding and a list
+of Unicode strings will be returned, while passing an 8-bit path will
+return the 8-bit versions of the filenames. For example, assuming the
+default filesystem encoding is UTF-8, running the following program::
+
+ fn = u'filename\u4500abc'
+ f = open(fn, 'w')
+ f.close()
+
+ import os
+ print os.listdir('.')
+ print os.listdir(u'.')
+
+will produce the following output::
+
+ amk:~$ python t.py
+ ['.svn', 'filename\xe4\x94\x80abc', ...]
+ [u'.svn', u'filename\u4500abc', ...]
+
+The first list contains UTF-8-encoded filenames, and the second list
+contains the Unicode versions.
+
+
+
+Tips for Writing Unicode-aware Programs
+''''''''''''''''''''''''''''''''''''''''''''
+
+This section provides some suggestions on writing software that
+deals with Unicode.
+
+The most important tip is:
+
+ Software should only work with Unicode strings internally,
+ converting to a particular encoding on output.
+
+If you attempt to write processing functions that accept both
+Unicode and 8-bit strings, you will find your program vulnerable to
+bugs wherever you combine the two different kinds of strings. Python's
+default encoding is ASCII, so whenever a character with an ASCII value >127
+is in the input data, you'll get a ``UnicodeDecodeError``
+because that character can't be handled by the ASCII encoding.
+
+It's easy to miss such problems if you only test your software
+with data that doesn't contain any
+accents; everything will seem to work, but there's actually a bug in your
+program waiting for the first user who attempts to use characters >127.
+A second tip, therefore, is:
+
+ Include characters >127 and, even better, characters >255 in your
+ test data.
+
+When using data coming from a web browser or some other untrusted source,
+a common technique is to check for illegal characters in a string
+before using the string in a generated command line or storing it in a
+database. If you're doing this, be careful to check
+the string once it's in the form that will be used or stored; it's
+possible for encodings to be used to disguise characters. This is especially
+true if the input data also specifies the encoding;
+many encodings leave the commonly checked-for characters alone,
+but Python includes some encodings such as ``'base64'``
+that modify every single character.
+
+For example, let's say you have a content management system that takes a
+Unicode filename, and you want to disallow paths with a '/' character.
+You might write this code::
+
+ def read_file (filename, encoding):
+ if '/' in filename:
+ raise ValueError("'/' not allowed in filenames")
+ unicode_name = filename.decode(encoding)
+ f = open(unicode_name, 'r')
+ # ... return contents of file ...
+
+However, if an attacker could specify the ``'base64'`` encoding,
+they could pass ``'L2V0Yy9wYXNzd2Q='``, which is the base-64
+encoded form of the string ``'/etc/passwd'``, to read a
+system file. The above code looks for ``'/'`` characters
+in the encoded form and misses the dangerous character
+in the resulting decoded form.
+
+References
+''''''''''''''
+
+The PDF slides for Marc-André Lemburg's presentation "Writing
+Unicode-aware Applications in Python" are available at
+<http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
+and discuss questions of character encodings as well as how to
+internationalize and localize an application.
+
+
+Revision History and Acknowledgements
+------------------------------------------
+
+Thanks to the following people who have noted errors or offered
+suggestions on this article: Nicholas Bastin,
+Marius Gedminas, Kent Johnson, Ken Krugler,
+Marc-André Lemburg, Martin von Löwis.
+
+Version 1.0: posted August 5 2005.
+
+Version 1.01: posted August 7 2005. Corrects factual and markup
+errors; adds several links.
+
+Version 1.02: posted August 16 2005. Corrects factual errors.
+
+
+.. comment Additional topic: building Python w/ UCS2 or UCS4 support
+.. comment Describe obscure -U switch somewhere?
+
+.. comment
+ Original outline:
+
+ - [ ] Unicode introduction
+ - [ ] ASCII
+ - [ ] Terms
+ - [ ] Character
+ - [ ] Code point
+ - [ ] Encodings
+ - [ ] Common encodings: ASCII, Latin-1, UTF-8
+ - [ ] Unicode Python type
+ - [ ] Writing unicode literals
+ - [ ] Obscurity: -U switch
+ - [ ] Built-ins
+ - [ ] unichr()
+ - [ ] ord()
+ - [ ] unicode() constructor
+ - [ ] Unicode type
+ - [ ] encode(), decode() methods
+ - [ ] Unicodedata module for character properties
+ - [ ] I/O
+ - [ ] Reading/writing Unicode data into files
+ - [ ] Byte-order marks
+ - [ ] Unicode filenames
+ - [ ] Writing Unicode programs
+ - [ ] Do everything in Unicode
+ - [ ] Declaring source code encodings (PEP 263)
+ - [ ] Other issues
+ - [ ] Building Python (UCS2, UCS4)