summaryrefslogtreecommitdiffstats
path: root/Doc/howto
diff options
context:
space:
mode:
authorGeorg Brandl <georg@python.org>2007-08-15 14:26:55 (GMT)
committerGeorg Brandl <georg@python.org>2007-08-15 14:26:55 (GMT)
commitf56181ff53ba00b7bed3997a4dccd9a1b6217b57 (patch)
tree1200947a7ffc78c2719831e4c7fd900a8ab01368 /Doc/howto
parentaf62d9abfb78067a54c769302005f952ed999f6a (diff)
downloadcpython-f56181ff53ba00b7bed3997a4dccd9a1b6217b57.zip
cpython-f56181ff53ba00b7bed3997a4dccd9a1b6217b57.tar.gz
cpython-f56181ff53ba00b7bed3997a4dccd9a1b6217b57.tar.bz2
Delete the LaTeX doc tree.
Diffstat (limited to 'Doc/howto')
-rw-r--r--Doc/howto/Makefile84
-rw-r--r--Doc/howto/TODO13
-rw-r--r--Doc/howto/advocacy.tex411
-rw-r--r--Doc/howto/curses.tex486
-rw-r--r--Doc/howto/doanddont.tex344
-rw-r--r--Doc/howto/functional.rst1472
-rw-r--r--Doc/howto/regex.tex1477
-rw-r--r--Doc/howto/sockets.tex465
-rw-r--r--Doc/howto/unicode.rst766
-rw-r--r--Doc/howto/urllib2.rst603
10 files changed, 0 insertions, 6121 deletions
diff --git a/Doc/howto/Makefile b/Doc/howto/Makefile
deleted file mode 100644
index 18110a2..0000000
--- a/Doc/howto/Makefile
+++ /dev/null
@@ -1,84 +0,0 @@
-# Makefile for the HOWTO directory
-# LaTeX HOWTOs can be turned into HTML, PDF, PS, DVI or plain text output.
-# reST HOWTOs can only be turned into HTML.
-
-# Variables to change
-
-# Paper size for non-HTML formats (letter or a4)
-PAPER=letter
-
-# Arguments to rst2html.py, and location of the script
-RSTARGS = --input-encoding=utf-8
-RST2HTML = rst2html.py
-
-# List of HOWTOs that aren't to be processed. This should contain the
-# base name of the HOWTO without any extension (e.g. 'advocacy',
-# 'unicode').
-REMOVE_HOWTOS =
-
-MKHOWTO=../tools/mkhowto
-WEBDIR=.
-PAPERDIR=../paper-$(PAPER)
-HTMLDIR=../html
-
-# Determine list of files to be built
-TEX_SOURCES = $(wildcard *.tex)
-RST_SOURCES = $(wildcard *.rst)
-TEX_NAMES = $(filter-out $(REMOVE_HOWTOS),$(patsubst %.tex,%,$(TEX_SOURCES)))
-
-PAPER_PATHS=$(addprefix $(PAPERDIR)/,$(TEX_NAMES))
-DVI =$(addsuffix .dvi,$(PAPER_PATHS))
-PDF =$(addsuffix .pdf,$(PAPER_PATHS))
-PS =$(addsuffix .ps,$(PAPER_PATHS))
-
-ALL_HOWTO_NAMES = $(TEX_NAMES) $(patsubst %.rst,%,$(RST_SOURCES))
-HOWTO_NAMES = $(filter-out $(REMOVE_HOWTOS),$(ALL_HOWTO_NAMES))
-HTML = $(addprefix $(HTMLDIR)/,$(HOWTO_NAMES))
-
-# Rules for building various formats
-
-# reST to HTML
-$(HTMLDIR)/%: %.rst
- if [ ! -d $@ ] ; then mkdir $@ ; fi
- $(RST2HTML) $(RSTARGS) $< >$@/index.html
-
-# LaTeX to various output formats
-$(PAPERDIR)/%.dvi : %.tex
- $(MKHOWTO) --dvi $<
- mv $*.dvi $@
-
-$(PAPERDIR)/%.pdf : %.tex
- $(MKHOWTO) --pdf $<
- mv $*.pdf $@
-
-$(PAPERDIR)/%.ps : %.tex
- $(MKHOWTO) --ps $<
- mv $*.ps $@
-
-$(HTMLDIR)/% : %.tex
- $(MKHOWTO) --html --iconserver="." --dir $@ $<
-
-# Rule that isn't actually used -- we no longer support the 'txt' target.
-$(PAPERDIR)/%.txt : %.tex
- $(MKHOWTO) --text $<
- mv $@ txt
-
-default:
- @echo "'all' -- build all files"
- @echo "'dvi', 'pdf', 'ps', 'html' -- build one format"
-
-all: dvi pdf ps html
-
-.PHONY : dvi pdf ps html
-dvi: $(DVI)
-pdf: $(PDF)
-ps: $(PS)
-html: $(HTML)
-
-clean:
- rm -f *~ *.log *.ind *.l2h *.aux *.toc *.how *.bkm
- rm -f *.dvi *.pdf *.ps
-
-clobber:
- rm -rf $(HTML)
- rm -rf $(DVI) $(PDF) $(PS)
diff --git a/Doc/howto/TODO b/Doc/howto/TODO
deleted file mode 100644
index c229828..0000000
--- a/Doc/howto/TODO
+++ /dev/null
@@ -1,13 +0,0 @@
-
-Short-term tasks:
- Quick revision pass to make HOWTOs match the current state of Python
-doanddont regex sockets
-
-Medium-term tasks:
- Revisit the regex howto.
- * Add exercises with answers for each section
- * More examples?
-
-Long-term tasks:
- Integrate with other Python docs?
-
diff --git a/Doc/howto/advocacy.tex b/Doc/howto/advocacy.tex
deleted file mode 100644
index 9074b3f..0000000
--- a/Doc/howto/advocacy.tex
+++ /dev/null
@@ -1,411 +0,0 @@
-
-\documentclass{howto}
-
-\title{Python Advocacy HOWTO}
-
-\release{0.03}
-
-\author{A.M. Kuchling}
-\authoraddress{\email{amk@amk.ca}}
-
-\begin{document}
-\maketitle
-
-\begin{abstract}
-\noindent
-It's usually difficult to get your management to accept open source
-software, and Python is no exception to this rule. This document
-discusses reasons to use Python, strategies for winning acceptance,
-facts and arguments you can use, and cases where you \emph{shouldn't}
-try to use Python.
-
-This document is available from the Python HOWTO page at
-\url{http://www.python.org/doc/howto}.
-
-\end{abstract}
-
-\tableofcontents
-
-\section{Reasons to Use Python}
-
-There are several reasons to incorporate a scripting language into
-your development process, and this section will discuss them, and why
-Python has some properties that make it a particularly good choice.
-
- \subsection{Programmability}
-
-Programs are often organized in a modular fashion. Lower-level
-operations are grouped together, and called by higher-level functions,
-which may in turn be used as basic operations by still further upper
-levels.
-
-For example, the lowest level might define a very low-level
-set of functions for accessing a hash table. The next level might use
-hash tables to store the headers of a mail message, mapping a header
-name like \samp{Date} to a value such as \samp{Tue, 13 May 1997
-20:00:54 -0400}. A yet higher level may operate on message objects,
-without knowing or caring that message headers are stored in a hash
-table, and so forth.
-
-Often, the lowest levels do very simple things; they implement a data
-structure such as a binary tree or hash table, or they perform some
-simple computation, such as converting a date string to a number. The
-higher levels then contain logic connecting these primitive
-operations. Using the approach, the primitives can be seen as basic
-building blocks which are then glued together to produce the complete
-product.
-
-Why is this design approach relevant to Python? Because Python is
-well suited to functioning as such a glue language. A common approach
-is to write a Python module that implements the lower level
-operations; for the sake of speed, the implementation might be in C,
-Java, or even Fortran. Once the primitives are available to Python
-programs, the logic underlying higher level operations is written in
-the form of Python code. The high-level logic is then more
-understandable, and easier to modify.
-
-John Ousterhout wrote a paper that explains this idea at greater
-length, entitled ``Scripting: Higher Level Programming for the 21st
-Century''. I recommend that you read this paper; see the references
-for the URL. Ousterhout is the inventor of the Tcl language, and
-therefore argues that Tcl should be used for this purpose; he only
-briefly refers to other languages such as Python, Perl, and
-Lisp/Scheme, but in reality, Ousterhout's argument applies to
-scripting languages in general, since you could equally write
-extensions for any of the languages mentioned above.
-
- \subsection{Prototyping}
-
-In \emph{The Mythical Man-Month}, Fredrick Brooks suggests the
-following rule when planning software projects: ``Plan to throw one
-away; you will anyway.'' Brooks is saying that the first attempt at a
-software design often turns out to be wrong; unless the problem is
-very simple or you're an extremely good designer, you'll find that new
-requirements and features become apparent once development has
-actually started. If these new requirements can't be cleanly
-incorporated into the program's structure, you're presented with two
-unpleasant choices: hammer the new features into the program somehow,
-or scrap everything and write a new version of the program, taking the
-new features into account from the beginning.
-
-Python provides you with a good environment for quickly developing an
-initial prototype. That lets you get the overall program structure
-and logic right, and you can fine-tune small details in the fast
-development cycle that Python provides. Once you're satisfied with
-the GUI interface or program output, you can translate the Python code
-into C++, Fortran, Java, or some other compiled language.
-
-Prototyping means you have to be careful not to use too many Python
-features that are hard to implement in your other language. Using
-\code{eval()}, or regular expressions, or the \module{pickle} module,
-means that you're going to need C or Java libraries for formula
-evaluation, regular expressions, and serialization, for example. But
-it's not hard to avoid such tricky code, and in the end the
-translation usually isn't very difficult. The resulting code can be
-rapidly debugged, because any serious logical errors will have been
-removed from the prototype, leaving only more minor slip-ups in the
-translation to track down.
-
-This strategy builds on the earlier discussion of programmability.
-Using Python as glue to connect lower-level components has obvious
-relevance for constructing prototype systems. In this way Python can
-help you with development, even if end users never come in contact
-with Python code at all. If the performance of the Python version is
-adequate and corporate politics allow it, you may not need to do a
-translation into C or Java, but it can still be faster to develop a
-prototype and then translate it, instead of attempting to produce the
-final version immediately.
-
-One example of this development strategy is Microsoft Merchant Server.
-Version 1.0 was written in pure Python, by a company that subsequently
-was purchased by Microsoft. Version 2.0 began to translate the code
-into \Cpp, shipping with some \Cpp code and some Python code. Version
-3.0 didn't contain any Python at all; all the code had been translated
-into \Cpp. Even though the product doesn't contain a Python
-interpreter, the Python language has still served a useful purpose by
-speeding up development.
-
-This is a very common use for Python. Past conference papers have
-also described this approach for developing high-level numerical
-algorithms; see David M. Beazley and Peter S. Lomdahl's paper
-``Feeding a Large-scale Physics Application to Python'' in the
-references for a good example. If an algorithm's basic operations are
-things like "Take the inverse of this 4000x4000 matrix", and are
-implemented in some lower-level language, then Python has almost no
-additional performance cost; the extra time required for Python to
-evaluate an expression like \code{m.invert()} is dwarfed by the cost
-of the actual computation. It's particularly good for applications
-where seemingly endless tweaking is required to get things right. GUI
-interfaces and Web sites are prime examples.
-
-The Python code is also shorter and faster to write (once you're
-familiar with Python), so it's easier to throw it away if you decide
-your approach was wrong; if you'd spent two weeks working on it
-instead of just two hours, you might waste time trying to patch up
-what you've got out of a natural reluctance to admit that those two
-weeks were wasted. Truthfully, those two weeks haven't been wasted,
-since you've learnt something about the problem and the technology
-you're using to solve it, but it's human nature to view this as a
-failure of some sort.
-
- \subsection{Simplicity and Ease of Understanding}
-
-Python is definitely \emph{not} a toy language that's only usable for
-small tasks. The language features are general and powerful enough to
-enable it to be used for many different purposes. It's useful at the
-small end, for 10- or 20-line scripts, but it also scales up to larger
-systems that contain thousands of lines of code.
-
-However, this expressiveness doesn't come at the cost of an obscure or
-tricky syntax. While Python has some dark corners that can lead to
-obscure code, there are relatively few such corners, and proper design
-can isolate their use to only a few classes or modules. It's
-certainly possible to write confusing code by using too many features
-with too little concern for clarity, but most Python code can look a
-lot like a slightly-formalized version of human-understandable
-pseudocode.
-
-In \emph{The New Hacker's Dictionary}, Eric S. Raymond gives the following
-definition for "compact":
-
-\begin{quotation}
- Compact \emph{adj.} Of a design, describes the valuable property
- that it can all be apprehended at once in one's head. This
- generally means the thing created from the design can be used
- with greater facility and fewer errors than an equivalent tool
- that is not compact. Compactness does not imply triviality or
- lack of power; for example, C is compact and FORTRAN is not,
- but C is more powerful than FORTRAN. Designs become
- non-compact through accreting features and cruft that don't
- merge cleanly into the overall design scheme (thus, some fans
- of Classic C maintain that ANSI C is no longer compact).
-\end{quotation}
-
-(From \url{http://www.catb.org/~esr/jargon/html/C/compact.html})
-
-In this sense of the word, Python is quite compact, because the
-language has just a few ideas, which are used in lots of places. Take
-namespaces, for example. Import a module with \code{import math}, and
-you create a new namespace called \samp{math}. Classes are also
-namespaces that share many of the properties of modules, and have a
-few of their own; for example, you can create instances of a class.
-Instances? They're yet another namespace. Namespaces are currently
-implemented as Python dictionaries, so they have the same methods as
-the standard dictionary data type: .keys() returns all the keys, and
-so forth.
-
-This simplicity arises from Python's development history. The
-language syntax derives from different sources; ABC, a relatively
-obscure teaching language, is one primary influence, and Modula-3 is
-another. (For more information about ABC and Modula-3, consult their
-respective Web sites at \url{http://www.cwi.nl/~steven/abc/} and
-\url{http://www.m3.org}.) Other features have come from C, Icon,
-Algol-68, and even Perl. Python hasn't really innovated very much,
-but instead has tried to keep the language small and easy to learn,
-building on ideas that have been tried in other languages and found
-useful.
-
-Simplicity is a virtue that should not be underestimated. It lets you
-learn the language more quickly, and then rapidly write code, code
-that often works the first time you run it.
-
- \subsection{Java Integration}
-
-If you're working with Java, Jython
-(\url{http://www.jython.org/}) is definitely worth your
-attention. Jython is a re-implementation of Python in Java that
-compiles Python code into Java bytecodes. The resulting environment
-has very tight, almost seamless, integration with Java. It's trivial
-to access Java classes from Python, and you can write Python classes
-that subclass Java classes. Jython can be used for prototyping Java
-applications in much the same way CPython is used, and it can also be
-used for test suites for Java code, or embedded in a Java application
-to add scripting capabilities.
-
-\section{Arguments and Rebuttals}
-
-Let's say that you've decided upon Python as the best choice for your
-application. How can you convince your management, or your fellow
-developers, to use Python? This section lists some common arguments
-against using Python, and provides some possible rebuttals.
-
-\emph{Python is freely available software that doesn't cost anything.
-How good can it be?}
-
-Very good, indeed. These days Linux and Apache, two other pieces of
-open source software, are becoming more respected as alternatives to
-commercial software, but Python hasn't had all the publicity.
-
-Python has been around for several years, with many users and
-developers. Accordingly, the interpreter has been used by many
-people, and has gotten most of the bugs shaken out of it. While bugs
-are still discovered at intervals, they're usually either quite
-obscure (they'd have to be, for no one to have run into them before)
-or they involve interfaces to external libraries. The internals of
-the language itself are quite stable.
-
-Having the source code should be viewed as making the software
-available for peer review; people can examine the code, suggest (and
-implement) improvements, and track down bugs. To find out more about
-the idea of open source code, along with arguments and case studies
-supporting it, go to \url{http://www.opensource.org}.
-
-\emph{Who's going to support it?}
-
-Python has a sizable community of developers, and the number is still
-growing. The Internet community surrounding the language is an active
-one, and is worth being considered another one of Python's advantages.
-Most questions posted to the comp.lang.python newsgroup are quickly
-answered by someone.
-
-Should you need to dig into the source code, you'll find it's clear
-and well-organized, so it's not very difficult to write extensions and
-track down bugs yourself. If you'd prefer to pay for support, there
-are companies and individuals who offer commercial support for Python.
-
-\emph{Who uses Python for serious work?}
-
-Lots of people; one interesting thing about Python is the surprising
-diversity of applications that it's been used for. People are using
-Python to:
-
-\begin{itemize}
-\item Run Web sites
-\item Write GUI interfaces
-\item Control
-number-crunching code on supercomputers
-\item Make a commercial application scriptable by embedding the Python
-interpreter inside it
-\item Process large XML data sets
-\item Build test suites for C or Java code
-\end{itemize}
-
-Whatever your application domain is, there's probably someone who's
-used Python for something similar. Yet, despite being useable for
-such high-end applications, Python's still simple enough to use for
-little jobs.
-
-See \url{http://wiki.python.org/moin/OrganizationsUsingPython} for a list of some of the
-organizations that use Python.
-
-\emph{What are the restrictions on Python's use?}
-
-They're practically nonexistent. Consult the \file{Misc/COPYRIGHT}
-file in the source distribution, or
-\url{http://www.python.org/doc/Copyright.html} for the full language,
-but it boils down to three conditions.
-
-\begin{itemize}
-
-\item You have to leave the copyright notice on the software; if you
-don't include the source code in a product, you have to put the
-copyright notice in the supporting documentation.
-
-\item Don't claim that the institutions that have developed Python
-endorse your product in any way.
-
-\item If something goes wrong, you can't sue for damages. Practically
-all software licences contain this condition.
-
-\end{itemize}
-
-Notice that you don't have to provide source code for anything that
-contains Python or is built with it. Also, the Python interpreter and
-accompanying documentation can be modified and redistributed in any
-way you like, and you don't have to pay anyone any licensing fees at
-all.
-
-\emph{Why should we use an obscure language like Python instead of
-well-known language X?}
-
-I hope this HOWTO, and the documents listed in the final section, will
-help convince you that Python isn't obscure, and has a healthily
-growing user base. One word of advice: always present Python's
-positive advantages, instead of concentrating on language X's
-failings. People want to know why a solution is good, rather than why
-all the other solutions are bad. So instead of attacking a competing
-solution on various grounds, simply show how Python's virtues can
-help.
-
-
-\section{Useful Resources}
-
-\begin{definitions}
-
-
-\term{\url{http://www.pythonology.com/success}}
-
-The Python Success Stories are a collection of stories from successful
-users of Python, with the emphasis on business and corporate users.
-
-%\term{\url{http://www.fsbassociates.com/books/pythonchpt1.htm}}
-
-%The first chapter of \emph{Internet Programming with Python} also
-%examines some of the reasons for using Python. The book is well worth
-%buying, but the publishers have made the first chapter available on
-%the Web.
-
-\term{\url{http://home.pacbell.net/ouster/scripting.html}}
-
-John Ousterhout's white paper on scripting is a good argument for the
-utility of scripting languages, though naturally enough, he emphasizes
-Tcl, the language he developed. Most of the arguments would apply to
-any scripting language.
-
-\term{\url{http://www.python.org/workshops/1997-10/proceedings/beazley.html}}
-
-The authors, David M. Beazley and Peter S. Lomdahl,
-describe their use of Python at Los Alamos National Laboratory.
-It's another good example of how Python can help get real work done.
-This quotation from the paper has been echoed by many people:
-
-\begin{quotation}
- Originally developed as a large monolithic application for
- massively parallel processing systems, we have used Python to
- transform our application into a flexible, highly modular, and
- extremely powerful system for performing simulation, data
- analysis, and visualization. In addition, we describe how Python
- has solved a number of important problems related to the
- development, debugging, deployment, and maintenance of scientific
- software.
-\end{quotation}
-
-\term{\url{http://pythonjournal.cognizor.com/pyj1/Everitt-Feit_interview98-V1.html}}
-
-This interview with Andy Feit, discussing Infoseek's use of Python, can be
-used to show that choosing Python didn't introduce any difficulties
-into a company's development process, and provided some substantial benefits.
-
-%\term{\url{http://www.python.org/psa/Commercial.html}}
-
-%Robin Friedrich wrote this document on how to support Python's use in
-%commercial projects.
-
-\term{\url{http://www.python.org/workshops/1997-10/proceedings/stein.ps}}
-
-For the 6th Python conference, Greg Stein presented a paper that
-traced Python's adoption and usage at a startup called eShop, and
-later at Microsoft.
-
-\term{\url{http://www.opensource.org}}
-
-Management may be doubtful of the reliability and usefulness of
-software that wasn't written commercially. This site presents
-arguments that show how open source software can have considerable
-advantages over closed-source software.
-
-\term{\url{http://sunsite.unc.edu/LDP/HOWTO/mini/Advocacy.html}}
-
-The Linux Advocacy mini-HOWTO was the inspiration for this document,
-and is also well worth reading for general suggestions on winning
-acceptance for a new technology, such as Linux or Python. In general,
-you won't make much progress by simply attacking existing systems and
-complaining about their inadequacies; this often ends up looking like
-unfocused whining. It's much better to point out some of the many
-areas where Python is an improvement over other systems.
-
-\end{definitions}
-
-\end{document}
-
-
diff --git a/Doc/howto/curses.tex b/Doc/howto/curses.tex
deleted file mode 100644
index 3e4cada..0000000
--- a/Doc/howto/curses.tex
+++ /dev/null
@@ -1,486 +0,0 @@
-\documentclass{howto}
-
-\title{Curses Programming with Python}
-
-\release{2.02}
-
-\author{A.M. Kuchling, Eric S. Raymond}
-\authoraddress{\email{amk@amk.ca}, \email{esr@thyrsus.com}}
-
-\begin{document}
-\maketitle
-
-\begin{abstract}
-\noindent
-This document describes how to write text-mode programs with Python 2.x,
-using the \module{curses} extension module to control the display.
-
-This document is available from the Python HOWTO page at
-\url{http://www.python.org/doc/howto}.
-\end{abstract}
-
-\tableofcontents
-
-\section{What is curses?}
-
-The curses library supplies a terminal-independent screen-painting and
-keyboard-handling facility for text-based terminals; such terminals
-include VT100s, the Linux console, and the simulated terminal provided
-by X11 programs such as xterm and rxvt. Display terminals support
-various control codes to perform common operations such as moving the
-cursor, scrolling the screen, and erasing areas. Different terminals
-use widely differing codes, and often have their own minor quirks.
-
-In a world of X displays, one might ask ``why bother''? It's true
-that character-cell display terminals are an obsolete technology, but
-there are niches in which being able to do fancy things with them are
-still valuable. One is on small-footprint or embedded Unixes that
-don't carry an X server. Another is for tools like OS installers
-and kernel configurators that may have to run before X is available.
-
-The curses library hides all the details of different terminals, and
-provides the programmer with an abstraction of a display, containing
-multiple non-overlapping windows. The contents of a window can be
-changed in various ways--adding text, erasing it, changing its
-appearance--and the curses library will automagically figure out what
-control codes need to be sent to the terminal to produce the right
-output.
-
-The curses library was originally written for BSD Unix; the later System V
-versions of Unix from AT\&T added many enhancements and new functions.
-BSD curses is no longer maintained, having been replaced by ncurses,
-which is an open-source implementation of the AT\&T interface. If you're
-using an open-source Unix such as Linux or FreeBSD, your system almost
-certainly uses ncurses. Since most current commercial Unix versions
-are based on System V code, all the functions described here will
-probably be available. The older versions of curses carried by some
-proprietary Unixes may not support everything, though.
-
-No one has made a Windows port of the curses module. On a Windows
-platform, try the Console module written by Fredrik Lundh. The
-Console module provides cursor-addressable text output, plus full
-support for mouse and keyboard input, and is available from
-\url{http://effbot.org/efflib/console}.
-
-\subsection{The Python curses module}
-
-Thy Python module is a fairly simple wrapper over the C functions
-provided by curses; if you're already familiar with curses programming
-in C, it's really easy to transfer that knowledge to Python. The
-biggest difference is that the Python interface makes things simpler,
-by merging different C functions such as \function{addstr},
-\function{mvaddstr}, \function{mvwaddstr}, into a single
-\method{addstr()} method. You'll see this covered in more detail
-later.
-
-This HOWTO is simply an introduction to writing text-mode programs
-with curses and Python. It doesn't attempt to be a complete guide to
-the curses API; for that, see the Python library guide's section on
-ncurses, and the C manual pages for ncurses. It will, however, give
-you the basic ideas.
-
-\section{Starting and ending a curses application}
-
-Before doing anything, curses must be initialized. This is done by
-calling the \function{initscr()} function, which will determine the
-terminal type, send any required setup codes to the terminal, and
-create various internal data structures. If successful,
-\function{initscr()} returns a window object representing the entire
-screen; this is usually called \code{stdscr}, after the name of the
-corresponding C
-variable.
-
-\begin{verbatim}
-import curses
-stdscr = curses.initscr()
-\end{verbatim}
-
-Usually curses applications turn off automatic echoing of keys to the
-screen, in order to be able to read keys and only display them under
-certain circumstances. This requires calling the \function{noecho()}
-function.
-
-\begin{verbatim}
-curses.noecho()
-\end{verbatim}
-
-Applications will also commonly need to react to keys instantly,
-without requiring the Enter key to be pressed; this is called cbreak
-mode, as opposed to the usual buffered input mode.
-
-\begin{verbatim}
-curses.cbreak()
-\end{verbatim}
-
-Terminals usually return special keys, such as the cursor keys or
-navigation keys such as Page Up and Home, as a multibyte escape
-sequence. While you could write your application to expect such
-sequences and process them accordingly, curses can do it for you,
-returning a special value such as \constant{curses.KEY_LEFT}. To get
-curses to do the job, you'll have to enable keypad mode.
-
-\begin{verbatim}
-stdscr.keypad(1)
-\end{verbatim}
-
-Terminating a curses application is much easier than starting one.
-You'll need to call
-
-\begin{verbatim}
-curses.nocbreak(); stdscr.keypad(0); curses.echo()
-\end{verbatim}
-
-to reverse the curses-friendly terminal settings. Then call the
-\function{endwin()} function to restore the terminal to its original
-operating mode.
-
-\begin{verbatim}
-curses.endwin()
-\end{verbatim}
-
-A common problem when debugging a curses application is to get your
-terminal messed up when the application dies without restoring the
-terminal to its previous state. In Python this commonly happens when
-your code is buggy and raises an uncaught exception. Keys are no
-longer be echoed to the screen when you type them, for example, which
-makes using the shell difficult.
-
-In Python you can avoid these complications and make debugging much
-easier by importing the module \module{curses.wrapper}. It supplies a
-\function{wrapper()} function that takes a callable. It does the
-initializations described above, and also initializes colors if color
-support is present. It then runs your provided callable and finally
-deinitializes appropriately. The callable is called inside a try-catch
-clause which catches exceptions, performs curses deinitialization, and
-then passes the exception upwards. Thus, your terminal won't be left
-in a funny state on exception.
-
-\section{Windows and Pads}
-
-Windows are the basic abstraction in curses. A window object
-represents a rectangular area of the screen, and supports various
-methods to display text, erase it, allow the user to input strings,
-and so forth.
-
-The \code{stdscr} object returned by the \function{initscr()} function
-is a window object that covers the entire screen. Many programs may
-need only this single window, but you might wish to divide the screen
-into smaller windows, in order to redraw or clear them separately.
-The \function{newwin()} function creates a new window of a given size,
-returning the new window object.
-
-\begin{verbatim}
-begin_x = 20 ; begin_y = 7
-height = 5 ; width = 40
-win = curses.newwin(height, width, begin_y, begin_x)
-\end{verbatim}
-
-A word about the coordinate system used in curses: coordinates are
-always passed in the order \emph{y,x}, and the top-left corner of a
-window is coordinate (0,0). This breaks a common convention for
-handling coordinates, where the \emph{x} coordinate usually comes
-first. This is an unfortunate difference from most other computer
-applications, but it's been part of curses since it was first written,
-and it's too late to change things now.
-
-When you call a method to display or erase text, the effect doesn't
-immediately show up on the display. This is because curses was
-originally written with slow 300-baud terminal connections in mind;
-with these terminals, minimizing the time required to redraw the
-screen is very important. This lets curses accumulate changes to the
-screen, and display them in the most efficient manner. For example,
-if your program displays some characters in a window, and then clears
-the window, there's no need to send the original characters because
-they'd never be visible.
-
-Accordingly, curses requires that you explicitly tell it to redraw
-windows, using the \function{refresh()} method of window objects. In
-practice, this doesn't really complicate programming with curses much.
-Most programs go into a flurry of activity, and then pause waiting for
-a keypress or some other action on the part of the user. All you have
-to do is to be sure that the screen has been redrawn before pausing to
-wait for user input, by simply calling \code{stdscr.refresh()} or the
-\function{refresh()} method of some other relevant window.
-
-A pad is a special case of a window; it can be larger than the actual
-display screen, and only a portion of it displayed at a time.
-Creating a pad simply requires the pad's height and width, while
-refreshing a pad requires giving the coordinates of the on-screen
-area where a subsection of the pad will be displayed.
-
-\begin{verbatim}
-pad = curses.newpad(100, 100)
-# These loops fill the pad with letters; this is
-# explained in the next section
-for y in range(0, 100):
- for x in range(0, 100):
- try: pad.addch(y,x, ord('a') + (x*x+y*y) % 26 )
- except curses.error: pass
-
-# Displays a section of the pad in the middle of the screen
-pad.refresh( 0,0, 5,5, 20,75)
-\end{verbatim}
-
-The \function{refresh()} call displays a section of the pad in the
-rectangle extending from coordinate (5,5) to coordinate (20,75) on the
-screen; the upper left corner of the displayed section is coordinate
-(0,0) on the pad. Beyond that difference, pads are exactly like
-ordinary windows and support the same methods.
-
-If you have multiple windows and pads on screen there is a more
-efficient way to go, which will prevent annoying screen flicker at
-refresh time. Use the \method{noutrefresh()} method
-of each window to update the data structure
-representing the desired state of the screen; then change the physical
-screen to match the desired state in one go with the function
-\function{doupdate()}. The normal \method{refresh()} method calls
-\function{doupdate()} as its last act.
-
-\section{Displaying Text}
-
-{}From a C programmer's point of view, curses may sometimes look like
-a twisty maze of functions, all subtly different. For example,
-\function{addstr()} displays a string at the current cursor location
-in the \code{stdscr} window, while \function{mvaddstr()} moves to a
-given y,x coordinate first before displaying the string.
-\function{waddstr()} is just like \function{addstr()}, but allows
-specifying a window to use, instead of using \code{stdscr} by default.
-\function{mvwaddstr()} follows similarly.
-
-Fortunately the Python interface hides all these details;
-\code{stdscr} is a window object like any other, and methods like
-\function{addstr()} accept multiple argument forms. Usually there are
-four different forms.
-
-\begin{tableii}{|c|l|}{textrm}{Form}{Description}
-\lineii{\var{str} or \var{ch}}{Display the string \var{str} or
-character \var{ch} at the current position}
-\lineii{\var{str} or \var{ch}, \var{attr}}{Display the string \var{str} or
-character \var{ch}, using attribute \var{attr} at the current position}
-\lineii{\var{y}, \var{x}, \var{str} or \var{ch}}
-{Move to position \var{y,x} within the window, and display \var{str}
-or \var{ch}}
-\lineii{\var{y}, \var{x}, \var{str} or \var{ch}, \var{attr}}
-{Move to position \var{y,x} within the window, and display \var{str}
-or \var{ch}, using attribute \var{attr}}
-\end{tableii}
-
-Attributes allow displaying text in highlighted forms, such as in
-boldface, underline, reverse code, or in color. They'll be explained
-in more detail in the next subsection.
-
-The \function{addstr()} function takes a Python string as the value to
-be displayed, while the \function{addch()} functions take a character,
-which can be either a Python string of length 1 or an integer. If
-it's a string, you're limited to displaying characters between 0 and
-255. SVr4 curses provides constants for extension characters; these
-constants are integers greater than 255. For example,
-\constant{ACS_PLMINUS} is a +/- symbol, and \constant{ACS_ULCORNER} is
-the upper left corner of a box (handy for drawing borders).
-
-Windows remember where the cursor was left after the last operation,
-so if you leave out the \var{y,x} coordinates, the string or character
-will be displayed wherever the last operation left off. You can also
-move the cursor with the \function{move(\var{y,x})} method. Because
-some terminals always display a flashing cursor, you may want to
-ensure that the cursor is positioned in some location where it won't
-be distracting; it can be confusing to have the cursor blinking at
-some apparently random location.
-
-If your application doesn't need a blinking cursor at all, you can
-call \function{curs_set(0)} to make it invisible. Equivalently, and
-for compatibility with older curses versions, there's a
-\function{leaveok(\var{bool})} function. When \var{bool} is true, the
-curses library will attempt to suppress the flashing cursor, and you
-won't need to worry about leaving it in odd locations.
-
-\subsection{Attributes and Color}
-
-Characters can be displayed in different ways. Status lines in a
-text-based application are commonly shown in reverse video; a text
-viewer may need to highlight certain words. curses supports this by
-allowing you to specify an attribute for each cell on the screen.
-
-An attribute is a integer, each bit representing a different
-attribute. You can try to display text with multiple attribute bits
-set, but curses doesn't guarantee that all the possible combinations
-are available, or that they're all visually distinct. That depends on
-the ability of the terminal being used, so it's safest to stick to the
-most commonly available attributes, listed here.
-
-\begin{tableii}{|c|l|}{constant}{Attribute}{Description}
-\lineii{A_BLINK}{Blinking text}
-\lineii{A_BOLD}{Extra bright or bold text}
-\lineii{A_DIM}{Half bright text}
-\lineii{A_REVERSE}{Reverse-video text}
-\lineii{A_STANDOUT}{The best highlighting mode available}
-\lineii{A_UNDERLINE}{Underlined text}
-\end{tableii}
-
-So, to display a reverse-video status line on the top line of the
-screen,
-you could code:
-
-\begin{verbatim}
-stdscr.addstr(0, 0, "Current mode: Typing mode",
- curses.A_REVERSE)
-stdscr.refresh()
-\end{verbatim}
-
-The curses library also supports color on those terminals that
-provide it, The most common such terminal is probably the Linux
-console, followed by color xterms.
-
-To use color, you must call the \function{start_color()} function soon
-after calling \function{initscr()}, to initialize the default color
-set (the \function{curses.wrapper.wrapper()} function does this
-automatically). Once that's done, the \function{has_colors()}
-function returns TRUE if the terminal in use can actually display
-color. (Note: curses uses the American spelling 'color', instead of
-the Canadian/British spelling 'colour'. If you're used to the British
-spelling, you'll have to resign yourself to misspelling it for the
-sake of these functions.)
-
-The curses library maintains a finite number of color pairs,
-containing a foreground (or text) color and a background color. You
-can get the attribute value corresponding to a color pair with the
-\function{color_pair()} function; this can be bitwise-OR'ed with other
-attributes such as \constant{A_REVERSE}, but again, such combinations
-are not guaranteed to work on all terminals.
-
-An example, which displays a line of text using color pair 1:
-
-\begin{verbatim}
-stdscr.addstr( "Pretty text", curses.color_pair(1) )
-stdscr.refresh()
-\end{verbatim}
-
-As I said before, a color pair consists of a foreground and
-background color. \function{start_color()} initializes 8 basic
-colors when it activates color mode. They are: 0:black, 1:red,
-2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and 7:white. The curses
-module defines named constants for each of these colors:
-\constant{curses.COLOR_BLACK}, \constant{curses.COLOR_RED}, and so
-forth.
-
-The \function{init_pair(\var{n, f, b})} function changes the
-definition of color pair \var{n}, to foreground color {f} and
-background color {b}. Color pair 0 is hard-wired to white on black,
-and cannot be changed.
-
-Let's put all this together. To change color 1 to red
-text on a white background, you would call:
-
-\begin{verbatim}
-curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
-\end{verbatim}
-
-When you change a color pair, any text already displayed using that
-color pair will change to the new colors. You can also display new
-text in this color with:
-
-\begin{verbatim}
-stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1) )
-\end{verbatim}
-
-Very fancy terminals can change the definitions of the actual colors
-to a given RGB value. This lets you change color 1, which is usually
-red, to purple or blue or any other color you like. Unfortunately,
-the Linux console doesn't support this, so I'm unable to try it out,
-and can't provide any examples. You can check if your terminal can do
-this by calling \function{can_change_color()}, which returns TRUE if
-the capability is there. If you're lucky enough to have such a
-talented terminal, consult your system's man pages for more
-information.
-
-\section{User Input}
-
-The curses library itself offers only very simple input mechanisms.
-Python's support adds a text-input widget that makes up some of the
-lack.
-
-The most common way to get input to a window is to use its
-\method{getch()} method. \method{getch()} pauses and waits for the
-user to hit a key, displaying it if \function{echo()} has been called
-earlier. You can optionally specify a coordinate to which the cursor
-should be moved before pausing.
-
-It's possible to change this behavior with the method
-\method{nodelay()}. After \method{nodelay(1)}, \method{getch()} for
-the window becomes non-blocking and returns \code{curses.ERR} (a value
-of -1) when no input is ready. There's also a \function{halfdelay()}
-function, which can be used to (in effect) set a timer on each
-\method{getch()}; if no input becomes available within the number of
-milliseconds specified as the argument to \function{halfdelay()},
-curses raises an exception.
-
-The \method{getch()} method returns an integer; if it's between 0 and
-255, it represents the ASCII code of the key pressed. Values greater
-than 255 are special keys such as Page Up, Home, or the cursor keys.
-You can compare the value returned to constants such as
-\constant{curses.KEY_PPAGE}, \constant{curses.KEY_HOME}, or
-\constant{curses.KEY_LEFT}. Usually the main loop of your program
-will look something like this:
-
-\begin{verbatim}
-while 1:
- c = stdscr.getch()
- if c == ord('p'): PrintDocument()
- elif c == ord('q'): break # Exit the while()
- elif c == curses.KEY_HOME: x = y = 0
-\end{verbatim}
-
-The \module{curses.ascii} module supplies ASCII class membership
-functions that take either integer or 1-character-string
-arguments; these may be useful in writing more readable tests for
-your command interpreters. It also supplies conversion functions
-that take either integer or 1-character-string arguments and return
-the same type. For example, \function{curses.ascii.ctrl()} returns
-the control character corresponding to its argument.
-
-There's also a method to retrieve an entire string,
-\constant{getstr()}. It isn't used very often, because its
-functionality is quite limited; the only editing keys available are
-the backspace key and the Enter key, which terminates the string. It
-can optionally be limited to a fixed number of characters.
-
-\begin{verbatim}
-curses.echo() # Enable echoing of characters
-
-# Get a 15-character string, with the cursor on the top line
-s = stdscr.getstr(0,0, 15)
-\end{verbatim}
-
-The Python \module{curses.textpad} module supplies something better.
-With it, you can turn a window into a text box that supports an
-Emacs-like set of keybindings. Various methods of \class{Textbox}
-class support editing with input validation and gathering the edit
-results either with or without trailing spaces. See the library
-documentation on \module{curses.textpad} for the details.
-
-\section{For More Information}
-
-This HOWTO didn't cover some advanced topics, such as screen-scraping
-or capturing mouse events from an xterm instance. But the Python
-library page for the curses modules is now pretty complete. You
-should browse it next.
-
-If you're in doubt about the detailed behavior of any of the ncurses
-entry points, consult the manual pages for your curses implementation,
-whether it's ncurses or a proprietary Unix vendor's. The manual pages
-will document any quirks, and provide complete lists of all the
-functions, attributes, and \constant{ACS_*} characters available to
-you.
-
-Because the curses API is so large, some functions aren't supported in
-the Python interface, not because they're difficult to implement, but
-because no one has needed them yet. Feel free to add them and then
-submit a patch. Also, we don't yet have support for the menus or
-panels libraries associated with ncurses; feel free to add that.
-
-If you write an interesting little program, feel free to contribute it
-as another demo. We can always use more of them!
-
-The ncurses FAQ: \url{http://dickey.his.com/ncurses/ncurses.faq.html}
-
-\end{document}
diff --git a/Doc/howto/doanddont.tex b/Doc/howto/doanddont.tex
deleted file mode 100644
index 28ef7c3..0000000
--- a/Doc/howto/doanddont.tex
+++ /dev/null
@@ -1,344 +0,0 @@
-\documentclass{howto}
-
-\title{Idioms and Anti-Idioms in Python}
-
-\release{0.00}
-
-\author{Moshe Zadka}
-\authoraddress{howto@zadka.site.co.il}
-
-\begin{document}
-\maketitle
-
-This document is placed in the public doman.
-
-\begin{abstract}
-\noindent
-This document can be considered a companion to the tutorial. It
-shows how to use Python, and even more importantly, how {\em not}
-to use Python.
-\end{abstract}
-
-\tableofcontents
-
-\section{Language Constructs You Should Not Use}
-
-While Python has relatively few gotchas compared to other languages, it
-still has some constructs which are only useful in corner cases, or are
-plain dangerous.
-
-\subsection{from module import *}
-
-\subsubsection{Inside Function Definitions}
-
-\code{from module import *} is {\em invalid} inside function definitions.
-While many versions of Python do not check for the invalidity, it does not
-make it more valid, no more then having a smart lawyer makes a man innocent.
-Do not use it like that ever. Even in versions where it was accepted, it made
-the function execution slower, because the compiler could not be certain
-which names are local and which are global. In Python 2.1 this construct
-causes warnings, and sometimes even errors.
-
-\subsubsection{At Module Level}
-
-While it is valid to use \code{from module import *} at module level it
-is usually a bad idea. For one, this loses an important property Python
-otherwise has --- you can know where each toplevel name is defined by
-a simple "search" function in your favourite editor. You also open yourself
-to trouble in the future, if some module grows additional functions or
-classes.
-
-One of the most awful question asked on the newsgroup is why this code:
-
-\begin{verbatim}
-f = open("www")
-f.read()
-\end{verbatim}
-
-does not work. Of course, it works just fine (assuming you have a file
-called "www".) But it does not work if somewhere in the module, the
-statement \code{from os import *} is present. The \module{os} module
-has a function called \function{open()} which returns an integer. While
-it is very useful, shadowing builtins is one of its least useful properties.
-
-Remember, you can never know for sure what names a module exports, so either
-take what you need --- \code{from module import name1, name2}, or keep them in
-the module and access on a per-need basis ---
-\code{import module;print module.name}.
-
-\subsubsection{When It Is Just Fine}
-
-There are situations in which \code{from module import *} is just fine:
-
-\begin{itemize}
-
-\item The interactive prompt. For example, \code{from math import *} makes
- Python an amazing scientific calculator.
-
-\item When extending a module in C with a module in Python.
-
-\item When the module advertises itself as \code{from import *} safe.
-
-\end{itemize}
-
-\subsection{Unadorned \keyword{exec}, \function{execfile} and friends}
-
-The word ``unadorned'' refers to the use without an explicit dictionary,
-in which case those constructs evaluate code in the {\em current} environment.
-This is dangerous for the same reasons \code{from import *} is dangerous ---
-it might step over variables you are counting on and mess up things for
-the rest of your code. Simply do not do that.
-
-Bad examples:
-
-\begin{verbatim}
->>> for name in sys.argv[1:]:
->>> exec "%s=1" % name
->>> def func(s, **kw):
->>> for var, val in kw.items():
->>> exec "s.%s=val" % var # invalid!
->>> execfile("handler.py")
->>> handle()
-\end{verbatim}
-
-Good examples:
-
-\begin{verbatim}
->>> d = {}
->>> for name in sys.argv[1:]:
->>> d[name] = 1
->>> def func(s, **kw):
->>> for var, val in kw.items():
->>> setattr(s, var, val)
->>> d={}
->>> execfile("handle.py", d, d)
->>> handle = d['handle']
->>> handle()
-\end{verbatim}
-
-\subsection{from module import name1, name2}
-
-This is a ``don't'' which is much weaker then the previous ``don't''s
-but is still something you should not do if you don't have good reasons
-to do that. The reason it is usually bad idea is because you suddenly
-have an object which lives in two seperate namespaces. When the binding
-in one namespace changes, the binding in the other will not, so there
-will be a discrepancy between them. This happens when, for example,
-one module is reloaded, or changes the definition of a function at runtime.
-
-Bad example:
-
-\begin{verbatim}
-# foo.py
-a = 1
-
-# bar.py
-from foo import a
-if something():
- a = 2 # danger: foo.a != a
-\end{verbatim}
-
-Good example:
-
-\begin{verbatim}
-# foo.py
-a = 1
-
-# bar.py
-import foo
-if something():
- foo.a = 2
-\end{verbatim}
-
-\subsection{except:}
-
-Python has the \code{except:} clause, which catches all exceptions.
-Since {\em every} error in Python raises an exception, this makes many
-programming errors look like runtime problems, and hinders
-the debugging process.
-
-The following code shows a great example:
-
-\begin{verbatim}
-try:
- foo = opne("file") # misspelled "open"
-except:
- sys.exit("could not open file!")
-\end{verbatim}
-
-The second line triggers a \exception{NameError} which is caught by the
-except clause. The program will exit, and you will have no idea that
-this has nothing to do with the readability of \code{"file"}.
-
-The example above is better written
-
-\begin{verbatim}
-try:
- foo = opne("file") # will be changed to "open" as soon as we run it
-except IOError:
- sys.exit("could not open file")
-\end{verbatim}
-
-There are some situations in which the \code{except:} clause is useful:
-for example, in a framework when running callbacks, it is good not to
-let any callback disturb the framework.
-
-\section{Exceptions}
-
-Exceptions are a useful feature of Python. You should learn to raise
-them whenever something unexpected occurs, and catch them only where
-you can do something about them.
-
-The following is a very popular anti-idiom
-
-\begin{verbatim}
-def get_status(file):
- if not os.path.exists(file):
- print "file not found"
- sys.exit(1)
- return open(file).readline()
-\end{verbatim}
-
-Consider the case the file gets deleted between the time the call to
-\function{os.path.exists} is made and the time \function{open} is called.
-That means the last line will throw an \exception{IOError}. The same would
-happen if \var{file} exists but has no read permission. Since testing this
-on a normal machine on existing and non-existing files make it seem bugless,
-that means in testing the results will seem fine, and the code will get
-shipped. Then an unhandled \exception{IOError} escapes to the user, who
-has to watch the ugly traceback.
-
-Here is a better way to do it.
-
-\begin{verbatim}
-def get_status(file):
- try:
- return open(file).readline()
- except (IOError, OSError):
- print "file not found"
- sys.exit(1)
-\end{verbatim}
-
-In this version, *either* the file gets opened and the line is read
-(so it works even on flaky NFS or SMB connections), or the message
-is printed and the application aborted.
-
-Still, \function{get_status} makes too many assumptions --- that it
-will only be used in a short running script, and not, say, in a long
-running server. Sure, the caller could do something like
-
-\begin{verbatim}
-try:
- status = get_status(log)
-except SystemExit:
- status = None
-\end{verbatim}
-
-So, try to make as few \code{except} clauses in your code --- those will
-usually be a catch-all in the \function{main}, or inside calls which
-should always succeed.
-
-So, the best version is probably
-
-\begin{verbatim}
-def get_status(file):
- return open(file).readline()
-\end{verbatim}
-
-The caller can deal with the exception if it wants (for example, if it
-tries several files in a loop), or just let the exception filter upwards
-to {\em its} caller.
-
-The last version is not very good either --- due to implementation details,
-the file would not be closed when an exception is raised until the handler
-finishes, and perhaps not at all in non-C implementations (e.g., Jython).
-
-\begin{verbatim}
-def get_status(file):
- fp = open(file)
- try:
- return fp.readline()
- finally:
- fp.close()
-\end{verbatim}
-
-\section{Using the Batteries}
-
-Every so often, people seem to be writing stuff in the Python library
-again, usually poorly. While the occasional module has a poor interface,
-it is usually much better to use the rich standard library and data
-types that come with Python then inventing your own.
-
-A useful module very few people know about is \module{os.path}. It
-always has the correct path arithmetic for your operating system, and
-will usually be much better then whatever you come up with yourself.
-
-Compare:
-
-\begin{verbatim}
-# ugh!
-return dir+"/"+file
-# better
-return os.path.join(dir, file)
-\end{verbatim}
-
-More useful functions in \module{os.path}: \function{basename},
-\function{dirname} and \function{splitext}.
-
-There are also many useful builtin functions people seem not to be
-aware of for some reason: \function{min()} and \function{max()} can
-find the minimum/maximum of any sequence with comparable semantics,
-for example, yet many people write their own
-\function{max()}/\function{min()}. Another highly useful function is
-\function{reduce()}. A classical use of \function{reduce()}
-is something like
-
-\begin{verbatim}
-import sys, operator
-nums = map(float, sys.argv[1:])
-print reduce(operator.add, nums)/len(nums)
-\end{verbatim}
-
-This cute little script prints the average of all numbers given on the
-command line. The \function{reduce()} adds up all the numbers, and
-the rest is just some pre- and postprocessing.
-
-On the same note, note that \function{float()}, \function{int()} and
-\function{long()} all accept arguments of type string, and so are
-suited to parsing --- assuming you are ready to deal with the
-\exception{ValueError} they raise.
-
-\section{Using Backslash to Continue Statements}
-
-Since Python treats a newline as a statement terminator,
-and since statements are often more then is comfortable to put
-in one line, many people do:
-
-\begin{verbatim}
-if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \
- calculate_number(10, 20) != forbulate(500, 360):
- pass
-\end{verbatim}
-
-You should realize that this is dangerous: a stray space after the
-\code{\\} would make this line wrong, and stray spaces are notoriously
-hard to see in editors. In this case, at least it would be a syntax
-error, but if the code was:
-
-\begin{verbatim}
-value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \
- + calculate_number(10, 20)*forbulate(500, 360)
-\end{verbatim}
-
-then it would just be subtly wrong.
-
-It is usually much better to use the implicit continuation inside parenthesis:
-
-This version is bulletproof:
-
-\begin{verbatim}
-value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9]
- + calculate_number(10, 20)*forbulate(500, 360))
-\end{verbatim}
-
-\end{document}
diff --git a/Doc/howto/functional.rst b/Doc/howto/functional.rst
deleted file mode 100644
index bfe67d1..0000000
--- a/Doc/howto/functional.rst
+++ /dev/null
@@ -1,1472 +0,0 @@
-Functional Programming HOWTO
-================================
-
-**Version 0.30**
-
-(This is a first draft. Please send comments/error
-reports/suggestions to amk@amk.ca. This URL is probably not going to
-be the final location of the document, so be careful about linking to
-it -- you may want to add a disclaimer.)
-
-In this document, we'll take a tour of Python's features suitable for
-implementing programs in a functional style. After an introduction to
-the concepts of functional programming, we'll look at language
-features such as iterators and generators and relevant library modules
-such as ``itertools`` and ``functools``.
-
-
-.. contents::
-
-Introduction
-----------------------
-
-This section explains the basic concept of functional programming; if
-you're just interested in learning about Python language features,
-skip to the next section.
-
-Programming languages support decomposing problems in several different
-ways:
-
-* Most programming languages are **procedural**:
- programs are lists of instructions that tell the computer what to
- do with the program's input.
- C, Pascal, and even Unix shells are procedural languages.
-
-* In **declarative** languages, you write a specification that describes
- the problem to be solved, and the language implementation figures out
- how to perform the computation efficiently. SQL is the declarative
- language you're most likely to be familiar with; a SQL query describes
- the data set you want to retrieve, and the SQL engine decides whether to
- scan tables or use indexes, which subclauses should be performed first,
- etc.
-
-* **Object-oriented** programs manipulate collections of objects.
- Objects have internal state and support methods that query or modify
- this internal state in some way. Smalltalk and Java are
- object-oriented languages. C++ and Python are languages that
- support object-oriented programming, but don't force the use
- of object-oriented features.
-
-* **Functional** programming decomposes a problem into a set of functions.
- Ideally, functions only take inputs and produce outputs, and don't have any
- internal state that affects the output produced for a given input.
- Well-known functional languages include the ML family (Standard ML,
- OCaml, and other variants) and Haskell.
-
-The designers of some computer languages have chosen one approach to
-programming that's emphasized. This often makes it difficult to
-write programs that use a different approach. Other languages are
-multi-paradigm languages that support several different approaches. Lisp,
-C++, and Python are multi-paradigm; you can write programs or
-libraries that are largely procedural, object-oriented, or functional
-in all of these languages. In a large program, different sections
-might be written using different approaches; the GUI might be object-oriented
-while the processing logic is procedural or functional, for example.
-
-In a functional program, input flows through a set of functions. Each
-function operates on its input and produces some output. Functional
-style frowns upon functions with side effects that modify internal
-state or make other changes that aren't visible in the function's
-return value. Functions that have no side effects at all are
-called **purely functional**.
-Avoiding side effects means not using data structures
-that get updated as a program runs; every function's output
-must only depend on its input.
-
-Some languages are very strict about purity and don't even have
-assignment statements such as ``a=3`` or ``c = a + b``, but it's
-difficult to avoid all side effects. Printing to the screen or
-writing to a disk file are side effects, for example. For example, in
-Python a ``print`` statement or a ``time.sleep(1)`` both return no
-useful value; they're only called for their side effects of sending
-some text to the screen or pausing execution for a second.
-
-Python programs written in functional style usually won't go to the
-extreme of avoiding all I/O or all assignments; instead, they'll
-provide a functional-appearing interface but will use non-functional
-features internally. For example, the implementation of a function
-will still use assignments to local variables, but won't modify global
-variables or have other side effects.
-
-Functional programming can be considered the opposite of
-object-oriented programming. Objects are little capsules containing
-some internal state along with a collection of method calls that let
-you modify this state, and programs consist of making the right set of
-state changes. Functional programming wants to avoid state changes as
-much as possible and works with data flowing between functions. In
-Python you might combine the two approaches by writing functions that
-take and return instances representing objects in your application
-(e-mail messages, transactions, etc.).
-
-Functional design may seem like an odd constraint to work under. Why
-should you avoid objects and side effects? There are theoretical and
-practical advantages to the functional style:
-
-* Formal provability.
-* Modularity.
-* Composability.
-* Ease of debugging and testing.
-
-Formal provability
-''''''''''''''''''''''
-
-A theoretical benefit is that it's easier to construct a mathematical proof
-that a functional program is correct.
-
-For a long time researchers have been interested in finding ways to
-mathematically prove programs correct. This is different from testing
-a program on numerous inputs and concluding that its output is usually
-correct, or reading a program's source code and concluding that the
-code looks right; the goal is instead a rigorous proof that a program
-produces the right result for all possible inputs.
-
-The technique used to prove programs correct is to write down
-**invariants**, properties of the input data and of the program's
-variables that are always true. For each line of code, you then show
-that if invariants X and Y are true **before** the line is executed,
-the slightly different invariants X' and Y' are true **after**
-the line is executed. This continues until you reach the end of the
-program, at which point the invariants should match the desired
-conditions on the program's output.
-
-Functional programming's avoidance of assignments arose because
-assignments are difficult to handle with this technique;
-assignments can break invariants that were true before the assignment
-without producing any new invariants that can be propagated onward.
-
-Unfortunately, proving programs correct is largely impractical and not
-relevant to Python software. Even trivial programs require proofs that
-are several pages long; the proof of correctness for a moderately
-complicated program would be enormous, and few or none of the programs
-you use daily (the Python interpreter, your XML parser, your web
-browser) could be proven correct. Even if you wrote down or generated
-a proof, there would then be the question of verifying the proof;
-maybe there's an error in it, and you wrongly believe you've proved
-the program correct.
-
-Modularity
-''''''''''''''''''''''
-
-A more practical benefit of functional programming is that it forces
-you to break apart your problem into small pieces. Programs are more
-modular as a result. It's easier to specify and write a small
-function that does one thing than a large function that performs a
-complicated transformation. Small functions are also easier to read
-and to check for errors.
-
-
-Ease of debugging and testing
-''''''''''''''''''''''''''''''''''
-
-Testing and debugging a functional-style program is easier.
-
-Debugging is simplified because functions are generally small and
-clearly specified. When a program doesn't work, each function is an
-interface point where you can check that the data are correct. You
-can look at the intermediate inputs and outputs to quickly isolate the
-function that's responsible for a bug.
-
-Testing is easier because each function is a potential subject for a
-unit test. Functions don't depend on system state that needs to be
-replicated before running a test; instead you only have to synthesize
-the right input and then check that the output matches expectations.
-
-
-
-Composability
-''''''''''''''''''''''
-
-As you work on a functional-style program, you'll write a number of
-functions with varying inputs and outputs. Some of these functions
-will be unavoidably specialized to a particular application, but
-others will be useful in a wide variety of programs. For example, a
-function that takes a directory path and returns all the XML files in
-the directory, or a function that takes a filename and returns its
-contents, can be applied to many different situations.
-
-Over time you'll form a personal library of utilities. Often you'll
-assemble new programs by arranging existing functions in a new
-configuration and writing a few functions specialized for the current
-task.
-
-
-
-Iterators
------------------------
-
-I'll start by looking at a Python language feature that's an important
-foundation for writing functional-style programs: iterators.
-
-An iterator is an object representing a stream of data; this object
-returns the data one element at a time. A Python iterator must
-support a method called ``next()`` that takes no arguments and always
-returns the next element of the stream. If there are no more elements
-in the stream, ``next()`` must raise the ``StopIteration`` exception.
-Iterators don't have to be finite, though; it's perfectly reasonable
-to write an iterator that produces an infinite stream of data.
-
-The built-in ``iter()`` function takes an arbitrary object and tries
-to return an iterator that will return the object's contents or
-elements, raising ``TypeError`` if the object doesn't support
-iteration. Several of Python's built-in data types support iteration,
-the most common being lists and dictionaries. An object is called
-an **iterable** object if you can get an iterator for it.
-
-You can experiment with the iteration interface manually::
-
- >>> L = [1,2,3]
- >>> it = iter(L)
- >>> print it
- <iterator object at 0x8116870>
- >>> it.next()
- 1
- >>> it.next()
- 2
- >>> it.next()
- 3
- >>> it.next()
- Traceback (most recent call last):
- File "<stdin>", line 1, in ?
- StopIteration
- >>>
-
-Python expects iterable objects in several different contexts, the
-most important being the ``for`` statement. In the statement ``for X in Y``,
-Y must be an iterator or some object for which ``iter()`` can create
-an iterator. These two statements are equivalent::
-
- for i in iter(obj):
- print i
-
- for i in obj:
- print i
-
-Iterators can be materialized as lists or tuples by using the
-``list()`` or ``tuple()`` constructor functions::
-
- >>> L = [1,2,3]
- >>> iterator = iter(L)
- >>> t = tuple(iterator)
- >>> t
- (1, 2, 3)
-
-Sequence unpacking also supports iterators: if you know an iterator
-will return N elements, you can unpack them into an N-tuple::
-
- >>> L = [1,2,3]
- >>> iterator = iter(L)
- >>> a,b,c = iterator
- >>> a,b,c
- (1, 2, 3)
-
-Built-in functions such as ``max()`` and ``min()`` can take a single
-iterator argument and will return the largest or smallest element.
-The ``"in"`` and ``"not in"`` operators also support iterators: ``X in
-iterator`` is true if X is found in the stream returned by the
-iterator. You'll run into obvious problems if the iterator is
-infinite; ``max()``, ``min()``, and ``"not in"`` will never return, and
-if the element X never appears in the stream, the ``"in"`` operator
-won't return either.
-
-Note that you can only go forward in an iterator; there's no way to
-get the previous element, reset the iterator, or make a copy of it.
-Iterator objects can optionally provide these additional capabilities,
-but the iterator protocol only specifies the ``next()`` method.
-Functions may therefore consume all of the iterator's output, and if
-you need to do something different with the same stream, you'll have
-to create a new iterator.
-
-
-
-Data Types That Support Iterators
-'''''''''''''''''''''''''''''''''''
-
-We've already seen how lists and tuples support iterators. In fact,
-any Python sequence type, such as strings, will automatically support
-creation of an iterator.
-
-Calling ``iter()`` on a dictionary returns an iterator that will loop
-over the dictionary's keys::
-
- >>> m = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
- ... 'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
- >>> for key in m:
- ... print key, m[key]
- Mar 3
- Feb 2
- Aug 8
- Sep 9
- May 5
- Jun 6
- Jul 7
- Jan 1
- Apr 4
- Nov 11
- Dec 12
- Oct 10
-
-Note that the order is essentially random, because it's based on the
-hash ordering of the objects in the dictionary.
-
-Applying ``iter()`` to a dictionary always loops over the keys, but
-dictionaries have methods that return other iterators. If you want to
-iterate over keys, values, or key/value pairs, you can explicitly call
-the ``iterkeys()``, ``itervalues()``, or ``iteritems()`` methods to
-get an appropriate iterator.
-
-The ``dict()`` constructor can accept an iterator that returns a
-finite stream of ``(key, value)`` tuples::
-
- >>> L = [('Italy', 'Rome'), ('France', 'Paris'), ('US', 'Washington DC')]
- >>> dict(iter(L))
- {'Italy': 'Rome', 'US': 'Washington DC', 'France': 'Paris'}
-
-Files also support iteration by calling the ``readline()``
-method until there are no more lines in the file. This means you can
-read each line of a file like this::
-
- for line in file:
- # do something for each line
- ...
-
-Sets can take their contents from an iterable and let you iterate over
-the set's elements::
-
- S = set((2, 3, 5, 7, 11, 13))
- for i in S:
- print i
-
-
-
-Generator expressions and list comprehensions
-----------------------------------------------------
-
-Two common operations on an iterator's output are 1) performing some
-operation for every element, 2) selecting a subset of elements that
-meet some condition. For example, given a list of strings, you might
-want to strip off trailing whitespace from each line or extract all
-the strings containing a given substring.
-
-List comprehensions and generator expressions (short form: "listcomps"
-and "genexps") are a concise notation for such operations, borrowed
-from the functional programming language Haskell
-(http://www.haskell.org). You can strip all the whitespace from a
-stream of strings with the following code::
-
- line_list = [' line 1\n', 'line 2 \n', ...]
-
- # Generator expression -- returns iterator
- stripped_iter = (line.strip() for line in line_list)
-
- # List comprehension -- returns list
- stripped_list = [line.strip() for line in line_list]
-
-You can select only certain elements by adding an ``"if"`` condition::
-
- stripped_list = [line.strip() for line in line_list
- if line != ""]
-
-With a list comprehension, you get back a Python list;
-``stripped_list`` is a list containing the resulting lines, not an
-iterator. Generator expressions return an iterator that computes the
-values as necessary, not needing to materialize all the values at
-once. This means that list comprehensions aren't useful if you're
-working with iterators that return an infinite stream or a very large
-amount of data. Generator expressions are preferable in these
-situations.
-
-Generator expressions are surrounded by parentheses ("()") and list
-comprehensions are surrounded by square brackets ("[]"). Generator
-expressions have the form::
-
- ( expression for expr in sequence1
- if condition1
- for expr2 in sequence2
- if condition2
- for expr3 in sequence3 ...
- if condition3
- for exprN in sequenceN
- if conditionN )
-
-Again, for a list comprehension only the outside brackets are
-different (square brackets instead of parentheses).
-
-The elements of the generated output will be the successive values of
-``expression``. The ``if`` clauses are all optional; if present,
-``expression`` is only evaluated and added to the result when
-``condition`` is true.
-
-Generator expressions always have to be written inside parentheses,
-but the parentheses signalling a function call also count. If you
-want to create an iterator that will be immediately passed to a
-function you can write::
-
- obj_total = sum(obj.count for obj in list_all_objects())
-
-The ``for...in`` clauses contain the sequences to be iterated over.
-The sequences do not have to be the same length, because they are
-iterated over from left to right, **not** in parallel. For each
-element in ``sequence1``, ``sequence2`` is looped over from the
-beginning. ``sequence3`` is then looped over for each
-resulting pair of elements from ``sequence1`` and ``sequence2``.
-
-To put it another way, a list comprehension or generator expression is
-equivalent to the following Python code::
-
- for expr1 in sequence1:
- if not (condition1):
- continue # Skip this element
- for expr2 in sequence2:
- if not (condition2):
- continue # Skip this element
- ...
- for exprN in sequenceN:
- if not (conditionN):
- continue # Skip this element
-
- # Output the value of
- # the expression.
-
-This means that when there are multiple ``for...in`` clauses but no
-``if`` clauses, the length of the resulting output will be equal to
-the product of the lengths of all the sequences. If you have two
-lists of length 3, the output list is 9 elements long::
-
- seq1 = 'abc'
- seq2 = (1,2,3)
- >>> [ (x,y) for x in seq1 for y in seq2]
- [('a', 1), ('a', 2), ('a', 3),
- ('b', 1), ('b', 2), ('b', 3),
- ('c', 1), ('c', 2), ('c', 3)]
-
-To avoid introducing an ambiguity into Python's grammar, if
-``expression`` is creating a tuple, it must be surrounded with
-parentheses. The first list comprehension below is a syntax error,
-while the second one is correct::
-
- # Syntax error
- [ x,y for x in seq1 for y in seq2]
- # Correct
- [ (x,y) for x in seq1 for y in seq2]
-
-
-Generators
------------------------
-
-Generators are a special class of functions that simplify the task of
-writing iterators. Regular functions compute a value and return it,
-but generators return an iterator that returns a stream of values.
-
-You're doubtless familiar with how regular function calls work in
-Python or C. When you call a function, it gets a private namespace
-where its local variables are created. When the function reaches a
-``return`` statement, the local variables are destroyed and the
-value is returned to the caller. A later call to the same function
-creates a new private namespace and a fresh set of local
-variables. But, what if the local variables weren't thrown away on
-exiting a function? What if you could later resume the function where
-it left off? This is what generators provide; they can be thought of
-as resumable functions.
-
-Here's the simplest example of a generator function::
-
- def generate_ints(N):
- for i in range(N):
- yield i
-
-Any function containing a ``yield`` keyword is a generator function;
-this is detected by Python's bytecode compiler which compiles the
-function specially as a result.
-
-When you call a generator function, it doesn't return a single value;
-instead it returns a generator object that supports the iterator
-protocol. On executing the ``yield`` expression, the generator
-outputs the value of ``i``, similar to a ``return``
-statement. The big difference between ``yield`` and a
-``return`` statement is that on reaching a ``yield`` the
-generator's state of execution is suspended and local variables are
-preserved. On the next call to the generator's ``.next()`` method,
-the function will resume executing.
-
-Here's a sample usage of the ``generate_ints()`` generator::
-
- >>> gen = generate_ints(3)
- >>> gen
- <generator object at 0x8117f90>
- >>> gen.next()
- 0
- >>> gen.next()
- 1
- >>> gen.next()
- 2
- >>> gen.next()
- Traceback (most recent call last):
- File "stdin", line 1, in ?
- File "stdin", line 2, in generate_ints
- StopIteration
-
-You could equally write ``for i in generate_ints(5)``, or
-``a,b,c = generate_ints(3)``.
-
-Inside a generator function, the ``return`` statement can only be used
-without a value, and signals the end of the procession of values;
-after executing a ``return`` the generator cannot return any further
-values. ``return`` with a value, such as ``return 5``, is a syntax
-error inside a generator function. The end of the generator's results
-can also be indicated by raising ``StopIteration`` manually, or by
-just letting the flow of execution fall off the bottom of the
-function.
-
-You could achieve the effect of generators manually by writing your
-own class and storing all the local variables of the generator as
-instance variables. For example, returning a list of integers could
-be done by setting ``self.count`` to 0, and having the
-``next()`` method increment ``self.count`` and return it.
-However, for a moderately complicated generator, writing a
-corresponding class can be much messier.
-
-The test suite included with Python's library, ``test_generators.py``,
-contains a number of more interesting examples. Here's one generator
-that implements an in-order traversal of a tree using generators
-recursively.
-
-::
-
- # A recursive generator that generates Tree leaves in in-order.
- def inorder(t):
- if t:
- for x in inorder(t.left):
- yield x
-
- yield t.label
-
- for x in inorder(t.right):
- yield x
-
-Two other examples in ``test_generators.py`` produce
-solutions for the N-Queens problem (placing N queens on an NxN
-chess board so that no queen threatens another) and the Knight's Tour
-(finding a route that takes a knight to every square of an NxN chessboard
-without visiting any square twice).
-
-
-
-Passing values into a generator
-''''''''''''''''''''''''''''''''''''''''''''''
-
-In Python 2.4 and earlier, generators only produced output. Once a
-generator's code was invoked to create an iterator, there was no way to
-pass any new information into the function when its execution is
-resumed. You could hack together this ability by making the
-generator look at a global variable or by passing in some mutable object
-that callers then modify, but these approaches are messy.
-
-In Python 2.5 there's a simple way to pass values into a generator.
-``yield`` became an expression, returning a value that can be assigned
-to a variable or otherwise operated on::
-
- val = (yield i)
-
-I recommend that you **always** put parentheses around a ``yield``
-expression when you're doing something with the returned value, as in
-the above example. The parentheses aren't always necessary, but it's
-easier to always add them instead of having to remember when they're
-needed.
-
-(PEP 342 explains the exact rules, which are that a
-``yield``-expression must always be parenthesized except when it
-occurs at the top-level expression on the right-hand side of an
-assignment. This means you can write ``val = yield i`` but have to
-use parentheses when there's an operation, as in ``val = (yield i)
-+ 12``.)
-
-Values are sent into a generator by calling its
-``send(value)`` method. This method resumes the
-generator's code and the ``yield`` expression returns the specified
-value. If the regular ``next()`` method is called, the
-``yield`` returns ``None``.
-
-Here's a simple counter that increments by 1 and allows changing the
-value of the internal counter.
-
-::
-
- def counter (maximum):
- i = 0
- while i < maximum:
- val = (yield i)
- # If value provided, change counter
- if val is not None:
- i = val
- else:
- i += 1
-
-And here's an example of changing the counter:
-
- >>> it = counter(10)
- >>> print it.next()
- 0
- >>> print it.next()
- 1
- >>> print it.send(8)
- 8
- >>> print it.next()
- 9
- >>> print it.next()
- Traceback (most recent call last):
- File ``t.py'', line 15, in ?
- print it.next()
- StopIteration
-
-Because ``yield`` will often be returning ``None``, you
-should always check for this case. Don't just use its value in
-expressions unless you're sure that the ``send()`` method
-will be the only method used resume your generator function.
-
-In addition to ``send()``, there are two other new methods on
-generators:
-
-* ``throw(type, value=None, traceback=None)`` is used to raise an exception inside the
- generator; the exception is raised by the ``yield`` expression
- where the generator's execution is paused.
-
-* ``close()`` raises a ``GeneratorExit``
- exception inside the generator to terminate the iteration.
- On receiving this
- exception, the generator's code must either raise
- ``GeneratorExit`` or ``StopIteration``; catching the
- exception and doing anything else is illegal and will trigger
- a ``RuntimeError``. ``close()`` will also be called by
- Python's garbage collector when the generator is garbage-collected.
-
- If you need to run cleanup code when a ``GeneratorExit`` occurs,
- I suggest using a ``try: ... finally:`` suite instead of
- catching ``GeneratorExit``.
-
-The cumulative effect of these changes is to turn generators from
-one-way producers of information into both producers and consumers.
-
-Generators also become **coroutines**, a more generalized form of
-subroutines. Subroutines are entered at one point and exited at
-another point (the top of the function, and a ``return``
-statement), but coroutines can be entered, exited, and resumed at
-many different points (the ``yield`` statements).
-
-
-Built-in functions
-----------------------------------------------
-
-Let's look in more detail at built-in functions often used with iterators.
-
-Two Python's built-in functions, ``map()`` and ``filter()``, are
-somewhat obsolete; they duplicate the features of list comprehensions
-but return actual lists instead of iterators.
-
-``map(f, iterA, iterB, ...)`` returns a list containing ``f(iterA[0],
-iterB[0]), f(iterA[1], iterB[1]), f(iterA[2], iterB[2]), ...``.
-
-::
-
- def upper(s):
- return s.upper()
- map(upper, ['sentence', 'fragment']) =>
- ['SENTENCE', 'FRAGMENT']
-
- [upper(s) for s in ['sentence', 'fragment']] =>
- ['SENTENCE', 'FRAGMENT']
-
-As shown above, you can achieve the same effect with a list
-comprehension. The ``itertools.imap()`` function does the same thing
-but can handle infinite iterators; it'll be discussed later, in the section on
-the ``itertools`` module.
-
-``filter(predicate, iter)`` returns a list
-that contains all the sequence elements that meet a certain condition,
-and is similarly duplicated by list comprehensions.
-A **predicate** is a function that returns the truth value of
-some condition; for use with ``filter()``, the predicate must take a
-single value.
-
-::
-
- def is_even(x):
- return (x % 2) == 0
-
- filter(is_even, range(10)) =>
- [0, 2, 4, 6, 8]
-
-This can also be written as a list comprehension::
-
- >>> [x for x in range(10) if is_even(x)]
- [0, 2, 4, 6, 8]
-
-``filter()`` also has a counterpart in the ``itertools`` module,
-``itertools.ifilter()``, that returns an iterator and
-can therefore handle infinite sequences just as ``itertools.imap()`` can.
-
-``reduce(func, iter, [initial_value])`` doesn't have a counterpart in
-the ``itertools`` module because it cumulatively performs an operation
-on all the iterable's elements and therefore can't be applied to
-infinite iterables. ``func`` must be a function that takes two elements
-and returns a single value. ``reduce()`` takes the first two elements
-A and B returned by the iterator and calculates ``func(A, B)``. It
-then requests the third element, C, calculates ``func(func(A, B),
-C)``, combines this result with the fourth element returned, and
-continues until the iterable is exhausted. If the iterable returns no
-values at all, a ``TypeError`` exception is raised. If the initial
-value is supplied, it's used as a starting point and
-``func(initial_value, A)`` is the first calculation.
-
-::
-
- import operator
- reduce(operator.concat, ['A', 'BB', 'C']) =>
- 'ABBC'
- reduce(operator.concat, []) =>
- TypeError: reduce() of empty sequence with no initial value
- reduce(operator.mul, [1,2,3], 1) =>
- 6
- reduce(operator.mul, [], 1) =>
- 1
-
-If you use ``operator.add`` with ``reduce()``, you'll add up all the
-elements of the iterable. This case is so common that there's a special
-built-in called ``sum()`` to compute it::
-
- reduce(operator.add, [1,2,3,4], 0) =>
- 10
- sum([1,2,3,4]) =>
- 10
- sum([]) =>
- 0
-
-For many uses of ``reduce()``, though, it can be clearer to just write
-the obvious ``for`` loop::
-
- # Instead of:
- product = reduce(operator.mul, [1,2,3], 1)
-
- # You can write:
- product = 1
- for i in [1,2,3]:
- product *= i
-
-
-``enumerate(iter)`` counts off the elements in the iterable, returning
-2-tuples containing the count and each element.
-
-::
-
- enumerate(['subject', 'verb', 'object']) =>
- (0, 'subject'), (1, 'verb'), (2, 'object')
-
-``enumerate()`` is often used when looping through a list
-and recording the indexes at which certain conditions are met::
-
- f = open('data.txt', 'r')
- for i, line in enumerate(f):
- if line.strip() == '':
- print 'Blank line at line #%i' % i
-
-``sorted(iterable, [cmp=None], [key=None], [reverse=False)``
-collects all the elements of the iterable into a list, sorts
-the list, and returns the sorted result. The ``cmp``, ``key``,
-and ``reverse`` arguments are passed through to the
-constructed list's ``.sort()`` method.
-
-::
-
- import random
- # Generate 8 random numbers between [0, 10000)
- rand_list = random.sample(range(10000), 8)
- rand_list =>
- [769, 7953, 9828, 6431, 8442, 9878, 6213, 2207]
- sorted(rand_list) =>
- [769, 2207, 6213, 6431, 7953, 8442, 9828, 9878]
- sorted(rand_list, reverse=True) =>
- [9878, 9828, 8442, 7953, 6431, 6213, 2207, 769]
-
-(For a more detailed discussion of sorting, see the Sorting mini-HOWTO
-in the Python wiki at http://wiki.python.org/moin/HowTo/Sorting.)
-
-The ``any(iter)`` and ``all(iter)`` built-ins look at
-the truth values of an iterable's contents. ``any()`` returns
-True if any element in the iterable is a true value, and ``all()``
-returns True if all of the elements are true values::
-
- any([0,1,0]) =>
- True
- any([0,0,0]) =>
- False
- any([1,1,1]) =>
- True
- all([0,1,0]) =>
- False
- all([0,0,0]) =>
- False
- all([1,1,1]) =>
- True
-
-
-Small functions and the lambda statement
-----------------------------------------------
-
-When writing functional-style programs, you'll often need little
-functions that act as predicates or that combine elements in some way.
-
-If there's a Python built-in or a module function that's suitable, you
-don't need to define a new function at all::
-
- stripped_lines = [line.strip() for line in lines]
- existing_files = filter(os.path.exists, file_list)
-
-If the function you need doesn't exist, you need to write it. One way
-to write small functions is to use the ``lambda`` statement. ``lambda``
-takes a number of parameters and an expression combining these parameters,
-and creates a small function that returns the value of the expression::
-
- lowercase = lambda x: x.lower()
-
- print_assign = lambda name, value: name + '=' + str(value)
-
- adder = lambda x, y: x+y
-
-An alternative is to just use the ``def`` statement and define a
-function in the usual way::
-
- def lowercase(x):
- return x.lower()
-
- def print_assign(name, value):
- return name + '=' + str(value)
-
- def adder(x,y):
- return x + y
-
-Which alternative is preferable? That's a style question; my usual
-course is to avoid using ``lambda``.
-
-One reason for my preference is that ``lambda`` is quite limited in
-the functions it can define. The result has to be computable as a
-single expression, which means you can't have multiway
-``if... elif... else`` comparisons or ``try... except`` statements.
-If you try to do too much in a ``lambda`` statement, you'll end up
-with an overly complicated expression that's hard to read. Quick,
-what's the following code doing?
-
-::
-
- total = reduce(lambda a, b: (0, a[1] + b[1]), items)[1]
-
-You can figure it out, but it takes time to disentangle the expression
-to figure out what's going on. Using a short nested
-``def`` statements makes things a little bit better::
-
- def combine (a, b):
- return 0, a[1] + b[1]
-
- total = reduce(combine, items)[1]
-
-But it would be best of all if I had simply used a ``for`` loop::
-
- total = 0
- for a, b in items:
- total += b
-
-Or the ``sum()`` built-in and a generator expression::
-
- total = sum(b for a,b in items)
-
-Many uses of ``reduce()`` are clearer when written as ``for`` loops.
-
-Fredrik Lundh once suggested the following set of rules for refactoring
-uses of ``lambda``:
-
-1) Write a lambda function.
-2) Write a comment explaining what the heck that lambda does.
-3) Study the comment for a while, and think of a name that captures
- the essence of the comment.
-4) Convert the lambda to a def statement, using that name.
-5) Remove the comment.
-
-I really like these rules, but you're free to disagree that this
-lambda-free style is better.
-
-
-The itertools module
------------------------
-
-The ``itertools`` module contains a number of commonly-used iterators
-as well as functions for combining several iterators. This section
-will introduce the module's contents by showing small examples.
-
-The module's functions fall into a few broad classes:
-
-* Functions that create a new iterator based on an existing iterator.
-* Functions for treating an iterator's elements as function arguments.
-* Functions for selecting portions of an iterator's output.
-* A function for grouping an iterator's output.
-
-Creating new iterators
-''''''''''''''''''''''
-
-``itertools.count(n)`` returns an infinite stream of
-integers, increasing by 1 each time. You can optionally supply the
-starting number, which defaults to 0::
-
- itertools.count() =>
- 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
- itertools.count(10) =>
- 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ...
-
-``itertools.cycle(iter)`` saves a copy of the contents of a provided
-iterable and returns a new iterator that returns its elements from
-first to last. The new iterator will repeat these elements infinitely.
-
-::
-
- itertools.cycle([1,2,3,4,5]) =>
- 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, ...
-
-``itertools.repeat(elem, [n])`` returns the provided element ``n``
-times, or returns the element endlessly if ``n`` is not provided.
-
-::
-
- itertools.repeat('abc') =>
- abc, abc, abc, abc, abc, abc, abc, abc, abc, abc, ...
- itertools.repeat('abc', 5) =>
- abc, abc, abc, abc, abc
-
-``itertools.chain(iterA, iterB, ...)`` takes an arbitrary number of
-iterables as input, and returns all the elements of the first
-iterator, then all the elements of the second, and so on, until all of
-the iterables have been exhausted.
-
-::
-
- itertools.chain(['a', 'b', 'c'], (1, 2, 3)) =>
- a, b, c, 1, 2, 3
-
-``itertools.izip(iterA, iterB, ...)`` takes one element from each iterable
-and returns them in a tuple::
-
- itertools.izip(['a', 'b', 'c'], (1, 2, 3)) =>
- ('a', 1), ('b', 2), ('c', 3)
-
-It's similiar to the built-in ``zip()`` function, but doesn't
-construct an in-memory list and exhaust all the input iterators before
-returning; instead tuples are constructed and returned only if they're
-requested. (The technical term for this behaviour is
-`lazy evaluation <http://en.wikipedia.org/wiki/Lazy_evaluation>`__.)
-
-This iterator is intended to be used with iterables that are all of
-the same length. If the iterables are of different lengths, the
-resulting stream will be the same length as the shortest iterable.
-
-::
-
- itertools.izip(['a', 'b'], (1, 2, 3)) =>
- ('a', 1), ('b', 2)
-
-You should avoid doing this, though, because an element may be taken
-from the longer iterators and discarded. This means you can't go on
-to use the iterators further because you risk skipping a discarded
-element.
-
-``itertools.islice(iter, [start], stop, [step])`` returns a stream
-that's a slice of the iterator. With a single ``stop`` argument,
-it will return the first ``stop``
-elements. If you supply a starting index, you'll get ``stop-start``
-elements, and if you supply a value for ``step``, elements will be
-skipped accordingly. Unlike Python's string and list slicing, you
-can't use negative values for ``start``, ``stop``, or ``step``.
-
-::
-
- itertools.islice(range(10), 8) =>
- 0, 1, 2, 3, 4, 5, 6, 7
- itertools.islice(range(10), 2, 8) =>
- 2, 3, 4, 5, 6, 7
- itertools.islice(range(10), 2, 8, 2) =>
- 2, 4, 6
-
-``itertools.tee(iter, [n])`` replicates an iterator; it returns ``n``
-independent iterators that will all return the contents of the source
-iterator. If you don't supply a value for ``n``, the default is 2.
-Replicating iterators requires saving some of the contents of the source
-iterator, so this can consume significant memory if the iterator is large
-and one of the new iterators is consumed more than the others.
-
-::
-
- itertools.tee( itertools.count() ) =>
- iterA, iterB
-
- where iterA ->
- 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
-
- and iterB ->
- 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
-
-
-Calling functions on elements
-'''''''''''''''''''''''''''''
-
-Two functions are used for calling other functions on the contents of an
-iterable.
-
-``itertools.imap(f, iterA, iterB, ...)`` returns
-a stream containing ``f(iterA[0], iterB[0]), f(iterA[1], iterB[1]),
-f(iterA[2], iterB[2]), ...``::
-
- itertools.imap(operator.add, [5, 6, 5], [1, 2, 3]) =>
- 6, 8, 8
-
-The ``operator`` module contains a set of functions
-corresponding to Python's operators. Some examples are
-``operator.add(a, b)`` (adds two values),
-``operator.ne(a, b)`` (same as ``a!=b``),
-and
-``operator.attrgetter('id')`` (returns a callable that
-fetches the ``"id"`` attribute).
-
-``itertools.starmap(func, iter)`` assumes that the iterable will
-return a stream of tuples, and calls ``f()`` using these tuples as the
-arguments::
-
- itertools.starmap(os.path.join,
- [('/usr', 'bin', 'java'), ('/bin', 'python'),
- ('/usr', 'bin', 'perl'),('/usr', 'bin', 'ruby')])
- =>
- /usr/bin/java, /bin/python, /usr/bin/perl, /usr/bin/ruby
-
-
-Selecting elements
-''''''''''''''''''
-
-Another group of functions chooses a subset of an iterator's elements
-based on a predicate.
-
-``itertools.ifilter(predicate, iter)`` returns all the elements for
-which the predicate returns true::
-
- def is_even(x):
- return (x % 2) == 0
-
- itertools.ifilter(is_even, itertools.count()) =>
- 0, 2, 4, 6, 8, 10, 12, 14, ...
-
-``itertools.ifilterfalse(predicate, iter)`` is the opposite,
-returning all elements for which the predicate returns false::
-
- itertools.ifilterfalse(is_even, itertools.count()) =>
- 1, 3, 5, 7, 9, 11, 13, 15, ...
-
-``itertools.takewhile(predicate, iter)`` returns elements for as long
-as the predicate returns true. Once the predicate returns false,
-the iterator will signal the end of its results.
-
-::
-
- def less_than_10(x):
- return (x < 10)
-
- itertools.takewhile(less_than_10, itertools.count()) =>
- 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
-
- itertools.takewhile(is_even, itertools.count()) =>
- 0
-
-``itertools.dropwhile(predicate, iter)`` discards elements while the
-predicate returns true, and then returns the rest of the iterable's
-results.
-
-::
-
- itertools.dropwhile(less_than_10, itertools.count()) =>
- 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ...
-
- itertools.dropwhile(is_even, itertools.count()) =>
- 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...
-
-
-Grouping elements
-'''''''''''''''''
-
-The last function I'll discuss, ``itertools.groupby(iter,
-key_func=None)``, is the most complicated. ``key_func(elem)`` is a
-function that can compute a key value for each element returned by the
-iterable. If you don't supply a key function, the key is simply each
-element itself.
-
-``groupby()`` collects all the consecutive elements from the
-underlying iterable that have the same key value, and returns a stream
-of 2-tuples containing a key value and an iterator for the elements
-with that key.
-
-::
-
- city_list = [('Decatur', 'AL'), ('Huntsville', 'AL'), ('Selma', 'AL'),
- ('Anchorage', 'AK'), ('Nome', 'AK'),
- ('Flagstaff', 'AZ'), ('Phoenix', 'AZ'), ('Tucson', 'AZ'),
- ...
- ]
-
- def get_state ((city, state)):
- return state
-
- itertools.groupby(city_list, get_state) =>
- ('AL', iterator-1),
- ('AK', iterator-2),
- ('AZ', iterator-3), ...
-
- where
- iterator-1 =>
- ('Decatur', 'AL'), ('Huntsville', 'AL'), ('Selma', 'AL')
- iterator-2 =>
- ('Anchorage', 'AK'), ('Nome', 'AK')
- iterator-3 =>
- ('Flagstaff', 'AZ'), ('Phoenix', 'AZ'), ('Tucson', 'AZ')
-
-``groupby()`` assumes that the underlying iterable's contents will
-already be sorted based on the key. Note that the returned iterators
-also use the underlying iterable, so you have to consume the results
-of iterator-1 before requesting iterator-2 and its corresponding key.
-
-
-The functools module
-----------------------------------------------
-
-The ``functools`` module in Python 2.5 contains some higher-order
-functions. A **higher-order function** takes one or more functions as
-input and returns a new function. The most useful tool in this module
-is the ``partial()`` function.
-
-For programs written in a functional style, you'll sometimes want to
-construct variants of existing functions that have some of the
-parameters filled in. Consider a Python function ``f(a, b, c)``; you
-may wish to create a new function ``g(b, c)`` that's equivalent to
-``f(1, b, c)``; you're filling in a value for one of ``f()``'s parameters.
-This is called "partial function application".
-
-The constructor for ``partial`` takes the arguments ``(function, arg1,
-arg2, ... kwarg1=value1, kwarg2=value2)``. The resulting object is
-callable, so you can just call it to invoke ``function`` with the
-filled-in arguments.
-
-Here's a small but realistic example::
-
- import functools
-
- def log (message, subsystem):
- "Write the contents of 'message' to the specified subsystem."
- print '%s: %s' % (subsystem, message)
- ...
-
- server_log = functools.partial(log, subsystem='server')
- server_log('Unable to open socket')
-
-
-The operator module
--------------------
-
-The ``operator`` module was mentioned earlier. It contains a set of
-functions corresponding to Python's operators. These functions
-are often useful in functional-style code because they save you
-from writing trivial functions that perform a single operation.
-
-Some of the functions in this module are:
-
-* Math operations: ``add()``, ``sub()``, ``mul()``, ``div()``, ``floordiv()``,
- ``abs()``, ...
-* Logical operations: ``not_()``, ``truth()``.
-* Bitwise operations: ``and_()``, ``or_()``, ``invert()``.
-* Comparisons: ``eq()``, ``ne()``, ``lt()``, ``le()``, ``gt()``, and ``ge()``.
-* Object identity: ``is_()``, ``is_not()``.
-
-Consult `the operator module's documentation <http://docs.python.org/lib/module-operator.html>`__ for a complete
-list.
-
-
-
-The functional module
----------------------
-
-Collin Winter's `functional module <http://oakwinter.com/code/functional/>`__
-provides a number of more
-advanced tools for functional programming. It also reimplements
-several Python built-ins, trying to make them more intuitive to those
-used to functional programming in other languages.
-
-This section contains an introduction to some of the most important
-functions in ``functional``; full documentation can be found at `the
-project's website <http://oakwinter.com/code/functional/documentation/>`__.
-
-``compose(outer, inner, unpack=False)``
-
-The ``compose()`` function implements function composition.
-In other words, it returns a wrapper around the ``outer`` and ``inner`` callables, such
-that the return value from ``inner`` is fed directly to ``outer``. That is,
-
-::
-
- >>> def add(a, b):
- ... return a + b
- ...
- >>> def double(a):
- ... return 2 * a
- ...
- >>> compose(double, add)(5, 6)
- 22
-
-is equivalent to
-
-::
-
- >>> double(add(5, 6))
- 22
-
-The ``unpack`` keyword is provided to work around the fact that Python functions are not always
-`fully curried <http://en.wikipedia.org/wiki/Currying>`__.
-By default, it is expected that the ``inner`` function will return a single object and that the ``outer``
-function will take a single argument. Setting the ``unpack`` argument causes ``compose`` to expect a
-tuple from ``inner`` which will be expanded before being passed to ``outer``. Put simply,
-
-::
-
- compose(f, g)(5, 6)
-
-is equivalent to::
-
- f(g(5, 6))
-
-while
-
-::
-
- compose(f, g, unpack=True)(5, 6)
-
-is equivalent to::
-
- f(*g(5, 6))
-
-Even though ``compose()`` only accepts two functions, it's trivial to
-build up a version that will compose any number of functions. We'll
-use ``reduce()``, ``compose()`` and ``partial()`` (the last of which
-is provided by both ``functional`` and ``functools``).
-
-::
-
- from functional import compose, partial
-
- multi_compose = partial(reduce, compose)
-
-
-We can also use ``map()``, ``compose()`` and ``partial()`` to craft a
-version of ``"".join(...)`` that converts its arguments to string::
-
- from functional import compose, partial
-
- join = compose("".join, partial(map, str))
-
-
-``flip(func)``
-
-``flip()`` wraps the callable in ``func`` and
-causes it to receive its non-keyword arguments in reverse order.
-
-::
-
- >>> def triple(a, b, c):
- ... return (a, b, c)
- ...
- >>> triple(5, 6, 7)
- (5, 6, 7)
- >>>
- >>> flipped_triple = flip(triple)
- >>> flipped_triple(5, 6, 7)
- (7, 6, 5)
-
-``foldl(func, start, iterable)``
-
-``foldl()`` takes a binary function, a starting value (usually some kind of 'zero'), and an iterable.
-The function is applied to the starting value and the first element of the list, then the result of
-that and the second element of the list, then the result of that and the third element of the list,
-and so on.
-
-This means that a call such as::
-
- foldl(f, 0, [1, 2, 3])
-
-is equivalent to::
-
- f(f(f(0, 1), 2), 3)
-
-
-``foldl()`` is roughly equivalent to the following recursive function::
-
- def foldl(func, start, seq):
- if len(seq) == 0:
- return start
-
- return foldl(func, func(start, seq[0]), seq[1:])
-
-Speaking of equivalence, the above ``foldl`` call can be expressed in terms of the built-in ``reduce`` like
-so::
-
- reduce(f, [1, 2, 3], 0)
-
-
-We can use ``foldl()``, ``operator.concat()`` and ``partial()`` to
-write a cleaner, more aesthetically-pleasing version of Python's
-``"".join(...)`` idiom::
-
- from functional import foldl, partial
- from operator import concat
-
- join = partial(foldl, concat, "")
-
-
-Revision History and Acknowledgements
-------------------------------------------------
-
-The author would like to thank the following people for offering
-suggestions, corrections and assistance with various drafts of this
-article: Ian Bicking, Nick Coghlan, Nick Efford, Raymond Hettinger,
-Jim Jewett, Mike Krell, Leandro Lameiro, Jussi Salmela,
-Collin Winter, Blake Winton.
-
-Version 0.1: posted June 30 2006.
-
-Version 0.11: posted July 1 2006. Typo fixes.
-
-Version 0.2: posted July 10 2006. Merged genexp and listcomp
-sections into one. Typo fixes.
-
-Version 0.21: Added more references suggested on the tutor mailing list.
-
-Version 0.30: Adds a section on the ``functional`` module written by
-Collin Winter; adds short section on the operator module; a few other
-edits.
-
-
-References
---------------------
-
-General
-'''''''''''''''
-
-**Structure and Interpretation of Computer Programs**, by
-Harold Abelson and Gerald Jay Sussman with Julie Sussman.
-Full text at http://mitpress.mit.edu/sicp/.
-In this classic textbook of computer science, chapters 2 and 3 discuss the
-use of sequences and streams to organize the data flow inside a
-program. The book uses Scheme for its examples, but many of the
-design approaches described in these chapters are applicable to
-functional-style Python code.
-
-http://www.defmacro.org/ramblings/fp.html: A general
-introduction to functional programming that uses Java examples
-and has a lengthy historical introduction.
-
-http://en.wikipedia.org/wiki/Functional_programming:
-General Wikipedia entry describing functional programming.
-
-http://en.wikipedia.org/wiki/Coroutine:
-Entry for coroutines.
-
-http://en.wikipedia.org/wiki/Currying:
-Entry for the concept of currying.
-
-Python-specific
-'''''''''''''''''''''''''''
-
-http://gnosis.cx/TPiP/:
-The first chapter of David Mertz's book :title-reference:`Text Processing in Python`
-discusses functional programming for text processing, in the section titled
-"Utilizing Higher-Order Functions in Text Processing".
-
-Mertz also wrote a 3-part series of articles on functional programming
-for IBM's DeveloperWorks site; see
-`part 1 <http://www-128.ibm.com/developerworks/library/l-prog.html>`__,
-`part 2 <http://www-128.ibm.com/developerworks/library/l-prog2.html>`__, and
-`part 3 <http://www-128.ibm.com/developerworks/linux/library/l-prog3.html>`__,
-
-
-Python documentation
-'''''''''''''''''''''''''''
-
-http://docs.python.org/lib/module-itertools.html:
-Documentation for the ``itertools`` module.
-
-http://docs.python.org/lib/module-operator.html:
-Documentation for the ``operator`` module.
-
-http://www.python.org/dev/peps/pep-0289/:
-PEP 289: "Generator Expressions"
-
-http://www.python.org/dev/peps/pep-0342/
-PEP 342: "Coroutines via Enhanced Generators" describes the new generator
-features in Python 2.5.
-
-.. comment
-
- Topics to place
- -----------------------------
-
- XXX os.walk()
-
- XXX Need a large example.
-
- But will an example add much? I'll post a first draft and see
- what the comments say.
-
-.. comment
-
- Original outline:
- Introduction
- Idea of FP
- Programs built out of functions
- Functions are strictly input-output, no internal state
- Opposed to OO programming, where objects have state
-
- Why FP?
- Formal provability
- Assignment is difficult to reason about
- Not very relevant to Python
- Modularity
- Small functions that do one thing
- Debuggability:
- Easy to test due to lack of state
- Easy to verify output from intermediate steps
- Composability
- You assemble a toolbox of functions that can be mixed
-
- Tackling a problem
- Need a significant example
-
- Iterators
- Generators
- The itertools module
- List comprehensions
- Small functions and the lambda statement
- Built-in functions
- map
- filter
- reduce
-
-.. comment
-
- Handy little function for printing part of an iterator -- used
- while writing this document.
-
- import itertools
- def print_iter(it):
- slice = itertools.islice(it, 10)
- for elem in slice[:-1]:
- sys.stdout.write(str(elem))
- sys.stdout.write(', ')
- print elem[-1]
-
-
diff --git a/Doc/howto/regex.tex b/Doc/howto/regex.tex
deleted file mode 100644
index 62b6daf..0000000
--- a/Doc/howto/regex.tex
+++ /dev/null
@@ -1,1477 +0,0 @@
-\documentclass{howto}
-
-% TODO:
-% Document lookbehind assertions
-% Better way of displaying a RE, a string, and what it matches
-% Mention optional argument to match.groups()
-% Unicode (at least a reference)
-
-\title{Regular Expression HOWTO}
-
-\release{0.05}
-
-\author{A.M. Kuchling}
-\authoraddress{\email{amk@amk.ca}}
-
-\begin{document}
-\maketitle
-
-\begin{abstract}
-\noindent
-This document is an introductory tutorial to using regular expressions
-in Python with the \module{re} module. It provides a gentler
-introduction than the corresponding section in the Library Reference.
-
-This document is available from
-\url{http://www.amk.ca/python/howto}.
-
-\end{abstract}
-
-\tableofcontents
-
-\section{Introduction}
-
-The \module{re} module was added in Python 1.5, and provides
-Perl-style regular expression patterns. Earlier versions of Python
-came with the \module{regex} module, which provided Emacs-style
-patterns. The \module{regex} module was removed completely in Python 2.5.
-
-Regular expressions (called REs, or regexes, or regex patterns) are
-essentially a tiny, highly specialized programming language embedded
-inside Python and made available through the \module{re} module.
-Using this little language, you specify the rules for the set of
-possible strings that you want to match; this set might contain
-English sentences, or e-mail addresses, or TeX commands, or anything
-you like. You can then ask questions such as ``Does this string match
-the pattern?'', or ``Is there a match for the pattern anywhere in this
-string?''. You can also use REs to modify a string or to split it
-apart in various ways.
-
-Regular expression patterns are compiled into a series of bytecodes
-which are then executed by a matching engine written in C. For
-advanced use, it may be necessary to pay careful attention to how the
-engine will execute a given RE, and write the RE in a certain way in
-order to produce bytecode that runs faster. Optimization isn't
-covered in this document, because it requires that you have a good
-understanding of the matching engine's internals.
-
-The regular expression language is relatively small and restricted, so
-not all possible string processing tasks can be done using regular
-expressions. There are also tasks that \emph{can} be done with
-regular expressions, but the expressions turn out to be very
-complicated. In these cases, you may be better off writing Python
-code to do the processing; while Python code will be slower than an
-elaborate regular expression, it will also probably be more understandable.
-
-\section{Simple Patterns}
-
-We'll start by learning about the simplest possible regular
-expressions. Since regular expressions are used to operate on
-strings, we'll begin with the most common task: matching characters.
-
-For a detailed explanation of the computer science underlying regular
-expressions (deterministic and non-deterministic finite automata), you
-can refer to almost any textbook on writing compilers.
-
-\subsection{Matching Characters}
-
-Most letters and characters will simply match themselves. For
-example, the regular expression \regexp{test} will match the string
-\samp{test} exactly. (You can enable a case-insensitive mode that
-would let this RE match \samp{Test} or \samp{TEST} as well; more
-about this later.)
-
-There are exceptions to this rule; some characters are special
-\dfn{metacharacters}, and don't match themselves. Instead, they
-signal that some out-of-the-ordinary thing should be matched, or they
-affect other portions of the RE by repeating them or changing their
-meaning. Much of this document is devoted to discussing various
-metacharacters and what they do.
-
-Here's a complete list of the metacharacters; their meanings will be
-discussed in the rest of this HOWTO.
-
-\begin{verbatim}
-. ^ $ * + ? { [ ] \ | ( )
-\end{verbatim}
-% $
-
-The first metacharacters we'll look at are \samp{[} and \samp{]}.
-They're used for specifying a character class, which is a set of
-characters that you wish to match. Characters can be listed
-individually, or a range of characters can be indicated by giving two
-characters and separating them by a \character{-}. For example,
-\regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or
-\samp{c}; this is the same as
-\regexp{[a-c]}, which uses a range to express the same set of
-characters. If you wanted to match only lowercase letters, your
-RE would be \regexp{[a-z]}.
-
-Metacharacters are not active inside classes. For example,
-\regexp{[akm\$]} will match any of the characters \character{a},
-\character{k}, \character{m}, or \character{\$}; \character{\$} is
-usually a metacharacter, but inside a character class it's stripped of
-its special nature.
-
-You can match the characters not listed within the class by
-\dfn{complementing} the set. This is indicated by including a
-\character{\^} as the first character of the class; \character{\^}
-outside a character class will simply match the
-\character{\^} character. For example, \verb|[^5]| will match any
-character except \character{5}.
-
-Perhaps the most important metacharacter is the backslash, \samp{\e}.
-As in Python string literals, the backslash can be followed by various
-characters to signal various special sequences. It's also used to escape
-all the metacharacters so you can still match them in patterns; for
-example, if you need to match a \samp{[} or
-\samp{\e}, you can precede them with a backslash to remove their
-special meaning: \regexp{\e[} or \regexp{\e\e}.
-
-Some of the special sequences beginning with \character{\e} represent
-predefined sets of characters that are often useful, such as the set
-of digits, the set of letters, or the set of anything that isn't
-whitespace. The following predefined special sequences are available:
-
-\begin{itemize}
-\item[\code{\e d}]Matches any decimal digit; this is
-equivalent to the class \regexp{[0-9]}.
-
-\item[\code{\e D}]Matches any non-digit character; this is
-equivalent to the class \verb|[^0-9]|.
-
-\item[\code{\e s}]Matches any whitespace character; this is
-equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}.
-
-\item[\code{\e S}]Matches any non-whitespace character; this is
-equivalent to the class \verb|[^ \t\n\r\f\v]|.
-
-\item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class
-\regexp{[a-zA-Z0-9_]}.
-
-\item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class
-\verb|[^a-zA-Z0-9_]|.
-\end{itemize}
-
-These sequences can be included inside a character class. For
-example, \regexp{[\e s,.]} is a character class that will match any
-whitespace character, or \character{,} or \character{.}.
-
-The final metacharacter in this section is \regexp{.}. It matches
-anything except a newline character, and there's an alternate mode
-(\code{re.DOTALL}) where it will match even a newline. \character{.}
-is often used where you want to match ``any character''.
-
-\subsection{Repeating Things}
-
-Being able to match varying sets of characters is the first thing
-regular expressions can do that isn't already possible with the
-methods available on strings. However, if that was the only
-additional capability of regexes, they wouldn't be much of an advance.
-Another capability is that you can specify that portions of the RE
-must be repeated a certain number of times.
-
-The first metacharacter for repeating things that we'll look at is
-\regexp{*}. \regexp{*} doesn't match the literal character \samp{*};
-instead, it specifies that the previous character can be matched zero
-or more times, instead of exactly once.
-
-For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a}
-characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a}
-characters), and so forth. The RE engine has various internal
-limitations stemming from the size of C's \code{int} type that will
-prevent it from matching over 2 billion \samp{a} characters; you
-probably don't have enough memory to construct a string that large, so
-you shouldn't run into that limit.
-
-Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE,
-the matching engine will try to repeat it as many times as possible.
-If later portions of the pattern don't match, the matching engine will
-then back up and try again with few repetitions.
-
-A step-by-step example will make this more obvious. Let's consider
-the expression \regexp{a[bcd]*b}. This matches the letter
-\character{a}, zero or more letters from the class \code{[bcd]}, and
-finally ends with a \character{b}. Now imagine matching this RE
-against the string \samp{abcbd}.
-
-\begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation}
-\lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.}
-\lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as
-it can, which is to the end of the string.}
-\lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the
-current position is at the end of the string, so it fails.}
-\lineiii{4}{\code{abcb}}{Back up, so that \regexp{[bcd]*} matches
-one less character.}
-\lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the
-current position is at the last character, which is a \character{d}.}
-\lineiii{6}{\code{abc}}{Back up again, so that \regexp{[bcd]*} is
-only matching \samp{bc}.}
-\lineiii{6}{\code{abcb}}{Try \regexp{b} again. This time
-but the character at the current position is \character{b}, so it succeeds.}
-\end{tableiii}
-
-The end of the RE has now been reached, and it has matched
-\samp{abcb}. This demonstrates how the matching engine goes as far as
-it can at first, and if no match is found it will then progressively
-back up and retry the rest of the RE again and again. It will back up
-until it has tried zero matches for \regexp{[bcd]*}, and if that
-subsequently fails, the engine will conclude that the string doesn't
-match the RE at all.
-
-Another repeating metacharacter is \regexp{+}, which matches one or
-more times. Pay careful attention to the difference between
-\regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more
-times, so whatever's being repeated may not be present at all, while
-\regexp{+} requires at least \emph{one} occurrence. To use a similar
-example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}),
-\samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}.
-
-There are two more repeating qualifiers. The question mark character,
-\regexp{?}, matches either once or zero times; you can think of it as
-marking something as being optional. For example, \regexp{home-?brew}
-matches either \samp{homebrew} or \samp{home-brew}.
-
-The most complicated repeated qualifier is
-\regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal
-integers. This qualifier means there must be at least \var{m}
-repetitions, and at most \var{n}. For example, \regexp{a/\{1,3\}b}
-will match \samp{a/b}, \samp{a//b}, and \samp{a///b}. It won't match
-\samp{ab}, which has no slashes, or \samp{a////b}, which has four.
-
-You can omit either \var{m} or \var{n}; in that case, a reasonable
-value is assumed for the missing value. Omitting \var{m} is
-interpreted as a lower limit of 0, while omitting \var{n} results in
-an upper bound of infinity --- actually, the upper bound is the
-2-billion limit mentioned earlier, but that might as well be infinity.
-
-Readers of a reductionist bent may notice that the three other qualifiers
-can all be expressed using this notation. \regexp{\{0,\}} is the same
-as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and
-\regexp{\{0,1\}} is the same as \regexp{?}. It's better to use
-\regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because
-they're shorter and easier to read.
-
-\section{Using Regular Expressions}
-
-Now that we've looked at some simple regular expressions, how do we
-actually use them in Python? The \module{re} module provides an
-interface to the regular expression engine, allowing you to compile
-REs into objects and then perform matches with them.
-
-\subsection{Compiling Regular Expressions}
-
-Regular expressions are compiled into \class{RegexObject} instances,
-which have methods for various operations such as searching for
-pattern matches or performing string substitutions.
-
-\begin{verbatim}
->>> import re
->>> p = re.compile('ab*')
->>> print p
-<re.RegexObject instance at 80b4150>
-\end{verbatim}
-
-\function{re.compile()} also accepts an optional \var{flags}
-argument, used to enable various special features and syntax
-variations. We'll go over the available settings later, but for now a
-single example will do:
-
-\begin{verbatim}
->>> p = re.compile('ab*', re.IGNORECASE)
-\end{verbatim}
-
-The RE is passed to \function{re.compile()} as a string. REs are
-handled as strings because regular expressions aren't part of the core
-Python language, and no special syntax was created for expressing
-them. (There are applications that don't need REs at all, so there's
-no need to bloat the language specification by including them.)
-Instead, the \module{re} module is simply a C extension module
-included with Python, just like the \module{socket} or \module{zlib}
-modules.
-
-Putting REs in strings keeps the Python language simpler, but has one
-disadvantage which is the topic of the next section.
-
-\subsection{The Backslash Plague}
-
-As stated earlier, regular expressions use the backslash
-character (\character{\e}) to indicate special forms or to allow
-special characters to be used without invoking their special meaning.
-This conflicts with Python's usage of the same character for the same
-purpose in string literals.
-
-Let's say you want to write a RE that matches the string
-\samp{{\e}section}, which might be found in a \LaTeX\ file. To figure
-out what to write in the program code, start with the desired string
-to be matched. Next, you must escape any backslashes and other
-metacharacters by preceding them with a backslash, resulting in the
-string \samp{\e\e section}. The resulting string that must be passed
-to \function{re.compile()} must be \verb|\\section|. However, to
-express this as a Python string literal, both backslashes must be
-escaped \emph{again}.
-
-\begin{tableii}{c|l}{code}{Characters}{Stage}
- \lineii{\e section}{Text string to be matched}
- \lineii{\e\e section}{Escaped backslash for \function{re.compile}}
- \lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal}
-\end{tableii}
-
-In short, to match a literal backslash, one has to write
-\code{'\e\e\e\e'} as the RE string, because the regular expression
-must be \samp{\e\e}, and each backslash must be expressed as
-\samp{\e\e} inside a regular Python string literal. In REs that
-feature backslashes repeatedly, this leads to lots of repeated
-backslashes and makes the resulting strings difficult to understand.
-
-The solution is to use Python's raw string notation for regular
-expressions; backslashes are not handled in any special way in
-a string literal prefixed with \character{r}, so \code{r"\e n"} is a
-two-character string containing \character{\e} and \character{n},
-while \code{"\e n"} is a one-character string containing a newline.
-Regular expressions will often be written in Python
-code using this raw string notation.
-
-\begin{tableii}{c|c}{code}{Regular String}{Raw string}
- \lineii{"ab*"}{\code{r"ab*"}}
- \lineii{"\e\e\e\e section"}{\code{r"\e\e section"}}
- \lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}}
-\end{tableii}
-
-\subsection{Performing Matches}
-
-Once you have an object representing a compiled regular expression,
-what do you do with it? \class{RegexObject} instances have several
-methods and attributes. Only the most significant ones will be
-covered here; consult \ulink{the Library
-Reference}{http://www.python.org/doc/lib/module-re.html} for a
-complete listing.
-
-\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
- \lineii{match()}{Determine if the RE matches at the beginning of
- the string.}
- \lineii{search()}{Scan through a string, looking for any location
- where this RE matches.}
- \lineii{findall()}{Find all substrings where the RE matches,
-and returns them as a list.}
- \lineii{finditer()}{Find all substrings where the RE matches,
-and returns them as an iterator.}
-\end{tableii}
-
-\method{match()} and \method{search()} return \code{None} if no match
-can be found. If they're successful, a \code{MatchObject} instance is
-returned, containing information about the match: where it starts and
-ends, the substring it matched, and more.
-
-You can learn about this by interactively experimenting with the
-\module{re} module. If you have Tkinter available, you may also want
-to look at \file{Tools/scripts/redemo.py}, a demonstration program
-included with the Python distribution. It allows you to enter REs and
-strings, and displays whether the RE matches or fails.
-\file{redemo.py} can be quite useful when trying to debug a
-complicated RE. Phil Schwartz's
-\ulink{Kodos}{http://www.phil-schwartz.com/kodos.spy} is also an interactive
-tool for developing and testing RE patterns.
-
-This HOWTO uses the standard Python interpreter for its examples.
-First, run the Python interpreter, import the \module{re} module, and
-compile a RE:
-
-\begin{verbatim}
-Python 2.2.2 (#1, Feb 10 2003, 12:57:01)
->>> import re
->>> p = re.compile('[a-z]+')
->>> p
-<_sre.SRE_Pattern object at 80c3c28>
-\end{verbatim}
-
-Now, you can try matching various strings against the RE
-\regexp{[a-z]+}. An empty string shouldn't match at all, since
-\regexp{+} means 'one or more repetitions'. \method{match()} should
-return \code{None} in this case, which will cause the interpreter to
-print no output. You can explicitly print the result of
-\method{match()} to make this clear.
-
-\begin{verbatim}
->>> p.match("")
->>> print p.match("")
-None
-\end{verbatim}
-
-Now, let's try it on a string that it should match, such as
-\samp{tempo}. In this case, \method{match()} will return a
-\class{MatchObject}, so you should store the result in a variable for
-later use.
-
-\begin{verbatim}
->>> m = p.match('tempo')
->>> print m
-<_sre.SRE_Match object at 80c4f68>
-\end{verbatim}
-
-Now you can query the \class{MatchObject} for information about the
-matching string. \class{MatchObject} instances also have several
-methods and attributes; the most important ones are:
-
-\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
- \lineii{group()}{Return the string matched by the RE}
- \lineii{start()}{Return the starting position of the match}
- \lineii{end()}{Return the ending position of the match}
- \lineii{span()}{Return a tuple containing the (start, end) positions
- of the match}
-\end{tableii}
-
-Trying these methods will soon clarify their meaning:
-
-\begin{verbatim}
->>> m.group()
-'tempo'
->>> m.start(), m.end()
-(0, 5)
->>> m.span()
-(0, 5)
-\end{verbatim}
-
-\method{group()} returns the substring that was matched by the
-RE. \method{start()} and \method{end()} return the starting and
-ending index of the match. \method{span()} returns both start and end
-indexes in a single tuple. Since the \method{match} method only
-checks if the RE matches at the start of a string,
-\method{start()} will always be zero. However, the \method{search}
-method of \class{RegexObject} instances scans through the string, so
-the match may not start at zero in that case.
-
-\begin{verbatim}
->>> print p.match('::: message')
-None
->>> m = p.search('::: message') ; print m
-<re.MatchObject instance at 80c9650>
->>> m.group()
-'message'
->>> m.span()
-(4, 11)
-\end{verbatim}
-
-In actual programs, the most common style is to store the
-\class{MatchObject} in a variable, and then check if it was
-\code{None}. This usually looks like:
-
-\begin{verbatim}
-p = re.compile( ... )
-m = p.match( 'string goes here' )
-if m:
- print 'Match found: ', m.group()
-else:
- print 'No match'
-\end{verbatim}
-
-Two \class{RegexObject} methods return all of the matches for a pattern.
-\method{findall()} returns a list of matching strings:
-
-\begin{verbatim}
->>> p = re.compile('\d+')
->>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
-['12', '11', '10']
-\end{verbatim}
-
-\method{findall()} has to create the entire list before it can be
-returned as the result. The \method{finditer()} method returns a
-sequence of \class{MatchObject} instances as an
-iterator.\footnote{Introduced in Python 2.2.2.}
-
-\begin{verbatim}
->>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
->>> iterator
-<callable-iterator object at 0x401833ac>
->>> for match in iterator:
-... print match.span()
-...
-(0, 2)
-(22, 24)
-(29, 31)
-\end{verbatim}
-
-
-\subsection{Module-Level Functions}
-
-You don't have to create a \class{RegexObject} and call its methods;
-the \module{re} module also provides top-level functions called
-\function{match()}, \function{search()}, \function{findall()},
-\function{sub()}, and so forth. These functions take the same
-arguments as the corresponding \class{RegexObject} method, with the RE
-string added as the first argument, and still return either
-\code{None} or a \class{MatchObject} instance.
-
-\begin{verbatim}
->>> print re.match(r'From\s+', 'Fromage amk')
-None
->>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')
-<re.MatchObject instance at 80c5978>
-\end{verbatim}
-
-Under the hood, these functions simply produce a \class{RegexObject}
-for you and call the appropriate method on it. They also store the
-compiled object in a cache, so future calls using the same
-RE are faster.
-
-Should you use these module-level functions, or should you get the
-\class{RegexObject} and call its methods yourself? That choice
-depends on how frequently the RE will be used, and on your personal
-coding style. If the RE is being used at only one point in the code,
-then the module functions are probably more convenient. If a program
-contains a lot of regular expressions, or re-uses the same ones in
-several locations, then it might be worthwhile to collect all the
-definitions in one place, in a section of code that compiles all the
-REs ahead of time. To take an example from the standard library,
-here's an extract from \file{xmllib.py}:
-
-\begin{verbatim}
-ref = re.compile( ... )
-entityref = re.compile( ... )
-charref = re.compile( ... )
-starttagopen = re.compile( ... )
-\end{verbatim}
-
-I generally prefer to work with the compiled object, even for
-one-time uses, but few people will be as much of a purist about this
-as I am.
-
-\subsection{Compilation Flags}
-
-Compilation flags let you modify some aspects of how regular
-expressions work. Flags are available in the \module{re} module under
-two names, a long name such as \constant{IGNORECASE} and a short,
-one-letter form such as \constant{I}. (If you're familiar with Perl's
-pattern modifiers, the one-letter forms use the same letters; the
-short form of \constant{re.VERBOSE} is \constant{re.X}, for example.)
-Multiple flags can be specified by bitwise OR-ing them; \code{re.I |
-re.M} sets both the \constant{I} and \constant{M} flags, for example.
-
-Here's a table of the available flags, followed by
-a more detailed explanation of each one.
-
-\begin{tableii}{c|l}{}{Flag}{Meaning}
- \lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any
- character, including newlines}
- \lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches}
- \lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match}
- \lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching,
- affecting \regexp{\^} and \regexp{\$}}
- \lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs,
- which can be organized more cleanly and understandably.}
-\end{tableii}
-
-\begin{datadesc}{I}
-\dataline{IGNORECASE}
-Perform case-insensitive matching; character class and literal strings
-will match
-letters by ignoring case. For example, \regexp{[A-Z]} will match
-lowercase letters, too, and \regexp{Spam} will match \samp{Spam},
-\samp{spam}, or \samp{spAM}.
-This lowercasing doesn't take the current locale into account; it will
-if you also set the \constant{LOCALE} flag.
-\end{datadesc}
-
-\begin{datadesc}{L}
-\dataline{LOCALE}
-Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b},
-and \regexp{\e B}, dependent on the current locale.
-
-Locales are a feature of the C library intended to help in writing
-programs that take account of language differences. For example, if
-you're processing French text, you'd want to be able to write
-\regexp{\e w+} to match words, but \regexp{\e w} only matches the
-character class \regexp{[A-Za-z]}; it won't match \character{\'e} or
-\character{\c c}. If your system is configured properly and a French
-locale is selected, certain C functions will tell the program that
-\character{\'e} should also be considered a letter. Setting the
-\constant{LOCALE} flag when compiling a regular expression will cause the
-resulting compiled object to use these C functions for \regexp{\e w};
-this is slower, but also enables \regexp{\e w+} to match French words as
-you'd expect.
-\end{datadesc}
-
-\begin{datadesc}{M}
-\dataline{MULTILINE}
-(\regexp{\^} and \regexp{\$} haven't been explained yet;
-they'll be introduced in section~\ref{more-metacharacters}.)
-
-Usually \regexp{\^} matches only at the beginning of the string, and
-\regexp{\$} matches only at the end of the string and immediately before the
-newline (if any) at the end of the string. When this flag is
-specified, \regexp{\^} matches at the beginning of the string and at
-the beginning of each line within the string, immediately following
-each newline. Similarly, the \regexp{\$} metacharacter matches either at
-the end of the string and at the end of each line (immediately
-preceding each newline).
-
-\end{datadesc}
-
-\begin{datadesc}{S}
-\dataline{DOTALL}
-Makes the \character{.} special character match any character at all,
-including a newline; without this flag, \character{.} will match
-anything \emph{except} a newline.
-\end{datadesc}
-
-\begin{datadesc}{X}
-\dataline{VERBOSE} This flag allows you to write regular expressions
-that are more readable by granting you more flexibility in how you can
-format them. When this flag has been specified, whitespace within the
-RE string is ignored, except when the whitespace is in a character
-class or preceded by an unescaped backslash; this lets you organize
-and indent the RE more clearly. This flag also lets you put comments
-within a RE that will be ignored by the engine; comments are marked by
-a \character{\#} that's neither in a character class or preceded by an
-unescaped backslash.
-
-For example, here's a RE that uses \constant{re.VERBOSE}; see how
-much easier it is to read?
-
-\begin{verbatim}
-charref = re.compile(r"""
- &[#] # Start of a numeric entity reference
- (
- 0[0-7]+ # Octal form
- | [0-9]+ # Decimal form
- | x[0-9a-fA-F]+ # Hexadecimal form
- )
- ; # Trailing semicolon
-""", re.VERBOSE)
-\end{verbatim}
-
-Without the verbose setting, the RE would look like this:
-\begin{verbatim}
-charref = re.compile("&#(0[0-7]+"
- "|[0-9]+"
- "|x[0-9a-fA-F]+);")
-\end{verbatim}
-
-In the above example, Python's automatic concatenation of string
-literals has been used to break up the RE into smaller pieces, but
-it's still more difficult to understand than the version using
-\constant{re.VERBOSE}.
-
-\end{datadesc}
-
-\section{More Pattern Power}
-
-So far we've only covered a part of the features of regular
-expressions. In this section, we'll cover some new metacharacters,
-and how to use groups to retrieve portions of the text that was matched.
-
-\subsection{More Metacharacters\label{more-metacharacters}}
-
-There are some metacharacters that we haven't covered yet. Most of
-them will be covered in this section.
-
-Some of the remaining metacharacters to be discussed are
-\dfn{zero-width assertions}. They don't cause the engine to advance
-through the string; instead, they consume no characters at all,
-and simply succeed or fail. For example, \regexp{\e b} is an
-assertion that the current position is located at a word boundary; the
-position isn't changed by the \regexp{\e b} at all. This means that
-zero-width assertions should never be repeated, because if they match
-once at a given location, they can obviously be matched an infinite
-number of times.
-
-\begin{list}{}{}
-
-\item[\regexp{|}]
-Alternation, or the ``or'' operator.
-If A and B are regular expressions,
-\regexp{A|B} will match any string that matches either \samp{A} or \samp{B}.
-\regexp{|} has very low precedence in order to make it work reasonably when
-you're alternating multi-character strings.
-\regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not
-\samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}.
-
-To match a literal \character{|},
-use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
-
-\item[\regexp{\^}] Matches at the beginning of lines. Unless the
-\constant{MULTILINE} flag has been set, this will only match at the
-beginning of the string. In \constant{MULTILINE} mode, this also
-matches immediately after each newline within the string.
-
-For example, if you wish to match the word \samp{From} only at the
-beginning of a line, the RE to use is \verb|^From|.
-
-\begin{verbatim}
->>> print re.search('^From', 'From Here to Eternity')
-<re.MatchObject instance at 80c1520>
->>> print re.search('^From', 'Reciting From Memory')
-None
-\end{verbatim}
-
-%To match a literal \character{\^}, use \regexp{\e\^} or enclose it
-%inside a character class, as in \regexp{[{\e}\^]}.
-
-\item[\regexp{\$}] Matches at the end of a line, which is defined as
-either the end of the string, or any location followed by a newline
-character.
-
-\begin{verbatim}
->>> print re.search('}$', '{block}')
-<re.MatchObject instance at 80adfa8>
->>> print re.search('}$', '{block} ')
-None
->>> print re.search('}$', '{block}\n')
-<re.MatchObject instance at 80adfa8>
-\end{verbatim}
-% $
-
-To match a literal \character{\$}, use \regexp{\e\$} or enclose it
-inside a character class, as in \regexp{[\$]}.
-
-\item[\regexp{\e A}] Matches only at the start of the string. When
-not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are
-effectively the same. In \constant{MULTILINE} mode, they're
-different: \regexp{\e A} still matches only at the beginning of the
-string, but \regexp{\^} may match at any location inside the string
-that follows a newline character.
-
-\item[\regexp{\e Z}] Matches only at the end of the string.
-
-\item[\regexp{\e b}] Word boundary.
-This is a zero-width assertion that matches only at the
-beginning or end of a word. A word is defined as a sequence of
-alphanumeric characters, so the end of a word is indicated by
-whitespace or a non-alphanumeric character.
-
-The following example matches \samp{class} only when it's a complete
-word; it won't match when it's contained inside another word.
-
-\begin{verbatim}
->>> p = re.compile(r'\bclass\b')
->>> print p.search('no class at all')
-<re.MatchObject instance at 80c8f28>
->>> print p.search('the declassified algorithm')
-None
->>> print p.search('one subclass is')
-None
-\end{verbatim}
-
-There are two subtleties you should remember when using this special
-sequence. First, this is the worst collision between Python's string
-literals and regular expression sequences. In Python's string
-literals, \samp{\e b} is the backspace character, ASCII value 8. If
-you're not using raw strings, then Python will convert the \samp{\e b} to
-a backspace, and your RE won't match as you expect it to. The
-following example looks the same as our previous RE, but omits
-the \character{r} in front of the RE string.
-
-\begin{verbatim}
->>> p = re.compile('\bclass\b')
->>> print p.search('no class at all')
-None
->>> print p.search('\b' + 'class' + '\b')
-<re.MatchObject instance at 80c3ee0>
-\end{verbatim}
-
-Second, inside a character class, where there's no use for this
-assertion, \regexp{\e b} represents the backspace character, for
-compatibility with Python's string literals.
-
-\item[\regexp{\e B}] Another zero-width assertion, this is the
-opposite of \regexp{\e b}, only matching when the current
-position is not at a word boundary.
-
-\end{list}
-
-\subsection{Grouping}
-
-Frequently you need to obtain more information than just whether the
-RE matched or not. Regular expressions are often used to dissect
-strings by writing a RE divided into several subgroups which
-match different components of interest. For example, an RFC-822
-header line is divided into a header name and a value, separated by a
-\character{:}, like this:
-
-\begin{verbatim}
-From: author@example.com
-User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
-MIME-Version: 1.0
-To: editor@example.com
-\end{verbatim}
-
-This can be handled by writing a regular expression
-which matches an entire header line, and has one group which matches the
-header name, and another group which matches the header's value.
-
-Groups are marked by the \character{(}, \character{)} metacharacters.
-\character{(} and \character{)} have much the same meaning as they do
-in mathematical expressions; they group together the expressions
-contained inside them, and you can repeat the contents of a
-group with a repeating qualifier, such as \regexp{*}, \regexp{+},
-\regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example,
-\regexp{(ab)*} will match zero or more repetitions of \samp{ab}.
-
-\begin{verbatim}
->>> p = re.compile('(ab)*')
->>> print p.match('ababababab').span()
-(0, 10)
-\end{verbatim}
-
-Groups indicated with \character{(}, \character{)} also capture the
-starting and ending index of the text that they match; this can be
-retrieved by passing an argument to \method{group()},
-\method{start()}, \method{end()}, and \method{span()}. Groups are
-numbered starting with 0. Group 0 is always present; it's the whole
-RE, so \class{MatchObject} methods all have group 0 as their default
-argument. Later we'll see how to express groups that don't capture
-the span of text that they match.
-
-\begin{verbatim}
->>> p = re.compile('(a)b')
->>> m = p.match('ab')
->>> m.group()
-'ab'
->>> m.group(0)
-'ab'
-\end{verbatim}
-
-Subgroups are numbered from left to right, from 1 upward. Groups can
-be nested; to determine the number, just count the opening parenthesis
-characters, going from left to right.
-
-\begin{verbatim}
->>> p = re.compile('(a(b)c)d')
->>> m = p.match('abcd')
->>> m.group(0)
-'abcd'
->>> m.group(1)
-'abc'
->>> m.group(2)
-'b'
-\end{verbatim}
-
-\method{group()} can be passed multiple group numbers at a time, in
-which case it will return a tuple containing the corresponding values
-for those groups.
-
-\begin{verbatim}
->>> m.group(2,1,2)
-('b', 'abc', 'b')
-\end{verbatim}
-
-The \method{groups()} method returns a tuple containing the strings
-for all the subgroups, from 1 up to however many there are.
-
-\begin{verbatim}
->>> m.groups()
-('abc', 'b')
-\end{verbatim}
-
-Backreferences in a pattern allow you to specify that the contents of
-an earlier capturing group must also be found at the current location
-in the string. For example, \regexp{\e 1} will succeed if the exact
-contents of group 1 can be found at the current position, and fails
-otherwise. Remember that Python's string literals also use a
-backslash followed by numbers to allow including arbitrary characters
-in a string, so be sure to use a raw string when incorporating
-backreferences in a RE.
-
-For example, the following RE detects doubled words in a string.
-
-\begin{verbatim}
->>> p = re.compile(r'(\b\w+)\s+\1')
->>> p.search('Paris in the the spring').group()
-'the the'
-\end{verbatim}
-
-Backreferences like this aren't often useful for just searching
-through a string --- there are few text formats which repeat data in
-this way --- but you'll soon find out that they're \emph{very} useful
-when performing string substitutions.
-
-\subsection{Non-capturing and Named Groups}
-
-Elaborate REs may use many groups, both to capture substrings of
-interest, and to group and structure the RE itself. In complex REs,
-it becomes difficult to keep track of the group numbers. There are
-two features which help with this problem. Both of them use a common
-syntax for regular expression extensions, so we'll look at that first.
-
-Perl 5 added several additional features to standard regular
-expressions, and the Python \module{re} module supports most of them.
-It would have been difficult to choose new
-single-keystroke metacharacters or new special sequences beginning
-with \samp{\e} to represent the new features without making Perl's
-regular expressions confusingly different from standard REs. If you
-chose \samp{\&} as a new metacharacter, for example, old expressions
-would be assuming that
-\samp{\&} was a regular character and wouldn't have escaped it by
-writing \regexp{\e \&} or \regexp{[\&]}.
-
-The solution chosen by the Perl developers was to use \regexp{(?...)}
-as the extension syntax. \samp{?} immediately after a parenthesis was
-a syntax error because the \samp{?} would have nothing to repeat, so
-this didn't introduce any compatibility problems. The characters
-immediately after the \samp{?} indicate what extension is being used,
-so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and
-\regexp{(?:foo)} is something else (a non-capturing group containing
-the subexpression \regexp{foo}).
-
-Python adds an extension syntax to Perl's extension syntax. If the
-first character after the question mark is a \samp{P}, you know that
-it's an extension that's specific to Python. Currently there are two
-such extensions: \regexp{(?P<\var{name}>...)} defines a named group,
-and \regexp{(?P=\var{name})} is a backreference to a named group. If
-future versions of Perl 5 add similar features using a different
-syntax, the \module{re} module will be changed to support the new
-syntax, while preserving the Python-specific syntax for
-compatibility's sake.
-
-Now that we've looked at the general extension syntax, we can return
-to the features that simplify working with groups in complex REs.
-Since groups are numbered from left to right and a complex expression
-may use many groups, it can become difficult to keep track of the
-correct numbering. Modifying such a complex RE is annoying, too:
-insert a new group near the beginning and you change the numbers of
-everything that follows it.
-
-Sometimes you'll want to use a group to collect a part of a regular
-expression, but aren't interested in retrieving the group's contents.
-You can make this fact explicit by using a non-capturing group:
-\regexp{(?:...)}, where you can replace the \regexp{...}
-with any other regular expression.
-
-\begin{verbatim}
->>> m = re.match("([abc])+", "abc")
->>> m.groups()
-('c',)
->>> m = re.match("(?:[abc])+", "abc")
->>> m.groups()
-()
-\end{verbatim}
-
-Except for the fact that you can't retrieve the contents of what the
-group matched, a non-capturing group behaves exactly the same as a
-capturing group; you can put anything inside it, repeat it with a
-repetition metacharacter such as \samp{*}, and nest it within other
-groups (capturing or non-capturing). \regexp{(?:...)} is particularly
-useful when modifying an existing pattern, since you can add new groups
-without changing how all the other groups are numbered. It should be
-mentioned that there's no performance difference in searching between
-capturing and non-capturing groups; neither form is any faster than
-the other.
-
-A more significant feature is named groups: instead of
-referring to them by numbers, groups can be referenced by a name.
-
-The syntax for a named group is one of the Python-specific extensions:
-\regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of
-the group. Named groups also behave exactly like capturing groups,
-and additionally associate a name with a group. The
-\class{MatchObject} methods that deal with capturing groups all accept
-either integers that refer to the group by number or strings that
-contain the desired group's name. Named groups are still given
-numbers, so you can retrieve information about a group in two ways:
-
-\begin{verbatim}
->>> p = re.compile(r'(?P<word>\b\w+\b)')
->>> m = p.search( '(((( Lots of punctuation )))' )
->>> m.group('word')
-'Lots'
->>> m.group(1)
-'Lots'
-\end{verbatim}
-
-Named groups are handy because they let you use easily-remembered
-names, instead of having to remember numbers. Here's an example RE
-from the \module{imaplib} module:
-
-\begin{verbatim}
-InternalDate = re.compile(r'INTERNALDATE "'
- r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
- r'(?P<year>[0-9][0-9][0-9][0-9])'
- r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
- r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
- r'"')
-\end{verbatim}
-
-It's obviously much easier to retrieve \code{m.group('zonem')},
-instead of having to remember to retrieve group 9.
-
-The syntax for backreferences in an expression such as
-\regexp{(...)\e 1} refers to the number of the group. There's
-naturally a variant that uses the group name instead of the number.
-This is another Python extension: \regexp{(?P=\var{name})} indicates
-that the contents of the group called \var{name} should again be matched
-at the current point. The regular expression for finding doubled
-words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as
-\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}:
-
-\begin{verbatim}
->>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
->>> p.search('Paris in the the spring').group()
-'the the'
-\end{verbatim}
-
-\subsection{Lookahead Assertions}
-
-Another zero-width assertion is the lookahead assertion. Lookahead
-assertions are available in both positive and negative form, and
-look like this:
-
-\begin{itemize}
-\item[\regexp{(?=...)}] Positive lookahead assertion. This succeeds
-if the contained regular expression, represented here by \code{...},
-successfully matches at the current location, and fails otherwise.
-But, once the contained expression has been tried, the matching engine
-doesn't advance at all; the rest of the pattern is tried right where
-the assertion started.
-
-\item[\regexp{(?!...)}] Negative lookahead assertion. This is the
-opposite of the positive assertion; it succeeds if the contained expression
-\emph{doesn't} match at the current position in the string.
-\end{itemize}
-
-To make this concrete, let's look at a case where a lookahead is
-useful. Consider a simple pattern to match a filename and split it
-apart into a base name and an extension, separated by a \samp{.}. For
-example, in \samp{news.rc}, \samp{news} is the base name, and
-\samp{rc} is the filename's extension.
-
-The pattern to match this is quite simple:
-
-\regexp{.*[.].*\$}
-
-Notice that the \samp{.} needs to be treated specially because it's a
-metacharacter; I've put it inside a character class. Also notice the
-trailing \regexp{\$}; this is added to ensure that all the rest of the
-string must be included in the extension. This regular expression
-matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and
-\samp{printers.conf}.
-
-Now, consider complicating the problem a bit; what if you want to
-match filenames where the extension is not \samp{bat}?
-Some incorrect attempts:
-
-\verb|.*[.][^b].*$|
-% $
-
-The first attempt above tries to exclude \samp{bat} by requiring that
-the first character of the extension is not a \samp{b}. This is
-wrong, because the pattern also doesn't match \samp{foo.bar}.
-
-% Messes up the HTML without the curly braces around \^
-\regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$}
-
-The expression gets messier when you try to patch up the first
-solution by requiring one of the following cases to match: the first
-character of the extension isn't \samp{b}; the second character isn't
-\samp{a}; or the third character isn't \samp{t}. This accepts
-\samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a
-three-letter extension and won't accept a filename with a two-letter
-extension such as \samp{sendmail.cf}. We'll complicate the pattern
-again in an effort to fix it.
-
-\regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$}
-
-In the third attempt, the second and third letters are all made
-optional in order to allow matching extensions shorter than three
-characters, such as \samp{sendmail.cf}.
-
-The pattern's getting really complicated now, which makes it hard to
-read and understand. Worse, if the problem changes and you want to
-exclude both \samp{bat} and \samp{exe} as extensions, the pattern
-would get even more complicated and confusing.
-
-A negative lookahead cuts through all this confusion:
-
-\regexp{.*[.](?!bat\$).*\$}
-% $
-
-The negative lookahead means: if the expression \regexp{bat} doesn't match at
-this point, try the rest of the pattern; if \regexp{bat\$} does match,
-the whole pattern will fail. The trailing \regexp{\$} is required to
-ensure that something like \samp{sample.batch}, where the extension
-only starts with \samp{bat}, will be allowed.
-
-Excluding another filename extension is now easy; simply add it as an
-alternative inside the assertion. The following pattern excludes
-filenames that end in either \samp{bat} or \samp{exe}:
-
-\regexp{.*[.](?!bat\$|exe\$).*\$}
-% $
-
-
-\section{Modifying Strings}
-
-Up to this point, we've simply performed searches against a static
-string. Regular expressions are also commonly used to modify strings
-in various ways, using the following \class{RegexObject} methods:
-
-\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
- \lineii{split()}{Split the string into a list, splitting it wherever the RE matches}
- \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string}
- \lineii{subn()}{Does the same thing as \method{sub()},
- but returns the new string and the number of replacements}
-\end{tableii}
-
-
-\subsection{Splitting Strings}
-
-The \method{split()} method of a \class{RegexObject} splits a string
-apart wherever the RE matches, returning a list of the pieces.
-It's similar to the \method{split()} method of strings but
-provides much more
-generality in the delimiters that you can split by;
-\method{split()} only supports splitting by whitespace or by
-a fixed string. As you'd expect, there's a module-level
-\function{re.split()} function, too.
-
-\begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}}
- Split \var{string} by the matches of the regular expression. If
- capturing parentheses are used in the RE, then their contents will
- also be returned as part of the resulting list. If \var{maxsplit}
- is nonzero, at most \var{maxsplit} splits are performed.
-\end{methoddesc}
-
-You can limit the number of splits made, by passing a value for
-\var{maxsplit}. When \var{maxsplit} is nonzero, at most
-\var{maxsplit} splits will be made, and the remainder of the string is
-returned as the final element of the list. In the following example,
-the delimiter is any sequence of non-alphanumeric characters.
-
-\begin{verbatim}
->>> p = re.compile(r'\W+')
->>> p.split('This is a test, short and sweet, of split().')
-['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
->>> p.split('This is a test, short and sweet, of split().', 3)
-['This', 'is', 'a', 'test, short and sweet, of split().']
-\end{verbatim}
-
-Sometimes you're not only interested in what the text between
-delimiters is, but also need to know what the delimiter was. If
-capturing parentheses are used in the RE, then their values are also
-returned as part of the list. Compare the following calls:
-
-\begin{verbatim}
->>> p = re.compile(r'\W+')
->>> p2 = re.compile(r'(\W+)')
->>> p.split('This... is a test.')
-['This', 'is', 'a', 'test', '']
->>> p2.split('This... is a test.')
-['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
-\end{verbatim}
-
-The module-level function \function{re.split()} adds the RE to be
-used as the first argument, but is otherwise the same.
-
-\begin{verbatim}
->>> re.split('[\W]+', 'Words, words, words.')
-['Words', 'words', 'words', '']
->>> re.split('([\W]+)', 'Words, words, words.')
-['Words', ', ', 'words', ', ', 'words', '.', '']
->>> re.split('[\W]+', 'Words, words, words.', 1)
-['Words', 'words, words.']
-\end{verbatim}
-
-\subsection{Search and Replace}
-
-Another common task is to find all the matches for a pattern, and
-replace them with a different string. The \method{sub()} method takes
-a replacement value, which can be either a string or a function, and
-the string to be processed.
-
-\begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}}
-Returns the string obtained by replacing the leftmost non-overlapping
-occurrences of the RE in \var{string} by the replacement
-\var{replacement}. If the pattern isn't found, \var{string} is returned
-unchanged.
-
-The optional argument \var{count} is the maximum number of pattern
-occurrences to be replaced; \var{count} must be a non-negative
-integer. The default value of 0 means to replace all occurrences.
-\end{methoddesc}
-
-Here's a simple example of using the \method{sub()} method. It
-replaces colour names with the word \samp{colour}:
-
-\begin{verbatim}
->>> p = re.compile( '(blue|white|red)')
->>> p.sub( 'colour', 'blue socks and red shoes')
-'colour socks and colour shoes'
->>> p.sub( 'colour', 'blue socks and red shoes', count=1)
-'colour socks and red shoes'
-\end{verbatim}
-
-The \method{subn()} method does the same work, but returns a 2-tuple
-containing the new string value and the number of replacements
-that were performed:
-
-\begin{verbatim}
->>> p = re.compile( '(blue|white|red)')
->>> p.subn( 'colour', 'blue socks and red shoes')
-('colour socks and colour shoes', 2)
->>> p.subn( 'colour', 'no colours at all')
-('no colours at all', 0)
-\end{verbatim}
-
-Empty matches are replaced only when they're not
-adjacent to a previous match.
-
-\begin{verbatim}
->>> p = re.compile('x*')
->>> p.sub('-', 'abxd')
-'-a-b-d-'
-\end{verbatim}
-
-If \var{replacement} is a string, any backslash escapes in it are
-processed. That is, \samp{\e n} is converted to a single newline
-character, \samp{\e r} is converted to a carriage return, and so forth.
-Unknown escapes such as \samp{\e j} are left alone. Backreferences,
-such as \samp{\e 6}, are replaced with the substring matched by the
-corresponding group in the RE. This lets you incorporate
-portions of the original text in the resulting
-replacement string.
-
-This example matches the word \samp{section} followed by a string
-enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to
-\samp{subsection}:
-
-\begin{verbatim}
->>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)
->>> p.sub(r'subsection{\1}','section{First} section{second}')
-'subsection{First} subsection{second}'
-\end{verbatim}
-
-There's also a syntax for referring to named groups as defined by the
-\regexp{(?P<name>...)} syntax. \samp{\e g<name>} will use the
-substring matched by the group named \samp{name}, and
-\samp{\e g<\var{number}>}
-uses the corresponding group number.
-\samp{\e g<2>} is therefore equivalent to \samp{\e 2},
-but isn't ambiguous in a
-replacement string such as \samp{\e g<2>0}. (\samp{\e 20} would be
-interpreted as a reference to group 20, not a reference to group 2
-followed by the literal character \character{0}.) The following
-substitutions are all equivalent, but use all three variations of the
-replacement string.
-
-\begin{verbatim}
->>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
->>> p.sub(r'subsection{\1}','section{First}')
-'subsection{First}'
->>> p.sub(r'subsection{\g<1>}','section{First}')
-'subsection{First}'
->>> p.sub(r'subsection{\g<name>}','section{First}')
-'subsection{First}'
-\end{verbatim}
-
-\var{replacement} can also be a function, which gives you even more
-control. If \var{replacement} is a function, the function is
-called for every non-overlapping occurrence of \var{pattern}. On each
-call, the function is
-passed a \class{MatchObject} argument for the match
-and can use this information to compute the desired replacement string and return it.
-
-In the following example, the replacement function translates
-decimals into hexadecimal:
-
-\begin{verbatim}
->>> def hexrepl( match ):
-... "Return the hex string for a decimal number"
-... value = int( match.group() )
-... return hex(value)
-...
->>> p = re.compile(r'\d+')
->>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
-'Call 0xffd2 for printing, 0xc000 for user code.'
-\end{verbatim}
-
-When using the module-level \function{re.sub()} function, the pattern
-is passed as the first argument. The pattern may be a string or a
-\class{RegexObject}; if you need to specify regular expression flags,
-you must either use a \class{RegexObject} as the first parameter, or use
-embedded modifiers in the pattern, e.g. \code{sub("(?i)b+", "x", "bbbb
-BBBB")} returns \code{'x x'}.
-
-\section{Common Problems}
-
-Regular expressions are a powerful tool for some applications, but in
-some ways their behaviour isn't intuitive and at times they don't
-behave the way you may expect them to. This section will point out
-some of the most common pitfalls.
-
-\subsection{Use String Methods}
-
-Sometimes using the \module{re} module is a mistake. If you're
-matching a fixed string, or a single character class, and you're not
-using any \module{re} features such as the \constant{IGNORECASE} flag,
-then the full power of regular expressions may not be required.
-Strings have several methods for performing operations with fixed
-strings and they're usually much faster, because the implementation is
-a single small C loop that's been optimized for the purpose, instead
-of the large, more generalized regular expression engine.
-
-One example might be replacing a single fixed string with another
-one; for example, you might replace \samp{word}
-with \samp{deed}. \code{re.sub()} seems like the function to use for
-this, but consider the \method{replace()} method. Note that
-\function{replace()} will also replace \samp{word} inside
-words, turning \samp{swordfish} into \samp{sdeedfish}, but the
-na{\"\i}ve RE \regexp{word} would have done that, too. (To avoid performing
-the substitution on parts of words, the pattern would have to be
-\regexp{\e bword\e b}, in order to require that \samp{word} have a
-word boundary on either side. This takes the job beyond
-\method{replace}'s abilities.)
-
-Another common task is deleting every occurrence of a single character
-from a string or replacing it with another single character. You
-might do this with something like \code{re.sub('\e n', ' ', S)}, but
-\method{translate()} is capable of doing both tasks
-and will be faster than any regular expression operation can be.
-
-In short, before turning to the \module{re} module, consider whether
-your problem can be solved with a faster and simpler string method.
-
-\subsection{match() versus search()}
-
-The \function{match()} function only checks if the RE matches at
-the beginning of the string while \function{search()} will scan
-forward through the string for a match.
-It's important to keep this distinction in mind. Remember,
-\function{match()} will only report a successful match which
-will start at 0; if the match wouldn't start at zero,
-\function{match()} will \emph{not} report it.
-
-\begin{verbatim}
->>> print re.match('super', 'superstition').span()
-(0, 5)
->>> print re.match('super', 'insuperable')
-None
-\end{verbatim}
-
-On the other hand, \function{search()} will scan forward through the
-string, reporting the first match it finds.
-
-\begin{verbatim}
->>> print re.search('super', 'superstition').span()
-(0, 5)
->>> print re.search('super', 'insuperable').span()
-(2, 7)
-\end{verbatim}
-
-Sometimes you'll be tempted to keep using \function{re.match()}, and
-just add \regexp{.*} to the front of your RE. Resist this temptation
-and use \function{re.search()} instead. The regular expression
-compiler does some analysis of REs in order to speed up the process of
-looking for a match. One such analysis figures out what the first
-character of a match must be; for example, a pattern starting with
-\regexp{Crow} must match starting with a \character{C}. The analysis
-lets the engine quickly scan through the string looking for the
-starting character, only trying the full match if a \character{C} is found.
-
-Adding \regexp{.*} defeats this optimization, requiring scanning to
-the end of the string and then backtracking to find a match for the
-rest of the RE. Use \function{re.search()} instead.
-
-\subsection{Greedy versus Non-Greedy}
-
-When repeating a regular expression, as in \regexp{a*}, the resulting
-action is to consume as much of the pattern as possible. This
-fact often bites you when you're trying to match a pair of
-balanced delimiters, such as the angle brackets surrounding an HTML
-tag. The na{\"\i}ve pattern for matching a single HTML tag doesn't
-work because of the greedy nature of \regexp{.*}.
-
-\begin{verbatim}
->>> s = '<html><head><title>Title</title>'
->>> len(s)
-32
->>> print re.match('<.*>', s).span()
-(0, 32)
->>> print re.match('<.*>', s).group()
-<html><head><title>Title</title>
-\end{verbatim}
-
-The RE matches the \character{<} in \samp{<html>}, and the
-\regexp{.*} consumes the rest of the string. There's still more left
-in the RE, though, and the \regexp{>} can't match at the end of
-the string, so the regular expression engine has to backtrack
-character by character until it finds a match for the \regexp{>}.
-The final match extends from the \character{<} in \samp{<html>}
-to the \character{>} in \samp{</title>}, which isn't what you want.
-
-In this case, the solution is to use the non-greedy qualifiers
-\regexp{*?}, \regexp{+?}, \regexp{??}, or
-\regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as
-possible. In the above example, the \character{>} is tried
-immediately after the first \character{<} matches, and when it fails,
-the engine advances a character at a time, retrying the \character{>}
-at every step. This produces just the right result:
-
-\begin{verbatim}
->>> print re.match('<.*?>', s).group()
-<html>
-\end{verbatim}
-
-(Note that parsing HTML or XML with regular expressions is painful.
-Quick-and-dirty patterns will handle common cases, but HTML and XML
-have special cases that will break the obvious regular expression; by
-the time you've written a regular expression that handles all of the
-possible cases, the patterns will be \emph{very} complicated. Use an
-HTML or XML parser module for such tasks.)
-
-\subsection{Not Using re.VERBOSE}
-
-By now you've probably noticed that regular expressions are a very
-compact notation, but they're not terribly readable. REs of
-moderate complexity can become lengthy collections of backslashes,
-parentheses, and metacharacters, making them difficult to read and
-understand.
-
-For such REs, specifying the \code{re.VERBOSE} flag when
-compiling the regular expression can be helpful, because it allows
-you to format the regular expression more clearly.
-
-The \code{re.VERBOSE} flag has several effects. Whitespace in the
-regular expression that \emph{isn't} inside a character class is
-ignored. This means that an expression such as \regexp{dog | cat} is
-equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]}
-will still match the characters \character{a}, \character{b}, or a
-space. In addition, you can also put comments inside a RE; comments
-extend from a \samp{\#} character to the next newline. When used with
-triple-quoted strings, this enables REs to be formatted more neatly:
-
-\begin{verbatim}
-pat = re.compile(r"""
- \s* # Skip leading whitespace
- (?P<header>[^:]+) # Header name
- \s* : # Whitespace, and a colon
- (?P<value>.*?) # The header's value -- *? used to
- # lose the following trailing whitespace
- \s*$ # Trailing whitespace to end-of-line
-""", re.VERBOSE)
-\end{verbatim}
-% $
-
-This is far more readable than:
-
-\begin{verbatim}
-pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
-\end{verbatim}
-% $
-
-\section{Feedback}
-
-Regular expressions are a complicated topic. Did this document help
-you understand them? Were there parts that were unclear, or Problems
-you encountered that weren't covered here? If so, please send
-suggestions for improvements to the author.
-
-The most complete book on regular expressions is almost certainly
-Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published
-by O'Reilly. Unfortunately, it exclusively concentrates on Perl and
-Java's flavours of regular expressions, and doesn't contain any Python
-material at all, so it won't be useful as a reference for programming
-in Python. (The first edition covered Python's now-removed
-\module{regex} module, which won't help you much.) Consider checking
-it out from your library.
-
-\end{document}
-
diff --git a/Doc/howto/sockets.tex b/Doc/howto/sockets.tex
deleted file mode 100644
index 0cecbb9..0000000
--- a/Doc/howto/sockets.tex
+++ /dev/null
@@ -1,465 +0,0 @@
-\documentclass{howto}
-
-\title{Socket Programming HOWTO}
-
-\release{0.00}
-
-\author{Gordon McMillan}
-\authoraddress{\email{gmcm@hypernet.com}}
-
-\begin{document}
-\maketitle
-
-\begin{abstract}
-\noindent
-Sockets are used nearly everywhere, but are one of the most severely
-misunderstood technologies around. This is a 10,000 foot overview of
-sockets. It's not really a tutorial - you'll still have work to do in
-getting things operational. It doesn't cover the fine points (and there
-are a lot of them), but I hope it will give you enough background to
-begin using them decently.
-
-This document is available from the Python HOWTO page at
-\url{http://www.python.org/doc/howto}.
-
-\end{abstract}
-
-\tableofcontents
-
-\section{Sockets}
-
-Sockets are used nearly everywhere, but are one of the most severely
-misunderstood technologies around. This is a 10,000 foot overview of
-sockets. It's not really a tutorial - you'll still have work to do in
-getting things working. It doesn't cover the fine points (and there
-are a lot of them), but I hope it will give you enough background to
-begin using them decently.
-
-I'm only going to talk about INET sockets, but they account for at
-least 99\% of the sockets in use. And I'll only talk about STREAM
-sockets - unless you really know what you're doing (in which case this
-HOWTO isn't for you!), you'll get better behavior and performance from
-a STREAM socket than anything else. I will try to clear up the mystery
-of what a socket is, as well as some hints on how to work with
-blocking and non-blocking sockets. But I'll start by talking about
-blocking sockets. You'll need to know how they work before dealing
-with non-blocking sockets.
-
-Part of the trouble with understanding these things is that "socket"
-can mean a number of subtly different things, depending on context. So
-first, let's make a distinction between a "client" socket - an
-endpoint of a conversation, and a "server" socket, which is more like
-a switchboard operator. The client application (your browser, for
-example) uses "client" sockets exclusively; the web server it's
-talking to uses both "server" sockets and "client" sockets.
-
-
-\subsection{History}
-
-Of the various forms of IPC (\emph{Inter Process Communication}),
-sockets are by far the most popular. On any given platform, there are
-likely to be other forms of IPC that are faster, but for
-cross-platform communication, sockets are about the only game in town.
-
-They were invented in Berkeley as part of the BSD flavor of Unix. They
-spread like wildfire with the Internet. With good reason --- the
-combination of sockets with INET makes talking to arbitrary machines
-around the world unbelievably easy (at least compared to other
-schemes).
-
-\section{Creating a Socket}
-
-Roughly speaking, when you clicked on the link that brought you to
-this page, your browser did something like the following:
-
-\begin{verbatim}
- #create an INET, STREAMing socket
- s = socket.socket(
- socket.AF_INET, socket.SOCK_STREAM)
- #now connect to the web server on port 80
- # - the normal http port
- s.connect(("www.mcmillan-inc.com", 80))
-\end{verbatim}
-
-When the \code{connect} completes, the socket \code{s} can
-now be used to send in a request for the text of this page. The same
-socket will read the reply, and then be destroyed. That's right -
-destroyed. Client sockets are normally only used for one exchange (or
-a small set of sequential exchanges).
-
-What happens in the web server is a bit more complex. First, the web
-server creates a "server socket".
-
-\begin{verbatim}
- #create an INET, STREAMing socket
- serversocket = socket.socket(
- socket.AF_INET, socket.SOCK_STREAM)
- #bind the socket to a public host,
- # and a well-known port
- serversocket.bind((socket.gethostname(), 80))
- #become a server socket
- serversocket.listen(5)
-\end{verbatim}
-
-A couple things to notice: we used \code{socket.gethostname()}
-so that the socket would be visible to the outside world. If we had
-used \code{s.bind(('', 80))} or \code{s.bind(('localhost',
-80))} or \code{s.bind(('127.0.0.1', 80))} we would still
-have a "server" socket, but one that was only visible within the same
-machine.
-
-A second thing to note: low number ports are usually reserved for
-"well known" services (HTTP, SNMP etc). If you're playing around, use
-a nice high number (4 digits).
-
-Finally, the argument to \code{listen} tells the socket library that
-we want it to queue up as many as 5 connect requests (the normal max)
-before refusing outside connections. If the rest of the code is
-written properly, that should be plenty.
-
-OK, now we have a "server" socket, listening on port 80. Now we enter
-the mainloop of the web server:
-
-\begin{verbatim}
- while 1:
- #accept connections from outside
- (clientsocket, address) = serversocket.accept()
- #now do something with the clientsocket
- #in this case, we'll pretend this is a threaded server
- ct = client_thread(clientsocket)
- ct.run()
-\end{verbatim}
-
-There's actually 3 general ways in which this loop could work -
-dispatching a thread to handle \code{clientsocket}, create a new
-process to handle \code{clientsocket}, or restructure this app
-to use non-blocking sockets, and mulitplex between our "server" socket
-and any active \code{clientsocket}s using
-\code{select}. More about that later. The important thing to
-understand now is this: this is \emph{all} a "server" socket
-does. It doesn't send any data. It doesn't receive any data. It just
-produces "client" sockets. Each \code{clientsocket} is created
-in response to some \emph{other} "client" socket doing a
-\code{connect()} to the host and port we're bound to. As soon as
-we've created that \code{clientsocket}, we go back to listening
-for more connections. The two "clients" are free to chat it up - they
-are using some dynamically allocated port which will be recycled when
-the conversation ends.
-
-\subsection{IPC} If you need fast IPC between two processes
-on one machine, you should look into whatever form of shared memory
-the platform offers. A simple protocol based around shared memory and
-locks or semaphores is by far the fastest technique.
-
-If you do decide to use sockets, bind the "server" socket to
-\code{'localhost'}. On most platforms, this will take a shortcut
-around a couple of layers of network code and be quite a bit faster.
-
-
-\section{Using a Socket}
-
-The first thing to note, is that the web browser's "client" socket and
-the web server's "client" socket are identical beasts. That is, this
-is a "peer to peer" conversation. Or to put it another way, \emph{as the
-designer, you will have to decide what the rules of etiquette are for
-a conversation}. Normally, the \code{connect}ing socket
-starts the conversation, by sending in a request, or perhaps a
-signon. But that's a design decision - it's not a rule of sockets.
-
-Now there are two sets of verbs to use for communication. You can use
-\code{send} and \code{recv}, or you can transform your
-client socket into a file-like beast and use \code{read} and
-\code{write}. The latter is the way Java presents their
-sockets. I'm not going to talk about it here, except to warn you that
-you need to use \code{flush} on sockets. These are buffered
-"files", and a common mistake is to \code{write} something, and
-then \code{read} for a reply. Without a \code{flush} in
-there, you may wait forever for the reply, because the request may
-still be in your output buffer.
-
-Now we come the major stumbling block of sockets - \code{send}
-and \code{recv} operate on the network buffers. They do not
-necessarily handle all the bytes you hand them (or expect from them),
-because their major focus is handling the network buffers. In general,
-they return when the associated network buffers have been filled
-(\code{send}) or emptied (\code{recv}). They then tell you
-how many bytes they handled. It is \emph{your} responsibility to call
-them again until your message has been completely dealt with.
-
-When a \code{recv} returns 0 bytes, it means the other side has
-closed (or is in the process of closing) the connection. You will not
-receive any more data on this connection. Ever. You may be able to
-send data successfully; I'll talk about that some on the next page.
-
-A protocol like HTTP uses a socket for only one transfer. The client
-sends a request, the reads a reply. That's it. The socket is
-discarded. This means that a client can detect the end of the reply by
-receiving 0 bytes.
-
-But if you plan to reuse your socket for further transfers, you need
-to realize that \emph{there is no "EOT" (End of Transfer) on a
-socket.} I repeat: if a socket \code{send} or
-\code{recv} returns after handling 0 bytes, the connection has
-been broken. If the connection has \emph{not} been broken, you may
-wait on a \code{recv} forever, because the socket will
-\emph{not} tell you that there's nothing more to read (for now). Now
-if you think about that a bit, you'll come to realize a fundamental
-truth of sockets: \emph{messages must either be fixed length} (yuck),
-\emph{or be delimited} (shrug), \emph{or indicate how long they are}
-(much better), \emph{or end by shutting down the connection}. The
-choice is entirely yours, (but some ways are righter than others).
-
-Assuming you don't want to end the connection, the simplest solution
-is a fixed length message:
-
-\begin{verbatim}
-class mysocket:
- '''demonstration class only
- - coded for clarity, not efficiency
- '''
-
- def __init__(self, sock=None):
- if sock is None:
- self.sock = socket.socket(
- socket.AF_INET, socket.SOCK_STREAM)
- else:
- self.sock = sock
-
- def connect(self, host, port):
- self.sock.connect((host, port))
-
- def mysend(self, msg):
- totalsent = 0
- while totalsent < MSGLEN:
- sent = self.sock.send(msg[totalsent:])
- if sent == 0:
- raise RuntimeError, \\
- "socket connection broken"
- totalsent = totalsent + sent
-
- def myreceive(self):
- msg = ''
- while len(msg) < MSGLEN:
- chunk = self.sock.recv(MSGLEN-len(msg))
- if chunk == '':
- raise RuntimeError, \\
- "socket connection broken"
- msg = msg + chunk
- return msg
-\end{verbatim}
-
-The sending code here is usable for almost any messaging scheme - in
-Python you send strings, and you can use \code{len()} to
-determine its length (even if it has embedded \code{\e 0}
-characters). It's mostly the receiving code that gets more
-complex. (And in C, it's not much worse, except you can't use
-\code{strlen} if the message has embedded \code{\e 0}s.)
-
-The easiest enhancement is to make the first character of the message
-an indicator of message type, and have the type determine the
-length. Now you have two \code{recv}s - the first to get (at
-least) that first character so you can look up the length, and the
-second in a loop to get the rest. If you decide to go the delimited
-route, you'll be receiving in some arbitrary chunk size, (4096 or 8192
-is frequently a good match for network buffer sizes), and scanning
-what you've received for a delimiter.
-
-One complication to be aware of: if your conversational protocol
-allows multiple messages to be sent back to back (without some kind of
-reply), and you pass \code{recv} an arbitrary chunk size, you
-may end up reading the start of a following message. You'll need to
-put that aside and hold onto it, until it's needed.
-
-Prefixing the message with it's length (say, as 5 numeric characters)
-gets more complex, because (believe it or not), you may not get all 5
-characters in one \code{recv}. In playing around, you'll get
-away with it; but in high network loads, your code will very quickly
-break unless you use two \code{recv} loops - the first to
-determine the length, the second to get the data part of the
-message. Nasty. This is also when you'll discover that
-\code{send} does not always manage to get rid of everything in
-one pass. And despite having read this, you will eventually get bit by
-it!
-
-In the interests of space, building your character, (and preserving my
-competitive position), these enhancements are left as an exercise for
-the reader. Lets move on to cleaning up.
-
-\subsection{Binary Data}
-
-It is perfectly possible to send binary data over a socket. The major
-problem is that not all machines use the same formats for binary
-data. For example, a Motorola chip will represent a 16 bit integer
-with the value 1 as the two hex bytes 00 01. Intel and DEC, however,
-are byte-reversed - that same 1 is 01 00. Socket libraries have calls
-for converting 16 and 32 bit integers - \code{ntohl, htonl, ntohs,
-htons} where "n" means \emph{network} and "h" means \emph{host},
-"s" means \emph{short} and "l" means \emph{long}. Where network order
-is host order, these do nothing, but where the machine is
-byte-reversed, these swap the bytes around appropriately.
-
-In these days of 32 bit machines, the ascii representation of binary
-data is frequently smaller than the binary representation. That's
-because a surprising amount of the time, all those longs have the
-value 0, or maybe 1. The string "0" would be two bytes, while binary
-is four. Of course, this doesn't fit well with fixed-length
-messages. Decisions, decisions.
-
-\section{Disconnecting}
-
-Strictly speaking, you're supposed to use \code{shutdown} on a
-socket before you \code{close} it. The \code{shutdown} is
-an advisory to the socket at the other end. Depending on the argument
-you pass it, it can mean "I'm not going to send anymore, but I'll
-still listen", or "I'm not listening, good riddance!". Most socket
-libraries, however, are so used to programmers neglecting to use this
-piece of etiquette that normally a \code{close} is the same as
-\code{shutdown(); close()}. So in most situations, an explicit
-\code{shutdown} is not needed.
-
-One way to use \code{shutdown} effectively is in an HTTP-like
-exchange. The client sends a request and then does a
-\code{shutdown(1)}. This tells the server "This client is done
-sending, but can still receive." The server can detect "EOF" by a
-receive of 0 bytes. It can assume it has the complete request. The
-server sends a reply. If the \code{send} completes successfully
-then, indeed, the client was still receiving.
-
-Python takes the automatic shutdown a step further, and says that when a socket is garbage collected, it will automatically do a \code{close} if it's needed. But relying on this is a very bad habit. If your socket just disappears without doing a \code{close}, the socket at the other end may hang indefinitely, thinking you're just being slow. \emph{Please} \code{close} your sockets when you're done.
-
-
-\subsection{When Sockets Die}
-
-Probably the worst thing about using blocking sockets is what happens
-when the other side comes down hard (without doing a
-\code{close}). Your socket is likely to hang. SOCKSTREAM is a
-reliable protocol, and it will wait a long, long time before giving up
-on a connection. If you're using threads, the entire thread is
-essentially dead. There's not much you can do about it. As long as you
-aren't doing something dumb, like holding a lock while doing a
-blocking read, the thread isn't really consuming much in the way of
-resources. Do \emph{not} try to kill the thread - part of the reason
-that threads are more efficient than processes is that they avoid the
-overhead associated with the automatic recycling of resources. In
-other words, if you do manage to kill the thread, your whole process
-is likely to be screwed up.
-
-\section{Non-blocking Sockets}
-
-If you've understood the preceeding, you already know most of what you
-need to know about the mechanics of using sockets. You'll still use
-the same calls, in much the same ways. It's just that, if you do it
-right, your app will be almost inside-out.
-
-In Python, you use \code{socket.setblocking(0)} to make it
-non-blocking. In C, it's more complex, (for one thing, you'll need to
-choose between the BSD flavor \code{O_NONBLOCK} and the almost
-indistinguishable Posix flavor \code{O_NDELAY}, which is
-completely different from \code{TCP_NODELAY}), but it's the
-exact same idea. You do this after creating the socket, but before
-using it. (Actually, if you're nuts, you can switch back and forth.)
-
-The major mechanical difference is that \code{send},
-\code{recv}, \code{connect} and \code{accept} can
-return without having done anything. You have (of course) a number of
-choices. You can check return code and error codes and generally drive
-yourself crazy. If you don't believe me, try it sometime. Your app
-will grow large, buggy and suck CPU. So let's skip the brain-dead
-solutions and do it right.
-
-Use \code{select}.
-
-In C, coding \code{select} is fairly complex. In Python, it's a
-piece of cake, but it's close enough to the C version that if you
-understand \code{select} in Python, you'll have little trouble
-with it in C.
-
-\begin{verbatim} ready_to_read, ready_to_write, in_error = \\
- select.select(
- potential_readers,
- potential_writers,
- potential_errs,
- timeout)
-\end{verbatim}
-
-You pass \code{select} three lists: the first contains all
-sockets that you might want to try reading; the second all the sockets
-you might want to try writing to, and the last (normally left empty)
-those that you want to check for errors. You should note that a
-socket can go into more than one list. The \code{select} call is
-blocking, but you can give it a timeout. This is generally a sensible
-thing to do - give it a nice long timeout (say a minute) unless you
-have good reason to do otherwise.
-
-In return, you will get three lists. They have the sockets that are
-actually readable, writable and in error. Each of these lists is a
-subset (possbily empty) of the corresponding list you passed in. And
-if you put a socket in more than one input list, it will only be (at
-most) in one output list.
-
-If a socket is in the output readable list, you can be
-as-close-to-certain-as-we-ever-get-in-this-business that a
-\code{recv} on that socket will return \emph{something}. Same
-idea for the writable list. You'll be able to send
-\emph{something}. Maybe not all you want to, but \emph{something} is
-better than nothing. (Actually, any reasonably healthy socket will
-return as writable - it just means outbound network buffer space is
-available.)
-
-If you have a "server" socket, put it in the potential_readers
-list. If it comes out in the readable list, your \code{accept}
-will (almost certainly) work. If you have created a new socket to
-\code{connect} to someone else, put it in the ptoential_writers
-list. If it shows up in the writable list, you have a decent chance
-that it has connected.
-
-One very nasty problem with \code{select}: if somewhere in those
-input lists of sockets is one which has died a nasty death, the
-\code{select} will fail. You then need to loop through every
-single damn socket in all those lists and do a
-\code{select([sock],[],[],0)} until you find the bad one. That
-timeout of 0 means it won't take long, but it's ugly.
-
-Actually, \code{select} can be handy even with blocking sockets.
-It's one way of determining whether you will block - the socket
-returns as readable when there's something in the buffers. However,
-this still doesn't help with the problem of determining whether the
-other end is done, or just busy with something else.
-
-\textbf{Portability alert}: On Unix, \code{select} works both with
-the sockets and files. Don't try this on Windows. On Windows,
-\code{select} works with sockets only. Also note that in C, many
-of the more advanced socket options are done differently on
-Windows. In fact, on Windows I usually use threads (which work very,
-very well) with my sockets. Face it, if you want any kind of
-performance, your code will look very different on Windows than on
-Unix. (I haven't the foggiest how you do this stuff on a Mac.)
-
-\subsection{Performance}
-
-There's no question that the fastest sockets code uses non-blocking
-sockets and select to multiplex them. You can put together something
-that will saturate a LAN connection without putting any strain on the
-CPU. The trouble is that an app written this way can't do much of
-anything else - it needs to be ready to shuffle bytes around at all
-times.
-
-Assuming that your app is actually supposed to do something more than
-that, threading is the optimal solution, (and using non-blocking
-sockets will be faster than using blocking sockets). Unfortunately,
-threading support in Unixes varies both in API and quality. So the
-normal Unix solution is to fork a subprocess to deal with each
-connection. The overhead for this is significant (and don't do this on
-Windows - the overhead of process creation is enormous there). It also
-means that unless each subprocess is completely independent, you'll
-need to use another form of IPC, say a pipe, or shared memory and
-semaphores, to communicate between the parent and child processes.
-
-Finally, remember that even though blocking sockets are somewhat
-slower than non-blocking, in many cases they are the "right"
-solution. After all, if your app is driven by the data it receives
-over a socket, there's not much sense in complicating the logic just
-so your app can wait on \code{select} instead of
-\code{recv}.
-
-\end{document}
diff --git a/Doc/howto/unicode.rst b/Doc/howto/unicode.rst
deleted file mode 100644
index dffe2cb..0000000
--- a/Doc/howto/unicode.rst
+++ /dev/null
@@ -1,766 +0,0 @@
-Unicode HOWTO
-================
-
-**Version 1.02**
-
-This HOWTO discusses Python's support for Unicode, and explains various
-problems that people commonly encounter when trying to work with Unicode.
-
-Introduction to Unicode
-------------------------------
-
-History of Character Codes
-''''''''''''''''''''''''''''''
-
-In 1968, the American Standard Code for Information Interchange,
-better known by its acronym ASCII, was standardized. ASCII defined
-numeric codes for various characters, with the numeric values running from 0 to
-127. For example, the lowercase letter 'a' is assigned 97 as its code
-value.
-
-ASCII was an American-developed standard, so it only defined
-unaccented characters. There was an 'e', but no 'é' or 'Í'. This
-meant that languages which required accented characters couldn't be
-faithfully represented in ASCII. (Actually the missing accents matter
-for English, too, which contains words such as 'naïve' and 'café', and some
-publications have house styles which require spellings such as
-'coöperate'.)
-
-For a while people just wrote programs that didn't display accents. I
-remember looking at Apple ][ BASIC programs, published in French-language
-publications in the mid-1980s, that had lines like these::
-
- PRINT "FICHER EST COMPLETE."
- PRINT "CARACTERE NON ACCEPTE."
-
-Those messages should contain accents, and they just look wrong to
-someone who can read French.
-
-In the 1980s, almost all personal computers were 8-bit, meaning that
-bytes could hold values ranging from 0 to 255. ASCII codes only went
-up to 127, so some machines assigned values between 128 and 255 to
-accented characters. Different machines had different codes, however,
-which led to problems exchanging files. Eventually various commonly
-used sets of values for the 128-255 range emerged. Some were true
-standards, defined by the International Standards Organization, and
-some were **de facto** conventions that were invented by one company
-or another and managed to catch on.
-
-255 characters aren't very many. For example, you can't fit
-both the accented characters used in Western Europe and the Cyrillic
-alphabet used for Russian into the 128-255 range because there are more than
-127 such characters.
-
-You could write files using different codes (all your Russian
-files in a coding system called KOI8, all your French files in
-a different coding system called Latin1), but what if you wanted
-to write a French document that quotes some Russian text? In the
-1980s people began to want to solve this problem, and the Unicode
-standardization effort began.
-
-Unicode started out using 16-bit characters instead of 8-bit characters. 16
-bits means you have 2^16 = 65,536 distinct values available, making it
-possible to represent many different characters from many different
-alphabets; an initial goal was to have Unicode contain the alphabets for
-every single human language. It turns out that even 16 bits isn't enough to
-meet that goal, and the modern Unicode specification uses a wider range of
-codes, 0-1,114,111 (0x10ffff in base-16).
-
-There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
-originally separate efforts, but the specifications were merged with
-the 1.1 revision of Unicode.
-
-(This discussion of Unicode's history is highly simplified. I don't
-think the average Python programmer needs to worry about the
-historical details; consult the Unicode consortium site listed in the
-References for more information.)
-
-
-Definitions
-''''''''''''''''''''''''
-
-A **character** is the smallest possible component of a text. 'A',
-'B', 'C', etc., are all different characters. So are 'È' and
-'Í'. Characters are abstractions, and vary depending on the
-language or context you're talking about. For example, the symbol for
-ohms (Ω) is usually drawn much like the capital letter
-omega (Ω) in the Greek alphabet (they may even be the same in
-some fonts), but these are two different characters that have
-different meanings.
-
-The Unicode standard describes how characters are represented by
-**code points**. A code point is an integer value, usually denoted in
-base 16. In the standard, a code point is written using the notation
-U+12ca to mean the character with value 0x12ca (4810 decimal). The
-Unicode standard contains a lot of tables listing characters and their
-corresponding code points::
-
- 0061 'a'; LATIN SMALL LETTER A
- 0062 'b'; LATIN SMALL LETTER B
- 0063 'c'; LATIN SMALL LETTER C
- ...
- 007B '{'; LEFT CURLY BRACKET
-
-Strictly, these definitions imply that it's meaningless to say 'this is
-character U+12ca'. U+12ca is a code point, which represents some particular
-character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'.
-In informal contexts, this distinction between code points and characters will
-sometimes be forgotten.
-
-A character is represented on a screen or on paper by a set of graphical
-elements that's called a **glyph**. The glyph for an uppercase A, for
-example, is two diagonal strokes and a horizontal stroke, though the exact
-details will depend on the font being used. Most Python code doesn't need
-to worry about glyphs; figuring out the correct glyph to display is
-generally the job of a GUI toolkit or a terminal's font renderer.
-
-
-Encodings
-'''''''''
-
-To summarize the previous section:
-a Unicode string is a sequence of code points, which are
-numbers from 0 to 0x10ffff. This sequence needs to be represented as
-a set of bytes (meaning, values from 0-255) in memory. The rules for
-translating a Unicode string into a sequence of bytes are called an
-**encoding**.
-
-The first encoding you might think of is an array of 32-bit integers.
-In this representation, the string "Python" would look like this::
-
- P y t h o n
- 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
- 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
-
-This representation is straightforward but using
-it presents a number of problems.
-
-1. It's not portable; different processors order the bytes
- differently.
-
-2. It's very wasteful of space. In most texts, the majority of the code
- points are less than 127, or less than 255, so a lot of space is occupied
- by zero bytes. The above string takes 24 bytes compared to the 6
- bytes needed for an ASCII representation. Increased RAM usage doesn't
- matter too much (desktop computers have megabytes of RAM, and strings
- aren't usually that large), but expanding our usage of disk and
- network bandwidth by a factor of 4 is intolerable.
-
-3. It's not compatible with existing C functions such as ``strlen()``,
- so a new family of wide string functions would need to be used.
-
-4. Many Internet standards are defined in terms of textual data, and
- can't handle content with embedded zero bytes.
-
-Generally people don't use this encoding, instead choosing other encodings
-that are more efficient and convenient.
-
-Encodings don't have to handle every possible Unicode character, and
-most encodings don't. For example, Python's default encoding is the
-'ascii' encoding. The rules for converting a Unicode string into the
-ASCII encoding are simple; for each code point:
-
-1. If the code point is <128, each byte is the same as the value of the
- code point.
-
-2. If the code point is 128 or greater, the Unicode string can't
- be represented in this encoding. (Python raises a
- ``UnicodeEncodeError`` exception in this case.)
-
-Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode
-code points 0-255 are identical to the Latin-1 values, so converting
-to this encoding simply requires converting code points to byte
-values; if a code point larger than 255 is encountered, the string
-can't be encoded into Latin-1.
-
-Encodings don't have to be simple one-to-one mappings like Latin-1.
-Consider IBM's EBCDIC, which was used on IBM mainframes. Letter
-values weren't in one block: 'a' through 'i' had values from 129 to
-137, but 'j' through 'r' were 145 through 153. If you wanted to use
-EBCDIC as an encoding, you'd probably use some sort of lookup table to
-perform the conversion, but this is largely an internal detail.
-
-UTF-8 is one of the most commonly used encodings. UTF stands for
-"Unicode Transformation Format", and the '8' means that 8-bit numbers
-are used in the encoding. (There's also a UTF-16 encoding, but it's
-less frequently used than UTF-8.) UTF-8 uses the following rules:
-
-1. If the code point is <128, it's represented by the corresponding byte value.
-2. If the code point is between 128 and 0x7ff, it's turned into two byte values
- between 128 and 255.
-3. Code points >0x7ff are turned into three- or four-byte sequences, where
- each byte of the sequence is between 128 and 255.
-
-UTF-8 has several convenient properties:
-
-1. It can handle any Unicode code point.
-2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent through protocols that can't handle zero bytes.
-3. A string of ASCII text is also valid UTF-8 text.
-4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
-5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize. It's also unlikely that random 8-bit data will look like valid UTF-8.
-
-
-
-References
-''''''''''''''
-
-The Unicode Consortium site at <http://www.unicode.org> has character
-charts, a glossary, and PDF versions of the Unicode specification. Be
-prepared for some difficult reading.
-<http://www.unicode.org/history/> is a chronology of the origin and
-development of Unicode.
-
-To help understand the standard, Jukka Korpela has written an
-introductory guide to reading the Unicode character tables,
-available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
-
-Roman Czyborra wrote another explanation of Unicode's basic principles;
-it's at <http://czyborra.com/unicode/characters.html>.
-Czyborra has written a number of other Unicode-related documentation,
-available from <http://www.cyzborra.com>.
-
-Two other good introductory articles were written by Joel Spolsky
-<http://www.joelonsoftware.com/articles/Unicode.html> and Jason
-Orendorff <http://www.jorendorff.com/articles/unicode/>. If this
-introduction didn't make things clear to you, you should try reading
-one of these alternate articles before continuing.
-
-Wikipedia entries are often helpful; see the entries for "character
-encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
-<http://en.wikipedia.org/wiki/UTF-8>, for example.
-
-
-Python's Unicode Support
-------------------------
-
-Now that you've learned the rudiments of Unicode, we can look at
-Python's Unicode features.
-
-
-The Unicode Type
-'''''''''''''''''''
-
-Unicode strings are expressed as instances of the ``unicode`` type,
-one of Python's repertoire of built-in types. It derives from an
-abstract type called ``basestring``, which is also an ancestor of the
-``str`` type; you can therefore check if a value is a string type with
-``isinstance(value, basestring)``. Under the hood, Python represents
-Unicode strings as either 16- or 32-bit integers, depending on how the
-Python interpreter was compiled.
-
-The ``unicode()`` constructor has the signature ``unicode(string[, encoding, errors])``.
-All of its arguments should be 8-bit strings. The first argument is converted
-to Unicode using the specified encoding; if you leave off the ``encoding`` argument,
-the ASCII encoding is used for the conversion, so characters greater than 127 will
-be treated as errors::
-
- >>> unicode('abcdef')
- u'abcdef'
- >>> s = unicode('abcdef')
- >>> type(s)
- <type 'unicode'>
- >>> unicode('abcdef' + chr(255))
- Traceback (most recent call last):
- File "<stdin>", line 1, in ?
- UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
- ordinal not in range(128)
-
-The ``errors`` argument specifies the response when the input string can't be converted according to the encoding's rules. Legal values for this argument
-are 'strict' (raise a ``UnicodeDecodeError`` exception),
-'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'),
-or 'ignore' (just leave the character out of the Unicode result).
-The following examples show the differences::
-
- >>> unicode('\x80abc', errors='strict')
- Traceback (most recent call last):
- File "<stdin>", line 1, in ?
- UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
- ordinal not in range(128)
- >>> unicode('\x80abc', errors='replace')
- u'\ufffdabc'
- >>> unicode('\x80abc', errors='ignore')
- u'abc'
-
-Encodings are specified as strings containing the encoding's name.
-Python 2.4 comes with roughly 100 different encodings; see the Python
-Library Reference at
-<http://docs.python.org/lib/standard-encodings.html> for a list. Some
-encodings have multiple names; for example, 'latin-1', 'iso_8859_1'
-and '8859' are all synonyms for the same encoding.
-
-One-character Unicode strings can also be created with the
-``unichr()`` built-in function, which takes integers and returns a
-Unicode string of length 1 that contains the corresponding code point.
-The reverse operation is the built-in `ord()` function that takes a
-one-character Unicode string and returns the code point value::
-
- >>> unichr(40960)
- u'\ua000'
- >>> ord(u'\ua000')
- 40960
-
-Instances of the ``unicode`` type have many of the same methods as
-the 8-bit string type for operations such as searching and formatting::
-
- >>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
- >>> s.count('e')
- 5
- >>> s.find('feather')
- 9
- >>> s.find('bird')
- -1
- >>> s.replace('feather', 'sand')
- u'Was ever sand so lightly blown to and fro as this multitude?'
- >>> s.upper()
- u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
-
-Note that the arguments to these methods can be Unicode strings or 8-bit strings.
-8-bit strings will be converted to Unicode before carrying out the operation;
-Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception::
-
- >>> s.find('Was\x9f')
- Traceback (most recent call last):
- File "<stdin>", line 1, in ?
- UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
- >>> s.find(u'Was\x9f')
- -1
-
-Much Python code that operates on strings will therefore work with
-Unicode strings without requiring any changes to the code. (Input and
-output code needs more updating for Unicode; more on this later.)
-
-Another important method is ``.encode([encoding], [errors='strict'])``,
-which returns an 8-bit string version of the
-Unicode string, encoded in the requested encoding. The ``errors``
-parameter is the same as the parameter of the ``unicode()``
-constructor, with one additional possibility; as well as 'strict',
-'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which
-uses XML's character references. The following example shows the
-different results::
-
- >>> u = unichr(40960) + u'abcd' + unichr(1972)
- >>> u.encode('utf-8')
- '\xea\x80\x80abcd\xde\xb4'
- >>> u.encode('ascii')
- Traceback (most recent call last):
- File "<stdin>", line 1, in ?
- UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
- >>> u.encode('ascii', 'ignore')
- 'abcd'
- >>> u.encode('ascii', 'replace')
- '?abcd?'
- >>> u.encode('ascii', 'xmlcharrefreplace')
- '&#40960;abcd&#1972;'
-
-Python's 8-bit strings have a ``.decode([encoding], [errors])`` method
-that interprets the string using the given encoding::
-
- >>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
- >>> utf8_version = u.encode('utf-8') # Encode as UTF-8
- >>> type(utf8_version), utf8_version
- (<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
- >>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
- >>> u == u2 # The two strings match
- True
-
-The low-level routines for registering and accessing the available
-encodings are found in the ``codecs`` module. However, the encoding
-and decoding functions returned by this module are usually more
-low-level than is comfortable, so I'm not going to describe the
-``codecs`` module here. If you need to implement a completely new
-encoding, you'll need to learn about the ``codecs`` module interfaces,
-but implementing encodings is a specialized task that also won't be
-covered here. Consult the Python documentation to learn more about
-this module.
-
-The most commonly used part of the ``codecs`` module is the
-``codecs.open()`` function which will be discussed in the section
-on input and output.
-
-
-Unicode Literals in Python Source Code
-''''''''''''''''''''''''''''''''''''''''''
-
-In Python source code, Unicode literals are written as strings
-prefixed with the 'u' or 'U' character: ``u'abcdefghijk'``. Specific
-code points can be written using the ``\u`` escape sequence, which is
-followed by four hex digits giving the code point. The ``\U`` escape
-sequence is similar, but expects 8 hex digits, not 4.
-
-Unicode literals can also use the same escape sequences as 8-bit
-strings, including ``\x``, but ``\x`` only takes two hex digits so it
-can't express an arbitrary code point. Octal escapes can go up to
-U+01ff, which is octal 777.
-
-::
-
- >>> s = u"a\xac\u1234\u20ac\U00008000"
- ^^^^ two-digit hex escape
- ^^^^^^ four-digit Unicode escape
- ^^^^^^^^^^ eight-digit Unicode escape
- >>> for c in s: print ord(c),
- ...
- 97 172 4660 8364 32768
-
-Using escape sequences for code points greater than 127 is fine in
-small doses, but becomes an annoyance if you're using many accented
-characters, as you would in a program with messages in French or some
-other accent-using language. You can also assemble strings using the
-``unichr()`` built-in function, but this is even more tedious.
-
-Ideally, you'd want to be able to write literals in your language's
-natural encoding. You could then edit Python source code with your
-favorite editor which would display the accented characters naturally,
-and have the right characters used at runtime.
-
-Python supports writing Unicode literals in any encoding, but you have
-to declare the encoding being used. This is done by including a
-special comment as either the first or second line of the source
-file::
-
- #!/usr/bin/env python
- # -*- coding: latin-1 -*-
-
- u = u'abcdé'
- print ord(u[-1])
-
-The syntax is inspired by Emacs's notation for specifying variables local to a file.
-Emacs supports many different variables, but Python only supports 'coding'.
-The ``-*-`` symbols indicate that the comment is special; within them,
-you must supply the name ``coding`` and the name of your chosen encoding,
-separated by ``':'``.
-
-If you don't include such a comment, the default encoding used will be
-ASCII. Versions of Python before 2.4 were Euro-centric and assumed
-Latin-1 as a default encoding for string literals; in Python 2.4,
-characters greater than 127 still work but result in a warning. For
-example, the following program has no encoding declaration::
-
- #!/usr/bin/env python
- u = u'abcdé'
- print ord(u[-1])
-
-When you run it with Python 2.4, it will output the following warning::
-
- amk:~$ python p263.py
- sys:1: DeprecationWarning: Non-ASCII character '\xe9'
- in file p263.py on line 2, but no encoding declared;
- see http://www.python.org/peps/pep-0263.html for details
-
-
-Unicode Properties
-'''''''''''''''''''
-
-The Unicode specification includes a database of information about
-code points. For each code point that's defined, the information
-includes the character's name, its category, the numeric value if
-applicable (Unicode has characters representing the Roman numerals and
-fractions such as one-third and four-fifths). There are also
-properties related to the code point's use in bidirectional text and
-other display-related properties.
-
-The following program displays some information about several
-characters, and prints the numeric value of one particular character::
-
- import unicodedata
-
- u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
-
- for i, c in enumerate(u):
- print i, '%04x' % ord(c), unicodedata.category(c),
- print unicodedata.name(c)
-
- # Get numeric value of second character
- print unicodedata.numeric(u[1])
-
-When run, this prints::
-
- 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
- 1 0bf2 No TAMIL NUMBER ONE THOUSAND
- 2 0f84 Mn TIBETAN MARK HALANTA
- 3 1770 Lo TAGBANWA LETTER SA
- 4 33af So SQUARE RAD OVER S SQUARED
- 1000.0
-
-The category codes are abbreviations describing the nature of the
-character. These are grouped into categories such as "Letter",
-"Number", "Punctuation", or "Symbol", which in turn are broken up into
-subcategories. To take the codes from the above output, ``'Ll'``
-means 'Letter, lowercase', ``'No'`` means "Number, other", ``'Mn'`` is
-"Mark, nonspacing", and ``'So'`` is "Symbol, other". See
-<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values>
-for a list of category codes.
-
-References
-''''''''''''''
-
-The Unicode and 8-bit string types are described in the Python library
-reference at <http://docs.python.org/lib/typesseq.html>.
-
-The documentation for the ``unicodedata`` module is at
-<http://docs.python.org/lib/module-unicodedata.html>.
-
-The documentation for the ``codecs`` module is at
-<http://docs.python.org/lib/module-codecs.html>.
-
-Marc-André Lemburg gave a presentation at EuroPython 2002
-titled "Python and Unicode". A PDF version of his slides
-is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>,
-and is an excellent overview of the design of Python's Unicode features.
-
-
-Reading and Writing Unicode Data
-----------------------------------------
-
-Once you've written some code that works with Unicode data, the next
-problem is input/output. How do you get Unicode strings into your
-program, and how do you convert Unicode into a form suitable for
-storage or transmission?
-
-It's possible that you may not need to do anything depending on your
-input sources and output destinations; you should check whether the
-libraries used in your application support Unicode natively. XML
-parsers often return Unicode data, for example. Many relational
-databases also support Unicode-valued columns and can return Unicode
-values from an SQL query.
-
-Unicode data is usually converted to a particular encoding before it
-gets written to disk or sent over a socket. It's possible to do all
-the work yourself: open a file, read an 8-bit string from it, and
-convert the string with ``unicode(str, encoding)``. However, the
-manual approach is not recommended.
-
-One problem is the multi-byte nature of encodings; one Unicode
-character can be represented by several bytes. If you want to read
-the file in arbitrary-sized chunks (say, 1K or 4K), you need to write
-error-handling code to catch the case where only part of the bytes
-encoding a single Unicode character are read at the end of a chunk.
-One solution would be to read the entire file into memory and then
-perform the decoding, but that prevents you from working with files
-that are extremely large; if you need to read a 2Gb file, you need 2Gb
-of RAM. (More, really, since for at least a moment you'd need to have
-both the encoded string and its Unicode version in memory.)
-
-The solution would be to use the low-level decoding interface to catch
-the case of partial coding sequences. The work of implementing this
-has already been done for you: the ``codecs`` module includes a
-version of the ``open()`` function that returns a file-like object
-that assumes the file's contents are in a specified encoding and
-accepts Unicode parameters for methods such as ``.read()`` and
-``.write()``.
-
-The function's parameters are
-``open(filename, mode='rb', encoding=None, errors='strict', buffering=1)``. ``mode`` can be
-``'r'``, ``'w'``, or ``'a'``, just like the corresponding parameter to the
-regular built-in ``open()`` function; add a ``'+'`` to
-update the file. ``buffering`` is similarly
-parallel to the standard function's parameter.
-``encoding`` is a string giving
-the encoding to use; if it's left as ``None``, a regular Python file
-object that accepts 8-bit strings is returned. Otherwise, a wrapper
-object is returned, and data written to or read from the wrapper
-object will be converted as needed. ``errors`` specifies the action
-for encoding errors and can be one of the usual values of 'strict',
-'ignore', and 'replace'.
-
-Reading Unicode from a file is therefore simple::
-
- import codecs
- f = codecs.open('unicode.rst', encoding='utf-8')
- for line in f:
- print repr(line)
-
-It's also possible to open files in update mode,
-allowing both reading and writing::
-
- f = codecs.open('test', encoding='utf-8', mode='w+')
- f.write(u'\u4500 blah blah blah\n')
- f.seek(0)
- print repr(f.readline()[:1])
- f.close()
-
-Unicode character U+FEFF is used as a byte-order mark (BOM),
-and is often written as the first character of a file in order
-to assist with autodetection of the file's byte ordering.
-Some encodings, such as UTF-16, expect a BOM to be present at
-the start of a file; when such an encoding is used,
-the BOM will be automatically written as the first character
-and will be silently dropped when the file is read. There are
-variants of these encodings, such as 'utf-16-le' and 'utf-16-be'
-for little-endian and big-endian encodings, that specify
-one particular byte ordering and don't
-skip the BOM.
-
-
-Unicode filenames
-'''''''''''''''''''''''''
-
-Most of the operating systems in common use today support filenames
-that contain arbitrary Unicode characters. Usually this is
-implemented by converting the Unicode string into some encoding that
-varies depending on the system. For example, MacOS X uses UTF-8 while
-Windows uses a configurable encoding; on Windows, Python uses the name
-"mbcs" to refer to whatever the currently configured encoding is. On
-Unix systems, there will only be a filesystem encoding if you've set
-the ``LANG`` or ``LC_CTYPE`` environment variables; if you haven't,
-the default encoding is ASCII.
-
-The ``sys.getfilesystemencoding()`` function returns the encoding to
-use on your current system, in case you want to do the encoding
-manually, but there's not much reason to bother. When opening a file
-for reading or writing, you can usually just provide the Unicode
-string as the filename, and it will be automatically converted to the
-right encoding for you::
-
- filename = u'filename\u4500abc'
- f = open(filename, 'w')
- f.write('blah\n')
- f.close()
-
-Functions in the ``os`` module such as ``os.stat()`` will also accept
-Unicode filenames.
-
-``os.listdir()``, which returns filenames, raises an issue: should it
-return the Unicode version of filenames, or should it return 8-bit
-strings containing the encoded versions? ``os.listdir()`` will do
-both, depending on whether you provided the directory path as an 8-bit
-string or a Unicode string. If you pass a Unicode string as the path,
-filenames will be decoded using the filesystem's encoding and a list
-of Unicode strings will be returned, while passing an 8-bit path will
-return the 8-bit versions of the filenames. For example, assuming the
-default filesystem encoding is UTF-8, running the following program::
-
- fn = u'filename\u4500abc'
- f = open(fn, 'w')
- f.close()
-
- import os
- print os.listdir('.')
- print os.listdir(u'.')
-
-will produce the following output::
-
- amk:~$ python t.py
- ['.svn', 'filename\xe4\x94\x80abc', ...]
- [u'.svn', u'filename\u4500abc', ...]
-
-The first list contains UTF-8-encoded filenames, and the second list
-contains the Unicode versions.
-
-
-
-Tips for Writing Unicode-aware Programs
-''''''''''''''''''''''''''''''''''''''''''''
-
-This section provides some suggestions on writing software that
-deals with Unicode.
-
-The most important tip is:
-
- Software should only work with Unicode strings internally,
- converting to a particular encoding on output.
-
-If you attempt to write processing functions that accept both
-Unicode and 8-bit strings, you will find your program vulnerable to
-bugs wherever you combine the two different kinds of strings. Python's
-default encoding is ASCII, so whenever a character with an ASCII value >127
-is in the input data, you'll get a ``UnicodeDecodeError``
-because that character can't be handled by the ASCII encoding.
-
-It's easy to miss such problems if you only test your software
-with data that doesn't contain any
-accents; everything will seem to work, but there's actually a bug in your
-program waiting for the first user who attempts to use characters >127.
-A second tip, therefore, is:
-
- Include characters >127 and, even better, characters >255 in your
- test data.
-
-When using data coming from a web browser or some other untrusted source,
-a common technique is to check for illegal characters in a string
-before using the string in a generated command line or storing it in a
-database. If you're doing this, be careful to check
-the string once it's in the form that will be used or stored; it's
-possible for encodings to be used to disguise characters. This is especially
-true if the input data also specifies the encoding;
-many encodings leave the commonly checked-for characters alone,
-but Python includes some encodings such as ``'base64'``
-that modify every single character.
-
-For example, let's say you have a content management system that takes a
-Unicode filename, and you want to disallow paths with a '/' character.
-You might write this code::
-
- def read_file (filename, encoding):
- if '/' in filename:
- raise ValueError("'/' not allowed in filenames")
- unicode_name = filename.decode(encoding)
- f = open(unicode_name, 'r')
- # ... return contents of file ...
-
-However, if an attacker could specify the ``'base64'`` encoding,
-they could pass ``'L2V0Yy9wYXNzd2Q='``, which is the base-64
-encoded form of the string ``'/etc/passwd'``, to read a
-system file. The above code looks for ``'/'`` characters
-in the encoded form and misses the dangerous character
-in the resulting decoded form.
-
-References
-''''''''''''''
-
-The PDF slides for Marc-André Lemburg's presentation "Writing
-Unicode-aware Applications in Python" are available at
-<http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
-and discuss questions of character encodings as well as how to
-internationalize and localize an application.
-
-
-Revision History and Acknowledgements
-------------------------------------------
-
-Thanks to the following people who have noted errors or offered
-suggestions on this article: Nicholas Bastin,
-Marius Gedminas, Kent Johnson, Ken Krugler,
-Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
-
-Version 1.0: posted August 5 2005.
-
-Version 1.01: posted August 7 2005. Corrects factual and markup
-errors; adds several links.
-
-Version 1.02: posted August 16 2005. Corrects factual errors.
-
-
-.. comment Additional topic: building Python w/ UCS2 or UCS4 support
-.. comment Describe obscure -U switch somewhere?
-.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
-
-.. comment
- Original outline:
-
- - [ ] Unicode introduction
- - [ ] ASCII
- - [ ] Terms
- - [ ] Character
- - [ ] Code point
- - [ ] Encodings
- - [ ] Common encodings: ASCII, Latin-1, UTF-8
- - [ ] Unicode Python type
- - [ ] Writing unicode literals
- - [ ] Obscurity: -U switch
- - [ ] Built-ins
- - [ ] unichr()
- - [ ] ord()
- - [ ] unicode() constructor
- - [ ] Unicode type
- - [ ] encode(), decode() methods
- - [ ] Unicodedata module for character properties
- - [ ] I/O
- - [ ] Reading/writing Unicode data into files
- - [ ] Byte-order marks
- - [ ] Unicode filenames
- - [ ] Writing Unicode programs
- - [ ] Do everything in Unicode
- - [ ] Declaring source code encodings (PEP 263)
- - [ ] Other issues
- - [ ] Building Python (UCS2, UCS4)
diff --git a/Doc/howto/urllib2.rst b/Doc/howto/urllib2.rst
deleted file mode 100644
index ce10e52..0000000
--- a/Doc/howto/urllib2.rst
+++ /dev/null
@@ -1,603 +0,0 @@
-==============================================
- HOWTO Fetch Internet Resources Using urllib2
-==============================================
-----------------------------
- Fetching URLs With Python
-----------------------------
-
-
-.. note::
-
- There is an French translation of an earlier revision of this
- HOWTO, available at `urllib2 - Le Manuel manquant
- <http://www.voidspace/python/articles/urllib2_francais.shtml>`_.
-
-.. contents:: urllib2 Tutorial
-
-
-Introduction
-============
-
-.. sidebar:: Related Articles
-
- You may also find useful the following article on fetching web
- resources with Python :
-
- * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
-
- A tutorial on *Basic Authentication*, with examples in Python.
-
- This HOWTO is written by `Michael Foord
- <http://www.voidspace.org.uk/python/index.shtml>`_.
-
-**urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs
-(Uniform Resource Locators). It offers a very simple interface, in the form of
-the *urlopen* function. This is capable of fetching URLs using a variety
-of different protocols. It also offers a slightly more complex
-interface for handling common situations - like basic authentication,
-cookies, proxies and so on. These are provided by objects called
-handlers and openers.
-
-urllib2 supports fetching URLs for many "URL schemes" (identified by the string
-before the ":" in URL - for example "ftp" is the URL scheme of
-"ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP).
-This tutorial focuses on the most common case, HTTP.
-
-For straightforward situations *urlopen* is very easy to use. But as
-soon as you encounter errors or non-trivial cases when opening HTTP
-URLs, you will need some understanding of the HyperText Transfer
-Protocol. The most comprehensive and authoritative reference to HTTP
-is :RFC:`2616`. This is a technical document and not intended to be
-easy to read. This HOWTO aims to illustrate using *urllib2*, with
-enough detail about HTTP to help you through. It is not intended to
-replace the `urllib2 docs <http://docs.python.org/lib/module-urllib2.html>`_ ,
-but is supplementary to them.
-
-
-Fetching URLs
-=============
-
-The simplest way to use urllib2 is as follows : ::
-
- import urllib2
- response = urllib2.urlopen('http://python.org/')
- html = response.read()
-
-Many uses of urllib2 will be that simple (note that instead of an
-'http:' URL we could have used an URL starting with 'ftp:', 'file:',
-etc.). However, it's the purpose of this tutorial to explain the more
-complicated cases, concentrating on HTTP.
-
-HTTP is based on requests and responses - the client makes requests
-and servers send responses. urllib2 mirrors this with a ``Request``
-object which represents the HTTP request you are making. In its
-simplest form you create a Request object that specifies the URL you
-want to fetch. Calling ``urlopen`` with this Request object returns a
-response object for the URL requested. This response is a file-like
-object, which means you can for example call .read() on the response :
-::
-
- import urllib2
-
- req = urllib2.Request('http://www.voidspace.org.uk')
- response = urllib2.urlopen(req)
- the_page = response.read()
-
-Note that urllib2 makes use of the same Request interface to handle
-all URL schemes. For example, you can make an FTP request like so: ::
-
- req = urllib2.Request('ftp://example.com/')
-
-In the case of HTTP, there are two extra things that Request objects
-allow you to do: First, you can pass data to be sent to the server.
-Second, you can pass extra information ("metadata") *about* the data
-or the about request itself, to the server - this information is sent
-as HTTP "headers". Let's look at each of these in turn.
-
-Data
-----
-
-Sometimes you want to send data to a URL (often the URL will refer to
-a CGI (Common Gateway Interface) script [#]_ or other web
-application). With HTTP, this is often done using what's known as a
-**POST** request. This is often what your browser does when you submit
-a HTML form that you filled in on the web. Not all POSTs have to come
-from forms: you can use a POST to transmit arbitrary data to your own
-application. In the common case of HTML forms, the data needs to be
-encoded in a standard way, and then passed to the Request object as
-the ``data`` argument. The encoding is done using a function from the
-``urllib`` library *not* from ``urllib2``. ::
-
- import urllib
- import urllib2
-
- url = 'http://www.someserver.com/cgi-bin/register.cgi'
- values = {'name' : 'Michael Foord',
- 'location' : 'Northampton',
- 'language' : 'Python' }
-
- data = urllib.urlencode(values)
- req = urllib2.Request(url, data)
- response = urllib2.urlopen(req)
- the_page = response.read()
-
-Note that other encodings are sometimes required (e.g. for file upload
-from HTML forms - see
-`HTML Specification, Form Submission <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_
-for more details).
-
-If you do not pass the ``data`` argument, urllib2 uses a **GET**
-request. One way in which GET and POST requests differ is that POST
-requests often have "side-effects": they change the state of the
-system in some way (for example by placing an order with the website
-for a hundredweight of tinned spam to be delivered to your door).
-Though the HTTP standard makes it clear that POSTs are intended to
-*always* cause side-effects, and GET requests *never* to cause
-side-effects, nothing prevents a GET request from having side-effects,
-nor a POST requests from having no side-effects. Data can also be
-passed in an HTTP GET request by encoding it in the URL itself.
-
-This is done as follows::
-
- >>> import urllib2
- >>> import urllib
- >>> data = {}
- >>> data['name'] = 'Somebody Here'
- >>> data['location'] = 'Northampton'
- >>> data['language'] = 'Python'
- >>> url_values = urllib.urlencode(data)
- >>> print url_values
- name=Somebody+Here&language=Python&location=Northampton
- >>> url = 'http://www.example.com/example.cgi'
- >>> full_url = url + '?' + url_values
- >>> data = urllib2.open(full_url)
-
-Notice that the full URL is created by adding a ``?`` to the URL, followed by
-the encoded values.
-
-Headers
--------
-
-We'll discuss here one particular HTTP header, to illustrate how to
-add headers to your HTTP request.
-
-Some websites [#]_ dislike being browsed by programs, or send
-different versions to different browsers [#]_ . By default urllib2
-identifies itself as ``Python-urllib/x.y`` (where ``x`` and ``y`` are
-the major and minor version numbers of the Python release,
-e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
-not work. The way a browser identifies itself is through the
-``User-Agent`` header [#]_. When you create a Request object you can
-pass a dictionary of headers in. The following example makes the same
-request as above, but identifies itself as a version of Internet
-Explorer [#]_. ::
-
- import urllib
- import urllib2
-
- url = 'http://www.someserver.com/cgi-bin/register.cgi'
- user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
- values = {'name' : 'Michael Foord',
- 'location' : 'Northampton',
- 'language' : 'Python' }
- headers = { 'User-Agent' : user_agent }
-
- data = urllib.urlencode(values)
- req = urllib2.Request(url, data, headers)
- response = urllib2.urlopen(req)
- the_page = response.read()
-
-The response also has two useful methods. See the section on `info and
-geturl`_ which comes after we have a look at what happens when things
-go wrong.
-
-
-Handling Exceptions
-===================
-
-*urlopen* raises ``URLError`` when it cannot handle a response (though
-as usual with Python APIs, builtin exceptions such as ValueError,
-TypeError etc. may also be raised).
-
-``HTTPError`` is the subclass of ``URLError`` raised in the specific
-case of HTTP URLs.
-
-URLError
---------
-
-Often, URLError is raised because there is no network connection (no
-route to the specified server), or the specified server doesn't exist.
-In this case, the exception raised will have a 'reason' attribute,
-which is a tuple containing an error code and a text error message.
-
-e.g. ::
-
- >>> req = urllib2.Request('http://www.pretend_server.org')
- >>> try: urllib2.urlopen(req)
- >>> except URLError, e:
- >>> print e.reason
- >>>
- (4, 'getaddrinfo failed')
-
-
-HTTPError
----------
-
-Every HTTP response from the server contains a numeric "status
-code". Sometimes the status code indicates that the server is unable
-to fulfil the request. The default handlers will handle some of these
-responses for you (for example, if the response is a "redirection"
-that requests the client fetch the document from a different URL,
-urllib2 will handle that for you). For those it can't handle, urlopen
-will raise an ``HTTPError``. Typical errors include '404' (page not
-found), '403' (request forbidden), and '401' (authentication
-required).
-
-See section 10 of RFC 2616 for a reference on all the HTTP error
-codes.
-
-The ``HTTPError`` instance raised will have an integer 'code'
-attribute, which corresponds to the error sent by the server.
-
-Error Codes
-~~~~~~~~~~~
-
-Because the default handlers handle redirects (codes in the 300
-range), and codes in the 100-299 range indicate success, you will
-usually only see error codes in the 400-599 range.
-
-``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful
-dictionary of response codes in that shows all the response codes used
-by RFC 2616. The dictionary is reproduced here for convenience ::
-
- # Table mapping response codes to messages; entries have the
- # form {code: (shortmessage, longmessage)}.
- responses = {
- 100: ('Continue', 'Request received, please continue'),
- 101: ('Switching Protocols',
- 'Switching to new protocol; obey Upgrade header'),
-
- 200: ('OK', 'Request fulfilled, document follows'),
- 201: ('Created', 'Document created, URL follows'),
- 202: ('Accepted',
- 'Request accepted, processing continues off-line'),
- 203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
- 204: ('No Content', 'Request fulfilled, nothing follows'),
- 205: ('Reset Content', 'Clear input form for further input.'),
- 206: ('Partial Content', 'Partial content follows.'),
-
- 300: ('Multiple Choices',
- 'Object has several resources -- see URI list'),
- 301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
- 302: ('Found', 'Object moved temporarily -- see URI list'),
- 303: ('See Other', 'Object moved -- see Method and URL list'),
- 304: ('Not Modified',
- 'Document has not changed since given time'),
- 305: ('Use Proxy',
- 'You must use proxy specified in Location to access this '
- 'resource.'),
- 307: ('Temporary Redirect',
- 'Object moved temporarily -- see URI list'),
-
- 400: ('Bad Request',
- 'Bad request syntax or unsupported method'),
- 401: ('Unauthorized',
- 'No permission -- see authorization schemes'),
- 402: ('Payment Required',
- 'No payment -- see charging schemes'),
- 403: ('Forbidden',
- 'Request forbidden -- authorization will not help'),
- 404: ('Not Found', 'Nothing matches the given URI'),
- 405: ('Method Not Allowed',
- 'Specified method is invalid for this server.'),
- 406: ('Not Acceptable', 'URI not available in preferred format.'),
- 407: ('Proxy Authentication Required', 'You must authenticate with '
- 'this proxy before proceeding.'),
- 408: ('Request Timeout', 'Request timed out; try again later.'),
- 409: ('Conflict', 'Request conflict.'),
- 410: ('Gone',
- 'URI no longer exists and has been permanently removed.'),
- 411: ('Length Required', 'Client must specify Content-Length.'),
- 412: ('Precondition Failed', 'Precondition in headers is false.'),
- 413: ('Request Entity Too Large', 'Entity is too large.'),
- 414: ('Request-URI Too Long', 'URI is too long.'),
- 415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
- 416: ('Requested Range Not Satisfiable',
- 'Cannot satisfy request range.'),
- 417: ('Expectation Failed',
- 'Expect condition could not be satisfied.'),
-
- 500: ('Internal Server Error', 'Server got itself in trouble'),
- 501: ('Not Implemented',
- 'Server does not support this operation'),
- 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
- 503: ('Service Unavailable',
- 'The server cannot process the request due to a high load'),
- 504: ('Gateway Timeout',
- 'The gateway server did not receive a timely response'),
- 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
- }
-
-When an error is raised the server responds by returning an HTTP error
-code *and* an error page. You can use the ``HTTPError`` instance as a
-response on the page returned. This means that as well as the code
-attribute, it also has read, geturl, and info, methods. ::
-
- >>> req = urllib2.Request('http://www.python.org/fish.html')
- >>> try:
- >>> urllib2.urlopen(req)
- >>> except URLError, e:
- >>> print e.code
- >>> print e.read()
- >>>
- 404
- <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
- "http://www.w3.org/TR/html4/loose.dtd">
- <?xml-stylesheet href="./css/ht2html.css"
- type="text/css"?>
- <html><head><title>Error 404: File Not Found</title>
- ...... etc...
-
-Wrapping it Up
---------------
-
-So if you want to be prepared for ``HTTPError`` *or* ``URLError``
-there are two basic approaches. I prefer the second approach.
-
-Number 1
-~~~~~~~~
-
-::
-
-
- from urllib2 import Request, urlopen, URLError, HTTPError
- req = Request(someurl)
- try:
- response = urlopen(req)
- except HTTPError, e:
- print 'The server couldn\'t fulfill the request.'
- print 'Error code: ', e.code
- except URLError, e:
- print 'We failed to reach a server.'
- print 'Reason: ', e.reason
- else:
- # everything is fine
-
-
-.. note::
-
- The ``except HTTPError`` *must* come first, otherwise ``except URLError``
- will *also* catch an ``HTTPError``.
-
-Number 2
-~~~~~~~~
-
-::
-
- from urllib2 import Request, urlopen, URLError
- req = Request(someurl)
- try:
- response = urlopen(req)
- except URLError, e:
- if hasattr(e, 'reason'):
- print 'We failed to reach a server.'
- print 'Reason: ', e.reason
- elif hasattr(e, 'code'):
- print 'The server couldn\'t fulfill the request.'
- print 'Error code: ', e.code
- else:
- # everything is fine
-
-
-info and geturl
-===============
-
-The response returned by urlopen (or the ``HTTPError`` instance) has
-two useful methods ``info`` and ``geturl``.
-
-**geturl** - this returns the real URL of the page fetched. This is
-useful because ``urlopen`` (or the opener object used) may have
-followed a redirect. The URL of the page fetched may not be the same
-as the URL requested.
-
-**info** - this returns a dictionary-like object that describes the
-page fetched, particularly the headers sent by the server. It is
-currently an ``httplib.HTTPMessage`` instance.
-
-Typical headers include 'Content-length', 'Content-type', and so
-on. See the
-`Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_
-for a useful listing of HTTP headers with brief explanations of their meaning
-and use.
-
-
-Openers and Handlers
-====================
-
-When you fetch a URL you use an opener (an instance of the perhaps
-confusingly-named ``urllib2.OpenerDirector``). Normally we have been using
-the default opener - via ``urlopen`` - but you can create custom
-openers. Openers use handlers. All the "heavy lifting" is done by the
-handlers. Each handler knows how to open URLs for a particular URL
-scheme (http, ftp, etc.), or how to handle an aspect of URL opening,
-for example HTTP redirections or HTTP cookies.
-
-You will want to create openers if you want to fetch URLs with
-specific handlers installed, for example to get an opener that handles
-cookies, or to get an opener that does not handle redirections.
-
-To create an opener, instantiate an OpenerDirector, and then call
-.add_handler(some_handler_instance) repeatedly.
-
-Alternatively, you can use ``build_opener``, which is a convenience
-function for creating opener objects with a single function call.
-``build_opener`` adds several handlers by default, but provides a
-quick way to add more and/or override the default handlers.
-
-Other sorts of handlers you might want to can handle proxies,
-authentication, and other common but slightly specialised
-situations.
-
-``install_opener`` can be used to make an ``opener`` object the
-(global) default opener. This means that calls to ``urlopen`` will use
-the opener you have installed.
-
-Opener objects have an ``open`` method, which can be called directly
-to fetch urls in the same way as the ``urlopen`` function: there's no
-need to call ``install_opener``, except as a convenience.
-
-
-Basic Authentication
-====================
-
-To illustrate creating and installing a handler we will use the
-``HTTPBasicAuthHandler``. For a more detailed discussion of this
-subject - including an explanation of how Basic Authentication works -
-see the `Basic Authentication Tutorial <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
-
-When authentication is required, the server sends a header (as well as
-the 401 error code) requesting authentication. This specifies the
-authentication scheme and a 'realm'. The header looks like :
-``Www-authenticate: SCHEME realm="REALM"``.
-
-e.g. ::
-
- Www-authenticate: Basic realm="cPanel Users"
-
-
-The client should then retry the request with the appropriate name and
-password for the realm included as a header in the request. This is
-'basic authentication'. In order to simplify this process we can
-create an instance of ``HTTPBasicAuthHandler`` and an opener to use
-this handler.
-
-The ``HTTPBasicAuthHandler`` uses an object called a password manager
-to handle the mapping of URLs and realms to passwords and
-usernames. If you know what the realm is (from the authentication
-header sent by the server), then you can use a
-``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In
-that case, it is convenient to use
-``HTTPPasswordMgrWithDefaultRealm``. This allows you to specify a
-default username and password for a URL. This will be supplied in the
-absence of you providing an alternative combination for a specific
-realm. We indicate this by providing ``None`` as the realm argument to
-the ``add_password`` method.
-
-The top-level URL is the first URL that requires authentication. URLs
-"deeper" than the URL you pass to .add_password() will also match. ::
-
- # create a password manager
- password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
-
- # Add the username and password.
- # If we knew the realm, we could use it instead of ``None``.
- top_level_url = "http://example.com/foo/"
- password_mgr.add_password(None, top_level_url, username, password)
-
- handler = urllib2.HTTPBasicAuthHandler(password_mgr)
-
- # create "opener" (OpenerDirector instance)
- opener = urllib2.build_opener(handler)
-
- # use the opener to fetch a URL
- opener.open(a_url)
-
- # Install the opener.
- # Now all calls to urllib2.urlopen use our opener.
- urllib2.install_opener(opener)
-
-.. note::
-
- In the above example we only supplied our ``HHTPBasicAuthHandler``
- to ``build_opener``. By default openers have the handlers for
- normal situations - ``ProxyHandler``, ``UnknownHandler``,
- ``HTTPHandler``, ``HTTPDefaultErrorHandler``,
- ``HTTPRedirectHandler``, ``FTPHandler``, ``FileHandler``,
- ``HTTPErrorProcessor``.
-
-top_level_url is in fact *either* a full URL (including the 'http:'
-scheme component and the hostname and optionally the port number)
-e.g. "http://example.com/" *or* an "authority" (i.e. the hostname,
-optionally including the port number) e.g. "example.com" or
-"example.com:8080" (the latter example includes a port number). The
-authority, if present, must NOT contain the "userinfo" component - for
-example "joe@password:example.com" is not correct.
-
-
-Proxies
-=======
-
-**urllib2** will auto-detect your proxy settings and use those. This
-is through the ``ProxyHandler`` which is part of the normal handler
-chain. Normally that's a good thing, but there are occasions when it
-may not be helpful [#]_. One way to do this is to setup our own
-``ProxyHandler``, with no proxies defined. This is done using similar
-steps to setting up a `Basic Authentication`_ handler : ::
-
- >>> proxy_support = urllib2.ProxyHandler({})
- >>> opener = urllib2.build_opener(proxy_support)
- >>> urllib2.install_opener(opener)
-
-.. note::
-
- Currently ``urllib2`` *does not* support fetching of ``https``
- locations through a proxy. However, this can be enabled by extending
- urllib2 as shown in the recipe [#]_.
-
-
-Sockets and Layers
-==================
-
-The Python support for fetching resources from the web is
-layered. urllib2 uses the httplib library, which in turn uses the
-socket library.
-
-As of Python 2.3 you can specify how long a socket should wait for a
-response before timing out. This can be useful in applications which
-have to fetch web pages. By default the socket module has *no timeout*
-and can hang. Currently, the socket timeout is not exposed at the
-httplib or urllib2 levels. However, you can set the default timeout
-globally for all sockets using : ::
-
- import socket
- import urllib2
-
- # timeout in seconds
- timeout = 10
- socket.setdefaulttimeout(timeout)
-
- # this call to urllib2.urlopen now uses the default timeout
- # we have set in the socket module
- req = urllib2.Request('http://www.voidspace.org.uk')
- response = urllib2.urlopen(req)
-
-
--------
-
-
-Footnotes
-=========
-
-This document was reviewed and revised by John Lee.
-
-.. [#] For an introduction to the CGI protocol see
- `Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_.
-.. [#] Like Google for example. The *proper* way to use google from a program
- is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See
- `Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_
- for some examples of using the Google API.
-.. [#] Browser sniffing is a very bad practise for website design - building
- sites using web standards is much more sensible. Unfortunately a lot of
- sites still send different versions to different browsers.
-.. [#] The user agent for MSIE 6 is
- *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
-.. [#] For details of more HTTP request headers, see
- `Quick Reference to HTTP Headers`_.
-.. [#] In my case I have to use a proxy to access the internet at work. If you
- attempt to fetch *localhost* URLs through this proxy it blocks them. IE
- is set to use the proxy, which urllib2 picks up on. In order to test
- scripts with a localhost server, I have to prevent urllib2 from using
- the proxy.
-.. [#] urllib2 opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
- <http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_.
-