diff options
author | Georg Brandl <georg@python.org> | 2007-08-15 14:27:07 (GMT) |
---|---|---|
committer | Georg Brandl <georg@python.org> | 2007-08-15 14:27:07 (GMT) |
commit | 739c01d47b9118d04e5722333f0e6b4d0c8bdd9e (patch) | |
tree | f82b450d291927fc1758b96d981aa0610947b529 /Doc/howto | |
parent | 2d1649094402ef393ea2b128ba2c08c3937e6b93 (diff) | |
download | cpython-739c01d47b9118d04e5722333f0e6b4d0c8bdd9e.zip cpython-739c01d47b9118d04e5722333f0e6b4d0c8bdd9e.tar.gz cpython-739c01d47b9118d04e5722333f0e6b4d0c8bdd9e.tar.bz2 |
Delete the LaTeX doc tree.
Diffstat (limited to 'Doc/howto')
-rw-r--r-- | Doc/howto/Makefile | 84 | ||||
-rw-r--r-- | Doc/howto/TODO | 13 | ||||
-rw-r--r-- | Doc/howto/advocacy.tex | 411 | ||||
-rw-r--r-- | Doc/howto/curses.tex | 486 | ||||
-rw-r--r-- | Doc/howto/doanddont.tex | 332 | ||||
-rw-r--r-- | Doc/howto/functional.rst | 1474 | ||||
-rw-r--r-- | Doc/howto/regex.tex | 1476 | ||||
-rw-r--r-- | Doc/howto/sockets.tex | 465 | ||||
-rw-r--r-- | Doc/howto/unicode.rst | 766 | ||||
-rw-r--r-- | Doc/howto/urllib2.rst | 603 |
10 files changed, 0 insertions, 6110 deletions
diff --git a/Doc/howto/Makefile b/Doc/howto/Makefile deleted file mode 100644 index 18110a2..0000000 --- a/Doc/howto/Makefile +++ /dev/null @@ -1,84 +0,0 @@ -# Makefile for the HOWTO directory -# LaTeX HOWTOs can be turned into HTML, PDF, PS, DVI or plain text output. -# reST HOWTOs can only be turned into HTML. - -# Variables to change - -# Paper size for non-HTML formats (letter or a4) -PAPER=letter - -# Arguments to rst2html.py, and location of the script -RSTARGS = --input-encoding=utf-8 -RST2HTML = rst2html.py - -# List of HOWTOs that aren't to be processed. This should contain the -# base name of the HOWTO without any extension (e.g. 'advocacy', -# 'unicode'). -REMOVE_HOWTOS = - -MKHOWTO=../tools/mkhowto -WEBDIR=. -PAPERDIR=../paper-$(PAPER) -HTMLDIR=../html - -# Determine list of files to be built -TEX_SOURCES = $(wildcard *.tex) -RST_SOURCES = $(wildcard *.rst) -TEX_NAMES = $(filter-out $(REMOVE_HOWTOS),$(patsubst %.tex,%,$(TEX_SOURCES))) - -PAPER_PATHS=$(addprefix $(PAPERDIR)/,$(TEX_NAMES)) -DVI =$(addsuffix .dvi,$(PAPER_PATHS)) -PDF =$(addsuffix .pdf,$(PAPER_PATHS)) -PS =$(addsuffix .ps,$(PAPER_PATHS)) - -ALL_HOWTO_NAMES = $(TEX_NAMES) $(patsubst %.rst,%,$(RST_SOURCES)) -HOWTO_NAMES = $(filter-out $(REMOVE_HOWTOS),$(ALL_HOWTO_NAMES)) -HTML = $(addprefix $(HTMLDIR)/,$(HOWTO_NAMES)) - -# Rules for building various formats - -# reST to HTML -$(HTMLDIR)/%: %.rst - if [ ! -d $@ ] ; then mkdir $@ ; fi - $(RST2HTML) $(RSTARGS) $< >$@/index.html - -# LaTeX to various output formats -$(PAPERDIR)/%.dvi : %.tex - $(MKHOWTO) --dvi $< - mv $*.dvi $@ - -$(PAPERDIR)/%.pdf : %.tex - $(MKHOWTO) --pdf $< - mv $*.pdf $@ - -$(PAPERDIR)/%.ps : %.tex - $(MKHOWTO) --ps $< - mv $*.ps $@ - -$(HTMLDIR)/% : %.tex - $(MKHOWTO) --html --iconserver="." --dir $@ $< - -# Rule that isn't actually used -- we no longer support the 'txt' target. -$(PAPERDIR)/%.txt : %.tex - $(MKHOWTO) --text $< - mv $@ txt - -default: - @echo "'all' -- build all files" - @echo "'dvi', 'pdf', 'ps', 'html' -- build one format" - -all: dvi pdf ps html - -.PHONY : dvi pdf ps html -dvi: $(DVI) -pdf: $(PDF) -ps: $(PS) -html: $(HTML) - -clean: - rm -f *~ *.log *.ind *.l2h *.aux *.toc *.how *.bkm - rm -f *.dvi *.pdf *.ps - -clobber: - rm -rf $(HTML) - rm -rf $(DVI) $(PDF) $(PS) diff --git a/Doc/howto/TODO b/Doc/howto/TODO deleted file mode 100644 index c229828..0000000 --- a/Doc/howto/TODO +++ /dev/null @@ -1,13 +0,0 @@ - -Short-term tasks: - Quick revision pass to make HOWTOs match the current state of Python -doanddont regex sockets - -Medium-term tasks: - Revisit the regex howto. - * Add exercises with answers for each section - * More examples? - -Long-term tasks: - Integrate with other Python docs? - diff --git a/Doc/howto/advocacy.tex b/Doc/howto/advocacy.tex deleted file mode 100644 index 9074b3f..0000000 --- a/Doc/howto/advocacy.tex +++ /dev/null @@ -1,411 +0,0 @@ - -\documentclass{howto} - -\title{Python Advocacy HOWTO} - -\release{0.03} - -\author{A.M. Kuchling} -\authoraddress{\email{amk@amk.ca}} - -\begin{document} -\maketitle - -\begin{abstract} -\noindent -It's usually difficult to get your management to accept open source -software, and Python is no exception to this rule. This document -discusses reasons to use Python, strategies for winning acceptance, -facts and arguments you can use, and cases where you \emph{shouldn't} -try to use Python. - -This document is available from the Python HOWTO page at -\url{http://www.python.org/doc/howto}. - -\end{abstract} - -\tableofcontents - -\section{Reasons to Use Python} - -There are several reasons to incorporate a scripting language into -your development process, and this section will discuss them, and why -Python has some properties that make it a particularly good choice. - - \subsection{Programmability} - -Programs are often organized in a modular fashion. Lower-level -operations are grouped together, and called by higher-level functions, -which may in turn be used as basic operations by still further upper -levels. - -For example, the lowest level might define a very low-level -set of functions for accessing a hash table. The next level might use -hash tables to store the headers of a mail message, mapping a header -name like \samp{Date} to a value such as \samp{Tue, 13 May 1997 -20:00:54 -0400}. A yet higher level may operate on message objects, -without knowing or caring that message headers are stored in a hash -table, and so forth. - -Often, the lowest levels do very simple things; they implement a data -structure such as a binary tree or hash table, or they perform some -simple computation, such as converting a date string to a number. The -higher levels then contain logic connecting these primitive -operations. Using the approach, the primitives can be seen as basic -building blocks which are then glued together to produce the complete -product. - -Why is this design approach relevant to Python? Because Python is -well suited to functioning as such a glue language. A common approach -is to write a Python module that implements the lower level -operations; for the sake of speed, the implementation might be in C, -Java, or even Fortran. Once the primitives are available to Python -programs, the logic underlying higher level operations is written in -the form of Python code. The high-level logic is then more -understandable, and easier to modify. - -John Ousterhout wrote a paper that explains this idea at greater -length, entitled ``Scripting: Higher Level Programming for the 21st -Century''. I recommend that you read this paper; see the references -for the URL. Ousterhout is the inventor of the Tcl language, and -therefore argues that Tcl should be used for this purpose; he only -briefly refers to other languages such as Python, Perl, and -Lisp/Scheme, but in reality, Ousterhout's argument applies to -scripting languages in general, since you could equally write -extensions for any of the languages mentioned above. - - \subsection{Prototyping} - -In \emph{The Mythical Man-Month}, Fredrick Brooks suggests the -following rule when planning software projects: ``Plan to throw one -away; you will anyway.'' Brooks is saying that the first attempt at a -software design often turns out to be wrong; unless the problem is -very simple or you're an extremely good designer, you'll find that new -requirements and features become apparent once development has -actually started. If these new requirements can't be cleanly -incorporated into the program's structure, you're presented with two -unpleasant choices: hammer the new features into the program somehow, -or scrap everything and write a new version of the program, taking the -new features into account from the beginning. - -Python provides you with a good environment for quickly developing an -initial prototype. That lets you get the overall program structure -and logic right, and you can fine-tune small details in the fast -development cycle that Python provides. Once you're satisfied with -the GUI interface or program output, you can translate the Python code -into C++, Fortran, Java, or some other compiled language. - -Prototyping means you have to be careful not to use too many Python -features that are hard to implement in your other language. Using -\code{eval()}, or regular expressions, or the \module{pickle} module, -means that you're going to need C or Java libraries for formula -evaluation, regular expressions, and serialization, for example. But -it's not hard to avoid such tricky code, and in the end the -translation usually isn't very difficult. The resulting code can be -rapidly debugged, because any serious logical errors will have been -removed from the prototype, leaving only more minor slip-ups in the -translation to track down. - -This strategy builds on the earlier discussion of programmability. -Using Python as glue to connect lower-level components has obvious -relevance for constructing prototype systems. In this way Python can -help you with development, even if end users never come in contact -with Python code at all. If the performance of the Python version is -adequate and corporate politics allow it, you may not need to do a -translation into C or Java, but it can still be faster to develop a -prototype and then translate it, instead of attempting to produce the -final version immediately. - -One example of this development strategy is Microsoft Merchant Server. -Version 1.0 was written in pure Python, by a company that subsequently -was purchased by Microsoft. Version 2.0 began to translate the code -into \Cpp, shipping with some \Cpp code and some Python code. Version -3.0 didn't contain any Python at all; all the code had been translated -into \Cpp. Even though the product doesn't contain a Python -interpreter, the Python language has still served a useful purpose by -speeding up development. - -This is a very common use for Python. Past conference papers have -also described this approach for developing high-level numerical -algorithms; see David M. Beazley and Peter S. Lomdahl's paper -``Feeding a Large-scale Physics Application to Python'' in the -references for a good example. If an algorithm's basic operations are -things like "Take the inverse of this 4000x4000 matrix", and are -implemented in some lower-level language, then Python has almost no -additional performance cost; the extra time required for Python to -evaluate an expression like \code{m.invert()} is dwarfed by the cost -of the actual computation. It's particularly good for applications -where seemingly endless tweaking is required to get things right. GUI -interfaces and Web sites are prime examples. - -The Python code is also shorter and faster to write (once you're -familiar with Python), so it's easier to throw it away if you decide -your approach was wrong; if you'd spent two weeks working on it -instead of just two hours, you might waste time trying to patch up -what you've got out of a natural reluctance to admit that those two -weeks were wasted. Truthfully, those two weeks haven't been wasted, -since you've learnt something about the problem and the technology -you're using to solve it, but it's human nature to view this as a -failure of some sort. - - \subsection{Simplicity and Ease of Understanding} - -Python is definitely \emph{not} a toy language that's only usable for -small tasks. The language features are general and powerful enough to -enable it to be used for many different purposes. It's useful at the -small end, for 10- or 20-line scripts, but it also scales up to larger -systems that contain thousands of lines of code. - -However, this expressiveness doesn't come at the cost of an obscure or -tricky syntax. While Python has some dark corners that can lead to -obscure code, there are relatively few such corners, and proper design -can isolate their use to only a few classes or modules. It's -certainly possible to write confusing code by using too many features -with too little concern for clarity, but most Python code can look a -lot like a slightly-formalized version of human-understandable -pseudocode. - -In \emph{The New Hacker's Dictionary}, Eric S. Raymond gives the following -definition for "compact": - -\begin{quotation} - Compact \emph{adj.} Of a design, describes the valuable property - that it can all be apprehended at once in one's head. This - generally means the thing created from the design can be used - with greater facility and fewer errors than an equivalent tool - that is not compact. Compactness does not imply triviality or - lack of power; for example, C is compact and FORTRAN is not, - but C is more powerful than FORTRAN. Designs become - non-compact through accreting features and cruft that don't - merge cleanly into the overall design scheme (thus, some fans - of Classic C maintain that ANSI C is no longer compact). -\end{quotation} - -(From \url{http://www.catb.org/~esr/jargon/html/C/compact.html}) - -In this sense of the word, Python is quite compact, because the -language has just a few ideas, which are used in lots of places. Take -namespaces, for example. Import a module with \code{import math}, and -you create a new namespace called \samp{math}. Classes are also -namespaces that share many of the properties of modules, and have a -few of their own; for example, you can create instances of a class. -Instances? They're yet another namespace. Namespaces are currently -implemented as Python dictionaries, so they have the same methods as -the standard dictionary data type: .keys() returns all the keys, and -so forth. - -This simplicity arises from Python's development history. The -language syntax derives from different sources; ABC, a relatively -obscure teaching language, is one primary influence, and Modula-3 is -another. (For more information about ABC and Modula-3, consult their -respective Web sites at \url{http://www.cwi.nl/~steven/abc/} and -\url{http://www.m3.org}.) Other features have come from C, Icon, -Algol-68, and even Perl. Python hasn't really innovated very much, -but instead has tried to keep the language small and easy to learn, -building on ideas that have been tried in other languages and found -useful. - -Simplicity is a virtue that should not be underestimated. It lets you -learn the language more quickly, and then rapidly write code, code -that often works the first time you run it. - - \subsection{Java Integration} - -If you're working with Java, Jython -(\url{http://www.jython.org/}) is definitely worth your -attention. Jython is a re-implementation of Python in Java that -compiles Python code into Java bytecodes. The resulting environment -has very tight, almost seamless, integration with Java. It's trivial -to access Java classes from Python, and you can write Python classes -that subclass Java classes. Jython can be used for prototyping Java -applications in much the same way CPython is used, and it can also be -used for test suites for Java code, or embedded in a Java application -to add scripting capabilities. - -\section{Arguments and Rebuttals} - -Let's say that you've decided upon Python as the best choice for your -application. How can you convince your management, or your fellow -developers, to use Python? This section lists some common arguments -against using Python, and provides some possible rebuttals. - -\emph{Python is freely available software that doesn't cost anything. -How good can it be?} - -Very good, indeed. These days Linux and Apache, two other pieces of -open source software, are becoming more respected as alternatives to -commercial software, but Python hasn't had all the publicity. - -Python has been around for several years, with many users and -developers. Accordingly, the interpreter has been used by many -people, and has gotten most of the bugs shaken out of it. While bugs -are still discovered at intervals, they're usually either quite -obscure (they'd have to be, for no one to have run into them before) -or they involve interfaces to external libraries. The internals of -the language itself are quite stable. - -Having the source code should be viewed as making the software -available for peer review; people can examine the code, suggest (and -implement) improvements, and track down bugs. To find out more about -the idea of open source code, along with arguments and case studies -supporting it, go to \url{http://www.opensource.org}. - -\emph{Who's going to support it?} - -Python has a sizable community of developers, and the number is still -growing. The Internet community surrounding the language is an active -one, and is worth being considered another one of Python's advantages. -Most questions posted to the comp.lang.python newsgroup are quickly -answered by someone. - -Should you need to dig into the source code, you'll find it's clear -and well-organized, so it's not very difficult to write extensions and -track down bugs yourself. If you'd prefer to pay for support, there -are companies and individuals who offer commercial support for Python. - -\emph{Who uses Python for serious work?} - -Lots of people; one interesting thing about Python is the surprising -diversity of applications that it's been used for. People are using -Python to: - -\begin{itemize} -\item Run Web sites -\item Write GUI interfaces -\item Control -number-crunching code on supercomputers -\item Make a commercial application scriptable by embedding the Python -interpreter inside it -\item Process large XML data sets -\item Build test suites for C or Java code -\end{itemize} - -Whatever your application domain is, there's probably someone who's -used Python for something similar. Yet, despite being useable for -such high-end applications, Python's still simple enough to use for -little jobs. - -See \url{http://wiki.python.org/moin/OrganizationsUsingPython} for a list of some of the -organizations that use Python. - -\emph{What are the restrictions on Python's use?} - -They're practically nonexistent. Consult the \file{Misc/COPYRIGHT} -file in the source distribution, or -\url{http://www.python.org/doc/Copyright.html} for the full language, -but it boils down to three conditions. - -\begin{itemize} - -\item You have to leave the copyright notice on the software; if you -don't include the source code in a product, you have to put the -copyright notice in the supporting documentation. - -\item Don't claim that the institutions that have developed Python -endorse your product in any way. - -\item If something goes wrong, you can't sue for damages. Practically -all software licences contain this condition. - -\end{itemize} - -Notice that you don't have to provide source code for anything that -contains Python or is built with it. Also, the Python interpreter and -accompanying documentation can be modified and redistributed in any -way you like, and you don't have to pay anyone any licensing fees at -all. - -\emph{Why should we use an obscure language like Python instead of -well-known language X?} - -I hope this HOWTO, and the documents listed in the final section, will -help convince you that Python isn't obscure, and has a healthily -growing user base. One word of advice: always present Python's -positive advantages, instead of concentrating on language X's -failings. People want to know why a solution is good, rather than why -all the other solutions are bad. So instead of attacking a competing -solution on various grounds, simply show how Python's virtues can -help. - - -\section{Useful Resources} - -\begin{definitions} - - -\term{\url{http://www.pythonology.com/success}} - -The Python Success Stories are a collection of stories from successful -users of Python, with the emphasis on business and corporate users. - -%\term{\url{http://www.fsbassociates.com/books/pythonchpt1.htm}} - -%The first chapter of \emph{Internet Programming with Python} also -%examines some of the reasons for using Python. The book is well worth -%buying, but the publishers have made the first chapter available on -%the Web. - -\term{\url{http://home.pacbell.net/ouster/scripting.html}} - -John Ousterhout's white paper on scripting is a good argument for the -utility of scripting languages, though naturally enough, he emphasizes -Tcl, the language he developed. Most of the arguments would apply to -any scripting language. - -\term{\url{http://www.python.org/workshops/1997-10/proceedings/beazley.html}} - -The authors, David M. Beazley and Peter S. Lomdahl, -describe their use of Python at Los Alamos National Laboratory. -It's another good example of how Python can help get real work done. -This quotation from the paper has been echoed by many people: - -\begin{quotation} - Originally developed as a large monolithic application for - massively parallel processing systems, we have used Python to - transform our application into a flexible, highly modular, and - extremely powerful system for performing simulation, data - analysis, and visualization. In addition, we describe how Python - has solved a number of important problems related to the - development, debugging, deployment, and maintenance of scientific - software. -\end{quotation} - -\term{\url{http://pythonjournal.cognizor.com/pyj1/Everitt-Feit_interview98-V1.html}} - -This interview with Andy Feit, discussing Infoseek's use of Python, can be -used to show that choosing Python didn't introduce any difficulties -into a company's development process, and provided some substantial benefits. - -%\term{\url{http://www.python.org/psa/Commercial.html}} - -%Robin Friedrich wrote this document on how to support Python's use in -%commercial projects. - -\term{\url{http://www.python.org/workshops/1997-10/proceedings/stein.ps}} - -For the 6th Python conference, Greg Stein presented a paper that -traced Python's adoption and usage at a startup called eShop, and -later at Microsoft. - -\term{\url{http://www.opensource.org}} - -Management may be doubtful of the reliability and usefulness of -software that wasn't written commercially. This site presents -arguments that show how open source software can have considerable -advantages over closed-source software. - -\term{\url{http://sunsite.unc.edu/LDP/HOWTO/mini/Advocacy.html}} - -The Linux Advocacy mini-HOWTO was the inspiration for this document, -and is also well worth reading for general suggestions on winning -acceptance for a new technology, such as Linux or Python. In general, -you won't make much progress by simply attacking existing systems and -complaining about their inadequacies; this often ends up looking like -unfocused whining. It's much better to point out some of the many -areas where Python is an improvement over other systems. - -\end{definitions} - -\end{document} - - diff --git a/Doc/howto/curses.tex b/Doc/howto/curses.tex deleted file mode 100644 index 3e4cada..0000000 --- a/Doc/howto/curses.tex +++ /dev/null @@ -1,486 +0,0 @@ -\documentclass{howto} - -\title{Curses Programming with Python} - -\release{2.02} - -\author{A.M. Kuchling, Eric S. Raymond} -\authoraddress{\email{amk@amk.ca}, \email{esr@thyrsus.com}} - -\begin{document} -\maketitle - -\begin{abstract} -\noindent -This document describes how to write text-mode programs with Python 2.x, -using the \module{curses} extension module to control the display. - -This document is available from the Python HOWTO page at -\url{http://www.python.org/doc/howto}. -\end{abstract} - -\tableofcontents - -\section{What is curses?} - -The curses library supplies a terminal-independent screen-painting and -keyboard-handling facility for text-based terminals; such terminals -include VT100s, the Linux console, and the simulated terminal provided -by X11 programs such as xterm and rxvt. Display terminals support -various control codes to perform common operations such as moving the -cursor, scrolling the screen, and erasing areas. Different terminals -use widely differing codes, and often have their own minor quirks. - -In a world of X displays, one might ask ``why bother''? It's true -that character-cell display terminals are an obsolete technology, but -there are niches in which being able to do fancy things with them are -still valuable. One is on small-footprint or embedded Unixes that -don't carry an X server. Another is for tools like OS installers -and kernel configurators that may have to run before X is available. - -The curses library hides all the details of different terminals, and -provides the programmer with an abstraction of a display, containing -multiple non-overlapping windows. The contents of a window can be -changed in various ways--adding text, erasing it, changing its -appearance--and the curses library will automagically figure out what -control codes need to be sent to the terminal to produce the right -output. - -The curses library was originally written for BSD Unix; the later System V -versions of Unix from AT\&T added many enhancements and new functions. -BSD curses is no longer maintained, having been replaced by ncurses, -which is an open-source implementation of the AT\&T interface. If you're -using an open-source Unix such as Linux or FreeBSD, your system almost -certainly uses ncurses. Since most current commercial Unix versions -are based on System V code, all the functions described here will -probably be available. The older versions of curses carried by some -proprietary Unixes may not support everything, though. - -No one has made a Windows port of the curses module. On a Windows -platform, try the Console module written by Fredrik Lundh. The -Console module provides cursor-addressable text output, plus full -support for mouse and keyboard input, and is available from -\url{http://effbot.org/efflib/console}. - -\subsection{The Python curses module} - -Thy Python module is a fairly simple wrapper over the C functions -provided by curses; if you're already familiar with curses programming -in C, it's really easy to transfer that knowledge to Python. The -biggest difference is that the Python interface makes things simpler, -by merging different C functions such as \function{addstr}, -\function{mvaddstr}, \function{mvwaddstr}, into a single -\method{addstr()} method. You'll see this covered in more detail -later. - -This HOWTO is simply an introduction to writing text-mode programs -with curses and Python. It doesn't attempt to be a complete guide to -the curses API; for that, see the Python library guide's section on -ncurses, and the C manual pages for ncurses. It will, however, give -you the basic ideas. - -\section{Starting and ending a curses application} - -Before doing anything, curses must be initialized. This is done by -calling the \function{initscr()} function, which will determine the -terminal type, send any required setup codes to the terminal, and -create various internal data structures. If successful, -\function{initscr()} returns a window object representing the entire -screen; this is usually called \code{stdscr}, after the name of the -corresponding C -variable. - -\begin{verbatim} -import curses -stdscr = curses.initscr() -\end{verbatim} - -Usually curses applications turn off automatic echoing of keys to the -screen, in order to be able to read keys and only display them under -certain circumstances. This requires calling the \function{noecho()} -function. - -\begin{verbatim} -curses.noecho() -\end{verbatim} - -Applications will also commonly need to react to keys instantly, -without requiring the Enter key to be pressed; this is called cbreak -mode, as opposed to the usual buffered input mode. - -\begin{verbatim} -curses.cbreak() -\end{verbatim} - -Terminals usually return special keys, such as the cursor keys or -navigation keys such as Page Up and Home, as a multibyte escape -sequence. While you could write your application to expect such -sequences and process them accordingly, curses can do it for you, -returning a special value such as \constant{curses.KEY_LEFT}. To get -curses to do the job, you'll have to enable keypad mode. - -\begin{verbatim} -stdscr.keypad(1) -\end{verbatim} - -Terminating a curses application is much easier than starting one. -You'll need to call - -\begin{verbatim} -curses.nocbreak(); stdscr.keypad(0); curses.echo() -\end{verbatim} - -to reverse the curses-friendly terminal settings. Then call the -\function{endwin()} function to restore the terminal to its original -operating mode. - -\begin{verbatim} -curses.endwin() -\end{verbatim} - -A common problem when debugging a curses application is to get your -terminal messed up when the application dies without restoring the -terminal to its previous state. In Python this commonly happens when -your code is buggy and raises an uncaught exception. Keys are no -longer be echoed to the screen when you type them, for example, which -makes using the shell difficult. - -In Python you can avoid these complications and make debugging much -easier by importing the module \module{curses.wrapper}. It supplies a -\function{wrapper()} function that takes a callable. It does the -initializations described above, and also initializes colors if color -support is present. It then runs your provided callable and finally -deinitializes appropriately. The callable is called inside a try-catch -clause which catches exceptions, performs curses deinitialization, and -then passes the exception upwards. Thus, your terminal won't be left -in a funny state on exception. - -\section{Windows and Pads} - -Windows are the basic abstraction in curses. A window object -represents a rectangular area of the screen, and supports various -methods to display text, erase it, allow the user to input strings, -and so forth. - -The \code{stdscr} object returned by the \function{initscr()} function -is a window object that covers the entire screen. Many programs may -need only this single window, but you might wish to divide the screen -into smaller windows, in order to redraw or clear them separately. -The \function{newwin()} function creates a new window of a given size, -returning the new window object. - -\begin{verbatim} -begin_x = 20 ; begin_y = 7 -height = 5 ; width = 40 -win = curses.newwin(height, width, begin_y, begin_x) -\end{verbatim} - -A word about the coordinate system used in curses: coordinates are -always passed in the order \emph{y,x}, and the top-left corner of a -window is coordinate (0,0). This breaks a common convention for -handling coordinates, where the \emph{x} coordinate usually comes -first. This is an unfortunate difference from most other computer -applications, but it's been part of curses since it was first written, -and it's too late to change things now. - -When you call a method to display or erase text, the effect doesn't -immediately show up on the display. This is because curses was -originally written with slow 300-baud terminal connections in mind; -with these terminals, minimizing the time required to redraw the -screen is very important. This lets curses accumulate changes to the -screen, and display them in the most efficient manner. For example, -if your program displays some characters in a window, and then clears -the window, there's no need to send the original characters because -they'd never be visible. - -Accordingly, curses requires that you explicitly tell it to redraw -windows, using the \function{refresh()} method of window objects. In -practice, this doesn't really complicate programming with curses much. -Most programs go into a flurry of activity, and then pause waiting for -a keypress or some other action on the part of the user. All you have -to do is to be sure that the screen has been redrawn before pausing to -wait for user input, by simply calling \code{stdscr.refresh()} or the -\function{refresh()} method of some other relevant window. - -A pad is a special case of a window; it can be larger than the actual -display screen, and only a portion of it displayed at a time. -Creating a pad simply requires the pad's height and width, while -refreshing a pad requires giving the coordinates of the on-screen -area where a subsection of the pad will be displayed. - -\begin{verbatim} -pad = curses.newpad(100, 100) -# These loops fill the pad with letters; this is -# explained in the next section -for y in range(0, 100): - for x in range(0, 100): - try: pad.addch(y,x, ord('a') + (x*x+y*y) % 26 ) - except curses.error: pass - -# Displays a section of the pad in the middle of the screen -pad.refresh( 0,0, 5,5, 20,75) -\end{verbatim} - -The \function{refresh()} call displays a section of the pad in the -rectangle extending from coordinate (5,5) to coordinate (20,75) on the -screen; the upper left corner of the displayed section is coordinate -(0,0) on the pad. Beyond that difference, pads are exactly like -ordinary windows and support the same methods. - -If you have multiple windows and pads on screen there is a more -efficient way to go, which will prevent annoying screen flicker at -refresh time. Use the \method{noutrefresh()} method -of each window to update the data structure -representing the desired state of the screen; then change the physical -screen to match the desired state in one go with the function -\function{doupdate()}. The normal \method{refresh()} method calls -\function{doupdate()} as its last act. - -\section{Displaying Text} - -{}From a C programmer's point of view, curses may sometimes look like -a twisty maze of functions, all subtly different. For example, -\function{addstr()} displays a string at the current cursor location -in the \code{stdscr} window, while \function{mvaddstr()} moves to a -given y,x coordinate first before displaying the string. -\function{waddstr()} is just like \function{addstr()}, but allows -specifying a window to use, instead of using \code{stdscr} by default. -\function{mvwaddstr()} follows similarly. - -Fortunately the Python interface hides all these details; -\code{stdscr} is a window object like any other, and methods like -\function{addstr()} accept multiple argument forms. Usually there are -four different forms. - -\begin{tableii}{|c|l|}{textrm}{Form}{Description} -\lineii{\var{str} or \var{ch}}{Display the string \var{str} or -character \var{ch} at the current position} -\lineii{\var{str} or \var{ch}, \var{attr}}{Display the string \var{str} or -character \var{ch}, using attribute \var{attr} at the current position} -\lineii{\var{y}, \var{x}, \var{str} or \var{ch}} -{Move to position \var{y,x} within the window, and display \var{str} -or \var{ch}} -\lineii{\var{y}, \var{x}, \var{str} or \var{ch}, \var{attr}} -{Move to position \var{y,x} within the window, and display \var{str} -or \var{ch}, using attribute \var{attr}} -\end{tableii} - -Attributes allow displaying text in highlighted forms, such as in -boldface, underline, reverse code, or in color. They'll be explained -in more detail in the next subsection. - -The \function{addstr()} function takes a Python string as the value to -be displayed, while the \function{addch()} functions take a character, -which can be either a Python string of length 1 or an integer. If -it's a string, you're limited to displaying characters between 0 and -255. SVr4 curses provides constants for extension characters; these -constants are integers greater than 255. For example, -\constant{ACS_PLMINUS} is a +/- symbol, and \constant{ACS_ULCORNER} is -the upper left corner of a box (handy for drawing borders). - -Windows remember where the cursor was left after the last operation, -so if you leave out the \var{y,x} coordinates, the string or character -will be displayed wherever the last operation left off. You can also -move the cursor with the \function{move(\var{y,x})} method. Because -some terminals always display a flashing cursor, you may want to -ensure that the cursor is positioned in some location where it won't -be distracting; it can be confusing to have the cursor blinking at -some apparently random location. - -If your application doesn't need a blinking cursor at all, you can -call \function{curs_set(0)} to make it invisible. Equivalently, and -for compatibility with older curses versions, there's a -\function{leaveok(\var{bool})} function. When \var{bool} is true, the -curses library will attempt to suppress the flashing cursor, and you -won't need to worry about leaving it in odd locations. - -\subsection{Attributes and Color} - -Characters can be displayed in different ways. Status lines in a -text-based application are commonly shown in reverse video; a text -viewer may need to highlight certain words. curses supports this by -allowing you to specify an attribute for each cell on the screen. - -An attribute is a integer, each bit representing a different -attribute. You can try to display text with multiple attribute bits -set, but curses doesn't guarantee that all the possible combinations -are available, or that they're all visually distinct. That depends on -the ability of the terminal being used, so it's safest to stick to the -most commonly available attributes, listed here. - -\begin{tableii}{|c|l|}{constant}{Attribute}{Description} -\lineii{A_BLINK}{Blinking text} -\lineii{A_BOLD}{Extra bright or bold text} -\lineii{A_DIM}{Half bright text} -\lineii{A_REVERSE}{Reverse-video text} -\lineii{A_STANDOUT}{The best highlighting mode available} -\lineii{A_UNDERLINE}{Underlined text} -\end{tableii} - -So, to display a reverse-video status line on the top line of the -screen, -you could code: - -\begin{verbatim} -stdscr.addstr(0, 0, "Current mode: Typing mode", - curses.A_REVERSE) -stdscr.refresh() -\end{verbatim} - -The curses library also supports color on those terminals that -provide it, The most common such terminal is probably the Linux -console, followed by color xterms. - -To use color, you must call the \function{start_color()} function soon -after calling \function{initscr()}, to initialize the default color -set (the \function{curses.wrapper.wrapper()} function does this -automatically). Once that's done, the \function{has_colors()} -function returns TRUE if the terminal in use can actually display -color. (Note: curses uses the American spelling 'color', instead of -the Canadian/British spelling 'colour'. If you're used to the British -spelling, you'll have to resign yourself to misspelling it for the -sake of these functions.) - -The curses library maintains a finite number of color pairs, -containing a foreground (or text) color and a background color. You -can get the attribute value corresponding to a color pair with the -\function{color_pair()} function; this can be bitwise-OR'ed with other -attributes such as \constant{A_REVERSE}, but again, such combinations -are not guaranteed to work on all terminals. - -An example, which displays a line of text using color pair 1: - -\begin{verbatim} -stdscr.addstr( "Pretty text", curses.color_pair(1) ) -stdscr.refresh() -\end{verbatim} - -As I said before, a color pair consists of a foreground and -background color. \function{start_color()} initializes 8 basic -colors when it activates color mode. They are: 0:black, 1:red, -2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and 7:white. The curses -module defines named constants for each of these colors: -\constant{curses.COLOR_BLACK}, \constant{curses.COLOR_RED}, and so -forth. - -The \function{init_pair(\var{n, f, b})} function changes the -definition of color pair \var{n}, to foreground color {f} and -background color {b}. Color pair 0 is hard-wired to white on black, -and cannot be changed. - -Let's put all this together. To change color 1 to red -text on a white background, you would call: - -\begin{verbatim} -curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE) -\end{verbatim} - -When you change a color pair, any text already displayed using that -color pair will change to the new colors. You can also display new -text in this color with: - -\begin{verbatim} -stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1) ) -\end{verbatim} - -Very fancy terminals can change the definitions of the actual colors -to a given RGB value. This lets you change color 1, which is usually -red, to purple or blue or any other color you like. Unfortunately, -the Linux console doesn't support this, so I'm unable to try it out, -and can't provide any examples. You can check if your terminal can do -this by calling \function{can_change_color()}, which returns TRUE if -the capability is there. If you're lucky enough to have such a -talented terminal, consult your system's man pages for more -information. - -\section{User Input} - -The curses library itself offers only very simple input mechanisms. -Python's support adds a text-input widget that makes up some of the -lack. - -The most common way to get input to a window is to use its -\method{getch()} method. \method{getch()} pauses and waits for the -user to hit a key, displaying it if \function{echo()} has been called -earlier. You can optionally specify a coordinate to which the cursor -should be moved before pausing. - -It's possible to change this behavior with the method -\method{nodelay()}. After \method{nodelay(1)}, \method{getch()} for -the window becomes non-blocking and returns \code{curses.ERR} (a value -of -1) when no input is ready. There's also a \function{halfdelay()} -function, which can be used to (in effect) set a timer on each -\method{getch()}; if no input becomes available within the number of -milliseconds specified as the argument to \function{halfdelay()}, -curses raises an exception. - -The \method{getch()} method returns an integer; if it's between 0 and -255, it represents the ASCII code of the key pressed. Values greater -than 255 are special keys such as Page Up, Home, or the cursor keys. -You can compare the value returned to constants such as -\constant{curses.KEY_PPAGE}, \constant{curses.KEY_HOME}, or -\constant{curses.KEY_LEFT}. Usually the main loop of your program -will look something like this: - -\begin{verbatim} -while 1: - c = stdscr.getch() - if c == ord('p'): PrintDocument() - elif c == ord('q'): break # Exit the while() - elif c == curses.KEY_HOME: x = y = 0 -\end{verbatim} - -The \module{curses.ascii} module supplies ASCII class membership -functions that take either integer or 1-character-string -arguments; these may be useful in writing more readable tests for -your command interpreters. It also supplies conversion functions -that take either integer or 1-character-string arguments and return -the same type. For example, \function{curses.ascii.ctrl()} returns -the control character corresponding to its argument. - -There's also a method to retrieve an entire string, -\constant{getstr()}. It isn't used very often, because its -functionality is quite limited; the only editing keys available are -the backspace key and the Enter key, which terminates the string. It -can optionally be limited to a fixed number of characters. - -\begin{verbatim} -curses.echo() # Enable echoing of characters - -# Get a 15-character string, with the cursor on the top line -s = stdscr.getstr(0,0, 15) -\end{verbatim} - -The Python \module{curses.textpad} module supplies something better. -With it, you can turn a window into a text box that supports an -Emacs-like set of keybindings. Various methods of \class{Textbox} -class support editing with input validation and gathering the edit -results either with or without trailing spaces. See the library -documentation on \module{curses.textpad} for the details. - -\section{For More Information} - -This HOWTO didn't cover some advanced topics, such as screen-scraping -or capturing mouse events from an xterm instance. But the Python -library page for the curses modules is now pretty complete. You -should browse it next. - -If you're in doubt about the detailed behavior of any of the ncurses -entry points, consult the manual pages for your curses implementation, -whether it's ncurses or a proprietary Unix vendor's. The manual pages -will document any quirks, and provide complete lists of all the -functions, attributes, and \constant{ACS_*} characters available to -you. - -Because the curses API is so large, some functions aren't supported in -the Python interface, not because they're difficult to implement, but -because no one has needed them yet. Feel free to add them and then -submit a patch. Also, we don't yet have support for the menus or -panels libraries associated with ncurses; feel free to add that. - -If you write an interesting little program, feel free to contribute it -as another demo. We can always use more of them! - -The ncurses FAQ: \url{http://dickey.his.com/ncurses/ncurses.faq.html} - -\end{document} diff --git a/Doc/howto/doanddont.tex b/Doc/howto/doanddont.tex deleted file mode 100644 index b54f069..0000000 --- a/Doc/howto/doanddont.tex +++ /dev/null @@ -1,332 +0,0 @@ -\documentclass{howto} - -\title{Idioms and Anti-Idioms in Python} - -\release{0.00} - -\author{Moshe Zadka} -\authoraddress{howto@zadka.site.co.il} - -\begin{document} -\maketitle - -This document is placed in the public doman. - -\begin{abstract} -\noindent -This document can be considered a companion to the tutorial. It -shows how to use Python, and even more importantly, how {\em not} -to use Python. -\end{abstract} - -\tableofcontents - -\section{Language Constructs You Should Not Use} - -While Python has relatively few gotchas compared to other languages, it -still has some constructs which are only useful in corner cases, or are -plain dangerous. - -\subsection{from module import *} - -\subsubsection{Inside Function Definitions} - -\code{from module import *} is {\em invalid} inside function definitions. -While many versions of Python do not check for the invalidity, it does not -make it more valid, no more then having a smart lawyer makes a man innocent. -Do not use it like that ever. Even in versions where it was accepted, it made -the function execution slower, because the compiler could not be certain -which names are local and which are global. In Python 2.1 this construct -causes warnings, and sometimes even errors. - -\subsubsection{At Module Level} - -While it is valid to use \code{from module import *} at module level it -is usually a bad idea. For one, this loses an important property Python -otherwise has --- you can know where each toplevel name is defined by -a simple "search" function in your favourite editor. You also open yourself -to trouble in the future, if some module grows additional functions or -classes. - -One of the most awful question asked on the newsgroup is why this code: - -\begin{verbatim} -f = open("www") -f.read() -\end{verbatim} - -does not work. Of course, it works just fine (assuming you have a file -called "www".) But it does not work if somewhere in the module, the -statement \code{from os import *} is present. The \module{os} module -has a function called \function{open()} which returns an integer. While -it is very useful, shadowing builtins is one of its least useful properties. - -Remember, you can never know for sure what names a module exports, so either -take what you need --- \code{from module import name1, name2}, or keep them in -the module and access on a per-need basis --- -\code{import module;print module.name}. - -\subsubsection{When It Is Just Fine} - -There are situations in which \code{from module import *} is just fine: - -\begin{itemize} - -\item The interactive prompt. For example, \code{from math import *} makes - Python an amazing scientific calculator. - -\item When extending a module in C with a module in Python. - -\item When the module advertises itself as \code{from import *} safe. - -\end{itemize} - -\subsection{Unadorned \function{exec} and friends} - -The word ``unadorned'' refers to the use without an explicit dictionary, -in which case those constructs evaluate code in the {\em current} environment. -This is dangerous for the same reasons \code{from import *} is dangerous --- -it might step over variables you are counting on and mess up things for -the rest of your code. Simply do not do that. - -Bad examples: - -\begin{verbatim} ->>> for name in sys.argv[1:]: ->>> exec("%s=1" % name) ->>> def func(s, **kw): ->>> for var, val in kw.items(): ->>> exec("s.%s=val" % var) # invalid! ->>> exec(open("handler.py").read()) ->>> handle() -\end{verbatim} - -Good examples: - -\begin{verbatim} ->>> d = {} ->>> for name in sys.argv[1:]: ->>> d[name] = 1 ->>> def func(s, **kw): ->>> for var, val in kw.items(): ->>> setattr(s, var, val) ->>> d={} ->>> exec(open("handler.py").read(), d, d) ->>> handle = d['handle'] ->>> handle() -\end{verbatim} - -\subsection{from module import name1, name2} - -This is a ``don't'' which is much weaker then the previous ``don't''s -but is still something you should not do if you don't have good reasons -to do that. The reason it is usually bad idea is because you suddenly -have an object which lives in two seperate namespaces. When the binding -in one namespace changes, the binding in the other will not, so there -will be a discrepancy between them. This happens when, for example, -one module is reloaded, or changes the definition of a function at runtime. - -Bad example: - -\begin{verbatim} -# foo.py -a = 1 - -# bar.py -from foo import a -if something(): - a = 2 # danger: foo.a != a -\end{verbatim} - -Good example: - -\begin{verbatim} -# foo.py -a = 1 - -# bar.py -import foo -if something(): - foo.a = 2 -\end{verbatim} - -\subsection{except:} - -Python has the \code{except:} clause, which catches all exceptions. -Since {\em every} error in Python raises an exception, this makes many -programming errors look like runtime problems, and hinders -the debugging process. - -The following code shows a great example: - -\begin{verbatim} -try: - foo = opne("file") # misspelled "open" -except: - sys.exit("could not open file!") -\end{verbatim} - -The second line triggers a \exception{NameError} which is caught by the -except clause. The program will exit, and you will have no idea that -this has nothing to do with the readability of \code{"file"}. - -The example above is better written - -\begin{verbatim} -try: - foo = opne("file") # will be changed to "open" as soon as we run it -except IOError: - sys.exit("could not open file") -\end{verbatim} - -There are some situations in which the \code{except:} clause is useful: -for example, in a framework when running callbacks, it is good not to -let any callback disturb the framework. - -\section{Exceptions} - -Exceptions are a useful feature of Python. You should learn to raise -them whenever something unexpected occurs, and catch them only where -you can do something about them. - -The following is a very popular anti-idiom - -\begin{verbatim} -def get_status(file): - if not os.path.exists(file): - print "file not found" - sys.exit(1) - return open(file).readline() -\end{verbatim} - -Consider the case the file gets deleted between the time the call to -\function{os.path.exists} is made and the time \function{open} is called. -That means the last line will throw an \exception{IOError}. The same would -happen if \var{file} exists but has no read permission. Since testing this -on a normal machine on existing and non-existing files make it seem bugless, -that means in testing the results will seem fine, and the code will get -shipped. Then an unhandled \exception{IOError} escapes to the user, who -has to watch the ugly traceback. - -Here is a better way to do it. - -\begin{verbatim} -def get_status(file): - try: - return open(file).readline() - except (IOError, OSError): - print "file not found" - sys.exit(1) -\end{verbatim} - -In this version, *either* the file gets opened and the line is read -(so it works even on flaky NFS or SMB connections), or the message -is printed and the application aborted. - -Still, \function{get_status} makes too many assumptions --- that it -will only be used in a short running script, and not, say, in a long -running server. Sure, the caller could do something like - -\begin{verbatim} -try: - status = get_status(log) -except SystemExit: - status = None -\end{verbatim} - -So, try to make as few \code{except} clauses in your code --- those will -usually be a catch-all in the \function{main}, or inside calls which -should always succeed. - -So, the best version is probably - -\begin{verbatim} -def get_status(file): - return open(file).readline() -\end{verbatim} - -The caller can deal with the exception if it wants (for example, if it -tries several files in a loop), or just let the exception filter upwards -to {\em its} caller. - -The last version is not very good either --- due to implementation details, -the file would not be closed when an exception is raised until the handler -finishes, and perhaps not at all in non-C implementations (e.g., Jython). - -\begin{verbatim} -def get_status(file): - fp = open(file) - try: - return fp.readline() - finally: - fp.close() -\end{verbatim} - -\section{Using the Batteries} - -Every so often, people seem to be writing stuff in the Python library -again, usually poorly. While the occasional module has a poor interface, -it is usually much better to use the rich standard library and data -types that come with Python then inventing your own. - -A useful module very few people know about is \module{os.path}. It -always has the correct path arithmetic for your operating system, and -will usually be much better then whatever you come up with yourself. - -Compare: - -\begin{verbatim} -# ugh! -return dir+"/"+file -# better -return os.path.join(dir, file) -\end{verbatim} - -More useful functions in \module{os.path}: \function{basename}, -\function{dirname} and \function{splitext}. - -There are also many useful builtin functions people seem not to be -aware of for some reason: \function{min()} and \function{max()} can -find the minimum/maximum of any sequence with comparable semantics, -for example, yet many people write their own -\function{max()}/\function{min()}. - -On the same note, note that \function{float()}, \function{int()} and -\function{long()} all accept arguments of type string, and so are -suited to parsing --- assuming you are ready to deal with the -\exception{ValueError} they raise. - -\section{Using Backslash to Continue Statements} - -Since Python treats a newline as a statement terminator, -and since statements are often more then is comfortable to put -in one line, many people do: - -\begin{verbatim} -if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \ - calculate_number(10, 20) != forbulate(500, 360): - pass -\end{verbatim} - -You should realize that this is dangerous: a stray space after the -\code{\\} would make this line wrong, and stray spaces are notoriously -hard to see in editors. In this case, at least it would be a syntax -error, but if the code was: - -\begin{verbatim} -value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \ - + calculate_number(10, 20)*forbulate(500, 360) -\end{verbatim} - -then it would just be subtly wrong. - -It is usually much better to use the implicit continuation inside parenthesis: - -This version is bulletproof: - -\begin{verbatim} -value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9] - + calculate_number(10, 20)*forbulate(500, 360)) -\end{verbatim} - -\end{document} diff --git a/Doc/howto/functional.rst b/Doc/howto/functional.rst deleted file mode 100644 index 5a55339..0000000 --- a/Doc/howto/functional.rst +++ /dev/null @@ -1,1474 +0,0 @@ -Functional Programming HOWTO -================================ - -**Version 0.30** - -(This is a first draft. Please send comments/error -reports/suggestions to amk@amk.ca. This URL is probably not going to -be the final location of the document, so be careful about linking to -it -- you may want to add a disclaimer.) - -In this document, we'll take a tour of Python's features suitable for -implementing programs in a functional style. After an introduction to -the concepts of functional programming, we'll look at language -features such as iterators and generators and relevant library modules -such as ``itertools`` and ``functools``. - - -.. contents:: - -Introduction ----------------------- - -This section explains the basic concept of functional programming; if -you're just interested in learning about Python language features, -skip to the next section. - -Programming languages support decomposing problems in several different -ways: - -* Most programming languages are **procedural**: - programs are lists of instructions that tell the computer what to - do with the program's input. - C, Pascal, and even Unix shells are procedural languages. - -* In **declarative** languages, you write a specification that describes - the problem to be solved, and the language implementation figures out - how to perform the computation efficiently. SQL is the declarative - language you're most likely to be familiar with; a SQL query describes - the data set you want to retrieve, and the SQL engine decides whether to - scan tables or use indexes, which subclauses should be performed first, - etc. - -* **Object-oriented** programs manipulate collections of objects. - Objects have internal state and support methods that query or modify - this internal state in some way. Smalltalk and Java are - object-oriented languages. C++ and Python are languages that - support object-oriented programming, but don't force the use - of object-oriented features. - -* **Functional** programming decomposes a problem into a set of functions. - Ideally, functions only take inputs and produce outputs, and don't have any - internal state that affects the output produced for a given input. - Well-known functional languages include the ML family (Standard ML, - OCaml, and other variants) and Haskell. - -The designers of some computer languages have chosen one approach to -programming that's emphasized. This often makes it difficult to -write programs that use a different approach. Other languages are -multi-paradigm languages that support several different approaches. Lisp, -C++, and Python are multi-paradigm; you can write programs or -libraries that are largely procedural, object-oriented, or functional -in all of these languages. In a large program, different sections -might be written using different approaches; the GUI might be object-oriented -while the processing logic is procedural or functional, for example. - -In a functional program, input flows through a set of functions. Each -function operates on its input and produces some output. Functional -style frowns upon functions with side effects that modify internal -state or make other changes that aren't visible in the function's -return value. Functions that have no side effects at all are -called **purely functional**. -Avoiding side effects means not using data structures -that get updated as a program runs; every function's output -must only depend on its input. - -Some languages are very strict about purity and don't even have -assignment statements such as ``a=3`` or ``c = a + b``, but it's -difficult to avoid all side effects. Printing to the screen or -writing to a disk file are side effects, for example. For example, in -Python a ``print`` statement or a ``time.sleep(1)`` both return no -useful value; they're only called for their side effects of sending -some text to the screen or pausing execution for a second. - -Python programs written in functional style usually won't go to the -extreme of avoiding all I/O or all assignments; instead, they'll -provide a functional-appearing interface but will use non-functional -features internally. For example, the implementation of a function -will still use assignments to local variables, but won't modify global -variables or have other side effects. - -Functional programming can be considered the opposite of -object-oriented programming. Objects are little capsules containing -some internal state along with a collection of method calls that let -you modify this state, and programs consist of making the right set of -state changes. Functional programming wants to avoid state changes as -much as possible and works with data flowing between functions. In -Python you might combine the two approaches by writing functions that -take and return instances representing objects in your application -(e-mail messages, transactions, etc.). - -Functional design may seem like an odd constraint to work under. Why -should you avoid objects and side effects? There are theoretical and -practical advantages to the functional style: - -* Formal provability. -* Modularity. -* Composability. -* Ease of debugging and testing. - -Formal provability -'''''''''''''''''''''' - -A theoretical benefit is that it's easier to construct a mathematical proof -that a functional program is correct. - -For a long time researchers have been interested in finding ways to -mathematically prove programs correct. This is different from testing -a program on numerous inputs and concluding that its output is usually -correct, or reading a program's source code and concluding that the -code looks right; the goal is instead a rigorous proof that a program -produces the right result for all possible inputs. - -The technique used to prove programs correct is to write down -**invariants**, properties of the input data and of the program's -variables that are always true. For each line of code, you then show -that if invariants X and Y are true **before** the line is executed, -the slightly different invariants X' and Y' are true **after** -the line is executed. This continues until you reach the end of the -program, at which point the invariants should match the desired -conditions on the program's output. - -Functional programming's avoidance of assignments arose because -assignments are difficult to handle with this technique; -assignments can break invariants that were true before the assignment -without producing any new invariants that can be propagated onward. - -Unfortunately, proving programs correct is largely impractical and not -relevant to Python software. Even trivial programs require proofs that -are several pages long; the proof of correctness for a moderately -complicated program would be enormous, and few or none of the programs -you use daily (the Python interpreter, your XML parser, your web -browser) could be proven correct. Even if you wrote down or generated -a proof, there would then be the question of verifying the proof; -maybe there's an error in it, and you wrongly believe you've proved -the program correct. - -Modularity -'''''''''''''''''''''' - -A more practical benefit of functional programming is that it forces -you to break apart your problem into small pieces. Programs are more -modular as a result. It's easier to specify and write a small -function that does one thing than a large function that performs a -complicated transformation. Small functions are also easier to read -and to check for errors. - - -Ease of debugging and testing -'''''''''''''''''''''''''''''''''' - -Testing and debugging a functional-style program is easier. - -Debugging is simplified because functions are generally small and -clearly specified. When a program doesn't work, each function is an -interface point where you can check that the data are correct. You -can look at the intermediate inputs and outputs to quickly isolate the -function that's responsible for a bug. - -Testing is easier because each function is a potential subject for a -unit test. Functions don't depend on system state that needs to be -replicated before running a test; instead you only have to synthesize -the right input and then check that the output matches expectations. - - - -Composability -'''''''''''''''''''''' - -As you work on a functional-style program, you'll write a number of -functions with varying inputs and outputs. Some of these functions -will be unavoidably specialized to a particular application, but -others will be useful in a wide variety of programs. For example, a -function that takes a directory path and returns all the XML files in -the directory, or a function that takes a filename and returns its -contents, can be applied to many different situations. - -Over time you'll form a personal library of utilities. Often you'll -assemble new programs by arranging existing functions in a new -configuration and writing a few functions specialized for the current -task. - - - -Iterators ------------------------ - -I'll start by looking at a Python language feature that's an important -foundation for writing functional-style programs: iterators. - -An iterator is an object representing a stream of data; this object -returns the data one element at a time. A Python iterator must -support a method called ``__next__()`` that takes no arguments and always -returns the next element of the stream. If there are no more elements -in the stream, ``__next__()`` must raise the ``StopIteration`` exception. -Iterators don't have to be finite, though; it's perfectly reasonable -to write an iterator that produces an infinite stream of data. -The built-in ``next()`` function is normally used to call the iterator's -``__next__()`` method. - -The built-in ``iter()`` function takes an arbitrary object and tries -to return an iterator that will return the object's contents or -elements, raising ``TypeError`` if the object doesn't support -iteration. Several of Python's built-in data types support iteration, -the most common being lists and dictionaries. An object is called -an **iterable** object if you can get an iterator for it. - -You can experiment with the iteration interface manually:: - - >>> L = [1,2,3] - >>> it = iter(L) - >>> print it - <iterator object at 0x8116870> - >>> next(it) - 1 - >>> next(it) - 2 - >>> next(it) - 3 - >>> next(it) - Traceback (most recent call last): - File "<stdin>", line 1, in ? - StopIteration - >>> - -Python expects iterable objects in several different contexts, the -most important being the ``for`` statement. In the statement ``for X in Y``, -Y must be an iterator or some object for which ``iter()`` can create -an iterator. These two statements are equivalent:: - - for i in iter(obj): - print i - - for i in obj: - print i - -Iterators can be materialized as lists or tuples by using the -``list()`` or ``tuple()`` constructor functions:: - - >>> L = [1,2,3] - >>> iterator = iter(L) - >>> t = tuple(iterator) - >>> t - (1, 2, 3) - -Sequence unpacking also supports iterators: if you know an iterator -will return N elements, you can unpack them into an N-tuple:: - - >>> L = [1,2,3] - >>> iterator = iter(L) - >>> a,b,c = iterator - >>> a,b,c - (1, 2, 3) - -Built-in functions such as ``max()`` and ``min()`` can take a single -iterator argument and will return the largest or smallest element. -The ``"in"`` and ``"not in"`` operators also support iterators: ``X in -iterator`` is true if X is found in the stream returned by the -iterator. You'll run into obvious problems if the iterator is -infinite; ``max()``, ``min()``, and ``"not in"`` will never return, and -if the element X never appears in the stream, the ``"in"`` operator -won't return either. - -Note that you can only go forward in an iterator; there's no way to -get the previous element, reset the iterator, or make a copy of it. -Iterator objects can optionally provide these additional capabilities, -but the iterator protocol only specifies the ``__next__()`` method. -Functions may therefore consume all of the iterator's output, and if -you need to do something different with the same stream, you'll have -to create a new iterator. - - - -Data Types That Support Iterators -''''''''''''''''''''''''''''''''''' - -We've already seen how lists and tuples support iterators. In fact, -any Python sequence type, such as strings, will automatically support -creation of an iterator. - -Calling ``iter()`` on a dictionary returns an iterator that will loop -over the dictionary's keys:: - - >>> m = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, - ... 'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12} - >>> for key in m: - ... print key, m[key] - Mar 3 - Feb 2 - Aug 8 - Sep 9 - May 5 - Jun 6 - Jul 7 - Jan 1 - Apr 4 - Nov 11 - Dec 12 - Oct 10 - -Note that the order is essentially random, because it's based on the -hash ordering of the objects in the dictionary. - -Applying ``iter()`` to a dictionary always loops over the keys, but -dictionaries have methods that return other iterators. If you want to -iterate over keys, values, or key/value pairs, you can explicitly call -the ``iterkeys()``, ``itervalues()``, or ``iteritems()`` methods to -get an appropriate iterator. - -The ``dict()`` constructor can accept an iterator that returns a -finite stream of ``(key, value)`` tuples:: - - >>> L = [('Italy', 'Rome'), ('France', 'Paris'), ('US', 'Washington DC')] - >>> dict(iter(L)) - {'Italy': 'Rome', 'US': 'Washington DC', 'France': 'Paris'} - -Files also support iteration by calling the ``readline()`` -method until there are no more lines in the file. This means you can -read each line of a file like this:: - - for line in file: - # do something for each line - ... - -Sets can take their contents from an iterable and let you iterate over -the set's elements:: - - S = set((2, 3, 5, 7, 11, 13)) - for i in S: - print i - - - -Generator expressions and list comprehensions ----------------------------------------------------- - -Two common operations on an iterator's output are 1) performing some -operation for every element, 2) selecting a subset of elements that -meet some condition. For example, given a list of strings, you might -want to strip off trailing whitespace from each line or extract all -the strings containing a given substring. - -List comprehensions and generator expressions (short form: "listcomps" -and "genexps") are a concise notation for such operations, borrowed -from the functional programming language Haskell -(http://www.haskell.org). You can strip all the whitespace from a -stream of strings with the following code:: - - line_list = [' line 1\n', 'line 2 \n', ...] - - # Generator expression -- returns iterator - stripped_iter = (line.strip() for line in line_list) - - # List comprehension -- returns list - stripped_list = [line.strip() for line in line_list] - -You can select only certain elements by adding an ``"if"`` condition:: - - stripped_list = [line.strip() for line in line_list - if line != ""] - -With a list comprehension, you get back a Python list; -``stripped_list`` is a list containing the resulting lines, not an -iterator. Generator expressions return an iterator that computes the -values as necessary, not needing to materialize all the values at -once. This means that list comprehensions aren't useful if you're -working with iterators that return an infinite stream or a very large -amount of data. Generator expressions are preferable in these -situations. - -Generator expressions are surrounded by parentheses ("()") and list -comprehensions are surrounded by square brackets ("[]"). Generator -expressions have the form:: - - ( expression for expr in sequence1 - if condition1 - for expr2 in sequence2 - if condition2 - for expr3 in sequence3 ... - if condition3 - for exprN in sequenceN - if conditionN ) - -Again, for a list comprehension only the outside brackets are -different (square brackets instead of parentheses). - -The elements of the generated output will be the successive values of -``expression``. The ``if`` clauses are all optional; if present, -``expression`` is only evaluated and added to the result when -``condition`` is true. - -Generator expressions always have to be written inside parentheses, -but the parentheses signalling a function call also count. If you -want to create an iterator that will be immediately passed to a -function you can write:: - - obj_total = sum(obj.count for obj in list_all_objects()) - -The ``for...in`` clauses contain the sequences to be iterated over. -The sequences do not have to be the same length, because they are -iterated over from left to right, **not** in parallel. For each -element in ``sequence1``, ``sequence2`` is looped over from the -beginning. ``sequence3`` is then looped over for each -resulting pair of elements from ``sequence1`` and ``sequence2``. - -To put it another way, a list comprehension or generator expression is -equivalent to the following Python code:: - - for expr1 in sequence1: - if not (condition1): - continue # Skip this element - for expr2 in sequence2: - if not (condition2): - continue # Skip this element - ... - for exprN in sequenceN: - if not (conditionN): - continue # Skip this element - - # Output the value of - # the expression. - -This means that when there are multiple ``for...in`` clauses but no -``if`` clauses, the length of the resulting output will be equal to -the product of the lengths of all the sequences. If you have two -lists of length 3, the output list is 9 elements long:: - - seq1 = 'abc' - seq2 = (1,2,3) - >>> [ (x,y) for x in seq1 for y in seq2] - [('a', 1), ('a', 2), ('a', 3), - ('b', 1), ('b', 2), ('b', 3), - ('c', 1), ('c', 2), ('c', 3)] - -To avoid introducing an ambiguity into Python's grammar, if -``expression`` is creating a tuple, it must be surrounded with -parentheses. The first list comprehension below is a syntax error, -while the second one is correct:: - - # Syntax error - [ x,y for x in seq1 for y in seq2] - # Correct - [ (x,y) for x in seq1 for y in seq2] - - -Generators ------------------------ - -Generators are a special class of functions that simplify the task of -writing iterators. Regular functions compute a value and return it, -but generators return an iterator that returns a stream of values. - -You're doubtless familiar with how regular function calls work in -Python or C. When you call a function, it gets a private namespace -where its local variables are created. When the function reaches a -``return`` statement, the local variables are destroyed and the -value is returned to the caller. A later call to the same function -creates a new private namespace and a fresh set of local -variables. But, what if the local variables weren't thrown away on -exiting a function? What if you could later resume the function where -it left off? This is what generators provide; they can be thought of -as resumable functions. - -Here's the simplest example of a generator function:: - - def generate_ints(N): - for i in range(N): - yield i - -Any function containing a ``yield`` keyword is a generator function; -this is detected by Python's bytecode compiler which compiles the -function specially as a result. - -When you call a generator function, it doesn't return a single value; -instead it returns a generator object that supports the iterator -protocol. On executing the ``yield`` expression, the generator -outputs the value of ``i``, similar to a ``return`` -statement. The big difference between ``yield`` and a -``return`` statement is that on reaching a ``yield`` the -generator's state of execution is suspended and local variables are -preserved. On the next call ``next(generator)``, -the function will resume executing. - -Here's a sample usage of the ``generate_ints()`` generator:: - - >>> gen = generate_ints(3) - >>> gen - <generator object at 0x8117f90> - >>> next(gen) - 0 - >>> next(gen) - 1 - >>> next(gen) - 2 - >>> next(gen) - Traceback (most recent call last): - File "stdin", line 1, in ? - File "stdin", line 2, in generate_ints - StopIteration - -You could equally write ``for i in generate_ints(5)``, or -``a,b,c = generate_ints(3)``. - -Inside a generator function, the ``return`` statement can only be used -without a value, and signals the end of the procession of values; -after executing a ``return`` the generator cannot return any further -values. ``return`` with a value, such as ``return 5``, is a syntax -error inside a generator function. The end of the generator's results -can also be indicated by raising ``StopIteration`` manually, or by -just letting the flow of execution fall off the bottom of the -function. - -You could achieve the effect of generators manually by writing your -own class and storing all the local variables of the generator as -instance variables. For example, returning a list of integers could -be done by setting ``self.count`` to 0, and having the -``__next__()`` method increment ``self.count`` and return it. -However, for a moderately complicated generator, writing a -corresponding class can be much messier. - -The test suite included with Python's library, ``test_generators.py``, -contains a number of more interesting examples. Here's one generator -that implements an in-order traversal of a tree using generators -recursively. - -:: - - # A recursive generator that generates Tree leaves in in-order. - def inorder(t): - if t: - for x in inorder(t.left): - yield x - - yield t.label - - for x in inorder(t.right): - yield x - -Two other examples in ``test_generators.py`` produce -solutions for the N-Queens problem (placing N queens on an NxN -chess board so that no queen threatens another) and the Knight's Tour -(finding a route that takes a knight to every square of an NxN chessboard -without visiting any square twice). - - - -Passing values into a generator -'''''''''''''''''''''''''''''''''''''''''''''' - -In Python 2.4 and earlier, generators only produced output. Once a -generator's code was invoked to create an iterator, there was no way to -pass any new information into the function when its execution is -resumed. You could hack together this ability by making the -generator look at a global variable or by passing in some mutable object -that callers then modify, but these approaches are messy. - -In Python 2.5 there's a simple way to pass values into a generator. -``yield`` became an expression, returning a value that can be assigned -to a variable or otherwise operated on:: - - val = (yield i) - -I recommend that you **always** put parentheses around a ``yield`` -expression when you're doing something with the returned value, as in -the above example. The parentheses aren't always necessary, but it's -easier to always add them instead of having to remember when they're -needed. - -(PEP 342 explains the exact rules, which are that a -``yield``-expression must always be parenthesized except when it -occurs at the top-level expression on the right-hand side of an -assignment. This means you can write ``val = yield i`` but have to -use parentheses when there's an operation, as in ``val = (yield i) -+ 12``.) - -Values are sent into a generator by calling its -``send(value)`` method. This method resumes the -generator's code and the ``yield`` expression returns the specified -value. If the regular ``__next__()`` method is called, the -``yield`` returns ``None``. - -Here's a simple counter that increments by 1 and allows changing the -value of the internal counter. - -:: - - def counter (maximum): - i = 0 - while i < maximum: - val = (yield i) - # If value provided, change counter - if val is not None: - i = val - else: - i += 1 - -And here's an example of changing the counter: - - >>> it = counter(10) - >>> print next(it) - 0 - >>> print next(it) - 1 - >>> print it.send(8) - 8 - >>> print next(it) - 9 - >>> print next(it) - Traceback (most recent call last): - File ``t.py'', line 15, in ? - print next(it) - StopIteration - -Because ``yield`` will often be returning ``None``, you -should always check for this case. Don't just use its value in -expressions unless you're sure that the ``send()`` method -will be the only method used resume your generator function. - -In addition to ``send()``, there are two other new methods on -generators: - -* ``throw(type, value=None, traceback=None)`` is used to raise an exception inside the - generator; the exception is raised by the ``yield`` expression - where the generator's execution is paused. - -* ``close()`` raises a ``GeneratorExit`` - exception inside the generator to terminate the iteration. - On receiving this - exception, the generator's code must either raise - ``GeneratorExit`` or ``StopIteration``; catching the - exception and doing anything else is illegal and will trigger - a ``RuntimeError``. ``close()`` will also be called by - Python's garbage collector when the generator is garbage-collected. - - If you need to run cleanup code when a ``GeneratorExit`` occurs, - I suggest using a ``try: ... finally:`` suite instead of - catching ``GeneratorExit``. - -The cumulative effect of these changes is to turn generators from -one-way producers of information into both producers and consumers. - -Generators also become **coroutines**, a more generalized form of -subroutines. Subroutines are entered at one point and exited at -another point (the top of the function, and a ``return`` -statement), but coroutines can be entered, exited, and resumed at -many different points (the ``yield`` statements). - - -Built-in functions ----------------------------------------------- - -Let's look in more detail at built-in functions often used with iterators. - -Two Python's built-in functions, ``map()`` and ``filter()``, are -somewhat obsolete; they duplicate the features of list comprehensions -but return actual lists instead of iterators. - -``map(f, iterA, iterB, ...)`` returns a list containing ``f(iterA[0], -iterB[0]), f(iterA[1], iterB[1]), f(iterA[2], iterB[2]), ...``. - -:: - - def upper(s): - return s.upper() - map(upper, ['sentence', 'fragment']) => - ['SENTENCE', 'FRAGMENT'] - - [upper(s) for s in ['sentence', 'fragment']] => - ['SENTENCE', 'FRAGMENT'] - -As shown above, you can achieve the same effect with a list -comprehension. The ``itertools.imap()`` function does the same thing -but can handle infinite iterators; it'll be discussed later, in the section on -the ``itertools`` module. - -``filter(predicate, iter)`` returns a list -that contains all the sequence elements that meet a certain condition, -and is similarly duplicated by list comprehensions. -A **predicate** is a function that returns the truth value of -some condition; for use with ``filter()``, the predicate must take a -single value. - -:: - - def is_even(x): - return (x % 2) == 0 - - filter(is_even, range(10)) => - [0, 2, 4, 6, 8] - -This can also be written as a list comprehension:: - - >>> [x for x in range(10) if is_even(x)] - [0, 2, 4, 6, 8] - -``filter()`` also has a counterpart in the ``itertools`` module, -``itertools.ifilter()``, that returns an iterator and -can therefore handle infinite sequences just as ``itertools.imap()`` can. - -``reduce(func, iter, [initial_value])`` doesn't have a counterpart in -the ``itertools`` module because it cumulatively performs an operation -on all the iterable's elements and therefore can't be applied to -infinite iterables. ``func`` must be a function that takes two elements -and returns a single value. ``reduce()`` takes the first two elements -A and B returned by the iterator and calculates ``func(A, B)``. It -then requests the third element, C, calculates ``func(func(A, B), -C)``, combines this result with the fourth element returned, and -continues until the iterable is exhausted. If the iterable returns no -values at all, a ``TypeError`` exception is raised. If the initial -value is supplied, it's used as a starting point and -``func(initial_value, A)`` is the first calculation. - -:: - - import operator - reduce(operator.concat, ['A', 'BB', 'C']) => - 'ABBC' - reduce(operator.concat, []) => - TypeError: reduce() of empty sequence with no initial value - reduce(operator.mul, [1,2,3], 1) => - 6 - reduce(operator.mul, [], 1) => - 1 - -If you use ``operator.add`` with ``reduce()``, you'll add up all the -elements of the iterable. This case is so common that there's a special -built-in called ``sum()`` to compute it:: - - reduce(operator.add, [1,2,3,4], 0) => - 10 - sum([1,2,3,4]) => - 10 - sum([]) => - 0 - -For many uses of ``reduce()``, though, it can be clearer to just write -the obvious ``for`` loop:: - - # Instead of: - product = reduce(operator.mul, [1,2,3], 1) - - # You can write: - product = 1 - for i in [1,2,3]: - product *= i - - -``enumerate(iter)`` counts off the elements in the iterable, returning -2-tuples containing the count and each element. - -:: - - enumerate(['subject', 'verb', 'object']) => - (0, 'subject'), (1, 'verb'), (2, 'object') - -``enumerate()`` is often used when looping through a list -and recording the indexes at which certain conditions are met:: - - f = open('data.txt', 'r') - for i, line in enumerate(f): - if line.strip() == '': - print 'Blank line at line #%i' % i - -``sorted(iterable, [cmp=None], [key=None], [reverse=False)`` -collects all the elements of the iterable into a list, sorts -the list, and returns the sorted result. The ``cmp``, ``key``, -and ``reverse`` arguments are passed through to the -constructed list's ``.sort()`` method. - -:: - - import random - # Generate 8 random numbers between [0, 10000) - rand_list = random.sample(range(10000), 8) - rand_list => - [769, 7953, 9828, 6431, 8442, 9878, 6213, 2207] - sorted(rand_list) => - [769, 2207, 6213, 6431, 7953, 8442, 9828, 9878] - sorted(rand_list, reverse=True) => - [9878, 9828, 8442, 7953, 6431, 6213, 2207, 769] - -(For a more detailed discussion of sorting, see the Sorting mini-HOWTO -in the Python wiki at http://wiki.python.org/moin/HowTo/Sorting.) - -The ``any(iter)`` and ``all(iter)`` built-ins look at -the truth values of an iterable's contents. ``any()`` returns -True if any element in the iterable is a true value, and ``all()`` -returns True if all of the elements are true values:: - - any([0,1,0]) => - True - any([0,0,0]) => - False - any([1,1,1]) => - True - all([0,1,0]) => - False - all([0,0,0]) => - False - all([1,1,1]) => - True - - -Small functions and the lambda statement ----------------------------------------------- - -When writing functional-style programs, you'll often need little -functions that act as predicates or that combine elements in some way. - -If there's a Python built-in or a module function that's suitable, you -don't need to define a new function at all:: - - stripped_lines = [line.strip() for line in lines] - existing_files = filter(os.path.exists, file_list) - -If the function you need doesn't exist, you need to write it. One way -to write small functions is to use the ``lambda`` statement. ``lambda`` -takes a number of parameters and an expression combining these parameters, -and creates a small function that returns the value of the expression:: - - lowercase = lambda x: x.lower() - - print_assign = lambda name, value: name + '=' + str(value) - - adder = lambda x, y: x+y - -An alternative is to just use the ``def`` statement and define a -function in the usual way:: - - def lowercase(x): - return x.lower() - - def print_assign(name, value): - return name + '=' + str(value) - - def adder(x,y): - return x + y - -Which alternative is preferable? That's a style question; my usual -course is to avoid using ``lambda``. - -One reason for my preference is that ``lambda`` is quite limited in -the functions it can define. The result has to be computable as a -single expression, which means you can't have multiway -``if... elif... else`` comparisons or ``try... except`` statements. -If you try to do too much in a ``lambda`` statement, you'll end up -with an overly complicated expression that's hard to read. Quick, -what's the following code doing? - -:: - - total = reduce(lambda a, b: (0, a[1] + b[1]), items)[1] - -You can figure it out, but it takes time to disentangle the expression -to figure out what's going on. Using a short nested -``def`` statements makes things a little bit better:: - - def combine (a, b): - return 0, a[1] + b[1] - - total = reduce(combine, items)[1] - -But it would be best of all if I had simply used a ``for`` loop:: - - total = 0 - for a, b in items: - total += b - -Or the ``sum()`` built-in and a generator expression:: - - total = sum(b for a,b in items) - -Many uses of ``reduce()`` are clearer when written as ``for`` loops. - -Fredrik Lundh once suggested the following set of rules for refactoring -uses of ``lambda``: - -1) Write a lambda function. -2) Write a comment explaining what the heck that lambda does. -3) Study the comment for a while, and think of a name that captures - the essence of the comment. -4) Convert the lambda to a def statement, using that name. -5) Remove the comment. - -I really like these rules, but you're free to disagree that this -lambda-free style is better. - - -The itertools module ------------------------ - -The ``itertools`` module contains a number of commonly-used iterators -as well as functions for combining several iterators. This section -will introduce the module's contents by showing small examples. - -The module's functions fall into a few broad classes: - -* Functions that create a new iterator based on an existing iterator. -* Functions for treating an iterator's elements as function arguments. -* Functions for selecting portions of an iterator's output. -* A function for grouping an iterator's output. - -Creating new iterators -'''''''''''''''''''''' - -``itertools.count(n)`` returns an infinite stream of -integers, increasing by 1 each time. You can optionally supply the -starting number, which defaults to 0:: - - itertools.count() => - 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... - itertools.count(10) => - 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ... - -``itertools.cycle(iter)`` saves a copy of the contents of a provided -iterable and returns a new iterator that returns its elements from -first to last. The new iterator will repeat these elements infinitely. - -:: - - itertools.cycle([1,2,3,4,5]) => - 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, ... - -``itertools.repeat(elem, [n])`` returns the provided element ``n`` -times, or returns the element endlessly if ``n`` is not provided. - -:: - - itertools.repeat('abc') => - abc, abc, abc, abc, abc, abc, abc, abc, abc, abc, ... - itertools.repeat('abc', 5) => - abc, abc, abc, abc, abc - -``itertools.chain(iterA, iterB, ...)`` takes an arbitrary number of -iterables as input, and returns all the elements of the first -iterator, then all the elements of the second, and so on, until all of -the iterables have been exhausted. - -:: - - itertools.chain(['a', 'b', 'c'], (1, 2, 3)) => - a, b, c, 1, 2, 3 - -``itertools.izip(iterA, iterB, ...)`` takes one element from each iterable -and returns them in a tuple:: - - itertools.izip(['a', 'b', 'c'], (1, 2, 3)) => - ('a', 1), ('b', 2), ('c', 3) - -It's similiar to the built-in ``zip()`` function, but doesn't -construct an in-memory list and exhaust all the input iterators before -returning; instead tuples are constructed and returned only if they're -requested. (The technical term for this behaviour is -`lazy evaluation <http://en.wikipedia.org/wiki/Lazy_evaluation>`__.) - -This iterator is intended to be used with iterables that are all of -the same length. If the iterables are of different lengths, the -resulting stream will be the same length as the shortest iterable. - -:: - - itertools.izip(['a', 'b'], (1, 2, 3)) => - ('a', 1), ('b', 2) - -You should avoid doing this, though, because an element may be taken -from the longer iterators and discarded. This means you can't go on -to use the iterators further because you risk skipping a discarded -element. - -``itertools.islice(iter, [start], stop, [step])`` returns a stream -that's a slice of the iterator. With a single ``stop`` argument, -it will return the first ``stop`` -elements. If you supply a starting index, you'll get ``stop-start`` -elements, and if you supply a value for ``step``, elements will be -skipped accordingly. Unlike Python's string and list slicing, you -can't use negative values for ``start``, ``stop``, or ``step``. - -:: - - itertools.islice(range(10), 8) => - 0, 1, 2, 3, 4, 5, 6, 7 - itertools.islice(range(10), 2, 8) => - 2, 3, 4, 5, 6, 7 - itertools.islice(range(10), 2, 8, 2) => - 2, 4, 6 - -``itertools.tee(iter, [n])`` replicates an iterator; it returns ``n`` -independent iterators that will all return the contents of the source -iterator. If you don't supply a value for ``n``, the default is 2. -Replicating iterators requires saving some of the contents of the source -iterator, so this can consume significant memory if the iterator is large -and one of the new iterators is consumed more than the others. - -:: - - itertools.tee( itertools.count() ) => - iterA, iterB - - where iterA -> - 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... - - and iterB -> - 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... - - -Calling functions on elements -''''''''''''''''''''''''''''' - -Two functions are used for calling other functions on the contents of an -iterable. - -``itertools.imap(f, iterA, iterB, ...)`` returns -a stream containing ``f(iterA[0], iterB[0]), f(iterA[1], iterB[1]), -f(iterA[2], iterB[2]), ...``:: - - itertools.imap(operator.add, [5, 6, 5], [1, 2, 3]) => - 6, 8, 8 - -The ``operator`` module contains a set of functions -corresponding to Python's operators. Some examples are -``operator.add(a, b)`` (adds two values), -``operator.ne(a, b)`` (same as ``a!=b``), -and -``operator.attrgetter('id')`` (returns a callable that -fetches the ``"id"`` attribute). - -``itertools.starmap(func, iter)`` assumes that the iterable will -return a stream of tuples, and calls ``f()`` using these tuples as the -arguments:: - - itertools.starmap(os.path.join, - [('/usr', 'bin', 'java'), ('/bin', 'python'), - ('/usr', 'bin', 'perl'),('/usr', 'bin', 'ruby')]) - => - /usr/bin/java, /bin/python, /usr/bin/perl, /usr/bin/ruby - - -Selecting elements -'''''''''''''''''' - -Another group of functions chooses a subset of an iterator's elements -based on a predicate. - -``itertools.ifilter(predicate, iter)`` returns all the elements for -which the predicate returns true:: - - def is_even(x): - return (x % 2) == 0 - - itertools.ifilter(is_even, itertools.count()) => - 0, 2, 4, 6, 8, 10, 12, 14, ... - -``itertools.ifilterfalse(predicate, iter)`` is the opposite, -returning all elements for which the predicate returns false:: - - itertools.ifilterfalse(is_even, itertools.count()) => - 1, 3, 5, 7, 9, 11, 13, 15, ... - -``itertools.takewhile(predicate, iter)`` returns elements for as long -as the predicate returns true. Once the predicate returns false, -the iterator will signal the end of its results. - -:: - - def less_than_10(x): - return (x < 10) - - itertools.takewhile(less_than_10, itertools.count()) => - 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 - - itertools.takewhile(is_even, itertools.count()) => - 0 - -``itertools.dropwhile(predicate, iter)`` discards elements while the -predicate returns true, and then returns the rest of the iterable's -results. - -:: - - itertools.dropwhile(less_than_10, itertools.count()) => - 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ... - - itertools.dropwhile(is_even, itertools.count()) => - 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... - - -Grouping elements -''''''''''''''''' - -The last function I'll discuss, ``itertools.groupby(iter, -key_func=None)``, is the most complicated. ``key_func(elem)`` is a -function that can compute a key value for each element returned by the -iterable. If you don't supply a key function, the key is simply each -element itself. - -``groupby()`` collects all the consecutive elements from the -underlying iterable that have the same key value, and returns a stream -of 2-tuples containing a key value and an iterator for the elements -with that key. - -:: - - city_list = [('Decatur', 'AL'), ('Huntsville', 'AL'), ('Selma', 'AL'), - ('Anchorage', 'AK'), ('Nome', 'AK'), - ('Flagstaff', 'AZ'), ('Phoenix', 'AZ'), ('Tucson', 'AZ'), - ... - ] - - def get_state ((city, state)): - return state - - itertools.groupby(city_list, get_state) => - ('AL', iterator-1), - ('AK', iterator-2), - ('AZ', iterator-3), ... - - where - iterator-1 => - ('Decatur', 'AL'), ('Huntsville', 'AL'), ('Selma', 'AL') - iterator-2 => - ('Anchorage', 'AK'), ('Nome', 'AK') - iterator-3 => - ('Flagstaff', 'AZ'), ('Phoenix', 'AZ'), ('Tucson', 'AZ') - -``groupby()`` assumes that the underlying iterable's contents will -already be sorted based on the key. Note that the returned iterators -also use the underlying iterable, so you have to consume the results -of iterator-1 before requesting iterator-2 and its corresponding key. - - -The functools module ----------------------------------------------- - -The ``functools`` module in Python 2.5 contains some higher-order -functions. A **higher-order function** takes one or more functions as -input and returns a new function. The most useful tool in this module -is the ``partial()`` function. - -For programs written in a functional style, you'll sometimes want to -construct variants of existing functions that have some of the -parameters filled in. Consider a Python function ``f(a, b, c)``; you -may wish to create a new function ``g(b, c)`` that's equivalent to -``f(1, b, c)``; you're filling in a value for one of ``f()``'s parameters. -This is called "partial function application". - -The constructor for ``partial`` takes the arguments ``(function, arg1, -arg2, ... kwarg1=value1, kwarg2=value2)``. The resulting object is -callable, so you can just call it to invoke ``function`` with the -filled-in arguments. - -Here's a small but realistic example:: - - import functools - - def log (message, subsystem): - "Write the contents of 'message' to the specified subsystem." - print '%s: %s' % (subsystem, message) - ... - - server_log = functools.partial(log, subsystem='server') - server_log('Unable to open socket') - - -The operator module -------------------- - -The ``operator`` module was mentioned earlier. It contains a set of -functions corresponding to Python's operators. These functions -are often useful in functional-style code because they save you -from writing trivial functions that perform a single operation. - -Some of the functions in this module are: - -* Math operations: ``add()``, ``sub()``, ``mul()``, ``div()``, ``floordiv()``, - ``abs()``, ... -* Logical operations: ``not_()``, ``truth()``. -* Bitwise operations: ``and_()``, ``or_()``, ``invert()``. -* Comparisons: ``eq()``, ``ne()``, ``lt()``, ``le()``, ``gt()``, and ``ge()``. -* Object identity: ``is_()``, ``is_not()``. - -Consult `the operator module's documentation <http://docs.python.org/lib/module-operator.html>`__ for a complete -list. - - - -The functional module ---------------------- - -Collin Winter's `functional module <http://oakwinter.com/code/functional/>`__ -provides a number of more -advanced tools for functional programming. It also reimplements -several Python built-ins, trying to make them more intuitive to those -used to functional programming in other languages. - -This section contains an introduction to some of the most important -functions in ``functional``; full documentation can be found at `the -project's website <http://oakwinter.com/code/functional/documentation/>`__. - -``compose(outer, inner, unpack=False)`` - -The ``compose()`` function implements function composition. -In other words, it returns a wrapper around the ``outer`` and ``inner`` callables, such -that the return value from ``inner`` is fed directly to ``outer``. That is, - -:: - - >>> def add(a, b): - ... return a + b - ... - >>> def double(a): - ... return 2 * a - ... - >>> compose(double, add)(5, 6) - 22 - -is equivalent to - -:: - - >>> double(add(5, 6)) - 22 - -The ``unpack`` keyword is provided to work around the fact that Python functions are not always -`fully curried <http://en.wikipedia.org/wiki/Currying>`__. -By default, it is expected that the ``inner`` function will return a single object and that the ``outer`` -function will take a single argument. Setting the ``unpack`` argument causes ``compose`` to expect a -tuple from ``inner`` which will be expanded before being passed to ``outer``. Put simply, - -:: - - compose(f, g)(5, 6) - -is equivalent to:: - - f(g(5, 6)) - -while - -:: - - compose(f, g, unpack=True)(5, 6) - -is equivalent to:: - - f(*g(5, 6)) - -Even though ``compose()`` only accepts two functions, it's trivial to -build up a version that will compose any number of functions. We'll -use ``reduce()``, ``compose()`` and ``partial()`` (the last of which -is provided by both ``functional`` and ``functools``). - -:: - - from functional import compose, partial - - multi_compose = partial(reduce, compose) - - -We can also use ``map()``, ``compose()`` and ``partial()`` to craft a -version of ``"".join(...)`` that converts its arguments to string:: - - from functional import compose, partial - - join = compose("".join, partial(map, str)) - - -``flip(func)`` - -``flip()`` wraps the callable in ``func`` and -causes it to receive its non-keyword arguments in reverse order. - -:: - - >>> def triple(a, b, c): - ... return (a, b, c) - ... - >>> triple(5, 6, 7) - (5, 6, 7) - >>> - >>> flipped_triple = flip(triple) - >>> flipped_triple(5, 6, 7) - (7, 6, 5) - -``foldl(func, start, iterable)`` - -``foldl()`` takes a binary function, a starting value (usually some kind of 'zero'), and an iterable. -The function is applied to the starting value and the first element of the list, then the result of -that and the second element of the list, then the result of that and the third element of the list, -and so on. - -This means that a call such as:: - - foldl(f, 0, [1, 2, 3]) - -is equivalent to:: - - f(f(f(0, 1), 2), 3) - - -``foldl()`` is roughly equivalent to the following recursive function:: - - def foldl(func, start, seq): - if len(seq) == 0: - return start - - return foldl(func, func(start, seq[0]), seq[1:]) - -Speaking of equivalence, the above ``foldl`` call can be expressed in terms of the built-in ``reduce`` like -so:: - - reduce(f, [1, 2, 3], 0) - - -We can use ``foldl()``, ``operator.concat()`` and ``partial()`` to -write a cleaner, more aesthetically-pleasing version of Python's -``"".join(...)`` idiom:: - - from functional import foldl, partial - from operator import concat - - join = partial(foldl, concat, "") - - -Revision History and Acknowledgements ------------------------------------------------- - -The author would like to thank the following people for offering -suggestions, corrections and assistance with various drafts of this -article: Ian Bicking, Nick Coghlan, Nick Efford, Raymond Hettinger, -Jim Jewett, Mike Krell, Leandro Lameiro, Jussi Salmela, -Collin Winter, Blake Winton. - -Version 0.1: posted June 30 2006. - -Version 0.11: posted July 1 2006. Typo fixes. - -Version 0.2: posted July 10 2006. Merged genexp and listcomp -sections into one. Typo fixes. - -Version 0.21: Added more references suggested on the tutor mailing list. - -Version 0.30: Adds a section on the ``functional`` module written by -Collin Winter; adds short section on the operator module; a few other -edits. - - -References --------------------- - -General -''''''''''''''' - -**Structure and Interpretation of Computer Programs**, by -Harold Abelson and Gerald Jay Sussman with Julie Sussman. -Full text at http://mitpress.mit.edu/sicp/. -In this classic textbook of computer science, chapters 2 and 3 discuss the -use of sequences and streams to organize the data flow inside a -program. The book uses Scheme for its examples, but many of the -design approaches described in these chapters are applicable to -functional-style Python code. - -http://www.defmacro.org/ramblings/fp.html: A general -introduction to functional programming that uses Java examples -and has a lengthy historical introduction. - -http://en.wikipedia.org/wiki/Functional_programming: -General Wikipedia entry describing functional programming. - -http://en.wikipedia.org/wiki/Coroutine: -Entry for coroutines. - -http://en.wikipedia.org/wiki/Currying: -Entry for the concept of currying. - -Python-specific -''''''''''''''''''''''''''' - -http://gnosis.cx/TPiP/: -The first chapter of David Mertz's book :title-reference:`Text Processing in Python` -discusses functional programming for text processing, in the section titled -"Utilizing Higher-Order Functions in Text Processing". - -Mertz also wrote a 3-part series of articles on functional programming -for IBM's DeveloperWorks site; see -`part 1 <http://www-128.ibm.com/developerworks/library/l-prog.html>`__, -`part 2 <http://www-128.ibm.com/developerworks/library/l-prog2.html>`__, and -`part 3 <http://www-128.ibm.com/developerworks/linux/library/l-prog3.html>`__, - - -Python documentation -''''''''''''''''''''''''''' - -http://docs.python.org/lib/module-itertools.html: -Documentation for the ``itertools`` module. - -http://docs.python.org/lib/module-operator.html: -Documentation for the ``operator`` module. - -http://www.python.org/dev/peps/pep-0289/: -PEP 289: "Generator Expressions" - -http://www.python.org/dev/peps/pep-0342/ -PEP 342: "Coroutines via Enhanced Generators" describes the new generator -features in Python 2.5. - -.. comment - - Topics to place - ----------------------------- - - XXX os.walk() - - XXX Need a large example. - - But will an example add much? I'll post a first draft and see - what the comments say. - -.. comment - - Original outline: - Introduction - Idea of FP - Programs built out of functions - Functions are strictly input-output, no internal state - Opposed to OO programming, where objects have state - - Why FP? - Formal provability - Assignment is difficult to reason about - Not very relevant to Python - Modularity - Small functions that do one thing - Debuggability: - Easy to test due to lack of state - Easy to verify output from intermediate steps - Composability - You assemble a toolbox of functions that can be mixed - - Tackling a problem - Need a significant example - - Iterators - Generators - The itertools module - List comprehensions - Small functions and the lambda statement - Built-in functions - map - filter - reduce - -.. comment - - Handy little function for printing part of an iterator -- used - while writing this document. - - import itertools - def print_iter(it): - slice = itertools.islice(it, 10) - for elem in slice[:-1]: - sys.stdout.write(str(elem)) - sys.stdout.write(', ') - print elem[-1] - - diff --git a/Doc/howto/regex.tex b/Doc/howto/regex.tex deleted file mode 100644 index d911be6..0000000 --- a/Doc/howto/regex.tex +++ /dev/null @@ -1,1476 +0,0 @@ -\documentclass{howto} - -% TODO: -% Document lookbehind assertions -% Better way of displaying a RE, a string, and what it matches -% Mention optional argument to match.groups() -% Unicode (at least a reference) - -\title{Regular Expression HOWTO} - -\release{0.05} - -\author{A.M. Kuchling} -\authoraddress{\email{amk@amk.ca}} - -\begin{document} -\maketitle - -\begin{abstract} -\noindent -This document is an introductory tutorial to using regular expressions -in Python with the \module{re} module. It provides a gentler -introduction than the corresponding section in the Library Reference. - -This document is available from -\url{http://www.amk.ca/python/howto}. - -\end{abstract} - -\tableofcontents - -\section{Introduction} - -The \module{re} module was added in Python 1.5, and provides -Perl-style regular expression patterns. Earlier versions of Python -came with the \module{regex} module, which provided Emacs-style -patterns. The \module{regex} module was removed completely in Python 2.5. - -Regular expressions (called REs, or regexes, or regex patterns) are -essentially a tiny, highly specialized programming language embedded -inside Python and made available through the \module{re} module. -Using this little language, you specify the rules for the set of -possible strings that you want to match; this set might contain -English sentences, or e-mail addresses, or TeX commands, or anything -you like. You can then ask questions such as ``Does this string match -the pattern?'', or ``Is there a match for the pattern anywhere in this -string?''. You can also use REs to modify a string or to split it -apart in various ways. - -Regular expression patterns are compiled into a series of bytecodes -which are then executed by a matching engine written in C. For -advanced use, it may be necessary to pay careful attention to how the -engine will execute a given RE, and write the RE in a certain way in -order to produce bytecode that runs faster. Optimization isn't -covered in this document, because it requires that you have a good -understanding of the matching engine's internals. - -The regular expression language is relatively small and restricted, so -not all possible string processing tasks can be done using regular -expressions. There are also tasks that \emph{can} be done with -regular expressions, but the expressions turn out to be very -complicated. In these cases, you may be better off writing Python -code to do the processing; while Python code will be slower than an -elaborate regular expression, it will also probably be more understandable. - -\section{Simple Patterns} - -We'll start by learning about the simplest possible regular -expressions. Since regular expressions are used to operate on -strings, we'll begin with the most common task: matching characters. - -For a detailed explanation of the computer science underlying regular -expressions (deterministic and non-deterministic finite automata), you -can refer to almost any textbook on writing compilers. - -\subsection{Matching Characters} - -Most letters and characters will simply match themselves. For -example, the regular expression \regexp{test} will match the string -\samp{test} exactly. (You can enable a case-insensitive mode that -would let this RE match \samp{Test} or \samp{TEST} as well; more -about this later.) - -There are exceptions to this rule; some characters are special -\dfn{metacharacters}, and don't match themselves. Instead, they -signal that some out-of-the-ordinary thing should be matched, or they -affect other portions of the RE by repeating them or changing their -meaning. Much of this document is devoted to discussing various -metacharacters and what they do. - -Here's a complete list of the metacharacters; their meanings will be -discussed in the rest of this HOWTO. - -\begin{verbatim} -. ^ $ * + ? { [ ] \ | ( ) -\end{verbatim} -% $ - -The first metacharacters we'll look at are \samp{[} and \samp{]}. -They're used for specifying a character class, which is a set of -characters that you wish to match. Characters can be listed -individually, or a range of characters can be indicated by giving two -characters and separating them by a \character{-}. For example, -\regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or -\samp{c}; this is the same as -\regexp{[a-c]}, which uses a range to express the same set of -characters. If you wanted to match only lowercase letters, your -RE would be \regexp{[a-z]}. - -Metacharacters are not active inside classes. For example, -\regexp{[akm\$]} will match any of the characters \character{a}, -\character{k}, \character{m}, or \character{\$}; \character{\$} is -usually a metacharacter, but inside a character class it's stripped of -its special nature. - -You can match the characters not listed within the class by -\dfn{complementing} the set. This is indicated by including a -\character{\^} as the first character of the class; \character{\^} -outside a character class will simply match the -\character{\^} character. For example, \verb|[^5]| will match any -character except \character{5}. - -Perhaps the most important metacharacter is the backslash, \samp{\e}. -As in Python string literals, the backslash can be followed by various -characters to signal various special sequences. It's also used to escape -all the metacharacters so you can still match them in patterns; for -example, if you need to match a \samp{[} or -\samp{\e}, you can precede them with a backslash to remove their -special meaning: \regexp{\e[} or \regexp{\e\e}. - -Some of the special sequences beginning with \character{\e} represent -predefined sets of characters that are often useful, such as the set -of digits, the set of letters, or the set of anything that isn't -whitespace. The following predefined special sequences are available: - -\begin{itemize} -\item[\code{\e d}]Matches any decimal digit; this is -equivalent to the class \regexp{[0-9]}. - -\item[\code{\e D}]Matches any non-digit character; this is -equivalent to the class \verb|[^0-9]|. - -\item[\code{\e s}]Matches any whitespace character; this is -equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}. - -\item[\code{\e S}]Matches any non-whitespace character; this is -equivalent to the class \verb|[^ \t\n\r\f\v]|. - -\item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class -\regexp{[a-zA-Z0-9_]}. - -\item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class -\verb|[^a-zA-Z0-9_]|. -\end{itemize} - -These sequences can be included inside a character class. For -example, \regexp{[\e s,.]} is a character class that will match any -whitespace character, or \character{,} or \character{.}. - -The final metacharacter in this section is \regexp{.}. It matches -anything except a newline character, and there's an alternate mode -(\code{re.DOTALL}) where it will match even a newline. \character{.} -is often used where you want to match ``any character''. - -\subsection{Repeating Things} - -Being able to match varying sets of characters is the first thing -regular expressions can do that isn't already possible with the -methods available on strings. However, if that was the only -additional capability of regexes, they wouldn't be much of an advance. -Another capability is that you can specify that portions of the RE -must be repeated a certain number of times. - -The first metacharacter for repeating things that we'll look at is -\regexp{*}. \regexp{*} doesn't match the literal character \samp{*}; -instead, it specifies that the previous character can be matched zero -or more times, instead of exactly once. - -For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a} -characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a} -characters), and so forth. The RE engine has various internal -limitations stemming from the size of C's \code{int} type that will -prevent it from matching over 2 billion \samp{a} characters; you -probably don't have enough memory to construct a string that large, so -you shouldn't run into that limit. - -Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE, -the matching engine will try to repeat it as many times as possible. -If later portions of the pattern don't match, the matching engine will -then back up and try again with few repetitions. - -A step-by-step example will make this more obvious. Let's consider -the expression \regexp{a[bcd]*b}. This matches the letter -\character{a}, zero or more letters from the class \code{[bcd]}, and -finally ends with a \character{b}. Now imagine matching this RE -against the string \samp{abcbd}. - -\begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation} -\lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.} -\lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as -it can, which is to the end of the string.} -\lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the -current position is at the end of the string, so it fails.} -\lineiii{4}{\code{abcb}}{Back up, so that \regexp{[bcd]*} matches -one less character.} -\lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the -current position is at the last character, which is a \character{d}.} -\lineiii{6}{\code{abc}}{Back up again, so that \regexp{[bcd]*} is -only matching \samp{bc}.} -\lineiii{6}{\code{abcb}}{Try \regexp{b} again. This time -but the character at the current position is \character{b}, so it succeeds.} -\end{tableiii} - -The end of the RE has now been reached, and it has matched -\samp{abcb}. This demonstrates how the matching engine goes as far as -it can at first, and if no match is found it will then progressively -back up and retry the rest of the RE again and again. It will back up -until it has tried zero matches for \regexp{[bcd]*}, and if that -subsequently fails, the engine will conclude that the string doesn't -match the RE at all. - -Another repeating metacharacter is \regexp{+}, which matches one or -more times. Pay careful attention to the difference between -\regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more -times, so whatever's being repeated may not be present at all, while -\regexp{+} requires at least \emph{one} occurrence. To use a similar -example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}), -\samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}. - -There are two more repeating qualifiers. The question mark character, -\regexp{?}, matches either once or zero times; you can think of it as -marking something as being optional. For example, \regexp{home-?brew} -matches either \samp{homebrew} or \samp{home-brew}. - -The most complicated repeated qualifier is -\regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal -integers. This qualifier means there must be at least \var{m} -repetitions, and at most \var{n}. For example, \regexp{a/\{1,3\}b} -will match \samp{a/b}, \samp{a//b}, and \samp{a///b}. It won't match -\samp{ab}, which has no slashes, or \samp{a////b}, which has four. - -You can omit either \var{m} or \var{n}; in that case, a reasonable -value is assumed for the missing value. Omitting \var{m} is -interpreted as a lower limit of 0, while omitting \var{n} results in -an upper bound of infinity --- actually, the upper bound is the -2-billion limit mentioned earlier, but that might as well be infinity. - -Readers of a reductionist bent may notice that the three other qualifiers -can all be expressed using this notation. \regexp{\{0,\}} is the same -as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and -\regexp{\{0,1\}} is the same as \regexp{?}. It's better to use -\regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because -they're shorter and easier to read. - -\section{Using Regular Expressions} - -Now that we've looked at some simple regular expressions, how do we -actually use them in Python? The \module{re} module provides an -interface to the regular expression engine, allowing you to compile -REs into objects and then perform matches with them. - -\subsection{Compiling Regular Expressions} - -Regular expressions are compiled into \class{RegexObject} instances, -which have methods for various operations such as searching for -pattern matches or performing string substitutions. - -\begin{verbatim} ->>> import re ->>> p = re.compile('ab*') ->>> print p -<re.RegexObject instance at 80b4150> -\end{verbatim} - -\function{re.compile()} also accepts an optional \var{flags} -argument, used to enable various special features and syntax -variations. We'll go over the available settings later, but for now a -single example will do: - -\begin{verbatim} ->>> p = re.compile('ab*', re.IGNORECASE) -\end{verbatim} - -The RE is passed to \function{re.compile()} as a string. REs are -handled as strings because regular expressions aren't part of the core -Python language, and no special syntax was created for expressing -them. (There are applications that don't need REs at all, so there's -no need to bloat the language specification by including them.) -Instead, the \module{re} module is simply a C extension module -included with Python, just like the \module{socket} or \module{zlib} -modules. - -Putting REs in strings keeps the Python language simpler, but has one -disadvantage which is the topic of the next section. - -\subsection{The Backslash Plague} - -As stated earlier, regular expressions use the backslash -character (\character{\e}) to indicate special forms or to allow -special characters to be used without invoking their special meaning. -This conflicts with Python's usage of the same character for the same -purpose in string literals. - -Let's say you want to write a RE that matches the string -\samp{{\e}section}, which might be found in a \LaTeX\ file. To figure -out what to write in the program code, start with the desired string -to be matched. Next, you must escape any backslashes and other -metacharacters by preceding them with a backslash, resulting in the -string \samp{\e\e section}. The resulting string that must be passed -to \function{re.compile()} must be \verb|\\section|. However, to -express this as a Python string literal, both backslashes must be -escaped \emph{again}. - -\begin{tableii}{c|l}{code}{Characters}{Stage} - \lineii{\e section}{Text string to be matched} - \lineii{\e\e section}{Escaped backslash for \function{re.compile}} - \lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal} -\end{tableii} - -In short, to match a literal backslash, one has to write -\code{'\e\e\e\e'} as the RE string, because the regular expression -must be \samp{\e\e}, and each backslash must be expressed as -\samp{\e\e} inside a regular Python string literal. In REs that -feature backslashes repeatedly, this leads to lots of repeated -backslashes and makes the resulting strings difficult to understand. - -The solution is to use Python's raw string notation for regular -expressions; backslashes are not handled in any special way in -a string literal prefixed with \character{r}, so \code{r"\e n"} is a -two-character string containing \character{\e} and \character{n}, -while \code{"\e n"} is a one-character string containing a newline. -Regular expressions will often be written in Python -code using this raw string notation. - -\begin{tableii}{c|c}{code}{Regular String}{Raw string} - \lineii{"ab*"}{\code{r"ab*"}} - \lineii{"\e\e\e\e section"}{\code{r"\e\e section"}} - \lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}} -\end{tableii} - -\subsection{Performing Matches} - -Once you have an object representing a compiled regular expression, -what do you do with it? \class{RegexObject} instances have several -methods and attributes. Only the most significant ones will be -covered here; consult \ulink{the Library -Reference}{http://www.python.org/doc/lib/module-re.html} for a -complete listing. - -\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} - \lineii{match()}{Determine if the RE matches at the beginning of - the string.} - \lineii{search()}{Scan through a string, looking for any location - where this RE matches.} - \lineii{findall()}{Find all substrings where the RE matches, -and returns them as a list.} - \lineii{finditer()}{Find all substrings where the RE matches, -and returns them as an iterator.} -\end{tableii} - -\method{match()} and \method{search()} return \code{None} if no match -can be found. If they're successful, a \code{MatchObject} instance is -returned, containing information about the match: where it starts and -ends, the substring it matched, and more. - -You can learn about this by interactively experimenting with the -\module{re} module. If you have Tkinter available, you may also want -to look at \file{Tools/scripts/redemo.py}, a demonstration program -included with the Python distribution. It allows you to enter REs and -strings, and displays whether the RE matches or fails. -\file{redemo.py} can be quite useful when trying to debug a -complicated RE. Phil Schwartz's -\ulink{Kodos}{http://www.phil-schwartz.com/kodos.spy} is also an interactive -tool for developing and testing RE patterns. - -This HOWTO uses the standard Python interpreter for its examples. -First, run the Python interpreter, import the \module{re} module, and -compile a RE: - -\begin{verbatim} -Python 2.2.2 (#1, Feb 10 2003, 12:57:01) ->>> import re ->>> p = re.compile('[a-z]+') ->>> p -<_sre.SRE_Pattern object at 80c3c28> -\end{verbatim} - -Now, you can try matching various strings against the RE -\regexp{[a-z]+}. An empty string shouldn't match at all, since -\regexp{+} means 'one or more repetitions'. \method{match()} should -return \code{None} in this case, which will cause the interpreter to -print no output. You can explicitly print the result of -\method{match()} to make this clear. - -\begin{verbatim} ->>> p.match("") ->>> print p.match("") -None -\end{verbatim} - -Now, let's try it on a string that it should match, such as -\samp{tempo}. In this case, \method{match()} will return a -\class{MatchObject}, so you should store the result in a variable for -later use. - -\begin{verbatim} ->>> m = p.match('tempo') ->>> print m -<_sre.SRE_Match object at 80c4f68> -\end{verbatim} - -Now you can query the \class{MatchObject} for information about the -matching string. \class{MatchObject} instances also have several -methods and attributes; the most important ones are: - -\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} - \lineii{group()}{Return the string matched by the RE} - \lineii{start()}{Return the starting position of the match} - \lineii{end()}{Return the ending position of the match} - \lineii{span()}{Return a tuple containing the (start, end) positions - of the match} -\end{tableii} - -Trying these methods will soon clarify their meaning: - -\begin{verbatim} ->>> m.group() -'tempo' ->>> m.start(), m.end() -(0, 5) ->>> m.span() -(0, 5) -\end{verbatim} - -\method{group()} returns the substring that was matched by the -RE. \method{start()} and \method{end()} return the starting and -ending index of the match. \method{span()} returns both start and end -indexes in a single tuple. Since the \method{match} method only -checks if the RE matches at the start of a string, -\method{start()} will always be zero. However, the \method{search} -method of \class{RegexObject} instances scans through the string, so -the match may not start at zero in that case. - -\begin{verbatim} ->>> print p.match('::: message') -None ->>> m = p.search('::: message') ; print m -<re.MatchObject instance at 80c9650> ->>> m.group() -'message' ->>> m.span() -(4, 11) -\end{verbatim} - -In actual programs, the most common style is to store the -\class{MatchObject} in a variable, and then check if it was -\code{None}. This usually looks like: - -\begin{verbatim} -p = re.compile( ... ) -m = p.match( 'string goes here' ) -if m: - print 'Match found: ', m.group() -else: - print 'No match' -\end{verbatim} - -Two \class{RegexObject} methods return all of the matches for a pattern. -\method{findall()} returns a list of matching strings: - -\begin{verbatim} ->>> p = re.compile('\d+') ->>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping') -['12', '11', '10'] -\end{verbatim} - -\method{findall()} has to create the entire list before it can be -returned as the result. The \method{finditer()} method returns a -sequence of \class{MatchObject} instances as an -iterator.\footnote{Introduced in Python 2.2.2.} - -\begin{verbatim} ->>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') ->>> iterator -<callable-iterator object at 0x401833ac> ->>> for match in iterator: -... print match.span() -... -(0, 2) -(22, 24) -(29, 31) -\end{verbatim} - - -\subsection{Module-Level Functions} - -You don't have to create a \class{RegexObject} and call its methods; -the \module{re} module also provides top-level functions called -\function{match()}, \function{search()}, \function{findall()}, -\function{sub()}, and so forth. These functions take the same -arguments as the corresponding \class{RegexObject} method, with the RE -string added as the first argument, and still return either -\code{None} or a \class{MatchObject} instance. - -\begin{verbatim} ->>> print re.match(r'From\s+', 'Fromage amk') -None ->>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998') -<re.MatchObject instance at 80c5978> -\end{verbatim} - -Under the hood, these functions simply produce a \class{RegexObject} -for you and call the appropriate method on it. They also store the -compiled object in a cache, so future calls using the same -RE are faster. - -Should you use these module-level functions, or should you get the -\class{RegexObject} and call its methods yourself? That choice -depends on how frequently the RE will be used, and on your personal -coding style. If the RE is being used at only one point in the code, -then the module functions are probably more convenient. If a program -contains a lot of regular expressions, or re-uses the same ones in -several locations, then it might be worthwhile to collect all the -definitions in one place, in a section of code that compiles all the -REs ahead of time. To take an example from the standard library: - -\begin{verbatim} -ref = re.compile( ... ) -entityref = re.compile( ... ) -charref = re.compile( ... ) -starttagopen = re.compile( ... ) -\end{verbatim} - -I generally prefer to work with the compiled object, even for -one-time uses, but few people will be as much of a purist about this -as I am. - -\subsection{Compilation Flags} - -Compilation flags let you modify some aspects of how regular -expressions work. Flags are available in the \module{re} module under -two names, a long name such as \constant{IGNORECASE} and a short, -one-letter form such as \constant{I}. (If you're familiar with Perl's -pattern modifiers, the one-letter forms use the same letters; the -short form of \constant{re.VERBOSE} is \constant{re.X}, for example.) -Multiple flags can be specified by bitwise OR-ing them; \code{re.I | -re.M} sets both the \constant{I} and \constant{M} flags, for example. - -Here's a table of the available flags, followed by -a more detailed explanation of each one. - -\begin{tableii}{c|l}{}{Flag}{Meaning} - \lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any - character, including newlines} - \lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches} - \lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match} - \lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching, - affecting \regexp{\^} and \regexp{\$}} - \lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs, - which can be organized more cleanly and understandably.} -\end{tableii} - -\begin{datadesc}{I} -\dataline{IGNORECASE} -Perform case-insensitive matching; character class and literal strings -will match -letters by ignoring case. For example, \regexp{[A-Z]} will match -lowercase letters, too, and \regexp{Spam} will match \samp{Spam}, -\samp{spam}, or \samp{spAM}. -This lowercasing doesn't take the current locale into account; it will -if you also set the \constant{LOCALE} flag. -\end{datadesc} - -\begin{datadesc}{L} -\dataline{LOCALE} -Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, -and \regexp{\e B}, dependent on the current locale. - -Locales are a feature of the C library intended to help in writing -programs that take account of language differences. For example, if -you're processing French text, you'd want to be able to write -\regexp{\e w+} to match words, but \regexp{\e w} only matches the -character class \regexp{[A-Za-z]}; it won't match \character{\'e} or -\character{\c c}. If your system is configured properly and a French -locale is selected, certain C functions will tell the program that -\character{\'e} should also be considered a letter. Setting the -\constant{LOCALE} flag when compiling a regular expression will cause the -resulting compiled object to use these C functions for \regexp{\e w}; -this is slower, but also enables \regexp{\e w+} to match French words as -you'd expect. -\end{datadesc} - -\begin{datadesc}{M} -\dataline{MULTILINE} -(\regexp{\^} and \regexp{\$} haven't been explained yet; -they'll be introduced in section~\ref{more-metacharacters}.) - -Usually \regexp{\^} matches only at the beginning of the string, and -\regexp{\$} matches only at the end of the string and immediately before the -newline (if any) at the end of the string. When this flag is -specified, \regexp{\^} matches at the beginning of the string and at -the beginning of each line within the string, immediately following -each newline. Similarly, the \regexp{\$} metacharacter matches either at -the end of the string and at the end of each line (immediately -preceding each newline). - -\end{datadesc} - -\begin{datadesc}{S} -\dataline{DOTALL} -Makes the \character{.} special character match any character at all, -including a newline; without this flag, \character{.} will match -anything \emph{except} a newline. -\end{datadesc} - -\begin{datadesc}{X} -\dataline{VERBOSE} This flag allows you to write regular expressions -that are more readable by granting you more flexibility in how you can -format them. When this flag has been specified, whitespace within the -RE string is ignored, except when the whitespace is in a character -class or preceded by an unescaped backslash; this lets you organize -and indent the RE more clearly. This flag also lets you put comments -within a RE that will be ignored by the engine; comments are marked by -a \character{\#} that's neither in a character class or preceded by an -unescaped backslash. - -For example, here's a RE that uses \constant{re.VERBOSE}; see how -much easier it is to read? - -\begin{verbatim} -charref = re.compile(r""" - &[#] # Start of a numeric entity reference - ( - 0[0-7]+ # Octal form - | [0-9]+ # Decimal form - | x[0-9a-fA-F]+ # Hexadecimal form - ) - ; # Trailing semicolon -""", re.VERBOSE) -\end{verbatim} - -Without the verbose setting, the RE would look like this: -\begin{verbatim} -charref = re.compile("&#(0[0-7]+" - "|[0-9]+" - "|x[0-9a-fA-F]+);") -\end{verbatim} - -In the above example, Python's automatic concatenation of string -literals has been used to break up the RE into smaller pieces, but -it's still more difficult to understand than the version using -\constant{re.VERBOSE}. - -\end{datadesc} - -\section{More Pattern Power} - -So far we've only covered a part of the features of regular -expressions. In this section, we'll cover some new metacharacters, -and how to use groups to retrieve portions of the text that was matched. - -\subsection{More Metacharacters\label{more-metacharacters}} - -There are some metacharacters that we haven't covered yet. Most of -them will be covered in this section. - -Some of the remaining metacharacters to be discussed are -\dfn{zero-width assertions}. They don't cause the engine to advance -through the string; instead, they consume no characters at all, -and simply succeed or fail. For example, \regexp{\e b} is an -assertion that the current position is located at a word boundary; the -position isn't changed by the \regexp{\e b} at all. This means that -zero-width assertions should never be repeated, because if they match -once at a given location, they can obviously be matched an infinite -number of times. - -\begin{list}{}{} - -\item[\regexp{|}] -Alternation, or the ``or'' operator. -If A and B are regular expressions, -\regexp{A|B} will match any string that matches either \samp{A} or \samp{B}. -\regexp{|} has very low precedence in order to make it work reasonably when -you're alternating multi-character strings. -\regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not -\samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}. - -To match a literal \character{|}, -use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}. - -\item[\regexp{\^}] Matches at the beginning of lines. Unless the -\constant{MULTILINE} flag has been set, this will only match at the -beginning of the string. In \constant{MULTILINE} mode, this also -matches immediately after each newline within the string. - -For example, if you wish to match the word \samp{From} only at the -beginning of a line, the RE to use is \verb|^From|. - -\begin{verbatim} ->>> print re.search('^From', 'From Here to Eternity') -<re.MatchObject instance at 80c1520> ->>> print re.search('^From', 'Reciting From Memory') -None -\end{verbatim} - -%To match a literal \character{\^}, use \regexp{\e\^} or enclose it -%inside a character class, as in \regexp{[{\e}\^]}. - -\item[\regexp{\$}] Matches at the end of a line, which is defined as -either the end of the string, or any location followed by a newline -character. - -\begin{verbatim} ->>> print re.search('}$', '{block}') -<re.MatchObject instance at 80adfa8> ->>> print re.search('}$', '{block} ') -None ->>> print re.search('}$', '{block}\n') -<re.MatchObject instance at 80adfa8> -\end{verbatim} -% $ - -To match a literal \character{\$}, use \regexp{\e\$} or enclose it -inside a character class, as in \regexp{[\$]}. - -\item[\regexp{\e A}] Matches only at the start of the string. When -not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are -effectively the same. In \constant{MULTILINE} mode, they're -different: \regexp{\e A} still matches only at the beginning of the -string, but \regexp{\^} may match at any location inside the string -that follows a newline character. - -\item[\regexp{\e Z}] Matches only at the end of the string. - -\item[\regexp{\e b}] Word boundary. -This is a zero-width assertion that matches only at the -beginning or end of a word. A word is defined as a sequence of -alphanumeric characters, so the end of a word is indicated by -whitespace or a non-alphanumeric character. - -The following example matches \samp{class} only when it's a complete -word; it won't match when it's contained inside another word. - -\begin{verbatim} ->>> p = re.compile(r'\bclass\b') ->>> print p.search('no class at all') -<re.MatchObject instance at 80c8f28> ->>> print p.search('the declassified algorithm') -None ->>> print p.search('one subclass is') -None -\end{verbatim} - -There are two subtleties you should remember when using this special -sequence. First, this is the worst collision between Python's string -literals and regular expression sequences. In Python's string -literals, \samp{\e b} is the backspace character, ASCII value 8. If -you're not using raw strings, then Python will convert the \samp{\e b} to -a backspace, and your RE won't match as you expect it to. The -following example looks the same as our previous RE, but omits -the \character{r} in front of the RE string. - -\begin{verbatim} ->>> p = re.compile('\bclass\b') ->>> print p.search('no class at all') -None ->>> print p.search('\b' + 'class' + '\b') -<re.MatchObject instance at 80c3ee0> -\end{verbatim} - -Second, inside a character class, where there's no use for this -assertion, \regexp{\e b} represents the backspace character, for -compatibility with Python's string literals. - -\item[\regexp{\e B}] Another zero-width assertion, this is the -opposite of \regexp{\e b}, only matching when the current -position is not at a word boundary. - -\end{list} - -\subsection{Grouping} - -Frequently you need to obtain more information than just whether the -RE matched or not. Regular expressions are often used to dissect -strings by writing a RE divided into several subgroups which -match different components of interest. For example, an RFC-822 -header line is divided into a header name and a value, separated by a -\character{:}, like this: - -\begin{verbatim} -From: author@example.com -User-Agent: Thunderbird 1.5.0.9 (X11/20061227) -MIME-Version: 1.0 -To: editor@example.com -\end{verbatim} - -This can be handled by writing a regular expression -which matches an entire header line, and has one group which matches the -header name, and another group which matches the header's value. - -Groups are marked by the \character{(}, \character{)} metacharacters. -\character{(} and \character{)} have much the same meaning as they do -in mathematical expressions; they group together the expressions -contained inside them, and you can repeat the contents of a -group with a repeating qualifier, such as \regexp{*}, \regexp{+}, -\regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example, -\regexp{(ab)*} will match zero or more repetitions of \samp{ab}. - -\begin{verbatim} ->>> p = re.compile('(ab)*') ->>> print p.match('ababababab').span() -(0, 10) -\end{verbatim} - -Groups indicated with \character{(}, \character{)} also capture the -starting and ending index of the text that they match; this can be -retrieved by passing an argument to \method{group()}, -\method{start()}, \method{end()}, and \method{span()}. Groups are -numbered starting with 0. Group 0 is always present; it's the whole -RE, so \class{MatchObject} methods all have group 0 as their default -argument. Later we'll see how to express groups that don't capture -the span of text that they match. - -\begin{verbatim} ->>> p = re.compile('(a)b') ->>> m = p.match('ab') ->>> m.group() -'ab' ->>> m.group(0) -'ab' -\end{verbatim} - -Subgroups are numbered from left to right, from 1 upward. Groups can -be nested; to determine the number, just count the opening parenthesis -characters, going from left to right. - -\begin{verbatim} ->>> p = re.compile('(a(b)c)d') ->>> m = p.match('abcd') ->>> m.group(0) -'abcd' ->>> m.group(1) -'abc' ->>> m.group(2) -'b' -\end{verbatim} - -\method{group()} can be passed multiple group numbers at a time, in -which case it will return a tuple containing the corresponding values -for those groups. - -\begin{verbatim} ->>> m.group(2,1,2) -('b', 'abc', 'b') -\end{verbatim} - -The \method{groups()} method returns a tuple containing the strings -for all the subgroups, from 1 up to however many there are. - -\begin{verbatim} ->>> m.groups() -('abc', 'b') -\end{verbatim} - -Backreferences in a pattern allow you to specify that the contents of -an earlier capturing group must also be found at the current location -in the string. For example, \regexp{\e 1} will succeed if the exact -contents of group 1 can be found at the current position, and fails -otherwise. Remember that Python's string literals also use a -backslash followed by numbers to allow including arbitrary characters -in a string, so be sure to use a raw string when incorporating -backreferences in a RE. - -For example, the following RE detects doubled words in a string. - -\begin{verbatim} ->>> p = re.compile(r'(\b\w+)\s+\1') ->>> p.search('Paris in the the spring').group() -'the the' -\end{verbatim} - -Backreferences like this aren't often useful for just searching -through a string --- there are few text formats which repeat data in -this way --- but you'll soon find out that they're \emph{very} useful -when performing string substitutions. - -\subsection{Non-capturing and Named Groups} - -Elaborate REs may use many groups, both to capture substrings of -interest, and to group and structure the RE itself. In complex REs, -it becomes difficult to keep track of the group numbers. There are -two features which help with this problem. Both of them use a common -syntax for regular expression extensions, so we'll look at that first. - -Perl 5 added several additional features to standard regular -expressions, and the Python \module{re} module supports most of them. -It would have been difficult to choose new -single-keystroke metacharacters or new special sequences beginning -with \samp{\e} to represent the new features without making Perl's -regular expressions confusingly different from standard REs. If you -chose \samp{\&} as a new metacharacter, for example, old expressions -would be assuming that -\samp{\&} was a regular character and wouldn't have escaped it by -writing \regexp{\e \&} or \regexp{[\&]}. - -The solution chosen by the Perl developers was to use \regexp{(?...)} -as the extension syntax. \samp{?} immediately after a parenthesis was -a syntax error because the \samp{?} would have nothing to repeat, so -this didn't introduce any compatibility problems. The characters -immediately after the \samp{?} indicate what extension is being used, -so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and -\regexp{(?:foo)} is something else (a non-capturing group containing -the subexpression \regexp{foo}). - -Python adds an extension syntax to Perl's extension syntax. If the -first character after the question mark is a \samp{P}, you know that -it's an extension that's specific to Python. Currently there are two -such extensions: \regexp{(?P<\var{name}>...)} defines a named group, -and \regexp{(?P=\var{name})} is a backreference to a named group. If -future versions of Perl 5 add similar features using a different -syntax, the \module{re} module will be changed to support the new -syntax, while preserving the Python-specific syntax for -compatibility's sake. - -Now that we've looked at the general extension syntax, we can return -to the features that simplify working with groups in complex REs. -Since groups are numbered from left to right and a complex expression -may use many groups, it can become difficult to keep track of the -correct numbering. Modifying such a complex RE is annoying, too: -insert a new group near the beginning and you change the numbers of -everything that follows it. - -Sometimes you'll want to use a group to collect a part of a regular -expression, but aren't interested in retrieving the group's contents. -You can make this fact explicit by using a non-capturing group: -\regexp{(?:...)}, where you can replace the \regexp{...} -with any other regular expression. - -\begin{verbatim} ->>> m = re.match("([abc])+", "abc") ->>> m.groups() -('c',) ->>> m = re.match("(?:[abc])+", "abc") ->>> m.groups() -() -\end{verbatim} - -Except for the fact that you can't retrieve the contents of what the -group matched, a non-capturing group behaves exactly the same as a -capturing group; you can put anything inside it, repeat it with a -repetition metacharacter such as \samp{*}, and nest it within other -groups (capturing or non-capturing). \regexp{(?:...)} is particularly -useful when modifying an existing pattern, since you can add new groups -without changing how all the other groups are numbered. It should be -mentioned that there's no performance difference in searching between -capturing and non-capturing groups; neither form is any faster than -the other. - -A more significant feature is named groups: instead of -referring to them by numbers, groups can be referenced by a name. - -The syntax for a named group is one of the Python-specific extensions: -\regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of -the group. Named groups also behave exactly like capturing groups, -and additionally associate a name with a group. The -\class{MatchObject} methods that deal with capturing groups all accept -either integers that refer to the group by number or strings that -contain the desired group's name. Named groups are still given -numbers, so you can retrieve information about a group in two ways: - -\begin{verbatim} ->>> p = re.compile(r'(?P<word>\b\w+\b)') ->>> m = p.search( '(((( Lots of punctuation )))' ) ->>> m.group('word') -'Lots' ->>> m.group(1) -'Lots' -\end{verbatim} - -Named groups are handy because they let you use easily-remembered -names, instead of having to remember numbers. Here's an example RE -from the \module{imaplib} module: - -\begin{verbatim} -InternalDate = re.compile(r'INTERNALDATE "' - r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-' - r'(?P<year>[0-9][0-9][0-9][0-9])' - r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])' - r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])' - r'"') -\end{verbatim} - -It's obviously much easier to retrieve \code{m.group('zonem')}, -instead of having to remember to retrieve group 9. - -The syntax for backreferences in an expression such as -\regexp{(...)\e 1} refers to the number of the group. There's -naturally a variant that uses the group name instead of the number. -This is another Python extension: \regexp{(?P=\var{name})} indicates -that the contents of the group called \var{name} should again be matched -at the current point. The regular expression for finding doubled -words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as -\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}: - -\begin{verbatim} ->>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)') ->>> p.search('Paris in the the spring').group() -'the the' -\end{verbatim} - -\subsection{Lookahead Assertions} - -Another zero-width assertion is the lookahead assertion. Lookahead -assertions are available in both positive and negative form, and -look like this: - -\begin{itemize} -\item[\regexp{(?=...)}] Positive lookahead assertion. This succeeds -if the contained regular expression, represented here by \code{...}, -successfully matches at the current location, and fails otherwise. -But, once the contained expression has been tried, the matching engine -doesn't advance at all; the rest of the pattern is tried right where -the assertion started. - -\item[\regexp{(?!...)}] Negative lookahead assertion. This is the -opposite of the positive assertion; it succeeds if the contained expression -\emph{doesn't} match at the current position in the string. -\end{itemize} - -To make this concrete, let's look at a case where a lookahead is -useful. Consider a simple pattern to match a filename and split it -apart into a base name and an extension, separated by a \samp{.}. For -example, in \samp{news.rc}, \samp{news} is the base name, and -\samp{rc} is the filename's extension. - -The pattern to match this is quite simple: - -\regexp{.*[.].*\$} - -Notice that the \samp{.} needs to be treated specially because it's a -metacharacter; I've put it inside a character class. Also notice the -trailing \regexp{\$}; this is added to ensure that all the rest of the -string must be included in the extension. This regular expression -matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and -\samp{printers.conf}. - -Now, consider complicating the problem a bit; what if you want to -match filenames where the extension is not \samp{bat}? -Some incorrect attempts: - -\verb|.*[.][^b].*$| -% $ - -The first attempt above tries to exclude \samp{bat} by requiring that -the first character of the extension is not a \samp{b}. This is -wrong, because the pattern also doesn't match \samp{foo.bar}. - -% Messes up the HTML without the curly braces around \^ -\regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$} - -The expression gets messier when you try to patch up the first -solution by requiring one of the following cases to match: the first -character of the extension isn't \samp{b}; the second character isn't -\samp{a}; or the third character isn't \samp{t}. This accepts -\samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a -three-letter extension and won't accept a filename with a two-letter -extension such as \samp{sendmail.cf}. We'll complicate the pattern -again in an effort to fix it. - -\regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$} - -In the third attempt, the second and third letters are all made -optional in order to allow matching extensions shorter than three -characters, such as \samp{sendmail.cf}. - -The pattern's getting really complicated now, which makes it hard to -read and understand. Worse, if the problem changes and you want to -exclude both \samp{bat} and \samp{exe} as extensions, the pattern -would get even more complicated and confusing. - -A negative lookahead cuts through all this confusion: - -\regexp{.*[.](?!bat\$).*\$} -% $ - -The negative lookahead means: if the expression \regexp{bat} doesn't match at -this point, try the rest of the pattern; if \regexp{bat\$} does match, -the whole pattern will fail. The trailing \regexp{\$} is required to -ensure that something like \samp{sample.batch}, where the extension -only starts with \samp{bat}, will be allowed. - -Excluding another filename extension is now easy; simply add it as an -alternative inside the assertion. The following pattern excludes -filenames that end in either \samp{bat} or \samp{exe}: - -\regexp{.*[.](?!bat\$|exe\$).*\$} -% $ - - -\section{Modifying Strings} - -Up to this point, we've simply performed searches against a static -string. Regular expressions are also commonly used to modify strings -in various ways, using the following \class{RegexObject} methods: - -\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} - \lineii{split()}{Split the string into a list, splitting it wherever the RE matches} - \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string} - \lineii{subn()}{Does the same thing as \method{sub()}, - but returns the new string and the number of replacements} -\end{tableii} - - -\subsection{Splitting Strings} - -The \method{split()} method of a \class{RegexObject} splits a string -apart wherever the RE matches, returning a list of the pieces. -It's similar to the \method{split()} method of strings but -provides much more -generality in the delimiters that you can split by; -\method{split()} only supports splitting by whitespace or by -a fixed string. As you'd expect, there's a module-level -\function{re.split()} function, too. - -\begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}} - Split \var{string} by the matches of the regular expression. If - capturing parentheses are used in the RE, then their contents will - also be returned as part of the resulting list. If \var{maxsplit} - is nonzero, at most \var{maxsplit} splits are performed. -\end{methoddesc} - -You can limit the number of splits made, by passing a value for -\var{maxsplit}. When \var{maxsplit} is nonzero, at most -\var{maxsplit} splits will be made, and the remainder of the string is -returned as the final element of the list. In the following example, -the delimiter is any sequence of non-alphanumeric characters. - -\begin{verbatim} ->>> p = re.compile(r'\W+') ->>> p.split('This is a test, short and sweet, of split().') -['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', ''] ->>> p.split('This is a test, short and sweet, of split().', 3) -['This', 'is', 'a', 'test, short and sweet, of split().'] -\end{verbatim} - -Sometimes you're not only interested in what the text between -delimiters is, but also need to know what the delimiter was. If -capturing parentheses are used in the RE, then their values are also -returned as part of the list. Compare the following calls: - -\begin{verbatim} ->>> p = re.compile(r'\W+') ->>> p2 = re.compile(r'(\W+)') ->>> p.split('This... is a test.') -['This', 'is', 'a', 'test', ''] ->>> p2.split('This... is a test.') -['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', ''] -\end{verbatim} - -The module-level function \function{re.split()} adds the RE to be -used as the first argument, but is otherwise the same. - -\begin{verbatim} ->>> re.split('[\W]+', 'Words, words, words.') -['Words', 'words', 'words', ''] ->>> re.split('([\W]+)', 'Words, words, words.') -['Words', ', ', 'words', ', ', 'words', '.', ''] ->>> re.split('[\W]+', 'Words, words, words.', 1) -['Words', 'words, words.'] -\end{verbatim} - -\subsection{Search and Replace} - -Another common task is to find all the matches for a pattern, and -replace them with a different string. The \method{sub()} method takes -a replacement value, which can be either a string or a function, and -the string to be processed. - -\begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}} -Returns the string obtained by replacing the leftmost non-overlapping -occurrences of the RE in \var{string} by the replacement -\var{replacement}. If the pattern isn't found, \var{string} is returned -unchanged. - -The optional argument \var{count} is the maximum number of pattern -occurrences to be replaced; \var{count} must be a non-negative -integer. The default value of 0 means to replace all occurrences. -\end{methoddesc} - -Here's a simple example of using the \method{sub()} method. It -replaces colour names with the word \samp{colour}: - -\begin{verbatim} ->>> p = re.compile( '(blue|white|red)') ->>> p.sub( 'colour', 'blue socks and red shoes') -'colour socks and colour shoes' ->>> p.sub( 'colour', 'blue socks and red shoes', count=1) -'colour socks and red shoes' -\end{verbatim} - -The \method{subn()} method does the same work, but returns a 2-tuple -containing the new string value and the number of replacements -that were performed: - -\begin{verbatim} ->>> p = re.compile( '(blue|white|red)') ->>> p.subn( 'colour', 'blue socks and red shoes') -('colour socks and colour shoes', 2) ->>> p.subn( 'colour', 'no colours at all') -('no colours at all', 0) -\end{verbatim} - -Empty matches are replaced only when they're not -adjacent to a previous match. - -\begin{verbatim} ->>> p = re.compile('x*') ->>> p.sub('-', 'abxd') -'-a-b-d-' -\end{verbatim} - -If \var{replacement} is a string, any backslash escapes in it are -processed. That is, \samp{\e n} is converted to a single newline -character, \samp{\e r} is converted to a carriage return, and so forth. -Unknown escapes such as \samp{\e j} are left alone. Backreferences, -such as \samp{\e 6}, are replaced with the substring matched by the -corresponding group in the RE. This lets you incorporate -portions of the original text in the resulting -replacement string. - -This example matches the word \samp{section} followed by a string -enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to -\samp{subsection}: - -\begin{verbatim} ->>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE) ->>> p.sub(r'subsection{\1}','section{First} section{second}') -'subsection{First} subsection{second}' -\end{verbatim} - -There's also a syntax for referring to named groups as defined by the -\regexp{(?P<name>...)} syntax. \samp{\e g<name>} will use the -substring matched by the group named \samp{name}, and -\samp{\e g<\var{number}>} -uses the corresponding group number. -\samp{\e g<2>} is therefore equivalent to \samp{\e 2}, -but isn't ambiguous in a -replacement string such as \samp{\e g<2>0}. (\samp{\e 20} would be -interpreted as a reference to group 20, not a reference to group 2 -followed by the literal character \character{0}.) The following -substitutions are all equivalent, but use all three variations of the -replacement string. - -\begin{verbatim} ->>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE) ->>> p.sub(r'subsection{\1}','section{First}') -'subsection{First}' ->>> p.sub(r'subsection{\g<1>}','section{First}') -'subsection{First}' ->>> p.sub(r'subsection{\g<name>}','section{First}') -'subsection{First}' -\end{verbatim} - -\var{replacement} can also be a function, which gives you even more -control. If \var{replacement} is a function, the function is -called for every non-overlapping occurrence of \var{pattern}. On each -call, the function is -passed a \class{MatchObject} argument for the match -and can use this information to compute the desired replacement string and return it. - -In the following example, the replacement function translates -decimals into hexadecimal: - -\begin{verbatim} ->>> def hexrepl( match ): -... "Return the hex string for a decimal number" -... value = int( match.group() ) -... return hex(value) -... ->>> p = re.compile(r'\d+') ->>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.') -'Call 0xffd2 for printing, 0xc000 for user code.' -\end{verbatim} - -When using the module-level \function{re.sub()} function, the pattern -is passed as the first argument. The pattern may be a string or a -\class{RegexObject}; if you need to specify regular expression flags, -you must either use a \class{RegexObject} as the first parameter, or use -embedded modifiers in the pattern, e.g. \code{sub("(?i)b+", "x", "bbbb -BBBB")} returns \code{'x x'}. - -\section{Common Problems} - -Regular expressions are a powerful tool for some applications, but in -some ways their behaviour isn't intuitive and at times they don't -behave the way you may expect them to. This section will point out -some of the most common pitfalls. - -\subsection{Use String Methods} - -Sometimes using the \module{re} module is a mistake. If you're -matching a fixed string, or a single character class, and you're not -using any \module{re} features such as the \constant{IGNORECASE} flag, -then the full power of regular expressions may not be required. -Strings have several methods for performing operations with fixed -strings and they're usually much faster, because the implementation is -a single small C loop that's been optimized for the purpose, instead -of the large, more generalized regular expression engine. - -One example might be replacing a single fixed string with another -one; for example, you might replace \samp{word} -with \samp{deed}. \code{re.sub()} seems like the function to use for -this, but consider the \method{replace()} method. Note that -\function{replace()} will also replace \samp{word} inside -words, turning \samp{swordfish} into \samp{sdeedfish}, but the -na{\"\i}ve RE \regexp{word} would have done that, too. (To avoid performing -the substitution on parts of words, the pattern would have to be -\regexp{\e bword\e b}, in order to require that \samp{word} have a -word boundary on either side. This takes the job beyond -\method{replace}'s abilities.) - -Another common task is deleting every occurrence of a single character -from a string or replacing it with another single character. You -might do this with something like \code{re.sub('\e n', ' ', S)}, but -\method{translate()} is capable of doing both tasks -and will be faster than any regular expression operation can be. - -In short, before turning to the \module{re} module, consider whether -your problem can be solved with a faster and simpler string method. - -\subsection{match() versus search()} - -The \function{match()} function only checks if the RE matches at -the beginning of the string while \function{search()} will scan -forward through the string for a match. -It's important to keep this distinction in mind. Remember, -\function{match()} will only report a successful match which -will start at 0; if the match wouldn't start at zero, -\function{match()} will \emph{not} report it. - -\begin{verbatim} ->>> print re.match('super', 'superstition').span() -(0, 5) ->>> print re.match('super', 'insuperable') -None -\end{verbatim} - -On the other hand, \function{search()} will scan forward through the -string, reporting the first match it finds. - -\begin{verbatim} ->>> print re.search('super', 'superstition').span() -(0, 5) ->>> print re.search('super', 'insuperable').span() -(2, 7) -\end{verbatim} - -Sometimes you'll be tempted to keep using \function{re.match()}, and -just add \regexp{.*} to the front of your RE. Resist this temptation -and use \function{re.search()} instead. The regular expression -compiler does some analysis of REs in order to speed up the process of -looking for a match. One such analysis figures out what the first -character of a match must be; for example, a pattern starting with -\regexp{Crow} must match starting with a \character{C}. The analysis -lets the engine quickly scan through the string looking for the -starting character, only trying the full match if a \character{C} is found. - -Adding \regexp{.*} defeats this optimization, requiring scanning to -the end of the string and then backtracking to find a match for the -rest of the RE. Use \function{re.search()} instead. - -\subsection{Greedy versus Non-Greedy} - -When repeating a regular expression, as in \regexp{a*}, the resulting -action is to consume as much of the pattern as possible. This -fact often bites you when you're trying to match a pair of -balanced delimiters, such as the angle brackets surrounding an HTML -tag. The na{\"\i}ve pattern for matching a single HTML tag doesn't -work because of the greedy nature of \regexp{.*}. - -\begin{verbatim} ->>> s = '<html><head><title>Title</title>' ->>> len(s) -32 ->>> print re.match('<.*>', s).span() -(0, 32) ->>> print re.match('<.*>', s).group() -<html><head><title>Title</title> -\end{verbatim} - -The RE matches the \character{<} in \samp{<html>}, and the -\regexp{.*} consumes the rest of the string. There's still more left -in the RE, though, and the \regexp{>} can't match at the end of -the string, so the regular expression engine has to backtrack -character by character until it finds a match for the \regexp{>}. -The final match extends from the \character{<} in \samp{<html>} -to the \character{>} in \samp{</title>}, which isn't what you want. - -In this case, the solution is to use the non-greedy qualifiers -\regexp{*?}, \regexp{+?}, \regexp{??}, or -\regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as -possible. In the above example, the \character{>} is tried -immediately after the first \character{<} matches, and when it fails, -the engine advances a character at a time, retrying the \character{>} -at every step. This produces just the right result: - -\begin{verbatim} ->>> print re.match('<.*?>', s).group() -<html> -\end{verbatim} - -(Note that parsing HTML or XML with regular expressions is painful. -Quick-and-dirty patterns will handle common cases, but HTML and XML -have special cases that will break the obvious regular expression; by -the time you've written a regular expression that handles all of the -possible cases, the patterns will be \emph{very} complicated. Use an -HTML or XML parser module for such tasks.) - -\subsection{Not Using re.VERBOSE} - -By now you've probably noticed that regular expressions are a very -compact notation, but they're not terribly readable. REs of -moderate complexity can become lengthy collections of backslashes, -parentheses, and metacharacters, making them difficult to read and -understand. - -For such REs, specifying the \code{re.VERBOSE} flag when -compiling the regular expression can be helpful, because it allows -you to format the regular expression more clearly. - -The \code{re.VERBOSE} flag has several effects. Whitespace in the -regular expression that \emph{isn't} inside a character class is -ignored. This means that an expression such as \regexp{dog | cat} is -equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]} -will still match the characters \character{a}, \character{b}, or a -space. In addition, you can also put comments inside a RE; comments -extend from a \samp{\#} character to the next newline. When used with -triple-quoted strings, this enables REs to be formatted more neatly: - -\begin{verbatim} -pat = re.compile(r""" - \s* # Skip leading whitespace - (?P<header>[^:]+) # Header name - \s* : # Whitespace, and a colon - (?P<value>.*?) # The header's value -- *? used to - # lose the following trailing whitespace - \s*$ # Trailing whitespace to end-of-line -""", re.VERBOSE) -\end{verbatim} -% $ - -This is far more readable than: - -\begin{verbatim} -pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$") -\end{verbatim} -% $ - -\section{Feedback} - -Regular expressions are a complicated topic. Did this document help -you understand them? Were there parts that were unclear, or Problems -you encountered that weren't covered here? If so, please send -suggestions for improvements to the author. - -The most complete book on regular expressions is almost certainly -Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published -by O'Reilly. Unfortunately, it exclusively concentrates on Perl and -Java's flavours of regular expressions, and doesn't contain any Python -material at all, so it won't be useful as a reference for programming -in Python. (The first edition covered Python's now-removed -\module{regex} module, which won't help you much.) Consider checking -it out from your library. - -\end{document} - diff --git a/Doc/howto/sockets.tex b/Doc/howto/sockets.tex deleted file mode 100644 index 0cecbb9..0000000 --- a/Doc/howto/sockets.tex +++ /dev/null @@ -1,465 +0,0 @@ -\documentclass{howto} - -\title{Socket Programming HOWTO} - -\release{0.00} - -\author{Gordon McMillan} -\authoraddress{\email{gmcm@hypernet.com}} - -\begin{document} -\maketitle - -\begin{abstract} -\noindent -Sockets are used nearly everywhere, but are one of the most severely -misunderstood technologies around. This is a 10,000 foot overview of -sockets. It's not really a tutorial - you'll still have work to do in -getting things operational. It doesn't cover the fine points (and there -are a lot of them), but I hope it will give you enough background to -begin using them decently. - -This document is available from the Python HOWTO page at -\url{http://www.python.org/doc/howto}. - -\end{abstract} - -\tableofcontents - -\section{Sockets} - -Sockets are used nearly everywhere, but are one of the most severely -misunderstood technologies around. This is a 10,000 foot overview of -sockets. It's not really a tutorial - you'll still have work to do in -getting things working. It doesn't cover the fine points (and there -are a lot of them), but I hope it will give you enough background to -begin using them decently. - -I'm only going to talk about INET sockets, but they account for at -least 99\% of the sockets in use. And I'll only talk about STREAM -sockets - unless you really know what you're doing (in which case this -HOWTO isn't for you!), you'll get better behavior and performance from -a STREAM socket than anything else. I will try to clear up the mystery -of what a socket is, as well as some hints on how to work with -blocking and non-blocking sockets. But I'll start by talking about -blocking sockets. You'll need to know how they work before dealing -with non-blocking sockets. - -Part of the trouble with understanding these things is that "socket" -can mean a number of subtly different things, depending on context. So -first, let's make a distinction between a "client" socket - an -endpoint of a conversation, and a "server" socket, which is more like -a switchboard operator. The client application (your browser, for -example) uses "client" sockets exclusively; the web server it's -talking to uses both "server" sockets and "client" sockets. - - -\subsection{History} - -Of the various forms of IPC (\emph{Inter Process Communication}), -sockets are by far the most popular. On any given platform, there are -likely to be other forms of IPC that are faster, but for -cross-platform communication, sockets are about the only game in town. - -They were invented in Berkeley as part of the BSD flavor of Unix. They -spread like wildfire with the Internet. With good reason --- the -combination of sockets with INET makes talking to arbitrary machines -around the world unbelievably easy (at least compared to other -schemes). - -\section{Creating a Socket} - -Roughly speaking, when you clicked on the link that brought you to -this page, your browser did something like the following: - -\begin{verbatim} - #create an INET, STREAMing socket - s = socket.socket( - socket.AF_INET, socket.SOCK_STREAM) - #now connect to the web server on port 80 - # - the normal http port - s.connect(("www.mcmillan-inc.com", 80)) -\end{verbatim} - -When the \code{connect} completes, the socket \code{s} can -now be used to send in a request for the text of this page. The same -socket will read the reply, and then be destroyed. That's right - -destroyed. Client sockets are normally only used for one exchange (or -a small set of sequential exchanges). - -What happens in the web server is a bit more complex. First, the web -server creates a "server socket". - -\begin{verbatim} - #create an INET, STREAMing socket - serversocket = socket.socket( - socket.AF_INET, socket.SOCK_STREAM) - #bind the socket to a public host, - # and a well-known port - serversocket.bind((socket.gethostname(), 80)) - #become a server socket - serversocket.listen(5) -\end{verbatim} - -A couple things to notice: we used \code{socket.gethostname()} -so that the socket would be visible to the outside world. If we had -used \code{s.bind(('', 80))} or \code{s.bind(('localhost', -80))} or \code{s.bind(('127.0.0.1', 80))} we would still -have a "server" socket, but one that was only visible within the same -machine. - -A second thing to note: low number ports are usually reserved for -"well known" services (HTTP, SNMP etc). If you're playing around, use -a nice high number (4 digits). - -Finally, the argument to \code{listen} tells the socket library that -we want it to queue up as many as 5 connect requests (the normal max) -before refusing outside connections. If the rest of the code is -written properly, that should be plenty. - -OK, now we have a "server" socket, listening on port 80. Now we enter -the mainloop of the web server: - -\begin{verbatim} - while 1: - #accept connections from outside - (clientsocket, address) = serversocket.accept() - #now do something with the clientsocket - #in this case, we'll pretend this is a threaded server - ct = client_thread(clientsocket) - ct.run() -\end{verbatim} - -There's actually 3 general ways in which this loop could work - -dispatching a thread to handle \code{clientsocket}, create a new -process to handle \code{clientsocket}, or restructure this app -to use non-blocking sockets, and mulitplex between our "server" socket -and any active \code{clientsocket}s using -\code{select}. More about that later. The important thing to -understand now is this: this is \emph{all} a "server" socket -does. It doesn't send any data. It doesn't receive any data. It just -produces "client" sockets. Each \code{clientsocket} is created -in response to some \emph{other} "client" socket doing a -\code{connect()} to the host and port we're bound to. As soon as -we've created that \code{clientsocket}, we go back to listening -for more connections. The two "clients" are free to chat it up - they -are using some dynamically allocated port which will be recycled when -the conversation ends. - -\subsection{IPC} If you need fast IPC between two processes -on one machine, you should look into whatever form of shared memory -the platform offers. A simple protocol based around shared memory and -locks or semaphores is by far the fastest technique. - -If you do decide to use sockets, bind the "server" socket to -\code{'localhost'}. On most platforms, this will take a shortcut -around a couple of layers of network code and be quite a bit faster. - - -\section{Using a Socket} - -The first thing to note, is that the web browser's "client" socket and -the web server's "client" socket are identical beasts. That is, this -is a "peer to peer" conversation. Or to put it another way, \emph{as the -designer, you will have to decide what the rules of etiquette are for -a conversation}. Normally, the \code{connect}ing socket -starts the conversation, by sending in a request, or perhaps a -signon. But that's a design decision - it's not a rule of sockets. - -Now there are two sets of verbs to use for communication. You can use -\code{send} and \code{recv}, or you can transform your -client socket into a file-like beast and use \code{read} and -\code{write}. The latter is the way Java presents their -sockets. I'm not going to talk about it here, except to warn you that -you need to use \code{flush} on sockets. These are buffered -"files", and a common mistake is to \code{write} something, and -then \code{read} for a reply. Without a \code{flush} in -there, you may wait forever for the reply, because the request may -still be in your output buffer. - -Now we come the major stumbling block of sockets - \code{send} -and \code{recv} operate on the network buffers. They do not -necessarily handle all the bytes you hand them (or expect from them), -because their major focus is handling the network buffers. In general, -they return when the associated network buffers have been filled -(\code{send}) or emptied (\code{recv}). They then tell you -how many bytes they handled. It is \emph{your} responsibility to call -them again until your message has been completely dealt with. - -When a \code{recv} returns 0 bytes, it means the other side has -closed (or is in the process of closing) the connection. You will not -receive any more data on this connection. Ever. You may be able to -send data successfully; I'll talk about that some on the next page. - -A protocol like HTTP uses a socket for only one transfer. The client -sends a request, the reads a reply. That's it. The socket is -discarded. This means that a client can detect the end of the reply by -receiving 0 bytes. - -But if you plan to reuse your socket for further transfers, you need -to realize that \emph{there is no "EOT" (End of Transfer) on a -socket.} I repeat: if a socket \code{send} or -\code{recv} returns after handling 0 bytes, the connection has -been broken. If the connection has \emph{not} been broken, you may -wait on a \code{recv} forever, because the socket will -\emph{not} tell you that there's nothing more to read (for now). Now -if you think about that a bit, you'll come to realize a fundamental -truth of sockets: \emph{messages must either be fixed length} (yuck), -\emph{or be delimited} (shrug), \emph{or indicate how long they are} -(much better), \emph{or end by shutting down the connection}. The -choice is entirely yours, (but some ways are righter than others). - -Assuming you don't want to end the connection, the simplest solution -is a fixed length message: - -\begin{verbatim} -class mysocket: - '''demonstration class only - - coded for clarity, not efficiency - ''' - - def __init__(self, sock=None): - if sock is None: - self.sock = socket.socket( - socket.AF_INET, socket.SOCK_STREAM) - else: - self.sock = sock - - def connect(self, host, port): - self.sock.connect((host, port)) - - def mysend(self, msg): - totalsent = 0 - while totalsent < MSGLEN: - sent = self.sock.send(msg[totalsent:]) - if sent == 0: - raise RuntimeError, \\ - "socket connection broken" - totalsent = totalsent + sent - - def myreceive(self): - msg = '' - while len(msg) < MSGLEN: - chunk = self.sock.recv(MSGLEN-len(msg)) - if chunk == '': - raise RuntimeError, \\ - "socket connection broken" - msg = msg + chunk - return msg -\end{verbatim} - -The sending code here is usable for almost any messaging scheme - in -Python you send strings, and you can use \code{len()} to -determine its length (even if it has embedded \code{\e 0} -characters). It's mostly the receiving code that gets more -complex. (And in C, it's not much worse, except you can't use -\code{strlen} if the message has embedded \code{\e 0}s.) - -The easiest enhancement is to make the first character of the message -an indicator of message type, and have the type determine the -length. Now you have two \code{recv}s - the first to get (at -least) that first character so you can look up the length, and the -second in a loop to get the rest. If you decide to go the delimited -route, you'll be receiving in some arbitrary chunk size, (4096 or 8192 -is frequently a good match for network buffer sizes), and scanning -what you've received for a delimiter. - -One complication to be aware of: if your conversational protocol -allows multiple messages to be sent back to back (without some kind of -reply), and you pass \code{recv} an arbitrary chunk size, you -may end up reading the start of a following message. You'll need to -put that aside and hold onto it, until it's needed. - -Prefixing the message with it's length (say, as 5 numeric characters) -gets more complex, because (believe it or not), you may not get all 5 -characters in one \code{recv}. In playing around, you'll get -away with it; but in high network loads, your code will very quickly -break unless you use two \code{recv} loops - the first to -determine the length, the second to get the data part of the -message. Nasty. This is also when you'll discover that -\code{send} does not always manage to get rid of everything in -one pass. And despite having read this, you will eventually get bit by -it! - -In the interests of space, building your character, (and preserving my -competitive position), these enhancements are left as an exercise for -the reader. Lets move on to cleaning up. - -\subsection{Binary Data} - -It is perfectly possible to send binary data over a socket. The major -problem is that not all machines use the same formats for binary -data. For example, a Motorola chip will represent a 16 bit integer -with the value 1 as the two hex bytes 00 01. Intel and DEC, however, -are byte-reversed - that same 1 is 01 00. Socket libraries have calls -for converting 16 and 32 bit integers - \code{ntohl, htonl, ntohs, -htons} where "n" means \emph{network} and "h" means \emph{host}, -"s" means \emph{short} and "l" means \emph{long}. Where network order -is host order, these do nothing, but where the machine is -byte-reversed, these swap the bytes around appropriately. - -In these days of 32 bit machines, the ascii representation of binary -data is frequently smaller than the binary representation. That's -because a surprising amount of the time, all those longs have the -value 0, or maybe 1. The string "0" would be two bytes, while binary -is four. Of course, this doesn't fit well with fixed-length -messages. Decisions, decisions. - -\section{Disconnecting} - -Strictly speaking, you're supposed to use \code{shutdown} on a -socket before you \code{close} it. The \code{shutdown} is -an advisory to the socket at the other end. Depending on the argument -you pass it, it can mean "I'm not going to send anymore, but I'll -still listen", or "I'm not listening, good riddance!". Most socket -libraries, however, are so used to programmers neglecting to use this -piece of etiquette that normally a \code{close} is the same as -\code{shutdown(); close()}. So in most situations, an explicit -\code{shutdown} is not needed. - -One way to use \code{shutdown} effectively is in an HTTP-like -exchange. The client sends a request and then does a -\code{shutdown(1)}. This tells the server "This client is done -sending, but can still receive." The server can detect "EOF" by a -receive of 0 bytes. It can assume it has the complete request. The -server sends a reply. If the \code{send} completes successfully -then, indeed, the client was still receiving. - -Python takes the automatic shutdown a step further, and says that when a socket is garbage collected, it will automatically do a \code{close} if it's needed. But relying on this is a very bad habit. If your socket just disappears without doing a \code{close}, the socket at the other end may hang indefinitely, thinking you're just being slow. \emph{Please} \code{close} your sockets when you're done. - - -\subsection{When Sockets Die} - -Probably the worst thing about using blocking sockets is what happens -when the other side comes down hard (without doing a -\code{close}). Your socket is likely to hang. SOCKSTREAM is a -reliable protocol, and it will wait a long, long time before giving up -on a connection. If you're using threads, the entire thread is -essentially dead. There's not much you can do about it. As long as you -aren't doing something dumb, like holding a lock while doing a -blocking read, the thread isn't really consuming much in the way of -resources. Do \emph{not} try to kill the thread - part of the reason -that threads are more efficient than processes is that they avoid the -overhead associated with the automatic recycling of resources. In -other words, if you do manage to kill the thread, your whole process -is likely to be screwed up. - -\section{Non-blocking Sockets} - -If you've understood the preceeding, you already know most of what you -need to know about the mechanics of using sockets. You'll still use -the same calls, in much the same ways. It's just that, if you do it -right, your app will be almost inside-out. - -In Python, you use \code{socket.setblocking(0)} to make it -non-blocking. In C, it's more complex, (for one thing, you'll need to -choose between the BSD flavor \code{O_NONBLOCK} and the almost -indistinguishable Posix flavor \code{O_NDELAY}, which is -completely different from \code{TCP_NODELAY}), but it's the -exact same idea. You do this after creating the socket, but before -using it. (Actually, if you're nuts, you can switch back and forth.) - -The major mechanical difference is that \code{send}, -\code{recv}, \code{connect} and \code{accept} can -return without having done anything. You have (of course) a number of -choices. You can check return code and error codes and generally drive -yourself crazy. If you don't believe me, try it sometime. Your app -will grow large, buggy and suck CPU. So let's skip the brain-dead -solutions and do it right. - -Use \code{select}. - -In C, coding \code{select} is fairly complex. In Python, it's a -piece of cake, but it's close enough to the C version that if you -understand \code{select} in Python, you'll have little trouble -with it in C. - -\begin{verbatim} ready_to_read, ready_to_write, in_error = \\ - select.select( - potential_readers, - potential_writers, - potential_errs, - timeout) -\end{verbatim} - -You pass \code{select} three lists: the first contains all -sockets that you might want to try reading; the second all the sockets -you might want to try writing to, and the last (normally left empty) -those that you want to check for errors. You should note that a -socket can go into more than one list. The \code{select} call is -blocking, but you can give it a timeout. This is generally a sensible -thing to do - give it a nice long timeout (say a minute) unless you -have good reason to do otherwise. - -In return, you will get three lists. They have the sockets that are -actually readable, writable and in error. Each of these lists is a -subset (possbily empty) of the corresponding list you passed in. And -if you put a socket in more than one input list, it will only be (at -most) in one output list. - -If a socket is in the output readable list, you can be -as-close-to-certain-as-we-ever-get-in-this-business that a -\code{recv} on that socket will return \emph{something}. Same -idea for the writable list. You'll be able to send -\emph{something}. Maybe not all you want to, but \emph{something} is -better than nothing. (Actually, any reasonably healthy socket will -return as writable - it just means outbound network buffer space is -available.) - -If you have a "server" socket, put it in the potential_readers -list. If it comes out in the readable list, your \code{accept} -will (almost certainly) work. If you have created a new socket to -\code{connect} to someone else, put it in the ptoential_writers -list. If it shows up in the writable list, you have a decent chance -that it has connected. - -One very nasty problem with \code{select}: if somewhere in those -input lists of sockets is one which has died a nasty death, the -\code{select} will fail. You then need to loop through every -single damn socket in all those lists and do a -\code{select([sock],[],[],0)} until you find the bad one. That -timeout of 0 means it won't take long, but it's ugly. - -Actually, \code{select} can be handy even with blocking sockets. -It's one way of determining whether you will block - the socket -returns as readable when there's something in the buffers. However, -this still doesn't help with the problem of determining whether the -other end is done, or just busy with something else. - -\textbf{Portability alert}: On Unix, \code{select} works both with -the sockets and files. Don't try this on Windows. On Windows, -\code{select} works with sockets only. Also note that in C, many -of the more advanced socket options are done differently on -Windows. In fact, on Windows I usually use threads (which work very, -very well) with my sockets. Face it, if you want any kind of -performance, your code will look very different on Windows than on -Unix. (I haven't the foggiest how you do this stuff on a Mac.) - -\subsection{Performance} - -There's no question that the fastest sockets code uses non-blocking -sockets and select to multiplex them. You can put together something -that will saturate a LAN connection without putting any strain on the -CPU. The trouble is that an app written this way can't do much of -anything else - it needs to be ready to shuffle bytes around at all -times. - -Assuming that your app is actually supposed to do something more than -that, threading is the optimal solution, (and using non-blocking -sockets will be faster than using blocking sockets). Unfortunately, -threading support in Unixes varies both in API and quality. So the -normal Unix solution is to fork a subprocess to deal with each -connection. The overhead for this is significant (and don't do this on -Windows - the overhead of process creation is enormous there). It also -means that unless each subprocess is completely independent, you'll -need to use another form of IPC, say a pipe, or shared memory and -semaphores, to communicate between the parent and child processes. - -Finally, remember that even though blocking sockets are somewhat -slower than non-blocking, in many cases they are the "right" -solution. After all, if your app is driven by the data it receives -over a socket, there's not much sense in complicating the logic just -so your app can wait on \code{select} instead of -\code{recv}. - -\end{document} diff --git a/Doc/howto/unicode.rst b/Doc/howto/unicode.rst deleted file mode 100644 index dffe2cb..0000000 --- a/Doc/howto/unicode.rst +++ /dev/null @@ -1,766 +0,0 @@ -Unicode HOWTO -================ - -**Version 1.02** - -This HOWTO discusses Python's support for Unicode, and explains various -problems that people commonly encounter when trying to work with Unicode. - -Introduction to Unicode ------------------------------- - -History of Character Codes -'''''''''''''''''''''''''''''' - -In 1968, the American Standard Code for Information Interchange, -better known by its acronym ASCII, was standardized. ASCII defined -numeric codes for various characters, with the numeric values running from 0 to -127. For example, the lowercase letter 'a' is assigned 97 as its code -value. - -ASCII was an American-developed standard, so it only defined -unaccented characters. There was an 'e', but no 'é' or 'Í'. This -meant that languages which required accented characters couldn't be -faithfully represented in ASCII. (Actually the missing accents matter -for English, too, which contains words such as 'naïve' and 'café', and some -publications have house styles which require spellings such as -'coöperate'.) - -For a while people just wrote programs that didn't display accents. I -remember looking at Apple ][ BASIC programs, published in French-language -publications in the mid-1980s, that had lines like these:: - - PRINT "FICHER EST COMPLETE." - PRINT "CARACTERE NON ACCEPTE." - -Those messages should contain accents, and they just look wrong to -someone who can read French. - -In the 1980s, almost all personal computers were 8-bit, meaning that -bytes could hold values ranging from 0 to 255. ASCII codes only went -up to 127, so some machines assigned values between 128 and 255 to -accented characters. Different machines had different codes, however, -which led to problems exchanging files. Eventually various commonly -used sets of values for the 128-255 range emerged. Some were true -standards, defined by the International Standards Organization, and -some were **de facto** conventions that were invented by one company -or another and managed to catch on. - -255 characters aren't very many. For example, you can't fit -both the accented characters used in Western Europe and the Cyrillic -alphabet used for Russian into the 128-255 range because there are more than -127 such characters. - -You could write files using different codes (all your Russian -files in a coding system called KOI8, all your French files in -a different coding system called Latin1), but what if you wanted -to write a French document that quotes some Russian text? In the -1980s people began to want to solve this problem, and the Unicode -standardization effort began. - -Unicode started out using 16-bit characters instead of 8-bit characters. 16 -bits means you have 2^16 = 65,536 distinct values available, making it -possible to represent many different characters from many different -alphabets; an initial goal was to have Unicode contain the alphabets for -every single human language. It turns out that even 16 bits isn't enough to -meet that goal, and the modern Unicode specification uses a wider range of -codes, 0-1,114,111 (0x10ffff in base-16). - -There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were -originally separate efforts, but the specifications were merged with -the 1.1 revision of Unicode. - -(This discussion of Unicode's history is highly simplified. I don't -think the average Python programmer needs to worry about the -historical details; consult the Unicode consortium site listed in the -References for more information.) - - -Definitions -'''''''''''''''''''''''' - -A **character** is the smallest possible component of a text. 'A', -'B', 'C', etc., are all different characters. So are 'È' and -'Í'. Characters are abstractions, and vary depending on the -language or context you're talking about. For example, the symbol for -ohms (Ω) is usually drawn much like the capital letter -omega (Ω) in the Greek alphabet (they may even be the same in -some fonts), but these are two different characters that have -different meanings. - -The Unicode standard describes how characters are represented by -**code points**. A code point is an integer value, usually denoted in -base 16. In the standard, a code point is written using the notation -U+12ca to mean the character with value 0x12ca (4810 decimal). The -Unicode standard contains a lot of tables listing characters and their -corresponding code points:: - - 0061 'a'; LATIN SMALL LETTER A - 0062 'b'; LATIN SMALL LETTER B - 0063 'c'; LATIN SMALL LETTER C - ... - 007B '{'; LEFT CURLY BRACKET - -Strictly, these definitions imply that it's meaningless to say 'this is -character U+12ca'. U+12ca is a code point, which represents some particular -character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. -In informal contexts, this distinction between code points and characters will -sometimes be forgotten. - -A character is represented on a screen or on paper by a set of graphical -elements that's called a **glyph**. The glyph for an uppercase A, for -example, is two diagonal strokes and a horizontal stroke, though the exact -details will depend on the font being used. Most Python code doesn't need -to worry about glyphs; figuring out the correct glyph to display is -generally the job of a GUI toolkit or a terminal's font renderer. - - -Encodings -''''''''' - -To summarize the previous section: -a Unicode string is a sequence of code points, which are -numbers from 0 to 0x10ffff. This sequence needs to be represented as -a set of bytes (meaning, values from 0-255) in memory. The rules for -translating a Unicode string into a sequence of bytes are called an -**encoding**. - -The first encoding you might think of is an array of 32-bit integers. -In this representation, the string "Python" would look like this:: - - P y t h o n - 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00 - 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 - -This representation is straightforward but using -it presents a number of problems. - -1. It's not portable; different processors order the bytes - differently. - -2. It's very wasteful of space. In most texts, the majority of the code - points are less than 127, or less than 255, so a lot of space is occupied - by zero bytes. The above string takes 24 bytes compared to the 6 - bytes needed for an ASCII representation. Increased RAM usage doesn't - matter too much (desktop computers have megabytes of RAM, and strings - aren't usually that large), but expanding our usage of disk and - network bandwidth by a factor of 4 is intolerable. - -3. It's not compatible with existing C functions such as ``strlen()``, - so a new family of wide string functions would need to be used. - -4. Many Internet standards are defined in terms of textual data, and - can't handle content with embedded zero bytes. - -Generally people don't use this encoding, instead choosing other encodings -that are more efficient and convenient. - -Encodings don't have to handle every possible Unicode character, and -most encodings don't. For example, Python's default encoding is the -'ascii' encoding. The rules for converting a Unicode string into the -ASCII encoding are simple; for each code point: - -1. If the code point is <128, each byte is the same as the value of the - code point. - -2. If the code point is 128 or greater, the Unicode string can't - be represented in this encoding. (Python raises a - ``UnicodeEncodeError`` exception in this case.) - -Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode -code points 0-255 are identical to the Latin-1 values, so converting -to this encoding simply requires converting code points to byte -values; if a code point larger than 255 is encountered, the string -can't be encoded into Latin-1. - -Encodings don't have to be simple one-to-one mappings like Latin-1. -Consider IBM's EBCDIC, which was used on IBM mainframes. Letter -values weren't in one block: 'a' through 'i' had values from 129 to -137, but 'j' through 'r' were 145 through 153. If you wanted to use -EBCDIC as an encoding, you'd probably use some sort of lookup table to -perform the conversion, but this is largely an internal detail. - -UTF-8 is one of the most commonly used encodings. UTF stands for -"Unicode Transformation Format", and the '8' means that 8-bit numbers -are used in the encoding. (There's also a UTF-16 encoding, but it's -less frequently used than UTF-8.) UTF-8 uses the following rules: - -1. If the code point is <128, it's represented by the corresponding byte value. -2. If the code point is between 128 and 0x7ff, it's turned into two byte values - between 128 and 255. -3. Code points >0x7ff are turned into three- or four-byte sequences, where - each byte of the sequence is between 128 and 255. - -UTF-8 has several convenient properties: - -1. It can handle any Unicode code point. -2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent through protocols that can't handle zero bytes. -3. A string of ASCII text is also valid UTF-8 text. -4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte. -5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize. It's also unlikely that random 8-bit data will look like valid UTF-8. - - - -References -'''''''''''''' - -The Unicode Consortium site at <http://www.unicode.org> has character -charts, a glossary, and PDF versions of the Unicode specification. Be -prepared for some difficult reading. -<http://www.unicode.org/history/> is a chronology of the origin and -development of Unicode. - -To help understand the standard, Jukka Korpela has written an -introductory guide to reading the Unicode character tables, -available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>. - -Roman Czyborra wrote another explanation of Unicode's basic principles; -it's at <http://czyborra.com/unicode/characters.html>. -Czyborra has written a number of other Unicode-related documentation, -available from <http://www.cyzborra.com>. - -Two other good introductory articles were written by Joel Spolsky -<http://www.joelonsoftware.com/articles/Unicode.html> and Jason -Orendorff <http://www.jorendorff.com/articles/unicode/>. If this -introduction didn't make things clear to you, you should try reading -one of these alternate articles before continuing. - -Wikipedia entries are often helpful; see the entries for "character -encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8 -<http://en.wikipedia.org/wiki/UTF-8>, for example. - - -Python's Unicode Support ------------------------- - -Now that you've learned the rudiments of Unicode, we can look at -Python's Unicode features. - - -The Unicode Type -''''''''''''''''''' - -Unicode strings are expressed as instances of the ``unicode`` type, -one of Python's repertoire of built-in types. It derives from an -abstract type called ``basestring``, which is also an ancestor of the -``str`` type; you can therefore check if a value is a string type with -``isinstance(value, basestring)``. Under the hood, Python represents -Unicode strings as either 16- or 32-bit integers, depending on how the -Python interpreter was compiled. - -The ``unicode()`` constructor has the signature ``unicode(string[, encoding, errors])``. -All of its arguments should be 8-bit strings. The first argument is converted -to Unicode using the specified encoding; if you leave off the ``encoding`` argument, -the ASCII encoding is used for the conversion, so characters greater than 127 will -be treated as errors:: - - >>> unicode('abcdef') - u'abcdef' - >>> s = unicode('abcdef') - >>> type(s) - <type 'unicode'> - >>> unicode('abcdef' + chr(255)) - Traceback (most recent call last): - File "<stdin>", line 1, in ? - UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6: - ordinal not in range(128) - -The ``errors`` argument specifies the response when the input string can't be converted according to the encoding's rules. Legal values for this argument -are 'strict' (raise a ``UnicodeDecodeError`` exception), -'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'), -or 'ignore' (just leave the character out of the Unicode result). -The following examples show the differences:: - - >>> unicode('\x80abc', errors='strict') - Traceback (most recent call last): - File "<stdin>", line 1, in ? - UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: - ordinal not in range(128) - >>> unicode('\x80abc', errors='replace') - u'\ufffdabc' - >>> unicode('\x80abc', errors='ignore') - u'abc' - -Encodings are specified as strings containing the encoding's name. -Python 2.4 comes with roughly 100 different encodings; see the Python -Library Reference at -<http://docs.python.org/lib/standard-encodings.html> for a list. Some -encodings have multiple names; for example, 'latin-1', 'iso_8859_1' -and '8859' are all synonyms for the same encoding. - -One-character Unicode strings can also be created with the -``unichr()`` built-in function, which takes integers and returns a -Unicode string of length 1 that contains the corresponding code point. -The reverse operation is the built-in `ord()` function that takes a -one-character Unicode string and returns the code point value:: - - >>> unichr(40960) - u'\ua000' - >>> ord(u'\ua000') - 40960 - -Instances of the ``unicode`` type have many of the same methods as -the 8-bit string type for operations such as searching and formatting:: - - >>> s = u'Was ever feather so lightly blown to and fro as this multitude?' - >>> s.count('e') - 5 - >>> s.find('feather') - 9 - >>> s.find('bird') - -1 - >>> s.replace('feather', 'sand') - u'Was ever sand so lightly blown to and fro as this multitude?' - >>> s.upper() - u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?' - -Note that the arguments to these methods can be Unicode strings or 8-bit strings. -8-bit strings will be converted to Unicode before carrying out the operation; -Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception:: - - >>> s.find('Was\x9f') - Traceback (most recent call last): - File "<stdin>", line 1, in ? - UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128) - >>> s.find(u'Was\x9f') - -1 - -Much Python code that operates on strings will therefore work with -Unicode strings without requiring any changes to the code. (Input and -output code needs more updating for Unicode; more on this later.) - -Another important method is ``.encode([encoding], [errors='strict'])``, -which returns an 8-bit string version of the -Unicode string, encoded in the requested encoding. The ``errors`` -parameter is the same as the parameter of the ``unicode()`` -constructor, with one additional possibility; as well as 'strict', -'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which -uses XML's character references. The following example shows the -different results:: - - >>> u = unichr(40960) + u'abcd' + unichr(1972) - >>> u.encode('utf-8') - '\xea\x80\x80abcd\xde\xb4' - >>> u.encode('ascii') - Traceback (most recent call last): - File "<stdin>", line 1, in ? - UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128) - >>> u.encode('ascii', 'ignore') - 'abcd' - >>> u.encode('ascii', 'replace') - '?abcd?' - >>> u.encode('ascii', 'xmlcharrefreplace') - 'ꀀabcd޴' - -Python's 8-bit strings have a ``.decode([encoding], [errors])`` method -that interprets the string using the given encoding:: - - >>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string - >>> utf8_version = u.encode('utf-8') # Encode as UTF-8 - >>> type(utf8_version), utf8_version - (<type 'str'>, '\xea\x80\x80abcd\xde\xb4') - >>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8 - >>> u == u2 # The two strings match - True - -The low-level routines for registering and accessing the available -encodings are found in the ``codecs`` module. However, the encoding -and decoding functions returned by this module are usually more -low-level than is comfortable, so I'm not going to describe the -``codecs`` module here. If you need to implement a completely new -encoding, you'll need to learn about the ``codecs`` module interfaces, -but implementing encodings is a specialized task that also won't be -covered here. Consult the Python documentation to learn more about -this module. - -The most commonly used part of the ``codecs`` module is the -``codecs.open()`` function which will be discussed in the section -on input and output. - - -Unicode Literals in Python Source Code -'''''''''''''''''''''''''''''''''''''''''' - -In Python source code, Unicode literals are written as strings -prefixed with the 'u' or 'U' character: ``u'abcdefghijk'``. Specific -code points can be written using the ``\u`` escape sequence, which is -followed by four hex digits giving the code point. The ``\U`` escape -sequence is similar, but expects 8 hex digits, not 4. - -Unicode literals can also use the same escape sequences as 8-bit -strings, including ``\x``, but ``\x`` only takes two hex digits so it -can't express an arbitrary code point. Octal escapes can go up to -U+01ff, which is octal 777. - -:: - - >>> s = u"a\xac\u1234\u20ac\U00008000" - ^^^^ two-digit hex escape - ^^^^^^ four-digit Unicode escape - ^^^^^^^^^^ eight-digit Unicode escape - >>> for c in s: print ord(c), - ... - 97 172 4660 8364 32768 - -Using escape sequences for code points greater than 127 is fine in -small doses, but becomes an annoyance if you're using many accented -characters, as you would in a program with messages in French or some -other accent-using language. You can also assemble strings using the -``unichr()`` built-in function, but this is even more tedious. - -Ideally, you'd want to be able to write literals in your language's -natural encoding. You could then edit Python source code with your -favorite editor which would display the accented characters naturally, -and have the right characters used at runtime. - -Python supports writing Unicode literals in any encoding, but you have -to declare the encoding being used. This is done by including a -special comment as either the first or second line of the source -file:: - - #!/usr/bin/env python - # -*- coding: latin-1 -*- - - u = u'abcdé' - print ord(u[-1]) - -The syntax is inspired by Emacs's notation for specifying variables local to a file. -Emacs supports many different variables, but Python only supports 'coding'. -The ``-*-`` symbols indicate that the comment is special; within them, -you must supply the name ``coding`` and the name of your chosen encoding, -separated by ``':'``. - -If you don't include such a comment, the default encoding used will be -ASCII. Versions of Python before 2.4 were Euro-centric and assumed -Latin-1 as a default encoding for string literals; in Python 2.4, -characters greater than 127 still work but result in a warning. For -example, the following program has no encoding declaration:: - - #!/usr/bin/env python - u = u'abcdé' - print ord(u[-1]) - -When you run it with Python 2.4, it will output the following warning:: - - amk:~$ python p263.py - sys:1: DeprecationWarning: Non-ASCII character '\xe9' - in file p263.py on line 2, but no encoding declared; - see http://www.python.org/peps/pep-0263.html for details - - -Unicode Properties -''''''''''''''''''' - -The Unicode specification includes a database of information about -code points. For each code point that's defined, the information -includes the character's name, its category, the numeric value if -applicable (Unicode has characters representing the Roman numerals and -fractions such as one-third and four-fifths). There are also -properties related to the code point's use in bidirectional text and -other display-related properties. - -The following program displays some information about several -characters, and prints the numeric value of one particular character:: - - import unicodedata - - u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231) - - for i, c in enumerate(u): - print i, '%04x' % ord(c), unicodedata.category(c), - print unicodedata.name(c) - - # Get numeric value of second character - print unicodedata.numeric(u[1]) - -When run, this prints:: - - 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE - 1 0bf2 No TAMIL NUMBER ONE THOUSAND - 2 0f84 Mn TIBETAN MARK HALANTA - 3 1770 Lo TAGBANWA LETTER SA - 4 33af So SQUARE RAD OVER S SQUARED - 1000.0 - -The category codes are abbreviations describing the nature of the -character. These are grouped into categories such as "Letter", -"Number", "Punctuation", or "Symbol", which in turn are broken up into -subcategories. To take the codes from the above output, ``'Ll'`` -means 'Letter, lowercase', ``'No'`` means "Number, other", ``'Mn'`` is -"Mark, nonspacing", and ``'So'`` is "Symbol, other". See -<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values> -for a list of category codes. - -References -'''''''''''''' - -The Unicode and 8-bit string types are described in the Python library -reference at <http://docs.python.org/lib/typesseq.html>. - -The documentation for the ``unicodedata`` module is at -<http://docs.python.org/lib/module-unicodedata.html>. - -The documentation for the ``codecs`` module is at -<http://docs.python.org/lib/module-codecs.html>. - -Marc-André Lemburg gave a presentation at EuroPython 2002 -titled "Python and Unicode". A PDF version of his slides -is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>, -and is an excellent overview of the design of Python's Unicode features. - - -Reading and Writing Unicode Data ----------------------------------------- - -Once you've written some code that works with Unicode data, the next -problem is input/output. How do you get Unicode strings into your -program, and how do you convert Unicode into a form suitable for -storage or transmission? - -It's possible that you may not need to do anything depending on your -input sources and output destinations; you should check whether the -libraries used in your application support Unicode natively. XML -parsers often return Unicode data, for example. Many relational -databases also support Unicode-valued columns and can return Unicode -values from an SQL query. - -Unicode data is usually converted to a particular encoding before it -gets written to disk or sent over a socket. It's possible to do all -the work yourself: open a file, read an 8-bit string from it, and -convert the string with ``unicode(str, encoding)``. However, the -manual approach is not recommended. - -One problem is the multi-byte nature of encodings; one Unicode -character can be represented by several bytes. If you want to read -the file in arbitrary-sized chunks (say, 1K or 4K), you need to write -error-handling code to catch the case where only part of the bytes -encoding a single Unicode character are read at the end of a chunk. -One solution would be to read the entire file into memory and then -perform the decoding, but that prevents you from working with files -that are extremely large; if you need to read a 2Gb file, you need 2Gb -of RAM. (More, really, since for at least a moment you'd need to have -both the encoded string and its Unicode version in memory.) - -The solution would be to use the low-level decoding interface to catch -the case of partial coding sequences. The work of implementing this -has already been done for you: the ``codecs`` module includes a -version of the ``open()`` function that returns a file-like object -that assumes the file's contents are in a specified encoding and -accepts Unicode parameters for methods such as ``.read()`` and -``.write()``. - -The function's parameters are -``open(filename, mode='rb', encoding=None, errors='strict', buffering=1)``. ``mode`` can be -``'r'``, ``'w'``, or ``'a'``, just like the corresponding parameter to the -regular built-in ``open()`` function; add a ``'+'`` to -update the file. ``buffering`` is similarly -parallel to the standard function's parameter. -``encoding`` is a string giving -the encoding to use; if it's left as ``None``, a regular Python file -object that accepts 8-bit strings is returned. Otherwise, a wrapper -object is returned, and data written to or read from the wrapper -object will be converted as needed. ``errors`` specifies the action -for encoding errors and can be one of the usual values of 'strict', -'ignore', and 'replace'. - -Reading Unicode from a file is therefore simple:: - - import codecs - f = codecs.open('unicode.rst', encoding='utf-8') - for line in f: - print repr(line) - -It's also possible to open files in update mode, -allowing both reading and writing:: - - f = codecs.open('test', encoding='utf-8', mode='w+') - f.write(u'\u4500 blah blah blah\n') - f.seek(0) - print repr(f.readline()[:1]) - f.close() - -Unicode character U+FEFF is used as a byte-order mark (BOM), -and is often written as the first character of a file in order -to assist with autodetection of the file's byte ordering. -Some encodings, such as UTF-16, expect a BOM to be present at -the start of a file; when such an encoding is used, -the BOM will be automatically written as the first character -and will be silently dropped when the file is read. There are -variants of these encodings, such as 'utf-16-le' and 'utf-16-be' -for little-endian and big-endian encodings, that specify -one particular byte ordering and don't -skip the BOM. - - -Unicode filenames -''''''''''''''''''''''''' - -Most of the operating systems in common use today support filenames -that contain arbitrary Unicode characters. Usually this is -implemented by converting the Unicode string into some encoding that -varies depending on the system. For example, MacOS X uses UTF-8 while -Windows uses a configurable encoding; on Windows, Python uses the name -"mbcs" to refer to whatever the currently configured encoding is. On -Unix systems, there will only be a filesystem encoding if you've set -the ``LANG`` or ``LC_CTYPE`` environment variables; if you haven't, -the default encoding is ASCII. - -The ``sys.getfilesystemencoding()`` function returns the encoding to -use on your current system, in case you want to do the encoding -manually, but there's not much reason to bother. When opening a file -for reading or writing, you can usually just provide the Unicode -string as the filename, and it will be automatically converted to the -right encoding for you:: - - filename = u'filename\u4500abc' - f = open(filename, 'w') - f.write('blah\n') - f.close() - -Functions in the ``os`` module such as ``os.stat()`` will also accept -Unicode filenames. - -``os.listdir()``, which returns filenames, raises an issue: should it -return the Unicode version of filenames, or should it return 8-bit -strings containing the encoded versions? ``os.listdir()`` will do -both, depending on whether you provided the directory path as an 8-bit -string or a Unicode string. If you pass a Unicode string as the path, -filenames will be decoded using the filesystem's encoding and a list -of Unicode strings will be returned, while passing an 8-bit path will -return the 8-bit versions of the filenames. For example, assuming the -default filesystem encoding is UTF-8, running the following program:: - - fn = u'filename\u4500abc' - f = open(fn, 'w') - f.close() - - import os - print os.listdir('.') - print os.listdir(u'.') - -will produce the following output:: - - amk:~$ python t.py - ['.svn', 'filename\xe4\x94\x80abc', ...] - [u'.svn', u'filename\u4500abc', ...] - -The first list contains UTF-8-encoded filenames, and the second list -contains the Unicode versions. - - - -Tips for Writing Unicode-aware Programs -'''''''''''''''''''''''''''''''''''''''''''' - -This section provides some suggestions on writing software that -deals with Unicode. - -The most important tip is: - - Software should only work with Unicode strings internally, - converting to a particular encoding on output. - -If you attempt to write processing functions that accept both -Unicode and 8-bit strings, you will find your program vulnerable to -bugs wherever you combine the two different kinds of strings. Python's -default encoding is ASCII, so whenever a character with an ASCII value >127 -is in the input data, you'll get a ``UnicodeDecodeError`` -because that character can't be handled by the ASCII encoding. - -It's easy to miss such problems if you only test your software -with data that doesn't contain any -accents; everything will seem to work, but there's actually a bug in your -program waiting for the first user who attempts to use characters >127. -A second tip, therefore, is: - - Include characters >127 and, even better, characters >255 in your - test data. - -When using data coming from a web browser or some other untrusted source, -a common technique is to check for illegal characters in a string -before using the string in a generated command line or storing it in a -database. If you're doing this, be careful to check -the string once it's in the form that will be used or stored; it's -possible for encodings to be used to disguise characters. This is especially -true if the input data also specifies the encoding; -many encodings leave the commonly checked-for characters alone, -but Python includes some encodings such as ``'base64'`` -that modify every single character. - -For example, let's say you have a content management system that takes a -Unicode filename, and you want to disallow paths with a '/' character. -You might write this code:: - - def read_file (filename, encoding): - if '/' in filename: - raise ValueError("'/' not allowed in filenames") - unicode_name = filename.decode(encoding) - f = open(unicode_name, 'r') - # ... return contents of file ... - -However, if an attacker could specify the ``'base64'`` encoding, -they could pass ``'L2V0Yy9wYXNzd2Q='``, which is the base-64 -encoded form of the string ``'/etc/passwd'``, to read a -system file. The above code looks for ``'/'`` characters -in the encoded form and misses the dangerous character -in the resulting decoded form. - -References -'''''''''''''' - -The PDF slides for Marc-André Lemburg's presentation "Writing -Unicode-aware Applications in Python" are available at -<http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf> -and discuss questions of character encodings as well as how to -internationalize and localize an application. - - -Revision History and Acknowledgements ------------------------------------------- - -Thanks to the following people who have noted errors or offered -suggestions on this article: Nicholas Bastin, -Marius Gedminas, Kent Johnson, Ken Krugler, -Marc-André Lemburg, Martin von Löwis, Chad Whitacre. - -Version 1.0: posted August 5 2005. - -Version 1.01: posted August 7 2005. Corrects factual and markup -errors; adds several links. - -Version 1.02: posted August 16 2005. Corrects factual errors. - - -.. comment Additional topic: building Python w/ UCS2 or UCS4 support -.. comment Describe obscure -U switch somewhere? -.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter - -.. comment - Original outline: - - - [ ] Unicode introduction - - [ ] ASCII - - [ ] Terms - - [ ] Character - - [ ] Code point - - [ ] Encodings - - [ ] Common encodings: ASCII, Latin-1, UTF-8 - - [ ] Unicode Python type - - [ ] Writing unicode literals - - [ ] Obscurity: -U switch - - [ ] Built-ins - - [ ] unichr() - - [ ] ord() - - [ ] unicode() constructor - - [ ] Unicode type - - [ ] encode(), decode() methods - - [ ] Unicodedata module for character properties - - [ ] I/O - - [ ] Reading/writing Unicode data into files - - [ ] Byte-order marks - - [ ] Unicode filenames - - [ ] Writing Unicode programs - - [ ] Do everything in Unicode - - [ ] Declaring source code encodings (PEP 263) - - [ ] Other issues - - [ ] Building Python (UCS2, UCS4) diff --git a/Doc/howto/urllib2.rst b/Doc/howto/urllib2.rst deleted file mode 100644 index f8f4a2b..0000000 --- a/Doc/howto/urllib2.rst +++ /dev/null @@ -1,603 +0,0 @@ -============================================== - HOWTO Fetch Internet Resources Using urllib2 -============================================== ----------------------------- - Fetching URLs With Python ----------------------------- - - -.. note:: - - There is an French translation of an earlier revision of this - HOWTO, available at `urllib2 - Le Manuel manquant - <http://www.voidspace/python/articles/urllib2_francais.shtml>`_. - -.. contents:: urllib2 Tutorial - - -Introduction -============ - -.. sidebar:: Related Articles - - You may also find useful the following article on fetching web - resources with Python : - - * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_ - - A tutorial on *Basic Authentication*, with examples in Python. - - This HOWTO is written by `Michael Foord - <http://www.voidspace.org.uk/python/index.shtml>`_. - -**urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs -(Uniform Resource Locators). It offers a very simple interface, in the form of -the *urlopen* function. This is capable of fetching URLs using a variety -of different protocols. It also offers a slightly more complex -interface for handling common situations - like basic authentication, -cookies, proxies and so on. These are provided by objects called -handlers and openers. - -urllib2 supports fetching URLs for many "URL schemes" (identified by the string -before the ":" in URL - for example "ftp" is the URL scheme of -"ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP). -This tutorial focuses on the most common case, HTTP. - -For straightforward situations *urlopen* is very easy to use. But as -soon as you encounter errors or non-trivial cases when opening HTTP -URLs, you will need some understanding of the HyperText Transfer -Protocol. The most comprehensive and authoritative reference to HTTP -is :RFC:`2616`. This is a technical document and not intended to be -easy to read. This HOWTO aims to illustrate using *urllib2*, with -enough detail about HTTP to help you through. It is not intended to -replace the `urllib2 docs <http://docs.python.org/lib/module-urllib2.html>`_ , -but is supplementary to them. - - -Fetching URLs -============= - -The simplest way to use urllib2 is as follows : :: - - import urllib2 - response = urllib2.urlopen('http://python.org/') - html = response.read() - -Many uses of urllib2 will be that simple (note that instead of an -'http:' URL we could have used an URL starting with 'ftp:', 'file:', -etc.). However, it's the purpose of this tutorial to explain the more -complicated cases, concentrating on HTTP. - -HTTP is based on requests and responses - the client makes requests -and servers send responses. urllib2 mirrors this with a ``Request`` -object which represents the HTTP request you are making. In its -simplest form you create a Request object that specifies the URL you -want to fetch. Calling ``urlopen`` with this Request object returns a -response object for the URL requested. This response is a file-like -object, which means you can for example call .read() on the response : -:: - - import urllib2 - - req = urllib2.Request('http://www.voidspace.org.uk') - response = urllib2.urlopen(req) - the_page = response.read() - -Note that urllib2 makes use of the same Request interface to handle -all URL schemes. For example, you can make an FTP request like so: :: - - req = urllib2.Request('ftp://example.com/') - -In the case of HTTP, there are two extra things that Request objects -allow you to do: First, you can pass data to be sent to the server. -Second, you can pass extra information ("metadata") *about* the data -or the about request itself, to the server - this information is sent -as HTTP "headers". Let's look at each of these in turn. - -Data ----- - -Sometimes you want to send data to a URL (often the URL will refer to -a CGI (Common Gateway Interface) script [#]_ or other web -application). With HTTP, this is often done using what's known as a -**POST** request. This is often what your browser does when you submit -a HTML form that you filled in on the web. Not all POSTs have to come -from forms: you can use a POST to transmit arbitrary data to your own -application. In the common case of HTML forms, the data needs to be -encoded in a standard way, and then passed to the Request object as -the ``data`` argument. The encoding is done using a function from the -``urllib`` library *not* from ``urllib2``. :: - - import urllib - import urllib2 - - url = 'http://www.someserver.com/cgi-bin/register.cgi' - values = {'name' : 'Michael Foord', - 'location' : 'Northampton', - 'language' : 'Python' } - - data = urllib.urlencode(values) - req = urllib2.Request(url, data) - response = urllib2.urlopen(req) - the_page = response.read() - -Note that other encodings are sometimes required (e.g. for file upload -from HTML forms - see -`HTML Specification, Form Submission <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ -for more details). - -If you do not pass the ``data`` argument, urllib2 uses a **GET** -request. One way in which GET and POST requests differ is that POST -requests often have "side-effects": they change the state of the -system in some way (for example by placing an order with the website -for a hundredweight of tinned spam to be delivered to your door). -Though the HTTP standard makes it clear that POSTs are intended to -*always* cause side-effects, and GET requests *never* to cause -side-effects, nothing prevents a GET request from having side-effects, -nor a POST requests from having no side-effects. Data can also be -passed in an HTTP GET request by encoding it in the URL itself. - -This is done as follows:: - - >>> import urllib2 - >>> import urllib - >>> data = {} - >>> data['name'] = 'Somebody Here' - >>> data['location'] = 'Northampton' - >>> data['language'] = 'Python' - >>> url_values = urllib.urlencode(data) - >>> print url_values - name=Somebody+Here&language=Python&location=Northampton - >>> url = 'http://www.example.com/example.cgi' - >>> full_url = url + '?' + url_values - >>> data = urllib2.open(full_url) - -Notice that the full URL is created by adding a ``?`` to the URL, followed by -the encoded values. - -Headers -------- - -We'll discuss here one particular HTTP header, to illustrate how to -add headers to your HTTP request. - -Some websites [#]_ dislike being browsed by programs, or send -different versions to different browsers [#]_ . By default urllib2 -identifies itself as ``Python-urllib/x.y`` (where ``x`` and ``y`` are -the major and minor version numbers of the Python release, -e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain -not work. The way a browser identifies itself is through the -``User-Agent`` header [#]_. When you create a Request object you can -pass a dictionary of headers in. The following example makes the same -request as above, but identifies itself as a version of Internet -Explorer [#]_. :: - - import urllib - import urllib2 - - url = 'http://www.someserver.com/cgi-bin/register.cgi' - user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' - values = {'name' : 'Michael Foord', - 'location' : 'Northampton', - 'language' : 'Python' } - headers = { 'User-Agent' : user_agent } - - data = urllib.urlencode(values) - req = urllib2.Request(url, data, headers) - response = urllib2.urlopen(req) - the_page = response.read() - -The response also has two useful methods. See the section on `info and -geturl`_ which comes after we have a look at what happens when things -go wrong. - - -Handling Exceptions -=================== - -*urlopen* raises ``URLError`` when it cannot handle a response (though -as usual with Python APIs, builtin exceptions such as ValueError, -TypeError etc. may also be raised). - -``HTTPError`` is the subclass of ``URLError`` raised in the specific -case of HTTP URLs. - -URLError --------- - -Often, URLError is raised because there is no network connection (no -route to the specified server), or the specified server doesn't exist. -In this case, the exception raised will have a 'reason' attribute, -which is a tuple containing an error code and a text error message. - -e.g. :: - - >>> req = urllib2.Request('http://www.pretend_server.org') - >>> try: urllib2.urlopen(req) - >>> except URLError as e: - >>> print e.reason - >>> - (4, 'getaddrinfo failed') - - -HTTPError ---------- - -Every HTTP response from the server contains a numeric "status -code". Sometimes the status code indicates that the server is unable -to fulfil the request. The default handlers will handle some of these -responses for you (for example, if the response is a "redirection" -that requests the client fetch the document from a different URL, -urllib2 will handle that for you). For those it can't handle, urlopen -will raise an ``HTTPError``. Typical errors include '404' (page not -found), '403' (request forbidden), and '401' (authentication -required). - -See section 10 of RFC 2616 for a reference on all the HTTP error -codes. - -The ``HTTPError`` instance raised will have an integer 'code' -attribute, which corresponds to the error sent by the server. - -Error Codes -~~~~~~~~~~~ - -Because the default handlers handle redirects (codes in the 300 -range), and codes in the 100-299 range indicate success, you will -usually only see error codes in the 400-599 range. - -``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful -dictionary of response codes in that shows all the response codes used -by RFC 2616. The dictionary is reproduced here for convenience :: - - # Table mapping response codes to messages; entries have the - # form {code: (shortmessage, longmessage)}. - responses = { - 100: ('Continue', 'Request received, please continue'), - 101: ('Switching Protocols', - 'Switching to new protocol; obey Upgrade header'), - - 200: ('OK', 'Request fulfilled, document follows'), - 201: ('Created', 'Document created, URL follows'), - 202: ('Accepted', - 'Request accepted, processing continues off-line'), - 203: ('Non-Authoritative Information', 'Request fulfilled from cache'), - 204: ('No Content', 'Request fulfilled, nothing follows'), - 205: ('Reset Content', 'Clear input form for further input.'), - 206: ('Partial Content', 'Partial content follows.'), - - 300: ('Multiple Choices', - 'Object has several resources -- see URI list'), - 301: ('Moved Permanently', 'Object moved permanently -- see URI list'), - 302: ('Found', 'Object moved temporarily -- see URI list'), - 303: ('See Other', 'Object moved -- see Method and URL list'), - 304: ('Not Modified', - 'Document has not changed since given time'), - 305: ('Use Proxy', - 'You must use proxy specified in Location to access this ' - 'resource.'), - 307: ('Temporary Redirect', - 'Object moved temporarily -- see URI list'), - - 400: ('Bad Request', - 'Bad request syntax or unsupported method'), - 401: ('Unauthorized', - 'No permission -- see authorization schemes'), - 402: ('Payment Required', - 'No payment -- see charging schemes'), - 403: ('Forbidden', - 'Request forbidden -- authorization will not help'), - 404: ('Not Found', 'Nothing matches the given URI'), - 405: ('Method Not Allowed', - 'Specified method is invalid for this server.'), - 406: ('Not Acceptable', 'URI not available in preferred format.'), - 407: ('Proxy Authentication Required', 'You must authenticate with ' - 'this proxy before proceeding.'), - 408: ('Request Timeout', 'Request timed out; try again later.'), - 409: ('Conflict', 'Request conflict.'), - 410: ('Gone', - 'URI no longer exists and has been permanently removed.'), - 411: ('Length Required', 'Client must specify Content-Length.'), - 412: ('Precondition Failed', 'Precondition in headers is false.'), - 413: ('Request Entity Too Large', 'Entity is too large.'), - 414: ('Request-URI Too Long', 'URI is too long.'), - 415: ('Unsupported Media Type', 'Entity body in unsupported format.'), - 416: ('Requested Range Not Satisfiable', - 'Cannot satisfy request range.'), - 417: ('Expectation Failed', - 'Expect condition could not be satisfied.'), - - 500: ('Internal Server Error', 'Server got itself in trouble'), - 501: ('Not Implemented', - 'Server does not support this operation'), - 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'), - 503: ('Service Unavailable', - 'The server cannot process the request due to a high load'), - 504: ('Gateway Timeout', - 'The gateway server did not receive a timely response'), - 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'), - } - -When an error is raised the server responds by returning an HTTP error -code *and* an error page. You can use the ``HTTPError`` instance as a -response on the page returned. This means that as well as the code -attribute, it also has read, geturl, and info, methods. :: - - >>> req = urllib2.Request('http://www.python.org/fish.html') - >>> try: - >>> urllib2.urlopen(req) - >>> except URLError as e: - >>> print e.code - >>> print e.read() - >>> - 404 - <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" - "http://www.w3.org/TR/html4/loose.dtd"> - <?xml-stylesheet href="./css/ht2html.css" - type="text/css"?> - <html><head><title>Error 404: File Not Found</title> - ...... etc... - -Wrapping it Up --------------- - -So if you want to be prepared for ``HTTPError`` *or* ``URLError`` -there are two basic approaches. I prefer the second approach. - -Number 1 -~~~~~~~~ - -:: - - - from urllib2 import Request, urlopen, URLError, HTTPError - req = Request(someurl) - try: - response = urlopen(req) - except HTTPError as e: - print 'The server couldn\'t fulfill the request.' - print 'Error code: ', e.code - except URLError as e: - print 'We failed to reach a server.' - print 'Reason: ', e.reason - else: - # everything is fine - - -.. note:: - - The ``except HTTPError`` *must* come first, otherwise ``except URLError`` - will *also* catch an ``HTTPError``. - -Number 2 -~~~~~~~~ - -:: - - from urllib2 import Request, urlopen, URLError - req = Request(someurl) - try: - response = urlopen(req) - except URLError as e: - if hasattr(e, 'reason'): - print 'We failed to reach a server.' - print 'Reason: ', e.reason - elif hasattr(e, 'code'): - print 'The server couldn\'t fulfill the request.' - print 'Error code: ', e.code - else: - # everything is fine - - -info and geturl -=============== - -The response returned by urlopen (or the ``HTTPError`` instance) has -two useful methods ``info`` and ``geturl``. - -**geturl** - this returns the real URL of the page fetched. This is -useful because ``urlopen`` (or the opener object used) may have -followed a redirect. The URL of the page fetched may not be the same -as the URL requested. - -**info** - this returns a dictionary-like object that describes the -page fetched, particularly the headers sent by the server. It is -currently an ``httplib.HTTPMessage`` instance. - -Typical headers include 'Content-length', 'Content-type', and so -on. See the -`Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_ -for a useful listing of HTTP headers with brief explanations of their meaning -and use. - - -Openers and Handlers -==================== - -When you fetch a URL you use an opener (an instance of the perhaps -confusingly-named ``urllib2.OpenerDirector``). Normally we have been using -the default opener - via ``urlopen`` - but you can create custom -openers. Openers use handlers. All the "heavy lifting" is done by the -handlers. Each handler knows how to open URLs for a particular URL -scheme (http, ftp, etc.), or how to handle an aspect of URL opening, -for example HTTP redirections or HTTP cookies. - -You will want to create openers if you want to fetch URLs with -specific handlers installed, for example to get an opener that handles -cookies, or to get an opener that does not handle redirections. - -To create an opener, instantiate an OpenerDirector, and then call -.add_handler(some_handler_instance) repeatedly. - -Alternatively, you can use ``build_opener``, which is a convenience -function for creating opener objects with a single function call. -``build_opener`` adds several handlers by default, but provides a -quick way to add more and/or override the default handlers. - -Other sorts of handlers you might want to can handle proxies, -authentication, and other common but slightly specialised -situations. - -``install_opener`` can be used to make an ``opener`` object the -(global) default opener. This means that calls to ``urlopen`` will use -the opener you have installed. - -Opener objects have an ``open`` method, which can be called directly -to fetch urls in the same way as the ``urlopen`` function: there's no -need to call ``install_opener``, except as a convenience. - - -Basic Authentication -==================== - -To illustrate creating and installing a handler we will use the -``HTTPBasicAuthHandler``. For a more detailed discussion of this -subject - including an explanation of how Basic Authentication works - -see the `Basic Authentication Tutorial <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_. - -When authentication is required, the server sends a header (as well as -the 401 error code) requesting authentication. This specifies the -authentication scheme and a 'realm'. The header looks like : -``Www-authenticate: SCHEME realm="REALM"``. - -e.g. :: - - Www-authenticate: Basic realm="cPanel Users" - - -The client should then retry the request with the appropriate name and -password for the realm included as a header in the request. This is -'basic authentication'. In order to simplify this process we can -create an instance of ``HTTPBasicAuthHandler`` and an opener to use -this handler. - -The ``HTTPBasicAuthHandler`` uses an object called a password manager -to handle the mapping of URLs and realms to passwords and -usernames. If you know what the realm is (from the authentication -header sent by the server), then you can use a -``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In -that case, it is convenient to use -``HTTPPasswordMgrWithDefaultRealm``. This allows you to specify a -default username and password for a URL. This will be supplied in the -absence of you providing an alternative combination for a specific -realm. We indicate this by providing ``None`` as the realm argument to -the ``add_password`` method. - -The top-level URL is the first URL that requires authentication. URLs -"deeper" than the URL you pass to .add_password() will also match. :: - - # create a password manager - password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm() - - # Add the username and password. - # If we knew the realm, we could use it instead of ``None``. - top_level_url = "http://example.com/foo/" - password_mgr.add_password(None, top_level_url, username, password) - - handler = urllib2.HTTPBasicAuthHandler(password_mgr) - - # create "opener" (OpenerDirector instance) - opener = urllib2.build_opener(handler) - - # use the opener to fetch a URL - opener.open(a_url) - - # Install the opener. - # Now all calls to urllib2.urlopen use our opener. - urllib2.install_opener(opener) - -.. note:: - - In the above example we only supplied our ``HHTPBasicAuthHandler`` - to ``build_opener``. By default openers have the handlers for - normal situations - ``ProxyHandler``, ``UnknownHandler``, - ``HTTPHandler``, ``HTTPDefaultErrorHandler``, - ``HTTPRedirectHandler``, ``FTPHandler``, ``FileHandler``, - ``HTTPErrorProcessor``. - -top_level_url is in fact *either* a full URL (including the 'http:' -scheme component and the hostname and optionally the port number) -e.g. "http://example.com/" *or* an "authority" (i.e. the hostname, -optionally including the port number) e.g. "example.com" or -"example.com:8080" (the latter example includes a port number). The -authority, if present, must NOT contain the "userinfo" component - for -example "joe@password:example.com" is not correct. - - -Proxies -======= - -**urllib2** will auto-detect your proxy settings and use those. This -is through the ``ProxyHandler`` which is part of the normal handler -chain. Normally that's a good thing, but there are occasions when it -may not be helpful [#]_. One way to do this is to setup our own -``ProxyHandler``, with no proxies defined. This is done using similar -steps to setting up a `Basic Authentication`_ handler : :: - - >>> proxy_support = urllib2.ProxyHandler({}) - >>> opener = urllib2.build_opener(proxy_support) - >>> urllib2.install_opener(opener) - -.. note:: - - Currently ``urllib2`` *does not* support fetching of ``https`` - locations through a proxy. However, this can be enabled by extending - urllib2 as shown in the recipe [#]_. - - -Sockets and Layers -================== - -The Python support for fetching resources from the web is -layered. urllib2 uses the httplib library, which in turn uses the -socket library. - -As of Python 2.3 you can specify how long a socket should wait for a -response before timing out. This can be useful in applications which -have to fetch web pages. By default the socket module has *no timeout* -and can hang. Currently, the socket timeout is not exposed at the -httplib or urllib2 levels. However, you can set the default timeout -globally for all sockets using : :: - - import socket - import urllib2 - - # timeout in seconds - timeout = 10 - socket.setdefaulttimeout(timeout) - - # this call to urllib2.urlopen now uses the default timeout - # we have set in the socket module - req = urllib2.Request('http://www.voidspace.org.uk') - response = urllib2.urlopen(req) - - -------- - - -Footnotes -========= - -This document was reviewed and revised by John Lee. - -.. [#] For an introduction to the CGI protocol see - `Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_. -.. [#] Like Google for example. The *proper* way to use google from a program - is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See - `Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_ - for some examples of using the Google API. -.. [#] Browser sniffing is a very bad practise for website design - building - sites using web standards is much more sensible. Unfortunately a lot of - sites still send different versions to different browsers. -.. [#] The user agent for MSIE 6 is - *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'* -.. [#] For details of more HTTP request headers, see - `Quick Reference to HTTP Headers`_. -.. [#] In my case I have to use a proxy to access the internet at work. If you - attempt to fetch *localhost* URLs through this proxy it blocks them. IE - is set to use the proxy, which urllib2 picks up on. In order to test - scripts with a localhost server, I have to prevent urllib2 from using - the proxy. -.. [#] urllib2 opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe - <http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_. - |