diff options
author | Andrew M. Kuchling <amk@amk.ca> | 2005-08-30 01:25:05 (GMT) |
---|---|---|
committer | Andrew M. Kuchling <amk@amk.ca> | 2005-08-30 01:25:05 (GMT) |
commit | e8f44d683e79c7a9659a4480736d55193da4a7b1 (patch) | |
tree | 37e8b05066aa1caf85f6b25d52f1576366e45e8e | |
parent | f1b2ba6aa1751c5325e8fb87a28e54a857796bfa (diff) | |
download | cpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.zip cpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.tar.gz cpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.tar.bz2 |
Commit the howto source to the main Python repository, with Fred's approval
-rw-r--r-- | Doc/howto/Makefile | 88 | ||||
-rw-r--r-- | Doc/howto/advocacy.tex | 405 | ||||
-rw-r--r-- | Doc/howto/curses.tex | 485 | ||||
-rw-r--r-- | Doc/howto/doanddont.tex | 343 | ||||
-rw-r--r-- | Doc/howto/regex.tex | 1466 | ||||
-rw-r--r-- | Doc/howto/rexec.tex | 61 | ||||
-rw-r--r-- | Doc/howto/sockets.tex | 460 | ||||
-rw-r--r-- | Doc/howto/sorting.tex | 267 | ||||
-rw-r--r-- | Doc/howto/unicode.rst | 765 |
9 files changed, 4340 insertions, 0 deletions
diff --git a/Doc/howto/Makefile b/Doc/howto/Makefile new file mode 100644 index 0000000..19701c6 --- /dev/null +++ b/Doc/howto/Makefile @@ -0,0 +1,88 @@ + +MKHOWTO=../tools/mkhowto +WEBDIR=. +RSTARGS = --input-encoding=utf-8 +VPATH=.:dvi:pdf:ps:txt + +# List of HOWTOs that aren't to be processed + +REMOVE_HOWTO = + +# Determine list of files to be built + +HOWTO=$(filter-out $(REMOVE_HOWTO),$(wildcard *.tex)) +RST_SOURCES = $(shell echo *.rst) +DVI =$(patsubst %.tex,%.dvi,$(HOWTO)) +PDF =$(patsubst %.tex,%.pdf,$(HOWTO)) +PS =$(patsubst %.tex,%.ps,$(HOWTO)) +TXT =$(patsubst %.tex,%.txt,$(HOWTO)) +HTML =$(patsubst %.tex,%,$(HOWTO)) + +# Rules for building various formats +%.dvi : %.tex + $(MKHOWTO) --dvi $< + mv $@ dvi + +%.pdf : %.tex + $(MKHOWTO) --pdf $< + mv $@ pdf + +%.ps : %.tex + $(MKHOWTO) --ps $< + mv $@ ps + +%.txt : %.tex + $(MKHOWTO) --text $< + mv $@ txt + +% : %.tex + $(MKHOWTO) --html --iconserver="." $< + tar -zcvf html/$*.tgz $* + #zip -r html/$*.zip $* + +default: + @echo "'all' -- build all files" + @echo "'dvi', 'pdf', 'ps', 'txt', 'html' -- build one format" + +all: $(HTML) + +.PHONY : dvi pdf ps txt html rst +dvi: $(DVI) + +pdf: $(PDF) +ps: $(PS) +txt: $(TXT) +html:$(HTML) + +# Rule to build collected tar files +dist: #all + for i in dvi pdf ps txt ; do \ + cd $$i ; \ + tar -zcf All.tgz *.$$i ;\ + cd .. ;\ + done + +# Rule to copy files to the Web tree on AMK's machine +web: dist + cp dvi/* $(WEBDIR)/dvi + cp ps/* $(WEBDIR)/ps + cp pdf/* $(WEBDIR)/pdf + cp txt/* $(WEBDIR)/txt + for dir in $(HTML) ; do cp -rp $$dir $(WEBDIR) ; done + for ltx in $(HOWTO) ; do cp -p $$ltx $(WEBDIR)/latex ; done + +rst: unicode.html + +%.html: %.rst + rst2html $(RSTARGS) $< >$@ + +clean: + rm -f *~ *.log *.ind *.l2h *.aux *.toc *.how + rm -f *.dvi *.ps *.pdf *.bkm + rm -f unicode.html + +clobber: + rm dvi/* ps/* pdf/* txt/* html/* + + + diff --git a/Doc/howto/advocacy.tex b/Doc/howto/advocacy.tex new file mode 100644 index 0000000..619242b --- /dev/null +++ b/Doc/howto/advocacy.tex @@ -0,0 +1,405 @@ + +\documentclass{howto} + +\title{Python Advocacy HOWTO} + +\release{0.03} + +\author{A.M. Kuchling} +\authoraddress{\email{amk@amk.ca}} + +\begin{document} +\maketitle + +\begin{abstract} +\noindent +It's usually difficult to get your management to accept open source +software, and Python is no exception to this rule. This document +discusses reasons to use Python, strategies for winning acceptance, +facts and arguments you can use, and cases where you \emph{shouldn't} +try to use Python. + +This document is available from the Python HOWTO page at +\url{http://www.python.org/doc/howto}. + +\end{abstract} + +\tableofcontents + +\section{Reasons to Use Python} + +There are several reasons to incorporate a scripting language into +your development process, and this section will discuss them, and why +Python has some properties that make it a particularly good choice. + + \subsection{Programmability} + +Programs are often organized in a modular fashion. Lower-level +operations are grouped together, and called by higher-level functions, +which may in turn be used as basic operations by still further upper +levels. + +For example, the lowest level might define a very low-level +set of functions for accessing a hash table. The next level might use +hash tables to store the headers of a mail message, mapping a header +name like \samp{Date} to a value such as \samp{Tue, 13 May 1997 +20:00:54 -0400}. A yet higher level may operate on message objects, +without knowing or caring that message headers are stored in a hash +table, and so forth. + +Often, the lowest levels do very simple things; they implement a data +structure such as a binary tree or hash table, or they perform some +simple computation, such as converting a date string to a number. The +higher levels then contain logic connecting these primitive +operations. Using the approach, the primitives can be seen as basic +building blocks which are then glued together to produce the complete +product. + +Why is this design approach relevant to Python? Because Python is +well suited to functioning as such a glue language. A common approach +is to write a Python module that implements the lower level +operations; for the sake of speed, the implementation might be in C, +Java, or even Fortran. Once the primitives are available to Python +programs, the logic underlying higher level operations is written in +the form of Python code. The high-level logic is then more +understandable, and easier to modify. + +John Ousterhout wrote a paper that explains this idea at greater +length, entitled ``Scripting: Higher Level Programming for the 21st +Century''. I recommend that you read this paper; see the references +for the URL. Ousterhout is the inventor of the Tcl language, and +therefore argues that Tcl should be used for this purpose; he only +briefly refers to other languages such as Python, Perl, and +Lisp/Scheme, but in reality, Ousterhout's argument applies to +scripting languages in general, since you could equally write +extensions for any of the languages mentioned above. + + \subsection{Prototyping} + +In \emph{The Mythical Man-Month}, Fredrick Brooks suggests the +following rule when planning software projects: ``Plan to throw one +away; you will anyway.'' Brooks is saying that the first attempt at a +software design often turns out to be wrong; unless the problem is +very simple or you're an extremely good designer, you'll find that new +requirements and features become apparent once development has +actually started. If these new requirements can't be cleanly +incorporated into the program's structure, you're presented with two +unpleasant choices: hammer the new features into the program somehow, +or scrap everything and write a new version of the program, taking the +new features into account from the beginning. + +Python provides you with a good environment for quickly developing an +initial prototype. That lets you get the overall program structure +and logic right, and you can fine-tune small details in the fast +development cycle that Python provides. Once you're satisfied with +the GUI interface or program output, you can translate the Python code +into C++, Fortran, Java, or some other compiled language. + +Prototyping means you have to be careful not to use too many Python +features that are hard to implement in your other language. Using +\code{eval()}, or regular expressions, or the \module{pickle} module, +means that you're going to need C or Java libraries for formula +evaluation, regular expressions, and serialization, for example. But +it's not hard to avoid such tricky code, and in the end the +translation usually isn't very difficult. The resulting code can be +rapidly debugged, because any serious logical errors will have been +removed from the prototype, leaving only more minor slip-ups in the +translation to track down. + +This strategy builds on the earlier discussion of programmability. +Using Python as glue to connect lower-level components has obvious +relevance for constructing prototype systems. In this way Python can +help you with development, even if end users never come in contact +with Python code at all. If the performance of the Python version is +adequate and corporate politics allow it, you may not need to do a +translation into C or Java, but it can still be faster to develop a +prototype and then translate it, instead of attempting to produce the +final version immediately. + +One example of this development strategy is Microsoft Merchant Server. +Version 1.0 was written in pure Python, by a company that subsequently +was purchased by Microsoft. Version 2.0 began to translate the code +into \Cpp, shipping with some \Cpp code and some Python code. Version +3.0 didn't contain any Python at all; all the code had been translated +into \Cpp. Even though the product doesn't contain a Python +interpreter, the Python language has still served a useful purpose by +speeding up development. + +This is a very common use for Python. Past conference papers have +also described this approach for developing high-level numerical +algorithms; see David M. Beazley and Peter S. Lomdahl's paper +``Feeding a Large-scale Physics Application to Python'' in the +references for a good example. If an algorithm's basic operations are +things like "Take the inverse of this 4000x4000 matrix", and are +implemented in some lower-level language, then Python has almost no +additional performance cost; the extra time required for Python to +evaluate an expression like \code{m.invert()} is dwarfed by the cost +of the actual computation. It's particularly good for applications +where seemingly endless tweaking is required to get things right. GUI +interfaces and Web sites are prime examples. + +The Python code is also shorter and faster to write (once you're +familiar with Python), so it's easier to throw it away if you decide +your approach was wrong; if you'd spent two weeks working on it +instead of just two hours, you might waste time trying to patch up +what you've got out of a natural reluctance to admit that those two +weeks were wasted. Truthfully, those two weeks haven't been wasted, +since you've learnt something about the problem and the technology +you're using to solve it, but it's human nature to view this as a +failure of some sort. + + \subsection{Simplicity and Ease of Understanding} + +Python is definitely \emph{not} a toy language that's only usable for +small tasks. The language features are general and powerful enough to +enable it to be used for many different purposes. It's useful at the +small end, for 10- or 20-line scripts, but it also scales up to larger +systems that contain thousands of lines of code. + +However, this expressiveness doesn't come at the cost of an obscure or +tricky syntax. While Python has some dark corners that can lead to +obscure code, there are relatively few such corners, and proper design +can isolate their use to only a few classes or modules. It's +certainly possible to write confusing code by using too many features +with too little concern for clarity, but most Python code can look a +lot like a slightly-formalized version of human-understandable +pseudocode. + +In \emph{The New Hacker's Dictionary}, Eric S. Raymond gives the following +definition for "compact": + +\begin{quotation} + Compact \emph{adj.} Of a design, describes the valuable property + that it can all be apprehended at once in one's head. This + generally means the thing created from the design can be used + with greater facility and fewer errors than an equivalent tool + that is not compact. Compactness does not imply triviality or + lack of power; for example, C is compact and FORTRAN is not, + but C is more powerful than FORTRAN. Designs become + non-compact through accreting features and cruft that don't + merge cleanly into the overall design scheme (thus, some fans + of Classic C maintain that ANSI C is no longer compact). +\end{quotation} + +(From \url{http://sagan.earthspace.net/jargon/jargon_18.html\#SEC25}) + +In this sense of the word, Python is quite compact, because the +language has just a few ideas, which are used in lots of places. Take +namespaces, for example. Import a module with \code{import math}, and +you create a new namespace called \samp{math}. Classes are also +namespaces that share many of the properties of modules, and have a +few of their own; for example, you can create instances of a class. +Instances? They're yet another namespace. Namespaces are currently +implemented as Python dictionaries, so they have the same methods as +the standard dictionary data type: .keys() returns all the keys, and +so forth. + +This simplicity arises from Python's development history. The +language syntax derives from different sources; ABC, a relatively +obscure teaching language, is one primary influence, and Modula-3 is +another. (For more information about ABC and Modula-3, consult their +respective Web sites at \url{http://www.cwi.nl/~steven/abc/} and +\url{http://www.m3.org}.) Other features have come from C, Icon, +Algol-68, and even Perl. Python hasn't really innovated very much, +but instead has tried to keep the language small and easy to learn, +building on ideas that have been tried in other languages and found +useful. + +Simplicity is a virtue that should not be underestimated. It lets you +learn the language more quickly, and then rapidly write code, code +that often works the first time you run it. + + \subsection{Java Integration} + +If you're working with Java, Jython +(\url{http://www.jython.org/}) is definitely worth your +attention. Jython is a re-implementation of Python in Java that +compiles Python code into Java bytecodes. The resulting environment +has very tight, almost seamless, integration with Java. It's trivial +to access Java classes from Python, and you can write Python classes +that subclass Java classes. Jython can be used for prototyping Java +applications in much the same way CPython is used, and it can also be +used for test suites for Java code, or embedded in a Java application +to add scripting capabilities. + +\section{Arguments and Rebuttals} + +Let's say that you've decided upon Python as the best choice for your +application. How can you convince your management, or your fellow +developers, to use Python? This section lists some common arguments +against using Python, and provides some possible rebuttals. + +\emph{Python is freely available software that doesn't cost anything. +How good can it be?} + +Very good, indeed. These days Linux and Apache, two other pieces of +open source software, are becoming more respected as alternatives to +commercial software, but Python hasn't had all the publicity. + +Python has been around for several years, with many users and +developers. Accordingly, the interpreter has been used by many +people, and has gotten most of the bugs shaken out of it. While bugs +are still discovered at intervals, they're usually either quite +obscure (they'd have to be, for no one to have run into them before) +or they involve interfaces to external libraries. The internals of +the language itself are quite stable. + +Having the source code should be viewed as making the software +available for peer review; people can examine the code, suggest (and +implement) improvements, and track down bugs. To find out more about +the idea of open source code, along with arguments and case studies +supporting it, go to \url{http://www.opensource.org}. + +\emph{Who's going to support it?} + +Python has a sizable community of developers, and the number is still +growing. The Internet community surrounding the language is an active +one, and is worth being considered another one of Python's advantages. +Most questions posted to the comp.lang.python newsgroup are quickly +answered by someone. + +Should you need to dig into the source code, you'll find it's clear +and well-organized, so it's not very difficult to write extensions and +track down bugs yourself. If you'd prefer to pay for support, there +are companies and individuals who offer commercial support for Python. + +\emph{Who uses Python for serious work?} + +Lots of people; one interesting thing about Python is the surprising +diversity of applications that it's been used for. People are using +Python to: + +\begin{itemize} +\item Run Web sites +\item Write GUI interfaces +\item Control +number-crunching code on supercomputers +\item Make a commercial application scriptable by embedding the Python +interpreter inside it +\item Process large XML data sets +\item Build test suites for C or Java code +\end{itemize} + +Whatever your application domain is, there's probably someone who's +used Python for something similar. Yet, despite being useable for +such high-end applications, Python's still simple enough to use for +little jobs. + +See \url{http://www.python.org/psa/Users.html} for a list of some of the +organizations that use Python. + +\emph{What are the restrictions on Python's use?} + +They're practically nonexistent. Consult the \file{Misc/COPYRIGHT} +file in the source distribution, or +\url{http://www.python.org/doc/Copyright.html} for the full language, +but it boils down to three conditions. + +\begin{itemize} + +\item You have to leave the copyright notice on the software; if you +don't include the source code in a product, you have to put the +copyright notice in the supporting documentation. + +\item Don't claim that the institutions that have developed Python +endorse your product in any way. + +\item If something goes wrong, you can't sue for damages. Practically +all software licences contain this condition. + +\end{itemize} + +Notice that you don't have to provide source code for anything that +contains Python or is built with it. Also, the Python interpreter and +accompanying documentation can be modified and redistributed in any +way you like, and you don't have to pay anyone any licensing fees at +all. + +\emph{Why should we use an obscure language like Python instead of +well-known language X?} + +I hope this HOWTO, and the documents listed in the final section, will +help convince you that Python isn't obscure, and has a healthily +growing user base. One word of advice: always present Python's +positive advantages, instead of concentrating on language X's +failings. People want to know why a solution is good, rather than why +all the other solutions are bad. So instead of attacking a competing +solution on various grounds, simply show how Python's virtues can +help. + + +\section{Useful Resources} + +\begin{definitions} + +\term{\url{http://www.fsbassociates.com/books/pythonchpt1.htm}} + +The first chapter of \emph{Internet Programming with Python} also +examines some of the reasons for using Python. The book is well worth +buying, but the publishers have made the first chapter available on +the Web. + +\term{\url{http://home.pacbell.net/ouster/scripting.html}} + +John Ousterhout's white paper on scripting is a good argument for the +utility of scripting languages, though naturally enough, he emphasizes +Tcl, the language he developed. Most of the arguments would apply to +any scripting language. + +\term{\url{http://www.python.org/workshops/1997-10/proceedings/beazley.html}} + +The authors, David M. Beazley and Peter S. Lomdahl, +describe their use of Python at Los Alamos National Laboratory. +It's another good example of how Python can help get real work done. +This quotation from the paper has been echoed by many people: + +\begin{quotation} + Originally developed as a large monolithic application for + massively parallel processing systems, we have used Python to + transform our application into a flexible, highly modular, and + extremely powerful system for performing simulation, data + analysis, and visualization. In addition, we describe how Python + has solved a number of important problems related to the + development, debugging, deployment, and maintenance of scientific + software. +\end{quotation} + +%\term{\url{http://www.pythonjournal.com/volume1/art-interview/}} + +%This interview with Andy Feit, discussing Infoseek's use of Python, can be +%used to show that choosing Python didn't introduce any difficulties +%into a company's development process, and provided some substantial benefits. + +\term{\url{http://www.python.org/psa/Commercial.html}} + +Robin Friedrich wrote this document on how to support Python's use in +commercial projects. + +\term{\url{http://www.python.org/workshops/1997-10/proceedings/stein.ps}} + +For the 6th Python conference, Greg Stein presented a paper that +traced Python's adoption and usage at a startup called eShop, and +later at Microsoft. + +\term{\url{http://www.opensource.org}} + +Management may be doubtful of the reliability and usefulness of +software that wasn't written commercially. This site presents +arguments that show how open source software can have considerable +advantages over closed-source software. + +\term{\url{http://sunsite.unc.edu/LDP/HOWTO/mini/Advocacy.html}} + +The Linux Advocacy mini-HOWTO was the inspiration for this document, +and is also well worth reading for general suggestions on winning +acceptance for a new technology, such as Linux or Python. In general, +you won't make much progress by simply attacking existing systems and +complaining about their inadequacies; this often ends up looking like +unfocused whining. It's much better to point out some of the many +areas where Python is an improvement over other systems. + +\end{definitions} + +\end{document} + + diff --git a/Doc/howto/curses.tex b/Doc/howto/curses.tex new file mode 100644 index 0000000..a6a0e0a --- /dev/null +++ b/Doc/howto/curses.tex @@ -0,0 +1,485 @@ +\documentclass{howto} + +\title{Curses Programming with Python} + +\release{2.01} + +\author{A.M. Kuchling, Eric S. Raymond} +\authoraddress{\email{amk@amk.ca}, \email{esr@thyrsus.com}} + +\begin{document} +\maketitle + +\begin{abstract} +\noindent +This document describes how to write text-mode programs with Python 2.x, +using the \module{curses} extension module to control the display. + +This document is available from the Python HOWTO page at +\url{http://www.python.org/doc/howto}. +\end{abstract} + +\tableofcontents + +\section{What is curses?} + +The curses library supplies a terminal-independent screen-painting and +keyboard-handling facility for text-based terminals; such terminals +include VT100s, the Linux console, and the simulated terminal provided +by X11 programs such as xterm and rxvt. Display terminals support +various control codes to perform common operations such as moving the +cursor, scrolling the screen, and erasing areas. Different terminals +use widely differing codes, and often have their own minor quirks. + +In a world of X displays, one might ask ``why bother''? It's true +that character-cell display terminals are an obsolete technology, but +there are niches in which being able to do fancy things with them are +still valuable. One is on small-footprint or embedded Unixes that +don't carry an X server. Another is for tools like OS installers +and kernel configurators that may have to run before X is available. + +The curses library hides all the details of different terminals, and +provides the programmer with an abstraction of a display, containing +multiple non-overlapping windows. The contents of a window can be +changed in various ways--adding text, erasing it, changing its +appearance--and the curses library will automagically figure out what +control codes need to be sent to the terminal to produce the right +output. + +The curses library was originally written for BSD Unix; the later System V +versions of Unix from AT\&T added many enhancements and new functions. +BSD curses is no longer maintained, having been replaced by ncurses, +which is an open-source implementation of the AT\&T interface. If you're +using an open-source Unix such as Linux or FreeBSD, your system almost +certainly uses ncurses. Since most current commercial Unix versions +are based on System V code, all the functions described here will +probably be available. The older versions of curses carried by some +proprietary Unixes may not support everything, though. + +No one has made a Windows port of the curses module. On a Windows +platform, try the Console module written by Fredrik Lundh. The +Console module provides cursor-addressable text output, plus full +support for mouse and keyboard input, and is available from +\url{http://effbot.org/efflib/console}. + +\subsection{The Python curses module} + +Thy Python module is a fairly simple wrapper over the C functions +provided by curses; if you're already familiar with curses programming +in C, it's really easy to transfer that knowledge to Python. The +biggest difference is that the Python interface makes things simpler, +by merging different C functions such as \function{addstr}, +\function{mvaddstr}, \function{mvwaddstr}, into a single +\method{addstr()} method. You'll see this covered in more detail +later. + +This HOWTO is simply an introduction to writing text-mode programs +with curses and Python. It doesn't attempt to be a complete guide to +the curses API; for that, see the Python library guide's serction on +ncurses, and the C manual pages for ncurses. It will, however, give +you the basic ideas. + +\section{Starting and ending a curses application} + +Before doing anything, curses must be initialized. This is done by +calling the \function{initscr()} function, which will determine the +terminal type, send any required setup codes to the terminal, and +create various internal data structures. If successful, +\function{initscr()} returns a window object representing the entire +screen; this is usually called \code{stdscr}, after the name of the +corresponding C +variable. + +\begin{verbatim} +import curses +stdscr = curses.initscr() +\end{verbatim} + +Usually curses applications turn off automatic echoing of keys to the +screen, in order to be able to read keys and only display them under +certain circumstances. This requires calling the \function{noecho()} +function. + +\begin{verbatim} +curses.noecho() +\end{verbatim} + +Applications will also commonly need to react to keys instantly, +without requiring the Enter key to be pressed; this is called cbreak +mode, as opposed to the usual buffered input mode. + +\begin{verbatim} +curses.cbreak() +\end{verbatim} + +Terminals usually return special keys, such as the cursor keys or +navigation keys such as Page Up and Home, as a multibyte escape +sequence. While you could write your application to expect such +sequences and process them accordingly, curses can do it for you, +returning a special value such as \constant{curses.KEY_LEFT}. To get +curses to do the job, you'll have to enable keypad mode. + +\begin{verbatim} +stdscr.keypad(1) +\end{verbatim} + +Terminating a curses application is much easier than starting one. +You'll need to call + +\begin{verbatim} +curses.nocbreak(); stdscr.keypad(0); curses.echo() +\end{verbatim} + +to reverse the curses-friendly terminal settings. Then call the +\function{endwin()} function to restore the terminal to its original +operating mode. + +\begin{verbatim} +curses.endwin() +\end{verbatim} + +A common problem when debugging a curses application is to get your +terminal messed up when the application dies without restoring the +terminal to its previous state. In Python this commonly happens when +your code is buggy and raises an uncaught exception. Keys are no +longer be echoed to the screen when you type them, for example, which +makes using the shell difficult. + +In Python you can avoid these complications and make debugging much +easier by importing the module \module{curses.wrapper}. It supplies a +function \function{wrapper} that takes a hook argument. It does the +initializations described above, and also initializes colors if color +support is present. It then runs your hook, and then finally +deinitializes appropriately. The hook is called inside a try-catch +clause which catches exceptions, performs curses deinitialization, and +then passes the exception upwards. Thus, your terminal won't be left +in a funny state on exception. + +\section{Windows and Pads} + +Windows are the basic abstraction in curses. A window object +represents a rectangular area of the screen, and supports various + methods to display text, erase it, allow the user to input strings, +and so forth. + +The \code{stdscr} object returned by the \function{initscr()} function +is a window object that covers the entire screen. Many programs may +need only this single window, but you might wish to divide the screen +into smaller windows, in order to redraw or clear them separately. +The \function{newwin()} function creates a new window of a given size, +returning the new window object. + +\begin{verbatim} +begin_x = 20 ; begin_y = 7 +height = 5 ; width = 40 +win = curses.newwin(height, width, begin_y, begin_x) +\end{verbatim} + +A word about the coordinate system used in curses: coordinates are +always passed in the order \emph{y,x}, and the top-left corner of a +window is coordinate (0,0). This breaks a common convention for +handling coordinates, where the \emph{x} coordinate usually comes +first. This is an unfortunate difference from most other computer +applications, but it's been part of curses since it was first written, +and it's too late to change things now. + +When you call a method to display or erase text, the effect doesn't +immediately show up on the display. This is because curses was +originally written with slow 300-baud terminal connections in mind; +with these terminals, minimizing the time required to redraw the +screen is very important. This lets curses accumulate changes to the +screen, and display them in the most efficient manner. For example, +if your program displays some characters in a window, and then clears +the window, there's no need to send the original characters because +they'd never be visible. + +Accordingly, curses requires that you explicitly tell it to redraw +windows, using the \function{refresh()} method of window objects. In +practice, this doesn't really complicate programming with curses much. +Most programs go into a flurry of activity, and then pause waiting for +a keypress or some other action on the part of the user. All you have +to do is to be sure that the screen has been redrawn before pausing to +wait for user input, by simply calling \code{stdscr.refresh()} or the +\function{refresh()} method of some other relevant window. + +A pad is a special case of a window; it can be larger than the actual +display screen, and only a portion of it displayed at a time. +Creating a pad simply requires the pad's height and width, while +refreshing a pad requires giving the coordinates of the on-screen +area where a subsection of the pad will be displayed. + +\begin{verbatim} +pad = curses.newpad(100, 100) +# These loops fill the pad with letters; this is +# explained in the next section +for y in range(0, 100): + for x in range(0, 100): + try: pad.addch(y,x, ord('a') + (x*x+y*y) % 26 ) + except curses.error: pass + +# Displays a section of the pad in the middle of the screen +pad.refresh( 0,0, 5,5, 20,75) +\end{verbatim} + +The \function{refresh()} call displays a section of the pad in the +rectangle extending from coordinate (5,5) to coordinate (20,75) on the +screen;the upper left corner of the displayed section is coordinate +(0,0) on the pad. Beyond that difference, pads are exactly like +ordinary windows and support the same methods. + +If you have multiple windows and pads on screen there is a more +efficient way to go, which will prevent annoying screen flicker at +refresh time. Use the methods \method{noutrefresh()} and/or +\method{noutrefresh()} of each window to update the data structure +representing the desired state of the screen; then change the physical +screen to match the desired state in one go with the function +\function{doupdate()}. The normal \method{refresh()} method calls +\function{doupdate()} as its last act. + +\section{Displaying Text} + +{}From a C programmer's point of view, curses may sometimes look like +a twisty maze of functions, all subtly different. For example, +\function{addstr()} displays a string at the current cursor location +in the \code{stdscr} window, while \function{mvaddstr()} moves to a +given y,x coordinate first before displaying the string. +\function{waddstr()} is just like \function{addstr()}, but allows +specifying a window to use, instead of using \code{stdscr} by default. +\function{mvwaddstr()} follows similarly. + +Fortunately the Python interface hides all these details; +\code{stdscr} is a window object like any other, and methods like +\function{addstr()} accept multiple argument forms. Usually there are +four different forms. + +\begin{tableii}{|c|l|}{textrm}{Form}{Description} +\lineii{\var{str} or \var{ch}}{Display the string \var{str} or +character \var{ch}} +\lineii{\var{str} or \var{ch}, \var{attr}}{Display the string \var{str} or +character \var{ch}, using attribute \var{attr}} +\lineii{\var{y}, \var{x}, \var{str} or \var{ch}} +{Move to position \var{y,x} within the window, and display \var{str} +or \var{ch}} +\lineii{\var{y}, \var{x}, \var{str} or \var{ch}, \var{attr}} +{Move to position \var{y,x} within the window, and display \var{str} +or \var{ch}, using attribute \var{attr}} +\end{tableii} + +Attributes allow displaying text in highlighted forms, such as in +boldface, underline, reverse code, or in color. They'll be explained +in more detail in the next subsection. + +The \function{addstr()} function takes a Python string as the value to +be displayed, while the \function{addch()} functions take a character, +which can be either a Python string of length 1, or an integer. If +it's a string, you're limited to displaying characters between 0 and +255. SVr4 curses provides constants for extension characters; these +constants are integers greater than 255. For example, +\constant{ACS_PLMINUS} is a +/- symbol, and \constant{ACS_ULCORNER} is +the upper left corner of a box (handy for drawing borders). + +Windows remember where the cursor was left after the last operation, +so if you leave out the \var{y,x} coordinates, the string or character +will be displayed wherever the last operation left off. You can also +move the cursor with the \function{move(\var{y,x})} method. Because +some terminals always display a flashing cursor, you may want to +ensure that the cursor is positioned in some location where it won't +be distracting; it can be confusing to have the cursor blinking at +some apparently random location. + +If your application doesn't need a blinking cursor at all, you can +call \function{curs_set(0)} to make it invisible. Equivalently, and +for compatibility with older curses versions, there's a +\function{leaveok(\var{bool})} function. When \var{bool} is true, the +curses library will attempt to suppress the flashing cursor, and you +won't need to worry about leaving it in odd locations. + +\subsection{Attributes and Color} + +Characters can be displayed in different ways. Status lines in a +text-based application are commonly shown in reverse video; a text +viewer may need to highlight certain words. curses supports this by +allowing you to specify an attribute for each cell on the screen. + +An attribute is a integer, each bit representing a different +attribute. You can try to display text with multiple attribute bits +set, but curses doesn't guarantee that all the possible combinations +are available, or that they're all visually distinct. That depends on +the ability of the terminal being used, so it's safest to stick to the +most commonly available attributes, listed here. + +\begin{tableii}{|c|l|}{constant}{Attribute}{Description} +\lineii{A_BLINK}{Blinking text} +\lineii{A_BOLD}{Extra bright or bold text} +\lineii{A_DIM}{Half bright text} +\lineii{A_REVERSE}{Reverse-video text} +\lineii{A_STANDOUT}{The best highlighting mode available} +\lineii{A_UNDERLINE}{Underlined text} +\end{tableii} + +So, to display a reverse-video status line on the top line of the +screen, +you could code: + +\begin{verbatim} +stdscr.addstr(0, 0, "Current mode: Typing mode", + curses.A_REVERSE) +stdscr.refresh() +\end{verbatim} + +The curses library also supports color on those terminals that +provide it, The most common such terminal is probably the Linux +console, followed by color xterms. + +To use color, you must call the \function{start_color()} function +soon after calling \function{initscr()}, to initialize the default +color set (the \function{curses.wrapper.wrapper()} function does this +automatically). Once that's done, the \function{has_colors()} +function returns TRUE if the terminal in use can actually display +color. (Note from AMK: curses uses the American spelling +'color', instead of the Canadian/British spelling 'colour'. If you're +like me, you'll have to resign yourself to misspelling it for the sake +of these functions.) + +The curses library maintains a finite number of color pairs, +containing a foreground (or text) color and a background color. You +can get the attribute value corresponding to a color pair with the +\function{color_pair()} function; this can be bitwise-OR'ed with other +attributes such as \constant{A_REVERSE}, but again, such combinations +are not guaranteed to work on all terminals. + +An example, which displays a line of text using color pair 1: + +\begin{verbatim} +stdscr.addstr( "Pretty text", curses.color_pair(1) ) +stdscr.refresh() +\end{verbatim} + +As I said before, a color pair consists of a foreground and +background color. \function{start_color()} initializes 8 basic +colors when it activates color mode. They are: 0:black, 1:red, +2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and 7:white. The curses +module defines named constants for each of these colors: +\constant{curses.COLOR_BLACK}, \constant{curses.COLOR_RED}, and so +forth. + +The \function{init_pair(\var{n, f, b})} function changes the +definition of color pair \var{n}, to foreground color {f} and +background color {b}. Color pair 0 is hard-wired to white on black, +and cannot be changed. + +Let's put all this together. To change color 1 to red +text on a white background, you would call: + +\begin{verbatim} +curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE) +\end{verbatim} + +When you change a color pair, any text already displayed using that +color pair will change to the new colors. You can also display new +text in this color with: + +\begin{verbatim} +stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1) ) +\end{verbatim} + +Very fancy terminals can change the definitions of the actual colors +to a given RGB value. This lets you change color 1, which is usually +red, to purple or blue or any other color you like. Unfortunately, +the Linux console doesn't support this, so I'm unable to try it out, +and can't provide any examples. You can check if your terminal can do +this by calling \function{can_change_color()}, which returns TRUE if +the capability is there. If you're lucky enough to have such a +talented terminal, consult your system's man pages for more +information. + +\section{User Input} + +The curses library itself offers only very simple input mechanisms. +Python's support adds a text-input widget that makes up some of the +lack. + +The most common way to get input to a window is to use its +\method{getch()} method. that pauses, and waits for the user to hit +a key, displaying it if \function{echo()} has been called earlier. +You can optionally specify a coordinate to which the cursor should be +moved before pausing. + +It's possible to change this behavior with the method +\method{nodelay()}. After \method{nodelay(1)}, \method{getch()} for +the window becomes non-blocking and returns ERR (-1) when no input is +ready. There's also a \function{halfdelay()} function, which can be +used to (in effect) set a timer on each \method{getch()}; if no input +becomes available within the number of milliseconds specified as the +argument to \function{halfdelay()}, curses throws an exception. + +The \method{getch()} method returns an integer; if it's between 0 and +255, it represents the ASCII code of the key pressed. Values greater +than 255 are special keys such as Page Up, Home, or the cursor keys. +You can compare the value returned to constants such as +\constant{curses.KEY_PPAGE}, \constant{curses.KEY_HOME}, or +\constant{curses.KEY_LEFT}. Usually the main loop of your program +will look something like this: + +\begin{verbatim} +while 1: + c = stdscr.getch() + if c == ord('p'): PrintDocument() + elif c == ord('q'): break # Exit the while() + elif c == curses.KEY_HOME: x = y = 0 +\end{verbatim} + +The \module{curses.ascii} module supplies ASCII class membership +functions that take either integer or 1-character-string +arguments; these may be useful in writing more readable tests for +your command interpreters. It also supplies conversion functions +that take either integer or 1-character-string arguments and return +the same type. For example, \function{curses.ascii.ctrl()} returns +the control character corresponding to its argument. + +There's also a method to retrieve an entire string, +\constant{getstr()}. It isn't used very often, because its +functionality is quite limited; the only editing keys available are +the backspace key and the Enter key, which terminates the string. It +can optionally be limited to a fixed number of characters. + +\begin{verbatim} +curses.echo() # Enable echoing of characters + +# Get a 15-character string, with the cursor on the top line +s = stdscr.getstr(0,0, 15) +\end{verbatim} + +The Python \module{curses.textpad} module supplies something better. +With it, you can turn a window into a text box that supports an +Emacs-like set of keybindings. Various methods of \class{Textbox} +class support editing with input validation and gathering the edit +results either with or without trailing spaces. See the library +documentation on \module{curses.textpad} for the details. + +\section{For More Information} + +This HOWTO didn't cover some advanced topics, such as screen-scraping +or capturing mouse events from an xterm instance. But the Python +library page for the curses modules is now pretty complete. You +should browse it next. + +If you're in doubt about the detailed behavior of any of the ncurses +entry points, consult the manual pages for your curses implementation, +whether it's ncurses or a proprietary Unix vendor's. The manual pages +will document any quirks, and provide complete lists of all the +functions, attributes, and \constant{ACS_*} characters available to +you. + +Because the curses API is so large, some functions aren't supported in +the Python interface, not because they're difficult to implement, but +because no one has needed them yet. Feel free to add them and then +submit a patch. Also, we don't yet have support for the menus or +panels libraries associated with ncurses; feel free to add that. + +If you write an interesting little program, feel free to contribute it +as another demo. We can always use more of them! + +The ncurses FAQ: \url{http://dickey.his.com/ncurses/ncurses.faq.html} + +\end{document} diff --git a/Doc/howto/doanddont.tex b/Doc/howto/doanddont.tex new file mode 100644 index 0000000..adbde66 --- /dev/null +++ b/Doc/howto/doanddont.tex @@ -0,0 +1,343 @@ +\documentclass{howto} + +\title{Idioms and Anti-Idioms in Python} + +\release{0.00} + +\author{Moshe Zadka} +\authoraddress{howto@zadka.site.co.il} + +\begin{document} +\maketitle + +This document is placed in the public doman. + +\begin{abstract} +\noindent +This document can be considered a companion to the tutorial. It +shows how to use Python, and even more importantly, how {\em not} +to use Python. +\end{abstract} + +\tableofcontents + +\section{Language Constructs You Should Not Use} + +While Python has relatively few gotchas compared to other languages, it +still has some constructs which are only useful in corner cases, or are +plain dangerous. + +\subsection{from module import *} + +\subsubsection{Inside Function Definitions} + +\code{from module import *} is {\em invalid} inside function definitions. +While many versions of Python do no check for the invalidity, it does not +make it more valid, no more then having a smart lawyer makes a man innocent. +Do not use it like that ever. Even in versions where it was accepted, it made +the function execution slower, because the compiler could not be certain +which names are local and which are global. In Python 2.1 this construct +causes warnings, and sometimes even errors. + +\subsubsection{At Module Level} + +While it is valid to use \code{from module import *} at module level it +is usually a bad idea. For one, this loses an important property Python +otherwise has --- you can know where each toplevel name is defined by +a simple "search" function in your favourite editor. You also open yourself +to trouble in the future, if some module grows additional functions or +classes. + +One of the most awful question asked on the newsgroup is why this code: + +\begin{verbatim} +f = open("www") +f.read() +\end{verbatim} + +does not work. Of course, it works just fine (assuming you have a file +called "www".) But it does not work if somewhere in the module, the +statement \code{from os import *} is present. The \module{os} module +has a function called \function{open()} which returns an integer. While +it is very useful, shadowing builtins is one of its least useful properties. + +Remember, you can never know for sure what names a module exports, so either +take what you need --- \code{from module import name1, name2}, or keep them in +the module and access on a per-need basis --- +\code{import module;print module.name}. + +\subsubsection{When It Is Just Fine} + +There are situations in which \code{from module import *} is just fine: + +\begin{itemize} + +\item The interactive prompt. For example, \code{from math import *} makes + Python an amazing scientific calculator. + +\item When extending a module in C with a module in Python. + +\item When the module advertises itself as \code{from import *} safe. + +\end{itemize} + +\subsection{Unadorned \keyword{exec}, \function{execfile} and friends} + +The word ``unadorned'' refers to the use without an explicit dictionary, +in which case those constructs evaluate code in the {\em current} environment. +This is dangerous for the same reasons \code{from import *} is dangerous --- +it might step over variables you are counting on and mess up things for +the rest of your code. Simply do not do that. + +Bad examples: + +\begin{verbatim} +>>> for name in sys.argv[1:]: +>>> exec "%s=1" % name +>>> def func(s, **kw): +>>> for var, val in kw.items(): +>>> exec "s.%s=val" % var # invalid! +>>> execfile("handler.py") +>>> handle() +\end{verbatim} + +Good examples: + +\begin{verbatim} +>>> d = {} +>>> for name in sys.argv[1:]: +>>> d[name] = 1 +>>> def func(s, **kw): +>>> for var, val in kw.items(): +>>> setattr(s, var, val) +>>> d={} +>>> execfile("handle.py", d, d) +>>> handle = d['handle'] +>>> handle() +\end{verbatim} + +\subsection{from module import name1, name2} + +This is a ``don't'' which is much weaker then the previous ``don't''s +but is still something you should not do if you don't have good reasons +to do that. The reason it is usually bad idea is because you suddenly +have an object which lives in two seperate namespaces. When the binding +in one namespace changes, the binding in the other will not, so there +will be a discrepancy between them. This happens when, for example, +one module is reloaded, or changes the definition of a function at runtime. + +Bad example: + +\begin{verbatim} +# foo.py +a = 1 + +# bar.py +from foo import a +if something(): + a = 2 # danger: foo.a != a +\end{verbatim} + +Good example: + +\begin{verbatim} +# foo.py +a = 1 + +# bar.py +import foo +if something(): + foo.a = 2 +\end{verbatim} + +\subsection{except:} + +Python has the \code{except:} clause, which catches all exceptions. +Since {\em every} error in Python raises an exception, this makes many +programming errors look like runtime problems, and hinders +the debugging process. + +The following code shows a great example: + +\begin{verbatim} +try: + foo = opne("file") # misspelled "open" +except: + sys.exit("could not open file!") +\end{verbatim} + +The second line triggers a \exception{NameError} which is caught by the +except clause. The program will exit, and you will have no idea that +this has nothing to do with the readability of \code{"file"}. + +The example above is better written + +\begin{verbatim} +try: + foo = opne("file") # will be changed to "open" as soon as we run it +except IOError: + sys.exit("could not open file") +\end{verbatim} + +There are some situations in which the \code{except:} clause is useful: +for example, in a framework when running callbacks, it is good not to +let any callback disturb the framework. + +\section{Exceptions} + +Exceptions are a useful feature of Python. You should learn to raise +them whenever something unexpected occurs, and catch them only where +you can do something about them. + +The following is a very popular anti-idiom + +\begin{verbatim} +def get_status(file): + if not os.path.exists(file): + print "file not found" + sys.exit(1) + return open(file).readline() +\end{verbatim} + +Consider the case the file gets deleted between the time the call to +\function{os.path.exists} is made and the time \function{open} is called. +That means the last line will throw an \exception{IOError}. The same would +happen if \var{file} exists but has no read permission. Since testing this +on a normal machine on existing and non-existing files make it seem bugless, +that means in testing the results will seem fine, and the code will get +shipped. Then an unhandled \exception{IOError} escapes to the user, who +has to watch the ugly traceback. + +Here is a better way to do it. + +\begin{verbatim} +def get_status(file): + try: + return open(file).readline() + except (IOError, OSError): + print "file not found" + sys.exit(1) +\end{verbatim} + +In this version, *either* the file gets opened and the line is read +(so it works even on flaky NFS or SMB connections), or the message +is printed and the application aborted. + +Still, \function{get_status} makes too many assumptions --- that it +will only be used in a short running script, and not, say, in a long +running server. Sure, the caller could do something like + +\begin{verbatim} +try: + status = get_status(log) +except SystemExit: + status = None +\end{verbatim} + +So, try to make as few \code{except} clauses in your code --- those will +usually be a catch-all in the \function{main}, or inside calls which +should always succeed. + +So, the best version is probably + +\begin{verbatim} +def get_status(file): + return open(file).readline() +\end{verbatim} + +The caller can deal with the exception if it wants (for example, if it +tries several files in a loop), or just let the exception filter upwards +to {\em its} caller. + +The last version is not very good either --- due to implementation details, +the file would not be closed when an exception is raised until the handler +finishes, and perhaps not at all in non-C implementations (e.g., Jython). + +\begin{verbatim} +def get_status(file): + fp = open(file) + try: + return fp.readline() + finally: + fp.close() +\end{verbatim} + +\section{Using the Batteries} + +Every so often, people seem to be writing stuff in the Python library +again, usually poorly. While the occasional module has a poor interface, +it is usually much better to use the rich standard library and data +types that come with Python then inventing your own. + +A useful module very few people know about is \module{os.path}. It +always has the correct path arithmetic for your operating system, and +will usually be much better then whatever you come up with yourself. + +Compare: + +\begin{verbatim} +# ugh! +return dir+"/"+file +# better +return os.path.join(dir, file) +\end{verbatim} + +More useful functions in \module{os.path}: \function{basename}, +\function{dirname} and \function{splitext}. + +There are also many useful builtin functions people seem not to be +aware of for some reason: \function{min()} and \function{max()} can +find the minimum/maximum of any sequence with comparable semantics, +for example, yet many people write they own max/min. Another highly +useful function is \function{reduce()}. Classical use of \function{reduce()} +is something like + +\begin{verbatim} +import sys, operator +nums = map(float, sys.argv[1:]) +print reduce(operator.add, nums)/len(nums) +\end{verbatim} + +This cute little script prints the average of all numbers given on the +command line. The \function{reduce()} adds up all the numbers, and +the rest is just some pre- and postprocessing. + +On the same note, note that \function{float()}, \function{int()} and +\function{long()} all accept arguments of type string, and so are +suited to parsing --- assuming you are ready to deal with the +\exception{ValueError} they raise. + +\section{Using Backslash to Continue Statements} + +Since Python treats a newline as a statement terminator, +and since statements are often more then is comfortable to put +in one line, many people do: + +\begin{verbatim} +if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \ + calculate_number(10, 20) != forbulate(500, 360): + pass +\end{verbatim} + +You should realize that this is dangerous: a stray space after the +\code{\\} would make this line wrong, and stray spaces are notoriously +hard to see in editors. In this case, at least it would be a syntax +error, but if the code was: + +\begin{verbatim} +value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \ + + calculate_number(10, 20)*forbulate(500, 360) +\end{verbatim} + +then it would just be subtly wrong. + +It is usually much better to use the implicit continuation inside parenthesis: + +This version is bulletproof: + +\begin{verbatim} +value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9] + + calculate_number(10, 20)*forbulate(500, 360)) +\end{verbatim} + +\end{document} diff --git a/Doc/howto/regex.tex b/Doc/howto/regex.tex new file mode 100644 index 0000000..5a65064 --- /dev/null +++ b/Doc/howto/regex.tex @@ -0,0 +1,1466 @@ +\documentclass{howto} + +% TODO: +% Document lookbehind assertions +% Better way of displaying a RE, a string, and what it matches +% Mention optional argument to match.groups() +% Unicode (at least a reference) + +\title{Regular Expression HOWTO} + +\release{0.05} + +\author{A.M. Kuchling} +\authoraddress{\email{amk@amk.ca}} + +\begin{document} +\maketitle + +\begin{abstract} +\noindent +This document is an introductory tutorial to using regular expressions +in Python with the \module{re} module. It provides a gentler +introduction than the corresponding section in the Library Reference. + +This document is available from +\url{http://www.amk.ca/python/howto}. + +\end{abstract} + +\tableofcontents + +\section{Introduction} + +The \module{re} module was added in Python 1.5, and provides +Perl-style regular expression patterns. Earlier versions of Python +came with the \module{regex} module, which provides Emacs-style +patterns. Emacs-style patterns are slightly less readable and +don't provide as many features, so there's not much reason to use +the \module{regex} module when writing new code, though you might +encounter old code that uses it. + +Regular expressions (or REs) are essentially a tiny, highly +specialized programming language embedded inside Python and made +available through the \module{re} module. Using this little language, +you specify the rules for the set of possible strings that you want to +match; this set might contain English sentences, or e-mail addresses, +or TeX commands, or anything you like. You can then ask questions +such as ``Does this string match the pattern?'', or ``Is there a match +for the pattern anywhere in this string?''. You can also use REs to +modify a string or to split it apart in various ways. + +Regular expression patterns are compiled into a series of bytecodes +which are then executed by a matching engine written in C. For +advanced use, it may be necessary to pay careful attention to how the +engine will execute a given RE, and write the RE in a certain way in +order to produce bytecode that runs faster. Optimization isn't +covered in this document, because it requires that you have a good +understanding of the matching engine's internals. + +The regular expression language is relatively small and restricted, so +not all possible string processing tasks can be done using regular +expressions. There are also tasks that \emph{can} be done with +regular expressions, but the expressions turn out to be very +complicated. In these cases, you may be better off writing Python +code to do the processing; while Python code will be slower than an +elaborate regular expression, it will also probably be more understandable. + +\section{Simple Patterns} + +We'll start by learning about the simplest possible regular +expressions. Since regular expressions are used to operate on +strings, we'll begin with the most common task: matching characters. + +For a detailed explanation of the computer science underlying regular +expressions (deterministic and non-deterministic finite automata), you +can refer to almost any textbook on writing compilers. + +\subsection{Matching Characters} + +Most letters and characters will simply match themselves. For +example, the regular expression \regexp{test} will match the string +\samp{test} exactly. (You can enable a case-insensitive mode that +would let this RE match \samp{Test} or \samp{TEST} as well; more +about this later.) + +There are exceptions to this rule; some characters are +special, and don't match themselves. Instead, they signal that some +out-of-the-ordinary thing should be matched, or they affect other +portions of the RE by repeating them. Much of this document is +devoted to discussing various metacharacters and what they do. + +Here's a complete list of the metacharacters; their meanings will be +discussed in the rest of this HOWTO. + +\begin{verbatim} +. ^ $ * + ? { [ ] \ | ( ) +\end{verbatim} +% $ + +The first metacharacters we'll look at are \samp{[} and \samp{]}. +They're used for specifying a character class, which is a set of +characters that you wish to match. Characters can be listed +individually, or a range of characters can be indicated by giving two +characters and separating them by a \character{-}. For example, +\regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or +\samp{c}; this is the same as +\regexp{[a-c]}, which uses a range to express the same set of +characters. If you wanted to match only lowercase letters, your +RE would be \regexp{[a-z]}. + +Metacharacters are not active inside classes. For example, +\regexp{[akm\$]} will match any of the characters \character{a}, +\character{k}, \character{m}, or \character{\$}; \character{\$} is +usually a metacharacter, but inside a character class it's stripped of +its special nature. + +You can match the characters not within a range by \dfn{complementing} +the set. This is indicated by including a \character{\^} as the first +character of the class; \character{\^} elsewhere will simply match the +\character{\^} character. For example, \verb|[^5]| will match any +character except \character{5}. + +Perhaps the most important metacharacter is the backslash, \samp{\e}. +As in Python string literals, the backslash can be followed by various +characters to signal various special sequences. It's also used to escape +all the metacharacters so you can still match them in patterns; for +example, if you need to match a \samp{[} or +\samp{\e}, you can precede them with a backslash to remove their +special meaning: \regexp{\e[} or \regexp{\e\e}. + +Some of the special sequences beginning with \character{\e} represent +predefined sets of characters that are often useful, such as the set +of digits, the set of letters, or the set of anything that isn't +whitespace. The following predefined special sequences are available: + +\begin{itemize} +\item[\code{\e d}]Matches any decimal digit; this is +equivalent to the class \regexp{[0-9]}. + +\item[\code{\e D}]Matches any non-digit character; this is +equivalent to the class \verb|[^0-9]|. + +\item[\code{\e s}]Matches any whitespace character; this is +equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}. + +\item[\code{\e S}]Matches any non-whitespace character; this is +equivalent to the class \verb|[^ \t\n\r\f\v]|. + +\item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class +\regexp{[a-zA-Z0-9_]}. + +\item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class +\verb|[^a-zA-Z0-9_]|. +\end{itemize} + +These sequences can be included inside a character class. For +example, \regexp{[\e s,.]} is a character class that will match any +whitespace character, or \character{,} or \character{.}. + +The final metacharacter in this section is \regexp{.}. It matches +anything except a newline character, and there's an alternate mode +(\code{re.DOTALL}) where it will match even a newline. \character{.} +is often used where you want to match ``any character''. + +\subsection{Repeating Things} + +Being able to match varying sets of characters is the first thing +regular expressions can do that isn't already possible with the +methods available on strings. However, if that was the only +additional capability of regexes, they wouldn't be much of an advance. +Another capability is that you can specify that portions of the RE +must be repeated a certain number of times. + +The first metacharacter for repeating things that we'll look at is +\regexp{*}. \regexp{*} doesn't match the literal character \samp{*}; +instead, it specifies that the previous character can be matched zero +or more times, instead of exactly once. + +For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a} +characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a} +characters), and so forth. The RE engine has various internal +limitations stemming from the size of C's \code{int} type, that will +prevent it from matching over 2 billion \samp{a} characters; you +probably don't have enough memory to construct a string that large, so +you shouldn't run into that limit. + +Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE, +the matching engine will try to repeat it as many times as possible. +If later portions of the pattern don't match, the matching engine will +then back up and try again with few repetitions. + +A step-by-step example will make this more obvious. Let's consider +the expression \regexp{a[bcd]*b}. This matches the letter +\character{a}, zero or more letters from the class \code{[bcd]}, and +finally ends with a \character{b}. Now imagine matching this RE +against the string \samp{abcbd}. + +\begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation} +\lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.} +\lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as +it can, which is to the end of the string.} +\lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the +current position is at the end of the string, so it fails.} +\lineiii{4}{\code{abcb}}{Back up, so that \regexp{[bcd]*} matches +one less character.} +\lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the +current position is at the last character, which is a \character{d}.} +\lineiii{6}{\code{abc}}{Back up again, so that \regexp{[bcd]*} is +only matching \samp{bc}.} +\lineiii{6}{\code{abcb}}{Try \regexp{b} again. This time +but the character at the current position is \character{b}, so it succeeds.} +\end{tableiii} + +The end of the RE has now been reached, and it has matched +\samp{abcb}. This demonstrates how the matching engine goes as far as +it can at first, and if no match is found it will then progressively +back up and retry the rest of the RE again and again. It will back up +until it has tried zero matches for \regexp{[bcd]*}, and if that +subsequently fails, the engine will conclude that the string doesn't +match the RE at all. + +Another repeating metacharacter is \regexp{+}, which matches one or +more times. Pay careful attention to the difference between +\regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more +times, so whatever's being repeated may not be present at all, while +\regexp{+} requires at least \emph{one} occurrence. To use a similar +example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}), +\samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}. + +There are two more repeating qualifiers. The question mark character, +\regexp{?}, matches either once or zero times; you can think of it as +marking something as being optional. For example, \regexp{home-?brew} +matches either \samp{homebrew} or \samp{home-brew}. + +The most complicated repeated qualifier is +\regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal +integers. This qualifier means there must be at least \var{m} +repetitions, and at most \var{n}. For example, \regexp{a/\{1,3\}b} +will match \samp{a/b}, \samp{a//b}, and \samp{a///b}. It won't match +\samp{ab}, which has no slashes, or \samp{a////b}, which has four. + +You can omit either \var{m} or \var{n}; in that case, a reasonable +value is assumed for the missing value. Omitting \var{m} is +interpreted as a lower limit of 0, while omitting \var{n} results in an +upper bound of infinity --- actually, the 2 billion limit mentioned +earlier, but that might as well be infinity. + +Readers of a reductionist bent may notice that the three other qualifiers +can all be expressed using this notation. \regexp{\{0,\}} is the same +as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and +\regexp{\{0,1\}} is the same as \regexp{?}. It's better to use +\regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because +they're shorter and easier to read. + +\section{Using Regular Expressions} + +Now that we've looked at some simple regular expressions, how do we +actually use them in Python? The \module{re} module provides an +interface to the regular expression engine, allowing you to compile +REs into objects and then perform matches with them. + +\subsection{Compiling Regular Expressions} + +Regular expressions are compiled into \class{RegexObject} instances, +which have methods for various operations such as searching for +pattern matches or performing string substitutions. + +\begin{verbatim} +>>> import re +>>> p = re.compile('ab*') +>>> print p +<re.RegexObject instance at 80b4150> +\end{verbatim} + +\function{re.compile()} also accepts an optional \var{flags} +argument, used to enable various special features and syntax +variations. We'll go over the available settings later, but for now a +single example will do: + +\begin{verbatim} +>>> p = re.compile('ab*', re.IGNORECASE) +\end{verbatim} + +The RE is passed to \function{re.compile()} as a string. REs are +handled as strings because regular expressions aren't part of the core +Python language, and no special syntax was created for expressing +them. (There are applications that don't need REs at all, so there's +no need to bloat the language specification by including them.) +Instead, the \module{re} module is simply a C extension module +included with Python, just like the \module{socket} or \module{zlib} +module. + +Putting REs in strings keeps the Python language simpler, but has one +disadvantage which is the topic of the next section. + +\subsection{The Backslash Plague} + +As stated earlier, regular expressions use the backslash +character (\character{\e}) to indicate special forms or to allow +special characters to be used without invoking their special meaning. +This conflicts with Python's usage of the same character for the same +purpose in string literals. + +Let's say you want to write a RE that matches the string +\samp{{\e}section}, which might be found in a \LaTeX\ file. To figure +out what to write in the program code, start with the desired string +to be matched. Next, you must escape any backslashes and other +metacharacters by preceding them with a backslash, resulting in the +string \samp{\e\e section}. The resulting string that must be passed +to \function{re.compile()} must be \verb|\\section|. However, to +express this as a Python string literal, both backslashes must be +escaped \emph{again}. + +\begin{tableii}{c|l}{code}{Characters}{Stage} + \lineii{\e section}{Text string to be matched} + \lineii{\e\e section}{Escaped backslash for \function{re.compile}} + \lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal} +\end{tableii} + +In short, to match a literal backslash, one has to write +\code{'\e\e\e\e'} as the RE string, because the regular expression +must be \samp{\e\e}, and each backslash must be expressed as +\samp{\e\e} inside a regular Python string literal. In REs that +feature backslashes repeatedly, this leads to lots of repeated +backslashes and makes the resulting strings difficult to understand. + +The solution is to use Python's raw string notation for regular +expressions; backslashes are not handled in any special way in +a string literal prefixed with \character{r}, so \code{r"\e n"} is a +two-character string containing \character{\e} and \character{n}, +while \code{"\e n"} is a one-character string containing a newline. +Frequently regular expressions will be expressed in Python +code using this raw string notation. + +\begin{tableii}{c|c}{code}{Regular String}{Raw string} + \lineii{"ab*"}{\code{r"ab*"}} + \lineii{"\e\e\e\e section"}{\code{r"\e\e section"}} + \lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}} +\end{tableii} + +\subsection{Performing Matches} + +Once you have an object representing a compiled regular expression, +what do you do with it? \class{RegexObject} instances have several +methods and attributes. Only the most significant ones will be +covered here; consult \ulink{the Library +Reference}{http://www.python.org/doc/lib/module-re.html} for a +complete listing. + +\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} + \lineii{match()}{Determine if the RE matches at the beginning of + the string.} + \lineii{search()}{Scan through a string, looking for any location + where this RE matches.} + \lineii{findall()}{Find all substrings where the RE matches, +and returns them as a list.} + \lineii{finditer()}{Find all substrings where the RE matches, +and returns them as an iterator.} +\end{tableii} + +\method{match()} and \method{search()} return \code{None} if no match +can be found. If they're successful, a \code{MatchObject} instance is +returned, containing information about the match: where it starts and +ends, the substring it matched, and more. + +You can learn about this by interactively experimenting with the +\module{re} module. If you have Tkinter available, you may also want +to look at \file{Tools/scripts/redemo.py}, a demonstration program +included with the Python distribution. It allows you to enter REs and +strings, and displays whether the RE matches or fails. +\file{redemo.py} can be quite useful when trying to debug a +complicated RE. Phil Schwartz's +\ulink{Kodos}{http://kodos.sourceforge.net} is also an interactive +tool for developing and testing RE patterns. This HOWTO will use the +standard Python interpreter for its examples. + +First, run the Python interpreter, import the \module{re} module, and +compile a RE: + +\begin{verbatim} +Python 2.2.2 (#1, Feb 10 2003, 12:57:01) +>>> import re +>>> p = re.compile('[a-z]+') +>>> p +<_sre.SRE_Pattern object at 80c3c28> +\end{verbatim} + +Now, you can try matching various strings against the RE +\regexp{[a-z]+}. An empty string shouldn't match at all, since +\regexp{+} means 'one or more repetitions'. \method{match()} should +return \code{None} in this case, which will cause the interpreter to +print no output. You can explicitly print the result of +\method{match()} to make this clear. + +\begin{verbatim} +>>> p.match("") +>>> print p.match("") +None +\end{verbatim} + +Now, let's try it on a string that it should match, such as +\samp{tempo}. In this case, \method{match()} will return a +\class{MatchObject}, so you should store the result in a variable for +later use. + +\begin{verbatim} +>>> m = p.match( 'tempo') +>>> print m +<_sre.SRE_Match object at 80c4f68> +\end{verbatim} + +Now you can query the \class{MatchObject} for information about the +matching string. \class{MatchObject} instances also have several +methods and attributes; the most important ones are: + +\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} + \lineii{group()}{Return the string matched by the RE} + \lineii{start()}{Return the starting position of the match} + \lineii{end()}{Return the ending position of the match} + \lineii{span()}{Return a tuple containing the (start, end) positions + of the match} +\end{tableii} + +Trying these methods will soon clarify their meaning: + +\begin{verbatim} +>>> m.group() +'tempo' +>>> m.start(), m.end() +(0, 5) +>>> m.span() +(0, 5) +\end{verbatim} + +\method{group()} returns the substring that was matched by the +RE. \method{start()} and \method{end()} return the starting and +ending index of the match. \method{span()} returns both start and end +indexes in a single tuple. Since the \method{match} method only +checks if the RE matches at the start of a string, +\method{start()} will always be zero. However, the \method{search} +method of \class{RegexObject} instances scans through the string, so +the match may not start at zero in that case. + +\begin{verbatim} +>>> print p.match('::: message') +None +>>> m = p.search('::: message') ; print m +<re.MatchObject instance at 80c9650> +>>> m.group() +'message' +>>> m.span() +(4, 11) +\end{verbatim} + +In actual programs, the most common style is to store the +\class{MatchObject} in a variable, and then check if it was +\code{None}. This usually looks like: + +\begin{verbatim} +p = re.compile( ... ) +m = p.match( 'string goes here' ) +if m: + print 'Match found: ', m.group() +else: + print 'No match' +\end{verbatim} + +Two \class{RegexObject} methods return all of the matches for a pattern. +\method{findall()} returns a list of matching strings: + +\begin{verbatim} +>>> p = re.compile('\d+') +>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping') +['12', '11', '10'] +\end{verbatim} + +\method{findall()} has to create the entire list before it can be +returned as the result. In Python 2.2, the \method{finditer()} method +is also available, returning a sequence of \class{MatchObject} instances +as an iterator. + +\begin{verbatim} +>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') +>>> iterator +<callable-iterator object at 0x401833ac> +>>> for match in iterator: +... print match.span() +... +(0, 2) +(22, 24) +(29, 31) +\end{verbatim} + + +\subsection{Module-Level Functions} + +You don't have to produce a \class{RegexObject} and call its methods; +the \module{re} module also provides top-level functions called +\function{match()}, \function{search()}, \function{sub()}, and so +forth. These functions take the same arguments as the corresponding +\class{RegexObject} method, with the RE string added as the first +argument, and still return either \code{None} or a \class{MatchObject} +instance. + +\begin{verbatim} +>>> print re.match(r'From\s+', 'Fromage amk') +None +>>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998') +<re.MatchObject instance at 80c5978> +\end{verbatim} + +Under the hood, these functions simply produce a \class{RegexObject} +for you and call the appropriate method on it. They also store the +compiled object in a cache, so future calls using the same +RE are faster. + +Should you use these module-level functions, or should you get the +\class{RegexObject} and call its methods yourself? That choice +depends on how frequently the RE will be used, and on your personal +coding style. If a RE is being used at only one point in the code, +then the module functions are probably more convenient. If a program +contains a lot of regular expressions, or re-uses the same ones in +several locations, then it might be worthwhile to collect all the +definitions in one place, in a section of code that compiles all the +REs ahead of time. To take an example from the standard library, +here's an extract from \file{xmllib.py}: + +\begin{verbatim} +ref = re.compile( ... ) +entityref = re.compile( ... ) +charref = re.compile( ... ) +starttagopen = re.compile( ... ) +\end{verbatim} + +I generally prefer to work with the compiled object, even for +one-time uses, but few people will be as much of a purist about this +as I am. + +\subsection{Compilation Flags} + +Compilation flags let you modify some aspects of how regular +expressions work. Flags are available in the \module{re} module under +two names, a long name such as \constant{IGNORECASE}, and a short, +one-letter form such as \constant{I}. (If you're familiar with Perl's +pattern modifiers, the one-letter forms use the same letters; the +short form of \constant{re.VERBOSE} is \constant{re.X}, for example.) +Multiple flags can be specified by bitwise OR-ing them; \code{re.I | +re.M} sets both the \constant{I} and \constant{M} flags, for example. + +Here's a table of the available flags, followed by +a more detailed explanation of each one. + +\begin{tableii}{c|l}{}{Flag}{Meaning} + \lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any + character, including newlines} + \lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches} + \lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match} + \lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching, + affecting \regexp{\^} and \regexp{\$}} + \lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs, + which can be organized more cleanly and understandably.} +\end{tableii} + +\begin{datadesc}{I} +\dataline{IGNORECASE} +Perform case-insensitive matching; character class and literal strings +will match +letters by ignoring case. For example, \regexp{[A-Z]} will match +lowercase letters, too, and \regexp{Spam} will match \samp{Spam}, +\samp{spam}, or \samp{spAM}. +This lowercasing doesn't take the current locale into account; it will +if you also set the \constant{LOCALE} flag. +\end{datadesc} + +\begin{datadesc}{L} +\dataline{LOCALE} +Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, +and \regexp{\e B}, dependent on the current locale. + +Locales are a feature of the C library intended to help in writing +programs that take account of language differences. For example, if +you're processing French text, you'd want to be able to write +\regexp{\e w+} to match words, but \regexp{\e w} only matches the +character class \regexp{[A-Za-z]}; it won't match \character{\'e} or +\character{\c c}. If your system is configured properly and a French +locale is selected, certain C functions will tell the program that +\character{\'e} should also be considered a letter. Setting the +\constant{LOCALE} flag when compiling a regular expression will cause the +resulting compiled object to use these C functions for \regexp{\e w}; +this is slower, but also enables \regexp{\e w+} to match French words as +you'd expect. +\end{datadesc} + +\begin{datadesc}{M} +\dataline{MULTILINE} +(\regexp{\^} and \regexp{\$} haven't been explained yet; +they'll be introduced in section~\ref{more-metacharacters}.) + +Usually \regexp{\^} matches only at the beginning of the string, and +\regexp{\$} matches only at the end of the string and immediately before the +newline (if any) at the end of the string. When this flag is +specified, \regexp{\^} matches at the beginning of the string and at +the beginning of each line within the string, immediately following +each newline. Similarly, the \regexp{\$} metacharacter matches either at +the end of the string and at the end of each line (immediately +preceding each newline). + +\end{datadesc} + +\begin{datadesc}{S} +\dataline{DOTALL} +Makes the \character{.} special character match any character at all, +including a newline; without this flag, \character{.} will match +anything \emph{except} a newline. +\end{datadesc} + +\begin{datadesc}{X} +\dataline{VERBOSE} This flag allows you to write regular expressions +that are more readable by granting you more flexibility in how you can +format them. When this flag has been specified, whitespace within the +RE string is ignored, except when the whitespace is in a character +class or preceded by an unescaped backslash; this lets you organize +and indent the RE more clearly. It also enables you to put comments +within a RE that will be ignored by the engine; comments are marked by +a \character{\#} that's neither in a character class or preceded by an +unescaped backslash. + +For example, here's a RE that uses \constant{re.VERBOSE}; see how +much easier it is to read? + +\begin{verbatim} +charref = re.compile(r""" + &[#] # Start of a numeric entity reference + ( + [0-9]+[^0-9] # Decimal form + | 0[0-7]+[^0-7] # Octal form + | x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form + ) +""", re.VERBOSE) +\end{verbatim} + +Without the verbose setting, the RE would look like this: +\begin{verbatim} +charref = re.compile("&#([0-9]+[^0-9]" + "|0[0-7]+[^0-7]" + "|x[0-9a-fA-F]+[^0-9a-fA-F])") +\end{verbatim} + +In the above example, Python's automatic concatenation of string +literals has been used to break up the RE into smaller pieces, but +it's still more difficult to understand than the version using +\constant{re.VERBOSE}. + +\end{datadesc} + +\section{More Pattern Power} + +So far we've only covered a part of the features of regular +expressions. In this section, we'll cover some new metacharacters, +and how to use groups to retrieve portions of the text that was matched. + +\subsection{More Metacharacters\label{more-metacharacters}} + +There are some metacharacters that we haven't covered yet. Most of +them will be covered in this section. + +Some of the remaining metacharacters to be discussed are +\dfn{zero-width assertions}. They don't cause the engine to advance +through the string; instead, they consume no characters at all, +and simply succeed or fail. For example, \regexp{\e b} is an +assertion that the current position is located at a word boundary; the +position isn't changed by the \regexp{\e b} at all. This means that +zero-width assertions should never be repeated, because if they match +once at a given location, they can obviously be matched an infinite +number of times. + +\begin{list}{}{} + +\item[\regexp{|}] +Alternation, or the ``or'' operator. +If A and B are regular expressions, +\regexp{A|B} will match any string that matches either \samp{A} or \samp{B}. +\regexp{|} has very low precedence in order to make it work reasonably when +you're alternating multi-character strings. +\regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not +\samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}. + +To match a literal \character{|}, +use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}. + +\item[\regexp{\^}] Matches at the beginning of lines. Unless the +\constant{MULTILINE} flag has been set, this will only match at the +beginning of the string. In \constant{MULTILINE} mode, this also +matches immediately after each newline within the string. + +For example, if you wish to match the word \samp{From} only at the +beginning of a line, the RE to use is \verb|^From|. + +\begin{verbatim} +>>> print re.search('^From', 'From Here to Eternity') +<re.MatchObject instance at 80c1520> +>>> print re.search('^From', 'Reciting From Memory') +None +\end{verbatim} + +%To match a literal \character{\^}, use \regexp{\e\^} or enclose it +%inside a character class, as in \regexp{[{\e}\^]}. + +\item[\regexp{\$}] Matches at the end of a line, which is defined as +either the end of the string, or any location followed by a newline +character. + +\begin{verbatim} +>>> print re.search('}$', '{block}') +<re.MatchObject instance at 80adfa8> +>>> print re.search('}$', '{block} ') +None +>>> print re.search('}$', '{block}\n') +<re.MatchObject instance at 80adfa8> +\end{verbatim} +% $ + +To match a literal \character{\$}, use \regexp{\e\$} or enclose it +inside a character class, as in \regexp{[\$]}. + +\item[\regexp{\e A}] Matches only at the start of the string. When +not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are +effectively the same. In \constant{MULTILINE} mode, however, they're +different; \regexp{\e A} still matches only at the beginning of the +string, but \regexp{\^} may match at any location inside the string +that follows a newline character. + +\item[\regexp{\e Z}]Matches only at the end of the string. + +\item[\regexp{\e b}] Word boundary. +This is a zero-width assertion that matches only at the +beginning or end of a word. A word is defined as a sequence of +alphanumeric characters, so the end of a word is indicated by +whitespace or a non-alphanumeric character. + +The following example matches \samp{class} only when it's a complete +word; it won't match when it's contained inside another word. + +\begin{verbatim} +>>> p = re.compile(r'\bclass\b') +>>> print p.search('no class at all') +<re.MatchObject instance at 80c8f28> +>>> print p.search('the declassified algorithm') +None +>>> print p.search('one subclass is') +None +\end{verbatim} + +There are two subtleties you should remember when using this special +sequence. First, this is the worst collision between Python's string +literals and regular expression sequences. In Python's string +literals, \samp{\e b} is the backspace character, ASCII value 8. If +you're not using raw strings, then Python will convert the \samp{\e b} to +a backspace, and your RE won't match as you expect it to. The +following example looks the same as our previous RE, but omits +the \character{r} in front of the RE string. + +\begin{verbatim} +>>> p = re.compile('\bclass\b') +>>> print p.search('no class at all') +None +>>> print p.search('\b' + 'class' + '\b') +<re.MatchObject instance at 80c3ee0> +\end{verbatim} + +Second, inside a character class, where there's no use for this +assertion, \regexp{\e b} represents the backspace character, for +compatibility with Python's string literals. + +\item[\regexp{\e B}] Another zero-width assertion, this is the +opposite of \regexp{\e b}, only matching when the current +position is not at a word boundary. + +\end{list} + +\subsection{Grouping} + +Frequently you need to obtain more information than just whether the +RE matched or not. Regular expressions are often used to dissect +strings by writing a RE divided into several subgroups which +match different components of interest. For example, an RFC-822 +header line is divided into a header name and a value, separated by a +\character{:}. This can be handled by writing a regular expression +which matches an entire header line, and has one group which matches the +header name, and another group which matches the header's value. + +Groups are marked by the \character{(}, \character{)} metacharacters. +\character{(} and \character{)} have much the same meaning as they do +in mathematical expressions; they group together the expressions +contained inside them. For example, you can repeat the contents of a +group with a repeating qualifier, such as \regexp{*}, \regexp{+}, +\regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example, +\regexp{(ab)*} will match zero or more repetitions of \samp{ab}. + +\begin{verbatim} +>>> p = re.compile('(ab)*') +>>> print p.match('ababababab').span() +(0, 10) +\end{verbatim} + +Groups indicated with \character{(}, \character{)} also capture the +starting and ending index of the text that they match; this can be +retrieved by passing an argument to \method{group()}, +\method{start()}, \method{end()}, and \method{span()}. Groups are +numbered starting with 0. Group 0 is always present; it's the whole +RE, so \class{MatchObject} methods all have group 0 as their default +argument. Later we'll see how to express groups that don't capture +the span of text that they match. + +\begin{verbatim} +>>> p = re.compile('(a)b') +>>> m = p.match('ab') +>>> m.group() +'ab' +>>> m.group(0) +'ab' +\end{verbatim} + +Subgroups are numbered from left to right, from 1 upward. Groups can +be nested; to determine the number, just count the opening parenthesis +characters, going from left to right. + +\begin{verbatim} +>>> p = re.compile('(a(b)c)d') +>>> m = p.match('abcd') +>>> m.group(0) +'abcd' +>>> m.group(1) +'abc' +>>> m.group(2) +'b' +\end{verbatim} + +\method{group()} can be passed multiple group numbers at a time, in +which case it will return a tuple containing the corresponding values +for those groups. + +\begin{verbatim} +>>> m.group(2,1,2) +('b', 'abc', 'b') +\end{verbatim} + +The \method{groups()} method returns a tuple containing the strings +for all the subgroups, from 1 up to however many there are. + +\begin{verbatim} +>>> m.groups() +('abc', 'b') +\end{verbatim} + +Backreferences in a pattern allow you to specify that the contents of +an earlier capturing group must also be found at the current location +in the string. For example, \regexp{\e 1} will succeed if the exact +contents of group 1 can be found at the current position, and fails +otherwise. Remember that Python's string literals also use a +backslash followed by numbers to allow including arbitrary characters +in a string, so be sure to use a raw string when incorporating +backreferences in a RE. + +For example, the following RE detects doubled words in a string. + +\begin{verbatim} +>>> p = re.compile(r'(\b\w+)\s+\1') +>>> p.search('Paris in the the spring').group() +'the the' +\end{verbatim} + +Backreferences like this aren't often useful for just searching +through a string --- there are few text formats which repeat data in +this way --- but you'll soon find out that they're \emph{very} useful +when performing string substitutions. + +\subsection{Non-capturing and Named Groups} + +Elaborate REs may use many groups, both to capture substrings of +interest, and to group and structure the RE itself. In complex REs, +it becomes difficult to keep track of the group numbers. There are +two features which help with this problem. Both of them use a common +syntax for regular expression extensions, so we'll look at that first. + +Perl 5 added several additional features to standard regular +expressions, and the Python \module{re} module supports most of them. +It would have been difficult to choose new single-keystroke +metacharacters or new special sequences beginning with \samp{\e} to +represent the new features without making Perl's regular expressions +confusingly different from standard REs. If you chose \samp{\&} as a +new metacharacter, for example, old expressions would be assuming that +\samp{\&} was a regular character and wouldn't have escaped it by +writing \regexp{\e \&} or \regexp{[\&]}. + +The solution chosen by the Perl developers was to use \regexp{(?...)} +as the extension syntax. \samp{?} immediately after a parenthesis was +a syntax error because the \samp{?} would have nothing to repeat, so +this didn't introduce any compatibility problems. The characters +immediately after the \samp{?} indicate what extension is being used, +so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and +\regexp{(?:foo)} is something else (a non-capturing group containing +the subexpression \regexp{foo}). + +Python adds an extension syntax to Perl's extension syntax. If the +first character after the question mark is a \samp{P}, you know that +it's an extension that's specific to Python. Currently there are two +such extensions: \regexp{(?P<\var{name}>...)} defines a named group, +and \regexp{(?P=\var{name})} is a backreference to a named group. If +future versions of Perl 5 add similar features using a different +syntax, the \module{re} module will be changed to support the new +syntax, while preserving the Python-specific syntax for +compatibility's sake. + +Now that we've looked at the general extension syntax, we can return +to the features that simplify working with groups in complex REs. +Since groups are numbered from left to right and a complex expression +may use many groups, it can become difficult to keep track of the +correct numbering, and modifying such a complex RE is annoying. +Insert a new group near the beginning, and you change the numbers of +everything that follows it. + +First, sometimes you'll want to use a group to collect a part of a +regular expression, but aren't interested in retrieving the group's +contents. You can make this fact explicit by using a non-capturing +group: \regexp{(?:...)}, where you can put any other regular +expression inside the parentheses. + +\begin{verbatim} +>>> m = re.match("([abc])+", "abc") +>>> m.groups() +('c',) +>>> m = re.match("(?:[abc])+", "abc") +>>> m.groups() +() +\end{verbatim} + +Except for the fact that you can't retrieve the contents of what the +group matched, a non-capturing group behaves exactly the same as a +capturing group; you can put anything inside it, repeat it with a +repetition metacharacter such as \samp{*}, and nest it within other +groups (capturing or non-capturing). \regexp{(?:...)} is particularly +useful when modifying an existing group, since you can add new groups +without changing how all the other groups are numbered. It should be +mentioned that there's no performance difference in searching between +capturing and non-capturing groups; neither form is any faster than +the other. + +The second, and more significant, feature is named groups; instead of +referring to them by numbers, groups can be referenced by a name. + +The syntax for a named group is one of the Python-specific extensions: +\regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of +the group. Except for associating a name with a group, named groups +also behave identically to capturing groups. The \class{MatchObject} +methods that deal with capturing groups all accept either integers, to +refer to groups by number, or a string containing the group name. +Named groups are still given numbers, so you can retrieve information +about a group in two ways: + +\begin{verbatim} +>>> p = re.compile(r'(?P<word>\b\w+\b)') +>>> m = p.search( '(((( Lots of punctuation )))' ) +>>> m.group('word') +'Lots' +>>> m.group(1) +'Lots' +\end{verbatim} + +Named groups are handy because they let you use easily-remembered +names, instead of having to remember numbers. Here's an example RE +from the \module{imaplib} module: + +\begin{verbatim} +InternalDate = re.compile(r'INTERNALDATE "' + r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-' + r'(?P<year>[0-9][0-9][0-9][0-9])' + r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])' + r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])' + r'"') +\end{verbatim} + +It's obviously much easier to retrieve \code{m.group('zonem')}, +instead of having to remember to retrieve group 9. + +Since the syntax for backreferences, in an expression like +\regexp{(...)\e 1}, refers to the number of the group there's +naturally a variant that uses the group name instead of the number. +This is also a Python extension: \regexp{(?P=\var{name})} indicates +that the contents of the group called \var{name} should again be found +at the current point. The regular expression for finding doubled +words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as +\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}: + +\begin{verbatim} +>>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)') +>>> p.search('Paris in the the spring').group() +'the the' +\end{verbatim} + +\subsection{Lookahead Assertions} + +Another zero-width assertion is the lookahead assertion. Lookahead +assertions are available in both positive and negative form, and +look like this: + +\begin{itemize} +\item[\regexp{(?=...)}] Positive lookahead assertion. This succeeds +if the contained regular expression, represented here by \code{...}, +successfully matches at the current location, and fails otherwise. +But, once the contained expression has been tried, the matching engine +doesn't advance at all; the rest of the pattern is tried right where +the assertion started. + +\item[\regexp{(?!...)}] Negative lookahead assertion. This is the +opposite of the positive assertion; it succeeds if the contained expression +\emph{doesn't} match at the current position in the string. +\end{itemize} + +An example will help make this concrete by demonstrating a case +where a lookahead is useful. Consider a simple pattern to match a +filename and split it apart into a base name and an extension, +separated by a \samp{.}. For example, in \samp{news.rc}, \samp{news} +is the base name, and \samp{rc} is the filename's extension. + +The pattern to match this is quite simple: + +\regexp{.*[.].*\$} + +Notice that the \samp{.} needs to be treated specially because it's a +metacharacter; I've put it inside a character class. Also notice the +trailing \regexp{\$}; this is added to ensure that all the rest of the +string must be included in the extension. This regular expression +matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and +\samp{printers.conf}. + +Now, consider complicating the problem a bit; what if you want to +match filenames where the extension is not \samp{bat}? +Some incorrect attempts: + +\verb|.*[.][^b].*$| +% $ + +The first attempt above tries to exclude \samp{bat} by requiring that +the first character of the extension is not a \samp{b}. This is +wrong, because the pattern also doesn't match \samp{foo.bar}. + +% Messes up the HTML without the curly braces around \^ +\regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$} + +The expression gets messier when you try to patch up the first +solution by requiring one of the following cases to match: the first +character of the extension isn't \samp{b}; the second character isn't +\samp{a}; or the third character isn't \samp{t}. This accepts +\samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a +three-letter extension and won't accept a filename with a two-letter +extension such as \samp{sendmail.cf}. We'll complicate the pattern +again in an effort to fix it. + +\regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$} + +In the third attempt, the second and third letters are all made +optional in order to allow matching extensions shorter than three +characters, such as \samp{sendmail.cf}. + +The pattern's getting really complicated now, which makes it hard to +read and understand. Worse, if the problem changes and you want to +exclude both \samp{bat} and \samp{exe} as extensions, the pattern +would get even more complicated and confusing. + +A negative lookahead cuts through all this: + +\regexp{.*[.](?!bat\$).*\$} +% $ + +The lookahead means: if the expression \regexp{bat} doesn't match at +this point, try the rest of the pattern; if \regexp{bat\$} does match, +the whole pattern will fail. The trailing \regexp{\$} is required to +ensure that something like \samp{sample.batch}, where the extension +only starts with \samp{bat}, will be allowed. + +Excluding another filename extension is now easy; simply add it as an +alternative inside the assertion. The following pattern excludes +filenames that end in either \samp{bat} or \samp{exe}: + +\regexp{.*[.](?!bat\$|exe\$).*\$} +% $ + + +\section{Modifying Strings} + +Up to this point, we've simply performed searches against a static +string. Regular expressions are also commonly used to modify a string +in various ways, using the following \class{RegexObject} methods: + +\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} + \lineii{split()}{Split the string into a list, splitting it wherever the RE matches} + \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string} + \lineii{subn()}{Does the same thing as \method{sub()}, + but returns the new string and the number of replacements} +\end{tableii} + + +\subsection{Splitting Strings} + +The \method{split()} method of a \class{RegexObject} splits a string +apart wherever the RE matches, returning a list of the pieces. +It's similar to the \method{split()} method of strings but +provides much more +generality in the delimiters that you can split by; +\method{split()} only supports splitting by whitespace or by +a fixed string. As you'd expect, there's a module-level +\function{re.split()} function, too. + +\begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}} + Split \var{string} by the matches of the regular expression. If + capturing parentheses are used in the RE, then their contents will + also be returned as part of the resulting list. If \var{maxsplit} + is nonzero, at most \var{maxsplit} splits are performed. +\end{methoddesc} + +You can limit the number of splits made, by passing a value for +\var{maxsplit}. When \var{maxsplit} is nonzero, at most +\var{maxsplit} splits will be made, and the remainder of the string is +returned as the final element of the list. In the following example, +the delimiter is any sequence of non-alphanumeric characters. + +\begin{verbatim} +>>> p = re.compile(r'\W+') +>>> p.split('This is a test, short and sweet, of split().') +['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', ''] +>>> p.split('This is a test, short and sweet, of split().', 3) +['This', 'is', 'a', 'test, short and sweet, of split().'] +\end{verbatim} + +Sometimes you're not only interested in what the text between +delimiters is, but also need to know what the delimiter was. If +capturing parentheses are used in the RE, then their values are also +returned as part of the list. Compare the following calls: + +\begin{verbatim} +>>> p = re.compile(r'\W+') +>>> p2 = re.compile(r'(\W+)') +>>> p.split('This... is a test.') +['This', 'is', 'a', 'test', ''] +>>> p2.split('This... is a test.') +['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', ''] +\end{verbatim} + +The module-level function \function{re.split()} adds the RE to be +used as the first argument, but is otherwise the same. + +\begin{verbatim} +>>> re.split('[\W]+', 'Words, words, words.') +['Words', 'words', 'words', ''] +>>> re.split('([\W]+)', 'Words, words, words.') +['Words', ', ', 'words', ', ', 'words', '.', ''] +>>> re.split('[\W]+', 'Words, words, words.', 1) +['Words', 'words, words.'] +\end{verbatim} + +\subsection{Search and Replace} + +Another common task is to find all the matches for a pattern, and +replace them with a different string. The \method{sub()} method takes +a replacement value, which can be either a string or a function, and +the string to be processed. + +\begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}} +Returns the string obtained by replacing the leftmost non-overlapping +occurrences of the RE in \var{string} by the replacement +\var{replacement}. If the pattern isn't found, \var{string} is returned +unchanged. + +The optional argument \var{count} is the maximum number of pattern +occurrences to be replaced; \var{count} must be a non-negative +integer. The default value of 0 means to replace all occurrences. +\end{methoddesc} + +Here's a simple example of using the \method{sub()} method. It +replaces colour names with the word \samp{colour}: + +\begin{verbatim} +>>> p = re.compile( '(blue|white|red)') +>>> p.sub( 'colour', 'blue socks and red shoes') +'colour socks and colour shoes' +>>> p.sub( 'colour', 'blue socks and red shoes', count=1) +'colour socks and red shoes' +\end{verbatim} + +The \method{subn()} method does the same work, but returns a 2-tuple +containing the new string value and the number of replacements +that were performed: + +\begin{verbatim} +>>> p = re.compile( '(blue|white|red)') +>>> p.subn( 'colour', 'blue socks and red shoes') +('colour socks and colour shoes', 2) +>>> p.subn( 'colour', 'no colours at all') +('no colours at all', 0) +\end{verbatim} + +Empty matches are replaced only when they're not +adjacent to a previous match. + +\begin{verbatim} +>>> p = re.compile('x*') +>>> p.sub('-', 'abxd') +'-a-b-d-' +\end{verbatim} + +If \var{replacement} is a string, any backslash escapes in it are +processed. That is, \samp{\e n} is converted to a single newline +character, \samp{\e r} is converted to a carriage return, and so forth. +Unknown escapes such as \samp{\e j} are left alone. Backreferences, +such as \samp{\e 6}, are replaced with the substring matched by the +corresponding group in the RE. This lets you incorporate +portions of the original text in the resulting +replacement string. + +This example matches the word \samp{section} followed by a string +enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to +\samp{subsection}: + +\begin{verbatim} +>>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE) +>>> p.sub(r'subsection{\1}','section{First} section{second}') +'subsection{First} subsection{second}' +\end{verbatim} + +There's also a syntax for referring to named groups as defined by the +\regexp{(?P<name>...)} syntax. \samp{\e g<name>} will use the +substring matched by the group named \samp{name}, and +\samp{\e g<\var{number}>} +uses the corresponding group number. +\samp{\e g<2>} is therefore equivalent to \samp{\e 2}, +but isn't ambiguous in a +replacement string such as \samp{\e g<2>0}. (\samp{\e 20} would be +interpreted as a reference to group 20, not a reference to group 2 +followed by the literal character \character{0}.) The following +substitutions are all equivalent, but use all three variations of the +replacement string. + +\begin{verbatim} +>>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE) +>>> p.sub(r'subsection{\1}','section{First}') +'subsection{First}' +>>> p.sub(r'subsection{\g<1>}','section{First}') +'subsection{First}' +>>> p.sub(r'subsection{\g<name>}','section{First}') +'subsection{First}' +\end{verbatim} + +\var{replacement} can also be a function, which gives you even more +control. If \var{replacement} is a function, the function is +called for every non-overlapping occurrence of \var{pattern}. On each +call, the function is +passed a \class{MatchObject} argument for the match +and can use this information to compute the desired replacement string and return it. + +In the following example, the replacement function translates +decimals into hexadecimal: + +\begin{verbatim} +>>> def hexrepl( match ): +... "Return the hex string for a decimal number" +... value = int( match.group() ) +... return hex(value) +... +>>> p = re.compile(r'\d+') +>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.') +'Call 0xffd2 for printing, 0xc000 for user code.' +\end{verbatim} + +When using the module-level \function{re.sub()} function, the pattern +is passed as the first argument. The pattern may be a string or a +\class{RegexObject}; if you need to specify regular expression flags, +you must either use a \class{RegexObject} as the first parameter, or use +embedded modifiers in the pattern, e.g. \code{sub("(?i)b+", "x", "bbbb +BBBB")} returns \code{'x x'}. + +\section{Common Problems} + +Regular expressions are a powerful tool for some applications, but in +some ways their behaviour isn't intuitive and at times they don't +behave the way you may expect them to. This section will point out +some of the most common pitfalls. + +\subsection{Use String Methods} + +Sometimes using the \module{re} module is a mistake. If you're +matching a fixed string, or a single character class, and you're not +using any \module{re} features such as the \constant{IGNORECASE} flag, +then the full power of regular expressions may not be required. +Strings have several methods for performing operations with fixed +strings and they're usually much faster, because the implementation is +a single small C loop that's been optimized for the purpose, instead +of the large, more generalized regular expression engine. + +One example might be replacing a single fixed string with another +one; for example, you might replace \samp{word} +with \samp{deed}. \code{re.sub()} seems like the function to use for +this, but consider the \method{replace()} method. Note that +\function{replace()} will also replace \samp{word} inside +words, turning \samp{swordfish} into \samp{sdeedfish}, but the +na{\"\i}ve RE \regexp{word} would have done that, too. (To avoid performing +the substitution on parts of words, the pattern would have to be +\regexp{\e bword\e b}, in order to require that \samp{word} have a +word boundary on either side. This takes the job beyond +\method{replace}'s abilities.) + +Another common task is deleting every occurrence of a single character +from a string or replacing it with another single character. You +might do this with something like \code{re.sub('\e n', ' ', S)}, but +\method{translate()} is capable of doing both tasks +and will be faster that any regular expression operation can be. + +In short, before turning to the \module{re} module, consider whether +your problem can be solved with a faster and simpler string method. + +\subsection{match() versus search()} + +The \function{match()} function only checks if the RE matches at +the beginning of the string while \function{search()} will scan +forward through the string for a match. +It's important to keep this distinction in mind. Remember, +\function{match()} will only report a successful match which +will start at 0; if the match wouldn't start at zero, +\function{match()} will \emph{not} report it. + +\begin{verbatim} +>>> print re.match('super', 'superstition').span() +(0, 5) +>>> print re.match('super', 'insuperable') +None +\end{verbatim} + +On the other hand, \function{search()} will scan forward through the +string, reporting the first match it finds. + +\begin{verbatim} +>>> print re.search('super', 'superstition').span() +(0, 5) +>>> print re.search('super', 'insuperable').span() +(2, 7) +\end{verbatim} + +Sometimes you'll be tempted to keep using \function{re.match()}, and +just add \regexp{.*} to the front of your RE. Resist this temptation +and use \function{re.search()} instead. The regular expression +compiler does some analysis of REs in order to speed up the process of +looking for a match. One such analysis figures out what the first +character of a match must be; for example, a pattern starting with +\regexp{Crow} must match starting with a \character{C}. The analysis +lets the engine quickly scan through the string looking for the +starting character, only trying the full match if a \character{C} is found. + +Adding \regexp{.*} defeats this optimization, requiring scanning to +the end of the string and then backtracking to find a match for the +rest of the RE. Use \function{re.search()} instead. + +\subsection{Greedy versus Non-Greedy} + +When repeating a regular expression, as in \regexp{a*}, the resulting +action is to consume as much of the pattern as possible. This +fact often bites you when you're trying to match a pair of +balanced delimiters, such as the angle brackets surrounding an HTML +tag. The na{\"\i}ve pattern for matching a single HTML tag doesn't +work because of the greedy nature of \regexp{.*}. + +\begin{verbatim} +>>> s = '<html><head><title>Title</title>' +>>> len(s) +32 +>>> print re.match('<.*>', s).span() +(0, 32) +>>> print re.match('<.*>', s).group() +<html><head><title>Title</title> +\end{verbatim} + +The RE matches the \character{<} in \samp{<html>}, and the +\regexp{.*} consumes the rest of the string. There's still more left +in the RE, though, and the \regexp{>} can't match at the end of +the string, so the regular expression engine has to backtrack +character by character until it finds a match for the \regexp{>}. +The final match extends from the \character{<} in \samp{<html>} +to the \character{>} in \samp{</title>}, which isn't what you want. + +In this case, the solution is to use the non-greedy qualifiers +\regexp{*?}, \regexp{+?}, \regexp{??}, or +\regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as +possible. In the above example, the \character{>} is tried +immediately after the first \character{<} matches, and when it fails, +the engine advances a character at a time, retrying the \character{>} +at every step. This produces just the right result: + +\begin{verbatim} +>>> print re.match('<.*?>', s).group() +<html> +\end{verbatim} + +(Note that parsing HTML or XML with regular expressions is painful. +Quick-and-dirty patterns will handle common cases, but HTML and XML +have special cases that will break the obvious regular expression; by +the time you've written a regular expression that handles all of the +possible cases, the patterns will be \emph{very} complicated. Use an +HTML or XML parser module for such tasks.) + +\subsection{Not Using re.VERBOSE} + +By now you've probably noticed that regular expressions are a very +compact notation, but they're not terribly readable. REs of +moderate complexity can become lengthy collections of backslashes, +parentheses, and metacharacters, making them difficult to read and +understand. + +For such REs, specifying the \code{re.VERBOSE} flag when +compiling the regular expression can be helpful, because it allows +you to format the regular expression more clearly. + +The \code{re.VERBOSE} flag has several effects. Whitespace in the +regular expression that \emph{isn't} inside a character class is +ignored. This means that an expression such as \regexp{dog | cat} is +equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]} +will still match the characters \character{a}, \character{b}, or a +space. In addition, you can also put comments inside a RE; comments +extend from a \samp{\#} character to the next newline. When used with +triple-quoted strings, this enables REs to be formatted more neatly: + +\begin{verbatim} +pat = re.compile(r""" + \s* # Skip leading whitespace + (?P<header>[^:]+) # Header name + \s* : # Whitespace, and a colon + (?P<value>.*?) # The header's value -- *? used to + # lose the following trailing whitespace + \s*$ # Trailing whitespace to end-of-line +""", re.VERBOSE) +\end{verbatim} +% $ + +This is far more readable than: + +\begin{verbatim} +pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$") +\end{verbatim} +% $ + +\section{Feedback} + +Regular expressions are a complicated topic. Did this document help +you understand them? Were there parts that were unclear, or Problems +you encountered that weren't covered here? If so, please send +suggestions for improvements to the author. + +The most complete book on regular expressions is almost certainly +Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published +by O'Reilly. Unfortunately, it exclusively concentrates on Perl and +Java's flavours of regular expressions, and doesn't contain any Python +material at all, so it won't be useful as a reference for programming +in Python. (The first edition covered Python's now-obsolete +\module{regex} module, which won't help you much.) Consider checking +it out from your library. + +\end{document} + diff --git a/Doc/howto/rexec.tex b/Doc/howto/rexec.tex new file mode 100644 index 0000000..44a0b30 --- /dev/null +++ b/Doc/howto/rexec.tex @@ -0,0 +1,61 @@ +\documentclass{howto} + +\title{Restricted Execution HOWTO} + +\release{2.1} + +\author{A.M. Kuchling} +\authoraddress{\email{amk@amk.ca}} + +\begin{document} + +\maketitle + +\begin{abstract} +\noindent + +Python 2.2.2 and earlier provided a \module{rexec} module running +untrusted code. However, it's never been exhaustively audited for +security and it hasn't been updated to take into account recent +changes to Python such as new-style classes. Therefore, the +\module{rexec} module should not be trusted. To discourage use of +\module{rexec}, this HOWTO has been withdrawn. + +The \module{rexec} and \module{Bastion} modules have been disabled in +the Python CVS tree, both on the trunk (which will eventually become +Python 2.3alpha2 and later 2.3final) and on the release22-maint branch +(which will become Python 2.2.3, if someone ever volunteers to issue +2.2.3). + +For discussion of the problems with \module{rexec}, see the python-dev +threads starting at the following URLs: +\url{http://mail.python.org/pipermail/python-dev/2002-December/031160.html}, +and +\url{http://mail.python.org/pipermail/python-dev/2003-January/031848.html}. + +\end{abstract} + + +\section{Version History} + +Sep. 12, 1998: Minor revisions and added the reference to the Janus +project. + +Feb. 26, 1998: First version. Suggestions are welcome. + +Mar. 16, 1998: Made some revisions suggested by Jeff Rush. Some minor +changes and clarifications, and a sizable section on exceptions added. + +Oct. 4, 2000: Checked with Python 2.0. Minor rewrites and fixes made. +Version number increased to 2.0. + +Dec. 17, 2002: Withdrawn. + +Jan. 8, 2003: Mention that \module{rexec} will be disabled in Python 2.3, +and added links to relevant python-dev threads. + +\end{document} + + + + diff --git a/Doc/howto/sockets.tex b/Doc/howto/sockets.tex new file mode 100644 index 0000000..4da92a8 --- /dev/null +++ b/Doc/howto/sockets.tex @@ -0,0 +1,460 @@ +\documentclass{howto} + +\title{Socket Programming HOWTO} + +\release{0.00} + +\author{Gordon McMillan} +\authoraddress{\email{gmcm@hypernet.com}} + +\begin{document} +\maketitle + +\begin{abstract} +\noindent +Sockets are used nearly everywhere, but are one of the most severely +misunderstood technologies around. This is a 10,000 foot overview of +sockets. It's not really a tutorial - you'll still have work to do in +getting things operational. It doesn't cover the fine points (and there +are a lot of them), but I hope it will give you enough background to +begin using them decently. + +This document is available from the Python HOWTO page at +\url{http://www.python.org/doc/howto}. + +\end{abstract} + +\tableofcontents + +\section{Sockets} + +Sockets are used nearly everywhere, but are one of the most severely +misunderstood technologies around. This is a 10,000 foot overview of +sockets. It's not really a tutorial - you'll still have work to do in +getting things working. It doesn't cover the fine points (and there +are a lot of them), but I hope it will give you enough background to +begin using them decently. + +I'm only going to talk about INET sockets, but they account for at +least 99\% of the sockets in use. And I'll only talk about STREAM +sockets - unless you really know what you're doing (in which case this +HOWTO isn't for you!), you'll get better behavior and performance from +a STREAM socket than anything else. I will try to clear up the mystery +of what a socket is, as well as some hints on how to work with +blocking and non-blocking sockets. But I'll start by talking about +blocking sockets. You'll need to know how they work before dealing +with non-blocking sockets. + +Part of the trouble with understanding these things is that "socket" +can mean a number of subtly different things, depending on context. So +first, let's make a distinction between a "client" socket - an +endpoint of a conversation, and a "server" socket, which is more like +a switchboard operator. The client application (your browser, for +example) uses "client" sockets exclusively; the web server it's +talking to uses both "server" sockets and "client" sockets. + + +\subsection{History} + +Of the various forms of IPC (\emph{Inter Process Communication}), +sockets are by far the most popular. On any given platform, there are +likely to be other forms of IPC that are faster, but for +cross-platform communication, sockets are about the only game in town. + +They were invented in Berkeley as part of the BSD flavor of Unix. They +spread like wildfire with the Internet. With good reason --- the +combination of sockets with INET makes talking to arbitrary machines +around the world unbelievably easy (at least compared to other +schemes). + +\section{Creating a Socket} + +Roughly speaking, when you clicked on the link that brought you to +this page, your browser did something like the following: + +\begin{verbatim} + #create an INET, STREAMing socket + s = socket.socket( + socket.AF_INET, socket.SOCK_STREAM) + #now connect to the web server on port 80 + # - the normal http port + s.connect(("www.mcmillan-inc.com", 80)) +\end{verbatim} + +When the \code{connect} completes, the socket \code{s} can +now be used to send in a request for the text of this page. The same +socket will read the reply, and then be destroyed. That's right - +destroyed. Client sockets are normally only used for one exchange (or +a small set of sequential exchanges). + +What happens in the web server is a bit more complex. First, the web +server creates a "server socket". + +\begin{verbatim} + #create an INET, STREAMing socket + serversocket = socket.socket( + socket.AF_INET, socket.SOCK_STREAM) + #bind the socket to a public host, + # and a well-known port + serversocket.bind((socket.gethostname(), 80)) + #become a server socket + serversocket.listen(5) +\end{verbatim} + +A couple things to notice: we used \code{socket.gethostname()} +so that the socket would be visible to the outside world. If we had +used \code{s.bind(('', 80))} or \code{s.bind(('localhost', +80))} or \code{s.bind(('127.0.0.1', 80))} we would still +have a "server" socket, but one that was only visible within the same +machine. + +A second thing to note: low number ports are usually reserved for +"well known" services (HTTP, SNMP etc). If you're playing around, use +a nice high number (4 digits). + +Finally, the argument to \code{listen} tells the socket library that +we want it to queue up as many as 5 connect requests (the normal max) +before refusing outside connections. If the rest of the code is +written properly, that should be plenty. + +OK, now we have a "server" socket, listening on port 80. Now we enter +the mainloop of the web server: + +\begin{verbatim} + while 1: + #accept connections from outside + (clientsocket, address) = serversocket.accept() + #now do something with the clientsocket + #in this case, we'll pretend this is a threaded server + ct = client_thread(clientsocket) + ct.run() +\end{verbatim} + +There's actually 3 general ways in which this loop could work - +dispatching a thread to handle \code{clientsocket}, create a new +process to handle \code{clientsocket}, or restructure this app +to use non-blocking sockets, and mulitplex between our "server" socket +and any active \code{clientsocket}s using +\code{select}. More about that later. The important thing to +understand now is this: this is \emph{all} a "server" socket +does. It doesn't send any data. It doesn't receive any data. It just +produces "client" sockets. Each \code{clientsocket} is created +in response to some \emph{other} "client" socket doing a +\code{connect()} to the host and port we're bound to. As soon as +we've created that \code{clientsocket}, we go back to listening +for more connections. The two "clients" are free to chat it up - they +are using some dynamically allocated port which will be recycled when +the conversation ends. + +\subsection{IPC} If you need fast IPC between two processes +on one machine, you should look into whatever form of shared memory +the platform offers. A simple protocol based around shared memory and +locks or semaphores is by far the fastest technique. + +If you do decide to use sockets, bind the "server" socket to +\code{'localhost'}. On most platforms, this will take a shortcut +around a couple of layers of network code and be quite a bit faster. + + +\section{Using a Socket} + +The first thing to note, is that the web browser's "client" socket and +the web server's "client" socket are identical beasts. That is, this +is a "peer to peer" conversation. Or to put it another way, \emph{as the +designer, you will have to decide what the rules of etiquette are for +a conversation}. Normally, the \code{connect}ing socket +starts the conversation, by sending in a request, or perhaps a +signon. But that's a design decision - it's not a rule of sockets. + +Now there are two sets of verbs to use for communication. You can use +\code{send} and \code{recv}, or you can transform your +client socket into a file-like beast and use \code{read} and +\code{write}. The latter is the way Java presents their +sockets. I'm not going to talk about it here, except to warn you that +you need to use \code{flush} on sockets. These are buffered +"files", and a common mistake is to \code{write} something, and +then \code{read} for a reply. Without a \code{flush} in +there, you may wait forever for the reply, because the request may +still be in your output buffer. + +Now we come the major stumbling block of sockets - \code{send} +and \code{recv} operate on the network buffers. They do not +necessarily handle all the bytes you hand them (or expect from them), +because their major focus is handling the network buffers. In general, +they return when the associated network buffers have been filled +(\code{send}) or emptied (\code{recv}). They then tell you +how many bytes they handled. It is \emph{your} responsibility to call +them again until your message has been completely dealt with. + +When a \code{recv} returns 0 bytes, it means the other side has +closed (or is in the process of closing) the connection. You will not +receive any more data on this connection. Ever. You may be able to +send data successfully; I'll talk about that some on the next page. + +A protocol like HTTP uses a socket for only one transfer. The client +sends a request, the reads a reply. That's it. The socket is +discarded. This means that a client can detect the end of the reply by +receiving 0 bytes. + +But if you plan to reuse your socket for further transfers, you need +to realize that \emph{there is no "EOT" (End of Transfer) on a +socket.} I repeat: if a socket \code{send} or +\code{recv} returns after handling 0 bytes, the connection has +been broken. If the connection has \emph{not} been broken, you may +wait on a \code{recv} forever, because the socket will +\emph{not} tell you that there's nothing more to read (for now). Now +if you think about that a bit, you'll come to realize a fundamental +truth of sockets: \emph{messages must either be fixed length} (yuck), +\emph{or be delimited} (shrug), \emph{or indicate how long they are} +(much better), \emph{or end by shutting down the connection}. The +choice is entirely yours, (but some ways are righter than others). + +Assuming you don't want to end the connection, the simplest solution +is a fixed length message: + +\begin{verbatim} + class mysocket: + '''demonstration class only + - coded for clarity, not efficiency''' + def __init__(self, sock=None): + if sock is None: + self.sock = socket.socket( + socket.AF_INET, socket.SOCK_STREAM) + else: + self.sock = sock + def connect(host, port): + self.sock.connect((host, port)) + def mysend(msg): + totalsent = 0 + while totalsent < MSGLEN: + sent = self.sock.send(msg[totalsent:]) + if sent == 0: + raise RuntimeError, \\ + "socket connection broken" + totalsent = totalsent + sent + def myreceive(): + msg = '' + while len(msg) < MSGLEN: + chunk = self.sock.recv(MSGLEN-len(msg)) + if chunk == '': + raise RuntimeError, \\ + "socket connection broken" + msg = msg + chunk + return msg +\end{verbatim} + +The sending code here is usable for almost any messaging scheme - in +Python you send strings, and you can use \code{len()} to +determine its length (even if it has embedded \code{\e 0} +characters). It's mostly the receiving code that gets more +complex. (And in C, it's not much worse, except you can't use +\code{strlen} if the message has embedded \code{\e 0}s.) + +The easiest enhancement is to make the first character of the message +an indicator of message type, and have the type determine the +length. Now you have two \code{recv}s - the first to get (at +least) that first character so you can look up the length, and the +second in a loop to get the rest. If you decide to go the delimited +route, you'll be receiving in some arbitrary chunk size, (4096 or 8192 +is frequently a good match for network buffer sizes), and scanning +what you've received for a delimiter. + +One complication to be aware of: if your conversational protocol +allows multiple messages to be sent back to back (without some kind of +reply), and you pass \code{recv} an arbitrary chunk size, you +may end up reading the start of a following message. You'll need to +put that aside and hold onto it, until it's needed. + +Prefixing the message with it's length (say, as 5 numeric characters) +gets more complex, because (believe it or not), you may not get all 5 +characters in one \code{recv}. In playing around, you'll get +away with it; but in high network loads, your code will very quickly +break unless you use two \code{recv} loops - the first to +determine the length, the second to get the data part of the +message. Nasty. This is also when you'll discover that +\code{send} does not always manage to get rid of everything in +one pass. And despite having read this, you will eventually get bit by +it! + +In the interests of space, building your character, (and preserving my +competitive position), these enhancements are left as an exercise for +the reader. Lets move on to cleaning up. + +\subsection{Binary Data} + +It is perfectly possible to send binary data over a socket. The major +problem is that not all machines use the same formats for binary +data. For example, a Motorola chip will represent a 16 bit integer +with the value 1 as the two hex bytes 00 01. Intel and DEC, however, +are byte-reversed - that same 1 is 01 00. Socket libraries have calls +for converting 16 and 32 bit integers - \code{ntohl, htonl, ntohs, +htons} where "n" means \emph{network} and "h" means \emph{host}, +"s" means \emph{short} and "l" means \emph{long}. Where network order +is host order, these do nothing, but where the machine is +byte-reversed, these swap the bytes around appropriately. + +In these days of 32 bit machines, the ascii representation of binary +data is frequently smaller than the binary representation. That's +because a surprising amount of the time, all those longs have the +value 0, or maybe 1. The string "0" would be two bytes, while binary +is four. Of course, this doesn't fit well with fixed-length +messages. Decisions, decisions. + +\section{Disconnecting} + +Strictly speaking, you're supposed to use \code{shutdown} on a +socket before you \code{close} it. The \code{shutdown} is +an advisory to the socket at the other end. Depending on the argument +you pass it, it can mean "I'm not going to send anymore, but I'll +still listen", or "I'm not listening, good riddance!". Most socket +libraries, however, are so used to programmers neglecting to use this +piece of etiquette that normally a \code{close} is the same as +\code{shutdown(); close()}. So in most situations, an explicit +\code{shutdown} is not needed. + +One way to use \code{shutdown} effectively is in an HTTP-like +exchange. The client sends a request and then does a +\code{shutdown(1)}. This tells the server "This client is done +sending, but can still receive." The server can detect "EOF" by a +receive of 0 bytes. It can assume it has the complete request. The +server sends a reply. If the \code{send} completes successfully +then, indeed, the client was still receiving. + +Python takes the automatic shutdown a step further, and says that when a socket is garbage collected, it will automatically do a \code{close} if it's needed. But relying on this is a very bad habit. If your socket just disappears without doing a \code{close}, the socket at the other end may hang indefinitely, thinking you're just being slow. \emph{Please} \code{close} your sockets when you're done. + + +\subsection{When Sockets Die} + +Probably the worst thing about using blocking sockets is what happens +when the other side comes down hard (without doing a +\code{close}). Your socket is likely to hang. SOCKSTREAM is a +reliable protocol, and it will wait a long, long time before giving up +on a connection. If you're using threads, the entire thread is +essentially dead. There's not much you can do about it. As long as you +aren't doing something dumb, like holding a lock while doing a +blocking read, the thread isn't really consuming much in the way of +resources. Do \emph{not} try to kill the thread - part of the reason +that threads are more efficient than processes is that they avoid the +overhead associated with the automatic recycling of resources. In +other words, if you do manage to kill the thread, your whole process +is likely to be screwed up. + +\section{Non-blocking Sockets} + +If you've understood the preceeding, you already know most of what you +need to know about the mechanics of using sockets. You'll still use +the same calls, in much the same ways. It's just that, if you do it +right, your app will be almost inside-out. + +In Python, you use \code{socket.setblocking(0)} to make it +non-blocking. In C, it's more complex, (for one thing, you'll need to +choose between the BSD flavor \code{O_NONBLOCK} and the almost +indistinguishable Posix flavor \code{O_NDELAY}, which is +completely different from \code{TCP_NODELAY}), but it's the +exact same idea. You do this after creating the socket, but before +using it. (Actually, if you're nuts, you can switch back and forth.) + +The major mechanical difference is that \code{send}, +\code{recv}, \code{connect} and \code{accept} can +return without having done anything. You have (of course) a number of +choices. You can check return code and error codes and generally drive +yourself crazy. If you don't believe me, try it sometime. Your app +will grow large, buggy and suck CPU. So let's skip the brain-dead +solutions and do it right. + +Use \code{select}. + +In C, coding \code{select} is fairly complex. In Python, it's a +piece of cake, but it's close enough to the C version that if you +understand \code{select} in Python, you'll have little trouble +with it in C. + +\begin{verbatim} ready_to_read, ready_to_write, in_error = \\ + select.select( + potential_readers, + potential_writers, + potential_errs, + timeout) +\end{verbatim} + +You pass \code{select} three lists: the first contains all +sockets that you might want to try reading; the second all the sockets +you might want to try writing to, and the last (normally left empty) +those that you want to check for errors. You should note that a +socket can go into more than one list. The \code{select} call is +blocking, but you can give it a timeout. This is generally a sensible +thing to do - give it a nice long timeout (say a minute) unless you +have good reason to do otherwise. + +In return, you will get three lists. They have the sockets that are +actually readable, writable and in error. Each of these lists is a +subset (possbily empty) of the corresponding list you passed in. And +if you put a socket in more than one input list, it will only be (at +most) in one output list. + +If a socket is in the output readable list, you can be +as-close-to-certain-as-we-ever-get-in-this-business that a +\code{recv} on that socket will return \emph{something}. Same +idea for the writable list. You'll be able to send +\emph{something}. Maybe not all you want to, but \emph{something} is +better than nothing. (Actually, any reasonably healthy socket will +return as writable - it just means outbound network buffer space is +available.) + +If you have a "server" socket, put it in the potential_readers +list. If it comes out in the readable list, your \code{accept} +will (almost certainly) work. If you have created a new socket to +\code{connect} to someone else, put it in the ptoential_writers +list. If it shows up in the writable list, you have a decent chance +that it has connected. + +One very nasty problem with \code{select}: if somewhere in those +input lists of sockets is one which has died a nasty death, the +\code{select} will fail. You then need to loop through every +single damn socket in all those lists and do a +\code{select([sock],[],[],0)} until you find the bad one. That +timeout of 0 means it won't take long, but it's ugly. + +Actually, \code{select} can be handy even with blocking sockets. +It's one way of determining whether you will block - the socket +returns as readable when there's something in the buffers. However, +this still doesn't help with the problem of determining whether the +other end is done, or just busy with something else. + +\textbf{Portability alert}: On Unix, \code{select} works both with +the sockets and files. Don't try this on Windows. On Windows, +\code{select} works with sockets only. Also note that in C, many +of the more advanced socket options are done differently on +Windows. In fact, on Windows I usually use threads (which work very, +very well) with my sockets. Face it, if you want any kind of +performance, your code will look very different on Windows than on +Unix. (I haven't the foggiest how you do this stuff on a Mac.) + +\subsection{Performance} + +There's no question that the fastest sockets code uses non-blocking +sockets and select to multiplex them. You can put together something +that will saturate a LAN connection without putting any strain on the +CPU. The trouble is that an app written this way can't do much of +anything else - it needs to be ready to shuffle bytes around at all +times. + +Assuming that your app is actually supposed to do something more than +that, threading is the optimal solution, (and using non-blocking +sockets will be faster than using blocking sockets). Unfortunately, +threading support in Unixes varies both in API and quality. So the +normal Unix solution is to fork a subprocess to deal with each +connection. The overhead for this is significant (and don't do this on +Windows - the overhead of process creation is enormous there). It also +means that unless each subprocess is completely independent, you'll +need to use another form of IPC, say a pipe, or shared memory and +semaphores, to communicate between the parent and child processes. + +Finally, remember that even though blocking sockets are somewhat +slower than non-blocking, in many cases they are the "right" +solution. After all, if your app is driven by the data it receives +over a socket, there's not much sense in complicating the logic just +so your app can wait on \code{select} instead of +\code{recv}. + +\end{document} diff --git a/Doc/howto/sorting.tex b/Doc/howto/sorting.tex new file mode 100644 index 0000000..a849c66 --- /dev/null +++ b/Doc/howto/sorting.tex @@ -0,0 +1,267 @@ +\documentclass{howto} + +\title{Sorting Mini-HOWTO} + +% Increment the release number whenever significant changes are made. +% The author and/or editor can define 'significant' however they like. +\release{0.01} + +\author{Andrew Dalke} +\authoraddress{\email{dalke@bioreason.com}} + +\begin{document} +\maketitle + +\begin{abstract} +\noindent +This document is a little tutorial +showing a half dozen ways to sort a list with the built-in +\method{sort()} method. + +This document is available from the Python HOWTO page at +\url{http://www.python.org/doc/howto}. +\end{abstract} + +\tableofcontents + +Python lists have a built-in \method{sort()} method. There are many +ways to use it to sort a list and there doesn't appear to be a single, +central place in the various manuals describing them, so I'll do so +here. + +\section{Sorting basic data types} + +A simple ascending sort is easy; just call the \method{sort()} method of a list. + +\begin{verbatim} +>>> a = [5, 2, 3, 1, 4] +>>> a.sort() +>>> print a +[1, 2, 3, 4, 5] +\end{verbatim} + +Sort takes an optional function which can be called for doing the +comparisons. The default sort routine is equivalent to + +\begin{verbatim} +>>> a = [5, 2, 3, 1, 4] +>>> a.sort(cmp) +>>> print a +[1, 2, 3, 4, 5] +\end{verbatim} + +where \function{cmp} is the built-in function which compares two objects, \code{x} and +\code{y}, and returns -1, 0 or 1 depending on whether $x<y$, $x==y$, or $x>y$. During +the course of the sort the relationships must stay the same for the +final list to make sense. + +If you want, you can define your own function for the comparison. For +integers (and numbers in general) we can do: + +\begin{verbatim} +>>> def numeric_compare(x, y): +>>> return x-y +>>> +>>> a = [5, 2, 3, 1, 4] +>>> a.sort(numeric_compare) +>>> print a +[1, 2, 3, 4, 5] +\end{verbatim} + +By the way, this function won't work if result of the subtraction +is out of range, as in \code{sys.maxint - (-1)}. + +Or, if you don't want to define a new named function you can create an +anonymous one using \keyword{lambda}, as in: + +\begin{verbatim} +>>> a = [5, 2, 3, 1, 4] +>>> a.sort(lambda x, y: x-y) +>>> print a +[1, 2, 3, 4, 5] +\end{verbatim} + +If you want the numbers sorted in reverse you can do + +\begin{verbatim} +>>> a = [5, 2, 3, 1, 4] +>>> def reverse_numeric(x, y): +>>> return y-x +>>> +>>> a.sort(reverse_numeric) +>>> print a +[5, 4, 3, 2, 1] +\end{verbatim} + +(a more general implementation could return \code{cmp(y,x)} or \code{-cmp(x,y)}). + +However, it's faster if Python doesn't have to call a function for +every comparison, so if you want a reverse-sorted list of basic data +types, do the forward sort first, then use the \method{reverse()} method. + +\begin{verbatim} +>>> a = [5, 2, 3, 1, 4] +>>> a.sort() +>>> a.reverse() +>>> print a +[5, 4, 3, 2, 1] +\end{verbatim} + +Here's a case-insensitive string comparison using a \keyword{lambda} function: + +\begin{verbatim} +>>> import string +>>> a = string.split("This is a test string from Andrew.") +>>> a.sort(lambda x, y: cmp(string.lower(x), string.lower(y))) +>>> print a +['a', 'Andrew.', 'from', 'is', 'string', 'test', 'This'] +\end{verbatim} + +This goes through the overhead of converting a word to lower case +every time it must be compared. At times it may be faster to compute +these once and use those values, and the following example shows how. + +\begin{verbatim} +>>> words = string.split("This is a test string from Andrew.") +>>> offsets = [] +>>> for i in range(len(words)): +>>> offsets.append( (string.lower(words[i]), i) ) +>>> +>>> offsets.sort() +>>> new_words = [] +>>> for dontcare, i in offsets: +>>> new_words.append(words[i]) +>>> +>>> print new_words +\end{verbatim} + +The \code{offsets} list is initialized to a tuple of the lower-case string +and its position in the \code{words} list. It is then sorted. Python's +sort method sorts tuples by comparing terms; given \code{x} and \code{y}, compare +\code{x[0]} to \code{y[0]}, then \code{x[1]} to \code{y[1]}, etc. until there is a difference. + +The result is that the \code{offsets} list is ordered by its first +term, and the second term can be used to figure out where the original +data was stored. (The \code{for} loop assigns \code{dontcare} and +\code{i} to the two fields of each term in the list, but we only need the +index value.) + +Another way to implement this is to store the original data as the +second term in the \code{offsets} list, as in: + +\begin{verbatim} +>>> words = string.split("This is a test string from Andrew.") +>>> offsets = [] +>>> for word in words: +>>> offsets.append( (string.lower(word), word) ) +>>> +>>> offsets.sort() +>>> new_words = [] +>>> for word in offsets: +>>> new_words.append(word[1]) +>>> +>>> print new_words +\end{verbatim} + +This isn't always appropriate because the second terms in the list +(the word, in this example) will be compared when the first terms are +the same. If this happens many times, then there will be the unneeded +performance hit of comparing the two objects. This can be a large +cost if most terms are the same and the objects define their own +\method{__cmp__} method, but there will still be some overhead to determine if +\method{__cmp__} is defined. + +Still, for large lists, or for lists where the comparison information +is expensive to calculate, the last two examples are likely to be the +fastest way to sort a list. It will not work on weakly sorted data, +like complex numbers, but if you don't know what that means, you +probably don't need to worry about it. + +\section{Comparing classes} + +The comparison for two basic data types, like ints to ints or string to +string, is built into Python and makes sense. There is a default way +to compare class instances, but the default manner isn't usually very +useful. You can define your own comparison with the \method{__cmp__} method, +as in: + +\begin{verbatim} +>>> class Spam: +>>> def __init__(self, spam, eggs): +>>> self.spam = spam +>>> self.eggs = eggs +>>> def __cmp__(self, other): +>>> return cmp(self.spam+self.eggs, other.spam+other.eggs) +>>> def __str__(self): +>>> return str(self.spam + self.eggs) +>>> +>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)] +>>> a.sort() +>>> for spam in a: +>>> print str(spam) +5 +10 +12 +\end{verbatim} + +Sometimes you may want to sort by a specific attribute of a class. If +appropriate you should just define the \method{__cmp__} method to compare +those values, but you cannot do this if you want to compare between +different attributes at different times. Instead, you'll need to go +back to passing a comparison function to sort, as in: + +\begin{verbatim} +>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)] +>>> a.sort(lambda x, y: cmp(x.eggs, y.eggs)) +>>> for spam in a: +>>> print spam.eggs, str(spam) +3 12 +4 5 +6 10 +\end{verbatim} + +If you want to compare two arbitrary attributes (and aren't overly +concerned about performance) you can even define your own comparison +function object. This uses the ability of a class instance to emulate +an function by defining the \method{__call__} method, as in: + +\begin{verbatim} +>>> class CmpAttr: +>>> def __init__(self, attr): +>>> self.attr = attr +>>> def __call__(self, x, y): +>>> return cmp(getattr(x, self.attr), getattr(y, self.attr)) +>>> +>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)] +>>> a.sort(CmpAttr("spam")) # sort by the "spam" attribute +>>> for spam in a: +>>> print spam.spam, spam.eggs, str(spam) +1 4 5 +4 6 10 +9 3 12 + +>>> a.sort(CmpAttr("eggs")) # re-sort by the "eggs" attribute +>>> for spam in a: +>>> print spam.spam, spam.eggs, str(spam) +9 3 12 +1 4 5 +4 6 10 +\end{verbatim} + +Of course, if you want a faster sort you can extract the attributes +into an intermediate list and sort that list. + + +So, there you have it; about a half-dozen different ways to define how +to sort a list: +\begin{itemize} + \item sort using the default method + \item sort using a comparison function + \item reverse sort not using a comparison function + \item sort on an intermediate list (two forms) + \item sort using class defined __cmp__ method + \item sort using a sort function object +\end{itemize} + +\end{document} +% LocalWords: maxint diff --git a/Doc/howto/unicode.rst b/Doc/howto/unicode.rst new file mode 100644 index 0000000..7ad61c1 --- /dev/null +++ b/Doc/howto/unicode.rst @@ -0,0 +1,765 @@ +Unicode HOWTO +================ + +**Version 1.02** + +This HOWTO discusses Python's support for Unicode, and explains various +problems that people commonly encounter when trying to work with Unicode. + +Introduction to Unicode +------------------------------ + +History of Character Codes +'''''''''''''''''''''''''''''' + +In 1968, the American Standard Code for Information Interchange, +better known by its acronym ASCII, was standardized. ASCII defined +numeric codes for various characters, with the numeric values running from 0 to +127. For example, the lowercase letter 'a' is assigned 97 as its code +value. + +ASCII was an American-developed standard, so it only defined +unaccented characters. There was an 'e', but no 'é' or 'Í'. This +meant that languages which required accented characters couldn't be +faithfully represented in ASCII. (Actually the missing accents matter +for English, too, which contains words such as 'naïve' and 'café', and some +publications have house styles which require spellings such as +'coöperate'.) + +For a while people just wrote programs that didn't display accents. I +remember looking at Apple ][ BASIC programs, published in French-language +publications in the mid-1980s, that had lines like these:: + + PRINT "FICHER EST COMPLETE." + PRINT "CARACTERE NON ACCEPTE." + +Those messages should contain accents, and they just look wrong to +someone who can read French. + +In the 1980s, almost all personal computers were 8-bit, meaning that +bytes could hold values ranging from 0 to 255. ASCII codes only went +up to 127, so some machines assigned values between 128 and 255 to +accented characters. Different machines had different codes, however, +which led to problems exchanging files. Eventually various commonly +used sets of values for the 128-255 range emerged. Some were true +standards, defined by the International Standards Organization, and +some were **de facto** conventions that were invented by one company +or another and managed to catch on. + +255 characters aren't very many. For example, you can't fit +both the accented characters used in Western Europe and the Cyrillic +alphabet used for Russian into the 128-255 range because there are more than +127 such characters. + +You could write files using different codes (all your Russian +files in a coding system called KOI8, all your French files in +a different coding system called Latin1), but what if you wanted +to write a French document that quotes some Russian text? In the +1980s people began to want to solve this problem, and the Unicode +standardization effort began. + +Unicode started out using 16-bit characters instead of 8-bit characters. 16 +bits means you have 2^16 = 65,536 distinct values available, making it +possible to represent many different characters from many different +alphabets; an initial goal was to have Unicode contain the alphabets for +every single human language. It turns out that even 16 bits isn't enough to +meet that goal, and the modern Unicode specification uses a wider range of +codes, 0-1,114,111 (0x10ffff in base-16). + +There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were +originally separate efforts, but the specifications were merged with +the 1.1 revision of Unicode. + +(This discussion of Unicode's history is highly simplified. I don't +think the average Python programmer needs to worry about the +historical details; consult the Unicode consortium site listed in the +References for more information.) + + +Definitions +'''''''''''''''''''''''' + +A **character** is the smallest possible component of a text. 'A', +'B', 'C', etc., are all different characters. So are 'È' and +'Í'. Characters are abstractions, and vary depending on the +language or context you're talking about. For example, the symbol for +ohms (Ω) is usually drawn much like the capital letter +omega (Ω) in the Greek alphabet (they may even be the same in +some fonts), but these are two different characters that have +different meanings. + +The Unicode standard describes how characters are represented by +**code points**. A code point is an integer value, usually denoted in +base 16. In the standard, a code point is written using the notation +U+12ca to mean the character with value 0x12ca (4810 decimal). The +Unicode standard contains a lot of tables listing characters and their +corresponding code points:: + + 0061 'a'; LATIN SMALL LETTER A + 0062 'b'; LATIN SMALL LETTER B + 0063 'c'; LATIN SMALL LETTER C + ... + 007B '{'; LEFT CURLY BRACKET + +Strictly, these definitions imply that it's meaningless to say 'this is +character U+12ca'. U+12ca is a code point, which represents some particular +character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. +In informal contexts, this distinction between code points and characters will +sometimes be forgotten. + +A character is represented on a screen or on paper by a set of graphical +elements that's called a **glyph**. The glyph for an uppercase A, for +example, is two diagonal strokes and a horizontal stroke, though the exact +details will depend on the font being used. Most Python code doesn't need +to worry about glyphs; figuring out the correct glyph to display is +generally the job of a GUI toolkit or a terminal's font renderer. + + +Encodings +''''''''' + +To summarize the previous section: +a Unicode string is a sequence of code points, which are +numbers from 0 to 0x10ffff. This sequence needs to be represented as +a set of bytes (meaning, values from 0-255) in memory. The rules for +translating a Unicode string into a sequence of bytes are called an +**encoding**. + +The first encoding you might think of is an array of 32-bit integers. +In this representation, the string "Python" would look like this:: + + P y t h o n + 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00 + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 + +This representation is straightforward but using +it presents a number of problems. + +1. It's not portable; different processors order the bytes + differently. + +2. It's very wasteful of space. In most texts, the majority of the code + points are less than 127, or less than 255, so a lot of space is occupied + by zero bytes. The above string takes 24 bytes compared to the 6 + bytes needed for an ASCII representation. Increased RAM usage doesn't + matter too much (desktop computers have megabytes of RAM, and strings + aren't usually that large), but expanding our usage of disk and + network bandwidth by a factor of 4 is intolerable. + +3. It's not compatible with existing C functions such as ``strlen()``, + so a new family of wide string functions would need to be used. + +4. Many Internet standards are defined in terms of textual data, and + can't handle content with embedded zero bytes. + +Generally people don't use this encoding, choosing other encodings +that are more efficient and convenient. + +Encodings don't have to handle every possible Unicode character, and +most encodings don't. For example, Python's default encoding is the +'ascii' encoding. The rules for converting a Unicode string into the +ASCII encoding are are simple; for each code point: + +1. If the code point is <128, each byte is the same as the value of the + code point. + +2. If the code point is 128 or greater, the Unicode string can't + be represented in this encoding. (Python raises a + ``UnicodeEncodeError`` exception in this case.) + +Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode +code points 0-255 are identical to the Latin-1 values, so converting +to this encoding simply requires converting code points to byte +values; if a code point larger than 255 is encountered, the string +can't be encoded into Latin-1. + +Encodings don't have to be simple one-to-one mappings like Latin-1. +Consider IBM's EBCDIC, which was used on IBM mainframes. Letter +values weren't in one block: 'a' through 'i' had values from 129 to +137, but 'j' through 'r' were 145 through 153. If you wanted to use +EBCDIC as an encoding, you'd probably use some sort of lookup table to +perform the conversion, but this is largely an internal detail. + +UTF-8 is one of the most commonly used encodings. UTF stands for +"Unicode Transformation Format", and the '8' means that 8-bit numbers +are used in the encoding. (There's also a UTF-16 encoding, but it's +less frequently used than UTF-8.) UTF-8 uses the following rules: + +1. If the code point is <128, it's represented by the corresponding byte value. +2. If the code point is between 128 and 0x7ff, it's turned into two byte values + between 128 and 255. +3. Code points >0x7ff are turned into three- or four-byte sequences, where + each byte of the sequence is between 128 and 255. + +UTF-8 has several convenient properties: + +1. It can handle any Unicode code point. +2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent through protocols that can't handle zero bytes. +3. A string of ASCII text is also valid UTF-8 text. +4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte. +5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize. It's also unlikely that random 8-bit data will look like valid UTF-8. + + + +References +'''''''''''''' + +The Unicode Consortium site at <http://www.unicode.org> has character +charts, a glossary, and PDF versions of the Unicode specification. Be +prepared for some difficult reading. +<http://www.unicode.org/history/> is a chronology of the origin and +development of Unicode. + +To help understand the standard, Jukka Korpela has written an +introductory guide to reading the Unicode character tables, +available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>. + +Roman Czyborra wrote another explanation of Unicode's basic principles; +it's at <http://czyborra.com/unicode/characters.html>. +Czyborra has written a number of other Unicode-related documentation, +available from <http://www.cyzborra.com>. + +Two other good introductory articles were written by Joel Spolsky +<http://www.joelonsoftware.com/articles/Unicode.html> and Jason +Orendorff <http://www.jorendorff.com/articles/unicode/>. If this +introduction didn't make things clear to you, you should try reading +one of these alternate articles before continuing. + +Wikipedia entries are often helpful; see the entries for "character +encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8 +<http://en.wikipedia.org/wiki/UTF-8>, for example. + + +Python's Unicode Support +------------------------ + +Now that you've learned the rudiments of Unicode, we can look at +Python's Unicode features. + + +The Unicode Type +''''''''''''''''''' + +Unicode strings are expressed as instances of the ``unicode`` type, +one of Python's repertoire of built-in types. It derives from an +abstract type called ``basestring``, which is also an ancestor of the +``str`` type; you can therefore check if a value is a string type with +``isinstance(value, basestring)``. Under the hood, Python represents +Unicode strings as either 16- or 32-bit integers, depending on how the +Python interpreter was compiled, but this + +The ``unicode()`` constructor has the signature ``unicode(string[, encoding, errors])``. +All of its arguments should be 8-bit strings. The first argument is converted +to Unicode using the specified encoding; if you leave off the ``encoding`` argument, +the ASCII encoding is used for the conversion, so characters greater than 127 will +be treated as errors:: + + >>> unicode('abcdef') + u'abcdef' + >>> s = unicode('abcdef') + >>> type(s) + <type 'unicode'> + >>> unicode('abcdef' + chr(255)) + Traceback (most recent call last): + File "<stdin>", line 1, in ? + UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6: + ordinal not in range(128) + +The ``errors`` argument specifies the response when the input string can't be converted according to the encoding's rules. Legal values for this argument +are 'strict' (raise a ``UnicodeDecodeError`` exception), +'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'), +or 'ignore' (just leave the character out of the Unicode result). +The following examples show the differences:: + + >>> unicode('\x80abc', errors='strict') + Traceback (most recent call last): + File "<stdin>", line 1, in ? + UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: + ordinal not in range(128) + >>> unicode('\x80abc', errors='replace') + u'\ufffdabc' + >>> unicode('\x80abc', errors='ignore') + u'abc' + +Encodings are specified as strings containing the encoding's name. +Python 2.4 comes with roughly 100 different encodings; see the Python +Library Reference at +<http://docs.python.org/lib/standard-encodings.html> for a list. Some +encodings have multiple names; for example, 'latin-1', 'iso_8859_1' +and '8859' are all synonyms for the same encoding. + +One-character Unicode strings can also be created with the +``unichr()`` built-in function, which takes integers and returns a +Unicode string of length 1 that contains the corresponding code point. +The reverse operation is the built-in `ord()` function that takes a +one-character Unicode string and returns the code point value:: + + >>> unichr(40960) + u'\ua000' + >>> ord(u'\ua000') + 40960 + +Instances of the ``unicode`` type have many of the same methods as +the 8-bit string type for operations such as searching and formatting:: + + >>> s = u'Was ever feather so lightly blown to and fro as this multitude?' + >>> s.count('e') + 5 + >>> s.find('feather') + 9 + >>> s.find('bird') + -1 + >>> s.replace('feather', 'sand') + u'Was ever sand so lightly blown to and fro as this multitude?' + >>> s.upper() + u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?' + +Note that the arguments to these methods can be Unicode strings or 8-bit strings. +8-bit strings will be converted to Unicode before carrying out the operation; +Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception:: + + >>> s.find('Was\x9f') + Traceback (most recent call last): + File "<stdin>", line 1, in ? + UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128) + >>> s.find(u'Was\x9f') + -1 + +Much Python code that operates on strings will therefore work with +Unicode strings without requiring any changes to the code. (Input and +output code needs more updating for Unicode; more on this later.) + +Another important method is ``.encode([encoding], [errors='strict'])``, +which returns an 8-bit string version of the +Unicode string, encoded in the requested encoding. The ``errors`` +parameter is the same as the parameter of the ``unicode()`` +constructor, with one additional possibility; as well as 'strict', +'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which +uses XML's character references. The following example shows the +different results:: + + >>> u = unichr(40960) + u'abcd' + unichr(1972) + >>> u.encode('utf-8') + '\xea\x80\x80abcd\xde\xb4' + >>> u.encode('ascii') + Traceback (most recent call last): + File "<stdin>", line 1, in ? + UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128) + >>> u.encode('ascii', 'ignore') + 'abcd' + >>> u.encode('ascii', 'replace') + '?abcd?' + >>> u.encode('ascii', 'xmlcharrefreplace') + 'ꀀabcd޴' + +Python's 8-bit strings have a ``.decode([encoding], [errors])`` method +that interprets the string using the given encoding:: + + >>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string + >>> utf8_version = u.encode('utf-8') # Encode as UTF-8 + >>> type(utf8_version), utf8_version + (<type 'str'>, '\xea\x80\x80abcd\xde\xb4') + >>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8 + >>> u == u2 # The two strings match + True + +The low-level routines for registering and accessing the available +encodings are found in the ``codecs`` module. However, the encoding +and decoding functions returned by this module are usually more +low-level than is comfortable, so I'm not going to describe the +``codecs`` module here. If you need to implement a completely new +encoding, you'll need to learn about the ``codecs`` module interfaces, +but implementing encodings is a specialized task that also won't be +covered here. Consult the Python documentation to learn more about +this module. + +The most commonly used part of the ``codecs`` module is the +``codecs.open()`` function which will be discussed in the section +on input and output. + + +Unicode Literals in Python Source Code +'''''''''''''''''''''''''''''''''''''''''' + +In Python source code, Unicode literals are written as strings +prefixed with the 'u' or 'U' character: ``u'abcdefghijk'``. Specific +code points can be written using the ``\u`` escape sequence, which is +followed by four hex digits giving the code point. The ``\U`` escape +sequence is similar, but expects 8 hex digits, not 4. + +Unicode literals can also use the same escape sequences as 8-bit +strings, including ``\x``, but ``\x`` only takes two hex digits so it +can't express an arbitrary code point. Octal escapes can go up to +U+01ff, which is octal 777. + +:: + + >>> s = u"a\xac\u1234\u20ac\U00008000" + ^^^^ two-digit hex escape + ^^^^^^ four-digit Unicode escape + ^^^^^^^^^^ eight-digit Unicode escape + >>> for c in s: print ord(c), + ... + 97 172 4660 8364 32768 + +Using escape sequences for code points greater than 127 is fine in +small doses, but becomes an annoyance if you're using many accented +characters, as you would in a program with messages in French or some +other accent-using language. You can also assemble strings using the +``unichr()`` built-in function, but this is even more tedious. + +Ideally, you'd want to be able to write literals in your language's +natural encoding. You could then edit Python source code with your +favorite editor which would display the accented characters naturally, +and have the right characters used at runtime. + +Python supports writing Unicode literals in any encoding, but you have +to declare the encoding being used. This is done by including a +special comment as either the first or second line of the source +file:: + + #!/usr/bin/env python + # -*- coding: latin-1 -*- + + u = u'abcdé' + print ord(u[-1]) + +The syntax is inspired by Emacs's notation for specifying variables local to a file. +Emacs supports many different variables, but Python only supports 'coding'. +The ``-*-`` symbols indicate that the comment is special; within them, +you must supply the name ``coding`` and the name of your chosen encoding, +separated by ``':'``. + +If you don't include such a comment, the default encoding used will be +ASCII. Versions of Python before 2.4 were Euro-centric and assumed +Latin-1 as a default encoding for string literals; in Python 2.4, +characters greater than 127 still work but result in a warning. For +example, the following program has no encoding declaration:: + + #!/usr/bin/env python + u = u'abcdé' + print ord(u[-1]) + +When you run it with Python 2.4, it will output the following warning:: + + amk:~$ python p263.py + sys:1: DeprecationWarning: Non-ASCII character '\xe9' + in file p263.py on line 2, but no encoding declared; + see http://www.python.org/peps/pep-0263.html for details + + +Unicode Properties +''''''''''''''''''' + +The Unicode specification includes a database of information about +code points. For each code point that's defined, the information +includes the character's name, its category, the numeric value if +applicable (Unicode has characters representing the Roman numerals and +fractions such as one-third and four-fifths). There are also +properties related to the code point's use in bidirectional text and +other display-related properties. + +The following program displays some information about several +characters, and prints the numeric value of one particular character:: + + import unicodedata + + u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231) + + for i, c in enumerate(u): + print i, '%04x' % ord(c), unicodedata.category(c), + print unicodedata.name(c) + + # Get numeric value of second character + print unicodedata.numeric(u[1]) + +When run, this prints:: + + 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE + 1 0bf2 No TAMIL NUMBER ONE THOUSAND + 2 0f84 Mn TIBETAN MARK HALANTA + 3 1770 Lo TAGBANWA LETTER SA + 4 33af So SQUARE RAD OVER S SQUARED + 1000.0 + +The category codes are abbreviations describing the nature of the +character. These are grouped into categories such as "Letter", +"Number", "Punctuation", or "Symbol", which in turn are broken up into +subcategories. To take the codes from the above output, ``'Ll'`` +means 'Letter, lowercase', ``'No'`` means "Number, other", ``'Mn'`` is +"Mark, nonspacing", and ``'So'`` is "Symbol, other". See +<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values> +for a list of category codes. + +References +'''''''''''''' + +The Unicode and 8-bit string types are described in the Python library +reference at <http://docs.python.org/lib/typesseq.html>. + +The documentation for the ``unicodedata`` module is at +<http://docs.python.org/lib/module-unicodedata.html>. + +The documentation for the ``codecs`` module is at +<http://docs.python.org/lib/module-codecs.html>. + +Marc-André Lemburg gave a presentation at EuroPython 2002 +titled "Python and Unicode". A PDF version of his slides +is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>, +and is an excellent overview of the design of Python's Unicode features. + + +Reading and Writing Unicode Data +---------------------------------------- + +Once you've written some code that works with Unicode data, the next +problem is input/output. How do you get Unicode strings into your +program, and how do you convert Unicode into a form suitable for +storage or transmission? + +It's possible that you may not need to do anything depending on your +input sources and output destinations; you should check whether the +libraries used in your application support Unicode natively. XML +parsers often return Unicode data, for example. Many relational +databases also support Unicode-valued columns and can return Unicode +values from an SQL query. + +Unicode data is usually converted to a particular encoding before it +gets written to disk or sent over a socket. It's possible to do all +the work yourself: open a file, read an 8-bit string from it, and +convert the string with ``unicode(str, encoding)``. However, the +manual approach is not recommended. + +One problem is the multi-byte nature of encodings; one Unicode +character can be represented by several bytes. If you want to read +the file in arbitrary-sized chunks (say, 1K or 4K), you need to write +error-handling code to catch the case where only part of the bytes +encoding a single Unicode character are read at the end of a chunk. +One solution would be to read the entire file into memory and then +perform the decoding, but that prevents you from working with files +that are extremely large; if you need to read a 2Gb file, you need 2Gb +of RAM. (More, really, since for at least a moment you'd need to have +both the encoded string and its Unicode version in memory.) + +The solution would be to use the low-level decoding interface to catch +the case of partial coding sequences. The work of implementing this +has already been done for you: the ``codecs`` module includes a +version of the ``open()`` function that returns a file-like object +that assumes the file's contents are in a specified encoding and +accepts Unicode parameters for methods such as ``.read()`` and +``.write()``. + +The function's parameters are +``open(filename, mode='rb', encoding=None, errors='strict', buffering=1)``. ``mode`` can be +``'r'``, ``'w'``, or ``'a'``, just like the corresponding parameter to the +regular built-in ``open()`` function; add a ``'+'`` to +update the file. ``buffering`` is similarly +parallel to the standard function's parameter. +``encoding`` is a string giving +the encoding to use; if it's left as ``None``, a regular Python file +object that accepts 8-bit strings is returned. Otherwise, a wrapper +object is returned, and data written to or read from the wrapper +object will be converted as needed. ``errors`` specifies the action +for encoding errors and can be one of the usual values of 'strict', +'ignore', and 'replace'. + +Reading Unicode from a file is therefore simple:: + + import codecs + f = codecs.open('unicode.rst', encoding='utf-8') + for line in f: + print repr(line) + +It's also possible to open files in update mode, +allowing both reading and writing:: + + f = codecs.open('test', encoding='utf-8', mode='w+') + f.write(u'\u4500 blah blah blah\n') + f.seek(0) + print repr(f.readline()[:1]) + f.close() + +Unicode character U+FEFF is used as a byte-order mark (BOM), +and is often written as the first character of a file in order +to assist with autodetection of the file's byte ordering. +Some encodings, such as UTF-16, expect a BOM to be present at +the start of a file; when such an encoding is used, +the BOM will be automatically written as the first character +and will be silently dropped when the file is read. There are +variants of these encodings, such as 'utf-16-le' and 'utf-16-be' +for little-endian and big-endian encodings, that specify +one particular byte ordering and don't +skip the BOM. + + +Unicode filenames +''''''''''''''''''''''''' + +Most of the operating systems in common use today support filenames +that contain arbitrary Unicode characters. Usually this is +implemented by converting the Unicode string into some encoding that +varies depending on the system. For example, MacOS X uses UTF-8 while +Windows uses a configurable encoding; on Windows, Python uses the name +"mbcs" to refer to whatever the currently configured encoding is. On +Unix systems, there will only be a filesystem encoding if you've set +the ``LANG`` or ``LC_CTYPE`` environment variables; if you haven't, +the default encoding is ASCII. + +The ``sys.getfilesystemencoding()`` function returns the encoding to +use on your current system, in case you want to do the encoding +manually, but there's not much reason to bother. When opening a file +for reading or writing, you can usually just provide the Unicode +string as the filename, and it will be automatically converted to the +right encoding for you:: + + filename = u'filename\u4500abc' + f = open(filename, 'w') + f.write('blah\n') + f.close() + +Functions in the ``os`` module such as ``os.stat()`` will also accept +Unicode filenames. + +``os.listdir()``, which returns filenames, raises an issue: should it +return the Unicode version of filenames, or should it return 8-bit +strings containing the encoded versions? ``os.listdir()`` will do +both, depending on whether you provided the directory path as an 8-bit +string or a Unicode string. If you pass a Unicode string as the path, +filenames will be decoded using the filesystem's encoding and a list +of Unicode strings will be returned, while passing an 8-bit path will +return the 8-bit versions of the filenames. For example, assuming the +default filesystem encoding is UTF-8, running the following program:: + + fn = u'filename\u4500abc' + f = open(fn, 'w') + f.close() + + import os + print os.listdir('.') + print os.listdir(u'.') + +will produce the following output:: + + amk:~$ python t.py + ['.svn', 'filename\xe4\x94\x80abc', ...] + [u'.svn', u'filename\u4500abc', ...] + +The first list contains UTF-8-encoded filenames, and the second list +contains the Unicode versions. + + + +Tips for Writing Unicode-aware Programs +'''''''''''''''''''''''''''''''''''''''''''' + +This section provides some suggestions on writing software that +deals with Unicode. + +The most important tip is: + + Software should only work with Unicode strings internally, + converting to a particular encoding on output. + +If you attempt to write processing functions that accept both +Unicode and 8-bit strings, you will find your program vulnerable to +bugs wherever you combine the two different kinds of strings. Python's +default encoding is ASCII, so whenever a character with an ASCII value >127 +is in the input data, you'll get a ``UnicodeDecodeError`` +because that character can't be handled by the ASCII encoding. + +It's easy to miss such problems if you only test your software +with data that doesn't contain any +accents; everything will seem to work, but there's actually a bug in your +program waiting for the first user who attempts to use characters >127. +A second tip, therefore, is: + + Include characters >127 and, even better, characters >255 in your + test data. + +When using data coming from a web browser or some other untrusted source, +a common technique is to check for illegal characters in a string +before using the string in a generated command line or storing it in a +database. If you're doing this, be careful to check +the string once it's in the form that will be used or stored; it's +possible for encodings to be used to disguise characters. This is especially +true if the input data also specifies the encoding; +many encodings leave the commonly checked-for characters alone, +but Python includes some encodings such as ``'base64'`` +that modify every single character. + +For example, let's say you have a content management system that takes a +Unicode filename, and you want to disallow paths with a '/' character. +You might write this code:: + + def read_file (filename, encoding): + if '/' in filename: + raise ValueError("'/' not allowed in filenames") + unicode_name = filename.decode(encoding) + f = open(unicode_name, 'r') + # ... return contents of file ... + +However, if an attacker could specify the ``'base64'`` encoding, +they could pass ``'L2V0Yy9wYXNzd2Q='``, which is the base-64 +encoded form of the string ``'/etc/passwd'``, to read a +system file. The above code looks for ``'/'`` characters +in the encoded form and misses the dangerous character +in the resulting decoded form. + +References +'''''''''''''' + +The PDF slides for Marc-André Lemburg's presentation "Writing +Unicode-aware Applications in Python" are available at +<http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf> +and discuss questions of character encodings as well as how to +internationalize and localize an application. + + +Revision History and Acknowledgements +------------------------------------------ + +Thanks to the following people who have noted errors or offered +suggestions on this article: Nicholas Bastin, +Marius Gedminas, Kent Johnson, Ken Krugler, +Marc-André Lemburg, Martin von Löwis. + +Version 1.0: posted August 5 2005. + +Version 1.01: posted August 7 2005. Corrects factual and markup +errors; adds several links. + +Version 1.02: posted August 16 2005. Corrects factual errors. + + +.. comment Additional topic: building Python w/ UCS2 or UCS4 support +.. comment Describe obscure -U switch somewhere? + +.. comment + Original outline: + + - [ ] Unicode introduction + - [ ] ASCII + - [ ] Terms + - [ ] Character + - [ ] Code point + - [ ] Encodings + - [ ] Common encodings: ASCII, Latin-1, UTF-8 + - [ ] Unicode Python type + - [ ] Writing unicode literals + - [ ] Obscurity: -U switch + - [ ] Built-ins + - [ ] unichr() + - [ ] ord() + - [ ] unicode() constructor + - [ ] Unicode type + - [ ] encode(), decode() methods + - [ ] Unicodedata module for character properties + - [ ] I/O + - [ ] Reading/writing Unicode data into files + - [ ] Byte-order marks + - [ ] Unicode filenames + - [ ] Writing Unicode programs + - [ ] Do everything in Unicode + - [ ] Declaring source code encodings (PEP 263) + - [ ] Other issues + - [ ] Building Python (UCS2, UCS4) |