diff options
author | Georg Brandl <georg@python.org> | 2007-08-15 14:28:22 (GMT) |
---|---|---|
committer | Georg Brandl <georg@python.org> | 2007-08-15 14:28:22 (GMT) |
commit | 116aa62bf54a39697e25f21d6cf6799f7faa1349 (patch) | |
tree | 8db5729518ed4ca88e26f1e26cc8695151ca3eb3 /Doc/howto | |
parent | 739c01d47b9118d04e5722333f0e6b4d0c8bdd9e (diff) | |
download | cpython-116aa62bf54a39697e25f21d6cf6799f7faa1349.zip cpython-116aa62bf54a39697e25f21d6cf6799f7faa1349.tar.gz cpython-116aa62bf54a39697e25f21d6cf6799f7faa1349.tar.bz2 |
Move the 3k reST doc tree in place.
Diffstat (limited to 'Doc/howto')
-rw-r--r-- | Doc/howto/advocacy.rst | 356 | ||||
-rw-r--r-- | Doc/howto/curses.rst | 434 | ||||
-rw-r--r-- | Doc/howto/doanddont.rst | 308 | ||||
-rw-r--r-- | Doc/howto/functional.rst | 1400 | ||||
-rw-r--r-- | Doc/howto/index.rst | 25 | ||||
-rw-r--r-- | Doc/howto/pythonmac.rst | 202 | ||||
-rw-r--r-- | Doc/howto/regex.rst | 1377 | ||||
-rw-r--r-- | Doc/howto/sockets.rst | 421 | ||||
-rw-r--r-- | Doc/howto/unicode.rst | 732 | ||||
-rw-r--r-- | Doc/howto/urllib2.rst | 578 |
10 files changed, 5833 insertions, 0 deletions
diff --git a/Doc/howto/advocacy.rst b/Doc/howto/advocacy.rst new file mode 100644 index 0000000..1f1754a --- /dev/null +++ b/Doc/howto/advocacy.rst @@ -0,0 +1,356 @@ +************************* + Python Advocacy HOWTO +************************* + +:Author: A.M. Kuchling +:Release: 0.03 + + +.. topic:: Abstract + + It's usually difficult to get your management to accept open source software, + and Python is no exception to this rule. This document discusses reasons to use + Python, strategies for winning acceptance, facts and arguments you can use, and + cases where you *shouldn't* try to use Python. + + +Reasons to Use Python +===================== + +There are several reasons to incorporate a scripting language into your +development process, and this section will discuss them, and why Python has some +properties that make it a particularly good choice. + + +Programmability +--------------- + +Programs are often organized in a modular fashion. Lower-level operations are +grouped together, and called by higher-level functions, which may in turn be +used as basic operations by still further upper levels. + +For example, the lowest level might define a very low-level set of functions for +accessing a hash table. The next level might use hash tables to store the +headers of a mail message, mapping a header name like ``Date`` to a value such +as ``Tue, 13 May 1997 20:00:54 -0400``. A yet higher level may operate on +message objects, without knowing or caring that message headers are stored in a +hash table, and so forth. + +Often, the lowest levels do very simple things; they implement a data structure +such as a binary tree or hash table, or they perform some simple computation, +such as converting a date string to a number. The higher levels then contain +logic connecting these primitive operations. Using the approach, the primitives +can be seen as basic building blocks which are then glued together to produce +the complete product. + +Why is this design approach relevant to Python? Because Python is well suited +to functioning as such a glue language. A common approach is to write a Python +module that implements the lower level operations; for the sake of speed, the +implementation might be in C, Java, or even Fortran. Once the primitives are +available to Python programs, the logic underlying higher level operations is +written in the form of Python code. The high-level logic is then more +understandable, and easier to modify. + +John Ousterhout wrote a paper that explains this idea at greater length, +entitled "Scripting: Higher Level Programming for the 21st Century". I +recommend that you read this paper; see the references for the URL. Ousterhout +is the inventor of the Tcl language, and therefore argues that Tcl should be +used for this purpose; he only briefly refers to other languages such as Python, +Perl, and Lisp/Scheme, but in reality, Ousterhout's argument applies to +scripting languages in general, since you could equally write extensions for any +of the languages mentioned above. + + +Prototyping +----------- + +In *The Mythical Man-Month*, Fredrick Brooks suggests the following rule when +planning software projects: "Plan to throw one away; you will anyway." Brooks +is saying that the first attempt at a software design often turns out to be +wrong; unless the problem is very simple or you're an extremely good designer, +you'll find that new requirements and features become apparent once development +has actually started. If these new requirements can't be cleanly incorporated +into the program's structure, you're presented with two unpleasant choices: +hammer the new features into the program somehow, or scrap everything and write +a new version of the program, taking the new features into account from the +beginning. + +Python provides you with a good environment for quickly developing an initial +prototype. That lets you get the overall program structure and logic right, and +you can fine-tune small details in the fast development cycle that Python +provides. Once you're satisfied with the GUI interface or program output, you +can translate the Python code into C++, Fortran, Java, or some other compiled +language. + +Prototyping means you have to be careful not to use too many Python features +that are hard to implement in your other language. Using ``eval()``, or regular +expressions, or the :mod:`pickle` module, means that you're going to need C or +Java libraries for formula evaluation, regular expressions, and serialization, +for example. But it's not hard to avoid such tricky code, and in the end the +translation usually isn't very difficult. The resulting code can be rapidly +debugged, because any serious logical errors will have been removed from the +prototype, leaving only more minor slip-ups in the translation to track down. + +This strategy builds on the earlier discussion of programmability. Using Python +as glue to connect lower-level components has obvious relevance for constructing +prototype systems. In this way Python can help you with development, even if +end users never come in contact with Python code at all. If the performance of +the Python version is adequate and corporate politics allow it, you may not need +to do a translation into C or Java, but it can still be faster to develop a +prototype and then translate it, instead of attempting to produce the final +version immediately. + +One example of this development strategy is Microsoft Merchant Server. Version +1.0 was written in pure Python, by a company that subsequently was purchased by +Microsoft. Version 2.0 began to translate the code into C++, shipping with some +C++code and some Python code. Version 3.0 didn't contain any Python at all; all +the code had been translated into C++. Even though the product doesn't contain +a Python interpreter, the Python language has still served a useful purpose by +speeding up development. + +This is a very common use for Python. Past conference papers have also +described this approach for developing high-level numerical algorithms; see +David M. Beazley and Peter S. Lomdahl's paper "Feeding a Large-scale Physics +Application to Python" in the references for a good example. If an algorithm's +basic operations are things like "Take the inverse of this 4000x4000 matrix", +and are implemented in some lower-level language, then Python has almost no +additional performance cost; the extra time required for Python to evaluate an +expression like ``m.invert()`` is dwarfed by the cost of the actual computation. +It's particularly good for applications where seemingly endless tweaking is +required to get things right. GUI interfaces and Web sites are prime examples. + +The Python code is also shorter and faster to write (once you're familiar with +Python), so it's easier to throw it away if you decide your approach was wrong; +if you'd spent two weeks working on it instead of just two hours, you might +waste time trying to patch up what you've got out of a natural reluctance to +admit that those two weeks were wasted. Truthfully, those two weeks haven't +been wasted, since you've learnt something about the problem and the technology +you're using to solve it, but it's human nature to view this as a failure of +some sort. + + +Simplicity and Ease of Understanding +------------------------------------ + +Python is definitely *not* a toy language that's only usable for small tasks. +The language features are general and powerful enough to enable it to be used +for many different purposes. It's useful at the small end, for 10- or 20-line +scripts, but it also scales up to larger systems that contain thousands of lines +of code. + +However, this expressiveness doesn't come at the cost of an obscure or tricky +syntax. While Python has some dark corners that can lead to obscure code, there +are relatively few such corners, and proper design can isolate their use to only +a few classes or modules. It's certainly possible to write confusing code by +using too many features with too little concern for clarity, but most Python +code can look a lot like a slightly-formalized version of human-understandable +pseudocode. + +In *The New Hacker's Dictionary*, Eric S. Raymond gives the following definition +for "compact": + +.. epigraph:: + + Compact *adj.* Of a design, describes the valuable property that it can all be + apprehended at once in one's head. This generally means the thing created from + the design can be used with greater facility and fewer errors than an equivalent + tool that is not compact. Compactness does not imply triviality or lack of + power; for example, C is compact and FORTRAN is not, but C is more powerful than + FORTRAN. Designs become non-compact through accreting features and cruft that + don't merge cleanly into the overall design scheme (thus, some fans of Classic C + maintain that ANSI C is no longer compact). + + (From http://www.catb.org/ esr/jargon/html/C/compact.html) + +In this sense of the word, Python is quite compact, because the language has +just a few ideas, which are used in lots of places. Take namespaces, for +example. Import a module with ``import math``, and you create a new namespace +called ``math``. Classes are also namespaces that share many of the properties +of modules, and have a few of their own; for example, you can create instances +of a class. Instances? They're yet another namespace. Namespaces are currently +implemented as Python dictionaries, so they have the same methods as the +standard dictionary data type: .keys() returns all the keys, and so forth. + +This simplicity arises from Python's development history. The language syntax +derives from different sources; ABC, a relatively obscure teaching language, is +one primary influence, and Modula-3 is another. (For more information about ABC +and Modula-3, consult their respective Web sites at http://www.cwi.nl/ +steven/abc/ and http://www.m3.org.) Other features have come from C, Icon, +Algol-68, and even Perl. Python hasn't really innovated very much, but instead +has tried to keep the language small and easy to learn, building on ideas that +have been tried in other languages and found useful. + +Simplicity is a virtue that should not be underestimated. It lets you learn the +language more quickly, and then rapidly write code, code that often works the +first time you run it. + + +Java Integration +---------------- + +If you're working with Java, Jython (http://www.jython.org/) is definitely worth +your attention. Jython is a re-implementation of Python in Java that compiles +Python code into Java bytecodes. The resulting environment has very tight, +almost seamless, integration with Java. It's trivial to access Java classes +from Python, and you can write Python classes that subclass Java classes. +Jython can be used for prototyping Java applications in much the same way +CPython is used, and it can also be used for test suites for Java code, or +embedded in a Java application to add scripting capabilities. + + +Arguments and Rebuttals +======================= + +Let's say that you've decided upon Python as the best choice for your +application. How can you convince your management, or your fellow developers, +to use Python? This section lists some common arguments against using Python, +and provides some possible rebuttals. + +**Python is freely available software that doesn't cost anything. How good can +it be?** + +Very good, indeed. These days Linux and Apache, two other pieces of open source +software, are becoming more respected as alternatives to commercial software, +but Python hasn't had all the publicity. + +Python has been around for several years, with many users and developers. +Accordingly, the interpreter has been used by many people, and has gotten most +of the bugs shaken out of it. While bugs are still discovered at intervals, +they're usually either quite obscure (they'd have to be, for no one to have run +into them before) or they involve interfaces to external libraries. The +internals of the language itself are quite stable. + +Having the source code should be viewed as making the software available for +peer review; people can examine the code, suggest (and implement) improvements, +and track down bugs. To find out more about the idea of open source code, along +with arguments and case studies supporting it, go to http://www.opensource.org. + +**Who's going to support it?** + +Python has a sizable community of developers, and the number is still growing. +The Internet community surrounding the language is an active one, and is worth +being considered another one of Python's advantages. Most questions posted to +the comp.lang.python newsgroup are quickly answered by someone. + +Should you need to dig into the source code, you'll find it's clear and +well-organized, so it's not very difficult to write extensions and track down +bugs yourself. If you'd prefer to pay for support, there are companies and +individuals who offer commercial support for Python. + +**Who uses Python for serious work?** + +Lots of people; one interesting thing about Python is the surprising diversity +of applications that it's been used for. People are using Python to: + +* Run Web sites + +* Write GUI interfaces + +* Control number-crunching code on supercomputers + +* Make a commercial application scriptable by embedding the Python interpreter + inside it + +* Process large XML data sets + +* Build test suites for C or Java code + +Whatever your application domain is, there's probably someone who's used Python +for something similar. Yet, despite being useable for such high-end +applications, Python's still simple enough to use for little jobs. + +See http://wiki.python.org/moin/OrganizationsUsingPython for a list of some of +the organizations that use Python. + +**What are the restrictions on Python's use?** + +They're practically nonexistent. Consult the :file:`Misc/COPYRIGHT` file in the +source distribution, or http://www.python.org/doc/Copyright.html for the full +language, but it boils down to three conditions. + +* You have to leave the copyright notice on the software; if you don't include + the source code in a product, you have to put the copyright notice in the + supporting documentation. + +* Don't claim that the institutions that have developed Python endorse your + product in any way. + +* If something goes wrong, you can't sue for damages. Practically all software + licences contain this condition. + +Notice that you don't have to provide source code for anything that contains +Python or is built with it. Also, the Python interpreter and accompanying +documentation can be modified and redistributed in any way you like, and you +don't have to pay anyone any licensing fees at all. + +**Why should we use an obscure language like Python instead of well-known +language X?** + +I hope this HOWTO, and the documents listed in the final section, will help +convince you that Python isn't obscure, and has a healthily growing user base. +One word of advice: always present Python's positive advantages, instead of +concentrating on language X's failings. People want to know why a solution is +good, rather than why all the other solutions are bad. So instead of attacking +a competing solution on various grounds, simply show how Python's virtues can +help. + + +Useful Resources +================ + +http://www.pythonology.com/success + The Python Success Stories are a collection of stories from successful users of + Python, with the emphasis on business and corporate users. + +.. % \term{\url{http://www.fsbassociates.com/books/pythonchpt1.htm}} +.. % The first chapter of \emph{Internet Programming with Python} also +.. % examines some of the reasons for using Python. The book is well worth +.. % buying, but the publishers have made the first chapter available on +.. % the Web. + +http://home.pacbell.net/ouster/scripting.html + John Ousterhout's white paper on scripting is a good argument for the utility of + scripting languages, though naturally enough, he emphasizes Tcl, the language he + developed. Most of the arguments would apply to any scripting language. + +http://www.python.org/workshops/1997-10/proceedings/beazley.html + The authors, David M. Beazley and Peter S. Lomdahl, describe their use of + Python at Los Alamos National Laboratory. It's another good example of how + Python can help get real work done. This quotation from the paper has been + echoed by many people: + + .. epigraph:: + + Originally developed as a large monolithic application for massively parallel + processing systems, we have used Python to transform our application into a + flexible, highly modular, and extremely powerful system for performing + simulation, data analysis, and visualization. In addition, we describe how + Python has solved a number of important problems related to the development, + debugging, deployment, and maintenance of scientific software. + +http://pythonjournal.cognizor.com/pyj1/Everitt-Feit_interview98-V1.html + This interview with Andy Feit, discussing Infoseek's use of Python, can be used + to show that choosing Python didn't introduce any difficulties into a company's + development process, and provided some substantial benefits. + +.. % \term{\url{http://www.python.org/psa/Commercial.html}} +.. % Robin Friedrich wrote this document on how to support Python's use in +.. % commercial projects. + +http://www.python.org/workshops/1997-10/proceedings/stein.ps + For the 6th Python conference, Greg Stein presented a paper that traced Python's + adoption and usage at a startup called eShop, and later at Microsoft. + +http://www.opensource.org + Management may be doubtful of the reliability and usefulness of software that + wasn't written commercially. This site presents arguments that show how open + source software can have considerable advantages over closed-source software. + +http://sunsite.unc.edu/LDP/HOWTO/mini/Advocacy.html + The Linux Advocacy mini-HOWTO was the inspiration for this document, and is also + well worth reading for general suggestions on winning acceptance for a new + technology, such as Linux or Python. In general, you won't make much progress + by simply attacking existing systems and complaining about their inadequacies; + this often ends up looking like unfocused whining. It's much better to point + out some of the many areas where Python is an improvement over other systems. + diff --git a/Doc/howto/curses.rst b/Doc/howto/curses.rst new file mode 100644 index 0000000..e16d07a --- /dev/null +++ b/Doc/howto/curses.rst @@ -0,0 +1,434 @@ +********************************** + Curses Programming with Python +********************************** + +:Author: A.M. Kuchling, Eric S. Raymond +:Release: 2.02 + + +.. topic:: Abstract + + This document describes how to write text-mode programs with Python 2.x, using + the :mod:`curses` extension module to control the display. + + +What is curses? +=============== + +The curses library supplies a terminal-independent screen-painting and +keyboard-handling facility for text-based terminals; such terminals include +VT100s, the Linux console, and the simulated terminal provided by X11 programs +such as xterm and rxvt. Display terminals support various control codes to +perform common operations such as moving the cursor, scrolling the screen, and +erasing areas. Different terminals use widely differing codes, and often have +their own minor quirks. + +In a world of X displays, one might ask "why bother"? It's true that +character-cell display terminals are an obsolete technology, but there are +niches in which being able to do fancy things with them are still valuable. One +is on small-footprint or embedded Unixes that don't carry an X server. Another +is for tools like OS installers and kernel configurators that may have to run +before X is available. + +The curses library hides all the details of different terminals, and provides +the programmer with an abstraction of a display, containing multiple +non-overlapping windows. The contents of a window can be changed in various +ways-- adding text, erasing it, changing its appearance--and the curses library +will automagically figure out what control codes need to be sent to the terminal +to produce the right output. + +The curses library was originally written for BSD Unix; the later System V +versions of Unix from AT&T added many enhancements and new functions. BSD curses +is no longer maintained, having been replaced by ncurses, which is an +open-source implementation of the AT&T interface. If you're using an +open-source Unix such as Linux or FreeBSD, your system almost certainly uses +ncurses. Since most current commercial Unix versions are based on System V +code, all the functions described here will probably be available. The older +versions of curses carried by some proprietary Unixes may not support +everything, though. + +No one has made a Windows port of the curses module. On a Windows platform, try +the Console module written by Fredrik Lundh. The Console module provides +cursor-addressable text output, plus full support for mouse and keyboard input, +and is available from http://effbot.org/efflib/console. + + +The Python curses module +------------------------ + +Thy Python module is a fairly simple wrapper over the C functions provided by +curses; if you're already familiar with curses programming in C, it's really +easy to transfer that knowledge to Python. The biggest difference is that the +Python interface makes things simpler, by merging different C functions such as +:func:`addstr`, :func:`mvaddstr`, :func:`mvwaddstr`, into a single +:meth:`addstr` method. You'll see this covered in more detail later. + +This HOWTO is simply an introduction to writing text-mode programs with curses +and Python. It doesn't attempt to be a complete guide to the curses API; for +that, see the Python library guide's section on ncurses, and the C manual pages +for ncurses. It will, however, give you the basic ideas. + + +Starting and ending a curses application +======================================== + +Before doing anything, curses must be initialized. This is done by calling the +:func:`initscr` function, which will determine the terminal type, send any +required setup codes to the terminal, and create various internal data +structures. If successful, :func:`initscr` returns a window object representing +the entire screen; this is usually called ``stdscr``, after the name of the +corresponding C variable. :: + + import curses + stdscr = curses.initscr() + +Usually curses applications turn off automatic echoing of keys to the screen, in +order to be able to read keys and only display them under certain circumstances. +This requires calling the :func:`noecho` function. :: + + curses.noecho() + +Applications will also commonly need to react to keys instantly, without +requiring the Enter key to be pressed; this is called cbreak mode, as opposed to +the usual buffered input mode. :: + + curses.cbreak() + +Terminals usually return special keys, such as the cursor keys or navigation +keys such as Page Up and Home, as a multibyte escape sequence. While you could +write your application to expect such sequences and process them accordingly, +curses can do it for you, returning a special value such as +:const:`curses.KEY_LEFT`. To get curses to do the job, you'll have to enable +keypad mode. :: + + stdscr.keypad(1) + +Terminating a curses application is much easier than starting one. You'll need +to call :: + + curses.nocbreak(); stdscr.keypad(0); curses.echo() + +to reverse the curses-friendly terminal settings. Then call the :func:`endwin` +function to restore the terminal to its original operating mode. :: + + curses.endwin() + +A common problem when debugging a curses application is to get your terminal +messed up when the application dies without restoring the terminal to its +previous state. In Python this commonly happens when your code is buggy and +raises an uncaught exception. Keys are no longer be echoed to the screen when +you type them, for example, which makes using the shell difficult. + +In Python you can avoid these complications and make debugging much easier by +importing the module :mod:`curses.wrapper`. It supplies a :func:`wrapper` +function that takes a callable. It does the initializations described above, +and also initializes colors if color support is present. It then runs your +provided callable and finally deinitializes appropriately. The callable is +called inside a try-catch clause which catches exceptions, performs curses +deinitialization, and then passes the exception upwards. Thus, your terminal +won't be left in a funny state on exception. + + +Windows and Pads +================ + +Windows are the basic abstraction in curses. A window object represents a +rectangular area of the screen, and supports various methods to display text, +erase it, allow the user to input strings, and so forth. + +The ``stdscr`` object returned by the :func:`initscr` function is a window +object that covers the entire screen. Many programs may need only this single +window, but you might wish to divide the screen into smaller windows, in order +to redraw or clear them separately. The :func:`newwin` function creates a new +window of a given size, returning the new window object. :: + + begin_x = 20 ; begin_y = 7 + height = 5 ; width = 40 + win = curses.newwin(height, width, begin_y, begin_x) + +A word about the coordinate system used in curses: coordinates are always passed +in the order *y,x*, and the top-left corner of a window is coordinate (0,0). +This breaks a common convention for handling coordinates, where the *x* +coordinate usually comes first. This is an unfortunate difference from most +other computer applications, but it's been part of curses since it was first +written, and it's too late to change things now. + +When you call a method to display or erase text, the effect doesn't immediately +show up on the display. This is because curses was originally written with slow +300-baud terminal connections in mind; with these terminals, minimizing the time +required to redraw the screen is very important. This lets curses accumulate +changes to the screen, and display them in the most efficient manner. For +example, if your program displays some characters in a window, and then clears +the window, there's no need to send the original characters because they'd never +be visible. + +Accordingly, curses requires that you explicitly tell it to redraw windows, +using the :func:`refresh` method of window objects. In practice, this doesn't +really complicate programming with curses much. Most programs go into a flurry +of activity, and then pause waiting for a keypress or some other action on the +part of the user. All you have to do is to be sure that the screen has been +redrawn before pausing to wait for user input, by simply calling +``stdscr.refresh()`` or the :func:`refresh` method of some other relevant +window. + +A pad is a special case of a window; it can be larger than the actual display +screen, and only a portion of it displayed at a time. Creating a pad simply +requires the pad's height and width, while refreshing a pad requires giving the +coordinates of the on-screen area where a subsection of the pad will be +displayed. :: + + pad = curses.newpad(100, 100) + # These loops fill the pad with letters; this is + # explained in the next section + for y in range(0, 100): + for x in range(0, 100): + try: pad.addch(y,x, ord('a') + (x*x+y*y) % 26 ) + except curses.error: pass + + # Displays a section of the pad in the middle of the screen + pad.refresh( 0,0, 5,5, 20,75) + +The :func:`refresh` call displays a section of the pad in the rectangle +extending from coordinate (5,5) to coordinate (20,75) on the screen; the upper +left corner of the displayed section is coordinate (0,0) on the pad. Beyond +that difference, pads are exactly like ordinary windows and support the same +methods. + +If you have multiple windows and pads on screen there is a more efficient way to +go, which will prevent annoying screen flicker at refresh time. Use the +:meth:`noutrefresh` method of each window to update the data structure +representing the desired state of the screen; then change the physical screen to +match the desired state in one go with the function :func:`doupdate`. The +normal :meth:`refresh` method calls :func:`doupdate` as its last act. + + +Displaying Text +=============== + +From a C programmer's point of view, curses may sometimes look like a twisty +maze of functions, all subtly different. For example, :func:`addstr` displays a +string at the current cursor location in the ``stdscr`` window, while +:func:`mvaddstr` moves to a given y,x coordinate first before displaying the +string. :func:`waddstr` is just like :func:`addstr`, but allows specifying a +window to use, instead of using ``stdscr`` by default. :func:`mvwaddstr` follows +similarly. + +Fortunately the Python interface hides all these details; ``stdscr`` is a window +object like any other, and methods like :func:`addstr` accept multiple argument +forms. Usually there are four different forms. + ++---------------------------------+-----------------------------------------------+ +| Form | Description | ++=================================+===============================================+ +| *str* or *ch* | Display the string *str* or character *ch* at | +| | the current position | ++---------------------------------+-----------------------------------------------+ +| *str* or *ch*, *attr* | Display the string *str* or character *ch*, | +| | using attribute *attr* at the current | +| | position | ++---------------------------------+-----------------------------------------------+ +| *y*, *x*, *str* or *ch* | Move to position *y,x* within the window, and | +| | display *str* or *ch* | ++---------------------------------+-----------------------------------------------+ +| *y*, *x*, *str* or *ch*, *attr* | Move to position *y,x* within the window, and | +| | display *str* or *ch*, using attribute *attr* | ++---------------------------------+-----------------------------------------------+ + +Attributes allow displaying text in highlighted forms, such as in boldface, +underline, reverse code, or in color. They'll be explained in more detail in +the next subsection. + +The :func:`addstr` function takes a Python string as the value to be displayed, +while the :func:`addch` functions take a character, which can be either a Python +string of length 1 or an integer. If it's a string, you're limited to +displaying characters between 0 and 255. SVr4 curses provides constants for +extension characters; these constants are integers greater than 255. For +example, :const:`ACS_PLMINUS` is a +/- symbol, and :const:`ACS_ULCORNER` is the +upper left corner of a box (handy for drawing borders). + +Windows remember where the cursor was left after the last operation, so if you +leave out the *y,x* coordinates, the string or character will be displayed +wherever the last operation left off. You can also move the cursor with the +``move(y,x)`` method. Because some terminals always display a flashing cursor, +you may want to ensure that the cursor is positioned in some location where it +won't be distracting; it can be confusing to have the cursor blinking at some +apparently random location. + +If your application doesn't need a blinking cursor at all, you can call +``curs_set(0)`` to make it invisible. Equivalently, and for compatibility with +older curses versions, there's a ``leaveok(bool)`` function. When *bool* is +true, the curses library will attempt to suppress the flashing cursor, and you +won't need to worry about leaving it in odd locations. + + +Attributes and Color +-------------------- + +Characters can be displayed in different ways. Status lines in a text-based +application are commonly shown in reverse video; a text viewer may need to +highlight certain words. curses supports this by allowing you to specify an +attribute for each cell on the screen. + +An attribute is a integer, each bit representing a different attribute. You can +try to display text with multiple attribute bits set, but curses doesn't +guarantee that all the possible combinations are available, or that they're all +visually distinct. That depends on the ability of the terminal being used, so +it's safest to stick to the most commonly available attributes, listed here. + ++----------------------+--------------------------------------+ +| Attribute | Description | ++======================+======================================+ +| :const:`A_BLINK` | Blinking text | ++----------------------+--------------------------------------+ +| :const:`A_BOLD` | Extra bright or bold text | ++----------------------+--------------------------------------+ +| :const:`A_DIM` | Half bright text | ++----------------------+--------------------------------------+ +| :const:`A_REVERSE` | Reverse-video text | ++----------------------+--------------------------------------+ +| :const:`A_STANDOUT` | The best highlighting mode available | ++----------------------+--------------------------------------+ +| :const:`A_UNDERLINE` | Underlined text | ++----------------------+--------------------------------------+ + +So, to display a reverse-video status line on the top line of the screen, you +could code:: + + stdscr.addstr(0, 0, "Current mode: Typing mode", + curses.A_REVERSE) + stdscr.refresh() + +The curses library also supports color on those terminals that provide it, The +most common such terminal is probably the Linux console, followed by color +xterms. + +To use color, you must call the :func:`start_color` function soon after calling +:func:`initscr`, to initialize the default color set (the +:func:`curses.wrapper.wrapper` function does this automatically). Once that's +done, the :func:`has_colors` function returns TRUE if the terminal in use can +actually display color. (Note: curses uses the American spelling 'color', +instead of the Canadian/British spelling 'colour'. If you're used to the +British spelling, you'll have to resign yourself to misspelling it for the sake +of these functions.) + +The curses library maintains a finite number of color pairs, containing a +foreground (or text) color and a background color. You can get the attribute +value corresponding to a color pair with the :func:`color_pair` function; this +can be bitwise-OR'ed with other attributes such as :const:`A_REVERSE`, but +again, such combinations are not guaranteed to work on all terminals. + +An example, which displays a line of text using color pair 1:: + + stdscr.addstr( "Pretty text", curses.color_pair(1) ) + stdscr.refresh() + +As I said before, a color pair consists of a foreground and background color. +:func:`start_color` initializes 8 basic colors when it activates color mode. +They are: 0:black, 1:red, 2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and +7:white. The curses module defines named constants for each of these colors: +:const:`curses.COLOR_BLACK`, :const:`curses.COLOR_RED`, and so forth. + +The ``init_pair(n, f, b)`` function changes the definition of color pair *n*, to +foreground color f and background color b. Color pair 0 is hard-wired to white +on black, and cannot be changed. + +Let's put all this together. To change color 1 to red text on a white +background, you would call:: + + curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE) + +When you change a color pair, any text already displayed using that color pair +will change to the new colors. You can also display new text in this color +with:: + + stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1) ) + +Very fancy terminals can change the definitions of the actual colors to a given +RGB value. This lets you change color 1, which is usually red, to purple or +blue or any other color you like. Unfortunately, the Linux console doesn't +support this, so I'm unable to try it out, and can't provide any examples. You +can check if your terminal can do this by calling :func:`can_change_color`, +which returns TRUE if the capability is there. If you're lucky enough to have +such a talented terminal, consult your system's man pages for more information. + + +User Input +========== + +The curses library itself offers only very simple input mechanisms. Python's +support adds a text-input widget that makes up some of the lack. + +The most common way to get input to a window is to use its :meth:`getch` method. +:meth:`getch` pauses and waits for the user to hit a key, displaying it if +:func:`echo` has been called earlier. You can optionally specify a coordinate +to which the cursor should be moved before pausing. + +It's possible to change this behavior with the method :meth:`nodelay`. After +``nodelay(1)``, :meth:`getch` for the window becomes non-blocking and returns +``curses.ERR`` (a value of -1) when no input is ready. There's also a +:func:`halfdelay` function, which can be used to (in effect) set a timer on each +:meth:`getch`; if no input becomes available within the number of milliseconds +specified as the argument to :func:`halfdelay`, curses raises an exception. + +The :meth:`getch` method returns an integer; if it's between 0 and 255, it +represents the ASCII code of the key pressed. Values greater than 255 are +special keys such as Page Up, Home, or the cursor keys. You can compare the +value returned to constants such as :const:`curses.KEY_PPAGE`, +:const:`curses.KEY_HOME`, or :const:`curses.KEY_LEFT`. Usually the main loop of +your program will look something like this:: + + while 1: + c = stdscr.getch() + if c == ord('p'): PrintDocument() + elif c == ord('q'): break # Exit the while() + elif c == curses.KEY_HOME: x = y = 0 + +The :mod:`curses.ascii` module supplies ASCII class membership functions that +take either integer or 1-character-string arguments; these may be useful in +writing more readable tests for your command interpreters. It also supplies +conversion functions that take either integer or 1-character-string arguments +and return the same type. For example, :func:`curses.ascii.ctrl` returns the +control character corresponding to its argument. + +There's also a method to retrieve an entire string, :const:`getstr()`. It isn't +used very often, because its functionality is quite limited; the only editing +keys available are the backspace key and the Enter key, which terminates the +string. It can optionally be limited to a fixed number of characters. :: + + curses.echo() # Enable echoing of characters + + # Get a 15-character string, with the cursor on the top line + s = stdscr.getstr(0,0, 15) + +The Python :mod:`curses.textpad` module supplies something better. With it, you +can turn a window into a text box that supports an Emacs-like set of +keybindings. Various methods of :class:`Textbox` class support editing with +input validation and gathering the edit results either with or without trailing +spaces. See the library documentation on :mod:`curses.textpad` for the +details. + + +For More Information +==================== + +This HOWTO didn't cover some advanced topics, such as screen-scraping or +capturing mouse events from an xterm instance. But the Python library page for +the curses modules is now pretty complete. You should browse it next. + +If you're in doubt about the detailed behavior of any of the ncurses entry +points, consult the manual pages for your curses implementation, whether it's +ncurses or a proprietary Unix vendor's. The manual pages will document any +quirks, and provide complete lists of all the functions, attributes, and +:const:`ACS_\*` characters available to you. + +Because the curses API is so large, some functions aren't supported in the +Python interface, not because they're difficult to implement, but because no one +has needed them yet. Feel free to add them and then submit a patch. Also, we +don't yet have support for the menus or panels libraries associated with +ncurses; feel free to add that. + +If you write an interesting little program, feel free to contribute it as +another demo. We can always use more of them! + +The ncurses FAQ: http://dickey.his.com/ncurses/ncurses.faq.html + diff --git a/Doc/howto/doanddont.rst b/Doc/howto/doanddont.rst new file mode 100644 index 0000000..a322c53 --- /dev/null +++ b/Doc/howto/doanddont.rst @@ -0,0 +1,308 @@ +************************************ + Idioms and Anti-Idioms in Python +************************************ + +:Author: Moshe Zadka + +This document is placed in the public doman. + + +.. topic:: Abstract + + This document can be considered a companion to the tutorial. It shows how to use + Python, and even more importantly, how *not* to use Python. + + +Language Constructs You Should Not Use +====================================== + +While Python has relatively few gotchas compared to other languages, it still +has some constructs which are only useful in corner cases, or are plain +dangerous. + + +from module import \* +--------------------- + + +Inside Function Definitions +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``from module import *`` is *invalid* inside function definitions. While many +versions of Python do not check for the invalidity, it does not make it more +valid, no more then having a smart lawyer makes a man innocent. Do not use it +like that ever. Even in versions where it was accepted, it made the function +execution slower, because the compiler could not be certain which names are +local and which are global. In Python 2.1 this construct causes warnings, and +sometimes even errors. + + +At Module Level +^^^^^^^^^^^^^^^ + +While it is valid to use ``from module import *`` at module level it is usually +a bad idea. For one, this loses an important property Python otherwise has --- +you can know where each toplevel name is defined by a simple "search" function +in your favourite editor. You also open yourself to trouble in the future, if +some module grows additional functions or classes. + +One of the most awful question asked on the newsgroup is why this code:: + + f = open("www") + f.read() + +does not work. Of course, it works just fine (assuming you have a file called +"www".) But it does not work if somewhere in the module, the statement ``from os +import *`` is present. The :mod:`os` module has a function called :func:`open` +which returns an integer. While it is very useful, shadowing builtins is one of +its least useful properties. + +Remember, you can never know for sure what names a module exports, so either +take what you need --- ``from module import name1, name2``, or keep them in the +module and access on a per-need basis --- ``import module;print module.name``. + + +When It Is Just Fine +^^^^^^^^^^^^^^^^^^^^ + +There are situations in which ``from module import *`` is just fine: + +* The interactive prompt. For example, ``from math import *`` makes Python an + amazing scientific calculator. + +* When extending a module in C with a module in Python. + +* When the module advertises itself as ``from import *`` safe. + + +Unadorned :keyword:`exec` and friends +------------------------------------- + +The word "unadorned" refers to the use without an explicit dictionary, in which +case those constructs evaluate code in the *current* environment. This is +dangerous for the same reasons ``from import *`` is dangerous --- it might step +over variables you are counting on and mess up things for the rest of your code. +Simply do not do that. + +Bad examples:: + + >>> for name in sys.argv[1:]: + >>> exec "%s=1" % name + >>> def func(s, **kw): + >>> for var, val in kw.items(): + >>> exec "s.%s=val" % var # invalid! + >>> exec(open("handler.py").read()) + >>> handle() + +Good examples:: + + >>> d = {} + >>> for name in sys.argv[1:]: + >>> d[name] = 1 + >>> def func(s, **kw): + >>> for var, val in kw.items(): + >>> setattr(s, var, val) + >>> d={} + >>> exec(open("handle.py").read(), d, d) + >>> handle = d['handle'] + >>> handle() + + +from module import name1, name2 +------------------------------- + +This is a "don't" which is much weaker then the previous "don't"s but is still +something you should not do if you don't have good reasons to do that. The +reason it is usually bad idea is because you suddenly have an object which lives +in two seperate namespaces. When the binding in one namespace changes, the +binding in the other will not, so there will be a discrepancy between them. This +happens when, for example, one module is reloaded, or changes the definition of +a function at runtime. + +Bad example:: + + # foo.py + a = 1 + + # bar.py + from foo import a + if something(): + a = 2 # danger: foo.a != a + +Good example:: + + # foo.py + a = 1 + + # bar.py + import foo + if something(): + foo.a = 2 + + +except: +------- + +Python has the ``except:`` clause, which catches all exceptions. Since *every* +error in Python raises an exception, this makes many programming errors look +like runtime problems, and hinders the debugging process. + +The following code shows a great example:: + + try: + foo = opne("file") # misspelled "open" + except: + sys.exit("could not open file!") + +The second line triggers a :exc:`NameError` which is caught by the except +clause. The program will exit, and you will have no idea that this has nothing +to do with the readability of ``"file"``. + +The example above is better written :: + + try: + foo = opne("file") # will be changed to "open" as soon as we run it + except IOError: + sys.exit("could not open file") + +There are some situations in which the ``except:`` clause is useful: for +example, in a framework when running callbacks, it is good not to let any +callback disturb the framework. + + +Exceptions +========== + +Exceptions are a useful feature of Python. You should learn to raise them +whenever something unexpected occurs, and catch them only where you can do +something about them. + +The following is a very popular anti-idiom :: + + def get_status(file): + if not os.path.exists(file): + print "file not found" + sys.exit(1) + return open(file).readline() + +Consider the case the file gets deleted between the time the call to +:func:`os.path.exists` is made and the time :func:`open` is called. That means +the last line will throw an :exc:`IOError`. The same would happen if *file* +exists but has no read permission. Since testing this on a normal machine on +existing and non-existing files make it seem bugless, that means in testing the +results will seem fine, and the code will get shipped. Then an unhandled +:exc:`IOError` escapes to the user, who has to watch the ugly traceback. + +Here is a better way to do it. :: + + def get_status(file): + try: + return open(file).readline() + except (IOError, OSError): + print "file not found" + sys.exit(1) + +In this version, \*either\* the file gets opened and the line is read (so it +works even on flaky NFS or SMB connections), or the message is printed and the +application aborted. + +Still, :func:`get_status` makes too many assumptions --- that it will only be +used in a short running script, and not, say, in a long running server. Sure, +the caller could do something like :: + + try: + status = get_status(log) + except SystemExit: + status = None + +So, try to make as few ``except`` clauses in your code --- those will usually be +a catch-all in the :func:`main`, or inside calls which should always succeed. + +So, the best version is probably :: + + def get_status(file): + return open(file).readline() + +The caller can deal with the exception if it wants (for example, if it tries +several files in a loop), or just let the exception filter upwards to *its* +caller. + +The last version is not very good either --- due to implementation details, the +file would not be closed when an exception is raised until the handler finishes, +and perhaps not at all in non-C implementations (e.g., Jython). :: + + def get_status(file): + fp = open(file) + try: + return fp.readline() + finally: + fp.close() + + +Using the Batteries +=================== + +Every so often, people seem to be writing stuff in the Python library again, +usually poorly. While the occasional module has a poor interface, it is usually +much better to use the rich standard library and data types that come with +Python then inventing your own. + +A useful module very few people know about is :mod:`os.path`. It always has the +correct path arithmetic for your operating system, and will usually be much +better then whatever you come up with yourself. + +Compare:: + + # ugh! + return dir+"/"+file + # better + return os.path.join(dir, file) + +More useful functions in :mod:`os.path`: :func:`basename`, :func:`dirname` and +:func:`splitext`. + +There are also many useful builtin functions people seem not to be aware of for +some reason: :func:`min` and :func:`max` can find the minimum/maximum of any +sequence with comparable semantics, for example, yet many people write their own +:func:`max`/:func:`min`. Another highly useful function is :func:`reduce`. A +classical use of :func:`reduce` is something like :: + + import sys, operator + nums = map(float, sys.argv[1:]) + print reduce(operator.add, nums)/len(nums) + +This cute little script prints the average of all numbers given on the command +line. The :func:`reduce` adds up all the numbers, and the rest is just some +pre- and postprocessing. + +On the same note, note that :func:`float`, :func:`int` and :func:`long` all +accept arguments of type string, and so are suited to parsing --- assuming you +are ready to deal with the :exc:`ValueError` they raise. + + +Using Backslash to Continue Statements +====================================== + +Since Python treats a newline as a statement terminator, and since statements +are often more then is comfortable to put in one line, many people do:: + + if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \ + calculate_number(10, 20) != forbulate(500, 360): + pass + +You should realize that this is dangerous: a stray space after the ``XXX`` would +make this line wrong, and stray spaces are notoriously hard to see in editors. +In this case, at least it would be a syntax error, but if the code was:: + + value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \ + + calculate_number(10, 20)*forbulate(500, 360) + +then it would just be subtly wrong. + +It is usually much better to use the implicit continuation inside parenthesis: + +This version is bulletproof:: + + value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9] + + calculate_number(10, 20)*forbulate(500, 360)) + diff --git a/Doc/howto/functional.rst b/Doc/howto/functional.rst new file mode 100644 index 0000000..bc12793 --- /dev/null +++ b/Doc/howto/functional.rst @@ -0,0 +1,1400 @@ +******************************** + Functional Programming HOWTO +******************************** + +:Author: \A. M. Kuchling +:Release: 0.30 + +(This is a first draft. Please send comments/error reports/suggestions to +amk@amk.ca. This URL is probably not going to be the final location of the +document, so be careful about linking to it -- you may want to add a +disclaimer.) + +In this document, we'll take a tour of Python's features suitable for +implementing programs in a functional style. After an introduction to the +concepts of functional programming, we'll look at language features such as +iterators and generators and relevant library modules such as :mod:`itertools` +and :mod:`functools`. + + +Introduction +============ + +This section explains the basic concept of functional programming; if you're +just interested in learning about Python language features, skip to the next +section. + +Programming languages support decomposing problems in several different ways: + +* Most programming languages are **procedural**: programs are lists of + instructions that tell the computer what to do with the program's input. C, + Pascal, and even Unix shells are procedural languages. + +* In **declarative** languages, you write a specification that describes the + problem to be solved, and the language implementation figures out how to + perform the computation efficiently. SQL is the declarative language you're + most likely to be familiar with; a SQL query describes the data set you want + to retrieve, and the SQL engine decides whether to scan tables or use indexes, + which subclauses should be performed first, etc. + +* **Object-oriented** programs manipulate collections of objects. Objects have + internal state and support methods that query or modify this internal state in + some way. Smalltalk and Java are object-oriented languages. C++ and Python + are languages that support object-oriented programming, but don't force the + use of object-oriented features. + +* **Functional** programming decomposes a problem into a set of functions. + Ideally, functions only take inputs and produce outputs, and don't have any + internal state that affects the output produced for a given input. Well-known + functional languages include the ML family (Standard ML, OCaml, and other + variants) and Haskell. + +The designers of some computer languages have chosen one approach to programming +that's emphasized. This often makes it difficult to write programs that use a +different approach. Other languages are multi-paradigm languages that support +several different approaches. Lisp, C++, and Python are multi-paradigm; you can +write programs or libraries that are largely procedural, object-oriented, or +functional in all of these languages. In a large program, different sections +might be written using different approaches; the GUI might be object-oriented +while the processing logic is procedural or functional, for example. + +In a functional program, input flows through a set of functions. Each function +operates on its input and produces some output. Functional style frowns upon +functions with side effects that modify internal state or make other changes +that aren't visible in the function's return value. Functions that have no side +effects at all are called **purely functional**. Avoiding side effects means +not using data structures that get updated as a program runs; every function's +output must only depend on its input. + +Some languages are very strict about purity and don't even have assignment +statements such as ``a=3`` or ``c = a + b``, but it's difficult to avoid all +side effects. Printing to the screen or writing to a disk file are side +effects, for example. For example, in Python a ``print`` statement or a +``time.sleep(1)`` both return no useful value; they're only called for their +side effects of sending some text to the screen or pausing execution for a +second. + +Python programs written in functional style usually won't go to the extreme of +avoiding all I/O or all assignments; instead, they'll provide a +functional-appearing interface but will use non-functional features internally. +For example, the implementation of a function will still use assignments to +local variables, but won't modify global variables or have other side effects. + +Functional programming can be considered the opposite of object-oriented +programming. Objects are little capsules containing some internal state along +with a collection of method calls that let you modify this state, and programs +consist of making the right set of state changes. Functional programming wants +to avoid state changes as much as possible and works with data flowing between +functions. In Python you might combine the two approaches by writing functions +that take and return instances representing objects in your application (e-mail +messages, transactions, etc.). + +Functional design may seem like an odd constraint to work under. Why should you +avoid objects and side effects? There are theoretical and practical advantages +to the functional style: + +* Formal provability. +* Modularity. +* Composability. +* Ease of debugging and testing. + +Formal provability +------------------ + +A theoretical benefit is that it's easier to construct a mathematical proof that +a functional program is correct. + +For a long time researchers have been interested in finding ways to +mathematically prove programs correct. This is different from testing a program +on numerous inputs and concluding that its output is usually correct, or reading +a program's source code and concluding that the code looks right; the goal is +instead a rigorous proof that a program produces the right result for all +possible inputs. + +The technique used to prove programs correct is to write down **invariants**, +properties of the input data and of the program's variables that are always +true. For each line of code, you then show that if invariants X and Y are true +**before** the line is executed, the slightly different invariants X' and Y' are +true **after** the line is executed. This continues until you reach the end of +the program, at which point the invariants should match the desired conditions +on the program's output. + +Functional programming's avoidance of assignments arose because assignments are +difficult to handle with this technique; assignments can break invariants that +were true before the assignment without producing any new invariants that can be +propagated onward. + +Unfortunately, proving programs correct is largely impractical and not relevant +to Python software. Even trivial programs require proofs that are several pages +long; the proof of correctness for a moderately complicated program would be +enormous, and few or none of the programs you use daily (the Python interpreter, +your XML parser, your web browser) could be proven correct. Even if you wrote +down or generated a proof, there would then be the question of verifying the +proof; maybe there's an error in it, and you wrongly believe you've proved the +program correct. + +Modularity +---------- + +A more practical benefit of functional programming is that it forces you to +break apart your problem into small pieces. Programs are more modular as a +result. It's easier to specify and write a small function that does one thing +than a large function that performs a complicated transformation. Small +functions are also easier to read and to check for errors. + + +Ease of debugging and testing +----------------------------- + +Testing and debugging a functional-style program is easier. + +Debugging is simplified because functions are generally small and clearly +specified. When a program doesn't work, each function is an interface point +where you can check that the data are correct. You can look at the intermediate +inputs and outputs to quickly isolate the function that's responsible for a bug. + +Testing is easier because each function is a potential subject for a unit test. +Functions don't depend on system state that needs to be replicated before +running a test; instead you only have to synthesize the right input and then +check that the output matches expectations. + + + +Composability +------------- + +As you work on a functional-style program, you'll write a number of functions +with varying inputs and outputs. Some of these functions will be unavoidably +specialized to a particular application, but others will be useful in a wide +variety of programs. For example, a function that takes a directory path and +returns all the XML files in the directory, or a function that takes a filename +and returns its contents, can be applied to many different situations. + +Over time you'll form a personal library of utilities. Often you'll assemble +new programs by arranging existing functions in a new configuration and writing +a few functions specialized for the current task. + + + +Iterators +========= + +I'll start by looking at a Python language feature that's an important +foundation for writing functional-style programs: iterators. + +An iterator is an object representing a stream of data; this object returns the +data one element at a time. A Python iterator must support a method called +``next()`` that takes no arguments and always returns the next element of the +stream. If there are no more elements in the stream, ``next()`` must raise the +``StopIteration`` exception. Iterators don't have to be finite, though; it's +perfectly reasonable to write an iterator that produces an infinite stream of +data. + +The built-in :func:`iter` function takes an arbitrary object and tries to return +an iterator that will return the object's contents or elements, raising +:exc:`TypeError` if the object doesn't support iteration. Several of Python's +built-in data types support iteration, the most common being lists and +dictionaries. An object is called an **iterable** object if you can get an +iterator for it. + +You can experiment with the iteration interface manually:: + + >>> L = [1,2,3] + >>> it = iter(L) + >>> print it + <iterator object at 0x8116870> + >>> it.next() + 1 + >>> it.next() + 2 + >>> it.next() + 3 + >>> it.next() + Traceback (most recent call last): + File "<stdin>", line 1, in ? + StopIteration + >>> + +Python expects iterable objects in several different contexts, the most +important being the ``for`` statement. In the statement ``for X in Y``, Y must +be an iterator or some object for which ``iter()`` can create an iterator. +These two statements are equivalent:: + + for i in iter(obj): + print i + + for i in obj: + print i + +Iterators can be materialized as lists or tuples by using the :func:`list` or +:func:`tuple` constructor functions:: + + >>> L = [1,2,3] + >>> iterator = iter(L) + >>> t = tuple(iterator) + >>> t + (1, 2, 3) + +Sequence unpacking also supports iterators: if you know an iterator will return +N elements, you can unpack them into an N-tuple:: + + >>> L = [1,2,3] + >>> iterator = iter(L) + >>> a,b,c = iterator + >>> a,b,c + (1, 2, 3) + +Built-in functions such as :func:`max` and :func:`min` can take a single +iterator argument and will return the largest or smallest element. The ``"in"`` +and ``"not in"`` operators also support iterators: ``X in iterator`` is true if +X is found in the stream returned by the iterator. You'll run into obvious +problems if the iterator is infinite; ``max()``, ``min()``, and ``"not in"`` +will never return, and if the element X never appears in the stream, the +``"in"`` operator won't return either. + +Note that you can only go forward in an iterator; there's no way to get the +previous element, reset the iterator, or make a copy of it. Iterator objects +can optionally provide these additional capabilities, but the iterator protocol +only specifies the ``next()`` method. Functions may therefore consume all of +the iterator's output, and if you need to do something different with the same +stream, you'll have to create a new iterator. + + + +Data Types That Support Iterators +--------------------------------- + +We've already seen how lists and tuples support iterators. In fact, any Python +sequence type, such as strings, will automatically support creation of an +iterator. + +Calling :func:`iter` on a dictionary returns an iterator that will loop over the +dictionary's keys:: + + >>> m = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, + ... 'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12} + >>> for key in m: + ... print key, m[key] + Mar 3 + Feb 2 + Aug 8 + Sep 9 + May 5 + Jun 6 + Jul 7 + Jan 1 + Apr 4 + Nov 11 + Dec 12 + Oct 10 + +Note that the order is essentially random, because it's based on the hash +ordering of the objects in the dictionary. + +Applying ``iter()`` to a dictionary always loops over the keys, but dictionaries +have methods that return other iterators. If you want to iterate over keys, +values, or key/value pairs, you can explicitly call the ``iterkeys()``, +``itervalues()``, or ``iteritems()`` methods to get an appropriate iterator. + +The :func:`dict` constructor can accept an iterator that returns a finite stream +of ``(key, value)`` tuples:: + + >>> L = [('Italy', 'Rome'), ('France', 'Paris'), ('US', 'Washington DC')] + >>> dict(iter(L)) + {'Italy': 'Rome', 'US': 'Washington DC', 'France': 'Paris'} + +Files also support iteration by calling the ``readline()`` method until there +are no more lines in the file. This means you can read each line of a file like +this:: + + for line in file: + # do something for each line + ... + +Sets can take their contents from an iterable and let you iterate over the set's +elements:: + + S = set((2, 3, 5, 7, 11, 13)) + for i in S: + print i + + + +Generator expressions and list comprehensions +============================================= + +Two common operations on an iterator's output are 1) performing some operation +for every element, 2) selecting a subset of elements that meet some condition. +For example, given a list of strings, you might want to strip off trailing +whitespace from each line or extract all the strings containing a given +substring. + +List comprehensions and generator expressions (short form: "listcomps" and +"genexps") are a concise notation for such operations, borrowed from the +functional programming language Haskell (http://www.haskell.org). You can strip +all the whitespace from a stream of strings with the following code:: + + line_list = [' line 1\n', 'line 2 \n', ...] + + # Generator expression -- returns iterator + stripped_iter = (line.strip() for line in line_list) + + # List comprehension -- returns list + stripped_list = [line.strip() for line in line_list] + +You can select only certain elements by adding an ``"if"`` condition:: + + stripped_list = [line.strip() for line in line_list + if line != ""] + +With a list comprehension, you get back a Python list; ``stripped_list`` is a +list containing the resulting lines, not an iterator. Generator expressions +return an iterator that computes the values as necessary, not needing to +materialize all the values at once. This means that list comprehensions aren't +useful if you're working with iterators that return an infinite stream or a very +large amount of data. Generator expressions are preferable in these situations. + +Generator expressions are surrounded by parentheses ("()") and list +comprehensions are surrounded by square brackets ("[]"). Generator expressions +have the form:: + + ( expression for expr in sequence1 + if condition1 + for expr2 in sequence2 + if condition2 + for expr3 in sequence3 ... + if condition3 + for exprN in sequenceN + if conditionN ) + +Again, for a list comprehension only the outside brackets are different (square +brackets instead of parentheses). + +The elements of the generated output will be the successive values of +``expression``. The ``if`` clauses are all optional; if present, ``expression`` +is only evaluated and added to the result when ``condition`` is true. + +Generator expressions always have to be written inside parentheses, but the +parentheses signalling a function call also count. If you want to create an +iterator that will be immediately passed to a function you can write:: + + obj_total = sum(obj.count for obj in list_all_objects()) + +The ``for...in`` clauses contain the sequences to be iterated over. The +sequences do not have to be the same length, because they are iterated over from +left to right, **not** in parallel. For each element in ``sequence1``, +``sequence2`` is looped over from the beginning. ``sequence3`` is then looped +over for each resulting pair of elements from ``sequence1`` and ``sequence2``. + +To put it another way, a list comprehension or generator expression is +equivalent to the following Python code:: + + for expr1 in sequence1: + if not (condition1): + continue # Skip this element + for expr2 in sequence2: + if not (condition2): + continue # Skip this element + ... + for exprN in sequenceN: + if not (conditionN): + continue # Skip this element + + # Output the value of + # the expression. + +This means that when there are multiple ``for...in`` clauses but no ``if`` +clauses, the length of the resulting output will be equal to the product of the +lengths of all the sequences. If you have two lists of length 3, the output +list is 9 elements long:: + + seq1 = 'abc' + seq2 = (1,2,3) + >>> [ (x,y) for x in seq1 for y in seq2] + [('a', 1), ('a', 2), ('a', 3), + ('b', 1), ('b', 2), ('b', 3), + ('c', 1), ('c', 2), ('c', 3)] + +To avoid introducing an ambiguity into Python's grammar, if ``expression`` is +creating a tuple, it must be surrounded with parentheses. The first list +comprehension below is a syntax error, while the second one is correct:: + + # Syntax error + [ x,y for x in seq1 for y in seq2] + # Correct + [ (x,y) for x in seq1 for y in seq2] + + +Generators +========== + +Generators are a special class of functions that simplify the task of writing +iterators. Regular functions compute a value and return it, but generators +return an iterator that returns a stream of values. + +You're doubtless familiar with how regular function calls work in Python or C. +When you call a function, it gets a private namespace where its local variables +are created. When the function reaches a ``return`` statement, the local +variables are destroyed and the value is returned to the caller. A later call +to the same function creates a new private namespace and a fresh set of local +variables. But, what if the local variables weren't thrown away on exiting a +function? What if you could later resume the function where it left off? This +is what generators provide; they can be thought of as resumable functions. + +Here's the simplest example of a generator function:: + + def generate_ints(N): + for i in range(N): + yield i + +Any function containing a ``yield`` keyword is a generator function; this is +detected by Python's bytecode compiler which compiles the function specially as +a result. + +When you call a generator function, it doesn't return a single value; instead it +returns a generator object that supports the iterator protocol. On executing +the ``yield`` expression, the generator outputs the value of ``i``, similar to a +``return`` statement. The big difference between ``yield`` and a ``return`` +statement is that on reaching a ``yield`` the generator's state of execution is +suspended and local variables are preserved. On the next call to the +generator's ``.next()`` method, the function will resume executing. + +Here's a sample usage of the ``generate_ints()`` generator:: + + >>> gen = generate_ints(3) + >>> gen + <generator object at 0x8117f90> + >>> gen.next() + 0 + >>> gen.next() + 1 + >>> gen.next() + 2 + >>> gen.next() + Traceback (most recent call last): + File "stdin", line 1, in ? + File "stdin", line 2, in generate_ints + StopIteration + +You could equally write ``for i in generate_ints(5)``, or ``a,b,c = +generate_ints(3)``. + +Inside a generator function, the ``return`` statement can only be used without a +value, and signals the end of the procession of values; after executing a +``return`` the generator cannot return any further values. ``return`` with a +value, such as ``return 5``, is a syntax error inside a generator function. The +end of the generator's results can also be indicated by raising +``StopIteration`` manually, or by just letting the flow of execution fall off +the bottom of the function. + +You could achieve the effect of generators manually by writing your own class +and storing all the local variables of the generator as instance variables. For +example, returning a list of integers could be done by setting ``self.count`` to +0, and having the ``next()`` method increment ``self.count`` and return it. +However, for a moderately complicated generator, writing a corresponding class +can be much messier. + +The test suite included with Python's library, ``test_generators.py``, contains +a number of more interesting examples. Here's one generator that implements an +in-order traversal of a tree using generators recursively. + +:: + + # A recursive generator that generates Tree leaves in in-order. + def inorder(t): + if t: + for x in inorder(t.left): + yield x + + yield t.label + + for x in inorder(t.right): + yield x + +Two other examples in ``test_generators.py`` produce solutions for the N-Queens +problem (placing N queens on an NxN chess board so that no queen threatens +another) and the Knight's Tour (finding a route that takes a knight to every +square of an NxN chessboard without visiting any square twice). + + + +Passing values into a generator +------------------------------- + +In Python 2.4 and earlier, generators only produced output. Once a generator's +code was invoked to create an iterator, there was no way to pass any new +information into the function when its execution is resumed. You could hack +together this ability by making the generator look at a global variable or by +passing in some mutable object that callers then modify, but these approaches +are messy. + +In Python 2.5 there's a simple way to pass values into a generator. +:keyword:`yield` became an expression, returning a value that can be assigned to +a variable or otherwise operated on:: + + val = (yield i) + +I recommend that you **always** put parentheses around a ``yield`` expression +when you're doing something with the returned value, as in the above example. +The parentheses aren't always necessary, but it's easier to always add them +instead of having to remember when they're needed. + +(PEP 342 explains the exact rules, which are that a ``yield``-expression must +always be parenthesized except when it occurs at the top-level expression on the +right-hand side of an assignment. This means you can write ``val = yield i`` +but have to use parentheses when there's an operation, as in ``val = (yield i) ++ 12``.) + +Values are sent into a generator by calling its ``send(value)`` method. This +method resumes the generator's code and the ``yield`` expression returns the +specified value. If the regular ``next()`` method is called, the ``yield`` +returns ``None``. + +Here's a simple counter that increments by 1 and allows changing the value of +the internal counter. + +:: + + def counter (maximum): + i = 0 + while i < maximum: + val = (yield i) + # If value provided, change counter + if val is not None: + i = val + else: + i += 1 + +And here's an example of changing the counter: + + >>> it = counter(10) + >>> print it.next() + 0 + >>> print it.next() + 1 + >>> print it.send(8) + 8 + >>> print it.next() + 9 + >>> print it.next() + Traceback (most recent call last): + File ``t.py'', line 15, in ? + print it.next() + StopIteration + +Because ``yield`` will often be returning ``None``, you should always check for +this case. Don't just use its value in expressions unless you're sure that the +``send()`` method will be the only method used resume your generator function. + +In addition to ``send()``, there are two other new methods on generators: + +* ``throw(type, value=None, traceback=None)`` is used to raise an exception + inside the generator; the exception is raised by the ``yield`` expression + where the generator's execution is paused. + +* ``close()`` raises a :exc:`GeneratorExit` exception inside the generator to + terminate the iteration. On receiving this exception, the generator's code + must either raise :exc:`GeneratorExit` or :exc:`StopIteration`; catching the + exception and doing anything else is illegal and will trigger a + :exc:`RuntimeError`. ``close()`` will also be called by Python's garbage + collector when the generator is garbage-collected. + + If you need to run cleanup code when a :exc:`GeneratorExit` occurs, I suggest + using a ``try: ... finally:`` suite instead of catching :exc:`GeneratorExit`. + +The cumulative effect of these changes is to turn generators from one-way +producers of information into both producers and consumers. + +Generators also become **coroutines**, a more generalized form of subroutines. +Subroutines are entered at one point and exited at another point (the top of the +function, and a ``return`` statement), but coroutines can be entered, exited, +and resumed at many different points (the ``yield`` statements). + + +Built-in functions +================== + +Let's look in more detail at built-in functions often used with iterators. + +Two Python's built-in functions, :func:`map` and :func:`filter`, are somewhat +obsolete; they duplicate the features of list comprehensions but return actual +lists instead of iterators. + +``map(f, iterA, iterB, ...)`` returns a list containing ``f(iterA[0], iterB[0]), +f(iterA[1], iterB[1]), f(iterA[2], iterB[2]), ...``. + +:: + + def upper(s): + return s.upper() + map(upper, ['sentence', 'fragment']) => + ['SENTENCE', 'FRAGMENT'] + + [upper(s) for s in ['sentence', 'fragment']] => + ['SENTENCE', 'FRAGMENT'] + +As shown above, you can achieve the same effect with a list comprehension. The +:func:`itertools.imap` function does the same thing but can handle infinite +iterators; it'll be discussed later, in the section on the :mod:`itertools` module. + +``filter(predicate, iter)`` returns a list that contains all the sequence +elements that meet a certain condition, and is similarly duplicated by list +comprehensions. A **predicate** is a function that returns the truth value of +some condition; for use with :func:`filter`, the predicate must take a single +value. + +:: + + def is_even(x): + return (x % 2) == 0 + + filter(is_even, range(10)) => + [0, 2, 4, 6, 8] + +This can also be written as a list comprehension:: + + >>> [x for x in range(10) if is_even(x)] + [0, 2, 4, 6, 8] + +:func:`filter` also has a counterpart in the :mod:`itertools` module, +:func:`itertools.ifilter`, that returns an iterator and can therefore handle +infinite sequences just as :func:`itertools.imap` can. + +``reduce(func, iter, [initial_value])`` doesn't have a counterpart in the +:mod:`itertools` module because it cumulatively performs an operation on all the +iterable's elements and therefore can't be applied to infinite iterables. +``func`` must be a function that takes two elements and returns a single value. +:func:`reduce` takes the first two elements A and B returned by the iterator and +calculates ``func(A, B)``. It then requests the third element, C, calculates +``func(func(A, B), C)``, combines this result with the fourth element returned, +and continues until the iterable is exhausted. If the iterable returns no +values at all, a :exc:`TypeError` exception is raised. If the initial value is +supplied, it's used as a starting point and ``func(initial_value, A)`` is the +first calculation. + +:: + + import operator + reduce(operator.concat, ['A', 'BB', 'C']) => + 'ABBC' + reduce(operator.concat, []) => + TypeError: reduce() of empty sequence with no initial value + reduce(operator.mul, [1,2,3], 1) => + 6 + reduce(operator.mul, [], 1) => + 1 + +If you use :func:`operator.add` with :func:`reduce`, you'll add up all the +elements of the iterable. This case is so common that there's a special +built-in called :func:`sum` to compute it:: + + reduce(operator.add, [1,2,3,4], 0) => + 10 + sum([1,2,3,4]) => + 10 + sum([]) => + 0 + +For many uses of :func:`reduce`, though, it can be clearer to just write the +obvious :keyword:`for` loop:: + + # Instead of: + product = reduce(operator.mul, [1,2,3], 1) + + # You can write: + product = 1 + for i in [1,2,3]: + product *= i + + +``enumerate(iter)`` counts off the elements in the iterable, returning 2-tuples +containing the count and each element. + +:: + + enumerate(['subject', 'verb', 'object']) => + (0, 'subject'), (1, 'verb'), (2, 'object') + +:func:`enumerate` is often used when looping through a list and recording the +indexes at which certain conditions are met:: + + f = open('data.txt', 'r') + for i, line in enumerate(f): + if line.strip() == '': + print 'Blank line at line #%i' % i + +``sorted(iterable, [cmp=None], [key=None], [reverse=False)`` collects all the +elements of the iterable into a list, sorts the list, and returns the sorted +result. The ``cmp``, ``key``, and ``reverse`` arguments are passed through to +the constructed list's ``.sort()`` method. + +:: + + import random + # Generate 8 random numbers between [0, 10000) + rand_list = random.sample(range(10000), 8) + rand_list => + [769, 7953, 9828, 6431, 8442, 9878, 6213, 2207] + sorted(rand_list) => + [769, 2207, 6213, 6431, 7953, 8442, 9828, 9878] + sorted(rand_list, reverse=True) => + [9878, 9828, 8442, 7953, 6431, 6213, 2207, 769] + +(For a more detailed discussion of sorting, see the Sorting mini-HOWTO in the +Python wiki at http://wiki.python.org/moin/HowTo/Sorting.) + +The ``any(iter)`` and ``all(iter)`` built-ins look at the truth values of an +iterable's contents. :func:`any` returns True if any element in the iterable is +a true value, and :func:`all` returns True if all of the elements are true +values:: + + any([0,1,0]) => + True + any([0,0,0]) => + False + any([1,1,1]) => + True + all([0,1,0]) => + False + all([0,0,0]) => + False + all([1,1,1]) => + True + + +Small functions and the lambda expression +========================================= + +When writing functional-style programs, you'll often need little functions that +act as predicates or that combine elements in some way. + +If there's a Python built-in or a module function that's suitable, you don't +need to define a new function at all:: + + stripped_lines = [line.strip() for line in lines] + existing_files = filter(os.path.exists, file_list) + +If the function you need doesn't exist, you need to write it. One way to write +small functions is to use the ``lambda`` statement. ``lambda`` takes a number +of parameters and an expression combining these parameters, and creates a small +function that returns the value of the expression:: + + lowercase = lambda x: x.lower() + + print_assign = lambda name, value: name + '=' + str(value) + + adder = lambda x, y: x+y + +An alternative is to just use the ``def`` statement and define a function in the +usual way:: + + def lowercase(x): + return x.lower() + + def print_assign(name, value): + return name + '=' + str(value) + + def adder(x,y): + return x + y + +Which alternative is preferable? That's a style question; my usual course is to +avoid using ``lambda``. + +One reason for my preference is that ``lambda`` is quite limited in the +functions it can define. The result has to be computable as a single +expression, which means you can't have multiway ``if... elif... else`` +comparisons or ``try... except`` statements. If you try to do too much in a +``lambda`` statement, you'll end up with an overly complicated expression that's +hard to read. Quick, what's the following code doing? + +:: + + total = reduce(lambda a, b: (0, a[1] + b[1]), items)[1] + +You can figure it out, but it takes time to disentangle the expression to figure +out what's going on. Using a short nested ``def`` statements makes things a +little bit better:: + + def combine (a, b): + return 0, a[1] + b[1] + + total = reduce(combine, items)[1] + +But it would be best of all if I had simply used a ``for`` loop:: + + total = 0 + for a, b in items: + total += b + +Or the :func:`sum` built-in and a generator expression:: + + total = sum(b for a,b in items) + +Many uses of :func:`reduce` are clearer when written as ``for`` loops. + +Fredrik Lundh once suggested the following set of rules for refactoring uses of +``lambda``: + +1) Write a lambda function. +2) Write a comment explaining what the heck that lambda does. +3) Study the comment for a while, and think of a name that captures the essence + of the comment. +4) Convert the lambda to a def statement, using that name. +5) Remove the comment. + +I really like these rules, but you're free to disagree that this lambda-free +style is better. + + +The itertools module +==================== + +The :mod:`itertools` module contains a number of commonly-used iterators as well +as functions for combining several iterators. This section will introduce the +module's contents by showing small examples. + +The module's functions fall into a few broad classes: + +* Functions that create a new iterator based on an existing iterator. +* Functions for treating an iterator's elements as function arguments. +* Functions for selecting portions of an iterator's output. +* A function for grouping an iterator's output. + +Creating new iterators +---------------------- + +``itertools.count(n)`` returns an infinite stream of integers, increasing by 1 +each time. You can optionally supply the starting number, which defaults to 0:: + + itertools.count() => + 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... + itertools.count(10) => + 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ... + +``itertools.cycle(iter)`` saves a copy of the contents of a provided iterable +and returns a new iterator that returns its elements from first to last. The +new iterator will repeat these elements infinitely. + +:: + + itertools.cycle([1,2,3,4,5]) => + 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, ... + +``itertools.repeat(elem, [n])`` returns the provided element ``n`` times, or +returns the element endlessly if ``n`` is not provided. + +:: + + itertools.repeat('abc') => + abc, abc, abc, abc, abc, abc, abc, abc, abc, abc, ... + itertools.repeat('abc', 5) => + abc, abc, abc, abc, abc + +``itertools.chain(iterA, iterB, ...)`` takes an arbitrary number of iterables as +input, and returns all the elements of the first iterator, then all the elements +of the second, and so on, until all of the iterables have been exhausted. + +:: + + itertools.chain(['a', 'b', 'c'], (1, 2, 3)) => + a, b, c, 1, 2, 3 + +``itertools.izip(iterA, iterB, ...)`` takes one element from each iterable and +returns them in a tuple:: + + itertools.izip(['a', 'b', 'c'], (1, 2, 3)) => + ('a', 1), ('b', 2), ('c', 3) + +It's similiar to the built-in :func:`zip` function, but doesn't construct an +in-memory list and exhaust all the input iterators before returning; instead +tuples are constructed and returned only if they're requested. (The technical +term for this behaviour is `lazy evaluation +<http://en.wikipedia.org/wiki/Lazy_evaluation>`__.) + +This iterator is intended to be used with iterables that are all of the same +length. If the iterables are of different lengths, the resulting stream will be +the same length as the shortest iterable. + +:: + + itertools.izip(['a', 'b'], (1, 2, 3)) => + ('a', 1), ('b', 2) + +You should avoid doing this, though, because an element may be taken from the +longer iterators and discarded. This means you can't go on to use the iterators +further because you risk skipping a discarded element. + +``itertools.islice(iter, [start], stop, [step])`` returns a stream that's a +slice of the iterator. With a single ``stop`` argument, it will return the +first ``stop`` elements. If you supply a starting index, you'll get +``stop-start`` elements, and if you supply a value for ``step``, elements will +be skipped accordingly. Unlike Python's string and list slicing, you can't use +negative values for ``start``, ``stop``, or ``step``. + +:: + + itertools.islice(range(10), 8) => + 0, 1, 2, 3, 4, 5, 6, 7 + itertools.islice(range(10), 2, 8) => + 2, 3, 4, 5, 6, 7 + itertools.islice(range(10), 2, 8, 2) => + 2, 4, 6 + +``itertools.tee(iter, [n])`` replicates an iterator; it returns ``n`` +independent iterators that will all return the contents of the source iterator. +If you don't supply a value for ``n``, the default is 2. Replicating iterators +requires saving some of the contents of the source iterator, so this can consume +significant memory if the iterator is large and one of the new iterators is +consumed more than the others. + +:: + + itertools.tee( itertools.count() ) => + iterA, iterB + + where iterA -> + 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... + + and iterB -> + 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... + + +Calling functions on elements +----------------------------- + +Two functions are used for calling other functions on the contents of an +iterable. + +``itertools.imap(f, iterA, iterB, ...)`` returns a stream containing +``f(iterA[0], iterB[0]), f(iterA[1], iterB[1]), f(iterA[2], iterB[2]), ...``:: + + itertools.imap(operator.add, [5, 6, 5], [1, 2, 3]) => + 6, 8, 8 + +The ``operator`` module contains a set of functions corresponding to Python's +operators. Some examples are ``operator.add(a, b)`` (adds two values), +``operator.ne(a, b)`` (same as ``a!=b``), and ``operator.attrgetter('id')`` +(returns a callable that fetches the ``"id"`` attribute). + +``itertools.starmap(func, iter)`` assumes that the iterable will return a stream +of tuples, and calls ``f()`` using these tuples as the arguments:: + + itertools.starmap(os.path.join, + [('/usr', 'bin', 'java'), ('/bin', 'python'), + ('/usr', 'bin', 'perl'),('/usr', 'bin', 'ruby')]) + => + /usr/bin/java, /bin/python, /usr/bin/perl, /usr/bin/ruby + + +Selecting elements +------------------ + +Another group of functions chooses a subset of an iterator's elements based on a +predicate. + +``itertools.ifilter(predicate, iter)`` returns all the elements for which the +predicate returns true:: + + def is_even(x): + return (x % 2) == 0 + + itertools.ifilter(is_even, itertools.count()) => + 0, 2, 4, 6, 8, 10, 12, 14, ... + +``itertools.ifilterfalse(predicate, iter)`` is the opposite, returning all +elements for which the predicate returns false:: + + itertools.ifilterfalse(is_even, itertools.count()) => + 1, 3, 5, 7, 9, 11, 13, 15, ... + +``itertools.takewhile(predicate, iter)`` returns elements for as long as the +predicate returns true. Once the predicate returns false, the iterator will +signal the end of its results. + +:: + + def less_than_10(x): + return (x < 10) + + itertools.takewhile(less_than_10, itertools.count()) => + 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 + + itertools.takewhile(is_even, itertools.count()) => + 0 + +``itertools.dropwhile(predicate, iter)`` discards elements while the predicate +returns true, and then returns the rest of the iterable's results. + +:: + + itertools.dropwhile(less_than_10, itertools.count()) => + 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ... + + itertools.dropwhile(is_even, itertools.count()) => + 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... + + +Grouping elements +----------------- + +The last function I'll discuss, ``itertools.groupby(iter, key_func=None)``, is +the most complicated. ``key_func(elem)`` is a function that can compute a key +value for each element returned by the iterable. If you don't supply a key +function, the key is simply each element itself. + +``groupby()`` collects all the consecutive elements from the underlying iterable +that have the same key value, and returns a stream of 2-tuples containing a key +value and an iterator for the elements with that key. + +:: + + city_list = [('Decatur', 'AL'), ('Huntsville', 'AL'), ('Selma', 'AL'), + ('Anchorage', 'AK'), ('Nome', 'AK'), + ('Flagstaff', 'AZ'), ('Phoenix', 'AZ'), ('Tucson', 'AZ'), + ... + ] + + def get_state ((city, state)): + return state + + itertools.groupby(city_list, get_state) => + ('AL', iterator-1), + ('AK', iterator-2), + ('AZ', iterator-3), ... + + where + iterator-1 => + ('Decatur', 'AL'), ('Huntsville', 'AL'), ('Selma', 'AL') + iterator-2 => + ('Anchorage', 'AK'), ('Nome', 'AK') + iterator-3 => + ('Flagstaff', 'AZ'), ('Phoenix', 'AZ'), ('Tucson', 'AZ') + +``groupby()`` assumes that the underlying iterable's contents will already be +sorted based on the key. Note that the returned iterators also use the +underlying iterable, so you have to consume the results of iterator-1 before +requesting iterator-2 and its corresponding key. + + +The functools module +==================== + +The :mod:`functools` module in Python 2.5 contains some higher-order functions. +A **higher-order function** takes one or more functions as input and returns a +new function. The most useful tool in this module is the +:func:`functools.partial` function. + +For programs written in a functional style, you'll sometimes want to construct +variants of existing functions that have some of the parameters filled in. +Consider a Python function ``f(a, b, c)``; you may wish to create a new function +``g(b, c)`` that's equivalent to ``f(1, b, c)``; you're filling in a value for +one of ``f()``'s parameters. This is called "partial function application". + +The constructor for ``partial`` takes the arguments ``(function, arg1, arg2, +... kwarg1=value1, kwarg2=value2)``. The resulting object is callable, so you +can just call it to invoke ``function`` with the filled-in arguments. + +Here's a small but realistic example:: + + import functools + + def log (message, subsystem): + "Write the contents of 'message' to the specified subsystem." + print '%s: %s' % (subsystem, message) + ... + + server_log = functools.partial(log, subsystem='server') + server_log('Unable to open socket') + + +The operator module +------------------- + +The :mod:`operator` module was mentioned earlier. It contains a set of +functions corresponding to Python's operators. These functions are often useful +in functional-style code because they save you from writing trivial functions +that perform a single operation. + +Some of the functions in this module are: + +* Math operations: ``add()``, ``sub()``, ``mul()``, ``div()``, ``floordiv()``, + ``abs()``, ... +* Logical operations: ``not_()``, ``truth()``. +* Bitwise operations: ``and_()``, ``or_()``, ``invert()``. +* Comparisons: ``eq()``, ``ne()``, ``lt()``, ``le()``, ``gt()``, and ``ge()``. +* Object identity: ``is_()``, ``is_not()``. + +Consult the operator module's documentation for a complete list. + + + +The functional module +--------------------- + +Collin Winter's `functional module <http://oakwinter.com/code/functional/>`__ +provides a number of more advanced tools for functional programming. It also +reimplements several Python built-ins, trying to make them more intuitive to +those used to functional programming in other languages. + +This section contains an introduction to some of the most important functions in +``functional``; full documentation can be found at `the project's website +<http://oakwinter.com/code/functional/documentation/>`__. + +``compose(outer, inner, unpack=False)`` + +The ``compose()`` function implements function composition. In other words, it +returns a wrapper around the ``outer`` and ``inner`` callables, such that the +return value from ``inner`` is fed directly to ``outer``. That is, + +:: + + >>> def add(a, b): + ... return a + b + ... + >>> def double(a): + ... return 2 * a + ... + >>> compose(double, add)(5, 6) + 22 + +is equivalent to + +:: + + >>> double(add(5, 6)) + 22 + +The ``unpack`` keyword is provided to work around the fact that Python functions +are not always `fully curried <http://en.wikipedia.org/wiki/Currying>`__. By +default, it is expected that the ``inner`` function will return a single object +and that the ``outer`` function will take a single argument. Setting the +``unpack`` argument causes ``compose`` to expect a tuple from ``inner`` which +will be expanded before being passed to ``outer``. Put simply, + +:: + + compose(f, g)(5, 6) + +is equivalent to:: + + f(g(5, 6)) + +while + +:: + + compose(f, g, unpack=True)(5, 6) + +is equivalent to:: + + f(*g(5, 6)) + +Even though ``compose()`` only accepts two functions, it's trivial to build up a +version that will compose any number of functions. We'll use ``reduce()``, +``compose()`` and ``partial()`` (the last of which is provided by both +``functional`` and ``functools``). + +:: + + from functional import compose, partial + + multi_compose = partial(reduce, compose) + + +We can also use ``map()``, ``compose()`` and ``partial()`` to craft a version of +``"".join(...)`` that converts its arguments to string:: + + from functional import compose, partial + + join = compose("".join, partial(map, str)) + + +``flip(func)`` + +``flip()`` wraps the callable in ``func`` and causes it to receive its +non-keyword arguments in reverse order. + +:: + + >>> def triple(a, b, c): + ... return (a, b, c) + ... + >>> triple(5, 6, 7) + (5, 6, 7) + >>> + >>> flipped_triple = flip(triple) + >>> flipped_triple(5, 6, 7) + (7, 6, 5) + +``foldl(func, start, iterable)`` + +``foldl()`` takes a binary function, a starting value (usually some kind of +'zero'), and an iterable. The function is applied to the starting value and the +first element of the list, then the result of that and the second element of the +list, then the result of that and the third element of the list, and so on. + +This means that a call such as:: + + foldl(f, 0, [1, 2, 3]) + +is equivalent to:: + + f(f(f(0, 1), 2), 3) + + +``foldl()`` is roughly equivalent to the following recursive function:: + + def foldl(func, start, seq): + if len(seq) == 0: + return start + + return foldl(func, func(start, seq[0]), seq[1:]) + +Speaking of equivalence, the above ``foldl`` call can be expressed in terms of +the built-in ``reduce`` like so:: + + reduce(f, [1, 2, 3], 0) + + +We can use ``foldl()``, ``operator.concat()`` and ``partial()`` to write a +cleaner, more aesthetically-pleasing version of Python's ``"".join(...)`` +idiom:: + + from functional import foldl, partial + from operator import concat + + join = partial(foldl, concat, "") + + +Revision History and Acknowledgements +===================================== + +The author would like to thank the following people for offering suggestions, +corrections and assistance with various drafts of this article: Ian Bicking, +Nick Coghlan, Nick Efford, Raymond Hettinger, Jim Jewett, Mike Krell, Leandro +Lameiro, Jussi Salmela, Collin Winter, Blake Winton. + +Version 0.1: posted June 30 2006. + +Version 0.11: posted July 1 2006. Typo fixes. + +Version 0.2: posted July 10 2006. Merged genexp and listcomp sections into one. +Typo fixes. + +Version 0.21: Added more references suggested on the tutor mailing list. + +Version 0.30: Adds a section on the ``functional`` module written by Collin +Winter; adds short section on the operator module; a few other edits. + + +References +========== + +General +------- + +**Structure and Interpretation of Computer Programs**, by Harold Abelson and +Gerald Jay Sussman with Julie Sussman. Full text at +http://mitpress.mit.edu/sicp/. In this classic textbook of computer science, +chapters 2 and 3 discuss the use of sequences and streams to organize the data +flow inside a program. The book uses Scheme for its examples, but many of the +design approaches described in these chapters are applicable to functional-style +Python code. + +http://www.defmacro.org/ramblings/fp.html: A general introduction to functional +programming that uses Java examples and has a lengthy historical introduction. + +http://en.wikipedia.org/wiki/Functional_programming: General Wikipedia entry +describing functional programming. + +http://en.wikipedia.org/wiki/Coroutine: Entry for coroutines. + +http://en.wikipedia.org/wiki/Currying: Entry for the concept of currying. + +Python-specific +--------------- + +http://gnosis.cx/TPiP/: The first chapter of David Mertz's book +:title-reference:`Text Processing in Python` discusses functional programming +for text processing, in the section titled "Utilizing Higher-Order Functions in +Text Processing". + +Mertz also wrote a 3-part series of articles on functional programming +for IBM's DeveloperWorks site; see +`part 1 <http://www-128.ibm.com/developerworks/library/l-prog.html>`__, +`part 2 <http://www-128.ibm.com/developerworks/library/l-prog2.html>`__, and +`part 3 <http://www-128.ibm.com/developerworks/linux/library/l-prog3.html>`__, + + +Python documentation +-------------------- + +Documentation for the :mod:`itertools` module. + +Documentation for the :mod:`operator` module. + +:pep:`289`: "Generator Expressions" + +:pep:`342`: "Coroutines via Enhanced Generators" describes the new generator +features in Python 2.5. + +.. comment + + Topics to place + ----------------------------- + + XXX os.walk() + + XXX Need a large example. + + But will an example add much? I'll post a first draft and see + what the comments say. + +.. comment + + Original outline: + Introduction + Idea of FP + Programs built out of functions + Functions are strictly input-output, no internal state + Opposed to OO programming, where objects have state + + Why FP? + Formal provability + Assignment is difficult to reason about + Not very relevant to Python + Modularity + Small functions that do one thing + Debuggability: + Easy to test due to lack of state + Easy to verify output from intermediate steps + Composability + You assemble a toolbox of functions that can be mixed + + Tackling a problem + Need a significant example + + Iterators + Generators + The itertools module + List comprehensions + Small functions and the lambda statement + Built-in functions + map + filter + reduce + +.. comment + + Handy little function for printing part of an iterator -- used + while writing this document. + + import itertools + def print_iter(it): + slice = itertools.islice(it, 10) + for elem in slice[:-1]: + sys.stdout.write(str(elem)) + sys.stdout.write(', ') + print elem[-1] + + diff --git a/Doc/howto/index.rst b/Doc/howto/index.rst new file mode 100644 index 0000000..e668856 --- /dev/null +++ b/Doc/howto/index.rst @@ -0,0 +1,25 @@ +*************** + Python HOWTOs +*************** + +Python HOWTOs are documents that cover a single, specific topic, +and attempt to cover it fairly completely. Modelled on the Linux +Documentation Project's HOWTO collection, this collection is an +effort to foster documentation that's more detailed than the +Python Library Reference. + +Currently, the HOWTOs are: + +.. toctree:: + :maxdepth: 1 + + advocacy.rst + pythonmac.rst + curses.rst + doanddont.rst + functional.rst + regex.rst + sockets.rst + unicode.rst + urllib2.rst + diff --git a/Doc/howto/pythonmac.rst b/Doc/howto/pythonmac.rst new file mode 100644 index 0000000..7811f37 --- /dev/null +++ b/Doc/howto/pythonmac.rst @@ -0,0 +1,202 @@ + +.. _using-on-mac: + +*************************** +Using Python on a Macintosh +*************************** + +:Author: Bob Savage <bobsavage@mac.com> + + +Python on a Macintosh running Mac OS X is in principle very similar to Python on +any other Unix platform, but there are a number of additional features such as +the IDE and the Package Manager that are worth pointing out. + +The Mac-specific modules are documented in :ref:`mac-specific-services`. + +Python on Mac OS 9 or earlier can be quite different from Python on Unix or +Windows, but is beyond the scope of this manual, as that platform is no longer +supported, starting with Python 2.4. See http://www.cwi.nl/~jack/macpython for +installers for the latest 2.3 release for Mac OS 9 and related documentation. + + +.. _getting-osx: + +Getting and Installing MacPython +================================ + +Mac OS X 10.4 comes with Python 2.3 pre-installed by Apple. However, you are +encouraged to install the most recent version of Python from the Python website +(http://www.python.org). A "universal binary" build of Python 2.5, which runs +natively on the Mac's new Intel and legacy PPC CPU's, is available there. + +What you get after installing is a number of things: + +* A :file:`MacPython 2.5` folder in your :file:`Applications` folder. In here + you find IDLE, the development environment that is a standard part of official + Python distributions; PythonLauncher, which handles double-clicking Python + scripts from the Finder; and the "Build Applet" tool, which allows you to + package Python scripts as standalone applications on your system. + +* A framework :file:`/Library/Frameworks/Python.framework`, which includes the + Python executable and libraries. The installer adds this location to your shell + path. To uninstall MacPython, you can simply remove these three things. A + symlink to the Python executable is placed in /usr/local/bin/. + +The Apple-provided build of Python is installed in +:file:`/System/Library/Frameworks/Python.framework` and :file:`/usr/bin/python`, +respectively. You should never modify or delete these, as they are +Apple-controlled and are used by Apple- or third-party software. + +IDLE includes a help menu that allows you to access Python documentation. If you +are completely new to Python you should start reading the tutorial introduction +in that document. + +If you are familiar with Python on other Unix platforms you should read the +section on running Python scripts from the Unix shell. + + +How to run a Python script +-------------------------- + +Your best way to get started with Python on Mac OS X is through the IDLE +integrated development environment, see section :ref:`ide` and use the Help menu +when the IDE is running. + +If you want to run Python scripts from the Terminal window command line or from +the Finder you first need an editor to create your script. Mac OS X comes with a +number of standard Unix command line editors, :program:`vim` and +:program:`emacs` among them. If you want a more Mac-like editor, +:program:`BBEdit` or :program:`TextWrangler` from Bare Bones Software (see +http://www.barebones.com/products/bbedit/index.shtml) are good choices, as is +:program:`TextMate` (see http://macromates.com/). Other editors include +:program:`Gvim` (http://macvim.org) and :program:`Aquamacs` +(http://aquamacs.org). + +To run your script from the Terminal window you must make sure that +:file:`/usr/local/bin` is in your shell search path. + +To run your script from the Finder you have two options: + +* Drag it to :program:`PythonLauncher` + +* Select :program:`PythonLauncher` as the default application to open your + script (or any .py script) through the finder Info window and double-click it. + :program:`PythonLauncher` has various preferences to control how your script is + launched. Option-dragging allows you to change these for one invocation, or use + its Preferences menu to change things globally. + + +.. _osx-gui-scripts: + +Running scripts with a GUI +-------------------------- + +With older versions of Python, there is one Mac OS X quirk that you need to be +aware of: programs that talk to the Aqua window manager (in other words, +anything that has a GUI) need to be run in a special way. Use :program:`pythonw` +instead of :program:`python` to start such scripts. + +With Python 2.5, you can use either :program:`python` or :program:`pythonw`. + + +Configuration +------------- + +Python on OS X honors all standard Unix environment variables such as +:envvar:`PYTHONPATH`, but setting these variables for programs started from the +Finder is non-standard as the Finder does not read your :file:`.profile` or +:file:`.cshrc` at startup. You need to create a file :file:`~ +/.MacOSX/environment.plist`. See Apple's Technical Document QA1067 for details. + +For more information on installation Python packages in MacPython, see section +:ref:`mac-package-manager`. + + +.. _ide: + +The IDE +======= + +MacPython ships with the standard IDLE development environment. A good +introduction to using IDLE can be found at http://hkn.eecs.berkeley.edu/ +dyoo/python/idle_intro/index.html. + + +.. _mac-package-manager: + +Installing Additional Python Packages +===================================== + +There are several methods to install additional Python packages: + +* http://pythonmac.org/packages/ contains selected compiled packages for Python + 2.5, 2.4, and 2.3. + +* Packages can be installed via the standard Python distutils mode (``python + setup.py install``). + +* Many packages can also be installed via the :program:`setuptools` extension. + + +GUI Programming on the Mac +========================== + +There are several options for building GUI applications on the Mac with Python. + +*PyObjC* is a Python binding to Apple's Objective-C/Cocoa framework, which is +the foundation of most modern Mac development. Information on PyObjC is +available from http://pyobjc.sourceforge.net. + +The standard Python GUI toolkit is :mod:`Tkinter`, based on the cross-platform +Tk toolkit (http://www.tcl.tk). An Aqua-native version of Tk is bundled with OS +X by Apple, and the latest version can be downloaded and installed from +http://www.activestate.com; it can also be built from source. + +*wxPython* is another popular cross-platform GUI toolkit that runs natively on +Mac OS X. Packages and documentation are available from http://www.wxpython.org. + +*PyQt* is another popular cross-platform GUI toolkit that runs natively on Mac +OS X. More information can be found at +http://www.riverbankcomputing.co.uk/pyqt/. + + +Distributing Python Applications on the Mac +=========================================== + +The "Build Applet" tool that is placed in the MacPython 2.5 folder is fine for +packaging small Python scripts on your own machine to run as a standard Mac +application. This tool, however, is not robust enough to distribute Python +applications to other users. + +The standard tool for deploying standalone Python applications on the Mac is +:program:`py2app`. More information on installing and using py2app can be found +at http://undefined.org/python/#py2app. + + +Application Scripting +===================== + +Python can also be used to script other Mac applications via Apple's Open +Scripting Architecture (OSA); see http://appscript.sourceforge.net. Appscript is +a high-level, user-friendly Apple event bridge that allows you to control +scriptable Mac OS X applications using ordinary Python scripts. Appscript makes +Python a serious alternative to Apple's own *AppleScript* language for +automating your Mac. A related package, *PyOSA*, is an OSA language component +for the Python scripting language, allowing Python code to be executed by any +OSA-enabled application (Script Editor, Mail, iTunes, etc.). PyOSA makes Python +a full peer to AppleScript. + + +Other Resources +=============== + +The MacPython mailing list is an excellent support resource for Python users and +developers on the Mac: + +http://www.python.org/community/sigs/current/pythonmac-sig/ + +Another useful resource is the MacPython wiki: + +http://wiki.python.org/moin/MacPython + diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst new file mode 100644 index 0000000..b200764 --- /dev/null +++ b/Doc/howto/regex.rst @@ -0,0 +1,1377 @@ +**************************** + Regular Expression HOWTO +**************************** + +:Author: A.M. Kuchling +:Release: 0.05 + +.. % TODO: +.. % Document lookbehind assertions +.. % Better way of displaying a RE, a string, and what it matches +.. % Mention optional argument to match.groups() +.. % Unicode (at least a reference) + + +.. topic:: Abstract + + This document is an introductory tutorial to using regular expressions in Python + with the :mod:`re` module. It provides a gentler introduction than the + corresponding section in the Library Reference. + + +Introduction +============ + +The :mod:`re` module was added in Python 1.5, and provides Perl-style regular +expression patterns. Earlier versions of Python came with the :mod:`regex` +module, which provided Emacs-style patterns. The :mod:`regex` module was +removed completely in Python 2.5. + +Regular expressions (called REs, or regexes, or regex patterns) are essentially +a tiny, highly specialized programming language embedded inside Python and made +available through the :mod:`re` module. Using this little language, you specify +the rules for the set of possible strings that you want to match; this set might +contain English sentences, or e-mail addresses, or TeX commands, or anything you +like. You can then ask questions such as "Does this string match the pattern?", +or "Is there a match for the pattern anywhere in this string?". You can also +use REs to modify a string or to split it apart in various ways. + +Regular expression patterns are compiled into a series of bytecodes which are +then executed by a matching engine written in C. For advanced use, it may be +necessary to pay careful attention to how the engine will execute a given RE, +and write the RE in a certain way in order to produce bytecode that runs faster. +Optimization isn't covered in this document, because it requires that you have a +good understanding of the matching engine's internals. + +The regular expression language is relatively small and restricted, so not all +possible string processing tasks can be done using regular expressions. There +are also tasks that *can* be done with regular expressions, but the expressions +turn out to be very complicated. In these cases, you may be better off writing +Python code to do the processing; while Python code will be slower than an +elaborate regular expression, it will also probably be more understandable. + + +Simple Patterns +=============== + +We'll start by learning about the simplest possible regular expressions. Since +regular expressions are used to operate on strings, we'll begin with the most +common task: matching characters. + +For a detailed explanation of the computer science underlying regular +expressions (deterministic and non-deterministic finite automata), you can refer +to almost any textbook on writing compilers. + + +Matching Characters +------------------- + +Most letters and characters will simply match themselves. For example, the +regular expression ``test`` will match the string ``test`` exactly. (You can +enable a case-insensitive mode that would let this RE match ``Test`` or ``TEST`` +as well; more about this later.) + +There are exceptions to this rule; some characters are special +:dfn:`metacharacters`, and don't match themselves. Instead, they signal that +some out-of-the-ordinary thing should be matched, or they affect other portions +of the RE by repeating them or changing their meaning. Much of this document is +devoted to discussing various metacharacters and what they do. + +Here's a complete list of the metacharacters; their meanings will be discussed +in the rest of this HOWTO. :: + + . ^ $ * + ? { [ ] \ | ( ) + +The first metacharacters we'll look at are ``[`` and ``]``. They're used for +specifying a character class, which is a set of characters that you wish to +match. Characters can be listed individually, or a range of characters can be +indicated by giving two characters and separating them by a ``'-'``. For +example, ``[abc]`` will match any of the characters ``a``, ``b``, or ``c``; this +is the same as ``[a-c]``, which uses a range to express the same set of +characters. If you wanted to match only lowercase letters, your RE would be +``[a-z]``. + +.. % $ + +Metacharacters are not active inside classes. For example, ``[akm$]`` will +match any of the characters ``'a'``, ``'k'``, ``'m'``, or ``'$'``; ``'$'`` is +usually a metacharacter, but inside a character class it's stripped of its +special nature. + +You can match the characters not listed within the class by :dfn:`complementing` +the set. This is indicated by including a ``'^'`` as the first character of the +class; ``'^'`` outside a character class will simply match the ``'^'`` +character. For example, ``[^5]`` will match any character except ``'5'``. + +Perhaps the most important metacharacter is the backslash, ``\``. As in Python +string literals, the backslash can be followed by various characters to signal +various special sequences. It's also used to escape all the metacharacters so +you can still match them in patterns; for example, if you need to match a ``[`` +or ``\``, you can precede them with a backslash to remove their special +meaning: ``\[`` or ``\\``. + +Some of the special sequences beginning with ``'\'`` represent predefined sets +of characters that are often useful, such as the set of digits, the set of +letters, or the set of anything that isn't whitespace. The following predefined +special sequences are available: + +``\d`` + Matches any decimal digit; this is equivalent to the class ``[0-9]``. + +``\D`` + Matches any non-digit character; this is equivalent to the class ``[^0-9]``. + +``\s`` + Matches any whitespace character; this is equivalent to the class ``[ + \t\n\r\f\v]``. + +``\S`` + Matches any non-whitespace character; this is equivalent to the class ``[^ + \t\n\r\f\v]``. + +``\w`` + Matches any alphanumeric character; this is equivalent to the class + ``[a-zA-Z0-9_]``. + +``\W`` + Matches any non-alphanumeric character; this is equivalent to the class + ``[^a-zA-Z0-9_]``. + +These sequences can be included inside a character class. For example, +``[\s,.]`` is a character class that will match any whitespace character, or +``','`` or ``'.'``. + +The final metacharacter in this section is ``.``. It matches anything except a +newline character, and there's an alternate mode (``re.DOTALL``) where it will +match even a newline. ``'.'`` is often used where you want to match "any +character". + + +Repeating Things +---------------- + +Being able to match varying sets of characters is the first thing regular +expressions can do that isn't already possible with the methods available on +strings. However, if that was the only additional capability of regexes, they +wouldn't be much of an advance. Another capability is that you can specify that +portions of the RE must be repeated a certain number of times. + +The first metacharacter for repeating things that we'll look at is ``*``. ``*`` +doesn't match the literal character ``*``; instead, it specifies that the +previous character can be matched zero or more times, instead of exactly once. + +For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``), +``caaat`` (3 ``a`` characters), and so forth. The RE engine has various +internal limitations stemming from the size of C's ``int`` type that will +prevent it from matching over 2 billion ``a`` characters; you probably don't +have enough memory to construct a string that large, so you shouldn't run into +that limit. + +Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching +engine will try to repeat it as many times as possible. If later portions of the +pattern don't match, the matching engine will then back up and try again with +few repetitions. + +A step-by-step example will make this more obvious. Let's consider the +expression ``a[bcd]*b``. This matches the letter ``'a'``, zero or more letters +from the class ``[bcd]``, and finally ends with a ``'b'``. Now imagine matching +this RE against the string ``abcbd``. + ++------+-----------+---------------------------------+ +| Step | Matched | Explanation | ++======+===========+=================================+ +| 1 | ``a`` | The ``a`` in the RE matches. | ++------+-----------+---------------------------------+ +| 2 | ``abcbd`` | The engine matches ``[bcd]*``, | +| | | going as far as it can, which | +| | | is to the end of the string. | ++------+-----------+---------------------------------+ +| 3 | *Failure* | The engine tries to match | +| | | ``b``, but the current position | +| | | is at the end of the string, so | +| | | it fails. | ++------+-----------+---------------------------------+ +| 4 | ``abcb`` | Back up, so that ``[bcd]*`` | +| | | matches one less character. | ++------+-----------+---------------------------------+ +| 5 | *Failure* | Try ``b`` again, but the | +| | | current position is at the last | +| | | character, which is a ``'d'``. | ++------+-----------+---------------------------------+ +| 6 | ``abc`` | Back up again, so that | +| | | ``[bcd]*`` is only matching | +| | | ``bc``. | ++------+-----------+---------------------------------+ +| 6 | ``abcb`` | Try ``b`` again. This time | +| | | but the character at the | +| | | current position is ``'b'``, so | +| | | it succeeds. | ++------+-----------+---------------------------------+ + +The end of the RE has now been reached, and it has matched ``abcb``. This +demonstrates how the matching engine goes as far as it can at first, and if no +match is found it will then progressively back up and retry the rest of the RE +again and again. It will back up until it has tried zero matches for +``[bcd]*``, and if that subsequently fails, the engine will conclude that the +string doesn't match the RE at all. + +Another repeating metacharacter is ``+``, which matches one or more times. Pay +careful attention to the difference between ``*`` and ``+``; ``*`` matches +*zero* or more times, so whatever's being repeated may not be present at all, +while ``+`` requires at least *one* occurrence. To use a similar example, +``ca+t`` will match ``cat`` (1 ``a``), ``caaat`` (3 ``a``'s), but won't match +``ct``. + +There are two more repeating qualifiers. The question mark character, ``?``, +matches either once or zero times; you can think of it as marking something as +being optional. For example, ``home-?brew`` matches either ``homebrew`` or +``home-brew``. + +The most complicated repeated qualifier is ``{m,n}``, where *m* and *n* are +decimal integers. This qualifier means there must be at least *m* repetitions, +and at most *n*. For example, ``a/{1,3}b`` will match ``a/b``, ``a//b``, and +``a///b``. It won't match ``ab``, which has no slashes, or ``a////b``, which +has four. + +You can omit either *m* or *n*; in that case, a reasonable value is assumed for +the missing value. Omitting *m* is interpreted as a lower limit of 0, while +omitting *n* results in an upper bound of infinity --- actually, the upper bound +is the 2-billion limit mentioned earlier, but that might as well be infinity. + +Readers of a reductionist bent may notice that the three other qualifiers can +all be expressed using this notation. ``{0,}`` is the same as ``*``, ``{1,}`` +is equivalent to ``+``, and ``{0,1}`` is the same as ``?``. It's better to use +``*``, ``+``, or ``?`` when you can, simply because they're shorter and easier +to read. + + +Using Regular Expressions +========================= + +Now that we've looked at some simple regular expressions, how do we actually use +them in Python? The :mod:`re` module provides an interface to the regular +expression engine, allowing you to compile REs into objects and then perform +matches with them. + + +Compiling Regular Expressions +----------------------------- + +Regular expressions are compiled into :class:`RegexObject` instances, which have +methods for various operations such as searching for pattern matches or +performing string substitutions. :: + + >>> import re + >>> p = re.compile('ab*') + >>> print p + <re.RegexObject instance at 80b4150> + +:func:`re.compile` also accepts an optional *flags* argument, used to enable +various special features and syntax variations. We'll go over the available +settings later, but for now a single example will do:: + + >>> p = re.compile('ab*', re.IGNORECASE) + +The RE is passed to :func:`re.compile` as a string. REs are handled as strings +because regular expressions aren't part of the core Python language, and no +special syntax was created for expressing them. (There are applications that +don't need REs at all, so there's no need to bloat the language specification by +including them.) Instead, the :mod:`re` module is simply a C extension module +included with Python, just like the :mod:`socket` or :mod:`zlib` modules. + +Putting REs in strings keeps the Python language simpler, but has one +disadvantage which is the topic of the next section. + + +The Backslash Plague +-------------------- + +As stated earlier, regular expressions use the backslash character (``'\'``) to +indicate special forms or to allow special characters to be used without +invoking their special meaning. This conflicts with Python's usage of the same +character for the same purpose in string literals. + +Let's say you want to write a RE that matches the string ``\section``, which +might be found in a LaTeX file. To figure out what to write in the program +code, start with the desired string to be matched. Next, you must escape any +backslashes and other metacharacters by preceding them with a backslash, +resulting in the string ``\\section``. The resulting string that must be passed +to :func:`re.compile` must be ``\\section``. However, to express this as a +Python string literal, both backslashes must be escaped *again*. + ++-------------------+------------------------------------------+ +| Characters | Stage | ++===================+==========================================+ +| ``\section`` | Text string to be matched | ++-------------------+------------------------------------------+ +| ``\\section`` | Escaped backslash for :func:`re.compile` | ++-------------------+------------------------------------------+ +| ``"\\\\section"`` | Escaped backslashes for a string literal | ++-------------------+------------------------------------------+ + +In short, to match a literal backslash, one has to write ``'\\\\'`` as the RE +string, because the regular expression must be ``\\``, and each backslash must +be expressed as ``\\`` inside a regular Python string literal. In REs that +feature backslashes repeatedly, this leads to lots of repeated backslashes and +makes the resulting strings difficult to understand. + +The solution is to use Python's raw string notation for regular expressions; +backslashes are not handled in any special way in a string literal prefixed with +``'r'``, so ``r"\n"`` is a two-character string containing ``'\'`` and ``'n'``, +while ``"\n"`` is a one-character string containing a newline. Regular +expressions will often be written in Python code using this raw string notation. + ++-------------------+------------------+ +| Regular String | Raw string | ++===================+==================+ +| ``"ab*"`` | ``r"ab*"`` | ++-------------------+------------------+ +| ``"\\\\section"`` | ``r"\\section"`` | ++-------------------+------------------+ +| ``"\\w+\\s+\\1"`` | ``r"\w+\s+\1"`` | ++-------------------+------------------+ + + +Performing Matches +------------------ + +Once you have an object representing a compiled regular expression, what do you +do with it? :class:`RegexObject` instances have several methods and attributes. +Only the most significant ones will be covered here; consult `the Library +Reference <http://www.python.org/doc/lib/module-re.html>`_ for a complete +listing. + ++------------------+-----------------------------------------------+ +| Method/Attribute | Purpose | ++==================+===============================================+ +| ``match()`` | Determine if the RE matches at the beginning | +| | of the string. | ++------------------+-----------------------------------------------+ +| ``search()`` | Scan through a string, looking for any | +| | location where this RE matches. | ++------------------+-----------------------------------------------+ +| ``findall()`` | Find all substrings where the RE matches, and | +| | returns them as a list. | ++------------------+-----------------------------------------------+ +| ``finditer()`` | Find all substrings where the RE matches, and | +| | returns them as an iterator. | ++------------------+-----------------------------------------------+ + +:meth:`match` and :meth:`search` return ``None`` if no match can be found. If +they're successful, a ``MatchObject`` instance is returned, containing +information about the match: where it starts and ends, the substring it matched, +and more. + +You can learn about this by interactively experimenting with the :mod:`re` +module. If you have Tkinter available, you may also want to look at +:file:`Tools/scripts/redemo.py`, a demonstration program included with the +Python distribution. It allows you to enter REs and strings, and displays +whether the RE matches or fails. :file:`redemo.py` can be quite useful when +trying to debug a complicated RE. Phil Schwartz's `Kodos +<http://www.phil-schwartz.com/kodos.spy>`_ is also an interactive tool for +developing and testing RE patterns. + +This HOWTO uses the standard Python interpreter for its examples. First, run the +Python interpreter, import the :mod:`re` module, and compile a RE:: + + Python 2.2.2 (#1, Feb 10 2003, 12:57:01) + >>> import re + >>> p = re.compile('[a-z]+') + >>> p + <_sre.SRE_Pattern object at 80c3c28> + +Now, you can try matching various strings against the RE ``[a-z]+``. An empty +string shouldn't match at all, since ``+`` means 'one or more repetitions'. +:meth:`match` should return ``None`` in this case, which will cause the +interpreter to print no output. You can explicitly print the result of +:meth:`match` to make this clear. :: + + >>> p.match("") + >>> print p.match("") + None + +Now, let's try it on a string that it should match, such as ``tempo``. In this +case, :meth:`match` will return a :class:`MatchObject`, so you should store the +result in a variable for later use. :: + + >>> m = p.match('tempo') + >>> print m + <_sre.SRE_Match object at 80c4f68> + +Now you can query the :class:`MatchObject` for information about the matching +string. :class:`MatchObject` instances also have several methods and +attributes; the most important ones are: + ++------------------+--------------------------------------------+ +| Method/Attribute | Purpose | ++==================+============================================+ +| ``group()`` | Return the string matched by the RE | ++------------------+--------------------------------------------+ +| ``start()`` | Return the starting position of the match | ++------------------+--------------------------------------------+ +| ``end()`` | Return the ending position of the match | ++------------------+--------------------------------------------+ +| ``span()`` | Return a tuple containing the (start, end) | +| | positions of the match | ++------------------+--------------------------------------------+ + +Trying these methods will soon clarify their meaning:: + + >>> m.group() + 'tempo' + >>> m.start(), m.end() + (0, 5) + >>> m.span() + (0, 5) + +:meth:`group` returns the substring that was matched by the RE. :meth:`start` +and :meth:`end` return the starting and ending index of the match. :meth:`span` +returns both start and end indexes in a single tuple. Since the :meth:`match` +method only checks if the RE matches at the start of a string, :meth:`start` +will always be zero. However, the :meth:`search` method of :class:`RegexObject` +instances scans through the string, so the match may not start at zero in that +case. :: + + >>> print p.match('::: message') + None + >>> m = p.search('::: message') ; print m + <re.MatchObject instance at 80c9650> + >>> m.group() + 'message' + >>> m.span() + (4, 11) + +In actual programs, the most common style is to store the :class:`MatchObject` +in a variable, and then check if it was ``None``. This usually looks like:: + + p = re.compile( ... ) + m = p.match( 'string goes here' ) + if m: + print 'Match found: ', m.group() + else: + print 'No match' + +Two :class:`RegexObject` methods return all of the matches for a pattern. +:meth:`findall` returns a list of matching strings:: + + >>> p = re.compile('\d+') + >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping') + ['12', '11', '10'] + +:meth:`findall` has to create the entire list before it can be returned as the +result. The :meth:`finditer` method returns a sequence of :class:`MatchObject` +instances as an iterator. [#]_ :: + + >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') + >>> iterator + <callable-iterator object at 0x401833ac> + >>> for match in iterator: + ... print match.span() + ... + (0, 2) + (22, 24) + (29, 31) + + +Module-Level Functions +---------------------- + +You don't have to create a :class:`RegexObject` and call its methods; the +:mod:`re` module also provides top-level functions called :func:`match`, +:func:`search`, :func:`findall`, :func:`sub`, and so forth. These functions +take the same arguments as the corresponding :class:`RegexObject` method, with +the RE string added as the first argument, and still return either ``None`` or a +:class:`MatchObject` instance. :: + + >>> print re.match(r'From\s+', 'Fromage amk') + None + >>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998') + <re.MatchObject instance at 80c5978> + +Under the hood, these functions simply produce a :class:`RegexObject` for you +and call the appropriate method on it. They also store the compiled object in a +cache, so future calls using the same RE are faster. + +Should you use these module-level functions, or should you get the +:class:`RegexObject` and call its methods yourself? That choice depends on how +frequently the RE will be used, and on your personal coding style. If the RE is +being used at only one point in the code, then the module functions are probably +more convenient. If a program contains a lot of regular expressions, or re-uses +the same ones in several locations, then it might be worthwhile to collect all +the definitions in one place, in a section of code that compiles all the REs +ahead of time. To take an example from the standard library, here's an extract +from :file:`xmllib.py`:: + + ref = re.compile( ... ) + entityref = re.compile( ... ) + charref = re.compile( ... ) + starttagopen = re.compile( ... ) + +I generally prefer to work with the compiled object, even for one-time uses, but +few people will be as much of a purist about this as I am. + + +Compilation Flags +----------------- + +Compilation flags let you modify some aspects of how regular expressions work. +Flags are available in the :mod:`re` module under two names, a long name such as +:const:`IGNORECASE` and a short, one-letter form such as :const:`I`. (If you're +familiar with Perl's pattern modifiers, the one-letter forms use the same +letters; the short form of :const:`re.VERBOSE` is :const:`re.X`, for example.) +Multiple flags can be specified by bitwise OR-ing them; ``re.I | re.M`` sets +both the :const:`I` and :const:`M` flags, for example. + +Here's a table of the available flags, followed by a more detailed explanation +of each one. + ++---------------------------------+--------------------------------------------+ +| Flag | Meaning | ++=================================+============================================+ +| :const:`DOTALL`, :const:`S` | Make ``.`` match any character, including | +| | newlines | ++---------------------------------+--------------------------------------------+ +| :const:`IGNORECASE`, :const:`I` | Do case-insensitive matches | ++---------------------------------+--------------------------------------------+ +| :const:`LOCALE`, :const:`L` | Do a locale-aware match | ++---------------------------------+--------------------------------------------+ +| :const:`MULTILINE`, :const:`M` | Multi-line matching, affecting ``^`` and | +| | ``$`` | ++---------------------------------+--------------------------------------------+ +| :const:`VERBOSE`, :const:`X` | Enable verbose REs, which can be organized | +| | more cleanly and understandably. | ++---------------------------------+--------------------------------------------+ + + +.. data:: I + IGNORECASE + :noindex: + + Perform case-insensitive matching; character class and literal strings will + match letters by ignoring case. For example, ``[A-Z]`` will match lowercase + letters, too, and ``Spam`` will match ``Spam``, ``spam``, or ``spAM``. This + lowercasing doesn't take the current locale into account; it will if you also + set the :const:`LOCALE` flag. + + +.. data:: L + LOCALE + :noindex: + + Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale. + + Locales are a feature of the C library intended to help in writing programs that + take account of language differences. For example, if you're processing French + text, you'd want to be able to write ``\w+`` to match words, but ``\w`` only + matches the character class ``[A-Za-z]``; it won't match ``'é'`` or ``'ç'``. If + your system is configured properly and a French locale is selected, certain C + functions will tell the program that ``'é'`` should also be considered a letter. + Setting the :const:`LOCALE` flag when compiling a regular expression will cause + the resulting compiled object to use these C functions for ``\w``; this is + slower, but also enables ``\w+`` to match French words as you'd expect. + + +.. data:: M + MULTILINE + :noindex: + + (``^`` and ``$`` haven't been explained yet; they'll be introduced in section + :ref:`more-metacharacters`.) + + Usually ``^`` matches only at the beginning of the string, and ``$`` matches + only at the end of the string and immediately before the newline (if any) at the + end of the string. When this flag is specified, ``^`` matches at the beginning + of the string and at the beginning of each line within the string, immediately + following each newline. Similarly, the ``$`` metacharacter matches either at + the end of the string and at the end of each line (immediately preceding each + newline). + + +.. data:: S + DOTALL + :noindex: + + Makes the ``'.'`` special character match any character at all, including a + newline; without this flag, ``'.'`` will match anything *except* a newline. + + +.. data:: X + VERBOSE + :noindex: + + This flag allows you to write regular expressions that are more readable by + granting you more flexibility in how you can format them. When this flag has + been specified, whitespace within the RE string is ignored, except when the + whitespace is in a character class or preceded by an unescaped backslash; this + lets you organize and indent the RE more clearly. This flag also lets you put + comments within a RE that will be ignored by the engine; comments are marked by + a ``'#'`` that's neither in a character class or preceded by an unescaped + backslash. + + For example, here's a RE that uses :const:`re.VERBOSE`; see how much easier it + is to read? :: + + charref = re.compile(r""" + &[#] # Start of a numeric entity reference + ( + 0[0-7]+ # Octal form + | [0-9]+ # Decimal form + | x[0-9a-fA-F]+ # Hexadecimal form + ) + ; # Trailing semicolon + """, re.VERBOSE) + + Without the verbose setting, the RE would look like this:: + + charref = re.compile("&#(0[0-7]+" + "|[0-9]+" + "|x[0-9a-fA-F]+);") + + In the above example, Python's automatic concatenation of string literals has + been used to break up the RE into smaller pieces, but it's still more difficult + to understand than the version using :const:`re.VERBOSE`. + + +More Pattern Power +================== + +So far we've only covered a part of the features of regular expressions. In +this section, we'll cover some new metacharacters, and how to use groups to +retrieve portions of the text that was matched. + + +.. _more-metacharacters: + +More Metacharacters +------------------- + +There are some metacharacters that we haven't covered yet. Most of them will be +covered in this section. + +Some of the remaining metacharacters to be discussed are :dfn:`zero-width +assertions`. They don't cause the engine to advance through the string; +instead, they consume no characters at all, and simply succeed or fail. For +example, ``\b`` is an assertion that the current position is located at a word +boundary; the position isn't changed by the ``\b`` at all. This means that +zero-width assertions should never be repeated, because if they match once at a +given location, they can obviously be matched an infinite number of times. + +``|`` + Alternation, or the "or" operator. If A and B are regular expressions, + ``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very + low precedence in order to make it work reasonably when you're alternating + multi-character strings. ``Crow|Servo`` will match either ``Crow`` or ``Servo``, + not ``Cro``, a ``'w'`` or an ``'S'``, and ``ervo``. + + To match a literal ``'|'``, use ``\|``, or enclose it inside a character class, + as in ``[|]``. + +``^`` + Matches at the beginning of lines. Unless the :const:`MULTILINE` flag has been + set, this will only match at the beginning of the string. In :const:`MULTILINE` + mode, this also matches immediately after each newline within the string. + + For example, if you wish to match the word ``From`` only at the beginning of a + line, the RE to use is ``^From``. :: + + >>> print re.search('^From', 'From Here to Eternity') + <re.MatchObject instance at 80c1520> + >>> print re.search('^From', 'Reciting From Memory') + None + + .. % To match a literal \character{\^}, use \regexp{\e\^} or enclose it + .. % inside a character class, as in \regexp{[{\e}\^]}. + +``$`` + Matches at the end of a line, which is defined as either the end of the string, + or any location followed by a newline character. :: + + >>> print re.search('}$', '{block}') + <re.MatchObject instance at 80adfa8> + >>> print re.search('}$', '{block} ') + None + >>> print re.search('}$', '{block}\n') + <re.MatchObject instance at 80adfa8> + + To match a literal ``'$'``, use ``\$`` or enclose it inside a character class, + as in ``[$]``. + + .. % $ + +``\A`` + Matches only at the start of the string. When not in :const:`MULTILINE` mode, + ``\A`` and ``^`` are effectively the same. In :const:`MULTILINE` mode, they're + different: ``\A`` still matches only at the beginning of the string, but ``^`` + may match at any location inside the string that follows a newline character. + +``\Z`` + Matches only at the end of the string. + +``\b`` + Word boundary. This is a zero-width assertion that matches only at the + beginning or end of a word. A word is defined as a sequence of alphanumeric + characters, so the end of a word is indicated by whitespace or a + non-alphanumeric character. + + The following example matches ``class`` only when it's a complete word; it won't + match when it's contained inside another word. :: + + >>> p = re.compile(r'\bclass\b') + >>> print p.search('no class at all') + <re.MatchObject instance at 80c8f28> + >>> print p.search('the declassified algorithm') + None + >>> print p.search('one subclass is') + None + + There are two subtleties you should remember when using this special sequence. + First, this is the worst collision between Python's string literals and regular + expression sequences. In Python's string literals, ``\b`` is the backspace + character, ASCII value 8. If you're not using raw strings, then Python will + convert the ``\b`` to a backspace, and your RE won't match as you expect it to. + The following example looks the same as our previous RE, but omits the ``'r'`` + in front of the RE string. :: + + >>> p = re.compile('\bclass\b') + >>> print p.search('no class at all') + None + >>> print p.search('\b' + 'class' + '\b') + <re.MatchObject instance at 80c3ee0> + + Second, inside a character class, where there's no use for this assertion, + ``\b`` represents the backspace character, for compatibility with Python's + string literals. + +``\B`` + Another zero-width assertion, this is the opposite of ``\b``, only matching when + the current position is not at a word boundary. + + +Grouping +-------- + +Frequently you need to obtain more information than just whether the RE matched +or not. Regular expressions are often used to dissect strings by writing a RE +divided into several subgroups which match different components of interest. +For example, an RFC-822 header line is divided into a header name and a value, +separated by a ``':'``, like this:: + + From: author@example.com + User-Agent: Thunderbird 1.5.0.9 (X11/20061227) + MIME-Version: 1.0 + To: editor@example.com + +This can be handled by writing a regular expression which matches an entire +header line, and has one group which matches the header name, and another group +which matches the header's value. + +Groups are marked by the ``'('``, ``')'`` metacharacters. ``'('`` and ``')'`` +have much the same meaning as they do in mathematical expressions; they group +together the expressions contained inside them, and you can repeat the contents +of a group with a repeating qualifier, such as ``*``, ``+``, ``?``, or +``{m,n}``. For example, ``(ab)*`` will match zero or more repetitions of +``ab``. :: + + >>> p = re.compile('(ab)*') + >>> print p.match('ababababab').span() + (0, 10) + +Groups indicated with ``'('``, ``')'`` also capture the starting and ending +index of the text that they match; this can be retrieved by passing an argument +to :meth:`group`, :meth:`start`, :meth:`end`, and :meth:`span`. Groups are +numbered starting with 0. Group 0 is always present; it's the whole RE, so +:class:`MatchObject` methods all have group 0 as their default argument. Later +we'll see how to express groups that don't capture the span of text that they +match. :: + + >>> p = re.compile('(a)b') + >>> m = p.match('ab') + >>> m.group() + 'ab' + >>> m.group(0) + 'ab' + +Subgroups are numbered from left to right, from 1 upward. Groups can be nested; +to determine the number, just count the opening parenthesis characters, going +from left to right. :: + + >>> p = re.compile('(a(b)c)d') + >>> m = p.match('abcd') + >>> m.group(0) + 'abcd' + >>> m.group(1) + 'abc' + >>> m.group(2) + 'b' + +:meth:`group` can be passed multiple group numbers at a time, in which case it +will return a tuple containing the corresponding values for those groups. :: + + >>> m.group(2,1,2) + ('b', 'abc', 'b') + +The :meth:`groups` method returns a tuple containing the strings for all the +subgroups, from 1 up to however many there are. :: + + >>> m.groups() + ('abc', 'b') + +Backreferences in a pattern allow you to specify that the contents of an earlier +capturing group must also be found at the current location in the string. For +example, ``\1`` will succeed if the exact contents of group 1 can be found at +the current position, and fails otherwise. Remember that Python's string +literals also use a backslash followed by numbers to allow including arbitrary +characters in a string, so be sure to use a raw string when incorporating +backreferences in a RE. + +For example, the following RE detects doubled words in a string. :: + + >>> p = re.compile(r'(\b\w+)\s+\1') + >>> p.search('Paris in the the spring').group() + 'the the' + +Backreferences like this aren't often useful for just searching through a string +--- there are few text formats which repeat data in this way --- but you'll soon +find out that they're *very* useful when performing string substitutions. + + +Non-capturing and Named Groups +------------------------------ + +Elaborate REs may use many groups, both to capture substrings of interest, and +to group and structure the RE itself. In complex REs, it becomes difficult to +keep track of the group numbers. There are two features which help with this +problem. Both of them use a common syntax for regular expression extensions, so +we'll look at that first. + +Perl 5 added several additional features to standard regular expressions, and +the Python :mod:`re` module supports most of them. It would have been +difficult to choose new single-keystroke metacharacters or new special sequences +beginning with ``\`` to represent the new features without making Perl's regular +expressions confusingly different from standard REs. If you chose ``&`` as a +new metacharacter, for example, old expressions would be assuming that ``&`` was +a regular character and wouldn't have escaped it by writing ``\&`` or ``[&]``. + +The solution chosen by the Perl developers was to use ``(?...)`` as the +extension syntax. ``?`` immediately after a parenthesis was a syntax error +because the ``?`` would have nothing to repeat, so this didn't introduce any +compatibility problems. The characters immediately after the ``?`` indicate +what extension is being used, so ``(?=foo)`` is one thing (a positive lookahead +assertion) and ``(?:foo)`` is something else (a non-capturing group containing +the subexpression ``foo``). + +Python adds an extension syntax to Perl's extension syntax. If the first +character after the question mark is a ``P``, you know that it's an extension +that's specific to Python. Currently there are two such extensions: +``(?P<name>...)`` defines a named group, and ``(?P=name)`` is a backreference to +a named group. If future versions of Perl 5 add similar features using a +different syntax, the :mod:`re` module will be changed to support the new +syntax, while preserving the Python-specific syntax for compatibility's sake. + +Now that we've looked at the general extension syntax, we can return to the +features that simplify working with groups in complex REs. Since groups are +numbered from left to right and a complex expression may use many groups, it can +become difficult to keep track of the correct numbering. Modifying such a +complex RE is annoying, too: insert a new group near the beginning and you +change the numbers of everything that follows it. + +Sometimes you'll want to use a group to collect a part of a regular expression, +but aren't interested in retrieving the group's contents. You can make this fact +explicit by using a non-capturing group: ``(?:...)``, where you can replace the +``...`` with any other regular expression. :: + + >>> m = re.match("([abc])+", "abc") + >>> m.groups() + ('c',) + >>> m = re.match("(?:[abc])+", "abc") + >>> m.groups() + () + +Except for the fact that you can't retrieve the contents of what the group +matched, a non-capturing group behaves exactly the same as a capturing group; +you can put anything inside it, repeat it with a repetition metacharacter such +as ``*``, and nest it within other groups (capturing or non-capturing). +``(?:...)`` is particularly useful when modifying an existing pattern, since you +can add new groups without changing how all the other groups are numbered. It +should be mentioned that there's no performance difference in searching between +capturing and non-capturing groups; neither form is any faster than the other. + +A more significant feature is named groups: instead of referring to them by +numbers, groups can be referenced by a name. + +The syntax for a named group is one of the Python-specific extensions: +``(?P<name>...)``. *name* is, obviously, the name of the group. Named groups +also behave exactly like capturing groups, and additionally associate a name +with a group. The :class:`MatchObject` methods that deal with capturing groups +all accept either integers that refer to the group by number or strings that +contain the desired group's name. Named groups are still given numbers, so you +can retrieve information about a group in two ways:: + + >>> p = re.compile(r'(?P<word>\b\w+\b)') + >>> m = p.search( '(((( Lots of punctuation )))' ) + >>> m.group('word') + 'Lots' + >>> m.group(1) + 'Lots' + +Named groups are handy because they let you use easily-remembered names, instead +of having to remember numbers. Here's an example RE from the :mod:`imaplib` +module:: + + InternalDate = re.compile(r'INTERNALDATE "' + r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-' + r'(?P<year>[0-9][0-9][0-9][0-9])' + r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])' + r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])' + r'"') + +It's obviously much easier to retrieve ``m.group('zonem')``, instead of having +to remember to retrieve group 9. + +The syntax for backreferences in an expression such as ``(...)\1`` refers to the +number of the group. There's naturally a variant that uses the group name +instead of the number. This is another Python extension: ``(?P=name)`` indicates +that the contents of the group called *name* should again be matched at the +current point. The regular expression for finding doubled words, +``(\b\w+)\s+\1`` can also be written as ``(?P<word>\b\w+)\s+(?P=word)``:: + + >>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)') + >>> p.search('Paris in the the spring').group() + 'the the' + + +Lookahead Assertions +-------------------- + +Another zero-width assertion is the lookahead assertion. Lookahead assertions +are available in both positive and negative form, and look like this: + +``(?=...)`` + Positive lookahead assertion. This succeeds if the contained regular + expression, represented here by ``...``, successfully matches at the current + location, and fails otherwise. But, once the contained expression has been + tried, the matching engine doesn't advance at all; the rest of the pattern is + tried right where the assertion started. + +``(?!...)`` + Negative lookahead assertion. This is the opposite of the positive assertion; + it succeeds if the contained expression *doesn't* match at the current position + in the string. + +To make this concrete, let's look at a case where a lookahead is useful. +Consider a simple pattern to match a filename and split it apart into a base +name and an extension, separated by a ``.``. For example, in ``news.rc``, +``news`` is the base name, and ``rc`` is the filename's extension. + +The pattern to match this is quite simple: + +``.*[.].*$`` + +Notice that the ``.`` needs to be treated specially because it's a +metacharacter; I've put it inside a character class. Also notice the trailing +``$``; this is added to ensure that all the rest of the string must be included +in the extension. This regular expression matches ``foo.bar`` and +``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``. + +Now, consider complicating the problem a bit; what if you want to match +filenames where the extension is not ``bat``? Some incorrect attempts: + +``.*[.][^b].*$`` The first attempt above tries to exclude ``bat`` by requiring +that the first character of the extension is not a ``b``. This is wrong, +because the pattern also doesn't match ``foo.bar``. + +.. % $ + +``.*[.]([^b]..|.[^a].|..[^t])$`` + +.. % Messes up the HTML without the curly braces around \^ + +The expression gets messier when you try to patch up the first solution by +requiring one of the following cases to match: the first character of the +extension isn't ``b``; the second character isn't ``a``; or the third character +isn't ``t``. This accepts ``foo.bar`` and rejects ``autoexec.bat``, but it +requires a three-letter extension and won't accept a filename with a two-letter +extension such as ``sendmail.cf``. We'll complicate the pattern again in an +effort to fix it. + +``.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$`` + +In the third attempt, the second and third letters are all made optional in +order to allow matching extensions shorter than three characters, such as +``sendmail.cf``. + +The pattern's getting really complicated now, which makes it hard to read and +understand. Worse, if the problem changes and you want to exclude both ``bat`` +and ``exe`` as extensions, the pattern would get even more complicated and +confusing. + +A negative lookahead cuts through all this confusion: + +``.*[.](?!bat$).*$`` The negative lookahead means: if the expression ``bat`` +doesn't match at this point, try the rest of the pattern; if ``bat$`` does +match, the whole pattern will fail. The trailing ``$`` is required to ensure +that something like ``sample.batch``, where the extension only starts with +``bat``, will be allowed. + +.. % $ + +Excluding another filename extension is now easy; simply add it as an +alternative inside the assertion. The following pattern excludes filenames that +end in either ``bat`` or ``exe``: + +``.*[.](?!bat$|exe$).*$`` + +.. % $ + + +Modifying Strings +================= + +Up to this point, we've simply performed searches against a static string. +Regular expressions are also commonly used to modify strings in various ways, +using the following :class:`RegexObject` methods: + ++------------------+-----------------------------------------------+ +| Method/Attribute | Purpose | ++==================+===============================================+ +| ``split()`` | Split the string into a list, splitting it | +| | wherever the RE matches | ++------------------+-----------------------------------------------+ +| ``sub()`` | Find all substrings where the RE matches, and | +| | replace them with a different string | ++------------------+-----------------------------------------------+ +| ``subn()`` | Does the same thing as :meth:`sub`, but | +| | returns the new string and the number of | +| | replacements | ++------------------+-----------------------------------------------+ + + +Splitting Strings +----------------- + +The :meth:`split` method of a :class:`RegexObject` splits a string apart +wherever the RE matches, returning a list of the pieces. It's similar to the +:meth:`split` method of strings but provides much more generality in the +delimiters that you can split by; :meth:`split` only supports splitting by +whitespace or by a fixed string. As you'd expect, there's a module-level +:func:`re.split` function, too. + + +.. method:: .split(string [, maxsplit=0]) + :noindex: + + Split *string* by the matches of the regular expression. If capturing + parentheses are used in the RE, then their contents will also be returned as + part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit* splits + are performed. + +You can limit the number of splits made, by passing a value for *maxsplit*. +When *maxsplit* is nonzero, at most *maxsplit* splits will be made, and the +remainder of the string is returned as the final element of the list. In the +following example, the delimiter is any sequence of non-alphanumeric characters. +:: + + >>> p = re.compile(r'\W+') + >>> p.split('This is a test, short and sweet, of split().') + ['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', ''] + >>> p.split('This is a test, short and sweet, of split().', 3) + ['This', 'is', 'a', 'test, short and sweet, of split().'] + +Sometimes you're not only interested in what the text between delimiters is, but +also need to know what the delimiter was. If capturing parentheses are used in +the RE, then their values are also returned as part of the list. Compare the +following calls:: + + >>> p = re.compile(r'\W+') + >>> p2 = re.compile(r'(\W+)') + >>> p.split('This... is a test.') + ['This', 'is', 'a', 'test', ''] + >>> p2.split('This... is a test.') + ['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', ''] + +The module-level function :func:`re.split` adds the RE to be used as the first +argument, but is otherwise the same. :: + + >>> re.split('[\W]+', 'Words, words, words.') + ['Words', 'words', 'words', ''] + >>> re.split('([\W]+)', 'Words, words, words.') + ['Words', ', ', 'words', ', ', 'words', '.', ''] + >>> re.split('[\W]+', 'Words, words, words.', 1) + ['Words', 'words, words.'] + + +Search and Replace +------------------ + +Another common task is to find all the matches for a pattern, and replace them +with a different string. The :meth:`sub` method takes a replacement value, +which can be either a string or a function, and the string to be processed. + + +.. method:: .sub(replacement, string[, count=0]) + :noindex: + + Returns the string obtained by replacing the leftmost non-overlapping + occurrences of the RE in *string* by the replacement *replacement*. If the + pattern isn't found, *string* is returned unchanged. + + The optional argument *count* is the maximum number of pattern occurrences to be + replaced; *count* must be a non-negative integer. The default value of 0 means + to replace all occurrences. + +Here's a simple example of using the :meth:`sub` method. It replaces colour +names with the word ``colour``:: + + >>> p = re.compile( '(blue|white|red)') + >>> p.sub( 'colour', 'blue socks and red shoes') + 'colour socks and colour shoes' + >>> p.sub( 'colour', 'blue socks and red shoes', count=1) + 'colour socks and red shoes' + +The :meth:`subn` method does the same work, but returns a 2-tuple containing the +new string value and the number of replacements that were performed:: + + >>> p = re.compile( '(blue|white|red)') + >>> p.subn( 'colour', 'blue socks and red shoes') + ('colour socks and colour shoes', 2) + >>> p.subn( 'colour', 'no colours at all') + ('no colours at all', 0) + +Empty matches are replaced only when they're not adjacent to a previous match. +:: + + >>> p = re.compile('x*') + >>> p.sub('-', 'abxd') + '-a-b-d-' + +If *replacement* is a string, any backslash escapes in it are processed. That +is, ``\n`` is converted to a single newline character, ``\r`` is converted to a +carriage return, and so forth. Unknown escapes such as ``\j`` are left alone. +Backreferences, such as ``\6``, are replaced with the substring matched by the +corresponding group in the RE. This lets you incorporate portions of the +original text in the resulting replacement string. + +This example matches the word ``section`` followed by a string enclosed in +``{``, ``}``, and changes ``section`` to ``subsection``:: + + >>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE) + >>> p.sub(r'subsection{\1}','section{First} section{second}') + 'subsection{First} subsection{second}' + +There's also a syntax for referring to named groups as defined by the +``(?P<name>...)`` syntax. ``\g<name>`` will use the substring matched by the +group named ``name``, and ``\g<number>`` uses the corresponding group number. +``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous in a +replacement string such as ``\g<2>0``. (``\20`` would be interpreted as a +reference to group 20, not a reference to group 2 followed by the literal +character ``'0'``.) The following substitutions are all equivalent, but use all +three variations of the replacement string. :: + + >>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE) + >>> p.sub(r'subsection{\1}','section{First}') + 'subsection{First}' + >>> p.sub(r'subsection{\g<1>}','section{First}') + 'subsection{First}' + >>> p.sub(r'subsection{\g<name>}','section{First}') + 'subsection{First}' + +*replacement* can also be a function, which gives you even more control. If +*replacement* is a function, the function is called for every non-overlapping +occurrence of *pattern*. On each call, the function is passed a +:class:`MatchObject` argument for the match and can use this information to +compute the desired replacement string and return it. + +In the following example, the replacement function translates decimals into +hexadecimal:: + + >>> def hexrepl( match ): + ... "Return the hex string for a decimal number" + ... value = int( match.group() ) + ... return hex(value) + ... + >>> p = re.compile(r'\d+') + >>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.') + 'Call 0xffd2 for printing, 0xc000 for user code.' + +When using the module-level :func:`re.sub` function, the pattern is passed as +the first argument. The pattern may be a string or a :class:`RegexObject`; if +you need to specify regular expression flags, you must either use a +:class:`RegexObject` as the first parameter, or use embedded modifiers in the +pattern, e.g. ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``. + + +Common Problems +=============== + +Regular expressions are a powerful tool for some applications, but in some ways +their behaviour isn't intuitive and at times they don't behave the way you may +expect them to. This section will point out some of the most common pitfalls. + + +Use String Methods +------------------ + +Sometimes using the :mod:`re` module is a mistake. If you're matching a fixed +string, or a single character class, and you're not using any :mod:`re` features +such as the :const:`IGNORECASE` flag, then the full power of regular expressions +may not be required. Strings have several methods for performing operations with +fixed strings and they're usually much faster, because the implementation is a +single small C loop that's been optimized for the purpose, instead of the large, +more generalized regular expression engine. + +One example might be replacing a single fixed string with another one; for +example, you might replace ``word`` with ``deed``. ``re.sub()`` seems like the +function to use for this, but consider the :meth:`replace` method. Note that +:func:`replace` will also replace ``word`` inside words, turning ``swordfish`` +into ``sdeedfish``, but the naive RE ``word`` would have done that, too. (To +avoid performing the substitution on parts of words, the pattern would have to +be ``\bword\b``, in order to require that ``word`` have a word boundary on +either side. This takes the job beyond :meth:`replace`'s abilities.) + +Another common task is deleting every occurrence of a single character from a +string or replacing it with another single character. You might do this with +something like ``re.sub('\n', ' ', S)``, but :meth:`translate` is capable of +doing both tasks and will be faster than any regular expression operation can +be. + +In short, before turning to the :mod:`re` module, consider whether your problem +can be solved with a faster and simpler string method. + + +match() versus search() +----------------------- + +The :func:`match` function only checks if the RE matches at the beginning of the +string while :func:`search` will scan forward through the string for a match. +It's important to keep this distinction in mind. Remember, :func:`match` will +only report a successful match which will start at 0; if the match wouldn't +start at zero, :func:`match` will *not* report it. :: + + >>> print re.match('super', 'superstition').span() + (0, 5) + >>> print re.match('super', 'insuperable') + None + +On the other hand, :func:`search` will scan forward through the string, +reporting the first match it finds. :: + + >>> print re.search('super', 'superstition').span() + (0, 5) + >>> print re.search('super', 'insuperable').span() + (2, 7) + +Sometimes you'll be tempted to keep using :func:`re.match`, and just add ``.*`` +to the front of your RE. Resist this temptation and use :func:`re.search` +instead. The regular expression compiler does some analysis of REs in order to +speed up the process of looking for a match. One such analysis figures out what +the first character of a match must be; for example, a pattern starting with +``Crow`` must match starting with a ``'C'``. The analysis lets the engine +quickly scan through the string looking for the starting character, only trying +the full match if a ``'C'`` is found. + +Adding ``.*`` defeats this optimization, requiring scanning to the end of the +string and then backtracking to find a match for the rest of the RE. Use +:func:`re.search` instead. + + +Greedy versus Non-Greedy +------------------------ + +When repeating a regular expression, as in ``a*``, the resulting action is to +consume as much of the pattern as possible. This fact often bites you when +you're trying to match a pair of balanced delimiters, such as the angle brackets +surrounding an HTML tag. The naive pattern for matching a single HTML tag +doesn't work because of the greedy nature of ``.*``. :: + + >>> s = '<html><head><title>Title</title>' + >>> len(s) + 32 + >>> print re.match('<.*>', s).span() + (0, 32) + >>> print re.match('<.*>', s).group() + <html><head><title>Title</title> + +The RE matches the ``'<'`` in ``<html>``, and the ``.*`` consumes the rest of +the string. There's still more left in the RE, though, and the ``>`` can't +match at the end of the string, so the regular expression engine has to +backtrack character by character until it finds a match for the ``>``. The +final match extends from the ``'<'`` in ``<html>`` to the ``'>'`` in +``</title>``, which isn't what you want. + +In this case, the solution is to use the non-greedy qualifiers ``*?``, ``+?``, +``??``, or ``{m,n}?``, which match as *little* text as possible. In the above +example, the ``'>'`` is tried immediately after the first ``'<'`` matches, and +when it fails, the engine advances a character at a time, retrying the ``'>'`` +at every step. This produces just the right result:: + + >>> print re.match('<.*?>', s).group() + <html> + +(Note that parsing HTML or XML with regular expressions is painful. +Quick-and-dirty patterns will handle common cases, but HTML and XML have special +cases that will break the obvious regular expression; by the time you've written +a regular expression that handles all of the possible cases, the patterns will +be *very* complicated. Use an HTML or XML parser module for such tasks.) + + +Not Using re.VERBOSE +-------------------- + +By now you've probably noticed that regular expressions are a very compact +notation, but they're not terribly readable. REs of moderate complexity can +become lengthy collections of backslashes, parentheses, and metacharacters, +making them difficult to read and understand. + +For such REs, specifying the ``re.VERBOSE`` flag when compiling the regular +expression can be helpful, because it allows you to format the regular +expression more clearly. + +The ``re.VERBOSE`` flag has several effects. Whitespace in the regular +expression that *isn't* inside a character class is ignored. This means that an +expression such as ``dog | cat`` is equivalent to the less readable ``dog|cat``, +but ``[a b]`` will still match the characters ``'a'``, ``'b'``, or a space. In +addition, you can also put comments inside a RE; comments extend from a ``#`` +character to the next newline. When used with triple-quoted strings, this +enables REs to be formatted more neatly:: + + pat = re.compile(r""" + \s* # Skip leading whitespace + (?P<header>[^:]+) # Header name + \s* : # Whitespace, and a colon + (?P<value>.*?) # The header's value -- *? used to + # lose the following trailing whitespace + \s*$ # Trailing whitespace to end-of-line + """, re.VERBOSE) + +This is far more readable than: + +.. % $ + +:: + + pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$") + +.. % $ + + +Feedback +======== + +Regular expressions are a complicated topic. Did this document help you +understand them? Were there parts that were unclear, or Problems you +encountered that weren't covered here? If so, please send suggestions for +improvements to the author. + +The most complete book on regular expressions is almost certainly Jeffrey +Friedl's Mastering Regular Expressions, published by O'Reilly. Unfortunately, +it exclusively concentrates on Perl and Java's flavours of regular expressions, +and doesn't contain any Python material at all, so it won't be useful as a +reference for programming in Python. (The first edition covered Python's +now-removed :mod:`regex` module, which won't help you much.) Consider checking +it out from your library. + + +.. rubric:: Footnotes + +.. [#] Introduced in Python 2.2.2. + diff --git a/Doc/howto/sockets.rst b/Doc/howto/sockets.rst new file mode 100644 index 0000000..dc05d32 --- /dev/null +++ b/Doc/howto/sockets.rst @@ -0,0 +1,421 @@ +**************************** + Socket Programming HOWTO +**************************** + +:Author: Gordon McMillan + + +.. topic:: Abstract + + Sockets are used nearly everywhere, but are one of the most severely + misunderstood technologies around. This is a 10,000 foot overview of sockets. + It's not really a tutorial - you'll still have work to do in getting things + operational. It doesn't cover the fine points (and there are a lot of them), but + I hope it will give you enough background to begin using them decently. + + +Sockets +======= + +Sockets are used nearly everywhere, but are one of the most severely +misunderstood technologies around. This is a 10,000 foot overview of sockets. +It's not really a tutorial - you'll still have work to do in getting things +working. It doesn't cover the fine points (and there are a lot of them), but I +hope it will give you enough background to begin using them decently. + +I'm only going to talk about INET sockets, but they account for at least 99% of +the sockets in use. And I'll only talk about STREAM sockets - unless you really +know what you're doing (in which case this HOWTO isn't for you!), you'll get +better behavior and performance from a STREAM socket than anything else. I will +try to clear up the mystery of what a socket is, as well as some hints on how to +work with blocking and non-blocking sockets. But I'll start by talking about +blocking sockets. You'll need to know how they work before dealing with +non-blocking sockets. + +Part of the trouble with understanding these things is that "socket" can mean a +number of subtly different things, depending on context. So first, let's make a +distinction between a "client" socket - an endpoint of a conversation, and a +"server" socket, which is more like a switchboard operator. The client +application (your browser, for example) uses "client" sockets exclusively; the +web server it's talking to uses both "server" sockets and "client" sockets. + + +History +------- + +Of the various forms of IPC (*Inter Process Communication*), sockets are by far +the most popular. On any given platform, there are likely to be other forms of +IPC that are faster, but for cross-platform communication, sockets are about the +only game in town. + +They were invented in Berkeley as part of the BSD flavor of Unix. They spread +like wildfire with the Internet. With good reason --- the combination of sockets +with INET makes talking to arbitrary machines around the world unbelievably easy +(at least compared to other schemes). + + +Creating a Socket +================= + +Roughly speaking, when you clicked on the link that brought you to this page, +your browser did something like the following:: + + #create an INET, STREAMing socket + s = socket.socket( + socket.AF_INET, socket.SOCK_STREAM) + #now connect to the web server on port 80 + # - the normal http port + s.connect(("www.mcmillan-inc.com", 80)) + +When the ``connect`` completes, the socket ``s`` can now be used to send in a +request for the text of this page. The same socket will read the reply, and then +be destroyed. That's right - destroyed. Client sockets are normally only used +for one exchange (or a small set of sequential exchanges). + +What happens in the web server is a bit more complex. First, the web server +creates a "server socket". :: + + #create an INET, STREAMing socket + serversocket = socket.socket( + socket.AF_INET, socket.SOCK_STREAM) + #bind the socket to a public host, + # and a well-known port + serversocket.bind((socket.gethostname(), 80)) + #become a server socket + serversocket.listen(5) + +A couple things to notice: we used ``socket.gethostname()`` so that the socket +would be visible to the outside world. If we had used ``s.bind(('', 80))`` or +``s.bind(('localhost', 80))`` or ``s.bind(('127.0.0.1', 80))`` we would still +have a "server" socket, but one that was only visible within the same machine. + +A second thing to note: low number ports are usually reserved for "well known" +services (HTTP, SNMP etc). If you're playing around, use a nice high number (4 +digits). + +Finally, the argument to ``listen`` tells the socket library that we want it to +queue up as many as 5 connect requests (the normal max) before refusing outside +connections. If the rest of the code is written properly, that should be plenty. + +OK, now we have a "server" socket, listening on port 80. Now we enter the +mainloop of the web server:: + + while 1: + #accept connections from outside + (clientsocket, address) = serversocket.accept() + #now do something with the clientsocket + #in this case, we'll pretend this is a threaded server + ct = client_thread(clientsocket) + ct.run() + +There's actually 3 general ways in which this loop could work - dispatching a +thread to handle ``clientsocket``, create a new process to handle +``clientsocket``, or restructure this app to use non-blocking sockets, and +mulitplex between our "server" socket and any active ``clientsocket``\ s using +``select``. More about that later. The important thing to understand now is +this: this is *all* a "server" socket does. It doesn't send any data. It doesn't +receive any data. It just produces "client" sockets. Each ``clientsocket`` is +created in response to some *other* "client" socket doing a ``connect()`` to the +host and port we're bound to. As soon as we've created that ``clientsocket``, we +go back to listening for more connections. The two "clients" are free to chat it +up - they are using some dynamically allocated port which will be recycled when +the conversation ends. + + +IPC +--- + +If you need fast IPC between two processes on one machine, you should look into +whatever form of shared memory the platform offers. A simple protocol based +around shared memory and locks or semaphores is by far the fastest technique. + +If you do decide to use sockets, bind the "server" socket to ``'localhost'``. On +most platforms, this will take a shortcut around a couple of layers of network +code and be quite a bit faster. + + +Using a Socket +============== + +The first thing to note, is that the web browser's "client" socket and the web +server's "client" socket are identical beasts. That is, this is a "peer to peer" +conversation. Or to put it another way, *as the designer, you will have to +decide what the rules of etiquette are for a conversation*. Normally, the +``connect``\ ing socket starts the conversation, by sending in a request, or +perhaps a signon. But that's a design decision - it's not a rule of sockets. + +Now there are two sets of verbs to use for communication. You can use ``send`` +and ``recv``, or you can transform your client socket into a file-like beast and +use ``read`` and ``write``. The latter is the way Java presents their sockets. +I'm not going to talk about it here, except to warn you that you need to use +``flush`` on sockets. These are buffered "files", and a common mistake is to +``write`` something, and then ``read`` for a reply. Without a ``flush`` in +there, you may wait forever for the reply, because the request may still be in +your output buffer. + +Now we come the major stumbling block of sockets - ``send`` and ``recv`` operate +on the network buffers. They do not necessarily handle all the bytes you hand +them (or expect from them), because their major focus is handling the network +buffers. In general, they return when the associated network buffers have been +filled (``send``) or emptied (``recv``). They then tell you how many bytes they +handled. It is *your* responsibility to call them again until your message has +been completely dealt with. + +When a ``recv`` returns 0 bytes, it means the other side has closed (or is in +the process of closing) the connection. You will not receive any more data on +this connection. Ever. You may be able to send data successfully; I'll talk +about that some on the next page. + +A protocol like HTTP uses a socket for only one transfer. The client sends a +request, the reads a reply. That's it. The socket is discarded. This means that +a client can detect the end of the reply by receiving 0 bytes. + +But if you plan to reuse your socket for further transfers, you need to realize +that *there is no "EOT" (End of Transfer) on a socket.* I repeat: if a socket +``send`` or ``recv`` returns after handling 0 bytes, the connection has been +broken. If the connection has *not* been broken, you may wait on a ``recv`` +forever, because the socket will *not* tell you that there's nothing more to +read (for now). Now if you think about that a bit, you'll come to realize a +fundamental truth of sockets: *messages must either be fixed length* (yuck), *or +be delimited* (shrug), *or indicate how long they are* (much better), *or end by +shutting down the connection*. The choice is entirely yours, (but some ways are +righter than others). + +Assuming you don't want to end the connection, the simplest solution is a fixed +length message:: + + class mysocket: + '''demonstration class only + - coded for clarity, not efficiency + ''' + + def __init__(self, sock=None): + if sock is None: + self.sock = socket.socket( + socket.AF_INET, socket.SOCK_STREAM) + else: + self.sock = sock + + def connect(self, host, port): + self.sock.connect((host, port)) + + def mysend(self, msg): + totalsent = 0 + while totalsent < MSGLEN: + sent = self.sock.send(msg[totalsent:]) + if sent == 0: + raise RuntimeError, \ + "socket connection broken" + totalsent = totalsent + sent + + def myreceive(self): + msg = '' + while len(msg) < MSGLEN: + chunk = self.sock.recv(MSGLEN-len(msg)) + if chunk == '': + raise RuntimeError, \ + "socket connection broken" + msg = msg + chunk + return msg + +The sending code here is usable for almost any messaging scheme - in Python you +send strings, and you can use ``len()`` to determine its length (even if it has +embedded ``\0`` characters). It's mostly the receiving code that gets more +complex. (And in C, it's not much worse, except you can't use ``strlen`` if the +message has embedded ``\0``\ s.) + +The easiest enhancement is to make the first character of the message an +indicator of message type, and have the type determine the length. Now you have +two ``recv``\ s - the first to get (at least) that first character so you can +look up the length, and the second in a loop to get the rest. If you decide to +go the delimited route, you'll be receiving in some arbitrary chunk size, (4096 +or 8192 is frequently a good match for network buffer sizes), and scanning what +you've received for a delimiter. + +One complication to be aware of: if your conversational protocol allows multiple +messages to be sent back to back (without some kind of reply), and you pass +``recv`` an arbitrary chunk size, you may end up reading the start of a +following message. You'll need to put that aside and hold onto it, until it's +needed. + +Prefixing the message with it's length (say, as 5 numeric characters) gets more +complex, because (believe it or not), you may not get all 5 characters in one +``recv``. In playing around, you'll get away with it; but in high network loads, +your code will very quickly break unless you use two ``recv`` loops - the first +to determine the length, the second to get the data part of the message. Nasty. +This is also when you'll discover that ``send`` does not always manage to get +rid of everything in one pass. And despite having read this, you will eventually +get bit by it! + +In the interests of space, building your character, (and preserving my +competitive position), these enhancements are left as an exercise for the +reader. Lets move on to cleaning up. + + +Binary Data +----------- + +It is perfectly possible to send binary data over a socket. The major problem is +that not all machines use the same formats for binary data. For example, a +Motorola chip will represent a 16 bit integer with the value 1 as the two hex +bytes 00 01. Intel and DEC, however, are byte-reversed - that same 1 is 01 00. +Socket libraries have calls for converting 16 and 32 bit integers - ``ntohl, +htonl, ntohs, htons`` where "n" means *network* and "h" means *host*, "s" means +*short* and "l" means *long*. Where network order is host order, these do +nothing, but where the machine is byte-reversed, these swap the bytes around +appropriately. + +In these days of 32 bit machines, the ascii representation of binary data is +frequently smaller than the binary representation. That's because a surprising +amount of the time, all those longs have the value 0, or maybe 1. The string "0" +would be two bytes, while binary is four. Of course, this doesn't fit well with +fixed-length messages. Decisions, decisions. + + +Disconnecting +============= + +Strictly speaking, you're supposed to use ``shutdown`` on a socket before you +``close`` it. The ``shutdown`` is an advisory to the socket at the other end. +Depending on the argument you pass it, it can mean "I'm not going to send +anymore, but I'll still listen", or "I'm not listening, good riddance!". Most +socket libraries, however, are so used to programmers neglecting to use this +piece of etiquette that normally a ``close`` is the same as ``shutdown(); +close()``. So in most situations, an explicit ``shutdown`` is not needed. + +One way to use ``shutdown`` effectively is in an HTTP-like exchange. The client +sends a request and then does a ``shutdown(1)``. This tells the server "This +client is done sending, but can still receive." The server can detect "EOF" by +a receive of 0 bytes. It can assume it has the complete request. The server +sends a reply. If the ``send`` completes successfully then, indeed, the client +was still receiving. + +Python takes the automatic shutdown a step further, and says that when a socket +is garbage collected, it will automatically do a ``close`` if it's needed. But +relying on this is a very bad habit. If your socket just disappears without +doing a ``close``, the socket at the other end may hang indefinitely, thinking +you're just being slow. *Please* ``close`` your sockets when you're done. + + +When Sockets Die +---------------- + +Probably the worst thing about using blocking sockets is what happens when the +other side comes down hard (without doing a ``close``). Your socket is likely to +hang. SOCKSTREAM is a reliable protocol, and it will wait a long, long time +before giving up on a connection. If you're using threads, the entire thread is +essentially dead. There's not much you can do about it. As long as you aren't +doing something dumb, like holding a lock while doing a blocking read, the +thread isn't really consuming much in the way of resources. Do *not* try to kill +the thread - part of the reason that threads are more efficient than processes +is that they avoid the overhead associated with the automatic recycling of +resources. In other words, if you do manage to kill the thread, your whole +process is likely to be screwed up. + + +Non-blocking Sockets +==================== + +If you've understood the preceeding, you already know most of what you need to +know about the mechanics of using sockets. You'll still use the same calls, in +much the same ways. It's just that, if you do it right, your app will be almost +inside-out. + +In Python, you use ``socket.setblocking(0)`` to make it non-blocking. In C, it's +more complex, (for one thing, you'll need to choose between the BSD flavor +``O_NONBLOCK`` and the almost indistinguishable Posix flavor ``O_NDELAY``, which +is completely different from ``TCP_NODELAY``), but it's the exact same idea. You +do this after creating the socket, but before using it. (Actually, if you're +nuts, you can switch back and forth.) + +The major mechanical difference is that ``send``, ``recv``, ``connect`` and +``accept`` can return without having done anything. You have (of course) a +number of choices. You can check return code and error codes and generally drive +yourself crazy. If you don't believe me, try it sometime. Your app will grow +large, buggy and suck CPU. So let's skip the brain-dead solutions and do it +right. + +Use ``select``. + +In C, coding ``select`` is fairly complex. In Python, it's a piece of cake, but +it's close enough to the C version that if you understand ``select`` in Python, +you'll have little trouble with it in C. :: + + ready_to_read, ready_to_write, in_error = \ + select.select( + potential_readers, + potential_writers, + potential_errs, + timeout) + +You pass ``select`` three lists: the first contains all sockets that you might +want to try reading; the second all the sockets you might want to try writing +to, and the last (normally left empty) those that you want to check for errors. +You should note that a socket can go into more than one list. The ``select`` +call is blocking, but you can give it a timeout. This is generally a sensible +thing to do - give it a nice long timeout (say a minute) unless you have good +reason to do otherwise. + +In return, you will get three lists. They have the sockets that are actually +readable, writable and in error. Each of these lists is a subset (possbily +empty) of the corresponding list you passed in. And if you put a socket in more +than one input list, it will only be (at most) in one output list. + +If a socket is in the output readable list, you can be +as-close-to-certain-as-we-ever-get-in-this-business that a ``recv`` on that +socket will return *something*. Same idea for the writable list. You'll be able +to send *something*. Maybe not all you want to, but *something* is better than +nothing. (Actually, any reasonably healthy socket will return as writable - it +just means outbound network buffer space is available.) + +If you have a "server" socket, put it in the potential_readers list. If it comes +out in the readable list, your ``accept`` will (almost certainly) work. If you +have created a new socket to ``connect`` to someone else, put it in the +ptoential_writers list. If it shows up in the writable list, you have a decent +chance that it has connected. + +One very nasty problem with ``select``: if somewhere in those input lists of +sockets is one which has died a nasty death, the ``select`` will fail. You then +need to loop through every single damn socket in all those lists and do a +``select([sock],[],[],0)`` until you find the bad one. That timeout of 0 means +it won't take long, but it's ugly. + +Actually, ``select`` can be handy even with blocking sockets. It's one way of +determining whether you will block - the socket returns as readable when there's +something in the buffers. However, this still doesn't help with the problem of +determining whether the other end is done, or just busy with something else. + +**Portability alert**: On Unix, ``select`` works both with the sockets and +files. Don't try this on Windows. On Windows, ``select`` works with sockets +only. Also note that in C, many of the more advanced socket options are done +differently on Windows. In fact, on Windows I usually use threads (which work +very, very well) with my sockets. Face it, if you want any kind of performance, +your code will look very different on Windows than on Unix. (I haven't the +foggiest how you do this stuff on a Mac.) + + +Performance +----------- + +There's no question that the fastest sockets code uses non-blocking sockets and +select to multiplex them. You can put together something that will saturate a +LAN connection without putting any strain on the CPU. The trouble is that an app +written this way can't do much of anything else - it needs to be ready to +shuffle bytes around at all times. + +Assuming that your app is actually supposed to do something more than that, +threading is the optimal solution, (and using non-blocking sockets will be +faster than using blocking sockets). Unfortunately, threading support in Unixes +varies both in API and quality. So the normal Unix solution is to fork a +subprocess to deal with each connection. The overhead for this is significant +(and don't do this on Windows - the overhead of process creation is enormous +there). It also means that unless each subprocess is completely independent, +you'll need to use another form of IPC, say a pipe, or shared memory and +semaphores, to communicate between the parent and child processes. + +Finally, remember that even though blocking sockets are somewhat slower than +non-blocking, in many cases they are the "right" solution. After all, if your +app is driven by the data it receives over a socket, there's not much sense in +complicating the logic just so your app can wait on ``select`` instead of +``recv``. + diff --git a/Doc/howto/unicode.rst b/Doc/howto/unicode.rst new file mode 100644 index 0000000..16bd5a8 --- /dev/null +++ b/Doc/howto/unicode.rst @@ -0,0 +1,732 @@ +***************** + Unicode HOWTO +***************** + +:Release: 1.02 + +This HOWTO discusses Python's support for Unicode, and explains various problems +that people commonly encounter when trying to work with Unicode. + +Introduction to Unicode +======================= + +History of Character Codes +-------------------------- + +In 1968, the American Standard Code for Information Interchange, better known by +its acronym ASCII, was standardized. ASCII defined numeric codes for various +characters, with the numeric values running from 0 to +127. For example, the lowercase letter 'a' is assigned 97 as its code +value. + +ASCII was an American-developed standard, so it only defined unaccented +characters. There was an 'e', but no 'é' or 'Í'. This meant that languages +which required accented characters couldn't be faithfully represented in ASCII. +(Actually the missing accents matter for English, too, which contains words such +as 'naïve' and 'café', and some publications have house styles which require +spellings such as 'coöperate'.) + +For a while people just wrote programs that didn't display accents. I remember +looking at Apple ][ BASIC programs, published in French-language publications in +the mid-1980s, that had lines like these:: + + PRINT "FICHER EST COMPLETE." + PRINT "CARACTERE NON ACCEPTE." + +Those messages should contain accents, and they just look wrong to someone who +can read French. + +In the 1980s, almost all personal computers were 8-bit, meaning that bytes could +hold values ranging from 0 to 255. ASCII codes only went up to 127, so some +machines assigned values between 128 and 255 to accented characters. Different +machines had different codes, however, which led to problems exchanging files. +Eventually various commonly used sets of values for the 128-255 range emerged. +Some were true standards, defined by the International Standards Organization, +and some were **de facto** conventions that were invented by one company or +another and managed to catch on. + +255 characters aren't very many. For example, you can't fit both the accented +characters used in Western Europe and the Cyrillic alphabet used for Russian +into the 128-255 range because there are more than 127 such characters. + +You could write files using different codes (all your Russian files in a coding +system called KOI8, all your French files in a different coding system called +Latin1), but what if you wanted to write a French document that quotes some +Russian text? In the 1980s people began to want to solve this problem, and the +Unicode standardization effort began. + +Unicode started out using 16-bit characters instead of 8-bit characters. 16 +bits means you have 2^16 = 65,536 distinct values available, making it possible +to represent many different characters from many different alphabets; an initial +goal was to have Unicode contain the alphabets for every single human language. +It turns out that even 16 bits isn't enough to meet that goal, and the modern +Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in +base-16). + +There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were +originally separate efforts, but the specifications were merged with the 1.1 +revision of Unicode. + +(This discussion of Unicode's history is highly simplified. I don't think the +average Python programmer needs to worry about the historical details; consult +the Unicode consortium site listed in the References for more information.) + + +Definitions +----------- + +A **character** is the smallest possible component of a text. 'A', 'B', 'C', +etc., are all different characters. So are 'È' and 'Í'. Characters are +abstractions, and vary depending on the language or context you're talking +about. For example, the symbol for ohms (Ω) is usually drawn much like the +capital letter omega (Ω) in the Greek alphabet (they may even be the same in +some fonts), but these are two different characters that have different +meanings. + +The Unicode standard describes how characters are represented by **code +points**. A code point is an integer value, usually denoted in base 16. In the +standard, a code point is written using the notation U+12ca to mean the +character with value 0x12ca (4810 decimal). The Unicode standard contains a lot +of tables listing characters and their corresponding code points:: + + 0061 'a'; LATIN SMALL LETTER A + 0062 'b'; LATIN SMALL LETTER B + 0063 'c'; LATIN SMALL LETTER C + ... + 007B '{'; LEFT CURLY BRACKET + +Strictly, these definitions imply that it's meaningless to say 'this is +character U+12ca'. U+12ca is a code point, which represents some particular +character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In +informal contexts, this distinction between code points and characters will +sometimes be forgotten. + +A character is represented on a screen or on paper by a set of graphical +elements that's called a **glyph**. The glyph for an uppercase A, for example, +is two diagonal strokes and a horizontal stroke, though the exact details will +depend on the font being used. Most Python code doesn't need to worry about +glyphs; figuring out the correct glyph to display is generally the job of a GUI +toolkit or a terminal's font renderer. + + +Encodings +--------- + +To summarize the previous section: a Unicode string is a sequence of code +points, which are numbers from 0 to 0x10ffff. This sequence needs to be +represented as a set of bytes (meaning, values from 0-255) in memory. The rules +for translating a Unicode string into a sequence of bytes are called an +**encoding**. + +The first encoding you might think of is an array of 32-bit integers. In this +representation, the string "Python" would look like this:: + + P y t h o n + 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00 + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 + +This representation is straightforward but using it presents a number of +problems. + +1. It's not portable; different processors order the bytes differently. + +2. It's very wasteful of space. In most texts, the majority of the code points + are less than 127, or less than 255, so a lot of space is occupied by zero + bytes. The above string takes 24 bytes compared to the 6 bytes needed for an + ASCII representation. Increased RAM usage doesn't matter too much (desktop + computers have megabytes of RAM, and strings aren't usually that large), but + expanding our usage of disk and network bandwidth by a factor of 4 is + intolerable. + +3. It's not compatible with existing C functions such as ``strlen()``, so a new + family of wide string functions would need to be used. + +4. Many Internet standards are defined in terms of textual data, and can't + handle content with embedded zero bytes. + +Generally people don't use this encoding, instead choosing other encodings that +are more efficient and convenient. + +Encodings don't have to handle every possible Unicode character, and most +encodings don't. For example, Python's default encoding is the 'ascii' +encoding. The rules for converting a Unicode string into the ASCII encoding are +simple; for each code point: + +1. If the code point is < 128, each byte is the same as the value of the code + point. + +2. If the code point is 128 or greater, the Unicode string can't be represented + in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this + case.) + +Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points +0-255 are identical to the Latin-1 values, so converting to this encoding simply +requires converting code points to byte values; if a code point larger than 255 +is encountered, the string can't be encoded into Latin-1. + +Encodings don't have to be simple one-to-one mappings like Latin-1. Consider +IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one +block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145 +through 153. If you wanted to use EBCDIC as an encoding, you'd probably use +some sort of lookup table to perform the conversion, but this is largely an +internal detail. + +UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode +Transformation Format", and the '8' means that 8-bit numbers are used in the +encoding. (There's also a UTF-16 encoding, but it's less frequently used than +UTF-8.) UTF-8 uses the following rules: + +1. If the code point is <128, it's represented by the corresponding byte value. +2. If the code point is between 128 and 0x7ff, it's turned into two byte values + between 128 and 255. +3. Code points >0x7ff are turned into three- or four-byte sequences, where each + byte of the sequence is between 128 and 255. + +UTF-8 has several convenient properties: + +1. It can handle any Unicode code point. +2. A Unicode string is turned into a string of bytes containing no embedded zero + bytes. This avoids byte-ordering issues, and means UTF-8 strings can be + processed by C functions such as ``strcpy()`` and sent through protocols that + can't handle zero bytes. +3. A string of ASCII text is also valid UTF-8 text. +4. UTF-8 is fairly compact; the majority of code points are turned into two + bytes, and values less than 128 occupy only a single byte. +5. If bytes are corrupted or lost, it's possible to determine the start of the + next UTF-8-encoded code point and resynchronize. It's also unlikely that + random 8-bit data will look like valid UTF-8. + + + +References +---------- + +The Unicode Consortium site at <http://www.unicode.org> has character charts, a +glossary, and PDF versions of the Unicode specification. Be prepared for some +difficult reading. <http://www.unicode.org/history/> is a chronology of the +origin and development of Unicode. + +To help understand the standard, Jukka Korpela has written an introductory guide +to reading the Unicode character tables, available at +<http://www.cs.tut.fi/~jkorpela/unicode/guide.html>. + +Roman Czyborra wrote another explanation of Unicode's basic principles; it's at +<http://czyborra.com/unicode/characters.html>. Czyborra has written a number of +other Unicode-related documentation, available from <http://www.cyzborra.com>. + +Two other good introductory articles were written by Joel Spolsky +<http://www.joelonsoftware.com/articles/Unicode.html> and Jason Orendorff +<http://www.jorendorff.com/articles/unicode/>. If this introduction didn't make +things clear to you, you should try reading one of these alternate articles +before continuing. + +Wikipedia entries are often helpful; see the entries for "character encoding" +<http://en.wikipedia.org/wiki/Character_encoding> and UTF-8 +<http://en.wikipedia.org/wiki/UTF-8>, for example. + + +Python's Unicode Support +======================== + +Now that you've learned the rudiments of Unicode, we can look at Python's +Unicode features. + + +The Unicode Type +---------------- + +Unicode strings are expressed as instances of the :class:`unicode` type, one of +Python's repertoire of built-in types. It derives from an abstract type called +:class:`basestring`, which is also an ancestor of the :class:`str` type; you can +therefore check if a value is a string type with ``isinstance(value, +basestring)``. Under the hood, Python represents Unicode strings as either 16- +or 32-bit integers, depending on how the Python interpreter was compiled. + +The :func:`unicode` constructor has the signature ``unicode(string[, encoding, +errors])``. All of its arguments should be 8-bit strings. The first argument +is converted to Unicode using the specified encoding; if you leave off the +``encoding`` argument, the ASCII encoding is used for the conversion, so +characters greater than 127 will be treated as errors:: + + >>> unicode('abcdef') + u'abcdef' + >>> s = unicode('abcdef') + >>> type(s) + <type 'unicode'> + >>> unicode('abcdef' + chr(255)) + Traceback (most recent call last): + File "<stdin>", line 1, in ? + UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6: + ordinal not in range(128) + +The ``errors`` argument specifies the response when the input string can't be +converted according to the encoding's rules. Legal values for this argument are +'strict' (raise a ``UnicodeDecodeError`` exception), 'replace' (add U+FFFD, +'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the +Unicode result). The following examples show the differences:: + + >>> unicode('\x80abc', errors='strict') + Traceback (most recent call last): + File "<stdin>", line 1, in ? + UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: + ordinal not in range(128) + >>> unicode('\x80abc', errors='replace') + u'\ufffdabc' + >>> unicode('\x80abc', errors='ignore') + u'abc' + +Encodings are specified as strings containing the encoding's name. Python 2.4 +comes with roughly 100 different encodings; see the Python Library Reference at +<http://docs.python.org/lib/standard-encodings.html> for a list. Some encodings +have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all +synonyms for the same encoding. + +One-character Unicode strings can also be created with the :func:`unichr` +built-in function, which takes integers and returns a Unicode string of length 1 +that contains the corresponding code point. The reverse operation is the +built-in :func:`ord` function that takes a one-character Unicode string and +returns the code point value:: + + >>> unichr(40960) + u'\ua000' + >>> ord(u'\ua000') + 40960 + +Instances of the :class:`unicode` type have many of the same methods as the +8-bit string type for operations such as searching and formatting:: + + >>> s = u'Was ever feather so lightly blown to and fro as this multitude?' + >>> s.count('e') + 5 + >>> s.find('feather') + 9 + >>> s.find('bird') + -1 + >>> s.replace('feather', 'sand') + u'Was ever sand so lightly blown to and fro as this multitude?' + >>> s.upper() + u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?' + +Note that the arguments to these methods can be Unicode strings or 8-bit +strings. 8-bit strings will be converted to Unicode before carrying out the +operation; Python's default ASCII encoding will be used, so characters greater +than 127 will cause an exception:: + + >>> s.find('Was\x9f') + Traceback (most recent call last): + File "<stdin>", line 1, in ? + UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128) + >>> s.find(u'Was\x9f') + -1 + +Much Python code that operates on strings will therefore work with Unicode +strings without requiring any changes to the code. (Input and output code needs +more updating for Unicode; more on this later.) + +Another important method is ``.encode([encoding], [errors='strict'])``, which +returns an 8-bit string version of the Unicode string, encoded in the requested +encoding. The ``errors`` parameter is the same as the parameter of the +``unicode()`` constructor, with one additional possibility; as well as 'strict', +'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's +character references. The following example shows the different results:: + + >>> u = unichr(40960) + u'abcd' + unichr(1972) + >>> u.encode('utf-8') + '\xea\x80\x80abcd\xde\xb4' + >>> u.encode('ascii') + Traceback (most recent call last): + File "<stdin>", line 1, in ? + UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128) + >>> u.encode('ascii', 'ignore') + 'abcd' + >>> u.encode('ascii', 'replace') + '?abcd?' + >>> u.encode('ascii', 'xmlcharrefreplace') + 'ꀀabcd޴' + +Python's 8-bit strings have a ``.decode([encoding], [errors])`` method that +interprets the string using the given encoding:: + + >>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string + >>> utf8_version = u.encode('utf-8') # Encode as UTF-8 + >>> type(utf8_version), utf8_version + (<type 'str'>, '\xea\x80\x80abcd\xde\xb4') + >>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8 + >>> u == u2 # The two strings match + True + +The low-level routines for registering and accessing the available encodings are +found in the :mod:`codecs` module. However, the encoding and decoding functions +returned by this module are usually more low-level than is comfortable, so I'm +not going to describe the :mod:`codecs` module here. If you need to implement a +completely new encoding, you'll need to learn about the :mod:`codecs` module +interfaces, but implementing encodings is a specialized task that also won't be +covered here. Consult the Python documentation to learn more about this module. + +The most commonly used part of the :mod:`codecs` module is the +:func:`codecs.open` function which will be discussed in the section on input and +output. + + +Unicode Literals in Python Source Code +-------------------------------------- + +In Python source code, Unicode literals are written as strings prefixed with the +'u' or 'U' character: ``u'abcdefghijk'``. Specific code points can be written +using the ``\u`` escape sequence, which is followed by four hex digits giving +the code point. The ``\U`` escape sequence is similar, but expects 8 hex +digits, not 4. + +Unicode literals can also use the same escape sequences as 8-bit strings, +including ``\x``, but ``\x`` only takes two hex digits so it can't express an +arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777. + +:: + + >>> s = u"a\xac\u1234\u20ac\U00008000" + ^^^^ two-digit hex escape + ^^^^^^ four-digit Unicode escape + ^^^^^^^^^^ eight-digit Unicode escape + >>> for c in s: print ord(c), + ... + 97 172 4660 8364 32768 + +Using escape sequences for code points greater than 127 is fine in small doses, +but becomes an annoyance if you're using many accented characters, as you would +in a program with messages in French or some other accent-using language. You +can also assemble strings using the :func:`unichr` built-in function, but this is +even more tedious. + +Ideally, you'd want to be able to write literals in your language's natural +encoding. You could then edit Python source code with your favorite editor +which would display the accented characters naturally, and have the right +characters used at runtime. + +Python supports writing Unicode literals in any encoding, but you have to +declare the encoding being used. This is done by including a special comment as +either the first or second line of the source file:: + + #!/usr/bin/env python + # -*- coding: latin-1 -*- + + u = u'abcdé' + print ord(u[-1]) + +The syntax is inspired by Emacs's notation for specifying variables local to a +file. Emacs supports many different variables, but Python only supports +'coding'. The ``-*-`` symbols indicate that the comment is special; within +them, you must supply the name ``coding`` and the name of your chosen encoding, +separated by ``':'``. + +If you don't include such a comment, the default encoding used will be ASCII. +Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default +encoding for string literals; in Python 2.4, characters greater than 127 still +work but result in a warning. For example, the following program has no +encoding declaration:: + + #!/usr/bin/env python + u = u'abcdé' + print ord(u[-1]) + +When you run it with Python 2.4, it will output the following warning:: + + amk:~$ python p263.py + sys:1: DeprecationWarning: Non-ASCII character '\xe9' + in file p263.py on line 2, but no encoding declared; + see http://www.python.org/peps/pep-0263.html for details + + +Unicode Properties +------------------ + +The Unicode specification includes a database of information about code points. +For each code point that's defined, the information includes the character's +name, its category, the numeric value if applicable (Unicode has characters +representing the Roman numerals and fractions such as one-third and +four-fifths). There are also properties related to the code point's use in +bidirectional text and other display-related properties. + +The following program displays some information about several characters, and +prints the numeric value of one particular character:: + + import unicodedata + + u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231) + + for i, c in enumerate(u): + print i, '%04x' % ord(c), unicodedata.category(c), + print unicodedata.name(c) + + # Get numeric value of second character + print unicodedata.numeric(u[1]) + +When run, this prints:: + + 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE + 1 0bf2 No TAMIL NUMBER ONE THOUSAND + 2 0f84 Mn TIBETAN MARK HALANTA + 3 1770 Lo TAGBANWA LETTER SA + 4 33af So SQUARE RAD OVER S SQUARED + 1000.0 + +The category codes are abbreviations describing the nature of the character. +These are grouped into categories such as "Letter", "Number", "Punctuation", or +"Symbol", which in turn are broken up into subcategories. To take the codes +from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means +"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol, +other". See +<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values> for a +list of category codes. + +References +---------- + +The Unicode and 8-bit string types are described in the Python library reference +at :ref:`typesseq`. + +The documentation for the :mod:`unicodedata` module. + +The documentation for the :mod:`codecs` module. + +Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and +Unicode". A PDF version of his slides is available at +<http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>, and is an +excellent overview of the design of Python's Unicode features. + + +Reading and Writing Unicode Data +================================ + +Once you've written some code that works with Unicode data, the next problem is +input/output. How do you get Unicode strings into your program, and how do you +convert Unicode into a form suitable for storage or transmission? + +It's possible that you may not need to do anything depending on your input +sources and output destinations; you should check whether the libraries used in +your application support Unicode natively. XML parsers often return Unicode +data, for example. Many relational databases also support Unicode-valued +columns and can return Unicode values from an SQL query. + +Unicode data is usually converted to a particular encoding before it gets +written to disk or sent over a socket. It's possible to do all the work +yourself: open a file, read an 8-bit string from it, and convert the string with +``unicode(str, encoding)``. However, the manual approach is not recommended. + +One problem is the multi-byte nature of encodings; one Unicode character can be +represented by several bytes. If you want to read the file in arbitrary-sized +chunks (say, 1K or 4K), you need to write error-handling code to catch the case +where only part of the bytes encoding a single Unicode character are read at the +end of a chunk. One solution would be to read the entire file into memory and +then perform the decoding, but that prevents you from working with files that +are extremely large; if you need to read a 2Gb file, you need 2Gb of RAM. +(More, really, since for at least a moment you'd need to have both the encoded +string and its Unicode version in memory.) + +The solution would be to use the low-level decoding interface to catch the case +of partial coding sequences. The work of implementing this has already been +done for you: the :mod:`codecs` module includes a version of the :func:`open` +function that returns a file-like object that assumes the file's contents are in +a specified encoding and accepts Unicode parameters for methods such as +``.read()`` and ``.write()``. + +The function's parameters are ``open(filename, mode='rb', encoding=None, +errors='strict', buffering=1)``. ``mode`` can be ``'r'``, ``'w'``, or ``'a'``, +just like the corresponding parameter to the regular built-in ``open()`` +function; add a ``'+'`` to update the file. ``buffering`` is similarly parallel +to the standard function's parameter. ``encoding`` is a string giving the +encoding to use; if it's left as ``None``, a regular Python file object that +accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and +data written to or read from the wrapper object will be converted as needed. +``errors`` specifies the action for encoding errors and can be one of the usual +values of 'strict', 'ignore', and 'replace'. + +Reading Unicode from a file is therefore simple:: + + import codecs + f = codecs.open('unicode.rst', encoding='utf-8') + for line in f: + print repr(line) + +It's also possible to open files in update mode, allowing both reading and +writing:: + + f = codecs.open('test', encoding='utf-8', mode='w+') + f.write(u'\u4500 blah blah blah\n') + f.seek(0) + print repr(f.readline()[:1]) + f.close() + +Unicode character U+FEFF is used as a byte-order mark (BOM), and is often +written as the first character of a file in order to assist with autodetection +of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be +present at the start of a file; when such an encoding is used, the BOM will be +automatically written as the first character and will be silently dropped when +the file is read. There are variants of these encodings, such as 'utf-16-le' +and 'utf-16-be' for little-endian and big-endian encodings, that specify one +particular byte ordering and don't skip the BOM. + + +Unicode filenames +----------------- + +Most of the operating systems in common use today support filenames that contain +arbitrary Unicode characters. Usually this is implemented by converting the +Unicode string into some encoding that varies depending on the system. For +example, MacOS X uses UTF-8 while Windows uses a configurable encoding; on +Windows, Python uses the name "mbcs" to refer to whatever the currently +configured encoding is. On Unix systems, there will only be a filesystem +encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if +you haven't, the default encoding is ASCII. + +The :func:`sys.getfilesystemencoding` function returns the encoding to use on +your current system, in case you want to do the encoding manually, but there's +not much reason to bother. When opening a file for reading or writing, you can +usually just provide the Unicode string as the filename, and it will be +automatically converted to the right encoding for you:: + + filename = u'filename\u4500abc' + f = open(filename, 'w') + f.write('blah\n') + f.close() + +Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode +filenames. + +:func:`os.listdir`, which returns filenames, raises an issue: should it return +the Unicode version of filenames, or should it return 8-bit strings containing +the encoded versions? :func:`os.listdir` will do both, depending on whether you +provided the directory path as an 8-bit string or a Unicode string. If you pass +a Unicode string as the path, filenames will be decoded using the filesystem's +encoding and a list of Unicode strings will be returned, while passing an 8-bit +path will return the 8-bit versions of the filenames. For example, assuming the +default filesystem encoding is UTF-8, running the following program:: + + fn = u'filename\u4500abc' + f = open(fn, 'w') + f.close() + + import os + print os.listdir('.') + print os.listdir(u'.') + +will produce the following output:: + + amk:~$ python t.py + ['.svn', 'filename\xe4\x94\x80abc', ...] + [u'.svn', u'filename\u4500abc', ...] + +The first list contains UTF-8-encoded filenames, and the second list contains +the Unicode versions. + + + +Tips for Writing Unicode-aware Programs +--------------------------------------- + +This section provides some suggestions on writing software that deals with +Unicode. + +The most important tip is: + + Software should only work with Unicode strings internally, converting to a + particular encoding on output. + +If you attempt to write processing functions that accept both Unicode and 8-bit +strings, you will find your program vulnerable to bugs wherever you combine the +two different kinds of strings. Python's default encoding is ASCII, so whenever +a character with an ASCII value > 127 is in the input data, you'll get a +:exc:`UnicodeDecodeError` because that character can't be handled by the ASCII +encoding. + +It's easy to miss such problems if you only test your software with data that +doesn't contain any accents; everything will seem to work, but there's actually +a bug in your program waiting for the first user who attempts to use characters +> 127. A second tip, therefore, is: + + Include characters > 127 and, even better, characters > 255 in your test + data. + +When using data coming from a web browser or some other untrusted source, a +common technique is to check for illegal characters in a string before using the +string in a generated command line or storing it in a database. If you're doing +this, be careful to check the string once it's in the form that will be used or +stored; it's possible for encodings to be used to disguise characters. This is +especially true if the input data also specifies the encoding; many encodings +leave the commonly checked-for characters alone, but Python includes some +encodings such as ``'base64'`` that modify every single character. + +For example, let's say you have a content management system that takes a Unicode +filename, and you want to disallow paths with a '/' character. You might write +this code:: + + def read_file (filename, encoding): + if '/' in filename: + raise ValueError("'/' not allowed in filenames") + unicode_name = filename.decode(encoding) + f = open(unicode_name, 'r') + # ... return contents of file ... + +However, if an attacker could specify the ``'base64'`` encoding, they could pass +``'L2V0Yy9wYXNzd2Q='``, which is the base-64 encoded form of the string +``'/etc/passwd'``, to read a system file. The above code looks for ``'/'`` +characters in the encoded form and misses the dangerous character in the +resulting decoded form. + +References +---------- + +The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware +Applications in Python" are available at +<http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf> +and discuss questions of character encodings as well as how to internationalize +and localize an application. + + +Revision History and Acknowledgements +===================================== + +Thanks to the following people who have noted errors or offered suggestions on +this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler, +Marc-André Lemburg, Martin von Löwis, Chad Whitacre. + +Version 1.0: posted August 5 2005. + +Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds +several links. + +Version 1.02: posted August 16 2005. Corrects factual errors. + + +.. comment Additional topic: building Python w/ UCS2 or UCS4 support +.. comment Describe obscure -U switch somewhere? +.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter + +.. comment + Original outline: + + - [ ] Unicode introduction + - [ ] ASCII + - [ ] Terms + - [ ] Character + - [ ] Code point + - [ ] Encodings + - [ ] Common encodings: ASCII, Latin-1, UTF-8 + - [ ] Unicode Python type + - [ ] Writing unicode literals + - [ ] Obscurity: -U switch + - [ ] Built-ins + - [ ] unichr() + - [ ] ord() + - [ ] unicode() constructor + - [ ] Unicode type + - [ ] encode(), decode() methods + - [ ] Unicodedata module for character properties + - [ ] I/O + - [ ] Reading/writing Unicode data into files + - [ ] Byte-order marks + - [ ] Unicode filenames + - [ ] Writing Unicode programs + - [ ] Do everything in Unicode + - [ ] Declaring source code encodings (PEP 263) + - [ ] Other issues + - [ ] Building Python (UCS2, UCS4) diff --git a/Doc/howto/urllib2.rst b/Doc/howto/urllib2.rst new file mode 100644 index 0000000..dc20b02 --- /dev/null +++ b/Doc/howto/urllib2.rst @@ -0,0 +1,578 @@ +************************************************ + HOWTO Fetch Internet Resources Using urllib2 +************************************************ + +:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_ + +.. note:: + + There is an French translation of an earlier revision of this + HOWTO, available at `urllib2 - Le Manuel manquant + <http://www.voidspace/python/articles/urllib2_francais.shtml>`_. + + + +Introduction +============ + +.. sidebar:: Related Articles + + You may also find useful the following article on fetching web resources + with Python : + + * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_ + + A tutorial on *Basic Authentication*, with examples in Python. + +**urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs +(Uniform Resource Locators). It offers a very simple interface, in the form of +the *urlopen* function. This is capable of fetching URLs using a variety of +different protocols. It also offers a slightly more complex interface for +handling common situations - like basic authentication, cookies, proxies and so +on. These are provided by objects called handlers and openers. + +urllib2 supports fetching URLs for many "URL schemes" (identified by the string +before the ":" in URL - for example "ftp" is the URL scheme of +"ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP). +This tutorial focuses on the most common case, HTTP. + +For straightforward situations *urlopen* is very easy to use. But as soon as you +encounter errors or non-trivial cases when opening HTTP URLs, you will need some +understanding of the HyperText Transfer Protocol. The most comprehensive and +authoritative reference to HTTP is :rfc:`2616`. This is a technical document and +not intended to be easy to read. This HOWTO aims to illustrate using *urllib2*, +with enough detail about HTTP to help you through. It is not intended to replace +the :mod:`urllib2` docs, but is supplementary to them. + + +Fetching URLs +============= + +The simplest way to use urllib2 is as follows:: + + import urllib2 + response = urllib2.urlopen('http://python.org/') + html = response.read() + +Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we +could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the +purpose of this tutorial to explain the more complicated cases, concentrating on +HTTP. + +HTTP is based on requests and responses - the client makes requests and servers +send responses. urllib2 mirrors this with a ``Request`` object which represents +the HTTP request you are making. In its simplest form you create a Request +object that specifies the URL you want to fetch. Calling ``urlopen`` with this +Request object returns a response object for the URL requested. This response is +a file-like object, which means you can for example call ``.read()`` on the +response:: + + import urllib2 + + req = urllib2.Request('http://www.voidspace.org.uk') + response = urllib2.urlopen(req) + the_page = response.read() + +Note that urllib2 makes use of the same Request interface to handle all URL +schemes. For example, you can make an FTP request like so:: + + req = urllib2.Request('ftp://example.com/') + +In the case of HTTP, there are two extra things that Request objects allow you +to do: First, you can pass data to be sent to the server. Second, you can pass +extra information ("metadata") *about* the data or the about request itself, to +the server - this information is sent as HTTP "headers". Let's look at each of +these in turn. + +Data +---- + +Sometimes you want to send data to a URL (often the URL will refer to a CGI +(Common Gateway Interface) script [#]_ or other web application). With HTTP, +this is often done using what's known as a **POST** request. This is often what +your browser does when you submit a HTML form that you filled in on the web. Not +all POSTs have to come from forms: you can use a POST to transmit arbitrary data +to your own application. In the common case of HTML forms, the data needs to be +encoded in a standard way, and then passed to the Request object as the ``data`` +argument. The encoding is done using a function from the ``urllib`` library +*not* from ``urllib2``. :: + + import urllib + import urllib2 + + url = 'http://www.someserver.com/cgi-bin/register.cgi' + values = {'name' : 'Michael Foord', + 'location' : 'Northampton', + 'language' : 'Python' } + + data = urllib.urlencode(values) + req = urllib2.Request(url, data) + response = urllib2.urlopen(req) + the_page = response.read() + +Note that other encodings are sometimes required (e.g. for file upload from HTML +forms - see `HTML Specification, Form Submission +<http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more +details). + +If you do not pass the ``data`` argument, urllib2 uses a **GET** request. One +way in which GET and POST requests differ is that POST requests often have +"side-effects": they change the state of the system in some way (for example by +placing an order with the website for a hundredweight of tinned spam to be +delivered to your door). Though the HTTP standard makes it clear that POSTs are +intended to *always* cause side-effects, and GET requests *never* to cause +side-effects, nothing prevents a GET request from having side-effects, nor a +POST requests from having no side-effects. Data can also be passed in an HTTP +GET request by encoding it in the URL itself. + +This is done as follows:: + + >>> import urllib2 + >>> import urllib + >>> data = {} + >>> data['name'] = 'Somebody Here' + >>> data['location'] = 'Northampton' + >>> data['language'] = 'Python' + >>> url_values = urllib.urlencode(data) + >>> print url_values + name=Somebody+Here&language=Python&location=Northampton + >>> url = 'http://www.example.com/example.cgi' + >>> full_url = url + '?' + url_values + >>> data = urllib2.open(full_url) + +Notice that the full URL is created by adding a ``?`` to the URL, followed by +the encoded values. + +Headers +------- + +We'll discuss here one particular HTTP header, to illustrate how to add headers +to your HTTP request. + +Some websites [#]_ dislike being browsed by programs, or send different versions +to different browsers [#]_ . By default urllib2 identifies itself as +``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version +numbers of the Python release, +e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain +not work. The way a browser identifies itself is through the +``User-Agent`` header [#]_. When you create a Request object you can +pass a dictionary of headers in. The following example makes the same +request as above, but identifies itself as a version of Internet +Explorer [#]_. :: + + import urllib + import urllib2 + + url = 'http://www.someserver.com/cgi-bin/register.cgi' + user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' + values = {'name' : 'Michael Foord', + 'location' : 'Northampton', + 'language' : 'Python' } + headers = { 'User-Agent' : user_agent } + + data = urllib.urlencode(values) + req = urllib2.Request(url, data, headers) + response = urllib2.urlopen(req) + the_page = response.read() + +The response also has two useful methods. See the section on `info and geturl`_ +which comes after we have a look at what happens when things go wrong. + + +Handling Exceptions +=================== + +*urlopen* raises ``URLError`` when it cannot handle a response (though as usual +with Python APIs, builtin exceptions such as ValueError, TypeError etc. may also +be raised). + +``HTTPError`` is the subclass of ``URLError`` raised in the specific case of +HTTP URLs. + +URLError +-------- + +Often, URLError is raised because there is no network connection (no route to +the specified server), or the specified server doesn't exist. In this case, the +exception raised will have a 'reason' attribute, which is a tuple containing an +error code and a text error message. + +e.g. :: + + >>> req = urllib2.Request('http://www.pretend_server.org') + >>> try: urllib2.urlopen(req) + >>> except URLError, e: + >>> print e.reason + >>> + (4, 'getaddrinfo failed') + + +HTTPError +--------- + +Every HTTP response from the server contains a numeric "status code". Sometimes +the status code indicates that the server is unable to fulfil the request. The +default handlers will handle some of these responses for you (for example, if +the response is a "redirection" that requests the client fetch the document from +a different URL, urllib2 will handle that for you). For those it can't handle, +urlopen will raise an ``HTTPError``. Typical errors include '404' (page not +found), '403' (request forbidden), and '401' (authentication required). + +See section 10 of RFC 2616 for a reference on all the HTTP error codes. + +The ``HTTPError`` instance raised will have an integer 'code' attribute, which +corresponds to the error sent by the server. + +Error Codes +~~~~~~~~~~~ + +Because the default handlers handle redirects (codes in the 300 range), and +codes in the 100-299 range indicate success, you will usually only see error +codes in the 400-599 range. + +``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful dictionary of +response codes in that shows all the response codes used by RFC 2616. The +dictionary is reproduced here for convenience :: + + # Table mapping response codes to messages; entries have the + # form {code: (shortmessage, longmessage)}. + responses = { + 100: ('Continue', 'Request received, please continue'), + 101: ('Switching Protocols', + 'Switching to new protocol; obey Upgrade header'), + + 200: ('OK', 'Request fulfilled, document follows'), + 201: ('Created', 'Document created, URL follows'), + 202: ('Accepted', + 'Request accepted, processing continues off-line'), + 203: ('Non-Authoritative Information', 'Request fulfilled from cache'), + 204: ('No Content', 'Request fulfilled, nothing follows'), + 205: ('Reset Content', 'Clear input form for further input.'), + 206: ('Partial Content', 'Partial content follows.'), + + 300: ('Multiple Choices', + 'Object has several resources -- see URI list'), + 301: ('Moved Permanently', 'Object moved permanently -- see URI list'), + 302: ('Found', 'Object moved temporarily -- see URI list'), + 303: ('See Other', 'Object moved -- see Method and URL list'), + 304: ('Not Modified', + 'Document has not changed since given time'), + 305: ('Use Proxy', + 'You must use proxy specified in Location to access this ' + 'resource.'), + 307: ('Temporary Redirect', + 'Object moved temporarily -- see URI list'), + + 400: ('Bad Request', + 'Bad request syntax or unsupported method'), + 401: ('Unauthorized', + 'No permission -- see authorization schemes'), + 402: ('Payment Required', + 'No payment -- see charging schemes'), + 403: ('Forbidden', + 'Request forbidden -- authorization will not help'), + 404: ('Not Found', 'Nothing matches the given URI'), + 405: ('Method Not Allowed', + 'Specified method is invalid for this server.'), + 406: ('Not Acceptable', 'URI not available in preferred format.'), + 407: ('Proxy Authentication Required', 'You must authenticate with ' + 'this proxy before proceeding.'), + 408: ('Request Timeout', 'Request timed out; try again later.'), + 409: ('Conflict', 'Request conflict.'), + 410: ('Gone', + 'URI no longer exists and has been permanently removed.'), + 411: ('Length Required', 'Client must specify Content-Length.'), + 412: ('Precondition Failed', 'Precondition in headers is false.'), + 413: ('Request Entity Too Large', 'Entity is too large.'), + 414: ('Request-URI Too Long', 'URI is too long.'), + 415: ('Unsupported Media Type', 'Entity body in unsupported format.'), + 416: ('Requested Range Not Satisfiable', + 'Cannot satisfy request range.'), + 417: ('Expectation Failed', + 'Expect condition could not be satisfied.'), + + 500: ('Internal Server Error', 'Server got itself in trouble'), + 501: ('Not Implemented', + 'Server does not support this operation'), + 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'), + 503: ('Service Unavailable', + 'The server cannot process the request due to a high load'), + 504: ('Gateway Timeout', + 'The gateway server did not receive a timely response'), + 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'), + } + +When an error is raised the server responds by returning an HTTP error code +*and* an error page. You can use the ``HTTPError`` instance as a response on the +page returned. This means that as well as the code attribute, it also has read, +geturl, and info, methods. :: + + >>> req = urllib2.Request('http://www.python.org/fish.html') + >>> try: + >>> urllib2.urlopen(req) + >>> except URLError, e: + >>> print e.code + >>> print e.read() + >>> + 404 + <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" + "http://www.w3.org/TR/html4/loose.dtd"> + <?xml-stylesheet href="./css/ht2html.css" + type="text/css"?> + <html><head><title>Error 404: File Not Found</title> + ...... etc... + +Wrapping it Up +-------------- + +So if you want to be prepared for ``HTTPError`` *or* ``URLError`` there are two +basic approaches. I prefer the second approach. + +Number 1 +~~~~~~~~ + +:: + + + from urllib2 import Request, urlopen, URLError, HTTPError + req = Request(someurl) + try: + response = urlopen(req) + except HTTPError, e: + print 'The server couldn\'t fulfill the request.' + print 'Error code: ', e.code + except URLError, e: + print 'We failed to reach a server.' + print 'Reason: ', e.reason + else: + # everything is fine + + +.. note:: + + The ``except HTTPError`` *must* come first, otherwise ``except URLError`` + will *also* catch an ``HTTPError``. + +Number 2 +~~~~~~~~ + +:: + + from urllib2 import Request, urlopen, URLError + req = Request(someurl) + try: + response = urlopen(req) + except URLError, e: + if hasattr(e, 'reason'): + print 'We failed to reach a server.' + print 'Reason: ', e.reason + elif hasattr(e, 'code'): + print 'The server couldn\'t fulfill the request.' + print 'Error code: ', e.code + else: + # everything is fine + + +info and geturl +=============== + +The response returned by urlopen (or the ``HTTPError`` instance) has two useful +methods ``info`` and ``geturl``. + +**geturl** - this returns the real URL of the page fetched. This is useful +because ``urlopen`` (or the opener object used) may have followed a +redirect. The URL of the page fetched may not be the same as the URL requested. + +**info** - this returns a dictionary-like object that describes the page +fetched, particularly the headers sent by the server. It is currently an +``httplib.HTTPMessage`` instance. + +Typical headers include 'Content-length', 'Content-type', and so on. See the +`Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_ +for a useful listing of HTTP headers with brief explanations of their meaning +and use. + + +Openers and Handlers +==================== + +When you fetch a URL you use an opener (an instance of the perhaps +confusingly-named :class:`urllib2.OpenerDirector`). Normally we have been using +the default opener - via ``urlopen`` - but you can create custom +openers. Openers use handlers. All the "heavy lifting" is done by the +handlers. Each handler knows how to open URLs for a particular URL scheme (http, +ftp, etc.), or how to handle an aspect of URL opening, for example HTTP +redirections or HTTP cookies. + +You will want to create openers if you want to fetch URLs with specific handlers +installed, for example to get an opener that handles cookies, or to get an +opener that does not handle redirections. + +To create an opener, instantiate an ``OpenerDirector``, and then call +``.add_handler(some_handler_instance)`` repeatedly. + +Alternatively, you can use ``build_opener``, which is a convenience function for +creating opener objects with a single function call. ``build_opener`` adds +several handlers by default, but provides a quick way to add more and/or +override the default handlers. + +Other sorts of handlers you might want to can handle proxies, authentication, +and other common but slightly specialised situations. + +``install_opener`` can be used to make an ``opener`` object the (global) default +opener. This means that calls to ``urlopen`` will use the opener you have +installed. + +Opener objects have an ``open`` method, which can be called directly to fetch +urls in the same way as the ``urlopen`` function: there's no need to call +``install_opener``, except as a convenience. + + +Basic Authentication +==================== + +To illustrate creating and installing a handler we will use the +``HTTPBasicAuthHandler``. For a more detailed discussion of this subject -- +including an explanation of how Basic Authentication works - see the `Basic +Authentication Tutorial +<http://www.voidspace.org.uk/python/articles/authentication.shtml>`_. + +When authentication is required, the server sends a header (as well as the 401 +error code) requesting authentication. This specifies the authentication scheme +and a 'realm'. The header looks like : ``Www-authenticate: SCHEME +realm="REALM"``. + +e.g. :: + + Www-authenticate: Basic realm="cPanel Users" + + +The client should then retry the request with the appropriate name and password +for the realm included as a header in the request. This is 'basic +authentication'. In order to simplify this process we can create an instance of +``HTTPBasicAuthHandler`` and an opener to use this handler. + +The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle +the mapping of URLs and realms to passwords and usernames. If you know what the +realm is (from the authentication header sent by the server), then you can use a +``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that +case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows +you to specify a default username and password for a URL. This will be supplied +in the absence of you providing an alternative combination for a specific +realm. We indicate this by providing ``None`` as the realm argument to the +``add_password`` method. + +The top-level URL is the first URL that requires authentication. URLs "deeper" +than the URL you pass to .add_password() will also match. :: + + # create a password manager + password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm() + + # Add the username and password. + # If we knew the realm, we could use it instead of ``None``. + top_level_url = "http://example.com/foo/" + password_mgr.add_password(None, top_level_url, username, password) + + handler = urllib2.HTTPBasicAuthHandler(password_mgr) + + # create "opener" (OpenerDirector instance) + opener = urllib2.build_opener(handler) + + # use the opener to fetch a URL + opener.open(a_url) + + # Install the opener. + # Now all calls to urllib2.urlopen use our opener. + urllib2.install_opener(opener) + +.. note:: + + In the above example we only supplied our ``HHTPBasicAuthHandler`` to + ``build_opener``. By default openers have the handlers for normal situations + -- ``ProxyHandler``, ``UnknownHandler``, ``HTTPHandler``, + ``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``, + ``FileHandler``, ``HTTPErrorProcessor``. + +``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme +component and the hostname and optionally the port number) +e.g. "http://example.com/" *or* an "authority" (i.e. the hostname, +optionally including the port number) e.g. "example.com" or "example.com:8080" +(the latter example includes a port number). The authority, if present, must +NOT contain the "userinfo" component - for example "joe@password:example.com" is +not correct. + + +Proxies +======= + +**urllib2** will auto-detect your proxy settings and use those. This is through +the ``ProxyHandler`` which is part of the normal handler chain. Normally that's +a good thing, but there are occasions when it may not be helpful [#]_. One way +to do this is to setup our own ``ProxyHandler``, with no proxies defined. This +is done using similar steps to setting up a `Basic Authentication`_ handler : :: + + >>> proxy_support = urllib2.ProxyHandler({}) + >>> opener = urllib2.build_opener(proxy_support) + >>> urllib2.install_opener(opener) + +.. note:: + + Currently ``urllib2`` *does not* support fetching of ``https`` locations + through a proxy. However, this can be enabled by extending urllib2 as + shown in the recipe [#]_. + + +Sockets and Layers +================== + +The Python support for fetching resources from the web is layered. urllib2 uses +the httplib library, which in turn uses the socket library. + +As of Python 2.3 you can specify how long a socket should wait for a response +before timing out. This can be useful in applications which have to fetch web +pages. By default the socket module has *no timeout* and can hang. Currently, +the socket timeout is not exposed at the httplib or urllib2 levels. However, +you can set the default timeout globally for all sockets using :: + + import socket + import urllib2 + + # timeout in seconds + timeout = 10 + socket.setdefaulttimeout(timeout) + + # this call to urllib2.urlopen now uses the default timeout + # we have set in the socket module + req = urllib2.Request('http://www.voidspace.org.uk') + response = urllib2.urlopen(req) + + +------- + + +Footnotes +========= + +This document was reviewed and revised by John Lee. + +.. [#] For an introduction to the CGI protocol see + `Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_. +.. [#] Like Google for example. The *proper* way to use google from a program + is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See + `Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_ + for some examples of using the Google API. +.. [#] Browser sniffing is a very bad practise for website design - building + sites using web standards is much more sensible. Unfortunately a lot of + sites still send different versions to different browsers. +.. [#] The user agent for MSIE 6 is + *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'* +.. [#] For details of more HTTP request headers, see + `Quick Reference to HTTP Headers`_. +.. [#] In my case I have to use a proxy to access the internet at work. If you + attempt to fetch *localhost* URLs through this proxy it blocks them. IE + is set to use the proxy, which urllib2 picks up on. In order to test + scripts with a localhost server, I have to prevent urllib2 from using + the proxy. +.. [#] urllib2 opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe + <http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_. + |