From 062ea2e70bd320a5b4b3cd1907babf19c17c6622 Mon Sep 17 00:00:00 2001 From: Fred Drake Date: Fri, 6 Oct 2000 19:59:22 +0000 Subject: Made a number of revisions suggested by Fredrik Lundh. Revised the first paragraph so it doesn't sound like it was written when 7-bit strings were assumed; note that Unicode strings can be used. --- Doc/lib/libre.tex | 45 +++++++++++++++++++++++++++++++++------------ 1 file changed, 33 insertions(+), 12 deletions(-) diff --git a/Doc/lib/libre.tex b/Doc/lib/libre.tex index 0c9df2a..37b4ee8 100644 --- a/Doc/lib/libre.tex +++ b/Doc/lib/libre.tex @@ -1,21 +1,21 @@ \section{\module{re} --- - Perl-style regular expression operations.} + Regular expression operations} \declaremodule{standard}{re} \moduleauthor{Andrew M. Kuchling}{amk1@bigfoot.com} +\moduleauthor{Fredrik Lundh}{effbot@telia.com} \sectionauthor{Andrew M. Kuchling}{amk1@bigfoot.com} -\modulesynopsis{Perl-style regular expression search and match -operations.} +\modulesynopsis{Regular expression search and match operations with a + Perl-style expression syntax.} This module provides regular expression matching operations similar to -those found in Perl. It's 8-bit clean: the strings being processed -may contain both null bytes and characters whose high bit is set. Regular -expression pattern strings may not contain null bytes, but can specify -the null byte using the \code{\e\var{number}} notation. -Characters with the high bit set may be included. The \module{re} -module is always available. +those found in Perl. Regular expression pattern strings may not +contain null bytes, but can specify the null byte using the +\code{\e\var{number}} notation. Both patterns and strings to be +searched can be Unicode strings as well as 8-bit strings. The +\module{re} module is always available. Regular expressions use the backslash character (\character{\e}) to indicate special forms or to allow special characters to be used @@ -34,6 +34,15 @@ while \code{"\e n"} is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation. +\strong{Implementation note:} +The \module{re}\refstmodindex{pre} module has two distinct +implementations: \module{sre} is the default implementation and +includes Unicode support, but may run into stack limitations for some +patterns. Though this will be fixed for a future release of Python, +the older implementation (without Unicode support) is still available +as the \module{pre}\refstmodindex{pre} module. + + \subsection{Regular Expression Syntax \label{re-syntax}} A regular expression (or RE) specifies a set of strings that matches @@ -155,9 +164,16 @@ simply match the \character{\^} character. For example, \regexp{[{\^}5]} will match any character except \character{5}. \item[\character{|}]\code{A|B}, where A and B can be arbitrary REs, -creates a regular expression that will match either A or B. This can -be used inside groups (see below) as well. To match a literal \character{|}, -use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}. +creates a regular expression that will match either A or B. An +arbitrary number of REs can be separated by the \character{|} in this +way. This can be used inside groups (see below) as well. REs +separated by \character{|} are tried from left to right, and the first +one that allows the complete pattern to match is considered the +accepted branch. This means that if \code{A} matches, \code{B} will +never be tested, even if it would produce a longer overall match. In +other words, the \character{|} operator is never greedy. To match a +literal \character{|}, use \regexp{\e|}, or enclose it inside a +character class, as in \regexp{[|]}. \item[\code{(...)}] Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents @@ -184,6 +200,11 @@ for the entire regular expression. This is useful if you wish to include the flags as part of the regular expression, instead of passing a \var{flag} argument to the \function{compile()} function. +Note that the \regexp{(?x)} flag changes how the expression is parsed. +It should be used first in the expression string, or after one or more +whitespace characters. If there are non-whitespace characters before +the flag, the results are undefined. + \item[\code{(?:...)}] A non-grouping version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the -- cgit v0.12