From b8df156ab59a8b7853fcecb64294e22d37ea5b0a Mon Sep 17 00:00:00 2001 From: Georg Brandl Date: Wed, 5 Dec 2007 18:30:48 +0000 Subject: Add examples to re docs. Written for GHOP by Dan Finnie. --- Doc/ACKS.txt | 1 + Doc/library/re.rst | 302 ++++++++++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 286 insertions(+), 17 deletions(-) diff --git a/Doc/ACKS.txt b/Doc/ACKS.txt index de2fd51..da098d0 100644 --- a/Doc/ACKS.txt +++ b/Doc/ACKS.txt @@ -48,6 +48,7 @@ docs@python.org), and we'll be glad to correct the problem. * Carey Evans * Martijn Faassen * Carl Feynman +* Dan Finnie * Hernán Martínez Foffani * Stefan Franke * Jim Fulton diff --git a/Doc/library/re.rst b/Doc/library/re.rst index 1caaaf2..fbc9267 100644 --- a/Doc/library/re.rst +++ b/Doc/library/re.rst @@ -31,6 +31,11 @@ prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing newline. Usually patterns will be expressed in Python code using this raw string notation. +It is important to note that most regular expression operations are available as +module-level functions and :class:`RegexObject` methods. The functions are +shortcuts that don't require you to compile a regex object first, but miss some +fine-tuning parameters. + .. seealso:: Mastering Regular Expressions @@ -408,11 +413,9 @@ argument regardless of whether a newline precedes it. :: - re.compile("a").match("ba", 1) # succeeds - re.compile("^a").search("ba", 1) # fails; 'a' not at start - re.compile("^a").search("\na", 1) # fails; 'a' not at start - re.compile("^a", re.M).search("\na", 1) # succeeds - re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n + >>> re.match("c", "abcdef") # No match + >>> re.search("c", "abcdef") + <_sre.SRE_Match object at 0x827e9c0> # Match .. _contents-of-module-re: @@ -504,7 +507,13 @@ form. character class or preceded by an unescaped backslash, all characters from the leftmost such ``'#'`` through the end of the line are ignored. - .. % XXX should add an example here + That means that the two following regular expression objects that match a + decimal number are functionally equal:: + + a = re.compile(r"""\d + # the integral part + \. # the decimal point + \d * # some fractional digits""", re.X) + b = re.compile(r"\d+\.\d*") .. function:: search(pattern, string[, flags]) @@ -525,7 +534,8 @@ form. .. note:: - If you want to locate a match anywhere in *string*, use :meth:`search` instead. + If you want to locate a match anywhere in *string*, use :meth:`search` + instead. .. function:: split(pattern, string[, maxsplit=0]) @@ -663,7 +673,8 @@ attributes: .. note:: - If you want to locate a match anywhere in *string*, use :meth:`search` instead. + If you want to locate a match anywhere in *string*, use :meth:`search` + instead. The optional second parameter *pos* gives an index in the string where the search is to start; it defaults to ``0``. This is not completely equivalent to @@ -676,7 +687,12 @@ attributes: from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less than *pos*, no match will be found, otherwise, if *rx* is a compiled regular expression object, ``rx.match(string, 0, 50)`` is equivalent to - ``rx.match(string[:50], 0)``. + ``rx.match(string[:50], 0)``. :: + + >>> pattern = re.compile("o") + >>> pattern.match("dog") # No match as "o" is not at the start of "dog." + >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog". + <_sre.SRE_Match object at 0x827eb10> .. method:: RegexObject.search(string[, pos[, endpos]]) @@ -764,7 +780,17 @@ support the following methods and attributes: pattern, an :exc:`IndexError` exception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is ``None``. If a group is contained in a part of the pattern that matched multiple times, - the last match is returned. + the last match is returned. :: + + >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") + >>> m.group(0) + 'Isaac Newton' # The entire match + >>> m.group(1) + 'Isaac' # The first parenthesized subgroup. + >>> m.group(2) + 'Newton' # The second parenthesized subgroup. + >>> m.group(1, 2) + ('Isaac', 'Newton') # Multiple arguments give us a tuple. If the regular expression uses the ``(?P...)`` syntax, the *groupN* arguments may also be strings identifying groups by their group name. If a @@ -773,10 +799,23 @@ support the following methods and attributes: A moderately complicated example:: - m = re.match(r"(?P\d+)\.(\d*)", '3.14') + >>> m = re.match(r"(?P\w+) (?P\w+)", "Malcom Reynolds") + >>> m.group('first_name') + 'Malcom' + >>> m.group('last_name') + 'Reynolds' + + Named groups can also be referred to by their index:: + + >>> m.group(1) + 'Malcom' + >>> m.group(2) + 'Reynolds' - After performing this match, ``m.group(1)`` is ``'3'``, as is - ``m.group('int')``, and ``m.group(2)`` is ``'14'``. + If a group matches multiple times, only the last match is accessible:: + >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times. + >>> m.group(1) # Returns only the last match. + 'c3' .. method:: MatchObject.groups([default]) @@ -788,12 +827,32 @@ support the following methods and attributes: string would be returned instead. In later versions (from 1.5.1 on), a singleton tuple is returned in such cases.) + For example:: + + >>> m = re.match(r"(\d+)\.(\d+)", "24.1632") + >>> m.groups() + ('24', '1632') + + If we make the decimal place and everything after it optional, not all groups + might participate in the match. These groups will default to ``None`` unless + the *default* argument is given:: + + >>> m = re.match(r"(\d+)\.?(\d+)?", "24") + >>> m.groups() + ('24', None) # Second group defaults to None. + >>> m.groups('0') + ('24', '0') # Now, the second group defaults to '0'. + .. method:: MatchObject.groupdict([default]) Return a dictionary containing all the *named* subgroups of the match, keyed by the subgroup name. The *default* argument is used for groups that did not - participate in the match; it defaults to ``None``. + participate in the match; it defaults to ``None``. For example:: + + >>> m = re.match(r"(?P\w+) (?P\w+)", "Malcom Reynolds") + >>> m.groupdict() + {'first_name': 'Malcom', 'last_name': 'Reynolds'} .. method:: MatchObject.start([group]) @@ -812,12 +871,19 @@ support the following methods and attributes: ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both 2, and ``m.start(2)`` raises an :exc:`IndexError` exception. + An example that will remove *remove_this* from email addresses:: + + >>> email = "tony@tiremove_thisger.net" + >>> m = re.search("remove_this", email) + >>> email[:m.start()] + email[m.end():] + 'tony@tiger.net' + .. method:: MatchObject.span([group]) For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note that if *group* did not contribute to the match, this is - ``(-1, -1)``. Again, *group* defaults to zero. + ``(-1, -1)``. *group* defaults to zero, the entire match. .. attribute:: MatchObject.pos @@ -863,7 +929,62 @@ support the following methods and attributes: Examples -------- -**Simulating scanf()** + +Checking For a Pair +^^^^^^^^^^^^^^^^^^^ + +In this example, we'll use the following helper function to display match +objects a little more gracefully:: + + def displaymatch(match): + if match is None: + return None + return '' % (match.group(), match.groups()) + +Suppose you are writing a poker program where a player's hand is represented as +a 5-character string with each character representing a card, "a" for ace, "k" +for king, "q" for queen, j for jack, "0" for 10, and "1" through "9" +representing the card with that value. + +To see if a given string is a valid hand, one could do the following:: + + >>> valid = re.compile(r"[0-9akqj]{5}$" + >>> displaymatch(valid.match("ak05q")) # Valid. + + >>> displaymatch(valid.match("ak05e")) # Invalid. + >>> displaymatch(valid.match("ak0")) # Invalid. + >>> displaymatch(valid.match("727ak")) # Valid. + + +That last hand, ``"727ak"``, contained a pair, or two of the same valued cards. +To match this with a regular expression, one could use backreferences as such:: + + >>> pair = re.compile(r".*(.).*\1") + >>> displaymatch(pair.match("717ak")) # Pair of 7s. + + >>> displaymatch(pair.match("718ak")) # No pairs. + >>> displaymatch(pair.match("354aa")) # Pair of aces. + + +To find out what card the pair consists of, one could use the :func:`group` +method of :class:`MatchObject` in the following manner:: + + >>> pair.match("717ak").group(1) + '7' + + # Error because re.match() returns None, which doesn't have a group() method: + >>> pair.match("718ak").group(1) + Traceback (most recent call last): + File "", line 1, in + re.match(r".*(.).*\1", "718ak").group(1) + AttributeError: 'NoneType' object has no attribute 'group' + + >>> pair.match("354aa").group(1) + 'a' + + +Simulating scanf() +^^^^^^^^^^^^^^^^^^ .. index:: single: scanf() @@ -907,7 +1028,9 @@ The equivalent regular expression would be :: (\S+) - (\d+) errors, (\d+) warnings -**Avoiding recursion** + +Avoiding recursion +^^^^^^^^^^^^^^^^^^ If you create regular expressions that require the engine to perform a lot of recursion, you may encounter a :exc:`RuntimeError` exception with the message @@ -929,3 +1052,148 @@ avoid recursion. Thus, the above regular expression can avoid recursion by being recast as ``Begin [a-zA-Z0-9_ ]*?end``. As a further benefit, such regular expressions will run faster than their recursive equivalents. + +search() vs. match() +^^^^^^^^^^^^^^^^^^^^ + +In a nutshell, :func:`match` only attempts to match a pattern at the beginning +of a string where :func:`search` will match a pattern anywhere in a string. +For example:: + + >>> re.match("o", "dog") # No match as "o" is not the first letter of "dog". + >>> re.search("o", "dog") # Match as search() looks everywhere in the string. + <_sre.SRE_Match object at 0x827e9f8> + +.. note:: + + The following applies only to regular expression objects like those created + with ``re.compile("pattern")``, not the primitives + ``re.match(pattern, string)`` or ``re.search(pattern, string)``. + +:func:`match` has an optional second parameter that gives an index in the string +where the search is to start:: + + >>> pattern = re.compile("o") + >>> pattern.match("dog") # No match as "o" is not at the start of "dog." + # Equivalent to the above expression as 0 is the default starting index: + >>> pattern.match("dog", 0) + # Match as "o" is the 2nd character of "dog" (index 0 is the first): + >>> pattern.match("dog", 1) + <_sre.SRE_Match object at 0x827eb10> + >>> pattern.match("dog", 2) # No match as "o" is not the 3rd character of "dog." + + +Making a Phonebook +^^^^^^^^^^^^^^^^^^ + +:func:`split` splits a string into a list delimited by the passed pattern. The +method is invaluable for converting textual data into data structures that can be +easily read and modified by Python as demonstrated in the following example that +creates a phonebook. + +First, get the input using triple-quoted string syntax:: + + >>> input = """Ross McFluff 834.345.1254 155 Elm Street + Ronald Heathmore 892.345.3428 436 Finley Avenue + Frank Burger 925.541.7625 662 South Dogwood Way + Heather Albrecht 548.326.4584 919 Park Place""" + +Then, convert the string into a list with each line having its own entry:: + + >>> entries = re.split("\n", input) + >>> entries + ['Ross McFluff 834.345.1254 155 Elm Street', + 'Ronald Heathmore 892.345.3428 436 Finley Avenue', + 'Frank Burger 925.541.7625 662 South Dogwood Way', + 'Heather Albrecht 548.326.4584 919 Park Place'] + +Finally, split each entry into a list with first name, last name, telephone +number, and address. We use the ``maxsplit`` paramater of :func:`split` +because the address has spaces, our splitting pattern, in it:: + + >>> [re.split(" ", entry, 3) for entry in entries] + [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'], + ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'], + ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'], + ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']] + +With a ``maxsplit`` of ``4``, we could seperate the house number from the street +name:: + + >>> [re.split(" ", entry, 4) for entry in entries] + [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'], + ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'], + ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'], + ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']] + + +Text Munging +^^^^^^^^^^^^ + +:func:`sub` replaces every occurrence of a pattern with a string or the +result of a function. This example demonstrates using :func:`sub` with +a function to "munge" text, or randomize the order of all the characters +in each word of a sentence except for the first and last characters:: + + >>> def repl(m): + ... inner_word = list(m.group(2)) + ... random.shuffle(inner_word) + ... return m.group(1) + "".join(inner_word) + m.group(3) + >>> text = "Professor Abdolmalek, please report your absences promptly." + >>> re.sub("(\w)(\w+)(\w)", repl, text) + 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.' + >>> re.sub("(\w)(\w+)(\w)", repl, text) + 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.' + + +Finding all Adverbs +^^^^^^^^^^^^^^^^^^^ + +:func:`findall` matches *all* occurences of a pattern, not just the first +one as :func:`search` does. For example, if one was a writer and wanted to +find all of the adverbs in some text, he or she might use :func:`findall` in +the following manner:: + + >>> text = "He was carefully disguised but captured quickly by police." + >>> re.findall(r"\w+ly", text) + ['carefully', 'quickly'] + + +Finding all Adverbs and their Positions +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If one wants more information about all matches of a pattern than the matched +text, :func:`finditer` is useful as it provides instances of +:class:`MatchObject` instead of strings. Continuing with the previous example, +if one was a writer who wanted to find all of the adverbs *and their positions* +in some text, he or she would use :func:`finditer` in the following manner:: + + >>> text = "He was carefully disguised but captured quickly by police." + >>> for m in re.finditer(r"\w+ly", text): + print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0)) + 07-16: carefully + 40-47: quickly + + +Raw String Notation +^^^^^^^^^^^^^^^^^^^ + +Raw string notation (``r"text"``) keeps regular expressions sane. Without it, +every backslash (``'\'``) in a regular expression would have to be prefixed with +another one to escape it. For example, the two following lines of code are +functionally identical:: + + >>> re.match(r"\W(.)\1\W", " ff ") + <_sre.SRE_Match object at 0x8262760> + >>> re.match("\\W(.)\\1\\W", " ff ") + <_sre.SRE_Match object at 0x82627a0> + +When one wants to match a literal backslash, it must be escaped in the regular +expression. With raw string notation, this means ``r"\\"``. Without raw string +notation, one must use ``"\\\\"``, making the following lines of code +functionally identical:: + + >>> re.match(r"\\", r"\\") + <_sre.SRE_Match object at 0x827eb48> + >>> re.match("\\\\", r"\\") + <_sre.SRE_Match object at 0x827ec60> -- cgit v0.12