From a2c39eef55596583301ced9356ec03e35a6783f0 Mon Sep 17 00:00:00 2001
From: dkf <donal.k.fellows@manchester.ac.uk>
Date: Thu, 15 Nov 2007 12:02:54 +0000
Subject: Readability improvements

---
 ChangeLog       |   3 +
 doc/re_syntax.n | 204 ++++++++++++++++++++++++++++++++++----------------------
 2 files changed, 126 insertions(+), 81 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 6847d8c..62aabfb 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,8 @@
 2007-11-15  Donal K. Fellows  <donal.k.fellows@man.ac.uk>
 
+	* doc/re_syntax.n: Try to make this easier to read. It's still a very
+	difficult manual page!
+
 	* unix/tcl.m4 (SC_CONFIG_CFLAGS): Allow people to turn off the -rpath
 	option to their linker if they so desire. This is a configuration only
 	recommended for (some) vendors. Relates to [Patch 1231022].
diff --git a/doc/re_syntax.n b/doc/re_syntax.n
index eebda51..f1eabbc 100644
--- a/doc/re_syntax.n
+++ b/doc/re_syntax.n
@@ -5,7 +5,7 @@
 '\" See the file "license.terms" for information on usage and redistribution
 '\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
 '\" 
-'\" RCS: @(#) $Id: re_syntax.n,v 1.15 2007/11/14 11:13:32 dkf Exp $
+'\" RCS: @(#) $Id: re_syntax.n,v 1.16 2007/11/15 12:02:56 dkf Exp $
 '\"
 .so man.macros
 .TH re_syntax n "8.1" Tcl "Tcl Built-In Commands"
@@ -50,7 +50,7 @@ A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR,
 concatenated.
 It matches a match for the first, followed by a match for the second, etc;
 an empty branch matches the empty string.
-.PP
+.SS QUANTIFIERS
 A quantified atom is an \fIatom\fR possibly followed
 by a single \fIquantifier\fR.
 Without a quantifier, it matches a single match for the atom.
@@ -93,7 +93,7 @@ of matches (see \fBMATCHING\fR)
 The forms using \fB{\fR and \fB}\fR are known as \fIbound\fRs.  The
 numbers \fIm\fR and \fIn\fR are unsigned decimal integers with
 permissible values from 0 to 255 inclusive.
-.PP
+.SS ATOMS
 An atom is one of:
 .RS 2
 .IP \fB(\fIre\fB)\fR 6
@@ -128,7 +128,7 @@ when followed by a digit, it is the beginning of a \fIbound\fR (see above)
 where \fIx\fR is a single character with no other significance,
 matches that character.
 .RE
-.PP
+.SS CONSTRAINTS
 A \fIconstraint\fR matches an empty string when specific conditions
 are met.  A constraint may not be followed by a quantifier.  The
 simple constraints are as follows; some more constraints are described
@@ -163,7 +163,7 @@ An RE may not end with
 A \fIbracket expression\fR is a list of characters enclosed in
 .QW \fB[\|]\fR .
 It normally matches any single character from the list
-(but see below).  If the list begins with
+(but see below). If the list begins with
 .QW \fB^\fR ,
 it matches any single character (but see below) \fInot\fR from the
 rest of the list.
@@ -171,22 +171,25 @@ rest of the list.
 If two characters in the list are separated by
 .QW \fB\-\fR ,
 this is shorthand for the full \fIrange\fR of characters between those two
-(inclusive) in the collating sequence, e.g. \fB[0\-9]\fR in Unicode
-matches any conventional decimal digit.  Two ranges may not share an
-endpoint, so e.g. \fBa\-c\-e\fR is illegal.  Ranges in Tcl always use the
+(inclusive) in the collating sequence, e.g.
+.QW \fB[0\-9]\fR
+in Unicode matches any conventional decimal digit. Two ranges may not share an
+endpoint, so e.g.
+.QW \fBa\-c\-e\fR
+is illegal. Ranges in Tcl always use the
 Unicode collating sequence, but other programs may use other collating
 sequences and this can be a source of incompatability between programs.
 .PP
 To include a literal \fB]\fR or \fB\-\fR in the list, the simplest
 method is to enclose it in \fB[.\fR and \fB.]\fR to make it a
-collating element (see below).  Alternatively, make it the first
+collating element (see below). Alternatively, make it the first
 character (following a possible
 .QW \fB^\fR ),
 or (AREs only) precede it with
 .QW \fB\e\fR .
 Alternatively, for
 .QW \fB\-\fR ,
-make it the last character, or the second endpoint of a range.  To use
+make it the last character, or the second endpoint of a range. To use
 a literal \fB\-\fR as the first endpoint of a range, make it a
 collating element or (AREs only) precede it with
 .QW \fB\e\fR .
@@ -194,48 +197,7 @@ With the exception of
 these, some combinations using \fB[\fR (see next paragraphs), and
 escapes, all other special characters lose their special significance
 within a bracket expression.
-.PP
-Within a bracket expression, a collating element (a character, a
-multi-character sequence that collates as if it were a single
-character, or a collating-sequence name for either) enclosed in
-\fB[.\fR and \fB.]\fR stands for the sequence of characters of that
-collating element.  The sequence is a single element of the bracket
-expression's list.  A bracket expression in a locale that has
-multi-character collating elements can thus match more than one
-character.  So (insidiously), a bracket expression that starts with
-\fB^\fR can match multi-character collating elements even if none of
-them appear in the bracket expression!  (\fINote:\fR Tcl has
-no multi-character collating elements.  This information is only for
-illustration.)
-.PP
-For example, assume the collating sequence includes a \fBch\fR
-multi-character collating element.  Then the RE \fB[[.ch.]]*c\fR (zero
-or more \fBch\fRs followed by \fBc\fR) matches the first five
-characters of
-.QW \fBchchcc\fR .
-Also, the RE \fB[^c]b\fR matches all of
-.QW \fBchb\fR
-(because \fB[^c]\fR matches the multi-character \fBch\fR).
-.PP
-Within a bracket expression, a collating element enclosed in \fB[=\fR
-and \fB=]\fR is an equivalence class, standing for the sequences of
-characters of all collating elements equivalent to that one, including
-itself.  (If there are no other equivalent collating elements, the
-treatment is as if the enclosing delimiters were
-.QW \fB[.\fR \&
-and
-.QW \fB.]\fR .)
-For example, if \fBo\fR and \fB\N'244'\fR are the members of an
-equivalence class, then
-.QW \fB[[=o=]]\fR ,
-.QW \fB[[=\N'244'=]]\fR ,
-and
-.QW \fB[o\N'244']\fR \&
-are all synonymous.  An equivalence class may
-not be an endpoint of a range.  (\fINote:\fR Tcl implements only the
-Unicode locale.  It does not define any equivalence classes. The
-examples above are just illustrations.)
-.PP
+.SS "CHARACTER CLASSES"
 Within a bracket expression, the name of a \fIcharacter class\fR
 enclosed in \fB[:\fR and \fB:]\fR stands for the list of all
 characters (not all collating elements!)  belonging to that class.
@@ -265,30 +227,94 @@ A character with a visible representation (includes both alnum and punct).
 .IP \fBcntrl\fR 8
 A control character.
 .PP
-A locale may provide others.  (Note that the current Tcl
-implementation has only one locale: the Unicode locale.)  A character
-class may not be used as an endpoint of a range.
+A locale may provide others. A character class may not be used as an endpoint
+of a range.
+.RS
 .PP
+(\fINote:\fR the current Tcl implementation has only one locale, the Unicode
+locale, which supports exactly the above classes.)
+.RE
+.SS "BRACKETED CONSTRAINTS"
 There are two special cases of bracket expressions: the bracket
-expressions \fB[[:<:]]\fR and \fB[[:>:]]\fR are constraints, matching
-empty strings at the beginning and end of a word respectively.
-'\" note, discussion of escapes below references this definition of word
-A word is defined as a sequence of word characters that is neither
-preceded nor followed by word characters.  A word character is an
-\fIalnum\fR character or an underscore (\fB_\fR).  These special
-bracket expressions are deprecated; users of AREs should use
+expressions
+.QW \fB[[:<:]]\fR
+and
+.QW \fB[[:>:]]\fR
+are constraints, matching empty strings at the beginning and end of a word
+respectively.
+.\" note, discussion of escapes below references this definition of word
+A word is defined as a sequence of word characters that is neither preceded
+nor followed by word characters. A word character is an \fIalnum\fR character
+or an underscore
+.PQ \fB_\fR "" .
+These special bracket expressions are deprecated; users of AREs should use
 constraint escapes instead (see below).
+.SS "COLLATING ELEMENTS"
+Within a bracket expression, a collating element (a character, a
+multi-character sequence that collates as if it were a single
+character, or a collating-sequence name for either) enclosed in
+\fB[.\fR and \fB.]\fR stands for the sequence of characters of that
+collating element. The sequence is a single element of the bracket
+expression's list. A bracket expression in a locale that has
+multi-character collating elements can thus match more than one
+character. So (insidiously), a bracket expression that starts with
+\fB^\fR can match multi-character collating elements even if none of
+them appear in the bracket expression!
+.RS
+.PP
+(\fINote:\fR Tcl has no multi-character collating elements. This information
+is only for illustration.)
+.RE
+.PP
+For example, assume the collating sequence includes a \fBch\fR multi-character
+collating element. Then the RE
+.QW \fB[[.ch.]]*c\fR
+(zero or more
+.QW \fBch\fRs
+followed by
+.QW \fBc\fR )
+matches the first five characters of
+.QW \fBchchcc\fR .
+Also, the RE
+.QW \fB[^c]b\fR
+matches all of
+.QW \fBchb\fR
+(because
+.QW \fB[^c]\fR
+matches the multi-character
+.QW \fBch\fR ).
+.SS "EQUIVALENCE CLASSES"
+Within a bracket expression, a collating element enclosed in \fB[=\fR
+and \fB=]\fR is an equivalence class, standing for the sequences of
+characters of all collating elements equivalent to that one, including
+itself. (If there are no other equivalent collating elements, the
+treatment is as if the enclosing delimiters were
+.QW \fB[.\fR \&
+and
+.QW \fB.]\fR .)
+For example, if \fBo\fR and \fB\N'244'\fR are the members of an
+equivalence class, then
+.QW \fB[[=o=]]\fR ,
+.QW \fB[[=\N'244'=]]\fR ,
+and
+.QW \fB[o\N'244']\fR \&
+are all synonymous. An equivalence class may not be an endpoint of a range.
+.RS
+.PP
+(\fINote:\fR Tcl implements only the Unicode locale. It does not define any
+equivalence classes. The examples above are just illustrations.)
+.RE
 .SH ESCAPES
 Escapes (AREs only), which begin with a \fB\e\fR followed by an
 alphanumeric character, come in several varieties: character entry,
-class shorthands, constraint escapes, and back references.  A \fB\e\fR
+class shorthands, constraint escapes, and back references. A \fB\e\fR
 followed by an alphanumeric character but not constituting a valid
-escape is illegal in AREs.  In EREs, there are no escapes: outside a
+escape is illegal in AREs. In EREs, there are no escapes: outside a
 bracket expression, a \fB\e\fR followed by an alphanumeric character
 merely stands for that character as an ordinary character, and inside
-a bracket expression, \fB\e\fR is an ordinary character.  (The latter
+a bracket expression, \fB\e\fR is an ordinary character. (The latter
 is the one actual incompatibility between EREs and AREs.)
-.PP
+.SS "CHARACTER-ENTRY ESCAPES"
 Character-entry escapes (AREs only) exist to make it easier to specify
 non-printing and otherwise inconvenient characters in REs:
 .RS 2
@@ -380,13 +406,13 @@ Octal digits are
 .PP
 The character-entry escapes are always taken as ordinary characters.
 For example, \fB\e135\fR is \fB]\fR in Unicode, but \fB\e135\fR does
-not terminate a bracket expression.  Beware, however, that some
+not terminate a bracket expression. Beware, however, that some
 applications (e.g., C compilers and the Tcl interpreter if the regular
 expression is not quoted with braces) interpret such sequences
 themselves before the regular-expression package gets to see them,
 which may require doubling (quadrupling, etc.) the
 .QW \fB\e\fR .
-.PP
+.SS "CLASS-SHORTHAND ESCAPES"
 Class-shorthand escapes (AREs only) provide shorthands for certain
 commonly-used character classes:
 .RS 2
@@ -426,10 +452,16 @@ lose their outer brackets, and
 .QW \fB\eS\fR ,
 and
 .QW \fB\eW\fR \&
-are illegal.  (So, for example, \fB[a-c\ed]\fR is
-equivalent to \fB[a-c[:digit:]]\fR.  Also, \fB[a-c\eD]\fR, which is
-equivalent to \fB[a-c^[:digit:]]\fR, is illegal.)
-.PP
+are illegal. (So, for example,
+.QW \fB[a-c\ed]\fR
+is equivalent to
+.QW \fB[a-c[:digit:]]\fR .
+Also,
+.QW \fB[a-c\eD]\fR ,
+which is equivalent to
+.QW \fB[a-c^[:digit:]]\fR ,
+is illegal.)
+.SS "CONSTRAINT ESCAPES"
 A constraint escape (AREs only) is a constraint, matching the empty
 string if specific conditions are met, written as an escape:
 .RS 2
@@ -474,13 +506,20 @@ closing capturing parentheses seen so far) a \fIback reference\fR, see
 below
 .RE
 .PP
-A word is defined as in the specification of \fB[[:<:]]\fR and
-\fB[[:>:]]\fR above.  Constraint escapes are illegal within bracket
-expressions.
-.PP
+A word is defined as in the specification of
+.QW \fB[[:<:]]\fR
+and
+.QW \fB[[:>:]]\fR
+above. Constraint escapes are illegal within bracket expressions.
+.SS "BACK REFERENCES"
 A back reference (AREs only) matches the same string matched by the
 parenthesized subexpression specified by the number, so that (e.g.)
-\fB([bc])\e1\fR matches \fBbb\fR or \fBcc\fR but not
+.QW \fB([bc])\e1\fR
+matches
+.QW \fBbb\fR
+or
+.QW \fBcc\fR
+but not
 .QW \fBbc\fR .
 The subexpression must entirely precede the back reference in the RE.
 Subexpressions are numbered in the order of their leading parentheses.
@@ -488,9 +527,9 @@ Non-capturing parentheses do not define subexpressions.
 .PP
 There is an inherent historical ambiguity between octal
 character-entry escapes and back references, which is resolved by
-heuristics, as hinted at above.  A leading zero always indicates an
-octal escape.  A single non-zero digit, not followed by another digit,
-is always taken as a back reference.  A multi-digit sequence not
+heuristics, as hinted at above. A leading zero always indicates an
+octal escape. A single non-zero digit, not followed by another digit,
+is always taken as a back reference. A multi-digit sequence not
 starting with a zero is taken as a back reference if it comes after a
 suitable subexpression (i.e. the number is in the legal range for a
 back reference), and otherwise is taken as octal.
@@ -762,7 +801,10 @@ it appears at the beginning of the RE or the beginning of a
 parenthesized subexpression (after a possible leading
 .QW \fB^\fR ).
 Finally, single-digit back references are available, and \fB\e<\fR and
-\fB\e>\fR are synonyms for \fB[[:<:]]\fR and \fB[[:>:]]\fR
+\fB\e>\fR are synonyms for
+.QW \fB[[:<:]]\fR
+and
+.QW \fB[[:>:]]\fR
 respectively; no other escapes are available.
 .SH "SEE ALSO"
 RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
-- 
cgit v0.12