diff options
author | Georg Brandl <georg@python.org> | 2007-08-31 08:07:45 (GMT) |
---|---|---|
committer | Georg Brandl <georg@python.org> | 2007-08-31 08:07:45 (GMT) |
commit | 57e3b68c220ef2a6387419cef69ff1d1c7f283cf (patch) | |
tree | a0bb8df6896cc13872bfbb776563b54e00dcb063 /Doc/reference/lexical_analysis.rst | |
parent | 3dc33d18452de871cff98914dda81ff00b4d00f6 (diff) | |
download | cpython-57e3b68c220ef2a6387419cef69ff1d1c7f283cf.zip cpython-57e3b68c220ef2a6387419cef69ff1d1c7f283cf.tar.gz cpython-57e3b68c220ef2a6387419cef69ff1d1c7f283cf.tar.bz2 |
Update the first two parts of the reference manual for Py3k,
mainly concerning PEPs 3131 and 3120.
Diffstat (limited to 'Doc/reference/lexical_analysis.rst')
-rw-r--r-- | Doc/reference/lexical_analysis.rst | 387 |
1 files changed, 166 insertions, 221 deletions
diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index 35e92cf..856137d 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -5,38 +5,16 @@ Lexical analysis **************** -.. index:: - single: lexical analysis - single: parser - single: token +.. index:: lexical analysis, parser, token A Python program is read by a *parser*. Input to the parser is a stream of *tokens*, generated by the *lexical analyzer*. This chapter describes how the lexical analyzer breaks a file into tokens. -Python uses the 7-bit ASCII character set for program text. - -.. versionadded:: 2.3 - An encoding declaration can be used to indicate that string literals and - comments use an encoding different from ASCII. - -For compatibility with older versions, Python only warns if it finds 8-bit -characters; those warnings should be corrected by either declaring an explicit -encoding, or using escape sequences if those bytes are binary data, instead of -characters. - -The run-time character set depends on the I/O devices connected to the program -but is generally a superset of ASCII. - -**Future compatibility note:** It may be tempting to assume that the character -set for 8-bit characters is ISO Latin-1 (an ASCII superset that covers most -western languages that use the Latin alphabet), but it is possible that in the -future Unicode text editors will become common. These generally use the UTF-8 -encoding, which is also an ASCII superset, but with very different use for the -characters with ordinals 128-255. While there is no consensus on this subject -yet, it is unwise to assume either Latin-1 or UTF-8, even though the current -implementation appears to favor Latin-1. This applies both to the source -character set and the run-time character set. +Python reads program text as Unicode code points; the encoding of a source file +can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120` +for details. If the source file cannot be decoded, a :exc:`SyntaxError` is +raised. .. _line-structure: @@ -44,21 +22,17 @@ character set and the run-time character set. Line structure ============== -.. index:: single: line structure +.. index:: line structure A Python program is divided into a number of *logical lines*. -.. _logical: +.. _logical-lines: Logical lines ------------- -.. index:: - single: logical line - single: physical line - single: line joining - single: NEWLINE token +.. index:: logical line, physical line, line joining, NEWLINE token The end of a logical line is represented by the token NEWLINE. Statements cannot cross logical line boundaries except where NEWLINE is allowed by the @@ -67,7 +41,7 @@ constructed from one or more *physical lines* by following the explicit or implicit *line joining* rules. -.. _physical: +.. _physical-lines: Physical lines -------------- @@ -89,9 +63,7 @@ representing ASCII LF, is the line terminator). Comments -------- -.. index:: - single: comment - single: hash character +.. index:: comment, hash character A comment starts with a hash character (``#``) that is not part of a string literal, and ends at the end of the physical line. A comment signifies the end @@ -104,9 +76,7 @@ are ignored by the syntax; they are not tokens. Encoding declarations --------------------- -.. index:: - single: source character set - single: encodings +.. index:: source character set, encodings If a comment in the first or second line of the Python script matches the regular expression ``coding[=:]\s*([-\w.]+)``, this comment is processed as an @@ -119,19 +89,19 @@ which is recognized also by GNU Emacs, and :: # vim:fileencoding=<encoding-name> -which is recognized by Bram Moolenaar's VIM. In addition, if the first bytes of -the file are the UTF-8 byte-order mark (``'\xef\xbb\xbf'``), the declared file -encoding is UTF-8 (this is supported, among others, by Microsoft's -:program:`notepad`). +which is recognized by Bram Moolenaar's VIM. + +If no encoding declaration is found, the default encoding is UTF-8. In +addition, if the first bytes of the file are the UTF-8 byte-order mark +(``b'\xef\xbb\xbf'``), the declared file encoding is UTF-8 (this is supported, +among others, by Microsoft's :program:`notepad`). If an encoding is declared, the encoding name must be recognized by Python. The -encoding is used for all lexical analysis, in particular to find the end of a -string, and to interpret the contents of Unicode literals. String literals are -converted to Unicode for syntactical analysis, then converted back to their -original encoding before interpretation starts. The encoding declaration must -appear on a line of its own. +encoding is used for all lexical analysis, including string literals, comments +and identifiers. The encoding declaration must appear on a line of its own. -.. % XXX there should be a list of supported encodings. +A list of standard encodings can be found in the section +:ref:`standard-encodings`. .. _explicit-joining: @@ -139,21 +109,13 @@ appear on a line of its own. Explicit line joining --------------------- -.. index:: - single: physical line - single: line joining - single: line continuation - single: backslash character +.. index:: physical line, line joining, line continuation, backslash character Two or more physical lines may be joined into logical lines using backslash characters (``\``), as follows: when a physical line ends in a backslash that is not part of a string literal or comment, it is joined with the following forming a single logical line, deleting the backslash and the following end-of-line -character. For example: - -.. % - -:: +character. For example:: if 1900 < year < 2100 and 1 <= month <= 12 \ and 1 <= day <= 31 and 0 <= hour < 24 \ @@ -197,9 +159,9 @@ Blank lines A logical line that contains only spaces, tabs, formfeeds and possibly a comment, is ignored (i.e., no NEWLINE token is generated). During interactive input of statements, handling of a blank line may differ depending on the -implementation of the read-eval-print loop. In the standard implementation, an -entirely blank logical line (i.e. one containing not even whitespace or a -comment) terminates a multi-line statement. +implementation of the read-eval-print loop. In the standard interactive +interpreter, an entirely blank logical line (i.e. one containing not even +whitespace or a comment) terminates a multi-line statement. .. _indentation: @@ -207,14 +169,7 @@ comment) terminates a multi-line statement. Indentation ----------- -.. index:: - single: indentation - single: whitespace - single: leading whitespace - single: space - single: tab - single: grouping - single: statement grouping +.. index:: indentation, leading whitespace, space, tab, grouping, statement grouping Leading whitespace (spaces and tabs) at the beginning of a logical line is used to compute the indentation level of the line, which in turn is used to determine @@ -238,9 +193,7 @@ for the indentation calculations above. Formfeed characters occurring elsewhere in the leading whitespace have an undefined effect (for instance, they may reset the space count to zero). -.. index:: - single: INDENT token - single: DEDENT token +.. index:: INDENT token, DEDENT token The indentation levels of consecutive lines are used to generate INDENT and DEDENT tokens, using a stack, as follows. @@ -315,22 +268,48 @@ possible string that forms a legal token, when read from left to right. Identifiers and keywords ======================== -.. index:: - single: identifier - single: name +.. index:: identifier, name Identifiers (also referred to as *names*) are described by the following lexical definitions: -.. productionlist:: - identifier: (`letter`|"_") (`letter` | `digit` | "_")* - letter: `lowercase` | `uppercase` - lowercase: "a"..."z" - uppercase: "A"..."Z" - digit: "0"..."9" +The syntax of identifiers in Python is based on the Unicode standard annex +UAX-31, with elaboration and changes as defined below. + +Within the ASCII range (U+0001..U+007F), the valid characters for identifiers +are the same as in Python 2.5; Python 3.0 introduces additional +characters from outside the ASCII range (see :pep:`3131`). For other +characters, the classification uses the version of the Unicode Character +Database as included in the :mod:`unicodedata` module. Identifiers are unlimited in length. Case is significant. +.. productionlist:: + identifier: `id_start` `id_continue`* + id_start: <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, + the underscore, and characters with the Other_ID_Start property> + id_continue: <all characters in `id_start`, plus characters in the categories + Mn, Mc, Nd, Pc and others with the Other_ID_Continue property> + +The Unicode category codes mentioned above stand for: + +* *Lu* - uppercase letters +* *Ll* - lowercase letters +* *Lt* - titlecase letters +* *Lm* - modifier letters +* *Lo* - other letters +* *Nl* - letter numbers +* *Mn* - nonspacing marks +* *Mc* - spacing combining marks +* *Nd* - decimal numbers +* *Pc* - connector punctuations + +All identifiers are converted into the normal form NFC while parsing; comparison +of identifiers is based on NFC. + +A non-normative HTML file listing all valid identifier characters for Unicode +4.1 can be found at +http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html. .. _keywords: @@ -345,25 +324,13 @@ The following identifiers are used as reserved words, or *keywords* of the language, and cannot be used as ordinary identifiers. They must be spelled exactly as written here:: - and def for is raise - as del from lambda return - assert elif global not try - break else if or while - class except import pass with - continue finally in print yield - -.. versionchanged:: 2.4 - :const:`None` became a constant and is now recognized by the compiler as a name - for the built-in object :const:`None`. Although it is not a keyword, you cannot - assign a different object to it. - -.. versionchanged:: 2.5 - Both :keyword:`as` and :keyword:`with` are only recognized when the - ``with_statement`` future feature has been enabled. It will always be enabled in - Python 2.6. See section :ref:`with` for details. Note that using :keyword:`as` - and :keyword:`with` as identifiers will always issue a warning, even when the - ``with_statement`` future directive is not in effect. - + False class finally is return + None continue for lambda try + True def from nonlocal while + and del global not with + as elif if or yield + assert else import pass + break except in raise .. _id-classes: @@ -405,71 +372,71 @@ characters: Literals ======== -.. index:: - single: literal - single: constant +.. index:: literal, constant Literals are notations for constant values of some built-in types. .. _strings: -String literals ---------------- +String and Bytes literals +------------------------- -.. index:: single: string literal +.. index:: string literal, bytes literal, ASCII String literals are described by the following lexical definitions: -.. index:: single: ASCII@ASCII - .. productionlist:: stringliteral: [`stringprefix`](`shortstring` | `longstring`) - stringprefix: "r" | "u" | "ur" | "R" | "U" | "UR" | "Ur" | "uR" + stringprefix: "r" | "R" shortstring: "'" `shortstringitem`* "'" | '"' `shortstringitem`* '"' - longstring: ""'" `longstringitem`* ""'" - : | '"""' `longstringitem`* '"""' - shortstringitem: `shortstringchar` | `escapeseq` - longstringitem: `longstringchar` | `escapeseq` + longstring: "'''" `longstringitem`* "'''" | '"""' `longstringitem`* '"""' + shortstringitem: `shortstringchar` | `stringescapeseq` + longstringitem: `longstringchar` | `stringescapeseq` shortstringchar: <any source character except "\" or newline or the quote> longstringchar: <any source character except "\"> - escapeseq: "\" <any ASCII character> + stringescapeseq: "\" <any source character> + +.. productionlist:: + bytesliteral: `bytesprefix`(`shortbytes` | `longbytes`) + bytesprefix: "b" | "B" + shortbytes: "'" `shortbytesitem`* "'" | '"' `shortbytesitem`* '"' + longbytes: "'''" `longbytesitem`* "'''" | '"""' `longbytesitem`* '"""' + shortbytesitem: `shortbyteschar` | `bytesescapeseq` + longbytesitem: `longbyteschar` | `bytesescapeseq` + shortbyteschar: <any ASCII character except "\" or newline or the quote> + longbyteschar: <any ASCII character except "\"> + bytesescapeseq: "\" <any ASCII character> One syntactic restriction not indicated by these productions is that whitespace -is not allowed between the :token:`stringprefix` and the rest of the string -literal. The source character set is defined by the encoding declaration; it is -ASCII if no encoding declaration is given in the source file; see section -:ref:`encodings`. +is not allowed between the :token:`stringprefix` or :token:`bytesprefix` and the +rest of the literal. The source character set is defined by the encoding +declaration; it is UTF-8 if no encoding declaration is given in the source file; +see section :ref:`encodings`. -.. index:: - single: triple-quoted string - single: Unicode Consortium - single: string; Unicode - single: raw string +.. index:: triple-quoted string, Unicode Consortium, raw string -In plain English: String literals can be enclosed in matching single quotes +In plain English: Both types of literals can be enclosed in matching single quotes (``'``) or double quotes (``"``). They can also be enclosed in matching groups of three single or double quotes (these are generally referred to as *triple-quoted strings*). The backslash (``\``) character is used to escape characters that otherwise have a special meaning, such as newline, backslash -itself, or the quote character. String literals may optionally be prefixed with -a letter ``'r'`` or ``'R'``; such strings are called :dfn:`raw strings` and use -different rules for interpreting backslash escape sequences. A prefix of -``'u'`` or ``'U'`` makes the string a Unicode string. Unicode strings use the -Unicode character set as defined by the Unicode Consortium and ISO 10646. Some -additional escape sequences, described below, are available in Unicode strings. -The two prefix characters may be combined; in this case, ``'u'`` must appear -before ``'r'``. +itself, or the quote character. + +String literals may optionally be prefixed with a letter ``'r'`` or ``'R'``; +such strings are called :dfn:`raw strings` and use different rules for +interpreting backslash escape sequences. + +Bytes literals are always prefixed with ``'b'`` or ``'B'``; they produce an +instance of the :class:`bytes` type instead of the :class:`str` type. They +may only contain ASCII characters; bytes with a numeric value of 128 or greater +must be expressed with escapes. In triple-quoted strings, unescaped newlines and quotes are allowed (and are retained), except that three unescaped quotes in a row terminate the string. (A "quote" is the character used to open the string, i.e. either ``'`` or ``"``.) -.. index:: - single: physical line - single: escape sequence - single: Standard C - single: C +.. index:: physical line, escape sequence, Standard C, C Unless an ``'r'`` or ``'R'`` prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C. The @@ -478,7 +445,7 @@ recognized escape sequences are: +-----------------+---------------------------------+-------+ | Escape Sequence | Meaning | Notes | +=================+=================================+=======+ -| ``\newline`` | Ignored | | +| ``\newline`` | Backslash and newline ignored | | +-----------------+---------------------------------+-------+ | ``\\`` | Backslash (``\``) | | +-----------------+---------------------------------+-------+ @@ -494,83 +461,83 @@ recognized escape sequences are: +-----------------+---------------------------------+-------+ | ``\n`` | ASCII Linefeed (LF) | | +-----------------+---------------------------------+-------+ -| ``\N{name}`` | Character named *name* in the | | -| | Unicode database (Unicode only) | | -+-----------------+---------------------------------+-------+ | ``\r`` | ASCII Carriage Return (CR) | | +-----------------+---------------------------------+-------+ | ``\t`` | ASCII Horizontal Tab (TAB) | | +-----------------+---------------------------------+-------+ -| ``\uxxxx`` | Character with 16-bit hex value | \(1) | -| | *xxxx* (Unicode only) | | -+-----------------+---------------------------------+-------+ -| ``\Uxxxxxxxx`` | Character with 32-bit hex value | \(2) | -| | *xxxxxxxx* (Unicode only) | | -+-----------------+---------------------------------+-------+ | ``\v`` | ASCII Vertical Tab (VT) | | +-----------------+---------------------------------+-------+ -| ``\ooo`` | Character with octal value | (3,5) | +| ``\ooo`` | Character with octal value | (1,3) | | | *ooo* | | +-----------------+---------------------------------+-------+ -| ``\xhh`` | Character with hex value *hh* | (4,5) | +| ``\xhh`` | Character with hex value *hh* | (2,3) | +-----------------+---------------------------------+-------+ -.. index:: single: ASCII@ASCII +Escape sequences only recognized in string literals are: + ++-----------------+---------------------------------+-------+ +| Escape Sequence | Meaning | Notes | ++=================+=================================+=======+ +| ``\N{name}`` | Character named *name* in the | | +| | Unicode database | | ++-----------------+---------------------------------+-------+ +| ``\uxxxx`` | Character with 16-bit hex value | \(4) | +| | *xxxx* | | ++-----------------+---------------------------------+-------+ +| ``\Uxxxxxxxx`` | Character with 32-bit hex value | \(5) | +| | *xxxxxxxx* | | ++-----------------+---------------------------------+-------+ Notes: (1) - Individual code units which form parts of a surrogate pair can be encoded using - this escape sequence. + As in Standard C, up to three octal digits are accepted. (2) - Any Unicode character can be encoded this way, but characters outside the Basic - Multilingual Plane (BMP) will be encoded using a surrogate pair if Python is - compiled to use 16-bit code units (the default). Individual code units which - form parts of a surrogate pair can be encoded using this escape sequence. + Unlike in Standard C, at most two hex digits are accepted. (3) - As in Standard C, up to three octal digits are accepted. + In a bytes literal, hexadecimal and octal escapes denote the byte with the + given value. In a string literal, these escapes denote a Unicode character + with the given value. (4) - Unlike in Standard C, at most two hex digits are accepted. + Individual code units which form parts of a surrogate pair can be encoded using + this escape sequence. (5) - In a string literal, hexadecimal and octal escapes denote the byte with the - given value; it is not necessary that the byte encodes a character in the source - character set. In a Unicode literal, these escapes denote a Unicode character - with the given value. + Any Unicode character can be encoded this way, but characters outside the Basic + Multilingual Plane (BMP) will be encoded using a surrogate pair if Python is + compiled to use 16-bit code units (the default). Individual code units which + form parts of a surrogate pair can be encoded using this escape sequence. + -.. index:: single: unrecognized escape sequence +.. index:: unrecognized escape sequence Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., *the backslash is left in the string*. (This behavior is useful when debugging: if an escape sequence is mistyped, the resulting output is more easily recognized as broken.) It is also important to note that the -escape sequences marked as "(Unicode only)" in the table above fall into the -category of unrecognized escapes for non-Unicode string literals. - -When an ``'r'`` or ``'R'`` prefix is present, a character following a backslash -is included in the string without change, and *all backslashes are left in the -string*. For example, the string literal ``r"\n"`` consists of two characters: -a backslash and a lowercase ``'n'``. String quotes can be escaped with a -backslash, but the backslash remains in the string; for example, ``r"\""`` is a -valid string literal consisting of two characters: a backslash and a double -quote; ``r"\"`` is not a valid string literal (even a raw string cannot end in -an odd number of backslashes). Specifically, *a raw string cannot end in a -single backslash* (since the backslash would escape the following quote -character). Note also that a single backslash followed by a newline is -interpreted as those two characters as part of the string, *not* as a line -continuation. - -When an ``'r'`` or ``'R'`` prefix is used in conjunction with a ``'u'`` or -``'U'`` prefix, then the ``\uXXXX`` and ``\UXXXXXXXX`` escape sequences are -processed while *all other backslashes are left in the string*. For example, -the string literal ``ur"\u0062\n"`` consists of three Unicode characters: 'LATIN -SMALL LETTER B', 'REVERSE SOLIDUS', and 'LATIN SMALL LETTER N'. Backslashes can -be escaped with a preceding backslash; however, both remain in the string. As a -result, ``\uXXXX`` escape sequences are only recognized when there are an odd -number of backslashes. +escape sequences only recognized in string literals fall into the category of +unrecognized escapes for bytes literals. + +When an ``'r'`` or ``'R'`` prefix is used in a string literal, then the +``\uXXXX`` and ``\UXXXXXXXX`` escape sequences are processed while *all other +backslashes are left in the string*. For example, the string literal +``r"\u0062\n"`` consists of three Unicode characters: 'LATIN SMALL LETTER B', +'REVERSE SOLIDUS', and 'LATIN SMALL LETTER N'. Backslashes can be escaped with a +preceding backslash; however, both remain in the string. As a result, +``\uXXXX`` escape sequences are only recognized when there is an odd number of +backslashes. + +Even in a raw string, string quotes can be escaped with a backslash, but the +backslash remains in the string; for example, ``r"\""`` is a valid string +literal consisting of two characters: a backslash and a double quote; ``r"\"`` +is not a valid string literal (even a raw string cannot end in an odd number of +backslashes). Specifically, *a raw string cannot end in a single backslash* +(since the backslash would escape the following quote character). Note also +that a single backslash followed by a newline is interpreted as those two +characters as part of the string, *not* as a line continuation. .. _string-catenation: @@ -600,19 +567,9 @@ styles for each component (even mixing raw strings and triple quoted strings). Numeric literals ---------------- -.. index:: - single: number - single: numeric literal - single: integer literal - single: plain integer literal - single: long integer literal - single: floating point literal - single: hexadecimal literal - single: octal literal - single: binary literal - single: decimal literal - single: imaginary literal - single: complex; literal +.. index:: number, numeric literal, integer literal, plain integer literal + long integer literal, floating point literal, hexadecimal literal + octal literal, binary literal, decimal literal, imaginary literal, complex literal There are four types of numeric literals: plain integers, long integers, floating point numbers, and imaginary numbers. There are no complex literals @@ -633,18 +590,17 @@ Integer literals are described by the following lexical definitions: .. productionlist:: integer: `decimalinteger` | `octinteger` | `hexinteger` decimalinteger: `nonzerodigit` `digit`* | "0"+ + nonzerodigit: "1"..."9" + digit: "0"..."9" octinteger: "0" ("o" | "O") `octdigit`+ hexinteger: "0" ("x" | "X") `hexdigit`+ bininteger: "0" ("b" | "B") `bindigit`+ - nonzerodigit: "1"..."9" octdigit: "0"..."7" hexdigit: `digit` | "a"..."f" | "A"..."F" - bindigit: "0"..."1" + bindigit: "0" | "1" -Plain integer literals that are above the largest representable plain integer -(e.g., 2147483647 when using 32-bit arithmetic) are accepted as if they were -long integers instead. [#]_ There is no limit for long integer literals apart -from what can be stored in available memory. +There is no limit for the length of integer literals apart from what can be +stored in available memory. Note that leading zeros in a non-zero decimal number are not allowed. This is for disambiguation with C-style octal literals, which Python used before version @@ -732,7 +688,7 @@ The following tokens serve as delimiters in the grammar:: &= |= ^= >>= <<= **= The period can also occur in floating-point and imaginary literals. A sequence -of three periods has a special meaning as an ellipsis in slices. The second half +of three periods has a special meaning as an ellipsis literal. The second half of the list, the augmented assignment operators, serve lexically as delimiters, but also perform an operation. @@ -741,18 +697,7 @@ tokens or are otherwise significant to the lexical analyzer:: ' " # \ -.. index:: single: ASCII@ASCII - The following printing ASCII characters are not used in Python. Their occurrence outside string literals and comments is an unconditional error:: $ ? - -.. rubric:: Footnotes - -.. [#] In versions of Python prior to 2.4, octal and hexadecimal literals in the range - just above the largest representable plain integer but below the largest - unsigned 32-bit number (on a machine using 32-bit arithmetic), 4294967296, were - taken as the negative plain integer obtained by subtracting 4294967296 from - their unsigned value. - |