diff options
author | Guido van Rossum <guido@python.org> | 1991-11-25 17:26:57 (GMT) |
---|---|---|
committer | Guido van Rossum <guido@python.org> | 1991-11-25 17:26:57 (GMT) |
commit | 4fc43bc377a0e9d0642af32d83459f5c71d8e733 (patch) | |
tree | 0ac5e2a3d2dbe9c1018967f29f921df6fe9edd0a /Doc/ref/ref.tex | |
parent | 01ebbb80ab9a4cdbc8acaa646b2f7a1b234215fc (diff) | |
download | cpython-4fc43bc377a0e9d0642af32d83459f5c71d8e733.zip cpython-4fc43bc377a0e9d0642af32d83459f5c71d8e733.tar.gz cpython-4fc43bc377a0e9d0642af32d83459f5c71d8e733.tar.bz2 |
First round of corrections (lexer only).
Diffstat (limited to 'Doc/ref/ref.tex')
-rw-r--r-- | Doc/ref/ref.tex | 269 |
1 files changed, 139 insertions, 130 deletions
diff --git a/Doc/ref/ref.tex b/Doc/ref/ref.tex index 6af7535..a2eb381 100644 --- a/Doc/ref/ref.tex +++ b/Doc/ref/ref.tex @@ -42,9 +42,8 @@ and MS-DOS. This reference manual describes the syntax and ``core semantics'' of the language. It is terse, but exact and complete. The semantics of non-essential built-in object types and of the built-in functions and -modules are described in the {\em Library Reference} document. For an -informal introduction to the language, see the {\em Tutorial} -document. +modules are described in the {\em Python Library Reference}. For an +informal introduction to the language, see the {\em Python Tutorial}. \end{abstract} @@ -63,132 +62,119 @@ It is not intended as a tutorial. \chapter{Lexical analysis} -A Python program is read by a {\em parser}. -Input to the parser is a stream of {\em tokens}, generated -by the {\em lexical analyzer}. +A Python program is read by a {\em parser}. Input to the parser is a +stream of {\em tokens}, generated by the {\em lexical analyzer}. This +chapter describes how the lexical analyzer breaks a file into tokens. \section{Line structure} -A Python program is divided in a number of logical lines. -Statements may not straddle logical line boundaries except where -explicitly allowed by the syntax. -To this purpose, the end of a logical line -is represented by the token NEWLINE. +A Python program is divided in a number of logical lines. Statements +do not straddle logical line boundaries except where explicitly +indicated by the syntax (i.e., for compound statements). To this +purpose, the end of a logical line is represented by the token +NEWLINE. \subsection{Comments} -A comment starts with a hash character (\verb/#/) and ends at the end -of the physical line. Comments are ignored by the syntax. -A hash character in a string literal does not start a comment. +A comment starts with a hash character (\verb\#\) that is not part of +a string literal, and ends at the end of the physical line. Comments +are ignored by the syntax. \subsection{Line joining} -Physical lines may be joined into logical lines using backslash -characters (\verb/\/), as follows. -If a physical line ends in a backslash that is not part of a string -literal or comment, it is joined with -the following forming a single logical line, deleting the backslash -and the following end-of-line character. More than two physical -lines may be joined together in this way. +Two or more physical lines may be joined into logical lines using +backslash characters (\verb/\/), as follows: When physical line ends +in a backslash that is not part of a string literal or comment, it is +joined with the following forming a single logical line, deleting the +backslash and the following end-of-line character. \subsection{Blank lines} -A physical line that is not the continuation of the previous line -and contains only spaces, tabs and possibly a comment, is ignored -(i.e., no NEWLINE token is generated), -except that during interactive input of statements, an empty -physical line terminates a multi-line statement. +A logical line that contains only spaces, tabs, and possibly a +comment, is ignored (i.e., no NEWLINE token is generated), except that +during interactive input of statements, an entirely blank logical line +terminates a multi-line statement. \subsection{Indentation} -Spaces and tabs at the beginning of a line are used to compute +Spaces and tabs at the beginning of a logical line are used to compute the indentation level of the line, which in turn is used to determine the grouping of statements. -First, each tab is replaced by one to eight spaces such that the column number -of the next character is a multiple of eight (counting from zero). -The column number of the first non-space character then defines the -line's indentation. -Indentation cannot be split over multiple physical lines using -backslashes. +First, each tab is replaced by one to eight spaces such that the total +number of spaces up to that point is a multiple of eight. The total +number of spaces preceding the first non-blank character then +determines the line's indentation. Indentation cannot be split over +multiple physical lines using backslashes. The indentation levels of consecutive lines are used to generate INDENT and DEDENT tokens, using a stack, as follows. Before the first line of the file is read, a single zero is pushed on -the stack; this will never be popped off again. The numbers pushed -on the stack will always be strictly increasing from bottom to top. -At the beginning of each logical line, the line's indentation level -is compared to the top of the stack. -If it is equal, nothing happens. -If it larger, it is pushed on the stack, and one INDENT token is generated. -If it is smaller, it {\em must} be one of the numbers occurring on the -stack; all numbers on the stack that are larger are popped off, -and for each number popped off a DEDENT token is generated. -At the end of the file, a DEDENT token is generated for each number -remaining on the stack that is larger than zero. +the stack; this will never be popped off again. The numbers pushed on +the stack will always be strictly increasing from bottom to top. At +the beginning of each logical line, the line's indentation level is +compared to the top of the stack. If it is equal, nothing happens. +If it larger, it is pushed on the stack, and one INDENT token is +generated. If it is smaller, it {\em must} be one of the numbers +occurring on the stack; all numbers on the stack that are larger are +popped off, and for each number popped off a DEDENT token is +generated. At the end of the file, a DEDENT token is generated for +each number remaining on the stack that is larger than zero. \section{Other tokens} Besides NEWLINE, INDENT and DEDENT, the following categories of tokens exist: identifiers, keywords, literals, operators, and delimiters. -Spaces and tabs are not tokens, but serve to delimit tokens. -Where ambiguity exists, a token comprises the longest possible -string that forms a legal token, when reading from left to right. +Spaces and tabs are not tokens, but serve to delimit tokens. Where +ambiguity exists, a token comprises the longest possible string that +forms a legal token, when read from left to right. Tokens are described using an extended regular expression notation. This is similar to the extended BNF notation used later, except that -the notation <...> is used to give an informal description of a character, -and that spaces and tabs are not to be ignored. +the notation \verb\<...>\ is used to give an informal description of a +character, and that spaces and tabs are not to be ignored. \section{Identifiers} Identifiers are described by the following regular expressions: \begin{verbatim} -identifier: (letter|'_') (letter|digit|'_')* +identifier: (letter|"_") (letter|digit|"_")* letter: lowercase | uppercase -lowercase: 'a'|'b'|...|'z' -uppercase: 'A'|'B'|...|'Z' -digit: '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9' +lowercase: "a"|"b"|...|"z" +uppercase: "A"|"B"|...|"Z" +digit: "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9" \end{verbatim} -Identifiers are unlimited in length. -Upper and lower case letters are different. +Identifiers are unlimited in length. Case is significant. \section{Keywords} -The following tokens are used as reserved words, -or keywords of the language, -and may not be used as ordinary identifiers. -They must be spelled exactly as written here: - -{\tt - and - break - class - continue - def - del - elif - else - except - finally - for - from - if - import - in - is - not - or - pass - print - raise - return - try - while -} +The following identifiers are used as reserved words, or {\em +keywords} of the language, and may not be used as ordinary +identifiers. They must be spelled exactly as written here: + +\begin{verbatim} +and del for is raise +break elif from not return +class else if or try +continue except import pass while +def finally in print +\end{verbatim} + +% import string +% l = [] +% try: +% while 1: +% l = l + string.split(raw_input()) +% except EOFError: +% pass +% l.sort() +% for i in range((len(l)+4)/5): +% for j in range(i, len(l), 5): +% print string.ljust(l[j], 10), +% print \section{Literals} @@ -197,24 +183,47 @@ They must be spelled exactly as written here: String literals are described by the following regular expressions: \begin{verbatim} -stringliteral: '\'' stringitem* '\'' +stringliteral: "'" stringitem* "'" stringitem: stringchar | escapeseq -stringchar: <any character except newline or '\\' or '\''> -escapeseq: '\\' <any character except newline> -\end{verbatim} - -String literals cannot span physical line boundaries. -Escape sequences in strings are actually interpreted according to almost the -same rules as used by Standard C -(XXX which should be made explicit here), -except that \verb/\E/ is equivalent to \verb/\033/, -\verb/\"/ is not recognized, -newline characters cannot be escaped, and -{\em all unrecognized escape sequences are left in the string unchanged}. -(The latter rule is useful when debugging: if an escape sequence is -mistyped, the resulting output is more easily recognized as broken. -It also helps somewhat for string literals used as regular expressions -or otherwise passed to other modules that do their own escape handling.) +stringchar: <any character except newline or "\" or "'"> +escapeseq: "'" <any character except newline> +\end{verbatim} + +String literals cannot span physical line boundaries. Escape +sequences in strings are actually interpreted according to rules +simular to those used by Standard C. The recognized escape sequences +are: + +\begin{center} +\begin{tabular}{|l|l|} +\hline +\verb/\\/ & Backslash (\verb/\/) \\ +\verb/\'/ & Single quote (\verb/'/) \\ +\verb/\a/ & ASCII Bell (BEL) \\ +\verb/\b/ & ASCII Backspace (BS) \\ +\verb/\E/ & ASCII Escape (ESC) \\ +\verb/\f/ & ASCII Formfeed (FF) \\ +\verb/\n/ & ASCII Linefeed (LF) \\ +\verb/\r/ & ASCII Carriage Return (CR) \\ +\verb/\t/ & ASCII Horizontal Tab (TAB) \\ +\verb/\v/ & ASCII Vertical Tab (VT) \\ +\verb/\/{\em ooo} & ASCII character with octal value {\em ooo} \\ +\verb/\x/{em xx...} & ASCII character with hex value {\em xx} \\ +\hline +\end{tabular} +\end{center} + +For compatibility with in Standard C, up to three octal digits are +accepted, but an unlimited number of hex digits is taken to be part of +the hex escape (and then the lower 8 bits of the resulting hex number +are used...). + +All unrecognized escape sequences are left in the string {\em +unchanged}, i.e., the backslash is left in the string. (This rule is +useful when debugging: if an escape sequence is mistyped, the +resulting output is more easily recognized as broken. It also helps +somewhat for string literals used as regular expressions or otherwise +passed to other modules that do their own escape handling.) \subsection{Numeric literals} @@ -224,24 +233,24 @@ and floating point numbers. Integers and long integers are described by the following regular expressions: \begin{verbatim} -longinteger: integer ('l'|'L') +longinteger: integer ("l"|"L") integer: decimalinteger | octinteger | hexinteger -decimalinteger: nonzerodigit digit* | '0' -octinteger: '0' octdigit+ -hexinteger: '0' ('x'|'X') hexdigit+ +decimalinteger: nonzerodigit digit* | "0" +octinteger: "0" octdigit+ +hexinteger: "0" ("x"|"X") hexdigit+ -nonzerodigit: '1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9' -octdigit: '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7' -hexdigit: digit|'a'|'b'|'c'|'d'|'e'|'f'|'A'|'B'|'C'|'D'|'E'|'F' +nonzerodigit: "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9" +octdigit: "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7" +hexdigit: digit|"a"|"b"|"c"|"d"|"e"|"f"|"A"|"B"|"C"|"D"|"E"|"F" \end{verbatim} Floating point numbers are described by the following regular expressions: \begin{verbatim} -floatnumber: [intpart] fraction [exponent] | intpart ['.'] exponent +floatnumber: [intpart] fraction [exponent] | intpart ["."] exponent intpart: digit+ -fraction: '.' digit+ -exponent: ('e'|'E') ['+'|'-'] digit+ +fraction: "." digit+ +exponent: ("e"|"E") ["+"|"-"] digit+ \end{verbatim} \section{Operators} @@ -292,15 +301,15 @@ conditions. Conditions are a superset of expressions, and a condition may be used where an expression is required by enclosing it in parentheses. The only place where an unparenthesized condition is not allowed is on the right-hand side of the assignment operator, -because this operator is the same token (\verb/'='/) as used for +because this operator is the same token (\verb\=\) as used for compasisons. The comma plays a somewhat special role in Python's syntax. It is an operator with a lower precedence than all others, but occasionally serves other purposes as well (e.g., it has special semantics in print statements). When a comma is accepted by the -syntax, one of the syntactic categories \verb/expression_list/ -or \verb/condition_list/ is always used. +syntax, one of the syntactic categories \verb\expression_list\ +or \verb\condition_list\ is always used. When (one alternative of) a syntax rule has the form @@ -308,8 +317,8 @@ When (one alternative of) a syntax rule has the form name: othername \end{verbatim} -and no semantics are given, the semantics of this form of \verb/name/ -are the same as for \verb/othername/. +and no semantics are given, the semantics of this form of \verb\name\ +are the same as for \verb\othername\. \section{Arithmetic conversions} @@ -414,11 +423,11 @@ key value prevails. A string conversion evaluates the contained condition list and converts the resulting object into a string according to rules specific to its type. -If the object is a string, a number, \verb/None/, or a tuple, list or +If the object is a string, a number, \verb\None\, or a tuple, list or dictionary containing only objects whose type is in this list, the resulting string is a valid Python expression which can be passed to the -built-in function \verb/eval()/ to yield an expression with the +built-in function \verb\eval()\ to yield an expression with the same value (or an approximation, if floating point numbers are involved). @@ -459,11 +468,11 @@ Their syntax is: factor: primary | '-' factor | '+' factor | '~' factor \end{verbatim} -The unary \verb/'-'/ operator yields the negative of its numeric argument. +The unary \verb\-\ operator yields the negative of its numeric argument. -The unary \verb/'+'/ operator yields its numeric argument unchanged. +The unary \verb\+\ operator yields its numeric argument unchanged. -The unary \verb/'~'/ operator yields the bit-wise negation of its +The unary \verb\~\ operator yields the bit-wise negation of its integral numerical argument. In all three cases, if the argument does not have the proper type, @@ -477,7 +486,7 @@ Terms represent the most tightly binding binary operators: term: factor | term '*' factor | term '/' factor | term '%' factor \end{verbatim} -The \verb/'*'/ operator yields the product of its arguments. +The \verb\*\ operator yields the product of its arguments. The arguments must either both be numbers, or one argument must be a (short) integer and the other must be a string. In the former case, the numbers are converted to a common type @@ -572,7 +581,7 @@ it is optional in all other cases (a single expression without a trailing comma doesn't create a tuple, but rather yields the value of that expression). -To create an empty tuple, use an empty pair of parentheses: \verb/()/. +To create an empty tuple, use an empty pair of parentheses: \verb\()\. \section{Comparisons} @@ -597,8 +606,8 @@ Note that $e_0 op_1 e_1 op_2 e_2$ does not imply any kind of comparison between $e_0$ and $e_2$, e.g., $x < y > z$ is perfectly legal. For the benefit of C programmers, -the comparison operators \verb/=/ and \verb/==/ are equivalent, -and so are \verb/<>/ and \verb/!=/. +the comparison operators \verb\=\ and \verb\==\ are equivalent, +and so are \verb\<>\ and \verb\!=\. Use of the C variants is discouraged. The operators {\tt '<', '>', '=', '>=', '<='}, and {\tt '<>'} compare @@ -610,7 +619,7 @@ the value \verb\None\ compares smaller than the values of any other type. (This unusual definition of comparison is done to simplify the definition of -operations like sorting and the \verb/in/ and \verb/not in/ operators.) +operations like sorting and the \verb\in\ and \verb\not in\ operators.) Comparison of objects of the same type depends on the type: @@ -869,12 +878,12 @@ A space is written before each object is (converted and) written, unless the output system believes it is positioned at the beginning of a line. This is the case: (1) when no characters have been written to standard output; or (2) when the last character written to -standard output is \verb/'\n'/; +standard output is \verb/\n/; or (3) when the last I/O operation on standard output was not a \verb\print\ statement. Finally, -a \verb/'\n'/ character is written at the end, +a \verb/\n/ character is written at the end, unless the \verb\print\ statement ends with a comma. This is the only action if the statement contains just the keyword \verb\print\. |