diff options
Diffstat (limited to 'Doc/ref/ref2.tex')
-rw-r--r-- | Doc/ref/ref2.tex | 372 |
1 files changed, 0 insertions, 372 deletions
diff --git a/Doc/ref/ref2.tex b/Doc/ref/ref2.tex deleted file mode 100644 index b0939988..0000000 --- a/Doc/ref/ref2.tex +++ /dev/null @@ -1,372 +0,0 @@ -\chapter{Lexical analysis} - -A Python program is read by a {\em parser}. Input to the parser is a -stream of {\em tokens}, generated by the {\em lexical analyzer}. This -chapter describes how the lexical analyzer breaks a file into tokens. -\index{lexical analysis} -\index{parser} -\index{token} - -\section{Line structure} - -A Python program is divided in a number of logical lines. The end of -a logical line is represented by the token NEWLINE. Statements cannot -cross logical line boundaries except where NEWLINE is allowed by the -syntax (e.g. between statements in compound statements). -\index{line structure} -\index{logical line} -\index{NEWLINE token} - -\subsection{Comments} - -A comment starts with a hash character (\verb@#@) that is not part of -a string literal, and ends at the end of the physical line. A comment -always signifies the end of the logical line. Comments are ignored by -the syntax. -\index{comment} -\index{logical line} -\index{physical line} -\index{hash character} - -\subsection{Explicit line joining} - -Two or more physical lines may be joined into logical lines using -backslash characters (\verb/\/), as follows: when a physical line ends -in a backslash that is not part of a string literal or comment, it is -joined with the following forming a single logical line, deleting the -backslash and the following end-of-line character. For example: -\index{physical line} -\index{line joining} -\index{line continuation} -\index{backslash character} -% -\begin{verbatim} -if 1900 < year < 2100 and 1 <= month <= 12 \ - and 1 <= day <= 31 and 0 <= hour < 24 \ - and 0 <= minute < 60 and 0 <= second < 60: # Looks like a valid date - return 1 -\end{verbatim} - -A line ending in a backslash cannot carry a comment; a backslash does -not continue a comment (but it does continue a string literal, see -below). - -\subsection{Implicit line joining} - -Expressions in parentheses, square brackets or curly braces can be -split over more than one physical line without using backslashes. -For example: - -\begin{verbatim} -month_names = ['Januari', 'Februari', 'Maart', # These are the - 'April', 'Mei', 'Juni', # Dutch names - 'Juli', 'Augustus', 'September', # for the months - 'Oktober', 'November', 'December'] # of the year -\end{verbatim} - -Implicitly continued lines can carry comments. The indentation of the -continuation lines is not important. Blank continuation lines are -allowed. - -\subsection{Blank lines} - -A logical line that contains only spaces, tabs, and possibly a -comment, is ignored (i.e., no NEWLINE token is generated), except that -during interactive input of statements, an entirely blank logical line -terminates a multi-line statement. -\index{blank line} - -\subsection{Indentation} - -Leading whitespace (spaces and tabs) at the beginning of a logical -line is used to compute the indentation level of the line, which in -turn is used to determine the grouping of statements. -\index{indentation} -\index{whitespace} -\index{leading whitespace} -\index{space} -\index{tab} -\index{grouping} -\index{statement grouping} - -First, tabs are replaced (from left to right) by one to eight spaces -such that the total number of characters up to there is a multiple of -eight (this is intended to be the same rule as used by {\UNIX}). The -total number of spaces preceding the first non-blank character then -determines the line's indentation. Indentation cannot be split over -multiple physical lines using backslashes. - -The indentation levels of consecutive lines are used to generate -INDENT and DEDENT tokens, using a stack, as follows. -\index{INDENT token} -\index{DEDENT token} - -Before the first line of the file is read, a single zero is pushed on -the stack; this will never be popped off again. The numbers pushed on -the stack will always be strictly increasing from bottom to top. At -the beginning of each logical line, the line's indentation level is -compared to the top of the stack. If it is equal, nothing happens. -If it is larger, it is pushed on the stack, and one INDENT token is -generated. If it is smaller, it {\em must} be one of the numbers -occurring on the stack; all numbers on the stack that are larger are -popped off, and for each number popped off a DEDENT token is -generated. At the end of the file, a DEDENT token is generated for -each number remaining on the stack that is larger than zero. - -Here is an example of a correctly (though confusingly) indented piece -of Python code: - -\begin{verbatim} -def perm(l): - # Compute the list of all permutations of l - - if len(l) <= 1: - return [l] - r = [] - for i in range(len(l)): - s = l[:i] + l[i+1:] - p = perm(s) - for x in p: - r.append(l[i:i+1] + x) - return r -\end{verbatim} - -The following example shows various indentation errors: - -\begin{verbatim} - def perm(l): # error: first line indented - for i in range(len(l)): # error: not indented - s = l[:i] + l[i+1:] - p = perm(l[:i] + l[i+1:]) # error: unexpected indent - for x in p: - r.append(l[i:i+1] + x) - return r # error: inconsistent dedent -\end{verbatim} - -(Actually, the first three errors are detected by the parser; only the -last error is found by the lexical analyzer --- the indentation of -\verb@return r@ does not match a level popped off the stack.) - -\section{Other tokens} - -Besides NEWLINE, INDENT and DEDENT, the following categories of tokens -exist: identifiers, keywords, literals, operators, and delimiters. -Spaces and tabs are not tokens, but serve to delimit tokens. Where -ambiguity exists, a token comprises the longest possible string that -forms a legal token, when read from left to right. - -\section{Identifiers} - -Identifiers (also referred to as names) are described by the following -lexical definitions: -\index{identifier} -\index{name} - -\begin{verbatim} -identifier: (letter|"_") (letter|digit|"_")* -letter: lowercase | uppercase -lowercase: "a"..."z" -uppercase: "A"..."Z" -digit: "0"..."9" -\end{verbatim} - -Identifiers are unlimited in length. Case is significant. - -\subsection{Keywords} - -The following identifiers are used as reserved words, or {\em -keywords} of the language, and cannot be used as ordinary -identifiers. They must be spelled exactly as written here: -\index{keyword} -\index{reserved word} - -\begin{verbatim} -and elif global not try -break else if or while -class except import pass -continue finally in print -def for is raise -del from lambda return -\end{verbatim} - -% When adding keywords, pipe it through keywords.py for reformatting - -\section{Literals} \label{literals} - -Literals are notations for constant values of some built-in types. -\index{literal} -\index{constant} - -\subsection{String literals} - -String literals are described by the following lexical definitions: -\index{string literal} - -\begin{verbatim} -stringliteral: shortstring | longstring -shortstring: "'" shortstringitem* "'" | '"' shortstringitem* '"' -longstring: "'''" longstringitem* "'''" | '"""' longstringitem* '"""' -shortstringitem: shortstringchar | escapeseq -longstringitem: longstringchar | escapeseq -shortstringchar: <any ASCII character except "\" or newline or the quote> -longstringchar: <any ASCII character except "\"> -escapeseq: "\" <any ASCII character> -\end{verbatim} -\index{ASCII} - -In ``long strings'' (strings surrounded by sets of three quotes), -unescaped newlines and quotes are allowed (and are retained), except -that three unescaped quotes in a row terminate the string. (A -``quote'' is the character used to open the string, i.e. either -\verb/'/ or \verb/"/.) - -Escape sequences in strings are interpreted according to rules similar -to those used by Standard C. The recognized escape sequences are: -\index{physical line} -\index{escape sequence} -\index{Standard C} -\index{C} - -\begin{center} -\begin{tabular}{|l|l|} -\hline -\verb/\/{\em newline} & Ignored \\ -\verb/\\/ & Backslash (\verb/\/) \\ -\verb/\'/ & Single quote (\verb/'/) \\ -\verb/\"/ & Double quote (\verb/"/) \\ -\verb/\a/ & \ASCII{} Bell (BEL) \\ -\verb/\b/ & \ASCII{} Backspace (BS) \\ -%\verb/\E/ & \ASCII{} Escape (ESC) \\ -\verb/\f/ & \ASCII{} Formfeed (FF) \\ -\verb/\n/ & \ASCII{} Linefeed (LF) \\ -\verb/\r/ & \ASCII{} Carriage Return (CR) \\ -\verb/\t/ & \ASCII{} Horizontal Tab (TAB) \\ -\verb/\v/ & \ASCII{} Vertical Tab (VT) \\ -\verb/\/{\em ooo} & \ASCII{} character with octal value {\em ooo} \\ -\verb/\x/{\em xx...} & \ASCII{} character with hex value {\em xx...} \\ -\hline -\end{tabular} -\end{center} -\index{ASCII} - -In strict compatibility with Standard C, up to three octal digits are -accepted, but an unlimited number of hex digits is taken to be part of -the hex escape (and then the lower 8 bits of the resulting hex number -are used in all current implementations...). - -All unrecognized escape sequences are left in the string unchanged, -i.e., {\em the backslash is left in the string.} (This behavior is -useful when debugging: if an escape sequence is mistyped, the -resulting output is more easily recognized as broken. It also helps a -great deal for string literals used as regular expressions or -otherwise passed to other modules that do their own escape handling.) -\index{unrecognized escape sequence} - -\subsection{Numeric literals} - -There are three types of numeric literals: plain integers, long -integers, and floating point numbers. -\index{number} -\index{numeric literal} -\index{integer literal} -\index{plain integer literal} -\index{long integer literal} -\index{floating point literal} -\index{hexadecimal literal} -\index{octal literal} -\index{decimal literal} - -Integer and long integer literals are described by the following -lexical definitions: - -\begin{verbatim} -longinteger: integer ("l"|"L") -integer: decimalinteger | octinteger | hexinteger -decimalinteger: nonzerodigit digit* | "0" -octinteger: "0" octdigit+ -hexinteger: "0" ("x"|"X") hexdigit+ - -nonzerodigit: "1"..."9" -octdigit: "0"..."7" -hexdigit: digit|"a"..."f"|"A"..."F" -\end{verbatim} - -Although both lower case `l' and upper case `L' are allowed as suffix -for long integers, it is strongly recommended to always use `L', since -the letter `l' looks too much like the digit `1'. - -Plain integer decimal literals must be at most 2147483647 (i.e., the -largest positive integer, using 32-bit arithmetic). Plain octal and -hexadecimal literals may be as large as 4294967295, but values larger -than 2147483647 are converted to a negative value by subtracting -4294967296. There is no limit for long integer literals apart from -what can be stored in available memory. - -Some examples of plain and long integer literals: - -\begin{verbatim} -7 2147483647 0177 0x80000000 -3L 79228162514264337593543950336L 0377L 0x100000000L -\end{verbatim} - -Floating point literals are described by the following lexical -definitions: - -\begin{verbatim} -floatnumber: pointfloat | exponentfloat -pointfloat: [intpart] fraction | intpart "." -exponentfloat: (intpart | pointfloat) exponent -intpart: digit+ -fraction: "." digit+ -exponent: ("e"|"E") ["+"|"-"] digit+ -\end{verbatim} - -The allowed range of floating point literals is -implementation-dependent. - -Some examples of floating point literals: - -\begin{verbatim} -3.14 10. .001 1e100 3.14e-10 -\end{verbatim} - -Note that numeric literals do not include a sign; a phrase like -\verb@-1@ is actually an expression composed of the operator -\verb@-@ and the literal \verb@1@. - -\section{Operators} - -The following tokens are operators: -\index{operators} - -\begin{verbatim} -+ - * / % -<< >> & | ^ ~ -< == > <= <> != >= -\end{verbatim} - -The comparison operators \verb@<>@ and \verb@!=@ are alternate -spellings of the same operator. - -\section{Delimiters} - -The following tokens serve as delimiters or otherwise have a special -meaning: -\index{delimiters} - -\begin{verbatim} -( ) [ ] { } -, : . " ` ' -= ; -\end{verbatim} - -The following printing \ASCII{} characters are not used in Python. Their -occurrence outside string literals and comments is an unconditional -error: -\index{ASCII} - -\begin{verbatim} -@ $ ? -\end{verbatim} - -They may be used by future versions of the language though! |