diff options
author | Fred Drake <fdrake@acm.org> | 1998-05-06 19:52:49 (GMT) |
---|---|---|
committer | Fred Drake <fdrake@acm.org> | 1998-05-06 19:52:49 (GMT) |
commit | f666917ab76a483447d5da33ebacf57ab385cb10 (patch) | |
tree | 78564f66276e06aa34a085df4fc1d599654c680f /Doc/ref/ref2.tex | |
parent | a6bb39622c6b9e485c9bd4845393ed0c28c52f81 (diff) | |
download | cpython-f666917ab76a483447d5da33ebacf57ab385cb10.zip cpython-f666917ab76a483447d5da33ebacf57ab385cb10.tar.gz cpython-f666917ab76a483447d5da33ebacf57ab385cb10.tar.bz2 |
The Python Reference Manual.
Diffstat (limited to 'Doc/ref/ref2.tex')
-rw-r--r-- | Doc/ref/ref2.tex | 372 |
1 files changed, 372 insertions, 0 deletions
diff --git a/Doc/ref/ref2.tex b/Doc/ref/ref2.tex new file mode 100644 index 0000000..b0939988 --- /dev/null +++ b/Doc/ref/ref2.tex @@ -0,0 +1,372 @@ +\chapter{Lexical analysis} + +A Python program is read by a {\em parser}. Input to the parser is a +stream of {\em tokens}, generated by the {\em lexical analyzer}. This +chapter describes how the lexical analyzer breaks a file into tokens. +\index{lexical analysis} +\index{parser} +\index{token} + +\section{Line structure} + +A Python program is divided in a number of logical lines. The end of +a logical line is represented by the token NEWLINE. Statements cannot +cross logical line boundaries except where NEWLINE is allowed by the +syntax (e.g. between statements in compound statements). +\index{line structure} +\index{logical line} +\index{NEWLINE token} + +\subsection{Comments} + +A comment starts with a hash character (\verb@#@) that is not part of +a string literal, and ends at the end of the physical line. A comment +always signifies the end of the logical line. Comments are ignored by +the syntax. +\index{comment} +\index{logical line} +\index{physical line} +\index{hash character} + +\subsection{Explicit line joining} + +Two or more physical lines may be joined into logical lines using +backslash characters (\verb/\/), as follows: when a physical line ends +in a backslash that is not part of a string literal or comment, it is +joined with the following forming a single logical line, deleting the +backslash and the following end-of-line character. For example: +\index{physical line} +\index{line joining} +\index{line continuation} +\index{backslash character} +% +\begin{verbatim} +if 1900 < year < 2100 and 1 <= month <= 12 \ + and 1 <= day <= 31 and 0 <= hour < 24 \ + and 0 <= minute < 60 and 0 <= second < 60: # Looks like a valid date + return 1 +\end{verbatim} + +A line ending in a backslash cannot carry a comment; a backslash does +not continue a comment (but it does continue a string literal, see +below). + +\subsection{Implicit line joining} + +Expressions in parentheses, square brackets or curly braces can be +split over more than one physical line without using backslashes. +For example: + +\begin{verbatim} +month_names = ['Januari', 'Februari', 'Maart', # These are the + 'April', 'Mei', 'Juni', # Dutch names + 'Juli', 'Augustus', 'September', # for the months + 'Oktober', 'November', 'December'] # of the year +\end{verbatim} + +Implicitly continued lines can carry comments. The indentation of the +continuation lines is not important. Blank continuation lines are +allowed. + +\subsection{Blank lines} + +A logical line that contains only spaces, tabs, and possibly a +comment, is ignored (i.e., no NEWLINE token is generated), except that +during interactive input of statements, an entirely blank logical line +terminates a multi-line statement. +\index{blank line} + +\subsection{Indentation} + +Leading whitespace (spaces and tabs) at the beginning of a logical +line is used to compute the indentation level of the line, which in +turn is used to determine the grouping of statements. +\index{indentation} +\index{whitespace} +\index{leading whitespace} +\index{space} +\index{tab} +\index{grouping} +\index{statement grouping} + +First, tabs are replaced (from left to right) by one to eight spaces +such that the total number of characters up to there is a multiple of +eight (this is intended to be the same rule as used by {\UNIX}). The +total number of spaces preceding the first non-blank character then +determines the line's indentation. Indentation cannot be split over +multiple physical lines using backslashes. + +The indentation levels of consecutive lines are used to generate +INDENT and DEDENT tokens, using a stack, as follows. +\index{INDENT token} +\index{DEDENT token} + +Before the first line of the file is read, a single zero is pushed on +the stack; this will never be popped off again. The numbers pushed on +the stack will always be strictly increasing from bottom to top. At +the beginning of each logical line, the line's indentation level is +compared to the top of the stack. If it is equal, nothing happens. +If it is larger, it is pushed on the stack, and one INDENT token is +generated. If it is smaller, it {\em must} be one of the numbers +occurring on the stack; all numbers on the stack that are larger are +popped off, and for each number popped off a DEDENT token is +generated. At the end of the file, a DEDENT token is generated for +each number remaining on the stack that is larger than zero. + +Here is an example of a correctly (though confusingly) indented piece +of Python code: + +\begin{verbatim} +def perm(l): + # Compute the list of all permutations of l + + if len(l) <= 1: + return [l] + r = [] + for i in range(len(l)): + s = l[:i] + l[i+1:] + p = perm(s) + for x in p: + r.append(l[i:i+1] + x) + return r +\end{verbatim} + +The following example shows various indentation errors: + +\begin{verbatim} + def perm(l): # error: first line indented + for i in range(len(l)): # error: not indented + s = l[:i] + l[i+1:] + p = perm(l[:i] + l[i+1:]) # error: unexpected indent + for x in p: + r.append(l[i:i+1] + x) + return r # error: inconsistent dedent +\end{verbatim} + +(Actually, the first three errors are detected by the parser; only the +last error is found by the lexical analyzer --- the indentation of +\verb@return r@ does not match a level popped off the stack.) + +\section{Other tokens} + +Besides NEWLINE, INDENT and DEDENT, the following categories of tokens +exist: identifiers, keywords, literals, operators, and delimiters. +Spaces and tabs are not tokens, but serve to delimit tokens. Where +ambiguity exists, a token comprises the longest possible string that +forms a legal token, when read from left to right. + +\section{Identifiers} + +Identifiers (also referred to as names) are described by the following +lexical definitions: +\index{identifier} +\index{name} + +\begin{verbatim} +identifier: (letter|"_") (letter|digit|"_")* +letter: lowercase | uppercase +lowercase: "a"..."z" +uppercase: "A"..."Z" +digit: "0"..."9" +\end{verbatim} + +Identifiers are unlimited in length. Case is significant. + +\subsection{Keywords} + +The following identifiers are used as reserved words, or {\em +keywords} of the language, and cannot be used as ordinary +identifiers. They must be spelled exactly as written here: +\index{keyword} +\index{reserved word} + +\begin{verbatim} +and elif global not try +break else if or while +class except import pass +continue finally in print +def for is raise +del from lambda return +\end{verbatim} + +% When adding keywords, pipe it through keywords.py for reformatting + +\section{Literals} \label{literals} + +Literals are notations for constant values of some built-in types. +\index{literal} +\index{constant} + +\subsection{String literals} + +String literals are described by the following lexical definitions: +\index{string literal} + +\begin{verbatim} +stringliteral: shortstring | longstring +shortstring: "'" shortstringitem* "'" | '"' shortstringitem* '"' +longstring: "'''" longstringitem* "'''" | '"""' longstringitem* '"""' +shortstringitem: shortstringchar | escapeseq +longstringitem: longstringchar | escapeseq +shortstringchar: <any ASCII character except "\" or newline or the quote> +longstringchar: <any ASCII character except "\"> +escapeseq: "\" <any ASCII character> +\end{verbatim} +\index{ASCII} + +In ``long strings'' (strings surrounded by sets of three quotes), +unescaped newlines and quotes are allowed (and are retained), except +that three unescaped quotes in a row terminate the string. (A +``quote'' is the character used to open the string, i.e. either +\verb/'/ or \verb/"/.) + +Escape sequences in strings are interpreted according to rules similar +to those used by Standard C. The recognized escape sequences are: +\index{physical line} +\index{escape sequence} +\index{Standard C} +\index{C} + +\begin{center} +\begin{tabular}{|l|l|} +\hline +\verb/\/{\em newline} & Ignored \\ +\verb/\\/ & Backslash (\verb/\/) \\ +\verb/\'/ & Single quote (\verb/'/) \\ +\verb/\"/ & Double quote (\verb/"/) \\ +\verb/\a/ & \ASCII{} Bell (BEL) \\ +\verb/\b/ & \ASCII{} Backspace (BS) \\ +%\verb/\E/ & \ASCII{} Escape (ESC) \\ +\verb/\f/ & \ASCII{} Formfeed (FF) \\ +\verb/\n/ & \ASCII{} Linefeed (LF) \\ +\verb/\r/ & \ASCII{} Carriage Return (CR) \\ +\verb/\t/ & \ASCII{} Horizontal Tab (TAB) \\ +\verb/\v/ & \ASCII{} Vertical Tab (VT) \\ +\verb/\/{\em ooo} & \ASCII{} character with octal value {\em ooo} \\ +\verb/\x/{\em xx...} & \ASCII{} character with hex value {\em xx...} \\ +\hline +\end{tabular} +\end{center} +\index{ASCII} + +In strict compatibility with Standard C, up to three octal digits are +accepted, but an unlimited number of hex digits is taken to be part of +the hex escape (and then the lower 8 bits of the resulting hex number +are used in all current implementations...). + +All unrecognized escape sequences are left in the string unchanged, +i.e., {\em the backslash is left in the string.} (This behavior is +useful when debugging: if an escape sequence is mistyped, the +resulting output is more easily recognized as broken. It also helps a +great deal for string literals used as regular expressions or +otherwise passed to other modules that do their own escape handling.) +\index{unrecognized escape sequence} + +\subsection{Numeric literals} + +There are three types of numeric literals: plain integers, long +integers, and floating point numbers. +\index{number} +\index{numeric literal} +\index{integer literal} +\index{plain integer literal} +\index{long integer literal} +\index{floating point literal} +\index{hexadecimal literal} +\index{octal literal} +\index{decimal literal} + +Integer and long integer literals are described by the following +lexical definitions: + +\begin{verbatim} +longinteger: integer ("l"|"L") +integer: decimalinteger | octinteger | hexinteger +decimalinteger: nonzerodigit digit* | "0" +octinteger: "0" octdigit+ +hexinteger: "0" ("x"|"X") hexdigit+ + +nonzerodigit: "1"..."9" +octdigit: "0"..."7" +hexdigit: digit|"a"..."f"|"A"..."F" +\end{verbatim} + +Although both lower case `l' and upper case `L' are allowed as suffix +for long integers, it is strongly recommended to always use `L', since +the letter `l' looks too much like the digit `1'. + +Plain integer decimal literals must be at most 2147483647 (i.e., the +largest positive integer, using 32-bit arithmetic). Plain octal and +hexadecimal literals may be as large as 4294967295, but values larger +than 2147483647 are converted to a negative value by subtracting +4294967296. There is no limit for long integer literals apart from +what can be stored in available memory. + +Some examples of plain and long integer literals: + +\begin{verbatim} +7 2147483647 0177 0x80000000 +3L 79228162514264337593543950336L 0377L 0x100000000L +\end{verbatim} + +Floating point literals are described by the following lexical +definitions: + +\begin{verbatim} +floatnumber: pointfloat | exponentfloat +pointfloat: [intpart] fraction | intpart "." +exponentfloat: (intpart | pointfloat) exponent +intpart: digit+ +fraction: "." digit+ +exponent: ("e"|"E") ["+"|"-"] digit+ +\end{verbatim} + +The allowed range of floating point literals is +implementation-dependent. + +Some examples of floating point literals: + +\begin{verbatim} +3.14 10. .001 1e100 3.14e-10 +\end{verbatim} + +Note that numeric literals do not include a sign; a phrase like +\verb@-1@ is actually an expression composed of the operator +\verb@-@ and the literal \verb@1@. + +\section{Operators} + +The following tokens are operators: +\index{operators} + +\begin{verbatim} ++ - * / % +<< >> & | ^ ~ +< == > <= <> != >= +\end{verbatim} + +The comparison operators \verb@<>@ and \verb@!=@ are alternate +spellings of the same operator. + +\section{Delimiters} + +The following tokens serve as delimiters or otherwise have a special +meaning: +\index{delimiters} + +\begin{verbatim} +( ) [ ] { } +, : . " ` ' += ; +\end{verbatim} + +The following printing \ASCII{} characters are not used in Python. Their +occurrence outside string literals and comments is an unconditional +error: +\index{ASCII} + +\begin{verbatim} +@ $ ? +\end{verbatim} + +They may be used by future versions of the language though! |