diff options
author | Guido van Rossum <guido@python.org> | 1992-08-14 09:11:01 (GMT) |
---|---|---|
committer | Guido van Rossum <guido@python.org> | 1992-08-14 09:11:01 (GMT) |
commit | 46f3e00407d614e0a1003379197c75e1b835e629 (patch) | |
tree | 21ac1f88f3c19e58bd8ff26e5998e44af1e4381f /Doc/ref2.tex | |
parent | 39789030bd9c9d21e2e9b6c8ca2e1214ba8f4b52 (diff) | |
download | cpython-46f3e00407d614e0a1003379197c75e1b835e629.zip cpython-46f3e00407d614e0a1003379197c75e1b835e629.tar.gz cpython-46f3e00407d614e0a1003379197c75e1b835e629.tar.bz2 |
Initial revision
Diffstat (limited to 'Doc/ref2.tex')
-rw-r--r-- | Doc/ref2.tex | 349 |
1 files changed, 349 insertions, 0 deletions
diff --git a/Doc/ref2.tex b/Doc/ref2.tex new file mode 100644 index 0000000..250bd2e --- /dev/null +++ b/Doc/ref2.tex @@ -0,0 +1,349 @@ +\chapter{Lexical analysis} + +A Python program is read by a {\em parser}. Input to the parser is a +stream of {\em tokens}, generated by the {\em lexical analyzer}. This +chapter describes how the lexical analyzer breaks a file into tokens. +\index{lexical analysis} +\index{parser} +\index{token} + +\section{Line structure} + +A Python program is divided in a number of logical lines. The end of +a logical line is represented by the token NEWLINE. Statements cannot +cross logical line boundaries except where NEWLINE is allowed by the +syntax (e.g. between statements in compound statements). +\index{line structure} +\index{logical line} +\index{NEWLINE token} + +\subsection{Comments} + +A comment starts with a hash character (\verb\#\) that is not part of +a string literal, and ends at the end of the physical line. A comment +always signifies the end of the logical line. Comments are ignored by +the syntax. +\index{comment} +\index{logical line} +\index{physical line} +\index{hash character} + +\subsection{Line joining} + +Two or more physical lines may be joined into logical lines using +backslash characters (\verb/\/), as follows: when a physical line ends +in a backslash that is not part of a string literal or comment, it is +joined with the following forming a single logical line, deleting the +backslash and the following end-of-line character. For example: +\index{physical line} +\index{line joining} +\index{backslash character} +% +\begin{verbatim} +month_names = ['Januari', 'Februari', 'Maart', \ + 'April', 'Mei', 'Juni', \ + 'Juli', 'Augustus', 'September', \ + 'Oktober', 'November', 'December'] +\end{verbatim} + +\subsection{Blank lines} + +A logical line that contains only spaces, tabs, and possibly a +comment, is ignored (i.e., no NEWLINE token is generated), except that +during interactive input of statements, an entirely blank logical line +terminates a multi-line statement. +\index{blank line} + +\subsection{Indentation} + +Leading whitespace (spaces and tabs) at the beginning of a logical +line is used to compute the indentation level of the line, which in +turn is used to determine the grouping of statements. +\index{indentation} +\index{whitespace} +\index{leading whitespace} +\index{space} +\index{tab} +\index{grouping} +\index{statement grouping} + +First, tabs are replaced (from left to right) by one to eight spaces +such that the total number of characters up to there is a multiple of +eight (this is intended to be the same rule as used by {\UNIX}). The +total number of spaces preceding the first non-blank character then +determines the line's indentation. Indentation cannot be split over +multiple physical lines using backslashes. + +The indentation levels of consecutive lines are used to generate +INDENT and DEDENT tokens, using a stack, as follows. +\index{INDENT token} +\index{DEDENT token} + +Before the first line of the file is read, a single zero is pushed on +the stack; this will never be popped off again. The numbers pushed on +the stack will always be strictly increasing from bottom to top. At +the beginning of each logical line, the line's indentation level is +compared to the top of the stack. If it is equal, nothing happens. +If it is larger, it is pushed on the stack, and one INDENT token is +generated. If it is smaller, it {\em must} be one of the numbers +occurring on the stack; all numbers on the stack that are larger are +popped off, and for each number popped off a DEDENT token is +generated. At the end of the file, a DEDENT token is generated for +each number remaining on the stack that is larger than zero. + +Here is an example of a correctly (though confusingly) indented piece +of Python code: + +\begin{verbatim} +def perm(l): + # Compute the list of all permutations of l + + if len(l) <= 1: + return [l] + r = [] + for i in range(len(l)): + s = l[:i] + l[i+1:] + p = perm(s) + for x in p: + r.append(l[i:i+1] + x) + return r +\end{verbatim} + +The following example shows various indentation errors: + +\begin{verbatim} + def perm(l): # error: first line indented + for i in range(len(l)): # error: not indented + s = l[:i] + l[i+1:] + p = perm(l[:i] + l[i+1:]) # error: unexpected indent + for x in p: + r.append(l[i:i+1] + x) + return r # error: inconsistent dedent +\end{verbatim} + +(Actually, the first three errors are detected by the parser; only the +last error is found by the lexical analyzer --- the indentation of +\verb\return r\ does not match a level popped off the stack.) + +\section{Other tokens} + +Besides NEWLINE, INDENT and DEDENT, the following categories of tokens +exist: identifiers, keywords, literals, operators, and delimiters. +Spaces and tabs are not tokens, but serve to delimit tokens. Where +ambiguity exists, a token comprises the longest possible string that +forms a legal token, when read from left to right. + +\section{Identifiers} + +Identifiers (also referred to as names) are described by the following +lexical definitions: +\index{identifier} +\index{name} + +\begin{verbatim} +identifier: (letter|"_") (letter|digit|"_")* +letter: lowercase | uppercase +lowercase: "a"..."z" +uppercase: "A"..."Z" +digit: "0"..."9" +\end{verbatim} + +Identifiers are unlimited in length. Case is significant. + +\subsection{Keywords} + +The following identifiers are used as reserved words, or {\em +keywords} of the language, and cannot be used as ordinary +identifiers. They must be spelled exactly as written here: +\index{keyword} +\index{reserved word} + +\begin{verbatim} +and del for in print +break elif from is raise +class else global not return +continue except if or try +def finally import pass while +\end{verbatim} + +% # This Python program sorts and formats the above table +% import string +% l = [] +% try: +% while 1: +% l = l + string.split(raw_input()) +% except EOFError: +% pass +% l.sort() +% for i in range((len(l)+4)/5): +% for j in range(i, len(l), 5): +% print string.ljust(l[j], 10), +% print + +\section{Literals} \label{literals} + +Literals are notations for constant values of some built-in types. +\index{literal} +\index{constant} + +\subsection{String literals} + +String literals are described by the following lexical definitions: +\index{string literal} + +\begin{verbatim} +stringliteral: "'" stringitem* "'" +stringitem: stringchar | escapeseq +stringchar: <any ASCII character except newline or "\" or "'"> +escapeseq: "'" <any ASCII character except newline> +\end{verbatim} +\index{ASCII} + +String literals cannot span physical line boundaries. Escape +sequences in strings are actually interpreted according to rules +similar to those used by Standard C. The recognized escape sequences +are: +\index{physical line} +\index{escape sequence} +\index{Standard C} +\index{C} + +\begin{center} +\begin{tabular}{|l|l|} +\hline +\verb/\\/ & Backslash (\verb/\/) \\ +\verb/\'/ & Single quote (\verb/'/) \\ +\verb/\a/ & ASCII Bell (BEL) \\ +\verb/\b/ & ASCII Backspace (BS) \\ +%\verb/\E/ & ASCII Escape (ESC) \\ +\verb/\f/ & ASCII Formfeed (FF) \\ +\verb/\n/ & ASCII Linefeed (LF) \\ +\verb/\r/ & ASCII Carriage Return (CR) \\ +\verb/\t/ & ASCII Horizontal Tab (TAB) \\ +\verb/\v/ & ASCII Vertical Tab (VT) \\ +\verb/\/{\em ooo} & ASCII character with octal value {\em ooo} \\ +\verb/\x/{\em xx...} & ASCII character with hex value {\em xx...} \\ +\hline +\end{tabular} +\end{center} +\index{ASCII} + +In strict compatibility with Standard C, up to three octal digits are +accepted, but an unlimited number of hex digits is taken to be part of +the hex escape (and then the lower 8 bits of the resulting hex number +are used in all current implementations...). + +All unrecognized escape sequences are left in the string unchanged, +i.e., {\em the backslash is left in the string.} (This behavior is +useful when debugging: if an escape sequence is mistyped, the +resulting output is more easily recognized as broken. It also helps a +great deal for string literals used as regular expressions or +otherwise passed to other modules that do their own escape handling.) +\index{unrecognized escape sequence} + +\subsection{Numeric literals} + +There are three types of numeric literals: plain integers, long +integers, and floating point numbers. +\index{number} +\index{numeric literal} +\index{integer literal} +\index{plain integer literal} +\index{long integer literal} +\index{floating point literal} +\index{hexadecimal literal} +\index{octal literal} +\index{decimal literal} + +Integer and long integer literals are described by the following +lexical definitions: + +\begin{verbatim} +longinteger: integer ("l"|"L") +integer: decimalinteger | octinteger | hexinteger +decimalinteger: nonzerodigit digit* | "0" +octinteger: "0" octdigit+ +hexinteger: "0" ("x"|"X") hexdigit+ + +nonzerodigit: "1"..."9" +octdigit: "0"..."7" +hexdigit: digit|"a"..."f"|"A"..."F" +\end{verbatim} + +Although both lower case `l' and upper case `L' are allowed as suffix +for long integers, it is strongly recommended to always use `L', since +the letter `l' looks too much like the digit `1'. + +Plain integer decimal literals must be at most $2^{31} - 1$ (i.e., the +largest positive integer, assuming 32-bit arithmetic). Plain octal and +hexadecimal literals may be as large as $2^{32} - 1$, but values +larger than $2^{31} - 1$ are converted to a negative value by +subtracting $2^{32}$. There is no limit for long integer literals. + +Some examples of plain and long integer literals: + +\begin{verbatim} +7 2147483647 0177 0x80000000 +3L 79228162514264337593543950336L 0377L 0x100000000L +\end{verbatim} + +Floating point literals are described by the following lexical +definitions: + +\begin{verbatim} +floatnumber: pointfloat | exponentfloat +pointfloat: [intpart] fraction | intpart "." +exponentfloat: (intpart | pointfloat) exponent +intpart: digit+ +fraction: "." digit+ +exponent: ("e"|"E") ["+"|"-"] digit+ +\end{verbatim} + +The allowed range of floating point literals is +implementation-dependent. + +Some examples of floating point literals: + +\begin{verbatim} +3.14 10. .001 1e100 3.14e-10 +\end{verbatim} + +Note that numeric literals do not include a sign; a phrase like +\verb\-1\ is actually an expression composed of the operator +\verb\-\ and the literal \verb\1\. + +\section{Operators} + +The following tokens are operators: +\index{operators} + +\begin{verbatim} ++ - * / % +<< >> & | ^ ~ +< == > <= <> != >= +\end{verbatim} + +The comparison operators \verb\<>\ and \verb\!=\ are alternate +spellings of the same operator. + +\section{Delimiters} + +The following tokens serve as delimiters or otherwise have a special +meaning: +\index{delimiters} + +\begin{verbatim} +( ) [ ] { } +; , : . ` = +\end{verbatim} + +The following printing ASCII characters are not used in Python. Their +occurrence outside string literals and comments is an unconditional +error: +\index{ASCII} + +\begin{verbatim} +@ $ " ? +\end{verbatim} + +They may be used by future versions of the language though! |