summaryrefslogtreecommitdiffstats
path: root/Tools/compiler/doc/compiler.tex
blob: 824ff089433c5260bde48af9761fe1629d302666 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
% Complete documentation on the extended LaTeX markup used for Python
% documentation is available in ``Documenting Python'', which is part
% of the standard documentation for Python.  It may be found online
% at:
%
%     http://www.python.org/doc/current/doc/doc.html

\documentclass{howto}

\title{Python compiler package}

\author{Jeremy Hylton}

% Please at least include a long-lived email address;
% the rest is at your discretion.
\authoraddress{
        PythonLabs \\
        Zope Corporation \\
        Email: \email{jeremy@zope.com}
}

\date{August 15, 2001}           % update before release!
                                % Use an explicit date so that reformatting
                                % doesn't cause a new date to be used.  Setting
                                % the date to \today can be used during draft
                                % stages to make it easier to handle versions.

\release{2.2}                   % release version; this is used to define the
                                % \version macro

\makeindex                      % tell \index to actually write the .idx file
\makemodindex                   % If this contains a lot of module sections.


\begin{document}

\maketitle

\begin{abstract}

\noindent
The Python compiler package is a tool for analyzing Python source code
and generating Python bytecode.  The compiler contains libraries to
generate an abstract syntax tree from Python source code and to
generate Python bytecode from the tree.

\end{abstract}

\tableofcontents


\section{Introduction\label{Introduction}}

XXX Need basic intro

XXX what are the major advantages...  the abstract syntax is much
closer to the python source...


\section{The basic interface}

\declaremodule{}{compiler}

The top-level of the package defines four functions.

\begin{funcdesc}{parse}{buf}
Returns an abstract syntax tree for the Python source code in \var{buf}.
The function raises SyntaxError if there is an error in the source
code.  The return value is a \class{compiler.ast.Module} instance that
contains the tree.  
\end{funcdesc}

\begin{funcdesc}{parseFile}{path}
Return an abstract syntax tree for the Python source code in the file
specified by \var{path}.  It is equivalent to
\code{parse(open(\var{path}).read())}.
\end{funcdesc}

\begin{funcdesc}{walk}{ast, visitor\optional{, verbose}}
Do a pre-order walk over the abstract syntax tree \var{ast}.  Call the
appropriate method on the \var{visitor} instance for each node
encountered.
\end{funcdesc}

\begin{funcdesc}{compile}{path}
Compile the file \var{path} and generate the corresponding \file{.pyc}
file.
\end{funcdesc}

The \module{compiler} package contains the following modules:
\refmodule[compiler.ast]{ast}, \module{consts}, \module{future},
\module{misc}, \module{pyassem}, \module{pycodegen}, \module{symbols},
\module{transformer}, and \refmodule[compiler.visitor]{visitor}.


\section{Limitations}

There are some problems with the error checking of the compiler
package.  The interpreter detects syntax errors in two distinct
phases.  One set of errors is detected by the interpreter's parser,
the other set by the compiler.  The compiler package relies on the
interpreter's parser, so it get the first phases of error checking for
free.  It implements the second phase itself, and that implement is
incomplete.  For example, the compiler package does not raise an error
if a name appears more than once in an argument list: 
\code{def f(x, x): ...}


\section{Python Abstract Syntax}

The \module{compiler.ast} module defines an abstract syntax for
Python.  In the abstract syntax tree, each node represents a syntactic
construct.  The root of the tree is \class{Module} object.

The abstract syntax offers a higher level interface to parsed Python
source code.  The \ulink{\module{parser}}
{http://www.python.org/doc/current/lib/module-parser.html}
module and the compiler written in C for the Python interpreter use a
concrete syntax tree.  The concrete syntax is tied closely to the
grammar description used for the Python parser.  Instead of a single
node for a construct, there are often several levels of nested nodes
that are introduced by Python's precedence rules.

The abstract syntax tree is created by the
\module{compiler.transformer} module.  The transformer relies on the
builtin Python parser to generate a concrete syntax tree.  It
generates an abstract syntax tree from the concrete tree.  

The \module{transformer} module was created by Greg
Stein\index{Stein, Greg} and Bill Tutt\index{Tutt, Bill} for an
experimental Python-to-C compiler.  The current version contains a
number of modifications and improvements, but the basic form of the
abstract syntax and of the transformer are due to Stein and Tutt.


\section{AST Nodes}

\declaremodule{}{compiler.ast}

The \module{compiler.ast} module is generated from a text file that
describes each node type and its elements.  Each node type is
represented as a class that inherits from the abstract base class
\class{compiler.ast.Node} and defines a set of named attributes for
child nodes.

\begin{classdesc}{Node}{}
  
  The \class{Node} instances are created automatically by the parser
  generator.  The recommended interface for specific \class{Node}
  instances is to use the public attributes to access child nodes.  A
  public attribute may be bound to a single node or to a sequence of
  nodes, depending on the \class{Node} type.  For example, the
  \member{bases} attribute of the \class{Class} node, is bound to a
  list of base class nodes, and the \member{doc} attribute is bound to
  a single node.
  
  Each \class{Node} instance has a \member{lineno} attribute which may
  be \code{None}.  XXX Not sure what the rules are for which nodes
  will have a useful lineno.
\end{classdesc}

All \class{Node} objects offer the following methods:

\begin{methoddesc}{getChildren}{}
  Returns a flattened list of the child nodes and objects in the
  order they occur.  Specifically, the order of the nodes is the
  order in which they appear in the Python grammar.  Not all of the
  children are \class{Node} instances.  The names of functions and
  classes, for example, are plain strings.
\end{methoddesc}

\begin{methoddesc}{getChildNodes}{}
  Returns a flattened list of the child nodes in the order they
  occur.  This method is like \method{getChildren()}, except that it
  only returns those children that are \class{Node} instances.
\end{methoddesc}

Two examples illustrate the general structure of \class{Node}
classes.  The \keyword{while} statement is defined by the following
grammar production: 

\begin{verbatim}
while_stmt:     "while" expression ":" suite
               ["else" ":" suite]
\end{verbatim}

The \class{While} node has three attributes: \member{test},
\member{body}, and \member{else_}.  (If the natural name for an
attribute is also a Python reserved word, it can't be used as an
attribute name.  An underscore is appended to the word to make it a
legal identifier, hence \member{else_} instead of \keyword{else}.)

The \keyword{if} statement is more complicated because it can include
several tests.  

\begin{verbatim}
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
\end{verbatim}

The \class{If} node only defines two attributes: \member{tests} and
\member{else_}.  The \member{tests} attribute is a sequence of test
expression, consequent body pairs.  There is one pair for each
\keyword{if}/\keyword{elif} clause.  The first element of the pair is
the test expression.  The second elements is a \class{Stmt} node that
contains the code to execute if the test is true.

The \method{getChildren()} method of \class{If} returns a flat list of
child nodes.  If there are three \keyword{if}/\keyword{elif} clauses
and no \keyword{else} clause, then \method{getChildren()} will return
a list of six elements: the first test expression, the first
\class{Stmt}, the second text expression, etc.

The following table lists each of the \class{Node} subclasses defined
in \module{compiler.ast} and each of the public attributes available
on their instances.  The values of most of the attributes are
themselves \class{Node} instances or sequences of instances.  When the
value is something other than an instance, the type is noted in the
comment.  The attributes are listed in the order in which they are
returned by \method{getChildren()} and \method{getChildNodes()}.

\input{asttable}


\section{Assignment nodes}

There is a collection of nodes used to represent assignments.  Each
assignment statement in the source code becomes a single
\class{Assign} node in the AST.  The \member{nodes} attribute is a
list that contains a node for each assignment target.  This is
necessary because assignment can be chained, e.g. \code{a = b = 2}.
Each \class{Node} in the list will be one of the following classes: 
\class{AssAttr}, \class{AssList}, \class{AssName}, or
\class{AssTuple}. 

XXX Explain what the AssXXX nodes are for.  Mention \code{a.b.c = 2}
as an example.  Explain what the flags are for.


\section{Using Visitors to Walk ASTs}

\declaremodule{}{compiler.visitor}

The visitor pattern is ...  The \refmodule{compiler} package uses a
variant on the visitor pattern that takes advantage of Python's
introspection features to elminiate the need for much of the visitor's
infrastructure.

The classes being visited do not need to be programmed to accept
visitors.  The visitor need only define visit methods for classes it
is specifically interested in; a default visit method can handle the
rest. 

XXX The magic \method{visit()} method for visitors.

\begin{funcdesc}{walk}{tree, visitor\optional{, verbose}}
\end{funcdesc}

\begin{classdesc}{ASTVisitor}{}

The \class{ASTVisitor} is responsible for walking over the tree in the
correct order.  A walk begins with a call to \method{preorder()}.  For
each node, it checks the \var{visitor} argument to \method{preorder()}
for a method named `visitNodeType,' where NodeType is the name of the
node's class, e.g. for a \class{While} node a \method{visitWhile()}
would be called .  If the method exists, it is called with the node as
its first argument.

The visitor method for a particular node type can control how child
nodes are visited during the walk.  The \class{ASTVisitor} modifies
the visitor argument by adding a visit method to the visitor; this
method can be used to visit a particular child node.  If no visitor is
found for a particular node type, the \method{default()} method is
called. 
\end{classdesc}

\class{ASTVisitor} objects have the following methods:

XXX describe extra arguments

\begin{methoddesc}{default}{node\optional{, \moreargs}}
\end{methoddesc}

\begin{methoddesc}{dispatch}{node\optional{, \moreargs}}
\end{methoddesc}

\begin{methoddesc}{preorder}{tree, visitor}
\end{methoddesc}


\section{Bytecode Generation}

The code generator is a visitor that emits bytecodes.  Each visit method
can call the \method{emit()} method to emit a new bytecode.  The basic
code generator is specialized for modules, classes, and functions.  An
assembler converts that emitted instructions to the low-level bytecode
format.  It handles things like generator of constant lists of code
objects and calculation of jump offsets.


\input{compiler.ind}                    % Index

\end{document}