diff options
-rw-r--r-- | Python/compile.txt | 507 |
1 files changed, 507 insertions, 0 deletions
diff --git a/Python/compile.txt b/Python/compile.txt new file mode 100644 index 0000000..12f2128 --- /dev/null +++ b/Python/compile.txt @@ -0,0 +1,507 @@ +Developer Notes for Python Compiler +=================================== + +Table of Contents +----------------- + +- Scope + Defines the limits of the change +- Parse Trees + Describes the local (Python) concept +- Abstract Syntax Trees (AST) + Describes the AST technology used +- Parse Tree to AST + Defines the transform approach +- Control Flow Graphs + Defines the creation of "basic blocks" +- AST to CFG to Bytecode + Tracks the flow from AST to bytecode +- Code Objects + Pointer to making bytecode "executable" +- Modified Files + Files added/modified/removed from CPython compiler +- ToDo + Work yet remaining (before complete) +- References + Academic and technical references to technology used. + + +Scope +----- + +Historically (through 2.4), compilation from source code to bytecode +involved two steps: + +1. Parse the source code into a parse tree (Parser/pgen.c) +2. Emit bytecode based on the parse tree (Python/compile.c) + +Historically, this is not how a standard compiler works. The usual +steps for compilation are: + +1. Parse source code into a parse tree (Parser/pgen.c) +2. Transform parse tree into an Abstract Syntax Tree (Python/ast.c) +3. Transform AST into a Control Flow Graph (Python/newcompile.c) +4. Emit bytecode based on the Control Flow Graph (Python/newcompile.c) + +Starting with Python 2.5, the above steps are now used. This change +was done to simplify compilation by breaking it into three steps. +The purpose of this document is to outline how the lattter three steps +of the process works. + +This document does not touch on how parsing works beyond what is needed +to explain what is needed for compilation. It is also not exhaustive +in terms of the how the entire system works. You will most likely need +to read some source to have an exact understanding of all details. + + +Parse Trees +----------- + +Python's parser is an LL(1) parser mostly based off of the +implementation laid out in the Dragon Book [Aho86]_. + +The grammar file for Python can be found in Grammar/Grammar with the +numeric value of grammar rules are stored in Include/graminit.h. The +numeric values for types of tokens (literal tokens, such as ``:``, +numbers, etc.) are kept in Include/token.h). The parse tree made up of +``node *`` structs (as defined in Include/node.h). + +Querying data from the node structs can be done with the following +macros (which are all defined in Include/token.h): + +- ``CHILD(node *, int)`` + Returns the nth child of the node using zero-offset indexing +- ``RCHILD(node *, int)`` + Returns the nth child of the node from the right side; use + negative numbers! +- ``NCH(node *)`` + Number of children the node has +- ``STR(node *)`` + String representation of the node; e.g., will return ``:`` for a + COLON token +- ``TYPE(node *)`` + The type of node as specified in ``Include/graminit.h`` +- ``REQ(node *, TYPE)`` + Assert that the node is the type that is expected +- ``LINENO(node *)`` + retrieve the line number of the source code that led to the + creation of the parse rule; defined in Python/ast.c + +To tie all of this example, consider the rule for 'while':: + + while_stmt: 'while' test ':' suite ['else' ':' suite] + +The node representing this will have ``TYPE(node) == while_stmt`` and +the number of children can be 4 or 7 depending on if there is an 'else' +statement. To access what should be the first ':' and require it be an +actual ':' token, `(REQ(CHILD(node, 2), COLON)``. + + +Abstract Syntax Trees (AST) +--------------------------- + +The abstract syntax tree (AST) is a high-level representation of the +program structure without the necessity of containing the source code; +it can be thought of a abstract representation of the source code. The +specification of the AST nodes is specified using the Zephyr Abstract +Syntax Definition Language (ASDL) [Wang97]_. + +The definition of the AST nodes for Python is found in the file +Parser/Python.asdl . + +Each AST node (representing statements, expressions, and several +specialized types, like list comprehensions and exception handlers) is +defined by the ASDL. Most definitions in the AST correspond to a +particular source construct, such as an 'if' statement or an attribute +lookup. The definition is independent of its realization in any +particular programming language. + +The following fragment of the Python ASDL construct demonstrates the +approach and syntax:: + + module Python + { + stmt = FunctionDef(identifier name, arguments args, stmt* body, + expr* decorators) + | Return(expr? value) | Yield(expr value) + attributes (int lineno) + } + +The preceding example describes three different kinds of statements; +function definitions, return statements, and yield statements. All +three kinds are considered of type stmt as shown by '|' separating the +various kinds. They all take arguments of various kinds and amounts. + +Modifiers on the argument type specify the number of values needed; '?' +means it is optional, '*' means 0 or more, no modifier means only one +value for the argument and it is required. FunctionDef, for instance, +takes an identifier for the name, 'arguments' for args, zero or more +stmt arguments for 'body', and zero or more expr arguments for +'decorators'. + +Do notice that something like 'arguments', which is a node type, is +represented as a single AST node and not as a sequence of nodes as with +stmt as one might expect. + +All three kinds also have an 'attributes' argument; this is shown by the +fact that 'attributes' lacks a '|' before it. + +The statement definitions above generate the following C structure type:: + + typedef struct _stmt *stmt_ty; + + struct _stmt { + enum { FunctionDef_kind=1, Return_kind=2, Yield_kind=3 } kind; + union { + struct { + identifier name; + arguments_ty args; + asdl_seq *body; + } FunctionDef; + + struct { + expr_ty value; + } Return; + + struct { + expr_ty value; + } Yield; + } v; + int lineno; + } + +Also generated are a series of constructor functions that allocate (in +this case) a stmt_ty struct with the appropriate initialization. The +'kind' field specifies which component of the union is initialized. The +FunctionDef() constructor function sets 'kind' to FunctionDef_kind and +initializes the 'name', 'args', 'body', and 'attributes' fields. + +*** NOTE: if you make a change here that can affect the output of bytecode that +is already in existence, make sure to delete your old .py(c|o) files! Running +``find . -name '*.py[co]' -exec rm -f {} ';'`` should do the trick. + + +Parse Tree to AST +----------------- + +The AST is generated from the parse tree in (see Python/ast.c) using the +function:: + + mod_ty PyAST_FromNode(const node *n); + +The function begins a tree walk of the parse tree, creating various AST +nodes as it goes along. It does this by allocating all new nodes it +needs, calling the proper AST node creation functions for any required +supporting functions, and connecting them as needed. + +Do realize that there is no automated nor symbolic connection between +the grammar specification and the nodes in the parse tree. No help is +directly provided by the parse tree as in yacc. + +For instance, one must keep track of +which node in the parse tree one is working with (e.g., if you are +working with an 'if' statement you need to watch out for the ':' token +to find the end of the conditional). No help is directly provided by +the parse tree as in yacc. + +The functions called to generate AST nodes from the parse tree all have +the name ast_for_xx where xx is what the grammar rule that the function +handles (alias_for_import_name is the exception to this). These in turn +call the constructor functions as defined by the ASDL grammar and +contained in Python/Python-ast.c (which was generated by +Parser/asdl_c.py) to create the nodes of the AST. This all leads to a +sequence of AST nodes stored in asdl_seq structs. + + +Function and macros for creating and using ``asdl_seq *`` types as found +in Python/asdl.c and Include/asdl.h: + +- ``asdl_seq_new(int)`` + Allocate memory for an asdl_seq for length 'size' +- ``asdl_seq_free(asdl_seq *)`` + Free asdl_seq struct +- ``asdl_seq_GET(asdl_seq *seq, int pos)`` + Get item held at 'pos' +- ``asdl_seq_SET(asdl_seq *seq, int pos, void *val)`` + Set 'pos' in 'seq' to 'val' +- ``asdl_seq_APPEND(asdl_seq *seq, void *val)`` + Set the end of 'seq' to 'val' +- ``asdl_seq_LEN(asdl_seq *)`` + Return the length of 'seq' + +If you are working with statements, you must also worry about keeping +track of what line number generated the statement. Currently the line +number is passed as the last parameter to each stmt_ty function. + + +Control Flow Graphs +------------------- + +A control flow graph (often referenced by its acronym, CFG) is a +directed graph that models the flow of a program using basic blocks that +contain the intermediate representation (abbreviated "IR", and in this +case is Python bytecode) within the blocks. Basic blocks themselves are +a block of IR that has a single entry point but possibly multiple exit +points. The single entry point is the key to basic blocks; it all has +to do with jumps. An entry point is the target of something that +changes control flow (such as a function call or a jump) while exit +points are instructions that would change the flow of the program (such +as jumps and 'return' statements). What this means is that a basic +block is a chunk of code that starts at the entry point and runs to an +exit point or the end of the block. + +As an example, consider an 'if' statement with an 'else' block. The +guard on the 'if' is a basic block which is pointed to by the basic +block containing the code leading to the 'if' statement. The 'if' +statement block contains jumps (which are exit points) to the true body +of the 'if' and the 'else' body (which may be NULL), each of which are +their own basic blocks. Both of those blocks in turn point to the +basic block representing the code following the entire 'if' statement. + +CFGs are usually one step away from final code output. Code is directly +generated from the basic blocks (with jump targets adjusted based on the +output order) by doing a post-order depth-first search on the CFG +following the edges. + + +AST to CFG to Bytecode +---------------------- + +With the AST created, the next step is to create the CFG. The first step +is to convert the AST to Python bytecode without having jump targets +resolved to specific offsets (this is calculated when the CFG goes to +final bytecode). Essentially, this transforms the AST into Python +bytecode with control flow represented by the edges of the CFG. + +Conversion is done in two passes. The first creates the namespace +(variables can be classified as local, free/cell for closures, or +global). With that done, the second pass essentially flattens the CFG +into a list and calculates jump offsets for final output of bytecode. + +The conversion process is initiated by a call to the function in +Python/newcompile.c:: + + PyCodeObject * PyAST_Compile(mod_ty, const char *, PyCompilerFlags); + +This function does both the conversion of the AST to a CFG and +outputting final bytecode from the CFG. The AST to CFG step is handled +mostly by the two functions called by PyAST_Compile():: + + struct symtable * PySymtable_Build(mod_ty, const char *, + PyFutureFeatures); + PyCodeObject * compiler_mod(struct compiler *, mod_ty); + +The former is in Python/symtable.c while the latter is in +Python/newcompile.c . + +PySymtable_Build() begins by entering the starting code block for the +AST (passed-in) and then calling the proper symtable_visit_xx function +(with xx being the AST node type). Next, the AST tree is walked with +the various code blocks that delineate the reach of a local variable +as blocks are entered and exited:: + + static int symtable_enter_block(struct symtable *, identifier, + block_ty, void *, int); + static int symtable_exit_block(struct symtable *, void *); + +Once the symbol table is created, it is time for CFG creation, whose +code is in Python/newcompile.c . This is handled by several functions +that break the task down by various AST node types. The functions are +all named compiler_visit_xx where xx is the name of the node type (such +as stmt, expr, etc.). Each function receives a ``struct compiler *`` +and xx_ty where xx is the AST node type. Typically these functions +consist of a large 'switch' statement, branching based on the kind of +node type passed to it. Simple things are handled inline in the +'switch' statement with more complex transformations farmed out to other +functions named compiler_xx with xx being a descriptive name of what is +being handled. + +When transforming an arbitrary AST node, use the VISIT macro:: + + VISIT(struct compiler *, <node type>, <AST node>); + +The appropriate compiler_visit_xx function is called, based on the value +passed in for <node type> (so ``VISIT(c, expr, node)`` calls +``compiler_visit_expr(c, node)``). The VISIT_SEQ macro is very similar, + but is called on AST node sequences (those values that were created as +arguments to a node that used the '*' modifier). There is also +VISIT_SLICE just for handling slices:: + + VISIT_SLICE(struct compiler *, slice_ty, expr_context_ty); + +Emission of bytecode is handled by the following macros: + +- ``ADDOP(struct compiler *c, int op)`` + add 'op' as an opcode +- ``ADDOP_I(struct compiler *c, int op, int oparg)`` + add 'op' with an 'oparg' argument +- ``ADDOP_O(struct compiler *c, int op, PyObject *type, PyObject *obj)`` + add 'op' with the proper argument based on the position of obj in + 'type', but with no handling of mangled names; used for when you + need to do named lookups of objects such as globals, consts, or + parameters where name mangling is not possible and the scope of the + name is known +- ``ADDOP_NAME(struct compiler *, int, PyObject *, PyObject *)`` + just like ADDOP_O, but name mangling is also handled; used for + attribute loading or importing based on name +- ``ADDOP_JABS(struct compiling *c, int op, basicblock b)`` + create an absolute jump to the basic block 'b' +- ``ADDOP_JREL(struct compiling *c, int op, basicblock b)`` + create a relative jump to the basic block 'b' + +Several helper functions that will emit bytecode and are named +compiler_xx() where xx is what the function helps with (list, boolop + etc.). A rather useful one is:: + + static int compiler_nameop(struct compiler *, identifier, + expr_context_ty); + +This function looks up the scope of a variable and, based on the +expression context, emits the proper opcode to load, store, or delete +the variable. + +As for handling the line number on which a statement is defined, is +handled by compiler_visit_stmt() and thus is not a worry. + +In addition to emitting bytecode based on the AST node, handling the +creation of basic blocks must be done. Below are the macros and +functions used for managing basic blocks: + +- ``NEW_BLOCK(struct compiler *)`` + create block and set it as current +- ``NEXT_BLOCK(struct compiler *)`` + basically NEW_BLOCK() plus jump from current block +- ``compiler_new_block(struct compiler *)`` + create a block but don't use it (used for generating jumps) + +Once the CFG is created, it must be flattened and then final emission of +bytecode occurs. Flattening is handled using a post-order depth-first +search. Once flattened, jump offsets are backpatched based on the +flattening and then a PyCodeObject file is created. All of this is +handled by calling:: + + PyCodeObject * assemble(struct compiler *, int); + +*** NOTE: if you make a change here that can affect the output of bytecode that +is already in existence, make sure to delete your old .py(c|o) files! Running +``find . -name '*.py[co]' -exec rm -f {} ';'`` should do the trick. + + +Code Objects +------------ + +In the end, one ends up with a PyCodeObject which is defined in +Include/code.h . And with that you now have executable Python bytecode! + + +Modified Files +-------------- + ++ Parser/ + + - Python.asdl + ASDL syntax file + + - asdl.py + "An implementation of the Zephyr Abstract Syntax Definition + Language." Uses SPARK_ to parse the ASDL files. + + - asdl_c.py + "Generate C code from an ASDL description." Generates + ../Python/Python-ast.c and ../Include/Python-ast.h . + + - spark.py + SPARK_ parser generator + ++ Python/ + + - Python-ast.c + Creates C structs corresponding to the ASDL types. Also + contains code for marshaling AST nodes (core ASDL types have + marshaling code in asdl.c). "File automatically generated by + ../Parser/asdl_c.py". + + - asdl.c + Contains code to handle the ASDL sequence type. Also has code + to handle marshalling the core ASDL types, such as number and + identifier. used by Python-ast.c for marshaling AST nodes. + + - ast.c + Converts Python's parse tree into the abstract syntax tree. + + - compile.txt + This file. + + - newcompile.c + New version of compile.c that handles the emitting of bytecode. + + - symtable.c + Generates symbol table from AST. + + ++ Include/ + + - Python-ast.h + Contains the actual definitions of the C structs as generated by + ../Python/Python-ast.c . + "Automatically generated by ../Parser/asdl_c.py". + + - asdl.h + Header for the corresponding ../Python/ast.c . + + - ast.h + Declares PyAST_FromNode() external (from ../Python/ast.c). + + - code.h + Header file for ../Objects/codeobject.c; contains definition of + PyCodeObject. + + - symtable.h + Header for ../Python/symtable.c . struct symtable and + PySTEntryObject are defined here. + ++ Objects/ + + - codeobject.c + Contains PyCodeObject-related code (originally in + ../Python/compile.c). + + +ToDo +---- +*** NOTE: all bugs and patches should be filed on SF under the group + "AST" for easy searching. It also does not hurt to put + "[AST]" at the beginning of the subject line of the tracker + item. + ++ Stdlib support + - AST->Python access? + - rewrite compiler package to mirror AST structure? ++ Documentation + - flesh out this doc + * byte stream output + * explanation of how the symbol table pass works + * code object (PyCodeObject) ++ Universal + - make sure entire test suite passes + - fix memory leaks + - make sure return types are properly checked for errors + - no gcc warnings + +References +---------- + +.. [Aho86] Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman. + `Compilers: Principles, Techniques, and Tools`, + http://www.amazon.com/exec/obidos/tg/detail/-/0201100886/104-0162389-6419108 + +.. [Wang97] Daniel C. Wang, Andrew W. Appel, Jeff L. Korn, and Chris + S. Serra. `The Zephyr Abstract Syntax Description Language.`_ + In Proceedings of the Conference on Domain-Specific Languages, pp. + 213--227, 1997. + +.. _The Zephyr Abstract Syntax Description Language.: + http://www.cs.princeton.edu/~danwang/Papers/dsl97/dsl97.html + +.. _SPARK: http://pages.cpsc.ucalgary.ca/~aycock/spark/ + |