diff options
-rw-r--r-- | Doc/library/parser.rst | 335 |
1 files changed, 2 insertions, 333 deletions
diff --git a/Doc/library/parser.rst b/Doc/library/parser.rst index efac1a5..ad336d3 100644 --- a/Doc/library/parser.rst +++ b/Doc/library/parser.rst @@ -317,22 +317,8 @@ ST objects have the following methods: Same as ``st2tuple(st, line_info, col_info)``. -.. _st-examples: - -Examples --------- - -.. index:: builtin: compile - -The parser modules allows operations to be performed on the parse tree of Python -source code before the :term:`bytecode` is generated, and provides for inspection of the -parse tree for information gathering purposes. Two examples are presented. The -simple example demonstrates emulation of the :func:`compile` built-in function -and the complex example shows the use of a parse tree for information discovery. - - -Emulation of :func:`compile` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Example: Emulation of :func:`compile` +------------------------------------- While many useful operations may take place between parsing and bytecode generation, the simplest operation is to do nothing. For this purpose, using @@ -366,320 +352,3 @@ readily available functions:: def load_expression(source_string): st = parser.expr(source_string) return st, st.compile() - - -Information Discovery -^^^^^^^^^^^^^^^^^^^^^ - -.. index:: - single: string; documentation - single: docstrings - -Some applications benefit from direct access to the parse tree. The remainder -of this section demonstrates how the parse tree provides access to module -documentation defined in docstrings without requiring that the code being -examined be loaded into a running interpreter via :keyword:`import`. This can -be very useful for performing analyses of untrusted code. - -Generally, the example will demonstrate how the parse tree may be traversed to -distill interesting information. Two functions and a set of classes are -developed which provide programmatic access to high level function and class -definitions provided by a module. The classes extract information from the -parse tree and provide access to the information at a useful semantic level, one -function provides a simple low-level pattern matching capability, and the other -function defines a high-level interface to the classes by handling file -operations on behalf of the caller. All source files mentioned here which are -not part of the Python installation are located in the :file:`Demo/parser/` -directory of the distribution. - -The dynamic nature of Python allows the programmer a great deal of flexibility, -but most modules need only a limited measure of this when defining classes, -functions, and methods. In this example, the only definitions that will be -considered are those which are defined in the top level of their context, e.g., -a function defined by a :keyword:`def` statement at column zero of a module, but -not a function defined within a branch of an :keyword:`if` ... :keyword:`else` -construct, though there are some good reasons for doing so in some situations. -Nesting of definitions will be handled by the code developed in the example. - -To construct the upper-level extraction methods, we need to know what the parse -tree structure looks like and how much of it we actually need to be concerned -about. Python uses a moderately deep parse tree so there are a large number of -intermediate nodes. It is important to read and understand the formal grammar -used by Python. This is specified in the file :file:`Grammar/Grammar` in the -distribution. Consider the simplest case of interest when searching for -docstrings: a module consisting of a docstring and nothing else. (See file -:file:`docstring.py`.) :: - - """Some documentation. - """ - -Using the interpreter to take a look at the parse tree, we find a bewildering -mass of numbers and parentheses, with the documentation buried deep in nested -tuples. :: - - >>> import parser - >>> import pprint - >>> st = parser.suite(open('docstring.py').read()) - >>> tup = st.totuple() - >>> pprint.pprint(tup) - (257, - (264, - (265, - (266, - (267, - (307, - (287, - (288, - (289, - (290, - (292, - (293, - (294, - (295, - (296, - (297, - (298, - (299, - (300, (3, '"""Some documentation.\n"""'))))))))))))))))), - (4, ''))), - (4, ''), - (0, '')) - -The numbers at the first element of each node in the tree are the node types; -they map directly to terminal and non-terminal symbols in the grammar. -Unfortunately, they are represented as integers in the internal representation, -and the Python structures generated do not change that. However, the -:mod:`symbol` and :mod:`token` modules provide symbolic names for the node types -and dictionaries which map from the integers to the symbolic names for the node -types. - -In the output presented above, the outermost tuple contains four elements: the -integer ``257`` and three additional tuples. Node type ``257`` has the symbolic -name :const:`file_input`. Each of these inner tuples contains an integer as the -first element; these integers, ``264``, ``4``, and ``0``, represent the node -types :const:`stmt`, :const:`NEWLINE`, and :const:`ENDMARKER`, respectively. -Note that these values may change depending on the version of Python you are -using; consult :file:`symbol.py` and :file:`token.py` for details of the -mapping. It should be fairly clear that the outermost node is related primarily -to the input source rather than the contents of the file, and may be disregarded -for the moment. The :const:`stmt` node is much more interesting. In -particular, all docstrings are found in subtrees which are formed exactly as -this node is formed, with the only difference being the string itself. The -association between the docstring in a similar tree and the defined entity -(class, function, or module) which it describes is given by the position of the -docstring subtree within the tree defining the described structure. - -By replacing the actual docstring with something to signify a variable component -of the tree, we allow a simple pattern matching approach to check any given -subtree for equivalence to the general pattern for docstrings. Since the -example demonstrates information extraction, we can safely require that the tree -be in tuple form rather than list form, allowing a simple variable -representation to be ``['variable_name']``. A simple recursive function can -implement the pattern matching, returning a Boolean and a dictionary of variable -name to value mappings. (See file :file:`example.py`.) :: - - def match(pattern, data, vars=None): - if vars is None: - vars = {} - if isinstance(pattern, list): - vars[pattern[0]] = data - return True, vars - if not instance(pattern, tuple): - return (pattern == data), vars - if len(data) != len(pattern): - return False, vars - for pattern, data in zip(pattern, data): - same, vars = match(pattern, data, vars) - if not same: - break - return same, vars - -Using this simple representation for syntactic variables and the symbolic node -types, the pattern for the candidate docstring subtrees becomes fairly readable. -(See file :file:`example.py`.) :: - - import symbol - import token - - DOCSTRING_STMT_PATTERN = ( - symbol.stmt, - (symbol.simple_stmt, - (symbol.small_stmt, - (symbol.expr_stmt, - (symbol.testlist, - (symbol.test, - (symbol.and_test, - (symbol.not_test, - (symbol.comparison, - (symbol.expr, - (symbol.xor_expr, - (symbol.and_expr, - (symbol.shift_expr, - (symbol.arith_expr, - (symbol.term, - (symbol.factor, - (symbol.power, - (symbol.atom, - (token.STRING, ['docstring']) - )))))))))))))))), - (token.NEWLINE, '') - )) - -Using the :func:`match` function with this pattern, extracting the module -docstring from the parse tree created previously is easy:: - - >>> found, vars = match(DOCSTRING_STMT_PATTERN, tup[1]) - >>> found - True - >>> vars - {'docstring': '"""Some documentation.\n"""'} - -Once specific data can be extracted from a location where it is expected, the -question of where information can be expected needs to be answered. When -dealing with docstrings, the answer is fairly simple: the docstring is the first -:const:`stmt` node in a code block (:const:`file_input` or :const:`suite` node -types). A module consists of a single :const:`file_input` node, and class and -function definitions each contain exactly one :const:`suite` node. Classes and -functions are readily identified as subtrees of code block nodes which start -with ``(stmt, (compound_stmt, (classdef, ...`` or ``(stmt, (compound_stmt, -(funcdef, ...``. Note that these subtrees cannot be matched by :func:`match` -since it does not support multiple sibling nodes to match without regard to -number. A more elaborate matching function could be used to overcome this -limitation, but this is sufficient for the example. - -Given the ability to determine whether a statement might be a docstring and -extract the actual string from the statement, some work needs to be performed to -walk the parse tree for an entire module and extract information about the names -defined in each context of the module and associate any docstrings with the -names. The code to perform this work is not complicated, but bears some -explanation. - -The public interface to the classes is straightforward and should probably be -somewhat more flexible. Each "major" block of the module is described by an -object providing several methods for inquiry and a constructor which accepts at -least the subtree of the complete parse tree which it represents. The -:class:`ModuleInfo` constructor accepts an optional *name* parameter since it -cannot otherwise determine the name of the module. - -The public classes include :class:`ClassInfo`, :class:`FunctionInfo`, and -:class:`ModuleInfo`. All objects provide the methods :meth:`get_name`, -:meth:`get_docstring`, :meth:`get_class_names`, and :meth:`get_class_info`. The -:class:`ClassInfo` objects support :meth:`get_method_names` and -:meth:`get_method_info` while the other classes provide -:meth:`get_function_names` and :meth:`get_function_info`. - -Within each of the forms of code block that the public classes represent, most -of the required information is in the same form and is accessed in the same way, -with classes having the distinction that functions defined at the top level are -referred to as "methods." Since the difference in nomenclature reflects a real -semantic distinction from functions defined outside of a class, the -implementation needs to maintain the distinction. Hence, most of the -functionality of the public classes can be implemented in a common base class, -:class:`SuiteInfoBase`, with the accessors for function and method information -provided elsewhere. Note that there is only one class which represents function -and method information; this parallels the use of the :keyword:`def` statement -to define both types of elements. - -Most of the accessor functions are declared in :class:`SuiteInfoBase` and do not -need to be overridden by subclasses. More importantly, the extraction of most -information from a parse tree is handled through a method called by the -:class:`SuiteInfoBase` constructor. The example code for most of the classes is -clear when read alongside the formal grammar, but the method which recursively -creates new information objects requires further examination. Here is the -relevant part of the :class:`SuiteInfoBase` definition from :file:`example.py`:: - - class SuiteInfoBase: - _docstring = '' - _name = '' - - def __init__(self, tree = None): - self._class_info = {} - self._function_info = {} - if tree: - self._extract_info(tree) - - def _extract_info(self, tree): - # extract docstring - if len(tree) == 2: - found, vars = match(DOCSTRING_STMT_PATTERN[1], tree[1]) - else: - found, vars = match(DOCSTRING_STMT_PATTERN, tree[3]) - if found: - self._docstring = eval(vars['docstring']) - # discover inner definitions - for node in tree[1:]: - found, vars = match(COMPOUND_STMT_PATTERN, node) - if found: - cstmt = vars['compound'] - if cstmt[0] == symbol.funcdef: - name = cstmt[2][1] - self._function_info[name] = FunctionInfo(cstmt) - elif cstmt[0] == symbol.classdef: - name = cstmt[2][1] - self._class_info[name] = ClassInfo(cstmt) - -After initializing some internal state, the constructor calls the -:meth:`_extract_info` method. This method performs the bulk of the information -extraction which takes place in the entire example. The extraction has two -distinct phases: the location of the docstring for the parse tree passed in, and -the discovery of additional definitions within the code block represented by the -parse tree. - -The initial :keyword:`if` test determines whether the nested suite is of the -"short form" or the "long form." The short form is used when the code block is -on the same line as the definition of the code block, as in :: - - def square(x): "Square an argument."; return x ** 2 - -while the long form uses an indented block and allows nested definitions:: - - def make_power(exp): - "Make a function that raises an argument to the exponent `exp`." - def raiser(x, y=exp): - return x ** y - return raiser - -When the short form is used, the code block may contain a docstring as the -first, and possibly only, :const:`small_stmt` element. The extraction of such a -docstring is slightly different and requires only a portion of the complete -pattern used in the more common case. As implemented, the docstring will only -be found if there is only one :const:`small_stmt` node in the -:const:`simple_stmt` node. Since most functions and methods which use the short -form do not provide a docstring, this may be considered sufficient. The -extraction of the docstring proceeds using the :func:`match` function as -described above, and the value of the docstring is stored as an attribute of the -:class:`SuiteInfoBase` object. - -After docstring extraction, a simple definition discovery algorithm operates on -the :const:`stmt` nodes of the :const:`suite` node. The special case of the -short form is not tested; since there are no :const:`stmt` nodes in the short -form, the algorithm will silently skip the single :const:`simple_stmt` node and -correctly not discover any nested definitions. - -Each statement in the code block is categorized as a class definition, function -or method definition, or something else. For the definition statements, the -name of the element defined is extracted and a representation object appropriate -to the definition is created with the defining subtree passed as an argument to -the constructor. The representation objects are stored in instance variables -and may be retrieved by name using the appropriate accessor methods. - -The public classes provide any accessors required which are more specific than -those provided by the :class:`SuiteInfoBase` class, but the real extraction -algorithm remains common to all forms of code blocks. A high-level function can -be used to extract the complete set of information from a source file. (See -file :file:`example.py`.) :: - - def get_docs(fileName): - import os - import parser - - source = open(fileName).read() - basename = os.path.basename(os.path.splitext(fileName)[0]) - st = parser.suite(source) - return ModuleInfo(st.totuple(), basename) - -This provides an easy-to-use interface to the documentation of a module. If -information is required which is not extracted by the code of this example, the -code may be extended at clearly defined points to provide additional -capabilities. - |