diff options
Diffstat (limited to 'Doc/library/tokenize.rst')
-rw-r--r-- | Doc/library/tokenize.rst | 172 |
1 files changed, 98 insertions, 74 deletions
diff --git a/Doc/library/tokenize.rst b/Doc/library/tokenize.rst index 9a17b14..bbe73d0 100644 --- a/Doc/library/tokenize.rst +++ b/Doc/library/tokenize.rst @@ -9,50 +9,34 @@ The :mod:`tokenize` module provides a lexical scanner for Python source code, -implemented in Python. The scanner in this module returns comments as tokens as -well, making it useful for implementing "pretty-printers," including colorizers -for on-screen displays. +implemented in Python. The scanner in this module returns comments as tokens +as well, making it useful for implementing "pretty-printers," including +colorizers for on-screen displays. The primary entry point is a :term:`generator`: -.. function:: generate_tokens(readline) +.. function:: tokenize(readline) - The :func:`generate_tokens` generator requires one argument, *readline*, which + The :func:`tokenize` generator requires one argument, *readline*, which must be a callable object which provides the same interface as the :meth:`readline` method of built-in file objects (see section - :ref:`bltin-file-objects`). Each call to the function should return one line of - input as a string. + :ref:`bltin-file-objects`). Each call to the function should return one + line of input as bytes. - The generator produces 5-tuples with these members: the token type; the token - string; a 2-tuple ``(srow, scol)`` of ints specifying the row and column where - the token begins in the source; a 2-tuple ``(erow, ecol)`` of ints specifying - the row and column where the token ends in the source; and the line on which the - token was found. The line passed is the *logical* line; continuation lines are - included. - - -An older entry point is retained for backward compatibility: - -.. function:: tokenize(readline[, tokeneater]) - - The :func:`tokenize` function accepts two parameters: one representing the input - stream, and one providing an output mechanism for :func:`tokenize`. - - The first parameter, *readline*, must be a callable object which provides the - same interface as the :meth:`readline` method of built-in file objects (see - section :ref:`bltin-file-objects`). Each call to the function should return one - line of input as a string. Alternately, *readline* may be a callable object that - signals completion by raising :exc:`StopIteration`. - - The second parameter, *tokeneater*, must also be a callable object. It is - called once for each token, with five arguments, corresponding to the tuples - generated by :func:`generate_tokens`. + The generator produces 5-tuples with these members: the token type; the + token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and + column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of + ints specifying the row and column where the token ends in the source; and + the line on which the token was found. The line passed is the *logical* + line; continuation lines are included. + + tokenize determines the source encoding of the file by looking for a utf-8 + bom or encoding cookie, according to :pep:`263`. All constants from the :mod:`token` module are also exported from -:mod:`tokenize`, as are two additional token type values that might be passed to -the *tokeneater* function by :func:`tokenize`: +:mod:`tokenize`, as are three additional token type values: .. data:: COMMENT @@ -62,55 +46,95 @@ the *tokeneater* function by :func:`tokenize`: .. data:: NL Token value used to indicate a non-terminating newline. The NEWLINE token - indicates the end of a logical line of Python code; NL tokens are generated when - a logical line of code is continued over multiple physical lines. + indicates the end of a logical line of Python code; NL tokens are generated + when a logical line of code is continued over multiple physical lines. -Another function is provided to reverse the tokenization process. This is useful -for creating tools that tokenize a script, modify the token stream, and write -back the modified script. +.. data:: ENCODING -.. function:: untokenize(iterable) + Token value that indicates the encoding used to decode the source bytes + into text. The first token returned by :func:`tokenize` will always be an + ENCODING token. - Converts tokens back into Python source code. The *iterable* must return - sequences with at least two elements, the token type and the token string. Any - additional sequence elements are ignored. - The reconstructed script is returned as a single string. The result is - guaranteed to tokenize back to match the input so that the conversion is - lossless and round-trips are assured. The guarantee applies only to the token - type and token string as the spacing between tokens (column positions) may - change. +Another function is provided to reverse the tokenization process. This is +useful for creating tools that tokenize a script, modify the token stream, and +write back the modified script. +.. function:: untokenize(iterable) + + Converts tokens back into Python source code. The *iterable* must return + sequences with at least two elements, the token type and the token string. + Any additional sequence elements are ignored. + + The reconstructed script is returned as a single string. The result is + guaranteed to tokenize back to match the input so that the conversion is + lossless and round-trips are assured. The guarantee applies only to the + token type and token string as the spacing between tokens (column + positions) may change. + + It returns bytes, encoded using the ENCODING token, which is the first + token sequence output by :func:`tokenize`. + + +:func:`tokenize` needs to detect the encoding of source files it tokenizes. The +function it uses to do this is available: + +.. function:: detect_encoding(readline) + + The :func:`detect_encoding` function is used to detect the encoding that + should be used to decode a Python source file. It requires one argment, + readline, in the same way as the :func:`tokenize` generator. + + It will call readline a maximum of twice, and return the encoding used + (as a string) and a list of any lines (not decoded from bytes) it has read + in. + + It detects the encoding from the presence of a utf-8 bom or an encoding + cookie as specified in pep-0263. If both a bom and a cookie are present, + but disagree, a SyntaxError will be raised. + + If no encoding is specified, then the default of 'utf-8' will be returned. + + Example of a script re-writer that transforms float literals into Decimal objects:: - def decistmt(s): - """Substitute Decimals for floats in a string of statements. - - >>> from decimal import Decimal - >>> s = 'print(+21.3e-5*-.1234/81.7)' - >>> decistmt(s) - "print(+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))" - - >>> exec(s) - -3.21716034272e-007 - >>> exec(decistmt(s)) - -3.217160342717258261933904529E-7 - - """ - result = [] - g = generate_tokens(StringIO(s).readline) # tokenize the string - for toknum, tokval, _, _, _ in g: - if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens - result.extend([ - (NAME, 'Decimal'), - (OP, '('), - (STRING, repr(tokval)), - (OP, ')') - ]) - else: - result.append((toknum, tokval)) - return untokenize(result) + def decistmt(s): + """Substitute Decimals for floats in a string of statements. + + >>> from decimal import Decimal + >>> s = 'print(+21.3e-5*-.1234/81.7)' + >>> decistmt(s) + "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))" + + The format of the exponent is inherited from the platform C library. + Known cases are "e-007" (Windows) and "e-07" (not Windows). Since + we're only showing 12 digits, and the 13th isn't close to 5, the + rest of the output should be platform-independent. + + >>> exec(s) #doctest: +ELLIPSIS + -3.21716034272e-0...7 + + Output from calculations with Decimal should be identical across all + platforms. + + >>> exec(decistmt(s)) + -3.217160342717258261933904529E-7 + """ + result = [] + g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string + for toknum, tokval, _, _, _ in g: + if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens + result.extend([ + (NAME, 'Decimal'), + (OP, '('), + (STRING, repr(tokval)), + (OP, ')') + ]) + else: + result.append((toknum, tokval)) + return untokenize(result).decode('utf-8') + |