summaryrefslogtreecommitdiffstats
path: root/Parser/tokenizer.h
Commit message (Collapse)AuthorAgeFilesLines
* [3.12] gh-106989: Remove tok report warnings (GH-106993) (#107013)Miss Islington (bot)2023-07-221-1/+0
| | | | Co-authored-by: Menelaos Kotoglou <contact@menelaoskotoglou.com>
* [3.12] gh-105718: Fix buffer allocation in tokenizer with readline ↵Miss Islington (bot)2023-06-131-1/+1
| | | | | | (GH-105728) (#105729) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com>
* [3.12] gh-105259: Ensure we don't show newline characters for trailing ↵Miss Islington (bot)2023-06-061-0/+1
| | | | NEWLINE tokens (GH-105364) (#105367)
* [3.12] gh-105069: Add a readline-like callable to the tokenizer to consume ↵Miss Islington (bot)2023-05-311-0/+2
| | | | | | | | input iteratively (GH-105070) (#105119) gh-105069: Add a readline-like callable to the tokenizer to consume input iteratively (GH-105070) (cherry picked from commit 9216e69a87d16d871625721ed5a8aa302511f367) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
* [3.12] gh-105017: Include CRLF lines in strings and column numbers ↵Miss Islington (bot)2023-05-281-2/+2
| | | | | | | | | (GH-105030) (#105041) gh-105017: Include CRLF lines in strings and column numbers (GH-105030) (cherry picked from commit 96fff35325e519cc76ffacf22e57e4c393d4446f) Co-authored-by: Marta Gómez Macías <mgmacias@google.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
* [3.12] Fix typo in the tokenizer (GH-104950) (#104953)Miss Islington (bot)2023-05-261-1/+1
| | | | | (cherry picked from commit 705e387dd81b971cb1ee5727da54adfb565f61d0) Co-authored-by: Stepfen Shawn <m18824909883@163.com>
* gh-102856: Python tokenizer implementation for PEP 701 (#104323)Marta Gómez Macías2023-05-211-0/+4
| | | | | | | | | | | This commit replaces the Python implementation of the tokenize module with an implementation that reuses the real C tokenizer via a private extension module. The tokenize module now implements a compatibility layer that transforms tokens from the C tokenizer into Python tokenize tokens for backward compatibility. As the C tokenizer does not emit some tokens that the Python tokenizer provides (such as comments and non-semantic newlines), a new special mode has been added to the C tokenizer mode that currently is only used via the extension module that exposes it to the Python layer. This new mode forces the C tokenizer to emit these new extra tokens and add the appropriate metadata that is needed to match the old Python implementation. Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
* gh-104658: Fix location of unclosed quote error for multiline f-strings ↵Pablo Galindo Salgado2023-05-201-0/+1
| | | | (#104660)
* gh-104016: Fixed off by 1 error in f string tokenizer (#104047)jx1242023-05-011-3/+4
| | | | | | Co-authored-by: sunmy2019 <59365878+sunmy2019@users.noreply.github.com> Co-authored-by: Ken Jin <kenjin@python.org> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
* gh-103656: Transfer f-string buffers to parser to avoid use-after-free ↵Lysandros Nikolaou2023-04-271-0/+2
| | | | | (GH-103896) Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
* GH-103718: Correctly cache and restore f-string buffers when needed (GH-103719)Lysandros Nikolaou2023-04-231-0/+3
|
* gh-102856: Clean some of the PEP 701 tokenizer implementation (#103634)Pablo Galindo Salgado2023-04-191-3/+2
|
* gh-102856: Initial implementation of PEP 701 (#102855)Pablo Galindo Salgado2023-04-191-0/+29
| | | | | | Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com> Co-authored-by: Batuhan Taskaya <isidentical@gmail.com> Co-authored-by: Marta Gómez Macías <mgmacias@google.com> Co-authored-by: sunmy2019 <59365878+sunmy2019@users.noreply.github.com>
* gh-99891: Fix infinite recursion in the tokenizer when showing warnings ↵Pablo Galindo Salgado2022-11-301-0/+1
| | | | | (GH-99893) Automerge-Triggered-By: GH:pablogsal
* gh-97997: Add col_offset field to tokenizer and use that for AST nodes (#98000)Lysandros Nikolaou2022-10-071-0/+2
|
* gh-97973: Return all necessary information from the tokenizer (GH-97984)Lysandros Nikolaou2022-10-061-1/+7
| | | | | Right now, the tokenizer only returns type and two pointers to the start and end of the token. This PR modifies the tokenizer to return the type and set all of the necessary information, so that the parser does not have to this.
* gh-93103: Parser uses PyConfig.parser_debug instead of Py_DebugFlag (#93106)Victor Stinner2022-05-241-0/+3
| | | | | | | * Replace deprecated Py_DebugFlag with PyConfig.parser_debug in the parser. * Add Parser.debug member. * Add tok_state.debug member. * Py_FrozenMain(): Replace Py_VerboseFlag with PyConfig.verbose.
* gh-92651: Remove the Include/token.h header file (#92652)Victor Stinner2022-05-111-1/+1
| | | | | | | | | | | | | | | Remove the token.h header file. There was never any public tokenizer C API. The token.h header file was only designed to be used by Python internals. Move Include/token.h to Include/internal/pycore_token.h. Including this header file now requires that the Py_BUILD_CORE macro is defined. It no longer checks for the Py_LIMITED_API macro. Rename functions: * PyToken_OneChar() => _PyToken_OneChar() * PyToken_TwoChars() => _PyToken_TwoChars() * PyToken_ThreeChars() => _PyToken_ThreeChars()
* Ensure the str member of the tokenizer is always initialised (GH-29681)Pablo Galindo Salgado2021-11-211-1/+1
|
* bpo-45434: Mark the PyTokenizer C API as private (GH-28924)Victor Stinner2021-10-131-5/+5
| | | | | | | | | | | | | | Rename PyTokenize functions to mark them as private: * PyTokenizer_FindEncodingFilename() => _PyTokenizer_FindEncodingFilename() * PyTokenizer_FromString() => _PyTokenizer_FromString() * PyTokenizer_FromFile() => _PyTokenizer_FromFile() * PyTokenizer_FromUTF8() => _PyTokenizer_FromUTF8() * PyTokenizer_Free() => _PyTokenizer_Free() * PyTokenizer_Get() => _PyTokenizer_Get() Remove the unused PyTokenizer_FindEncoding() function. import.c: remove unused #include "errcode.h".
* Fix typos in the Objects directory (GH-28766)Christian Clauss2021-10-061-1/+1
|
* bpo-44854: Remove trailing whitespaces (GH-27689)Serhiy Storchaka2021-08-091-1/+1
|
* bpo-44201: Avoid side effects of "invalid_*" rules in the REPL (GH-26298)Pablo Galindo2021-05-221-0/+10
| | | | | | | | | | | | When the parser does a second pass to check for errors, these rules can have some small side-effects as they may advance the parser more than the point reached in the first pass. This can cause the tokenizer to ask for extra tokens in interactive mode causing the tokenizer to show the prompt instead of failing instantly. To avoid this, add a new mode to the tokenizer that is activated in the second pass and deactivates asking for new tokens when the interactive line is finished. As the parsing should have reached the last line in the first pass, the second pass should not need to ask for more tokens.
* bpo-25643: Refactor the C tokenizer into smaller, logical units (GH-25050)Pablo Galindo2021-03-281-3/+2
|
* bpo-43410: Fix crash in the parser when producing syntax errors when reading ↵Pablo Galindo2021-03-141-1/+3
| | | | from stdin (GH-24763)
* bpo-42864: Improve error messages regarding unclosed parentheses (GH-24161)Pablo Galindo2021-01-191-0/+1
|
* bpo-42827: Fix crash on SyntaxError in multiline expressions (GH-24140)Lysandros Nikolaou2021-01-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | When trying to extract the error line for the error message there are two distinct cases: 1. The input comes from a file, which means that we can extract the error line by using `PyErr_ProgramTextObject` and which we already do. 2. The input does not come from a file, at which point we need to get the source code from the tokenizer: * If the tokenizer's current line number is the same with the line of the error, we get the line from `tok->buf` and we're ready. * Else, we can extract the error line from the source code in the following two ways: * If the input comes from a string we have all the input in `tok->str` and we can extract the error line from it. * If the input comes from stdin, i.e. the interactive prompt, we do not have access to the previous line. That's why a new field `tok->stdin_content` is added which holds the whole input for the current (multiline) statement or expression. We can then extract the error line from `tok->stdin_content` like we do in the string case above. Co-authored-by: Pablo Galindo <Pablogsal@gmail.com>
* closes bpo-39721: Fix constness of members of tok_state struct. (GH-18600)Andy Lester2020-02-281-5/+5
| | | | | | | | | | | | | | | | | | | | | The function PyTokenizer_FromUTF8 from Parser/tokenizer.c had a comment: /* XXX: constify members. */ This patch addresses that. In the tok_state struct: * end and start were non-const but could be made const * str and input were const but should have been non-const Changes to support this include: * decode_str() now returns a char * since it is allocated. * PyTokenizer_FromString() and PyTokenizer_FromUTF8() each creates a new char * for an allocate string instead of reusing the input const char *. * PyTokenizer_Get() and tok_get() now take const char ** arguments. * Various local vars are const or non-const accordingly. I was able to remove five casts that cast away constness.
* bpo-36623: Clean parser headers and include files (GH-12253)Pablo Galindo2019-04-131-0/+2
| | | After the removal of pgen, multiple header and function prototypes that lack implementation or are unused are still lying around.
* bpo-35975: Support parsing earlier minor versions of Python 3 (GH-12086)Guido van Rossum2019-03-071-0/+7
| | | | | | | This adds a `feature_version` flag to `ast.parse()` (documented) and `compile()` (hidden) that allow tweaking the parser to support older versions of the grammar. In particular if `feature_version` is 5 or 6, the hacks for the `async` and `await` keyword from PEP 492 are reinstated. (For 7 or higher, these are unconditionally treated as keywords, but they are still special tokens rather than `NAME` tokens that the parser driver recognizes.) https://bugs.python.org/issue35975
* bpo-35808: Retire pgen and use pgen2 to generate the parser (GH-11814)Pablo Galindo2019-03-011-8/+0
| | | | | Pgen is the oldest piece of technology in the CPython repository, building it requires various #if[n]def PGEN hacks in other parts of the code and it also depends more and more on CPython internals. This commit removes the old pgen C code and replaces it for a new version implemented in pure Python. This is a modified and adapted version of lib2to3/pgen2 that can generate grammar files compatibles with the current parser. This commit also eliminates all the #ifdef and code branches related to pgen, simplifying the code and making it more maintainable. The regen-grammar step now uses $(PYTHON_FOR_REGEN) that can be any version of the interpreter, so the new pgen code maintains compatibility with older versions of the interpreter (this also allows regenerating the grammar with the current CI solution that uses Python3.5). The new pgen Python module also makes use of the Grammar/Tokens file that holds the token specification, so is always kept in sync and avoids having to maintain duplicate token definitions.
* bpo-35766: Merge typed_ast back into CPython (GH-11645)Guido van Rossum2019-01-311-0/+2
|
* bpo-16806: Fix `lineno` and `col_offset` for multi-line string tokens (GH-10021)Anthony Sottile2019-01-131-0/+5
|
* bpo-33306: Improve SyntaxError messages for unbalanced parentheses. (GH-6516)Serhiy Storchaka2018-12-171-1/+4
|
* tokenizer: Remove unused tabs options (#4422)Victor Stinner2017-11-171-3/+0
| | | | | | | | | | Remove the following fields from tok_state structure which are now used unused: * altwarning: "Issue warning if alternate tabs don't match" * alterror: "Issue error if alternate tabs don't match" * alttabsize: "Alternate tab spacing" Replace alttabsize variable with ALTTABSIZE define.
* bpo-30406: Make async and await proper keywords (#1669)Jelle Zijlstra2017-10-061-7/+0
| | | Per PEP 492, 'async' and 'await' should become proper keywords in 3.7.
* Remove obsolete declaration in tokenizer.h (#962)Jim Fasarakis-Hilliard2017-04-031-2/+0
|
* Issue #24619: Simplify async/await tokenization.Yury Selivanov2015-07-231-15/+6
| | | | | | | | | | This commit simplifies async/await tokenization in tokenizer.c, tokenize.py & lib2to3/tokenize.py. Previous solution was to keep a stack of async-def & def blocks, whereas the new approach is just to remember position of the outermost async-def block. This change won't bring any parsing performance improvements, but it makes the code much easier to read and validate.
* Issue #24619: New approach for tokenizing async/await.Yury Selivanov2015-07-221-6/+15
| | | | | | | | | | | | | | | | | | | | | This commit fixes how one-line async-defs and defs are tracked by tokenizer. It allows to correctly parse invalid code such as: >>> async def f(): ... def g(): pass ... async = 10 and valid code such as: >>> async def f(): ... async def g(): pass ... await z As a consequence, is is now possible to have one-line 'async def foo(): await ..' functions: >>> async def foo(): return await bar()
* PEP 0492 -- Coroutines with async and await syntax. Issue #24017.Yury Selivanov2015-05-121-0/+7
|
* Issue #1772673: The type of `char*` arguments now changed to `const char*`.Serhiy Storchaka2013-10-191-3/+3
|
* Issue #9319: Include the filename in "Non-UTF8 code ..." syntax error.Victor Stinner2011-04-041-1/+0
|
* Issue #10785: Store the filename as Unicode in the Python parser.Victor Stinner2011-04-041-1/+7
|
* #10222: fix for overzealous AIX compiler.Georg Brandl2010-10-291-1/+1
|
* Issue #9713, #10114: Parser functions (eg. PyParser_ASTFromFile) expectsVictor Stinner2010-10-161-1/+1
| | | | | filenames encoded to the filesystem encoding with surrogateescape error handler (to support undecodable bytes), instead of UTF-8 in strict mode.
* Issue #10095: fp_setreadl() doesn't reopen the file, reuse instead the fileVictor Stinner2010-10-141-1/+1
| | | | descriptor.
* Recorded merge of revisions 81029 via svnmerge fromAntoine Pitrou2010-05-091-45/+45
| | | | | | | | | | svn+ssh://pythondev@svn.python.org/python/trunk ........ r81029 | antoine.pitrou | 2010-05-09 16:46:46 +0200 (dim., 09 mai 2010) | 3 lines Untabify C files. Will watch buildbots. ........
* Merged revisions 76230 via svnmerge fromBenjamin Peterson2009-11-131-2/+3
| | | | | | | | | | svn+ssh://pythondev@svn.python.org/python/trunk ........ r76230 | benjamin.peterson | 2009-11-12 17:39:44 -0600 (Thu, 12 Nov 2009) | 2 lines fix several compile() issues by translating newlines in the tokenizer ........
* ignore the coding cookie in compile(), exec(), and eval() if the source is a ↵Benjamin Peterson2009-03-021-0/+1
| | | | string #4626
* Latin-1 source code was not being properly decoded when passed throughBrett Cannon2008-10-171-2/+2
| | | | | | | compile(). This was due to left-over special-casing before UTF-8 became the default source encoding. Closes issue #3574. Thanks to Victor Stinner for help with the patch.