summaryrefslogtreecommitdiffstats
path: root/Parser/tokenizer.c
Commit message (Collapse)AuthorAgeFilesLines
* bpo-46820: Fix a SyntaxError in a numeric literal followed by "not in" ↵Miss Islington (bot)2022-02-221-0/+3
| | | | | | | | | | | (GH-31479) (GH-31493) Fix parsing a numeric literal immediately (without spaces) followed by "not in" keywords, like in "1not in x". Now the parser only emits a warning, not a syntax error. (cherry picked from commit 090e5c4b946b28f50fce445916c5d3ec45c8f45f) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
* [3.10] bpo-46521: Fix codeop to use a new partial-input mode of the parser ↵Pablo Galindo Salgado2022-02-081-10/+16
| | | | | | | (GH-31010). (GH-31213) (cherry picked from commit 69e10976b2e7682c6d57f4272932ebc19f8e8859) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
* bpo-14916: use specified tokenizer fd for file input (GH-31006)Miss Islington (bot)2022-02-031-1/+1
| | | | | | | | | | | | | @pablogsal, sorry i failed to rebase to main, so i recreated https://github.com/python/cpython/pull/22190GH-issuecomment-1024633392 > PyRun_InteractiveOne\*() functions allow to explicitily set fd instead of stdin. but stdin was hardcoded in readline call. > This patch does not fix target file for prompt unlike original bpo one : prompt fd is unrelated to tokenizer source which could be read only. It is more of a bugfix regarding the docs : actual documentation say "prompt the user" so one would expect prompt to go on stdout not a file for both PyRun_InteractiveOne\*() and PyRun_InteractiveLoop\*(). Automerge-Triggered-By: GH:pablogsal (cherry picked from commit 89b13042fcfc95bae21a49806a205ef62f1cdd73) Co-authored-by: Paul m. p. P <mail.peny@free.fr>
* [3.10] bpo-46091: Correctly calculate indentation levels for whitespace ↵Pablo Galindo Salgado2022-01-251-13/+33
| | | | | | | lines with continuation characters (GH-30130). (GH-30898) (cherry picked from commit a0efc0c1960e2c49e0092694d98395555270914c) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
* bpo-46054: Fix parsing error when parsing non-utf8 characters in source ↵Miss Islington (bot)2021-12-121-8/+5
| | | | | | | | | files (GH-30068) (GH-30069) (cherry picked from commit 4325a766f5f603ef6dfb8c4d5798e5e73cb5efd5) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com> Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
* [3.10] Ensure the str member of the tokenizer is always initialised ↵Pablo Galindo Salgado2021-11-211-1/+1
| | | | | | | (GH-29681). (GH-29683) (cherry picked from commit 4f006a789a35f5d1a7ef142bd1304ce167392457) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
* bpo-45738: Fix computation of error location for invalid continuation (GH-29550)Miss Islington (bot)2021-11-141-1/+0
| | | | | | characters in the parser (cherry picked from commit 25835c518aa7446f3680b62c1fb43827e0f190d9) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
* bpo-45562: Ensure all tokenizer debug messages are printed to stderr (GH-29270)Miss Islington (bot)2021-10-291-1/+1
| | | | | (cherry picked from commit cdc7a5827754bec83970bb052d410d55f85b3fff) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
* bpo-45562: Print tokenizer debug messages to stderr (GH-29250) (GH-29252)Miss Islington (bot)2021-10-271-4/+4
| | | | | | | (cherry picked from commit 10bbd41ba8c88bc102df108a4e0444abc7c5ea43) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com> Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
* bpo-45574: fix warning about `print_escape` being unused (GH-29172) (#29176)Miss Islington (bot)2021-10-231-0/+2
| | | | | | | | | | | | | | | | It used to be like this: <img width="1232" alt="Снимок экрана 2021-10-22 в 23 07 40" src="https://user-images.githubusercontent.com/4660275/138516608-fef6ec01-a96a-40f4-81ef-52265b0f536b.png"> Quick `grep` tells that it is just used in one place under `Py_DEBUG`: https://github.com/python/cpython/blame/f6e8b80d20159596cf641305bad3a833bedd2f4f/Parser/tokenizer.cGH-L1047-L1051 <img width="752" alt="Снимок экрана 2021-10-22 в 23 08 09" src="https://user-images.githubusercontent.com/4660275/138516684-ea503136-1e92-48a5-95bb-419e190d5866.png"> I am not sure, but it also looks like a private thing, it should not affect other users. Automerge-Triggered-By: GH:pablogsal (cherry picked from commit 4bc5473a42c5eae0928430930b897209492e849d) Co-authored-by: Nikita Sobolev <mail@sobolevn.me> Co-authored-by: Nikita Sobolev <mail@sobolevn.me>
* bpo-45562: Only show debug output from the parser in debug builds (GH-29140) ↵Miss Islington (bot)2021-10-221-0/+2
| | | | | | | | | (#29149) (cherry picked from commit 86dfb55d2e091cf633dbd7aabcd49d96fb1f9d81) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com> Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
* Update URLs in comments and metadata to use HTTPS (GH-27458) (GH-27478)Miss Islington (bot)2021-07-301-1/+1
| | | | | (cherry picked from commit be42c06bb01206209430f3ac08b72643dc7cad1c) Co-authored-by: Noah Kantrowitz <noah@coderanger.net>
* bpo-44317: Improve tokenizer errors with more informative locations ↵Miss Islington (bot)2021-07-101-18/+54
| | | | | | | (GH-26555) (GH-27079) (cherry picked from commit f24777c2b329974b69d2a3bf5cfc37e0fcace36c) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
* bpo-44396: Update multi-line-start location when reallocating tokenizer ↵Miss Islington (bot)2021-06-121-0/+5
| | | | | | buffers (GH-26676) (GH-26695) Automerge-Triggered-By: GH:pablogsal (cherry picked from commit a342cc5891dbd8a08d40e9444f2e2c9e93258721)
* bpo-43833: Emit warnings for numeric literals followed by keyword (GH-25466)Miss Islington (bot)2021-06-081-0/+128
| | | | | | | | | | | Emit a deprecation warning if the numeric literal is immediately followed by one of keywords: and, else, for, if, in, is, or. Raise a syntax error with more informative message if it is immediately followed by other keyword or identifier. Automerge-Triggered-By: GH:pablogsal (cherry picked from commit 2ea6d890281c415e0a2f00e63526e592da8ce3d9) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
* bpo-44201: Avoid side effects of "invalid_*" rules in the REPL (GH-26298) ↵Miss Islington (bot)2021-05-221-0/+9
| | | | | | | | | | | | | | | | | | (GH-26313) When the parser does a second pass to check for errors, these rules can have some small side-effects as they may advance the parser more than the point reached in the first pass. This can cause the tokenizer to ask for extra tokens in interactive mode causing the tokenizer to show the prompt instead of failing instantly. To avoid this, add a new mode to the tokenizer that is activated in the second pass and deactivates asking for new tokens when the interactive line is finished. As the parsing should have reached the last line in the first pass, the second pass should not need to ask for more tokens. (cherry picked from commit bd7476dae337e905e7b1bbf33ddb96cc270fdc84) Co-authored-by: Pablo Galindo <Pablogsal@gmail.com>
* Fix tokenizer error when raw decoding null bytes (GH-25080)Pablo Galindo2021-03-291-1/+4
|
* bpo-25643: Refactor the C tokenizer into smaller, logical units (GH-25050)Pablo Galindo2021-03-281-354/+332
|
* bpo-43410: Fix crash in the parser when producing syntax errors when reading ↵Pablo Galindo2021-03-141-26/+52
| | | | from stdin (GH-24763)
* bpo-40176: Improve error messages for unclosed string literals (GH-19346)Batuhan Taskaya2021-01-201-10/+16
| | | Automerge-Triggered-By: GH:isidentical
* bpo-42864: Fix compiler warning in the tokenizer with the new paren stack ↵Pablo Galindo2021-01-201-1/+1
| | | | for column numbers (GH-24266)
* bpo-42864: Improve error messages regarding unclosed parentheses (GH-24161)Pablo Galindo2021-01-191-1/+4
|
* bpo-42827: Fix crash on SyntaxError in multiline expressions (GH-24140)Lysandros Nikolaou2021-01-141-0/+21
| | | | | | | | | | | | | | | | | | | | | | | When trying to extract the error line for the error message there are two distinct cases: 1. The input comes from a file, which means that we can extract the error line by using `PyErr_ProgramTextObject` and which we already do. 2. The input does not come from a file, at which point we need to get the source code from the tokenizer: * If the tokenizer's current line number is the same with the line of the error, we get the line from `tok->buf` and we're ready. * Else, we can extract the error line from the source code in the following two ways: * If the input comes from a string we have all the input in `tok->str` and we can extract the error line from it. * If the input comes from stdin, i.e. the interactive prompt, we do not have access to the previous line. That's why a new field `tok->stdin_content` is added which holds the whole input for the current (multiline) statement or expression. We can then extract the error line from `tok->stdin_content` like we do in the string case above. Co-authored-by: Pablo Galindo <Pablogsal@gmail.com>
* bpo-42519: Replace PyMem_MALLOC() with PyMem_Malloc() (GH-23586)Victor Stinner2020-12-011-30/+30
| | | | | | | | | | | No longer use deprecated aliases to functions: * Replace PyMem_MALLOC() with PyMem_Malloc() * Replace PyMem_REALLOC() with PyMem_Realloc() * Replace PyMem_FREE() with PyMem_Free() * Replace PyMem_Del() with PyMem_Free() * Replace PyMem_DEL() with PyMem_Free() Modify also the PyMem_DEL() macro to use directly PyMem_Free().
* bpo-36020: Remove snprintf macro in pyerrors.h (GH-20889)Victor Stinner2020-06-151-1/+1
| | | | | | | | | | On Windows, #include "pyerrors.h" no longer defines "snprintf" and "vsnprintf" macros. PyOS_snprintf() and PyOS_vsnprintf() should be used to get portable behavior. Replace snprintf() calls with PyOS_snprintf() and replace vsnprintf() calls with PyOS_vsnprintf().
* bpo-40847: Consider a line with only a LINECONT a blank line (GH-20769)Lysandros Nikolaou2020-06-101-1/+2
| | | | | | | | | | A line with only a line continuation character should be considered a blank line at tokenizer level so that only a single NEWLINE token gets emitted. The old parser was working around the issue, but the new parser threw a `SyntaxError` for valid input. For example, an empty line following a line continuation character was interpreted as a `SyntaxError`. Co-authored-by: Pablo Galindo <Pablogsal@gmail.com>
* Fix peg_generator compiler warnings under MSVC (GH-20405)Ammar Askar2020-05-261-4/+0
|
* bpo-40593: Improve syntax errors for invalid characters in source code. ↵Serhiy Storchaka2020-05-121-9/+37
| | | | (GH-20033)
* bpo-40246: Revert reporting of invalid string prefixes (GH-19888)Lysandros Nikolaou2020-05-041-4/+0
| | | | Due to backwards compatibility concerns regarding keywords immediately followed by a string without whitespace between them (like in `bg="#d00" if clear else"#fca"`) will fail to parse, commit 41d5b94af44e34ac05d4cd57460ed104ccf96628 has to be reverted.
* bpo-40335: Correctly handle multi-line strings in tokenize error scenarios ↵Pablo Galindo2020-04-211-3/+4
| | | | | (GH-19619) Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>
* bpo-40246: Report a better error message for invalid string prefixes (GH-19476)Lysandros Nikolaou2020-04-121-0/+4
|
* bpo-39882: Add _Py_FatalErrorFormat() function (GH-19157)Victor Stinner2020-03-251-1/+1
|
* bpo-39882: Py_FatalError() logs the function name (GH-18819)Victor Stinner2020-03-061-3/+5
| | | | | | | | | | | | The Py_FatalError() function is replaced with a macro which logs automatically the name of the current function, unless the Py_LIMITED_API macro is defined. Changes: * Add _Py_FatalErrorFunc() function. * Remove the function name from the message of Py_FatalError() calls which included the function name. * Update tests.
* closes bpo-39721: Fix constness of members of tok_state struct. (GH-18600)Andy Lester2020-02-281-20/+30
| | | | | | | | | | | | | | | | | | | | | The function PyTokenizer_FromUTF8 from Parser/tokenizer.c had a comment: /* XXX: constify members. */ This patch addresses that. In the tok_state struct: * end and start were non-const but could be made const * str and input were const but should have been non-const Changes to support this include: * decode_str() now returns a char * since it is allocated. * PyTokenizer_FromString() and PyTokenizer_FromUTF8() each creates a new char * for an allocate string instead of reusing the input const char *. * PyTokenizer_Get() and tok_get() now take const char ** arguments. * Various local vars are const or non-const accordingly. I was able to remove five casts that cast away constness.
* bpo-39219: Fix SyntaxError attributes in the tokenizer. (GH-17828)Serhiy Storchaka2020-02-121-4/+32
| | | | * Always set the text attribute. * Correct the offset attribute for non-ascii sources.
* bpo-39500: Document PyUnicode_IsIdentifier() function (GH-18397)Victor Stinner2020-02-111-1/+2
| | | | PyUnicode_IsIdentifier() does not call Py_FatalError() anymore if the string is not ready.
* bpo-39209: Manage correctly multi-line tokens in interactive mode (GH-17860)Pablo Galindo2020-01-061-0/+2
|
* bpo-38673: dont switch to ps2 if the line starts with comment or whitespace ↵Batuhan Taşkaya2019-12-091-0/+6
| | | | | (GH-17421) https://bugs.python.org/issue38673
* Indent code inside if block. (GH-15284)Hansraj Das2019-08-151-1/+1
| | | Without indendation, seems like strcpy line is parallel to `if` condition.
* Fix `SyntaxError` indicator printing too many spaces for multi-line strings ↵Anthony Sottile2019-07-291-0/+2
| | | | (GH-14433)
* bpo-36878: Only allow text after `# type: ignore` if first character ASCII ↵Michael J. Sullivan2019-05-221-2/+3
| | | | | | | | | | | (GH-13504) This disallows things like `# type: ignoreé`, which seems wrong. Also switch to using Py_ISALNUM for the alnum check, for consistency with other code (and maybe correctness re: locale issues?). https://bugs.python.org/issue36878
* bpo-36878: Track extra text added to 'type: ignore' in the AST (GH-13479)Michael J. Sullivan2019-05-221-2/+6
| | | | | GH-13238 made extra text after a # type: ignore accepted by the parser. This finishes the job and actually plumbs the extra text through the parser and makes it available in the AST.
* bpo-2180: Treat line continuation at EOF as a `SyntaxError` (GH-13401)Anthony Sottile2019-05-181-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This makes the parser consistent with the tokenize module (already the case in `pypy`). sample ------ ```python x = 5\ ``` before ------ ```console $ python3 t.py $ python3 -mtokenize t.py t.py:2:0: error: EOF in multi-line statement ``` after ----- ```console $ ./python t.py File "t.py", line 3 x = 5\ ^ SyntaxError: unexpected EOF while parsing $ ./python -m tokenize t.py t.py:2:0: error: EOF in multi-line statement ``` https://bugs.python.org/issue2180
* bpo-36878: Allow extra text after `# type: ignore` comments (GH-13238)Michael J. Sullivan2019-05-111-8/+5
| | | | | | | In the parser, when using the type_comments=True option, recognize a TYPE_IGNORE as anything containing `# type: ignore` followed by a non-alphanumeric character. This is to allow ignores such as `# type: ignore[E1000]`.
* bpo-36623: Clean parser headers and include files (GH-12253)Pablo Galindo2019-04-131-1/+0
| | | After the removal of pgen, multiple header and function prototypes that lack implementation or are unused are still lying around.
* bpo-36459: Fix a possible double PyMem_FREE() due to tokenizer.c's ↵Zackery Spytz2019-03-281-1/+0
| | | | | | tok_nextc() (12601) Remove the PyMem_FREE() call added in cb90c89. The buffer will be freed when PyTokenizer_Free() is called on the tokenizer state.
* bpo-36367: Free buffer if realloc fails in tokenize.c (GH-12442)Pablo Galindo2019-03-191-2/+8
|
* bpo-35975: Support parsing earlier minor versions of Python 3 (GH-12086)Guido van Rossum2019-03-071-0/+79
| | | | | | | This adds a `feature_version` flag to `ast.parse()` (documented) and `compile()` (hidden) that allow tweaking the parser to support older versions of the grammar. In particular if `feature_version` is 5 or 6, the hacks for the `async` and `await` keyword from PEP 492 are reinstated. (For 7 or higher, these are unconditionally treated as keywords, but they are still special tokens rather than `NAME` tokens that the parser driver recognizes.) https://bugs.python.org/issue35975
* bpo-35808: Retire pgen and use pgen2 to generate the parser (GH-11814)Pablo Galindo2019-03-011-56/+0
| | | | | Pgen is the oldest piece of technology in the CPython repository, building it requires various #if[n]def PGEN hacks in other parts of the code and it also depends more and more on CPython internals. This commit removes the old pgen C code and replaces it for a new version implemented in pure Python. This is a modified and adapted version of lib2to3/pgen2 that can generate grammar files compatibles with the current parser. This commit also eliminates all the #ifdef and code branches related to pgen, simplifying the code and making it more maintainable. The regen-grammar step now uses $(PYTHON_FOR_REGEN) that can be any version of the interpreter, so the new pgen code maintains compatibility with older versions of the interpreter (this also allows regenerating the grammar with the current CI solution that uses Python3.5). The new pgen Python module also makes use of the Grammar/Tokens file that holds the token specification, so is always kept in sync and avoids having to maintain duplicate token definitions.
* bpo-35766: Merge typed_ast back into CPython (GH-11645)Guido van Rossum2019-01-311-1/+56
|