summaryrefslogtreecommitdiffstats
path: root/InternalDocs
diff options
context:
space:
mode:
authorIrit Katriel <1055913+iritkatriel@users.noreply.github.com>2024-11-22 19:27:41 (GMT)
committerGitHub <noreply@github.com>2024-11-22 19:27:41 (GMT)
commit4b12a6ff4ac6659156a01ae224249c7e145ff865 (patch)
tree55adddc1c7e222e117d8ca1c03c53336d7ba2494 /InternalDocs
parentca3ea9ad05c3d876a58463595e5b4228fda06936 (diff)
downloadcpython-4b12a6ff4ac6659156a01ae224249c7e145ff865.zip
cpython-4b12a6ff4ac6659156a01ae224249c7e145ff865.tar.gz
cpython-4b12a6ff4ac6659156a01ae224249c7e145ff865.tar.bz2
gh-119786: add code object doc, inline locations.md into it (#126832)
Diffstat (limited to 'InternalDocs')
-rw-r--r--InternalDocs/README.md4
-rw-r--r--InternalDocs/code_objects.md140
-rw-r--r--InternalDocs/compiler.md8
-rw-r--r--InternalDocs/interpreter.md2
-rw-r--r--InternalDocs/locations.md69
5 files changed, 142 insertions, 81 deletions
diff --git a/InternalDocs/README.md b/InternalDocs/README.md
index 2ef6e65..dbc858b 100644
--- a/InternalDocs/README.md
+++ b/InternalDocs/README.md
@@ -24,9 +24,7 @@ Compiling Python Source Code
Runtime Objects
---
-- [Code Objects (coming soon)](code_objects.md)
-
-- [The Source Code Locations Table](locations.md)
+- [Code Objects](code_objects.md)
- [Generators (coming soon)](generators.md)
diff --git a/InternalDocs/code_objects.md b/InternalDocs/code_objects.md
index 284a8b7..bee4a9d 100644
--- a/InternalDocs/code_objects.md
+++ b/InternalDocs/code_objects.md
@@ -1,5 +1,139 @@
-Code objects
-============
+# Code objects
-Coming soon.
+A `CodeObject` is a builtin Python type that represents a compiled executable,
+such as a compiled function or class.
+It contains a sequence of bytecode instructions along with its associated
+metadata: data which is necessary to execute the bytecode instructions (such
+as the values of the constants they access) or context information such as
+the source code location, which is useful for debuggers and other tools.
+
+Since 3.11, the final field of the `PyCodeObject` C struct is an array
+of indeterminate length containing the bytecode, `code->co_code_adaptive`.
+(In older versions the code object was a
+[`bytes`](https://docs.python.org/dev/library/stdtypes.html#bytes)
+object, `code->co_code`; this was changed to save an allocation and to
+allow it to be mutated.)
+
+Code objects are typically produced by the bytecode [compiler](compiler.md),
+although they are often written to disk by one process and read back in by another.
+The disk version of a code object is serialized using the
+[marshal](https://docs.python.org/dev/library/marshal.html) protocol.
+
+Code objects are nominally immutable.
+Some fields (including `co_code_adaptive` and fields for runtime
+information such as `_co_monitoring`) are mutable, but mutable fields are
+not included when code objects are hashed or compared.
+
+## Source code locations
+
+Whenever an exception occurs, the interpreter adds a traceback entry to
+the exception for the current frame, as well as each frame on the stack that
+it unwinds.
+The `tb_lineno` field of a traceback entry is (lazily) set to the line
+number of the instruction that was executing in the frame at the time of
+the exception.
+This field is computed from the locations table, `co_linetable`, by the function
+[`PyCode_Addr2Line`](https://docs.python.org/dev/c-api/code.html#c.PyCode_Addr2Line).
+Despite its name, `co_linetable` includes more than line numbers; it represents
+a 4-number source location for every instruction, indicating the precise line
+and column at which it begins and ends. This is a significant amount of data,
+so a compact format is very important.
+
+Note that traceback objects don't store all this information -- they store the start line
+number, for backward compatibility, and the "last instruction" value.
+The rest can be computed from the last instruction (`tb_lasti`) with the help of the
+locations table. For Python code, there is a convenience method
+(`codeobject.co_positions`)[https://docs.python.org/dev/reference/datamodel.html#codeobject.co_positions]
+which returns an iterator of `({line}, {endline}, {column}, {endcolumn})` tuples,
+one per instruction.
+There is also `co_lines()` which returns an iterator of `({start}, {end}, {line})` tuples,
+where `{start}` and `{end}` are bytecode offsets.
+The latter is described by [`PEP 626`](https://peps.python.org/pep-0626/); it is more
+compact, but doesn't return end line numbers or column offsets.
+From C code, you need to call
+[`PyCode_Addr2Location`](https://docs.python.org/dev/c-api/code.html#c.PyCode_Addr2Location).
+
+As the locations table is only consulted when displaying a traceback and when
+tracing (to pass the line number to the tracing function), lookup is not
+performance critical.
+In order to reduce the overhead during tracing, the mapping from instruction offset to
+line number is cached in the ``_co_linearray`` field.
+
+### Format of the locations table
+
+The `co_linetable` bytes object of code objects contains a compact
+representation of the source code positions of instructions, which are
+returned by the `co_positions()` iterator.
+
+> [!NOTE]
+> `co_linetable` is not to be confused with `co_lnotab`.
+> For backwards compatibility, `co_lnotab` exposes the format
+> as it existed in Python 3.10 and lower: this older format
+> stores only the start line for each instruction.
+> It is lazily created from `co_linetable` when accessed.
+> See [`Objects/lnotab_notes.txt`](../Objects/lnotab_notes.txt) for more details.
+
+`co_linetable` consists of a sequence of location entries.
+Each entry starts with a byte with the most significant bit set, followed by zero or more bytes with the most significant bit unset.
+
+Each entry contains the following information:
+* The number of code units covered by this entry (length)
+* The start line
+* The end line
+* The start column
+* The end column
+
+The first byte has the following format:
+
+Bit 7 | Bits 3-6 | Bits 0-2
+ ---- | ---- | ----
+ 1 | Code | Length (in code units) - 1
+
+The codes are enumerated in the `_PyCodeLocationInfoKind` enum.
+
+## Variable-length integer encodings
+
+Integers are often encoded using a variable-length integer encoding
+
+### Unsigned integers (`varint`)
+
+Unsigned integers are encoded in 6-bit chunks, least significant first.
+Each chunk but the last has bit 6 set.
+For example:
+
+* 63 is encoded as `0x3f`
+* 200 is encoded as `0x48`, `0x03`
+
+### Signed integers (`svarint`)
+
+Signed integers are encoded by converting them to unsigned integers, using the following function:
+```Python
+def convert(s):
+ if s < 0:
+ return ((-s)<<1) | 1
+ else:
+ return (s<<1)
+```
+
+*Location entries*
+
+The meaning of the codes and the following bytes are as follows:
+
+Code | Meaning | Start line | End line | Start column | End column
+ ---- | ---- | ---- | ---- | ---- | ----
+ 0-9 | Short form | Δ 0 | Δ 0 | See below | See below
+ 10-12 | One line form | Δ (code - 10) | Δ 0 | unsigned byte | unsigned byte
+ 13 | No column info | Δ svarint | Δ 0 | None | None
+ 14 | Long form | Δ svarint | Δ varint | varint | varint
+ 15 | No location | None | None | None | None
+
+The Δ means the value is encoded as a delta from another value:
+* Start line: Delta from the previous start line, or `co_firstlineno` for the first entry.
+* End line: Delta from the start line
+
+*The short forms*
+
+Codes 0-9 are the short forms. The short form consists of two bytes, the second byte holding additional column information. The code is the start column divided by 8 (and rounded down).
+* Start column: `(code*8) + ((second_byte>>4)&7)`
+* End column: `start_column + (second_byte&15)`
diff --git a/InternalDocs/compiler.md b/InternalDocs/compiler.md
index 37964bd..ed4cfb2 100644
--- a/InternalDocs/compiler.md
+++ b/InternalDocs/compiler.md
@@ -443,14 +443,12 @@ reference to the source code (filename, etc). All of this is implemented by
Code objects
============
-The result of `PyAST_CompileObject()` is a `PyCodeObject` which is defined in
+The result of `_PyAST_Compile()` is a `PyCodeObject` which is defined in
[Include/cpython/code.h](../Include/cpython/code.h).
And with that you now have executable Python bytecode!
-The code objects (byte code) are executed in [Python/ceval.c](../Python/ceval.c).
-This file will also need a new case statement for the new opcode in the big switch
-statement in `_PyEval_EvalFrameDefault()`.
-
+The code objects (byte code) are executed in `_PyEval_EvalFrameDefault()`
+in [Python/ceval.c](../Python/ceval.c).
Important files
===============
diff --git a/InternalDocs/interpreter.md b/InternalDocs/interpreter.md
index dcfddc9..4c10cbb 100644
--- a/InternalDocs/interpreter.md
+++ b/InternalDocs/interpreter.md
@@ -16,7 +16,7 @@ from the instruction definitions in [Python/bytecodes.c](../Python/bytecodes.c)
which are written in [a DSL](../Tools/cases_generator/interpreter_definition.md)
developed for this purpose.
-Recall that the [Python Compiler](compiler.md) produces a [`CodeObject`](code_object.md),
+Recall that the [Python Compiler](compiler.md) produces a [`CodeObject`](code_objects.md),
which contains the bytecode instructions along with static data that is required to execute them,
such as the consts list, variable names,
[exception table](exception_handling.md#format-of-the-exception-table), and so on.
diff --git a/InternalDocs/locations.md b/InternalDocs/locations.md
deleted file mode 100644
index 91a7824..0000000
--- a/InternalDocs/locations.md
+++ /dev/null
@@ -1,69 +0,0 @@
-# Locations table
-
-The `co_linetable` bytes object of code objects contains a compact
-representation of the source code positions of instructions, which are
-returned by the `co_positions()` iterator.
-
-`co_linetable` consists of a sequence of location entries.
-Each entry starts with a byte with the most significant bit set, followed by zero or more bytes with most significant bit unset.
-
-Each entry contains the following information:
-* The number of code units covered by this entry (length)
-* The start line
-* The end line
-* The start column
-* The end column
-
-The first byte has the following format:
-
-Bit 7 | Bits 3-6 | Bits 0-2
- ---- | ---- | ----
- 1 | Code | Length (in code units) - 1
-
-The codes are enumerated in the `_PyCodeLocationInfoKind` enum.
-
-## Variable length integer encodings
-
-Integers are often encoded using a variable length integer encoding
-
-### Unsigned integers (varint)
-
-Unsigned integers are encoded in 6 bit chunks, least significant first.
-Each chunk but the last has bit 6 set.
-For example:
-
-* 63 is encoded as `0x3f`
-* 200 is encoded as `0x48`, `0x03`
-
-### Signed integers (svarint)
-
-Signed integers are encoded by converting them to unsigned integers, using the following function:
-```Python
-def convert(s):
- if s < 0:
- return ((-s)<<1) | 1
- else:
- return (s<<1)
-```
-
-## Location entries
-
-The meaning of the codes and the following bytes are as follows:
-
-Code | Meaning | Start line | End line | Start column | End column
- ---- | ---- | ---- | ---- | ---- | ----
- 0-9 | Short form | Δ 0 | Δ 0 | See below | See below
- 10-12 | One line form | Δ (code - 10) | Δ 0 | unsigned byte | unsigned byte
- 13 | No column info | Δ svarint | Δ 0 | None | None
- 14 | Long form | Δ svarint | Δ varint | varint | varint
- 15 | No location | None | None | None | None
-
-The Δ means the value is encoded as a delta from another value:
-* Start line: Delta from the previous start line, or `co_firstlineno` for the first entry.
-* End line: Delta from the start line
-
-### The short forms
-
-Codes 0-9 are the short forms. The short form consists of two bytes, the second byte holding additional column information. The code is the start column divided by 8 (and rounded down).
-* Start column: `(code*8) + ((second_byte>>4)&7)`
-* End column: `start_column + (second_byte&15)`