summaryrefslogtreecommitdiffstats
path: root/InternalDocs
diff options
context:
space:
mode:
Diffstat (limited to 'InternalDocs')
-rw-r--r--InternalDocs/README.md4
-rw-r--r--InternalDocs/adaptive.md146
-rw-r--r--InternalDocs/code_objects.md5
-rw-r--r--InternalDocs/compiler.md10
-rw-r--r--InternalDocs/interpreter.md210
-rw-r--r--InternalDocs/jit.md134
6 files changed, 322 insertions, 187 deletions
diff --git a/InternalDocs/README.md b/InternalDocs/README.md
index 8cdd06d..794b4f3 100644
--- a/InternalDocs/README.md
+++ b/InternalDocs/README.md
@@ -34,9 +34,9 @@ Runtime Objects
Program Execution
---
-- [The Interpreter](interpreter.md)
+- [The Bytecode Interpreter](interpreter.md)
-- [Adaptive Instruction Families](adaptive.md)
+- [The JIT](jit.md)
- [Garbage Collector Design](garbage_collector.md)
diff --git a/InternalDocs/adaptive.md b/InternalDocs/adaptive.md
deleted file mode 100644
index 7cfa8e5..0000000
--- a/InternalDocs/adaptive.md
+++ /dev/null
@@ -1,146 +0,0 @@
-# Adding or extending a family of adaptive instructions.
-
-## Families of instructions
-
-The core part of [PEP 659](https://peps.python.org/pep-0659/)
-(specializing adaptive interpreter) is the families of
-instructions that perform the adaptive specialization.
-
-A family of instructions has the following fundamental properties:
-
-* It corresponds to a single instruction in the code
- generated by the bytecode compiler.
-* It has a single adaptive instruction that records an execution count and,
- at regular intervals, attempts to specialize itself. If not specializing,
- it executes the base implementation.
-* It has at least one specialized form of the instruction that is tailored
- for a particular value or set of values at runtime.
-* All members of the family must have the same number of inline cache entries,
- to ensure correct execution.
- Individual family members do not need to use all of the entries,
- but must skip over any unused entries when executing.
-
-The current implementation also requires the following,
-although these are not fundamental and may change:
-
-* All families use one or more inline cache entries,
- the first entry is always the counter.
-* All instruction names should start with the name of the adaptive
- instruction.
-* Specialized forms should have names describing their specialization.
-
-## Example family
-
-The `LOAD_GLOBAL` instruction (in [Python/bytecodes.c](../Python/bytecodes.c))
-already has an adaptive family that serves as a relatively simple example.
-
-The `LOAD_GLOBAL` instruction performs adaptive specialization,
-calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero.
-
-There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE`
-which is specialized for global variables in the module, and
-`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables.
-
-## Performance analysis
-
-The benefit of a specialization can be assessed with the following formula:
-`Tbase/Tadaptive`.
-
-Where `Tbase` is the mean time to execute the base instruction,
-and `Tadaptive` is the mean time to execute the specialized and adaptive forms.
-
-`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)`
-
-`Ti` is the time to execute the `i`th instruction in the family and `Ni` is
-the number of times that instruction is executed.
-`Tmiss` is the time to process a miss, including de-optimzation
-and the time to execute the base instruction.
-
-The ideal situation is where misses are rare and the specialized
-forms are much faster than the base instruction.
-`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`.
-In which case we have `Tadaptive ≈ sum(Ti*Ni)`.
-Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and
-`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction,
-we would expect the specialization of `LOAD_GLOBAL` to be profitable.
-
-## Design considerations
-
-While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and
-`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti`
-low for all specialized instructions and `Nmiss` as low as possible.
-
-Keeping `Nmiss` low means that there should be specializations for almost
-all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means
-keeping `Ti` low which means minimizing branches and dependent memory
-accesses (pointer chasing). These two objectives may be in conflict,
-requiring judgement and experimentation to design the family of instructions.
-
-The size of the inline cache should as small as possible,
-without impairing performance, to reduce the number of
-`EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache.
-
-### Gathering data
-
-Before choosing how to specialize an instruction, it is important to gather
-some data. What are the patterns of usage of the base instruction?
-Data can best be gathered by instrumenting the interpreter. Since a
-specialization function and adaptive instruction are going to be required,
-instrumentation can most easily be added in the specialization function.
-
-### Choice of specializations
-
-The performance of the specializing adaptive interpreter relies on the
-quality of specialization and keeping the overhead of specialization low.
-
-Specialized instructions must be fast. In order to be fast,
-specialized instructions should be tailored for a particular
-set of values that allows them to:
-
-1. Verify that incoming value is part of that set with low overhead.
-2. Perform the operation quickly.
-
-This requires that the set of values is chosen such that membership can be
-tested quickly and that membership is sufficient to allow the operation to
-performed quickly.
-
-For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()`
-dictionaries that have a keys with the expected version.
-
-This can be tested quickly:
-
-* `globals->keys->dk_version == expected_version`
-
-and the operation can be performed quickly:
-
-* `value = entries[cache->index].me_value;`.
-
-Because it is impossible to measure the performance of an instruction without
-also measuring unrelated factors, the assessment of the quality of a
-specialization will require some judgement.
-
-As a general rule, specialized instructions should be much faster than the
-base instruction.
-
-### Implementation of specialized instructions
-
-In general, specialized instructions should be implemented in two parts:
-
-1. A sequence of guards, each of the form
- `DEOPT_IF(guard-condition-is-false, BASE_NAME)`.
-2. The operation, which should ideally have no branches and
- a minimum number of dependent memory accesses.
-
-In practice, the parts may overlap, as data required for guards
-can be re-used in the operation.
-
-If there are branches in the operation, then consider further specialization
-to eliminate the branches.
-
-### Maintaining stats
-
-Finally, take care that stats are gather correctly.
-After the last `DEOPT_IF` has passed, a hit should be recorded with
-`STAT_INC(BASE_INSTRUCTION, hit)`.
-After an optimization has been deferred in the adaptive instruction,
-that should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`.
diff --git a/InternalDocs/code_objects.md b/InternalDocs/code_objects.md
index d4e28c6..a91a7043 100644
--- a/InternalDocs/code_objects.md
+++ b/InternalDocs/code_objects.md
@@ -18,6 +18,11 @@ Code objects are typically produced by the bytecode [compiler](compiler.md),
although they are often written to disk by one process and read back in by another.
The disk version of a code object is serialized using the
[marshal](https://docs.python.org/dev/library/marshal.html) protocol.
+When a `CodeObject` is created, the function `_PyCode_Quicken()` from
+[`Python/specialize.c`](../Python/specialize.c) is called to initialize
+the caches of all adaptive instructions. This is required because the
+on-disk format is a sequence of bytes, and some of the caches need to be
+initialized with 16-bit values.
Code objects are nominally immutable.
Some fields (including `co_code_adaptive` and fields for runtime
diff --git a/InternalDocs/compiler.md b/InternalDocs/compiler.md
index 9e99f34..c257bfd 100644
--- a/InternalDocs/compiler.md
+++ b/InternalDocs/compiler.md
@@ -595,16 +595,6 @@ Objects
* [Exception Handling](exception_handling.md): Describes the exception table
-Specializing Adaptive Interpreter
-=================================
-
-Adding a specializing, adaptive interpreter to CPython will bring significant
-performance improvements. These documents provide more information:
-
-* [PEP 659: Specializing Adaptive Interpreter](https://peps.python.org/pep-0659/).
-* [Adding or extending a family of adaptive instructions](adaptive.md)
-
-
References
==========
diff --git a/InternalDocs/interpreter.md b/InternalDocs/interpreter.md
index ab149e4..fa4a54f 100644
--- a/InternalDocs/interpreter.md
+++ b/InternalDocs/interpreter.md
@@ -1,8 +1,4 @@
-The bytecode interpreter
-========================
-
-Overview
---------
+# The bytecode interpreter
This document describes the workings and implementation of the bytecode
interpreter, the part of python that executes compiled Python code. Its
@@ -47,8 +43,7 @@ simply calls [`_PyEval_EvalFrameDefault()`] to execute the frame. However, as pe
`_PyEval_EvalFrameDefault()`.
-Instruction decoding
---------------------
+## Instruction decoding
The first task of the interpreter is to decode the bytecode instructions.
Bytecode is stored as an array of 16-bit code units (`_Py_CODEUNIT`).
@@ -110,8 +105,7 @@ snippet decode a complete instruction:
For various reasons we'll get to later (mostly efficiency, given that `EXTENDED_ARG`
is rare) the actual code is different.
-Jumps
-=====
+## Jumps
Note that when the `switch` statement is reached, `next_instr` (the "instruction offset")
already points to the next instruction.
@@ -120,25 +114,26 @@ Thus, jump instructions can be implemented by manipulating `next_instr`:
- A jump forward (`JUMP_FORWARD`) sets `next_instr += oparg`.
- A jump backward sets `next_instr -= oparg`.
-Inline cache entries
-====================
+## Inline cache entries
Some (specialized or specializable) instructions have an associated "inline cache".
The inline cache consists of one or more two-byte entries included in the bytecode
array as additional words following the `opcode`/`oparg` pair.
The size of the inline cache for a particular instruction is fixed by its `opcode`.
Moreover, the inline cache size for all instructions in a
-[family of specialized/specializable instructions](adaptive.md)
+[family of specialized/specializable instructions](#Specialization)
(for example, `LOAD_ATTR`, `LOAD_ATTR_SLOT`, `LOAD_ATTR_MODULE`) must all be
the same. Cache entries are reserved by the compiler and initialized with zeros.
Although they are represented by code units, cache entries do not conform to the
`opcode` / `oparg` format.
-If an instruction has an inline cache, the layout of its cache is described by
-a `struct` definition in (`pycore_code.h`)[../Include/internal/pycore_code.h].
-This allows us to access the cache by casting `next_instr` to a pointer to this `struct`.
-The size of such a `struct` must be independent of the machine architecture, word size
-and alignment requirements. For a 32-bit field, the `struct` should use `_Py_CODEUNIT field[2]`.
+If an instruction has an inline cache, the layout of its cache is described in
+the instruction's definition in [`Python/bytecodes.c`](../Python/bytecodes.c).
+The structs defined in [`pycore_code.h`](../Include/internal/pycore_code.h)
+allow us to access the cache by casting `next_instr` to a pointer to the relevant
+`struct`. The size of such a `struct` must be independent of the machine
+architecture, word size and alignment requirements. For a 32-bit field, the
+`struct` should use `_Py_CODEUNIT field[2]`.
The instruction implementation is responsible for advancing `next_instr` past the inline cache.
For example, if an instruction's inline cache is four bytes (that is, two code units) in size,
@@ -153,8 +148,7 @@ Serializing non-zero cache entries would present a problem because the serializa
More information about the use of inline caches can be found in
[PEP 659](https://peps.python.org/pep-0659/#ancillary-data).
-The evaluation stack
---------------------
+## The evaluation stack
Most instructions read or write some data in the form of object references (`PyObject *`).
The CPython bytecode interpreter is a stack machine, meaning that its instructions operate
@@ -193,16 +187,14 @@ For example, the following sequence is illegal, because it keeps pushing items o
> Do not confuse the evaluation stack with the call stack, which is used to implement calling
> and returning from functions.
-Error handling
---------------
+## Error handling
When the implementation of an opcode raises an exception, it jumps to the
`exception_unwind` label in [Python/ceval.c](../Python/ceval.c).
The exception is then handled as described in the
[`exception handling documentation`](exception_handling.md#handling-exceptions).
-Python-to-Python calls
-----------------------
+## Python-to-Python calls
The `_PyEval_EvalFrameDefault()` function is recursive, because sometimes
the interpreter calls some C function that calls back into the interpreter.
@@ -227,8 +219,7 @@ returns from `_PyEval_EvalFrameDefault()` altogether, to a C caller.
A similar check is performed when an unhandled exception occurs.
-The call stack
---------------
+## The call stack
Up through 3.10, the call stack was implemented as a singly-linked list of
[frame objects](frames.md). This was expensive because each call would require a
@@ -262,8 +253,7 @@ See also the [generators](generators.md) section.
<!--
-All sorts of variables
-----------------------
+## All sorts of variables
The bytecode compiler determines the scope in which each variable name is defined,
and generates instructions accordingly. For example, loading a local variable
@@ -297,8 +287,7 @@ Other topics
-->
-Introducing a new bytecode instruction
---------------------------------------
+## Introducing a new bytecode instruction
It is occasionally necessary to add a new opcode in order to implement
a new feature or change the way that existing features are compiled.
@@ -355,6 +344,169 @@ new bytecode properly. Run `make regen-importlib` for updating the
bytecode of frozen importlib files. You have to run `make` again after this
to recompile the generated C files.
+## Specialization
+
+Bytecode specialization, which was introduced in
+[PEP 659](https://peps.python.org/pep-0659/), speeds up program execution by
+rewriting instructions based on runtime information. This is done by replacing
+a generic instruction with a faster version that works for the case that this
+program encounters. Each specializable instruction is responsible for rewriting
+itself, using its [inline caches](#inline-cache-entries) for
+bookkeeping.
+
+When an adaptive instruction executes, it may attempt to specialize itself,
+depending on the argument and the contents of its cache. This is done
+by calling one of the `_Py_Specialize_XXX` functions in
+[`Python/specialize.c`](../Python/specialize.c).
+
+
+The specialized instructions are responsible for checking that the special-case
+assumptions still apply, and de-optimizing back to the generic version if not.
+
+## Families of instructions
+
+A *family* of instructions consists of an adaptive instruction along with the
+specialized instructions that it can be replaced by.
+It has the following fundamental properties:
+
+* It corresponds to a single instruction in the code
+ generated by the bytecode compiler.
+* It has a single adaptive instruction that records an execution count and,
+ at regular intervals, attempts to specialize itself. If not specializing,
+ it executes the base implementation.
+* It has at least one specialized form of the instruction that is tailored
+ for a particular value or set of values at runtime.
+* All members of the family must have the same number of inline cache entries,
+ to ensure correct execution.
+ Individual family members do not need to use all of the entries,
+ but must skip over any unused entries when executing.
+
+The current implementation also requires the following,
+although these are not fundamental and may change:
+
+* All families use one or more inline cache entries,
+ the first entry is always the counter.
+* All instruction names should start with the name of the adaptive
+ instruction.
+* Specialized forms should have names describing their specialization.
+
+## Example family
+
+The `LOAD_GLOBAL` instruction (in [Python/bytecodes.c](../Python/bytecodes.c))
+already has an adaptive family that serves as a relatively simple example.
+
+The `LOAD_GLOBAL` instruction performs adaptive specialization,
+calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero.
+
+There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE`
+which is specialized for global variables in the module, and
+`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables.
+
+## Performance analysis
+
+The benefit of a specialization can be assessed with the following formula:
+`Tbase/Tadaptive`.
+
+Where `Tbase` is the mean time to execute the base instruction,
+and `Tadaptive` is the mean time to execute the specialized and adaptive forms.
+
+`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)`
+
+`Ti` is the time to execute the `i`th instruction in the family and `Ni` is
+the number of times that instruction is executed.
+`Tmiss` is the time to process a miss, including de-optimzation
+and the time to execute the base instruction.
+
+The ideal situation is where misses are rare and the specialized
+forms are much faster than the base instruction.
+`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`.
+In which case we have `Tadaptive ≈ sum(Ti*Ni)`.
+Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and
+`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction,
+we would expect the specialization of `LOAD_GLOBAL` to be profitable.
+
+## Design considerations
+
+While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and
+`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti`
+low for all specialized instructions and `Nmiss` as low as possible.
+
+Keeping `Nmiss` low means that there should be specializations for almost
+all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means
+keeping `Ti` low which means minimizing branches and dependent memory
+accesses (pointer chasing). These two objectives may be in conflict,
+requiring judgement and experimentation to design the family of instructions.
+
+The size of the inline cache should as small as possible,
+without impairing performance, to reduce the number of
+`EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache.
+
+### Gathering data
+
+Before choosing how to specialize an instruction, it is important to gather
+some data. What are the patterns of usage of the base instruction?
+Data can best be gathered by instrumenting the interpreter. Since a
+specialization function and adaptive instruction are going to be required,
+instrumentation can most easily be added in the specialization function.
+
+### Choice of specializations
+
+The performance of the specializing adaptive interpreter relies on the
+quality of specialization and keeping the overhead of specialization low.
+
+Specialized instructions must be fast. In order to be fast,
+specialized instructions should be tailored for a particular
+set of values that allows them to:
+
+1. Verify that incoming value is part of that set with low overhead.
+2. Perform the operation quickly.
+
+This requires that the set of values is chosen such that membership can be
+tested quickly and that membership is sufficient to allow the operation to
+performed quickly.
+
+For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()`
+dictionaries that have a keys with the expected version.
+
+This can be tested quickly:
+
+* `globals->keys->dk_version == expected_version`
+
+and the operation can be performed quickly:
+
+* `value = entries[cache->index].me_value;`.
+
+Because it is impossible to measure the performance of an instruction without
+also measuring unrelated factors, the assessment of the quality of a
+specialization will require some judgement.
+
+As a general rule, specialized instructions should be much faster than the
+base instruction.
+
+### Implementation of specialized instructions
+
+In general, specialized instructions should be implemented in two parts:
+
+1. A sequence of guards, each of the form
+ `DEOPT_IF(guard-condition-is-false, BASE_NAME)`.
+2. The operation, which should ideally have no branches and
+ a minimum number of dependent memory accesses.
+
+In practice, the parts may overlap, as data required for guards
+can be re-used in the operation.
+
+If there are branches in the operation, then consider further specialization
+to eliminate the branches.
+
+### Maintaining stats
+
+Finally, take care that stats are gathered correctly.
+After the last `DEOPT_IF` has passed, a hit should be recorded with
+`STAT_INC(BASE_INSTRUCTION, hit)`.
+After an optimization has been deferred in the adaptive instruction,
+that should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`.
+
+
Additional resources
--------------------
diff --git a/InternalDocs/jit.md b/InternalDocs/jit.md
new file mode 100644
index 0000000..1e9f385
--- /dev/null
+++ b/InternalDocs/jit.md
@@ -0,0 +1,134 @@
+# The JIT
+
+The [adaptive interpreter](interpreter.md) consists of a main loop that
+executes the bytecode instructions generated by the
+[bytecode compiler](compiler.md) and their
+[specializations](interpreter.md#Specialization). Runtime optimization in
+this interpreter can only be done for one instruction at a time. The JIT
+is based on a mechanism to replace an entire sequence of bytecode instructions,
+and this enables optimizations that span multiple instructions.
+
+Historically, the adaptive interpreter was referred to as `tier 1` and
+the JIT as `tier 2`. You will see remnants of this in the code.
+
+## The Optimizer and Executors
+
+The program begins running on the adaptive interpreter, until a `JUMP_BACKWARD`
+instruction determines that it is "hot" because the counter in its
+[inline cache](interpreter.md#inline-cache-entries) indicates that it
+executed more than some threshold number of times (see
+[`backoff_counter_triggers`](../Include/internal/pycore_backoff.h)).
+It then calls the function `_PyOptimizer_Optimize()` in
+[`Python/optimizer.c`](../Python/optimizer.c), passing it the current
+[frame](frames.md) and instruction pointer. `_PyOptimizer_Optimize()`
+constructs an object of type
+[`_PyExecutorObject`](Include/internal/pycore_optimizer.h) which implements
+an optimized version of the instruction trace beginning at this jump.
+
+The optimizer determines where the trace ends, and the executor is set up
+to either return to the adaptive interpreter and resume execution, or
+transfer control to another executor (see `_PyExitData` in
+Include/internal/pycore_optimizer.h).
+
+The executor is stored on the [`code object`](code_objects.md) of the frame,
+in the `co_executors` field which is an array of executors. The start
+instruction of the trace (the `JUMP_BACKWARD`) is replaced by an
+`ENTER_EXECUTOR` instruction whose `oparg` is equal to the index of the
+executor in `co_executors`.
+
+## The micro-op optimizer
+
+The optimizer that `_PyOptimizer_Optimize()` runs is configurable via the
+`_Py_SetTier2Optimizer()` function (this is used in test via
+`_testinternalcapi.set_optimizer()`.)
+
+The micro-op (abbreviated `uop` to approximate `μop`) optimizer is defined in
+[`Python/optimizer.c`](../Python/optimizer.c) as the type `_PyUOpOptimizer_Type`.
+It translates an instruction trace into a sequence of micro-ops by replacing
+each bytecode by an equivalent sequence of micro-ops (see
+`_PyOpcode_macro_expansion` in
+[pycore_opcode_metadata.h](../Include/internal/pycore_opcode_metadata.h)
+which is generated from [`Python/bytecodes.c`](../Python/bytecodes.c)).
+The micro-op sequence is then optimized by
+`_Py_uop_analyze_and_optimize` in
+[`Python/optimizer_analysis.c`](../Python/optimizer_analysis.c)
+and an instance of `_PyUOpExecutor_Type` is created to contain it.
+
+## The JIT interpreter
+
+After a `JUMP_BACKWARD` instruction invokes the uop optimizer to create a uop
+executor, it transfers control to this executor via the `GOTO_TIER_TWO` macro.
+
+CPython implements two executors. Here we describe the JIT interpreter,
+which is the simpler of them and is therefore useful for debugging and analyzing
+the uops generation and optimization stages. To run it, we configure the
+JIT to run on its interpreter (i.e., python is configured with
+[`--enable-experimental-jit=interpreter`](https://docs.python.org/dev/using/configure.html#cmdoption-enable-experimental-jit)).
+
+When invoked, the executor jumps to the `tier2_dispatch:` label in
+[`Python/ceval.c`](../Python/ceval.c), where there is a loop that
+executes the micro-ops. The body of this loop is a switch statement over
+the uops IDs, resembling the one used in the adaptive interpreter.
+
+The swtich implementing the uops is in [`Python/executor_cases.c.h`](../Python/executor_cases.c.h),
+which is generated by the build script
+[`Tools/cases_generator/tier2_generator.py`](../Tools/cases_generator/tier2_generator.py)
+from the bytecode definitions in
+[`Python/bytecodes.c`](../Python/bytecodes.c).
+
+When an `_EXIT_TRACE` or `_DEOPT` uop is reached, the uop interpreter exits
+and execution returns to the adaptive interpreter.
+
+## Invalidating Executors
+
+In addition to being stored on the code object, each executor is also
+inserted into a list of all executors, which is stored in the interpreter
+state's `executor_list_head` field. This list is used when it is necessary
+to invalidate executors because values they used in their construction may
+have changed.
+
+## The JIT
+
+When the full jit is enabled (python was configured with
+[`--enable-experimental-jit`](https://docs.python.org/dev/using/configure.html#cmdoption-enable-experimental-jit),
+the uop executor's `jit_code` field is populated with a pointer to a compiled
+C function that implements the executor logic. This function's signature is
+defined by `jit_func` in [`pycore_jit.h`](Include/internal/pycore_jit.h).
+When the executor is invoked by `ENTER_EXECUTOR`, instead of jumping to
+the uop interpreter at `tier2_dispatch`, the executor runs the function
+that `jit_code` points to. This function returns the instruction pointer
+of the next Tier 1 instruction that needs to execute.
+
+The generation of the jitted functions uses the copy-and-patch technique
+which is described in
+[Haoran Xu's article](https://sillycross.github.io/2023/05/12/2023-05-12/).
+At its core are statically generated `stencils` for the implementation
+of the micro ops, which are completed with runtime information while
+the jitted code is constructed for an executor by
+[`_PyJIT_Compile`](../Python/jit.c).
+
+The stencils are generated at build time under the Makefile target `regen-jit`
+by the scripts in [`/Tools/jit`](/Tools/jit). This script reads
+[`Python/executor_cases.c.h`](../Python/executor_cases.c.h) (which is
+generated from [`Python/bytecodes.c`](../Python/bytecodes.c)). For
+each opcode, it constructs a `.c` file that contains a function for
+implementing this opcode, with some runtime information injected.
+This is done by replacing `CASE` by the bytecode definition in the
+template file [`Tools/jit/template.c`](../Tools/jit/template.c).
+
+Each of the `.c` files is compiled by LLVM, to produce an object file
+that contains a function that executes the opcode. These compiled
+functions are used to generate the file
+[`jit_stencils.h`](../jit_stencils.h), which contains the functions
+that the JIT can use to emit code for each of the bytecodes.
+
+For Python maintainers this means that changes to the bytecodes and
+their implementations do not require changes related to the stencils,
+because everything is automatically generated from
+[`Python/bytecodes.c`](../Python/bytecodes.c) at build time.
+
+See Also:
+
+* [Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode](https://arxiv.org/abs/2011.13127)
+
+* [PyCon 2024: Building a JIT compiler for CPython](https://www.youtube.com/watch?v=kMO3Ju0QCDo)