gh-119786: move adaptive interpreter doc from Python to InternalsDoc (#120137)

author: Irit Katriel <1055913+iritkatriel@users.noreply.github.com> 2024-06-06 10:56:58 (GMT)
committer: GitHub <noreply@github.com> 2024-06-06 10:56:58 (GMT)
commit: eeb8f67f837facb37f092a8b743f4d249515e82f (patch)
tree: 3312f7435dcc39c8efd091ef472b90e8ef0e27ac /InternalDocs
parent: f878d46e5614f08a9302fcb6fc611ef49e9acf2f (diff)
download: cpython-eeb8f67f837facb37f092a8b743f4d249515e82f.zip
cpython-eeb8f67f837facb37f092a8b743f4d249515e82f.tar.gz
cpython-eeb8f67f837facb37f092a8b743f4d249515e82f.tar.bz2
2 files changed, 144 insertions, 0 deletions
diff --git a/InternalDocs/README.md b/InternalDocs/README.md
index e69e27d..a2502fb 100644
--- a/InternalDocs/README.md
+++ b/InternalDocs/README.md
@@ -14,3 +14,4 @@ it is not, please report that through the
 
 [Exception Handling](exception_handling.md)
 
+[Adaptive Instruction Families](adaptive.md)
diff --git a/InternalDocs/adaptive.md b/InternalDocs/adaptive.md
new file mode 100644
index 0000000..0924573
--- /dev/null
+++ b/InternalDocs/adaptive.md
@@ -0,0 +1,143 @@
+# Adding or extending a family of adaptive instructions.
+
+## Families of instructions
+
+The core part of [PEP 659](https://peps.python.org/pep-0659/)
+(specializing adaptive interpreter) is the families of
+instructions that perform the adaptive specialization.
+
+A family of instructions has the following fundamental properties:
+
+* It corresponds to a single instruction in the code
+  generated by the bytecode compiler.
+* It has a single adaptive instruction that records an execution count and,
+  at regular intervals, attempts to specialize itself. If not specializing,
+  it executes the base implementation.
+* It has at least one specialized form of the instruction that is tailored
+  for a particular value or set of values at runtime.
+* All members of the family must have the same number of inline cache entries,
+  to ensure correct execution.
+  Individual family members do not need to use all of the entries,
+  but must skip over any unused entries when executing.
+
+The current implementation also requires the following,
+although these are not fundamental and may change:
+
+* All families use one or more inline cache entries,
+  the first entry is always the counter.
+* All instruction names should start with the name of the adaptive
+  instruction.
+* Specialized forms should have names describing their specialization.
+
+## Example family
+
+The `LOAD_GLOBAL` instruction (in
+[Python/bytecodes.c](https://github.com/python/cpython/blob/main/Python/bytecodes.c))
+already has an adaptive family that serves as a relatively simple example.
+
+The `LOAD_GLOBAL` instruction performs adaptive specialization,
+calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero.
+
+There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE`
+which is specialized for global variables in the module, and
+`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables.
+
+## Performance analysis
+
+The benefit of a specialization can be assessed with the following formula:
+`Tbase/Tadaptive`.
+
+Where `Tbase` is the mean time to execute the base instruction,
+and `Tadaptive` is the mean time to execute the specialized and adaptive forms.
+
+`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)`
+
+`Ti` is the time to execute the `i`th instruction in the family and `Ni` is
+the number of times that instruction is executed.
+`Tmiss` is the time to process a miss, including de-optimzation
+and the time to execute the base instruction.
+
+The ideal situation is where misses are rare and the specialized
+forms are much faster than the base instruction.
+`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`.
+In which case we have `Tadaptive ≈ sum(Ti*Ni)`.
+Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and
+`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction,
+we would expect the specialization of `LOAD_GLOBAL` to be profitable.
+
+## Design considerations
+
+While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and
+`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti`
+low for all specialized instructions and `Nmiss` as low as possible.
+
+Keeping `Nmiss` low means that there should be specializations for almost
+all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means
+keeping `Ti` low which means minimizing branches and dependent memory
+accesses (pointer chasing). These two objectives may be in conflict,
+requiring judgement and experimentation to design the family of instructions.
+
+The size of the inline cache should as small as possible,
+without impairing performance, to reduce the number of
+`EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache.
+
+### Gathering data
+
+Before choosing how to specialize an instruction, it is important to gather
+some data. What are the patterns of usage of the base instruction?
+Data can best be gathered by instrumenting the interpreter. Since a
+specialization function and adaptive instruction are going to be required,
+instrumentation can most easily be added in the specialization function.
+
+### Choice of specializations
+
+The performance of the specializing adaptive interpreter relies on the
+quality of specialization and keeping the overhead of specialization low.
+
+Specialized instructions must be fast. In order to be fast,
+specialized instructions should be tailored for a particular
+set of values that allows them to:
+1. Verify that incoming value is part of that set with low overhead.
+2. Perform the operation quickly.
+
+This requires that the set of values is chosen such that membership can be
+tested quickly and that membership is sufficient to allow the operation to
+performed quickly.
+
+For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()`
+dictionaries that have a keys with the expected version.
+
+This can be tested quickly:
+* `globals->keys->dk_version == expected_version`
+
+and the operation can be performed quickly:
+* `value = entries[cache->index].me_value;`.
+
+Because it is impossible to measure the performance of an instruction without
+also measuring unrelated factors, the assessment of the quality of a
+specialization will require some judgement.
+
+As a general rule, specialized instructions should be much faster than the
+base instruction.
+
+### Implementation of specialized instructions
+
+In general, specialized instructions should be implemented in two parts:
+1. A sequence of guards, each of the form
+  `DEOPT_IF(guard-condition-is-false, BASE_NAME)`.
+2. The operation, which should ideally have no branches and
+  a minimum number of dependent memory accesses.
+
+In practice, the parts may overlap, as data required for guards
+can be re-used in the operation.
+
+If there are branches in the operation, then consider further specialization
+to eliminate the branches.
+
+### Maintaining stats
+
+Finally, take care that stats are gather correctly.
+After the last `DEOPT_IF` has passed, a hit should be recorded with
+`STAT_INC(BASE_INSTRUCTION, hit)`.
+After an optimization has been deferred in the adaptive instruction,
+that should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`.
author	Irit Katriel <1055913+iritkatriel@users.noreply.github.com>	2024-06-06 10:56:58 (GMT)
committer	GitHub <noreply@github.com>	2024-06-06 10:56:58 (GMT)
commit	eeb8f67f837facb37f092a8b743f4d249515e82f (patch)
tree	3312f7435dcc39c8efd091ef472b90e8ef0e27ac /InternalDocs
parent	f878d46e5614f08a9302fcb6fc611ef49e9acf2f (diff)
download	cpython-eeb8f67f837facb37f092a8b743f4d249515e82f.zip cpython-eeb8f67f837facb37f092a8b743f4d249515e82f.tar.gz cpython-eeb8f67f837facb37f092a8b743f4d249515e82f.tar.bz2