Description
Feature or enhancement
Proposal:
When Python is compiled with --enabled-optimizations
, which turns on PGO (program guided optimizations), the build will run a subset of the unit tests as the "task" to generate profile information. Included in the profile is information like counts of the taken side of a CPU conditional branch instructions. To get the best optimization, your PGO task should match the branch taken behavior of your real workloads.
Using the unit tests has the advantage that we have good code coverage in terms of executing most branches and code paths. It also has the advantage of being available without any external dependencies. It has the disadvantage that the code executed during unit tests is likely quite atypical of what's executed during real applications. Running the ./python -X perf -m test --pgo
under the "perf" tool, I see the following results:
Children | Self | Symbol |
---|---|---|
97.39% | 21.16% | _PyEval_EvalFrameDefault |
34.37% | 2.39% | deduce_unreachable |
24.82% | 1.50% | _PyGC_Collect |
21.11% | 1.32% | gc_collect_region |
20.95% | 0.00% | gc_collect |
19.50% | 0.00% | py::gc_collect:/home/nas/src/cpython/Lib/test/support/init.py |
14.98% | 0.54% | PyObject_Vectorcall |
10.95% | 1.48% | dict_traverse |
7.57% | 0.00% | _PyPegen_run_parser_from_string |
7.46% | 0.00% | _PyPegen_run_parser |
7.46% | 0.00% | _PyPegen_parse |
7.33% | 0.24% | py::_make_iterencode.._iterencode_dict:/home/nas/src/cpython/Lib/json/encoder.py |
7.18% | 7.09% | visit_reachable |
6.43% | 0.02% | expression_rule |
6.13% | 0.00% | PyRun_StringFlags |
6.06% | 1.90% | _PyEval_Vector |
5.99% | 5.91% | visit_decref |
5.80% | 0.00% | builtin_eval |
5.66% | 0.04% | disjunction_rule |
This profile reveals a number of problems. First, a large fraction of time in spent in the cyclic GC. That's because the unit test framework calls test.support.gc_collect()
before each test case. That function triggers three full GC collections. Other tests also call the GC explicitly. This is not behavior typical of a real program.
Also taking a lot of time are functions related to parsing and compiling Python code. Notice the builtin_eval()
function, for example. I suspect that's mostly a result of using "doctest". Again, this would not be typical of real programs.
I think we should replace the PGO task with a program that more closely represents the behavior of real Python programs. There are at least two potential advantages: it could make the compiled Python binary faster for real programs, it could make our benchmark results less noisy since the compiler would be doing a better and most consistent job of generating optimal code.
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response