Where exactly does Python 3.11 get its ~25% Speedup? Part 1: General Optimizations
28 October, 2022
Python 3.11 was released a few days ago and as usual, it comes with more than a few new features that make it ever more interesting, from exception groups and fine-grained error locations and tracebacks to TOML parsing support in the standard library and of course the much awaited speedup as part of the faster CPython project.
CPython 3.11 is 25% faster than CPython 3.10 on average according to benchmarks with pyperformance. The changes that are being done are part of something that Guido called: The "Shannon Plan". For the 3.11 release, the plan was about making a lot of optimizations in two main areas: startup and runtime. The release also includes additional optimizations that are not part of the faster CPython project.
In this first part I will go through the details of the general optimizations (ie. non faster CPython project improvements) that made it to the stable 3.11.0 release.
Table of Contents
- Optimized printf-style % formatting on some format codes
- Faster int division for Python's bignums
- Faster sum of single digit PyLongs
- Faster list.append by streamlining list resize
- Smaller dictionary size when keys are all unicode
- Faster file transfer for large files when using asyncio.DatagramProtocol
- In math lib: Faster comb(n, k) and perm(n, k=None)
- In statistics lib: Faster mean(data), variance(data, xbar=None) and stdev(data, xbar=None)
- Constant time normalization of pure-ASCII strings in unicodedata.normalize()
Optimized printf-style % formatting on some format codes
Using formatted string literals is the fastest way to format strings.
A simple benchmark in Python 3.10:
$ python -m pyperf timeit -s \
'k = "foo"; v = "bar"' -- '"%s = %r" % (k, v)'
.....................
Mean +- std dev: 187 ns +- 8 ns
But using f-strings seems to be ~ 42% faster:
$ python -m pyperf timeit -s \
'k = "foo"; v = "bar"' -- 'f"{k!s} = {v!r}"'
.....................
Mean +- std dev: 131 ns +- 9 ns
The optimization works by converting simple C-style formatting into f-string. In 3.11.0, only %s
, %r
and %a
format codes are converted, but there's an ongoing pull request to support: %d
, %i
, %u
, %o
, %x
, %X
, %f
, %e
, %g
, %F
, %E
, %G
.
For example, here's the same benchmark in Python 3.11:
$ python -m pyperf timeit -s \
'k = "foo"; v = "bar"' -- '"%s = %r" % (k, v)'
.....................
Mean +- std dev: 100 ns +- 5 ns
That's about 87% faster! Of course additional optimizations in 3.11 are at play here, like the faster interpreter startup time.
Faster int division for Python's bignums
In Python 3.10:
python -m pyperf timeit -s 'x=10**1000' -- 'x//10'
.....................
Mean +- std dev: 1.18 us +- 0.02 us
In Python 3.11:
python -m pyperf timeit -s 'x=10**1000' -- 'x//10'
.....................
Mean +- std dev: 995 ns +- 15 ns
That's about 18% faster.
The optimization evolved from an observation from Mark Dickinson that a 128:64 divide instruction was always being generated by the compiler despite working with 30-bit digit values.
Python divisions are somewhat crippled even on x64. Assuming 30-bit digits, the basic building block that's needed for multi-precision division is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit quotient (and ideally also a 32-bit remainder). And there's an x86/x64 instruction that does exactly that, namely DIVL. But without using inline assembly, current versions of GCC and Clang apparently can't be persuaded to emit that instruction from the
longobject.c
source - they'll use DIVQ (a 128-bit-by-64-bit division, albeit with the top 64 bits of the dividend set to zero) on x64, and the__udivti3
or__udivti4
intrinsic on x86.
- Mark Dickinson (full context)
Faster sum
of single digit PyLongs
This issue started by the realization that sum
is much faster in Python 2.7 than in Python 3. Unfortunately, that still seems to be the case in 3.11.0 under certain conditions.
Python 2.7:
$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 37.4 us +- 1.1 us
Python 3.10:
$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 52.7 us +- 1.3 us
Python 3.11:
$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 39.0 us +- 1.0 us
The difference between Python3.10 and 3.11 is that the performance of calling sum
on single digit PyLongs has improved by inlining the unpacking of single digit PyLongs in the fast-addition branch of the sum
function. Doing it this way avoids the call to PyLong_AsLongAndOverflow
when unpacking.
What's interesting to note is that Python 3.11 is still significantly slower than Python 2.7 summing integers in some cases. We'll hopefully see additional improvements with a more efficient implementation of integers in Python.
Faster list.append
by streamlining list resize
There's noticeable speedup with list.append
(about 54% faster) with Python 3.11.
List append - Python 3.10:
$ python -m pyperf timeit -s \
'x = list(map(float, range(10_000)))' -- '[x.append(i) for i in range(10_000)]'
.....................
Mean +- std dev: 605 us +- 20 us
List append - Python 3.11:
$ python -m pyperf timeit -s \
'x = list(map(float, range(10_000)))' -- '[x.append(i) for i in range(10_000)]'
.....................
Mean +- std dev: 392 us +- 14 us
There's also some minor improvement with simple list comprehensions:
Python 3.10:
$ python -m pyperf timeit -s \
'' -- '[x for x in list(map(float, range(10_000)))]'
.....................
Mean +- std dev: 553 us +- 19 us
Python 3.11:
$ python -m pyperf timeit -s \
'' -- '[x for x in list(map(float, range(10_000)))]'
.....................
Mean +- std dev: 516 us +- 16 us
Smaller dictionary size when keys are all Unicode
This optimization makes Python more cache efficient when using dictionaries with all-Unicode keys. This is because the amount of ram used is reduced because the hash for those Unicode keys is dropped as Unicode objects have the hash already.
For example on a 64bit platform, in Python 3.10:
>>> sys.getsizeof(dict(foo="bar", bar="foo"))
232
While in Python 3.11:
>>> sys.getsizeof(dict(foo="bar", bar="foo"))
184
Faster file transfer for large files when using asyncio.DatagramProtocol
asyncio.DatagramProtocol
provides a base class for implementing datagram (UDP) protocols. With this optimization, transferring large files (e.g. ~60 MiB) using asyncio UDP will be over 100 times faster than in Python 3.10. This is done by calculating the size of the buffer once and storing it in a property. This makes using asyncio.DatagramProtocol
orders of magnitude faster when transferring large files over UDP.
The author of the PR msoxzw
provided the following test script.
In math
lib: Faster comb(n, k)
and perm(n, k=None)
Python 3.8 added the comb(n, k)
and perm(n, k=None)
function in the math
library of Python. Both are used to get the number of ways to choose k
items from n
items without repetition, both comb
returns it without order while perm
returns it with order. The optimization is achieved with multiple smaller improvements like using a divide-and-conquer algorithm for getting benefit of Karatsuba multiplication of large numbers and doing calculations for comb
in C unsigned long long
type instead of Python integers when possible (*). Another improvement for smaller k values (For 0 <= k <= n <= 67):
For
0 <= k <= n <= 67
,comb(n, k)
always fits into auint64_t
. We compute it ascomb_odd_part << shift
where2 ** shift
is the largest power of two dividingcomb(n, k)
andcomb_odd_part
iscomb(n, k) >> shift
.comb_odd_part
can be calculated efficiently via arithmetic modulo2 ** 64
, using three lookups and twouint64_t
multiplications, while the necessary shift can be computed via Kummer's theorem: it's the number of carries when addingk
ton - k
in binary, which in turn is the number of set bits ofn ^ k ^ (n - k)
. *
One more improvement is that the previous popcount-based code for computing the largest power of two dividing math.comb(n, k)
(for small n) got replaced with a more direct method based on counting trailing zeros of the factorials involved. (*).
Quick pyperf example, in Python 3.10:
$ python -m pyperf timeit -s \
'import math' -- 'math.comb(100, 55)'
.....................
Mean +- std dev: 3.72 us +- 0.07 us
# ---
$ python -m pyperf timeit -s \
'import math' -- 'math.comb(10000, 5500)'
.....................
Mean +- std dev: 11.9 ms +- 0.1 ms
Python 3.11:
$ python -m pyperf timeit -s \
'import math' -- 'math.comb(100, 55)'
.....................
Mean +- std dev: 476 ns +- 20 ns
# ---
$ python -m pyperf timeit -s \
'import math' -- 'math.comb(10000, 5500)'
.....................
Mean +- std dev: 2.28 ms +- 0.10 ms
In statistics
lib: Faster mean(data)
, variance(data, xbar=None)
and stdev(data, xbar=None)
The optimization improves the mean
, variance
, and stdev
functions in the statistics
module. If the input is an iterator, it is consumed in a single pass rather than converting it to a list. The single pass algorithm is twice as fast as the previous two pass code. *
Python 3.10:
# Mean
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.mean(range(1_000))'
.....................
Mean +- std dev: 255 us +- 11 us
# Variance
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.variance((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 77.0 us +- 2.9 us
# Sample standard deviation (stdev)
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.stdev((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 78.0 us +- 2.2 us
Python 3.11:
# Mean
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.mean(range(1_000))'
.....................
Mean +- std dev: 193 us +- 7 us
# Variance
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.variance((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 56.1 us +- 2.3 us
# Sample standard deviation (stdev)
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.stdev((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 59.4 us +- 2.6 us
Constant time normalization of pure-ASCII strings in unicodedata.normalize()
This optimization improves how unicodedata.normalize()
by returning quickly from the unicode quickcheck algorithm if the provided input is a pure ASCII string. The check is done using PyUnicode_IS_ASCII
.
Python 3.10:
$ python -m pyperf timeit -s \
'import unicodedata' -- 'unicodedata.normalize("NFC", "python")'
.....................
Mean +- std dev: 83.3 ns +- 4.3 ns
Python 3.11:
$ python -m pyperf timeit -s \
'import unicodedata' -- 'unicodedata.normalize("NFC", "python")'
.....................
Mean +- std dev: 34.2 ns +- 1.2 ns
Final notes:
- I wrote this post for my own edification about the recent improvements in Python 3.11. If I got something wrong, please let me know by email or on Twitter if you prefer.
- Comments on HackerNews
- In Part 2, I'll go over the optimizations provided by the faster CPython project.