Where exactly does Python 3.11 get its ~25% Speedup?
October 28, 2022 Python 3.11 was released a few days ago and as usual, it comes with more than a few new features that make it ever more interesting, from exception groups and fine-grained error locations and tracebacks to TOML parsing support in the standard library and of course the much awaited speedup as part of the faster CPython project. CPython 3.11 is 25% faster than CPython 3.10 on average according to benchmarks with pyperformance. The changes that are being done are part of something that Guido called: The "Shannon Plan". For the 3.11 release, the plan was about making a lot of optimizations in two main areas: startup and runtime. The release also includes additional optimizations that are not part of the faster CPython project. I will go through the details of the general optimizations (ie. non faster CPython project improvements) that made it to the stable 3.11.0 release. Using formatted string literals is the fastest way to format strings. A simple benchmark in Python 3.10: But using f-strings seems to be ~ 42% faster: The optimization works by converting simple C-style formatting into f-string. In 3.11.0, only For example, here's the same benchmark in Python 3.11: That's about 87% faster! Of course additional optimizations in 3.11 are at play here, like the faster interpreter startup time. In Python 3.10: In Python 3.11: That's about 18% faster. The optimization evolved from an observation from Mark Dickinson that a 128:64 divide instruction was always being generated by the compiler despite working with 30-bit digit values. Python divisions are somewhat crippled even on x64. Assuming 30-bit digits, the basic building block that's needed for multi-precision division is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit quotient (and ideally also a 32-bit remainder). And there's an x86/x64 instruction that does exactly that, namely DIVL. But without using inline assembly, current versions of GCC and Clang apparently can't be persuaded to emit that instruction from the This issue started by the realization that Python 2.7: Python 3.10: Python 3.11: The difference between Python3.10 and 3.11 is that the performance of calling What's interesting to note is that Python 3.11 is still significantly slower than Python 2.7 summing integers in some cases. We'll hopefully see additional improvements with a more efficient implementation of integers in Python. There's noticeable speedup with List append - Python 3.10: List append - Python 3.11: There's also some minor improvement with simple list comprehensions: Python 3.10: Python 3.11: This optimization makes Python more cache efficient when using dictionaries with all-Unicode keys. This is because the amount of ram used is reduced because the hash for those Unicode keys is dropped as Unicode objects have the hash already. For example on a 64bit platform, in Python 3.10: While in Python 3.11: The author of the PR Python 3.8 added the For One more improvement is that the previous popcount-based code for computing the largest power of two dividing Quick pyperf example, in Python 3.10: Python 3.11: The optimization improves the Python 3.10: Python 3.11: This optimization improves how Python 3.10: Python 3.11: Final notes:Table of Contents
Optimized printf-style % formatting on some format codes
$ python -m pyperf timeit -s \
'k = "foo"; v = "bar"' -- '"%s = %r" % (k, v)'
.....................
Mean +- std dev: 187 ns +- 8 ns
$ python -m pyperf timeit -s \
'k = "foo"; v = "bar"' -- 'f"{k!s} = {v!r}"'
.....................
Mean +- std dev: 131 ns +- 9 ns
%s, %r and %a format codes are converted, but there's an ongoing pull request to support: %d, %i, %u, %o, %x, %X, %f, %e, %g, %F, %E, %G.$ python -m pyperf timeit -s \
'k = "foo"; v = "bar"' -- '"%s = %r" % (k, v)'
.....................
Mean +- std dev: 100 ns +- 5 ns
Faster int division for Python's bignums
python -m pyperf timeit -s 'x=10**1000' -- 'x//10'
.....................
Mean +- std dev: 1.18 us +- 0.02 us
python -m pyperf timeit -s 'x=10**1000' -- 'x//10'
.....................
Mean +- std dev: 995 ns +- 15 ns
longobject.c source - they'll use DIVQ (a 128-bit-by-64-bit division, albeit with the top 64 bits of the dividend set to zero) on x64, and the __udivti3 or __udivti4 intrinsic on x86.
- Mark Dickinson (full context)Faster
sum of single digit PyLongssum is much faster in Python 2.7 than in Python 3. Unfortunately, that still seems to be the case in 3.11.0 under certain conditions.$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 37.4 us +- 1.1 us
$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 52.7 us +- 1.3 us
$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 39.0 us +- 1.0 us
sum on single digit PyLongs has improved by inlining the unpacking of single digit PyLongs in the fast-addition branch of the sum function. Doing it this way avoids the call to PyLong_AsLongAndOverflow when unpacking.Faster
list.append by streamlining list resizelist.append (about 54% faster) with Python 3.11.$ python -m pyperf timeit -s \
'x = list(map(float, range(10_000)))' -- '[x.append(i) for i in range(10_000)]'
.....................
Mean +- std dev: 605 us +- 20 us
$ python -m pyperf timeit -s \
'x = list(map(float, range(10_000)))' -- '[x.append(i) for i in range(10_000)]'
.....................
Mean +- std dev: 392 us +- 14 us
$ python -m pyperf timeit -s \
'' -- '[x for x in list(map(float, range(10_000)))]'
.....................
Mean +- std dev: 553 us +- 19 us
$ python -m pyperf timeit -s \
'' -- '[x for x in list(map(float, range(10_000)))]'
.....................
Mean +- std dev: 516 us +- 16 us
Smaller dictionary size when keys are all Unicode
>>> sys.getsizeof(dict(foo="bar", bar="foo"))
232
>>> sys.getsizeof(dict(foo="bar", bar="foo"))
184
Faster file transfer for large files when using
asyncio.DatagramProtocolasyncio.DatagramProtocol provides a base class for implementing datagram (UDP) protocols. With this optimization, transferring large files (e.g. ~60 MiB) using asyncio UDP will be over 100 times faster than in Python 3.10. This is done by calculating the size of the buffer once and storing it in a property. This makes using asyncio.DatagramProtocol orders of magnitude faster when transferring large files over UDP.msoxzw provided the following test script.In
math lib: Faster comb(n, k) and perm(n, k=None)comb(n, k) and perm(n, k=None) function in the math library of Python. Both are used to get the number of ways to choose k items from n items without repetition, both comb returns it without order while perm returns it with order. The optimization is achieved with multiple smaller improvements like using a divide-and-conquer algorithm for getting benefit of Karatsuba multiplication of large numbers and doing calculations for comb in C unsigned long long type instead of Python integers when possible (*). Another improvement for smaller k values (For 0 <= k <= n <= 67):0 <= k <= n <= 67, comb(n, k) always fits into a uint64_t. We compute it as comb_odd_part << shift where 2 ** shift is the largest power of two dividing comb(n, k) and comb_odd_part is comb(n, k) >> shift. comb_odd_part can be calculated efficiently via arithmetic modulo 2 ** 64, using three lookups and two uint64_t multiplications, while the necessary shift can be computed via Kummer's theorem: it's the number of carries when adding k to n - k in binary, which in turn is the number of set bits of n ^ k ^ (n - k). *math.comb(n, k) (for small n) got replaced with a more direct method based on counting trailing zeros of the factorials involved. (*).$ python -m pyperf timeit -s \
'import math' -- 'math.comb(100, 55)'
.....................
Mean +- std dev: 3.72 us +- 0.07 us
# ---
$ python -m pyperf timeit -s \
'import math' -- 'math.comb(10000, 5500)'
.....................
Mean +- std dev: 11.9 ms +- 0.1 ms
$ python -m pyperf timeit -s \
'import math' -- 'math.comb(100, 55)'
.....................
Mean +- std dev: 476 ns +- 20 ns
# ---
$ python -m pyperf timeit -s \
'import math' -- 'math.comb(10000, 5500)'
.....................
Mean +- std dev: 2.28 ms +- 0.10 ms
In
statistics lib: Faster mean(data), variance(data, xbar=None) and stdev(data, xbar=None)mean, variance, and stdev functions in the statistics module. If the input is an iterator, it is consumed in a single pass rather than converting it to a list. The single pass algorithm is twice as fast as the previous two pass code. *# Mean
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.mean(range(1_000))'
.....................
Mean +- std dev: 255 us +- 11 us
# Variance
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.variance((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 77.0 us +- 2.9 us
# Sample standard deviation (stdev)
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.stdev((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 78.0 us +- 2.2 us
# Mean
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.mean(range(1_000))'
.....................
Mean +- std dev: 193 us +- 7 us
# Variance
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.variance((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 56.1 us +- 2.3 us
# Sample standard deviation (stdev)
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.stdev((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 59.4 us +- 2.6 us
Constant time normalization of pure-ASCII strings in
unicodedata.normalize()unicodedata.normalize() by returning quickly from the unicode quickcheck algorithm if the provided input is a pure ASCII string. The check is done using PyUnicode_IS_ASCII.$ python -m pyperf timeit -s \
'import unicodedata' -- 'unicodedata.normalize("NFC", "python")'
.....................
Mean +- std dev: 83.3 ns +- 4.3 ns
$ python -m pyperf timeit -s \
'import unicodedata' -- 'unicodedata.normalize("NFC", "python")'
.....................
Mean +- std dev: 34.2 ns +- 1.2 ns