# Where exactly does Python 3.11 get its ~25% Speedup? Part 1: General Optimizations

28 October, 2022

Python 3.11 was released a few days ago and as usual, it comes with more than a few new features that make it ever more interesting, from exception groups and fine-grained error locations and tracebacks to TOML parsing support in the standard library and of course the much awaited speedup as part of the faster CPython project.

CPython 3.11 is 25% faster than CPython 3.10 on average according to benchmarks with pyperformance. The changes that are being done are part of something that Guido called: The "Shannon Plan". For the 3.11 release, the plan was about making a lot of optimizations in two main areas: startup and runtime. The release also includes additional optimizations that are not part of the faster CPython project.

In this first part I will go through the details of the general optimizations (ie. non faster CPython project improvements) that made it to the stable 3.11.0 release.

## Table of Contents

- Optimized printf-style % formatting on some format codes
- Faster int division for Python's bignums
- Faster sum of single digit PyLongs
- Faster list.append by streamlining list resize
- Smaller dictionary size when keys are all unicode
- Faster file transfer for large files when using asyncio.DatagramProtocol
- In math lib: Faster comb(n, k) and perm(n, k=None)
- In statistics lib: Faster mean(data), variance(data, xbar=None) and stdev(data, xbar=None)
- Constant time normalization of pure-ASCII strings in unicodedata.normalize()

## Optimized printf-style % formatting on some format codes

Using formatted string literals is the fastest way to format strings.

A simple benchmark in __Python 3.10__:

```
$ python -m pyperf timeit -s \
'k = "foo"; v = "bar"' -- '"%s = %r" % (k, v)'
.....................
Mean +- std dev: 187 ns +- 8 ns
```

But using f-strings seems to be ~ 42% faster:

```
$ python -m pyperf timeit -s \
'k = "foo"; v = "bar"' -- 'f"{k!s} = {v!r}"'
.....................
Mean +- std dev: 131 ns +- 9 ns
```

The optimization works by converting simple C-style formatting into f-string. In 3.11.0, only `%s`

, `%r`

and `%a`

format codes are converted, but there's an ongoing pull request to support: `%d`

, `%i`

, `%u`

, `%o`

, `%x`

, `%X`

, `%f`

, `%e`

, `%g`

, `%F`

, `%E`

, `%G`

.

For example, here's the same benchmark in __Python 3.11__:

```
$ python -m pyperf timeit -s \
'k = "foo"; v = "bar"' -- '"%s = %r" % (k, v)'
.....................
Mean +- std dev: 100 ns +- 5 ns
```

That's about 87% faster! Of course additional optimizations in 3.11 are at play here, like the faster interpreter startup time.

## Faster int division for Python's bignums

In __Python 3.10__:

```
python -m pyperf timeit -s 'x=10**1000' -- 'x//10'
.....................
Mean +- std dev: 1.18 us +- 0.02 us
```

In __Python 3.11__:

```
python -m pyperf timeit -s 'x=10**1000' -- 'x//10'
.....................
Mean +- std dev: 995 ns +- 15 ns
```

That's about 18% faster.

The optimization evolved from an observation from Mark Dickinson that a 128:64 divide instruction was always being generated by the compiler despite working with 30-bit digit values.

Python divisions are somewhat crippled even on x64. Assuming 30-bit digits, the basic building block that's needed for multi-precision division is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit quotient (and ideally also a 32-bit remainder). And there's an x86/x64 instruction that does exactly that, namely DIVL. But without using inline assembly, current versions of GCC and Clang apparently can't be persuaded to emit that instruction from the

`longobject.c`

source - they'll use DIVQ (a 128-bit-by-64-bit division, albeit with the top 64 bits of the dividend set to zero) on x64, and the`__udivti3`

or`__udivti4`

intrinsic on x86.

- Mark Dickinson (full context)

## Faster `sum`

of single digit PyLongs

This issue started by the realization that `sum`

is much faster in Python 2.7 than in Python 3. Unfortunately, that still seems to be the case in 3.11.0 under certain conditions.

__Python 2.7:__

```
$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 37.4 us +- 1.1 us
```

__Python 3.10:__

```
$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 52.7 us +- 1.3 us
```

__Python 3.11:__

```
$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 39.0 us +- 1.0 us
```

The difference between Python3.10 and 3.11 is that the performance of calling `sum`

on single digit PyLongs has improved by inlining the unpacking of single digit PyLongs in the fast-addition branch of the `sum`

function. Doing it this way avoids the call to `PyLong_AsLongAndOverflow`

when unpacking.

What's interesting to note is that Python 3.11 is still significantly slower than Python 2.7 summing integers in some cases. We'll hopefully see additional improvements with a more efficient implementation of integers in Python.

## Faster `list.append`

by streamlining list resize

There's noticeable speedup with `list.append`

(about 54% faster) with Python 3.11.

__List append - Python 3.10:__

```
$ python -m pyperf timeit -s \
'x = list(map(float, range(10_000)))' -- '[x.append(i) for i in range(10_000)]'
.....................
Mean +- std dev: 605 us +- 20 us
```

__List append - Python 3.11:__

```
$ python -m pyperf timeit -s \
'x = list(map(float, range(10_000)))' -- '[x.append(i) for i in range(10_000)]'
.....................
Mean +- std dev: 392 us +- 14 us
```

There's also some minor improvement with simple list comprehensions:

__Python 3.10:__

```
$ python -m pyperf timeit -s \
'' -- '[x for x in list(map(float, range(10_000)))]'
.....................
Mean +- std dev: 553 us +- 19 us
```

__Python 3.11:__

```
$ python -m pyperf timeit -s \
'' -- '[x for x in list(map(float, range(10_000)))]'
.....................
Mean +- std dev: 516 us +- 16 us
```

## Smaller dictionary size when keys are all Unicode

This optimization makes Python more cache efficient when using dictionaries with all-Unicode keys. This is because the amount of ram used is reduced because the hash for those Unicode keys is dropped as Unicode objects have the hash already.

For example on a 64bit platform, in __Python 3.10__:

```
>>> sys.getsizeof(dict(foo="bar", bar="foo"))
232
```

While in __Python 3.11__:

```
>>> sys.getsizeof(dict(foo="bar", bar="foo"))
184
```

## Faster file transfer for large files when using `asyncio.DatagramProtocol`

`asyncio.DatagramProtocol`

provides a base class for implementing datagram (UDP) protocols. With this optimization, transferring large files (e.g. ~60 MiB) using asyncio UDP will be over 100 times faster than in Python 3.10. This is done by calculating the size of the buffer once and storing it in a property. This makes using `asyncio.DatagramProtocol`

orders of magnitude faster when transferring large files over UDP.

The author of the PR `msoxzw`

provided the following test script.

## In `math`

lib: Faster `comb(n, k)`

and `perm(n, k=None)`

Python 3.8 added the `comb(n, k)`

and `perm(n, k=None)`

function in the `math`

library of Python. Both are used to get the number of ways to choose `k`

items from `n`

items without repetition, both `comb`

returns it without order while `perm`

returns it with order. The optimization is achieved with multiple smaller improvements like using a divide-and-conquer algorithm for getting benefit of Karatsuba multiplication of large numbers and doing calculations for `comb`

in C `unsigned long long`

type instead of Python integers when possible (*). Another improvement for smaller k values (For 0 <= k <= n <= 67):

For

`0 <= k <= n <= 67`

,`comb(n, k)`

always fits into a`uint64_t`

. We compute it as`comb_odd_part << shift`

where`2 ** shift`

is the largest power of two dividing`comb(n, k)`

and`comb_odd_part`

is`comb(n, k) >> shift`

.`comb_odd_part`

can be calculated efficiently via arithmetic modulo`2 ** 64`

, using three lookups and two`uint64_t`

multiplications, while the necessary shift can be computed via Kummer's theorem: it's the number of carries when adding`k`

to`n - k`

in binary, which in turn is the number of set bits of`n ^ k ^ (n - k)`

.*

One more improvement is that the previous popcount-based code for computing the largest power of two dividing `math.comb(n, k)`

(for small n) got replaced with a more direct method based on counting trailing zeros of the factorials involved. (*).

Quick pyperf example, in __Python 3.10__:

```
$ python -m pyperf timeit -s \
'import math' -- 'math.comb(100, 55)'
.....................
Mean +- std dev: 3.72 us +- 0.07 us
# ---
$ python -m pyperf timeit -s \
'import math' -- 'math.comb(10000, 5500)'
.....................
Mean +- std dev: 11.9 ms +- 0.1 ms
```

__Python 3.11:__

```
$ python -m pyperf timeit -s \
'import math' -- 'math.comb(100, 55)'
.....................
Mean +- std dev: 476 ns +- 20 ns
# ---
$ python -m pyperf timeit -s \
'import math' -- 'math.comb(10000, 5500)'
.....................
Mean +- std dev: 2.28 ms +- 0.10 ms
```

## In `statistics`

lib: Faster `mean(data)`

, `variance(data, xbar=None)`

and `stdev(data, xbar=None)`

The optimization improves the `mean`

, `variance`

, and `stdev`

functions in the `statistics`

module. If the input is an iterator, it is consumed in a single pass rather than converting it to a list. The single pass algorithm is twice as fast as the previous two pass code. *

__Python 3.10:__

```
# Mean
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.mean(range(1_000))'
.....................
Mean +- std dev: 255 us +- 11 us
# Variance
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.variance((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 77.0 us +- 2.9 us
# Sample standard deviation (stdev)
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.stdev((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 78.0 us +- 2.2 us
```

__Python 3.11:__

```
# Mean
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.mean(range(1_000))'
.....................
Mean +- std dev: 193 us +- 7 us
# Variance
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.variance((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 56.1 us +- 2.3 us
# Sample standard deviation (stdev)
$ python -m pyperf timeit -s \
'import statistics' -- 'statistics.stdev((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 59.4 us +- 2.6 us
```

## Constant time normalization of pure-ASCII strings in `unicodedata.normalize()`

This optimization improves how `unicodedata.normalize()`

by returning quickly from the unicode quickcheck algorithm if the provided input is a pure ASCII string. The check is done using `PyUnicode_IS_ASCII`

.

__Python 3.10:__

```
$ python -m pyperf timeit -s \
'import unicodedata' -- 'unicodedata.normalize("NFC", "python")'
.....................
Mean +- std dev: 83.3 ns +- 4.3 ns
```

__Python 3.11:__

```
$ python -m pyperf timeit -s \
'import unicodedata' -- 'unicodedata.normalize("NFC", "python")'
.....................
Mean +- std dev: 34.2 ns +- 1.2 ns
```

Final notes:

- I wrote this post for my own edification about the recent improvements in Python 3.11. If I got something wrong, please let me know by email or on Twitter if you prefer.
- Comments on HackerNews
- In Part 2, I'll go over the optimizations provided by the faster CPython project.