Skip to content

gh-146393: Optimize float division operations by mutating uniquely-referenced operands in place (JIT only)#146397

Open
eendebakpt wants to merge 6 commits intopython:mainfrom
eendebakpt:jit_float_truediv
Open

gh-146393: Optimize float division operations by mutating uniquely-referenced operands in place (JIT only)#146397
eendebakpt wants to merge 6 commits intopython:mainfrom
eendebakpt:jit_float_truediv

Conversation

@eendebakpt
Copy link
Contributor

@eendebakpt eendebakpt commented Mar 24, 2026

We optimize float divisions for the case where one of the operands is a unique reference. This is similar to #146307, but with a guard for division by zero.

  • We do not add opcodes in tier 1
  • For tier 2 we can specialize for the case when one of the operations is a unique reference and for the case when there are no unique references. The case _BINARY_TRUEDIV_FLOAT where there are no unique references (or we miss information about the uniqueness) has no performance improvement in itself, but is to propagate types better. This opcode has guards, so that even with input from locals the type is propagated.

Micro-benchmarks (min of 3 runs, 2M iterations)

Pattern main (ns/iter) branch (ns/iter) Speedup Notes
(a+b) * c 10.8 10.9 -- baseline (multiply, already optimized)
(a+b) + (c+d) 18.0 18.1 -- baseline (add, already optimized)
a / b 20.6 10.8 1.9x speculative guards + truediv specialization
(a+b) / c 26.4 11.0 2.4x inplace LHS, guard inserted for c
(2.0+x) / y 25.1 10.9 2.3x inplace LHS, guard inserted for y
c / (a+b) 26.0 11.2 2.3x inplace RHS, guard inserted for c
(a/b) / (c/d) 41.3 19.1 2.2x speculative guards enable inplace chain
(a/b) + (c/d) 29.1 19.0 1.5x speculative guards enable inplace add

All patterns are total += <expr> in a tight loop.

Benchmark script
"""Benchmark for float true division tier 2 specialization.

Usage:
    ./python bench_truediv.py
"""
import timeit

N = 2_000_000
INNER = 1000


def bench(label, fn):
    iters = N // INNER
    times = [timeit.timeit(fn, number=iters) for _ in range(3)]
    t = min(times)
    print(f"  {label}: {t/N*1e9:.1f} ns/iter")


def f_chain_mul(n, a, b, c):
    t = 0.0
    for i in range(n):
        t += (a + b) * c
    return t


def f_div(n, a, b):
    t = 0.0
    for i in range(n):
        t += a / b
    return t


def f_chain_div(n, a, b, c):
    t = 0.0
    for i in range(n):
        t += (a + b) / c
    return t


def f_2px_div_y(n, x, y):
    t = 0.0
    for i in range(n):
        t += (2.0 + x) / y
    return t


def f_div_rhs(n, a, b, c):
    t = 0.0
    for i in range(n):
        t += c / (a + b)
    return t


def f_ab_div_cd(n, a, b, c, d):
    t = 0.0
    for i in range(n):
        t += (a / b) / (c / d)
    return t


def f_ab_add_cd(n, a, b, c, d):
    t = 0.0
    for i in range(n):
        t += (a / b) + (c / d)
    return t


def f_add_chain(n, a, b, c, d):
    t = 0.0
    for i in range(n):
        t += (a + b) + (c + d)
    return t


# Warmup
f_chain_mul(10000, 2.0, 3.0, 4.0)
f_div(10000, 10.0, 3.0)
f_chain_div(10000, 2.0, 3.0, 4.0)
f_2px_div_y(10000, 3.0, 4.0)
f_div_rhs(10000, 2.0, 3.0, 4.0)
f_ab_div_cd(10000, 10.0, 3.0, 4.0, 5.0)
f_ab_add_cd(10000, 10.0, 3.0, 4.0, 5.0)
f_add_chain(10000, 1.0, 2.0, 3.0, 4.0)

print("Float truediv benchmark (min of 3 runs):")
bench("(a+b) * c              (baseline) ", lambda: f_chain_mul(INNER, 2.0, 3.0, 4.0))
bench("(a+b) + (c+d)          (baseline) ", lambda: f_add_chain(INNER, 1.0, 2.0, 3.0, 4.0))
bench("a / b                  (spec div) ", lambda: f_div(INNER, 10.0, 3.0))
bench("(a+b) / c              (inplace L)", lambda: f_chain_div(INNER, 2.0, 3.0, 4.0))
bench("(2.0+x) / y            (inplace L)", lambda: f_2px_div_y(INNER, 3.0, 4.0))
bench("c / (a+b)              (inplace R)", lambda: f_div_rhs(INNER, 2.0, 3.0, 4.0))
bench("(a/b) / (c/d)          (spec div) ", lambda: f_ab_div_cd(INNER, 10.0, 3.0, 4.0, 5.0))
bench("(a/b) + (c/d)          (spec div) ", lambda: f_ab_add_cd(INNER, 10.0, 3.0, 4.0, 5.0))

Analysis

The inplace truediv kicks in when at least one operand is a uniquely-referenced float (e.g. the result of a prior add/multiply). The optimizer emits _BINARY_OP_TRUEDIV_FLOAT_INPLACE or _INPLACE_RIGHT, saving one PyFloat_FromDouble allocation + deallocation per iteration.

The optimization works well for several cases. For some (e.g. (a/b) + (c/d) ) the performance gain is not due to an inplace division, but by better type propagation allowing the + to be specialized inplace. The a / b is also faster because of better type propagation and a += in the test script.

In typical code intermediate results are often stored in local variables. For these cases it is important pick up (speculative) type information as soon as possible.

…izer

Add inplace float true division ops that the tier 2 optimizer emits
when at least one operand is a known float:

- _BINARY_OP_TRUEDIV_FLOAT_INPLACE (unique LHS)
- _BINARY_OP_TRUEDIV_FLOAT_INPLACE_RIGHT (unique RHS)

The optimizer inserts _GUARD_TOS_FLOAT / _GUARD_NOS_FLOAT for
operands not yet known to be float, enabling specialization in
expressions like `(a + b) / c`.

Also marks the result of all NB_TRUE_DIVIDE operations as unique
float in the abstract interpreter, enabling downstream inplace ops
even for generic `a / b` (the `+=` can reuse the division result).

Speeds up chain division patterns by ~2.3x and simple `total += a/b`
by ~1.5x.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
eendebakpt and others added 5 commits March 25, 2026 00:01
Operations that always return a new float (true division, float**int,
int**negative_int, mixed int/float arithmetic) now mark their result
as PyJitRef_MakeUnique. This enables downstream operations to mutate
the result in place instead of allocating a new float.

Int results are NOT marked unique because small ints are cached/immortal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only set the result of NB_TRUE_DIVIDE to float when both operands
are known int/float. Types like Fraction and Decimal override
__truediv__ and return non-float results. The unconditional type
propagation caused _POP_TOP_FLOAT to be emitted for Fraction results,
crashing with an assertion failure.

Fixes the segfault in test_math.testRemainder and
test_random.test_binomialvariate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eendebakpt eendebakpt marked this pull request as ready for review March 25, 2026 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants