Remove the modulo operations in spsc#652
Conversation
|
I wish we had some benchmarking infra in place. This would help me make sure this actually solves the problem and would have helped detect it in the first place. |
|
Clippy lint fixes are already included in #644 |
|
@823984418 does this fix your performance issues? |
These modulo operations used to be well optimized when N was a power of 2 However, the consumer and producer now use view-types that make N runtime dependant, preventing the compiler from optimizing these modulo operations even when N is always a power of 2. This patch leverages the fact that `head` and `tail` are always kept lower than N to replace the modulo operations with a simple if, which gets optimized pretty well by the compiler and no branch is left. Closes rust-embedded#650
src/spsc.rs
Outdated
| let head = self.rb.head.load(Ordering::Relaxed); | ||
|
|
||
| let i = (head + self.index) % self.rb.n(); | ||
| let i = head + self.index; |
There was a problem hiding this comment.
Could this (theoretically) overflow? If n is larger than half of the usize::MAX and the queue already wrapped around so head is close to the end of the underlying array, and index is large?
(Very unlikely in practice, and I only had this idea because I wondered why QueueInner::len uses wrapping_add/wrapping_sub).
There was a problem hiding this comment.
Good catch.
This would be a bug even with the original implementation if n is not a divisor of usize::MAX.
I don't think this is the same as for len. in len we're doing wrapping operations because we "know" it's going to go negative (thus wrap) since current_head > current_tail. We could in theory remove the wrapping operation by changing the order (but then it's sensible to the same bug, where it could in theory overflow if N is too close to usize::MAX).
I'll take a look at fixing this.
There was a problem hiding this comment.
We could in theory remove the wrapping operation by changing the order (but then it's sensible to the same bug, where it could in theory overflow if N is too close to usize::MAX)
Yes, sorry, my comment about len was a bit short. Of course, as it's written currently, the wrapping_sub/add calls are required. I was thinking about changing that by changing the order. But then:
a) as you noticed, it would have the same potential overflow issue
b) wrapping operations might be more efficient, because they don't require any overflow checks (if they are enabled, e.g. in debug mode or by setting overflow-checks = true in [profile.release])
So I decided to not suggest removing the wrapping operations.
Thanks Jannic for the find: rust-embedded#652 (comment) If N == usize::MAX, there is the possibility of a panic in len() If N >= usize::MAX, then in the iterator code, self.index + self.head could overflow The operations are now slightly more complex and a bit slower, but thanks to compiler optimization they don't introduce branches, only conditional instructions (cmov, csel, it). All added tests fail without the fix
Thanks Jannic for the find: rust-embedded#652 (comment) If N == usize::MAX, there is the possibility of a panic in len() If N >= usize::MAX, then in the iterator code, self.index + self.head could overflow The operations are now slightly more complex and a bit slower, but thanks to compiler optimization they don't introduce branches, only conditional instructions (cmov, csel, it). All added tests fail without the fix
Thanks Jannic for the find: rust-embedded#652 (comment) If N == usize::MAX, there is the possibility of a panic in len() If N >= usize::MAX, then in the iterator code, self.index + self.head could overflow The operations are now slightly more complex and a bit slower, but thanks to compiler optimization they don't introduce branches, only conditional instructions (cmov, csel, it). All added tests fail without the fix
These modulo operations used to be well optimized when N was a power of 2 However, the consumer and producer now use view-types that make N runtime dependant, preventing the compiler from optimizing these modulo operations even when N is always a power of 2.
This patch leverages the fact that
headandtailare always kept lower than N to replace the modulo operations with a simple if, which gets optimized pretty well by the compiler and no branch is left.Closes #650