Fix LWT routing: preserve Paxos leader order in TokenAwarePolicy#782
Draft
mykaul wants to merge 1 commit intoscylladb:masterfrom
Draft
Fix LWT routing: preserve Paxos leader order in TokenAwarePolicy#782mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul wants to merge 1 commit intoscylladb:masterfrom
Conversation
TokenAwarePolicy.make_query_plan() was re-sorting replicas by distance (LOCAL_RACK > LOCAL > REMOTE) via yield_in_order(), which could demote the Paxos leader when using RackAwareRoundRobinPolicy if the leader happened to be in a different rack than the client. This causes an extra network hop for every LWT operation, increasing latency. For the tablet code path, replicas were derived from the child policy's round-robin order, completely losing the natural token-ring order. Fix: For LWT queries, yield replicas in their natural order (token-ring for non-tablet, tablet.replicas order for tablet), skipping only hosts that are down or IGNORED. Non-replica fallback hosts still use distance- based ordering. Non-LWT queries are completely unchanged. Fixes: scylladb#780, scylladb#781
This was referenced Apr 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
LWT (Lightweight Transaction) queries rely on Paxos consensus, where the first natural replica in the token ring acts as the Paxos leader. Routing LWT queries directly to the Paxos leader avoids an extra network hop and reduces Paxos round-trips from 4 to 3, significantly improving latency.
TokenAwarePolicy.make_query_plan()currently passes all replicas throughyield_in_order(), which re-sorts them by distance (LOCAL_RACK→LOCAL→REMOTE). This is correct for regular queries, but breaks Paxos leader routing for LWT queries — the leader may be demoted if it's in a different rack than the client (withRackAwareRoundRobinPolicy).Additionally, the tablet code path constructs replicas from the child policy's round-robin order (
child.make_query_plan()), which completely loses the natural token-ring order for LWT queries regardless of child policy.This is modeled after gocql's
pickLWTReplicas()which yields replicas in natural order without distance reordering for LWT queries.Changes
When
query.is_lwt()returnsTrue:tablet.replicasin natural order usingget_host_by_host_id()instead of filtering through the child policy's round-robin outputget_replicas()), skipping only down/IGNORED hosts — do NOT pass throughyield_in_order()distance bucketingRelated Issues
Tests
Added
LWTTokenAwareRoutingTestclass with 11 new tests covering:RackAwareRoundRobinPolicypreserves leader even when in different rackIGNOREDhosts (remote DC)DCAwareRoundRobinPolicypreserves behaviorRackAwareRoundRobinPolicypreserves rack-aware orderingAll 93 tests in
tests/unit/test_policies.pypass.Note
This fix is against
master. A follow-up will apply the same logical fix on top of PR #651 (query plan optimization), which has the same bugs in its refactored code structure.