Skip to content

Fix LWT routing: preserve Paxos leader order in TokenAwarePolicy#782

Draft
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:fix/lwt-paxos-leader-routing
Draft

Fix LWT routing: preserve Paxos leader order in TokenAwarePolicy#782
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:fix/lwt-paxos-leader-routing

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Apr 1, 2026

Summary

LWT (Lightweight Transaction) queries rely on Paxos consensus, where the first natural replica in the token ring acts as the Paxos leader. Routing LWT queries directly to the Paxos leader avoids an extra network hop and reduces Paxos round-trips from 4 to 3, significantly improving latency.

TokenAwarePolicy.make_query_plan() currently passes all replicas through yield_in_order(), which re-sorts them by distance (LOCAL_RACKLOCALREMOTE). This is correct for regular queries, but breaks Paxos leader routing for LWT queries — the leader may be demoted if it's in a different rack than the client (with RackAwareRoundRobinPolicy).

Additionally, the tablet code path constructs replicas from the child policy's round-robin order (child.make_query_plan()), which completely loses the natural token-ring order for LWT queries regardless of child policy.

This is modeled after gocql's pickLWTReplicas() which yields replicas in natural order without distance reordering for LWT queries.

Changes

When query.is_lwt() returns True:

  1. Tablet path: Resolve replicas from tablet.replicas in natural order using get_host_by_host_id() instead of filtering through the child policy's round-robin output
  2. Non-tablet path: Yield replicas in their natural token-ring order (from get_replicas()), skipping only down/IGNORED hosts — do NOT pass through yield_in_order() distance bucketing
  3. Non-replica fallback hosts still use distance-based ordering (unchanged)
  4. Non-LWT queries: Behavior is completely unchanged

Related Issues

Tests

Added LWTTokenAwareRoutingTest class with 11 new tests covering:

  • Basic LWT preserves ring order (Paxos leader first)
  • Non-LWT still uses distance-based ordering
  • LWT with RackAwareRoundRobinPolicy preserves leader even when in different rack
  • LWT skips down hosts
  • LWT skips IGNORED hosts (remote DC)
  • LWT with tablet routing preserves natural order
  • Non-LWT tablet routing preserves round-robin (child policy) ordering
  • LWT with shuffle disabled still preserves order
  • Non-LWT with shuffle enabled randomizes replicas
  • Non-LWT with DCAwareRoundRobinPolicy preserves behavior
  • Non-LWT with RackAwareRoundRobinPolicy preserves rack-aware ordering

All 93 tests in tests/unit/test_policies.py pass.

Note

This fix is against master. A follow-up will apply the same logical fix on top of PR #651 (query plan optimization), which has the same bugs in its refactored code structure.

TokenAwarePolicy.make_query_plan() was re-sorting replicas by distance
(LOCAL_RACK > LOCAL > REMOTE) via yield_in_order(), which could demote
the Paxos leader when using RackAwareRoundRobinPolicy if the leader
happened to be in a different rack than the client. This causes an
extra network hop for every LWT operation, increasing latency.

For the tablet code path, replicas were derived from the child policy's
round-robin order, completely losing the natural token-ring order.

Fix: For LWT queries, yield replicas in their natural order (token-ring
for non-tablet, tablet.replicas order for tablet), skipping only hosts
that are down or IGNORED. Non-replica fallback hosts still use distance-
based ordering. Non-LWT queries are completely unchanged.

Fixes: scylladb#780, scylladb#781
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LWT routing: Tablet path loses natural token-ring order (Paxos leader not prioritized) LWT routing: RackAwareRoundRobinPolicy demotes Paxos leader

1 participant