Skip to content

cachedb_redis: add Redis cluster hash tag support to redisHash#3815

Merged
liviuchircu merged 1 commit intoOpenSIPS:masterfrom
NormB:feature/redis-hashtag-support
Apr 22, 2026
Merged

cachedb_redis: add Redis cluster hash tag support to redisHash#3815
liviuchircu merged 1 commit intoOpenSIPS:masterfrom
NormB:feature/redis-hashtag-support

Conversation

@NormB
Copy link
Copy Markdown
Member

@NormB NormB commented Feb 10, 2026

Fix redisHash() to extract hash tags per the Redis cluster spec and use the spec-mandated CRC16(key) mod 16384 bitmask.

Summary

Fix redisHash() in cachedb_redis to support Redis cluster hash tags ({...}) and use the spec-mandated slot modulus.

Details

The redisHash() function in modules/cachedb_redis/cachedb_redis_utils.c has two correctness issues when compared to the https://redis.io/docs/latest/operate/oss_and_stack/reference/cluster-spec/:

  1. Missing hash tag {...} extraction — Redis cluster keys containing {substring} should hash only the substring inside the first {...} pair, enabling co-location of related keys on the same slot. The current code always hashes the full key. This means keys like {user:1000}.session and {user:1000}.profile, which are designed to land on the same slot, get scattered across different slots.

  2. Uses & con->slots_assigned instead of & 16383 — slots_assigned is set to the max end_slot found across nodes (which happens to be 16383 in a fully-allocated cluster), but this is fragile and semantically wrong. The Redis spec mandates CRC16(key) mod 16384 (i.e., & 16383).

Standalone test results confirming the bug and fix:
CRC16 check: crc16("123456789") = 0x31C3 (expected 0x31C3) PASS

KEY OLD NEW STATUS
foo 12182 12182 PASS (regression safe)
bar 5061 5061 PASS (regression safe)
hello 866 866 PASS (regression safe)
{user:1000}.session 10291 1649 PASS (old=BROKEN, new=FIXED)
{user:1000}.profile 11999 1649 INFO (slot changed)
{user:1000}.followers 4151 1649 PASS (old=BROKEN, new=FIXED)
foo{} 5542 5542 PASS (regression safe)
{}key 14961 14961 PASS (regression safe)
foo{{bar}} 4285 4015 INFO (slot changed)
foo{bar}{zap} 4770 5061 INFO (slot changed)

Results: 10 passed, 0 failed

Solution

The redisHash() function now implements hash tag extraction matching the https://redis.io/docs/latest/operate/oss_and_stack/reference/cluster-spec/ from the Redis cluster spec:

  • Find the first { in the key
  • Find the first } after it
  • If both exist and the substring between them is non-empty, hash only that substring
  • Otherwise, hash the full key
  • Always mask with & 16383 per spec

Additionally adds a NULL/empty key guard with LM_ERR and debug logging (LM_DBG) when a hash tag is detected. The function signature is unchanged — no caller modifications needed.

Compatibility

  • Plain keys (no {...}) produce identical slot values — no behavioral change for existing deployments not using hash tags.

  • Keys containing {...} hash tags will now route to different (correct) slots. Any deployment relying on the old (broken) slot assignment for hash-tagged keys would see keys move to new slots on restart. This is the correct behavior per the Redis spec and matches what redis-cli CLUSTER KEYSLOT returns.

  • The slots_assigned field in redis_con is retained — it is still populated during cluster discovery but no longer used for the bitmask. Removing it would be a separate cleanup.

Closing issues

Fix redisHash() to extract hash tags per the Redis cluster spec
and use the spec-mandated CRC16(key) mod 16384 bitmask.
@dondetir
Copy link
Copy Markdown
Contributor

Thanks for flagging this @NormB You're right — the hash slot calculation fix overlaps with #3815. Looking at your PR, it's actually more thorough on that front since it also adds hash tag {…} extraction, which mine doesn't cover.

The main contribution of this PR is the ASK redirect handling — when a Redis cluster is mid-resharding, source nodes return ASK instead of MOVED, and this requires sending an ASKING command to the target node before retrying. This completes the cluster redirect support started in #3639. That part has no overlap with #3815.

Happy to rebase this PR to remove the hash slot fix and keep only the ASK redirect handling, so it can sit cleanly on top of #3815. Would that work, or would you prefer I fold both into one PR?

@NormB
Copy link
Copy Markdown
Member Author

NormB commented Mar 30, 2026

Thanks @dondetir — rebasing to keep only the ASK redirect handling sounds like the right approach. Your parse_redirect_reply() refactor is clean and I've verified it passes a set of integration tests I have for this area, including live slot migrations across a 3-node cluster.

I have two follow-up PRs that build on top of both #3815 and your ASK work:

Both depend on #3815 and #3852.

@dondetir
Copy link
Copy Markdown
Contributor

dondetir commented Mar 31, 2026

Thanks @NormB — really appreciate you taking the time to validate the ASK handling against your integration suite, especially with live slot migrations on a 3-node cluster. That gives me a lot of confidence in how our changes interact.

I saw #3855 and #3856 — impressive work. The dynamic topology refresh and MI observability commands will be a big improvement for production cluster deployments.

I'd be happy to do a thorough review of both PRs if that would be useful — I've already been reading through the diffs and have some notes. Just let me know.

Happy to coordinate on merge ordering to keep the dependency chain clean.

@NormB
Copy link
Copy Markdown
Member Author

NormB commented Mar 31, 2026

I'd be happy to do a thorough review of both PRs if that would be useful — I've already been reading through the diffs and have some notes. Just let me know.

@dondetir Thanks for the offer to perform a review the the two cachedb_redis PRs as it would be greatly appreciated. @liviuchircu knows the module very well, but I think that he and the rest of the team are heads down working on the upcoming 4+ release.

Copy link
Copy Markdown
Member

@liviuchircu liviuchircu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @NormB @dondetir -- merging into 4.0 ASAP. Two big changes here, both which LGTM. First, the "mod 16384" instead of "con->slots_assigned" seems mandatory according to Redis Cluster specs so that is almost a fix, actually. Secondly, the new search for hash tags "{ }" feature and hashing is nice-to-have, and is also backwards-compatible. Great work here!

@liviuchircu liviuchircu merged commit f1fc0aa into OpenSIPS:master Apr 22, 2026
84 of 86 checks passed
liviuchircu pushed a commit that referenced this pull request Apr 22, 2026
Replace the static cluster topology (built once at startup, never
refreshed) with runtime discovery and automatic refresh:

Topology discovery and refresh:
- Probe CLUSTER SHARDS (Redis 7+) with fallback to CLUSTER SLOTS
  (Redis 3+) for backward compatibility
- O(1) slot_table[16384] lookup replaces per-query linked-list scan
- Automatic topology refresh on MOVED redirect, connection failure,
  or query targeting an unmapped slot (rate-limited to 1/sec)
- Dynamic node creation when MOVED points to an unknown endpoint
- Stale node pruning during refresh with safe connection cleanup
- Cap redirect loop at 5 max redirects to prevent worker hang on
  pathological cluster state

Cluster observability via MI commands:
- redis_cluster_info: full topology dump including per-node connection
  status, slot assignments, query/error/moved/ask counters, and
  last activity timestamp
- redis_cluster_refresh: trigger manual topology refresh (bypasses
  rate limit)
- redis_ping_nodes: per-node PING with microsecond latency reporting
- All MI commands support optional group filter parameter

Statistics:
- redis_queries, redis_queries_failed, redis_moved, redis_ask,
  redis_topology_refreshes (module-level stat counters)
- Per-node query, error, moved, ask counters in redis_cluster_info

Hash slot correctness:
- Hash tag {…} extraction per Redis Cluster specification
- CRC16 modulo 16384 replaces bitwise AND with slots_assigned

ASK redirect handling:
- Detect ASK responses alongside existing MOVED handling
- Send ASKING command to target node before retrying original query
- Do not update slot map (ASK is a temporary mid-migration redirect)
- Refactor parse_moved_reply into parse_redirect_reply with prefix
  parameter; inline wrappers for backward compatibility

Connection reliability:
- TCP keepalive via redis_keepalive parameter (default 10s)
- Stack allocation for redis_moved structs (eliminates OOM paths)
- NULL guards on malformed CLUSTER SHARDS/SLOTS reply elements
- Integer overflow protection in slot and port parsing
- NULL guards in MI command handlers for group_name/initial_url

Documentation:
- New section: Redis Cluster Support (topology discovery, automatic
  refresh, MOVED/ASK handling, hash tags)
- MI command reference: redis_cluster_info, redis_cluster_refresh,
  redis_ping_nodes
- Authentication URL format documentation (classic, ACL, no-auth)
- New parameter: redis_keepalive

Test suite (186 tests):
- C unit tests: hash slot calculation (37), MI counter helpers (41)
- Integration: topology startup (12), ASK redirect (16), topology
  refresh (13), MI commands (50), edge cases (16)
- Trap EXIT handlers for safe cluster state restoration
- python3 preflight checks for JSON-dependent tests

Depends on: #3815 (hash tag + modulo fix), #3852 (ASK redirect)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants