Skip to content

feat: optional ClickHouse dual-sink for analytics #58

@pthmas

Description

@pthmas

Summary

Add optional ClickHouse support as a parallel write sink alongside PostgreSQL. PG continues to handle point lookups (get block, get tx by hash); ClickHouse handles analytical queries (aggregations, time-range scans, charts, top-N rankings).

Motivation

Atlas's analytics endpoints (daily tx charts, gas price history, token holder rankings) all hit PostgreSQL today. As chain history grows into hundreds of millions of rows, these queries will degrade — ClickHouse is purpose-built for this workload and can be orders of magnitude faster for full-column scans and aggregations.

Design

Dual-Sink Writer

  1. Add a ClickHouseSink that writes blocks, txs, logs, and receipts using the clickhouse crate (RowBinary format + LZ4 compression)
  2. Wrap PG + CH sinks in a SinkSet abstraction that fans out writes via tokio::try_join! (parallel writes, both must succeed)
  3. CH is optional — enabled via ENABLE_CLICKHOUSE=true + CLICKHOUSE_URL
  4. CH sink should chunk large batches (e.g., 10K rows per INSERT) to bound memory
  5. Use ReplacingMergeTree engine so re-inserts after crashes are safe (duplicates merged away)
  6. CH sink gets its own retry logic (3 attempts, exponential backoff)

ClickHouse Schema

  • Mirror the core PG tables: blocks, transactions, event_logs, receipts
  • Partition by month: toYYYYMM(timestamp) for efficient time-range pruning
  • Bloom filter indexes on event_logs(topic0, address) for fast event filtering
  • Order keys optimized for analytical access patterns (e.g., logs ordered by (address, topic0, block_number))

Query Routing

  • Analytics API endpoints (charts, top holders, aggregations) route to ClickHouse when available
  • Point lookups (get block by number, get tx by hash) stay on PostgreSQL
  • Could expose an ?engine=clickhouse query param for explicit routing

PG → CH Backfill

  • One-time migration tool that reads historical data from PG and bulk-loads into CH
  • Allows adding ClickHouse to an existing Atlas deployment without re-indexing from the chain
  • Should run as a background task with configurable concurrency and throttling

Configuration

Var Default Description
ENABLE_CLICKHOUSE false Enable ClickHouse dual-sink
CLICKHOUSE_URL ClickHouse HTTP endpoint (required if enabled)
CLICKHOUSE_DATABASE atlas Database name
CLICKHOUSE_USER default Auth user
CLICKHOUSE_PASSWORD Auth password
CH_INSERT_CHUNK_SIZE 10000 Rows per INSERT batch

Docker Compose

  • Add optional ClickHouse service (clickhouse/clickhouse-server)
  • Only started when ENABLE_CLICKHOUSE=true profile is active

References

  • tidx implementation: sync/sink.rs (SinkSet fan-out), sync/ch_sink.rs (ClickHouse writer with chunked inserts + retry), clickhouse.rs (read engine with failover)
  • tidx uses ReplacingMergeTree + monthly partitioning + bloom filter indexes on logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions