Description
Currently the Metadata transform exposes EventTime and Delay for CDC sources.
For downstream deduplication (e.g. CDC → S3 Parquet → StarRocks/Athena), users need
a monotonically increasing, per-row unique key to identify the latest state of a record.
EventTime is insufficient for this purpose because:
- Multiple CDC events for the same primary key can share the same millisecond timestamp
when a MySQL statement affects multiple rows or two statements execute within 1ms
c_event_time - c_delay (a common pattern) collapses ordering further through
arithmetic coincidence — different delay values can produce identical results
Example from production:
{"c_operation_type": "UPDATE_AFTER", "c_event_time": 1774342441039, "c_delay": 39, "c_db_change_time": 1774342441000}
{"c_operation_type": "UPDATE_AFTER", "c_event_time": 1774342441040, "c_delay": 40, "c_db_change_time": 1774342441000}
Both rows produce the same c_db_change_time despite being distinct CDC events.
The correct solution — is to expose the binlog filename + position + row index, which form a strict total order guaranteed by MySQL's binary log protocol. These are already present in the Debezium SourceRecord source struct (file, pos, row) but are never surfaced to the SeaTunnel metadata layer.
Usage Scenario :
Any CDC pipeline that writes to a data lake (S3 Parquet, Iceberg) and needs to deduplicate or find the latest state of a record in downstream queries:
Metadata {
metadata_fields {
BinlogFile = c_binlog_file // e.g. "mysql-bin.000123"
BinlogPos = c_binlog_pos // e.g. 4857392 (BIGINT)
BinlogRow = c_binlog_row // e.g. 0 (INT, row index within event)
}
}
Downstream dedup key in StarRocks/Athena:
CONCAT(
COALESCE(c_binlog_file, ''), ':',
LPAD(CAST(COALESCE(c_binlog_pos, 0) AS STRING), 20, '0'), ':',
LPAD(CAST(COALESCE(c_binlog_row, 0) AS STRING), 10, '0')
) AS c_change_seq
Description
Currently the
Metadatatransform exposesEventTimeandDelayfor CDC sources.For downstream deduplication (e.g. CDC → S3 Parquet → StarRocks/Athena), users need
a monotonically increasing, per-row unique key to identify the latest state of a record.
EventTimeis insufficient for this purpose because:when a MySQL statement affects multiple rows or two statements execute within 1ms
c_event_time - c_delay(a common pattern) collapses ordering further througharithmetic coincidence — different delay values can produce identical results
Example from production:
{"c_operation_type": "UPDATE_AFTER", "c_event_time": 1774342441039, "c_delay": 39, "c_db_change_time": 1774342441000}
{"c_operation_type": "UPDATE_AFTER", "c_event_time": 1774342441040, "c_delay": 40, "c_db_change_time": 1774342441000}
Both rows produce the same c_db_change_time despite being distinct CDC events.
The correct solution — is to expose the binlog filename + position + row index, which form a strict total order guaranteed by MySQL's binary log protocol. These are already present in the Debezium SourceRecord source struct (file, pos, row) but are never surfaced to the SeaTunnel metadata layer.
Usage Scenario :
Any CDC pipeline that writes to a data lake (S3 Parquet, Iceberg) and needs to deduplicate or find the latest state of a record in downstream queries:
Metadata {
metadata_fields {
BinlogFile = c_binlog_file // e.g. "mysql-bin.000123"
BinlogPos = c_binlog_pos // e.g. 4857392 (BIGINT)
BinlogRow = c_binlog_row // e.g. 0 (INT, row index within event)
}
}
Downstream dedup key in StarRocks/Athena:
CONCAT(
COALESCE(c_binlog_file, ''), ':',
LPAD(CAST(COALESCE(c_binlog_pos, 0) AS STRING), 20, '0'), ':',
LPAD(CAST(COALESCE(c_binlog_row, 0) AS STRING), 10, '0')
) AS c_change_seq