Skip to content

fix(transaction): detect duplicate file paths within a single FastAppend batch#2509

Open
SreeramGarlapati wants to merge 1 commit into
apache:mainfrom
SreeramGarlapati:main
Open

fix(transaction): detect duplicate file paths within a single FastAppend batch#2509
SreeramGarlapati wants to merge 1 commit into
apache:mainfrom
SreeramGarlapati:main

Conversation

@SreeramGarlapati
Copy link
Copy Markdown
Contributor

@SreeramGarlapati SreeramGarlapati commented May 26, 2026

Which issue does this PR close?

What changes are included in this PR?

validate_duplicate_files was meant to refuse any append that would point at a file the table already references. It's been doing exactly half of that. The other half — refusing the same file appearing twice inside one batch — has been broken since the function was written, because the very first thing it does is collect the batch into a HashSet. The duplicates are gone before the check looks for them.

Concretely: hand FastAppendAction two DataFiles with identical file_paths, call commit() with the default check_duplicate=true, and you get Ok. The resulting manifest holds two entries pointing at the same Parquet file, the snapshot summary's added_files_count is wrong, and every scan double-counts the rows. Nothing in the commit path notices.

The fix is small. A new free function check_no_duplicate_paths_in_batch in transaction::snapshot walks the batch and returns DataInvalid on the first colliding file_path. FastAppendAction::commit calls it at the top, under the same check_duplicate gate, so the opt-out via with_check_duplicate(false) still disables both halves of the check together. SnapshotProducer::validate_duplicate_files is now purely the cross-snapshot check; it builds its own local new_files set from added_data_files and is otherwise unchanged.

This is a sibling of #1394, which fixed the cross-snapshot half of the same function. Same shape of bug, opposite side of the comparison.

Are these changes tested?

Yes — two unit tests in crates/iceberg/src/transaction/append.rs.

The first throws three identical paths into a single batch and asserts the commit fails with DataInvalid carrying the offending path in the message.

The second asserts that with_check_duplicate(false) still accepts batch duplicates — same opt-out behaviour the cross-snapshot check already documents.

I also verified the repro by reverting just the commit() change and rerunning the first test against the unfixed code — it fails exactly where the panic-on-Ok branch lives, which is the bug demonstrated as a failing unit test.

@SreeramGarlapati
Copy link
Copy Markdown
Contributor Author

Hi @blackmwk — appreciate a quick review on this when you have a moment. Short, well-scoped fix (one function plus two tests); context in the linked #2507.

Copy link
Copy Markdown
Contributor

@xanderbailey xanderbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR, thanks for the contribution, left a couple of comments / questions

}
if !duplicates_in_batch.is_empty() {
let mut paths: Vec<&str> = duplicates_in_batch.into_iter().collect();
paths.sort_unstable();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to know if we need the sort?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moot after blackmwk's fail-fast suggestion (below) - we now report just the first dup, no list to sort.

Comment on lines +167 to +170
// First, detect duplicate file paths within the batch itself.
// Collecting straight into a `HashSet` would silently deduplicate
// and pass corruption through to the manifest, so we walk the list
// explicitly and surface every distinct duplicated path.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is maybe a little verbose?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya - shouldn't have been there. dropped it entirely; the function name + err message convey it.

Copy link
Copy Markdown
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SreeramGarlapati for this pr!

Comment on lines +171 to +172
let mut new_files: HashSet<&str> = HashSet::with_capacity(self.added_data_files.len());
let mut duplicates_in_batch: HashSet<&str> = HashSet::new();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is over complicated, we should throw error immediately when new_files.insert return false.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I prefer to move this check in the begigning of FastAppendAction's commit method, rather than putting it here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - simplified to fail-fast on first insert collision. no extra set, no sort.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - extracted as a free fn check_no_duplicate_paths_in_batch in transaction::snapshot, called at the top of FastAppendAction::commit under the same check_duplicate gate. SnapshotProducer::validate_duplicate_files is now purely the cross-snapshot check.

@SreeramGarlapati SreeramGarlapati force-pushed the main branch 3 times, most recently from 6048b31 to c6f1688 Compare May 30, 2026 02:12
…end batch

`SnapshotProducer::validate_duplicate_files` collected `added_data_files`
straight into a `HashSet<&str>` before checking against existing manifests.
That collect step silently dedupes the batch, so two `DataFile` entries
sharing the same `file_path` in one `add_data_files(...)` call were written
into the manifest unchecked and committed without error - producing a
snapshot whose `added_files_count` and read-side row count both
double-count the offending file.

Add `check_no_duplicate_paths_in_batch` as a free function in
`transaction::snapshot`, run it at the top of `FastAppendAction::commit`
under the same `check_duplicate` gate, and fail fast on the first
collision with `ErrorKind::DataInvalid` naming the offending path. The
cross-snapshot half of `validate_duplicate_files` is unchanged; it now
builds its own local `new_files` set from `added_data_files` (no
behavioural change).

Two unit tests:
  - three identical paths in one batch are rejected with `DataInvalid`
    and the offending path in the message;
  - `with_check_duplicate(false)` opt-out still accepts batch duplicates,
    matching the opt-out semantics already documented for the
    cross-snapshot check.

Closes apache#2507.
@SreeramGarlapati
Copy link
Copy Markdown
Contributor Author

thanks for the reviews @xanderbailey and @blackmwk - addressed all ur comments.
pl. re-review now.

}
}

pub(super) fn check_no_duplicate_paths_in_batch(files: &[DataFile]) -> Result<()> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why putting it in SnapshotProducer? I think we should just put it in FastAppendAction.

#[async_trait]
impl TransactionAction for FastAppendAction {
async fn commit(self: Arc<Self>, table: &Table) -> Result<ActionCommit> {
if self.check_duplicate {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always add this check. The check_duplicate flag is used to save checking against existing data files in snapshot since it may introduce extra io. This is not the case for the newly added data files, and if we don't do the check, it's a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(transaction): FastAppendAction silently accepts duplicate file paths within a single batch

3 participants