fix: respect iceberg-field-name Avro property on read path#2539
Closed
SreeramGarlapati wants to merge 2 commits into
Closed
fix: respect iceberg-field-name Avro property on read path#2539SreeramGarlapati wants to merge 2 commits into
SreeramGarlapati wants to merge 2 commits into
Conversation
…end batch
`SnapshotProducer::validate_duplicate_files` collected `added_data_files`
straight into a `HashSet<&str>` before checking against existing manifests.
That collect step silently dedupes the batch, so two `DataFile` entries
sharing the same `file_path` in one `add_data_files(...)` call were written
into the manifest unchecked and committed without error - producing a
snapshot whose `added_files_count` and read-side row count both
double-count the offending file.
Add `check_no_duplicate_paths_in_batch` as a free function in
`transaction::snapshot`, run it at the top of `FastAppendAction::commit`
under the same `check_duplicate` gate, and fail fast on the first
collision with `ErrorKind::DataInvalid` naming the offending path. The
cross-snapshot half of `validate_duplicate_files` is unchanged; it now
builds its own local `new_files` set from `added_data_files` (no
behavioural change).
Two unit tests:
- three identical paths in one batch are rejected with `DataInvalid`
and the offending path in the message;
- `with_check_duplicate(false)` opt-out still accepts batch duplicates,
matching the opt-out semantics already documented for the
cross-snapshot check.
Closes apache#2507.
When Java writes Iceberg tables to Avro, it sanitizes field names that don't conform to Avro's naming rules ([A-Za-z_][A-Za-z0-9_]*) and stores the original Iceberg field name in the "iceberg-field-name" custom property. iceberg-rust was ignoring this property and using the sanitized Avro name directly, causing field name mismatches when reading Java-written Avro files. The fix checks for the iceberg-field-name property on each Avro record field and uses it as the Iceberg field name when present, falling back to the Avro field name otherwise. Closes apache#2536 Co-authored-by: rawataaryan9 <rawataaryan9@users.noreply.github.com>
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When Java writes Iceberg tables to Avro format, field names that don't conform to Avro's naming rules (
[A-Za-z_][A-Za-z0-9_]*) are sanitized — leading digits get a_prefix, special characters become_x<hex>. The original Iceberg field name is preserved in theiceberg-field-namecustom property on the Avro record field.iceberg-rust was ignoring this property entirely, using the sanitized Avro field name as the Iceberg field name. This caused schema mismatches when reading Java-written Avro data files (manifests, manifest lists, etc.) that contain fields with non-conforming names.
Changes
ICEBERG_FIELD_NAME_PROPconstant ("iceberg-field-name")AvroSchemaToSchema::record, checkcustom_attributesfor the property before falling back toavro_field.name_123columnwithiceberg-field-name: "123column") resolves to the original nameRelationship to #2535
This is the read-side counterpart. #2535 adds write-side sanitization (producing the
iceberg-field-nameproperty). This PR is independently useful because it allows iceberg-rust to correctly read Avro files already written by Java with sanitized names.Closes #2536
Test plan
iceberg-field-nameproperty resolves to original name