Skip to content

fix: respect iceberg-field-name Avro property on read path#2539

Closed
SreeramGarlapati wants to merge 2 commits into
apache:mainfrom
SreeramGarlapati:fix/avro-read-iceberg-field-name-property
Closed

fix: respect iceberg-field-name Avro property on read path#2539
SreeramGarlapati wants to merge 2 commits into
apache:mainfrom
SreeramGarlapati:fix/avro-read-iceberg-field-name-property

Conversation

@SreeramGarlapati
Copy link
Copy Markdown
Contributor

Summary

When Java writes Iceberg tables to Avro format, field names that don't conform to Avro's naming rules ([A-Za-z_][A-Za-z0-9_]*) are sanitized — leading digits get a _ prefix, special characters become _x<hex>. The original Iceberg field name is preserved in the iceberg-field-name custom property on the Avro record field.

iceberg-rust was ignoring this property entirely, using the sanitized Avro field name as the Iceberg field name. This caused schema mismatches when reading Java-written Avro data files (manifests, manifest lists, etc.) that contain fields with non-conforming names.

Changes

  • Added ICEBERG_FIELD_NAME_PROP constant ("iceberg-field-name")
  • In AvroSchemaToSchema::record, check custom_attributes for the property before falling back to avro_field.name
  • Added test verifying a sanitized field (_123column with iceberg-field-name: "123column") resolves to the original name

Relationship to #2535

This is the read-side counterpart. #2535 adds write-side sanitization (producing the iceberg-field-name property). This PR is independently useful because it allows iceberg-rust to correctly read Avro files already written by Java with sanitized names.

Closes #2536

Test plan

  • New test: Avro schema with iceberg-field-name property resolves to original name
  • New test: fields without the property still use Avro field name (backwards compat)
  • All 1384 existing tests pass
  • Workspace compiles cleanly

SreeramGarlapati and others added 2 commits May 29, 2026 19:17
…end batch

`SnapshotProducer::validate_duplicate_files` collected `added_data_files`
straight into a `HashSet<&str>` before checking against existing manifests.
That collect step silently dedupes the batch, so two `DataFile` entries
sharing the same `file_path` in one `add_data_files(...)` call were written
into the manifest unchecked and committed without error - producing a
snapshot whose `added_files_count` and read-side row count both
double-count the offending file.

Add `check_no_duplicate_paths_in_batch` as a free function in
`transaction::snapshot`, run it at the top of `FastAppendAction::commit`
under the same `check_duplicate` gate, and fail fast on the first
collision with `ErrorKind::DataInvalid` naming the offending path. The
cross-snapshot half of `validate_duplicate_files` is unchanged; it now
builds its own local `new_files` set from `added_data_files` (no
behavioural change).

Two unit tests:
  - three identical paths in one batch are rejected with `DataInvalid`
    and the offending path in the message;
  - `with_check_duplicate(false)` opt-out still accepts batch duplicates,
    matching the opt-out semantics already documented for the
    cross-snapshot check.

Closes apache#2507.
When Java writes Iceberg tables to Avro, it sanitizes field names that
don't conform to Avro's naming rules ([A-Za-z_][A-Za-z0-9_]*) and stores
the original Iceberg field name in the "iceberg-field-name" custom
property. iceberg-rust was ignoring this property and using the sanitized
Avro name directly, causing field name mismatches when reading
Java-written Avro files.

The fix checks for the iceberg-field-name property on each Avro record
field and uses it as the Iceberg field name when present, falling back
to the Avro field name otherwise.

Closes apache#2536

Co-authored-by: rawataaryan9 <rawataaryan9@users.noreply.github.com>
@SreeramGarlapati
Copy link
Copy Markdown
Contributor Author

Superseded by #2540, which contains this read-side change plus the matching Avro field-name sanitization on the write path. The two halves are useless without each other for cross-engine interop, so consolidating into a single PR.

Closing here; please review #2540 instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

avro schema reader ignores iceberg-field-name property — returns sanitized names from java manifests

1 participant