fix: sanitize Avro field names on write, respect iceberg-field-name on read#2540
Open
SreeramGarlapati wants to merge 2 commits into
Open
fix: sanitize Avro field names on write, respect iceberg-field-name on read#2540SreeramGarlapati wants to merge 2 commits into
SreeramGarlapati wants to merge 2 commits into
Conversation
…ld-name Iceberg field names can be arbitrary (leading digits, dots, spaces, etc.) but Avro requires names to match [A-Za-z_][A-Za-z0-9_]*. Java handles this by sanitizing invalid names on write and storing the original in an "iceberg-field-name" custom property, then checking that property on read. iceberg-rust was writing unsanitized names directly, which causes Avro validation failures (or produces files unreadable by strict Avro parsers) when field names don't conform to Avro's naming rules. This adds: - sanitize_avro_name(): matches Java's AvroSchemaUtil.sanitize() logic (prefix _ for leading digits, _x<HEX> for special chars) - Write path: sanitizes the field name and stores the original in iceberg-field-name when sanitization was needed - Read path: checks iceberg-field-name property first, falls back to the Avro field name Closes apache#2535 Co-authored-by: rawataaryan9 <rawataaryan9@users.noreply.github.com>
…tion Adds test cases for: - Non-ASCII BMP characters (U+00E9, U+4E2D) - Supplementary characters (surrogate pair handling, matching Java's UTF-16) - Empty string edge case - Read-path: iceberg-field-name property resolution from Java-written schemas - Verify iceberg-field-name property is set on dotted field names Co-authored-by: rawataaryan9 <rawataaryan9@users.noreply.github.com>
fe8c3ff to
12ba3b5
Compare
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Iceberg field names can be anything (
123column,field.with.dots, etc.) but Avro requires names to match[A-Za-z_][A-Za-z0-9_]*. Java handles this with a sanitize-on-write + restore-on-read protocol using theiceberg-field-namecustom Avro property. iceberg-rust was doing neither — writing invalid names directly and ignoring the property on read.This causes two interop failures:
Changes
Write path (Iceberg→Avro conversion in
SchemaToAvroSchema::field):is_valid_avro_name()— checks[A-Za-z_][A-Za-z0-9_]*sanitize_avro_name()— matches Java'sAvroSchemaUtil.sanitize():_(e.g.,123col→_123col)_x<HEX>(e.g.,field.name→field_x2Ename)charAt()behavior for supplementary charsiceberg-field-namepropertyRead path (Avro→Iceberg conversion in
AvroSchemaToSchema::record):iceberg-field-namecustom attribute first, falls back to Avro field nameJava reference
AvroSchemaUtil.sanitize()TypeToSchema.struct()— storesICEBERG_FIELD_NAME_PROPAvroSchemaUtil.ICEBERG_FIELD_NAME_PROPCloses #2535
Test plan
test_is_valid_avro_name— validates detection of invalid namestest_sanitize_avro_name— ASCII edge cases match Java behaviortest_sanitize_avro_name_unicode— BMP and supplementary char handling (surrogate pairs)test_sanitization_round_trip— Iceberg→Avro→Iceberg preserves original namestest_avro_to_iceberg_uses_iceberg_field_name_property— reads Java-written schemas correctly