Skip to content

Enable per column compression #3459

@mengna-lin

Description

@mengna-lin

Describe the enhancement requested

Summary

  • Add support for configuring compression codec and compression level on a per-column basis when writing Parquet files, rather than applying a single codec and level uniformly across all columns.

Proposed API
Programmatic (ParquetWriter / ParquetProperties):
ParquetWriter.builder(...)
.withCompressionCodec(CompressionCodecName.SNAPPY) // global default
.withCompressionCodec("col_a", CompressionCodecName.ZSTD) // per-column override
.withCompressionLevel("col_a", 9)
.build();

MapReduce (ParquetOutputFormat / Hadoop Configuration):
parquet.compression=SNAPPY
parquet.compression#col_a=ZSTD
parquet.compression.level#col_a=9

Behavior

  • Columns without an override inherit the global codec and level.
  • A compression level set for a column that doesn't support levels (e.g. SNAPPY) is silently ignored with a warning log.
  • A compression level set without a per-column codec override applies the level to the default codec, with a warning log.

Component(s)

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions