Skip to content

Column index with null_pages=true and null_counts=0 causes silent data loss during filter pushdown #3457

@clee704

Description

@clee704

Describe the bug

When a Parquet file contains a ColumnIndex where null_pages[i] is true and null_counts[i] is 0 for the same page, parquet-java's column index filtering silently drops that page from query results. No error or warning is produced.

Per the Parquet format specification, null_pages[i]=true means "a page contains only null values" and null_counts[i] is "the number of null values" for each page. These two fields directly contradict each other: a page cannot contain only null values while also having zero null values.

Why it causes data loss

The column index filtering in ColumnIndexBase has two code paths for evaluating predicates, and both exclude pages with this contradiction:

Non-null predicates (e.g., WHERE col = 50): The BoundaryOrder comparators iterate over pageIndexes — an internal array that maps min/max array positions to page numbers, omitting pages where null_pages[i] is true. Pages marked as null have no entry in this array and are never evaluated by the comparators, so their rows are excluded from results.

Null predicates (e.g., WHERE col IS NULL): ColumnIndexBase.visit(Eq) checks nullCounts[pageIndex] > 0 (line ~299 on master), which returns false when null_counts[i] is 0. The page is excluded.

A page with this contradiction is invisible to all predicates. Only unfiltered reads (no WHERE clause) return correct results.

Proposed fix

Add validation in ColumnIndexBuilder.build(PrimitiveType) to detect the contradiction and return null, following the existing pattern where this method already returns null for other kinds of invalid metadata:

if (nullPages.isEmpty()) {
  return null;
}
ColumnIndexBase<?> columnIndex = createColumnIndex(type);
if (columnIndex == null) {
  // Might happen if the specialized builder discovers invalid min/max values
  return null;
}

The null propagates through the read path (fromParquetColumnIndexreadColumnIndexgetColumnIndexColumnIndexFilter.applyPredicate), causing the filter to fall back to reading all pages for the affected column. Row-group-level statistics filtering and other columns are unaffected. Performance overhead should be negligible — build() already loops over pages multiple times, and this check is a single boolean read per page, short-circuited by &&.

Existing precedent

ColumnIndexBuilder.build(PrimitiveType) already returns null when the column index has invalid min/max values (createColumnIndex returns null). The proposed fix adds one more validation of the same kind — checking a different field (null_pages vs null_counts) for internal consistency, with the same outcome (return null).

More broadly, parquet-java has defensive handling for invalid writer metadata in other areas:

  • CorruptStatistics (PARQUET-251): Ignores statistics from writers known to produce invalid binary column statistics.
  • CorruptDeltaByteArrays (PARQUET-246): Forces sequential reads for files with broken delta byte array encoding.

How to reproduce

PrimitiveType type = Types.required(INT32).named("col");

// Pages 1-2 have null_pages=true but null_counts=0 — contradictory
ColumnIndex ci = ColumnIndexBuilder.build(
    type,
    BoundaryOrder.ASCENDING,
    List.of(false, true, true),
    List.of(0L, 0L, 0L),
    List.of(ByteBuffer.allocate(4), ByteBuffer.allocate(0), ByteBuffer.allocate(0)),
    List.of(ByteBuffer.allocate(4), ByteBuffer.allocate(0), ByteBuffer.allocate(0)));

// ci is non-null — the contradiction is not detected.
// Pages 1-2 are silently excluded from all column-index-based filtering.

Additional fix

The static ColumnIndexBuilder.build() method (the read-path entry point) does not null-check the return value of build(PrimitiveType) before dereferencing it, which could cause an NPE. The fix includes a null-guard for this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions