Column index with null_pages=true and null_counts=0 causes silent data loss during filter pushdown


### Describe the bug

When a Parquet file contains a `ColumnIndex` where `null_pages[i]` is `true` and `null_counts[i]` is `0` for the same page, parquet-java's column index filtering silently drops that page from query results. No error or warning is produced.

Per the [Parquet format specification](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift), `null_pages[i]=true` means "a page contains only null values" and `null_counts[i]` is "the number of null values" for each page. These two fields directly contradict each other: a page cannot contain only null values while also having zero null values.

### Why it causes data loss

The column index filtering in `ColumnIndexBase` has two code paths for evaluating predicates, and both exclude pages with this contradiction:

**Non-null predicates** (e.g., `WHERE col = 50`): The `BoundaryOrder` comparators iterate over `pageIndexes` — an internal array that maps min/max array positions to page numbers, omitting pages where `null_pages[i]` is `true`. Pages marked as null have no entry in this array and are never evaluated by the comparators, so their rows are excluded from results.

**Null predicates** (e.g., `WHERE col IS NULL`): `ColumnIndexBase.visit(Eq)` checks `nullCounts[pageIndex] > 0` (line ~299 on master), which returns `false` when `null_counts[i]` is `0`. The page is excluded.

A page with this contradiction is invisible to all predicates. Only unfiltered reads (no `WHERE` clause) return correct results.

### Proposed fix

Add validation in `ColumnIndexBuilder.build(PrimitiveType)` to detect the contradiction and return `null`, following the existing pattern where this method already returns `null` for other kinds of invalid metadata:

```java
if (nullPages.isEmpty()) {
  return null;
}
ColumnIndexBase<?> columnIndex = createColumnIndex(type);
if (columnIndex == null) {
  // Might happen if the specialized builder discovers invalid min/max values
  return null;
}
```

The `null` propagates through the read path (`fromParquetColumnIndex` → `readColumnIndex` → `getColumnIndex` → `ColumnIndexFilter.applyPredicate`), causing the filter to fall back to reading all pages for the affected column. Row-group-level statistics filtering and other columns are unaffected. Performance overhead should be negligible — `build()` already loops over pages multiple times, and this check is a single boolean read per page, short-circuited by `&&`.

### Existing precedent

`ColumnIndexBuilder.build(PrimitiveType)` already returns `null` when the column index has invalid min/max values (`createColumnIndex` returns null). The proposed fix adds one more validation of the same kind — checking a different field (`null_pages` vs `null_counts`) for internal consistency, with the same outcome (`return null`).

More broadly, parquet-java has defensive handling for invalid writer metadata in other areas:

- **`CorruptStatistics`** ([PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251)): Ignores statistics from writers known to produce invalid binary column statistics.
- **`CorruptDeltaByteArrays`** ([PARQUET-246](https://issues.apache.org/jira/browse/PARQUET-246)): Forces sequential reads for files with broken delta byte array encoding.

### How to reproduce

```java
PrimitiveType type = Types.required(INT32).named("col");

// Pages 1-2 have null_pages=true but null_counts=0 — contradictory
ColumnIndex ci = ColumnIndexBuilder.build(
    type,
    BoundaryOrder.ASCENDING,
    List.of(false, true, true),
    List.of(0L, 0L, 0L),
    List.of(ByteBuffer.allocate(4), ByteBuffer.allocate(0), ByteBuffer.allocate(0)),
    List.of(ByteBuffer.allocate(4), ByteBuffer.allocate(0), ByteBuffer.allocate(0)));

// ci is non-null — the contradiction is not detected.
// Pages 1-2 are silently excluded from all column-index-based filtering.
```

### Additional fix

The static `ColumnIndexBuilder.build()` method (the read-path entry point) does not null-check the return value of `build(PrimitiveType)` before dereferencing it, which could cause an NPE. The fix includes a null-guard for this.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column index with null_pages=true and null_counts=0 causes silent data loss during filter pushdown #3457

Describe the bug

Why it causes data loss

Proposed fix

Existing precedent

How to reproduce

Additional fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Column index with null_pages=true and null_counts=0 causes silent data loss during filter pushdown #3457

Description

Describe the bug

Why it causes data loss

Proposed fix

Existing precedent

How to reproduce

Additional fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions