GH-3451. Add a JMH benchmark for variants#3452
GH-3451. Add a JMH benchmark for variants#3452steveloughran wants to merge 9 commits intoapache:masterfrom
Conversation
Initial impl.
|
Still thinking of what else can be done here...suggestions welcome. Probably a real write to the localfs and read back in |
|
I'll add a "deep" option too, for consistency with the iceberg pr |
| private static int count() { | ||
| int c = counter++; | ||
| if (c >= 512) { | ||
| c = 0; |
There was a problem hiding this comment.
only resets the local copy, counter keeps growing?
There was a problem hiding this comment.
good point. will fix.
* deser to recurse down * include uuid and bigdecimal * reset counter on benchmark setup
iterations of class code and #of rows are the same for easy compare of overheads.
Using the same structure as the iceberg tests do
|
There's now a new benchmark which writes a file using the same simple schema as I'm doing in iceberg apache/iceberg#15629 , and tries to do a projection on it. Review by the copilot Setup: 1M rows, 4-field nested variant (idstr, varid, varcategory, col4), querying varcategory only. SingleShotTime, 15 iterations, @fork(0). Raw Results Speedup/Penalty vs readAllRecords Baseline
Recommendation Always detect file layout in ReadSupport.init() and apply the lean projection only when the file was written with a shredded schema. For unshredded files, use the full file schema or no projection.If you have a query with a pushdown predicate that wants to look inside a variant, creating a MessageType schema referring to the shredded values is counterproductive unless you know that the variant is shedded. That can be determined by looking at the schema and use `.containsField("typed_value") to see if the target variant has any nested values. @Override
public ReadContext init(InitContext context) {
MessageType fileSchema = context.getFileSchema();
GroupType nested = fileSchema.getType("nested").asGroupType();
if (nested.containsField("typed_value")) {
return new ReadContext(VARCATEGORY_PROJECTION);
}
// Unshredded file: projection designed for typed columns provides no benefit and
// causes schema mismatch overhead — fall back to the full file schema.
return new ReadContext(fileSchema);
} |
|
build failures are all because java11 javadoc is extra-fussy than the versions either side of it. |
Rationale for this change
There's no benchmark for variant IO and so there's no knowledge of any problems which exist now, or any way to detect regressions.
What changes are included in this PR?
VariantBenchmarkAre these changes tested?
Manually, initial PR doesn't fork the JVM for each option.
there's 100 iterations per benchmark because some of the unshredded/small object operations are so fast that clock granuarity becomes an issue.
Are there any user-facing changes?
No
Closes #3451