Upserts Bitmap Optimizations while taking Snapshot - Segment Commit#18020
Upserts Bitmap Optimizations while taking Snapshot - Segment Commit#18020Jackie-Jiang merged 10 commits intoapache:masterfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18020 +/- ##
============================================
+ Coverage 63.29% 63.34% +0.05%
- Complexity 1525 1543 +18
============================================
Files 3194 3200 +6
Lines 193645 194240 +595
Branches 29787 29932 +145
============================================
+ Hits 122559 123047 +488
- Misses 61466 61536 +70
- Partials 9620 9657 +37
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…clearing it whenever necessary
| } | ||
|
|
||
| public void init(IndexSegment segment, int docId) { | ||
| clear(); |
There was a problem hiding this comment.
Why have we removed this? are we going to rely on manually clearing this map ourselves ?
There was a problem hiding this comment.
init is called whenever we call the previous row, and then merge with that row. Since we are anyway clearing this map manually after the merge, thought this might be duplicate and unnecessary to clear() again.
There was a problem hiding this comment.
i feel other folks who have custom pinot forks may still be relying on init to clear. let's just keep it for now
There was a problem hiding this comment.
yeah, added it back
| MutableRoaringBitmap docIdsSnapshot = docIds.getMutableRoaringBitmap(); | ||
| try (DataOutputStream dataOutputStream = new DataOutputStream(new FileOutputStream(tmpFile))) { | ||
| docIdsSnapshot.serialize(dataOutputStream); | ||
| byte[] bytes = docIds.toBytes(); |
There was a problem hiding this comment.
Can you benchmark the performance of serialization vs clone? Will this cause longer pause?
Also, will this save heap memory? Should we consider serializing using off-heap memory?
There was a problem hiding this comment.
Yeah I haven't seen any longer pause from the small benchmark I have taken . Older code was completely cloning the bitmap. Atleast with that we reduce the number of lock contentions as well as bytes which are eligible for gc immediately. With off-heap memory:
It might not be worth it here:
Direct buffers are slower to allocate. ByteBuffer.allocateDirect() requires a system call (mmap) and has higher per-allocation overhead. For a ~40KB buffer created per segment, this would add overhead dominate any benefit. For maybe larger bitmaps it might make sense but it doesn't explain the purpose. We might just use it to save the heap memory and bring down heap overhead or long gc pauses. I don't think in our scenario we had a long gc pause. @KKcorps correct me if I am wrong.
xiangfu0
left a comment
There was a problem hiding this comment.
Auto-approved: low-risk change, all CI checks passed. Reviewed by automated bot.
There was a problem hiding this comment.
Pull request overview
Optimizes upsert snapshot persistence and segment lifecycle bookkeeping to reduce heap/GC pressure during frequent commits, and fixes retention of previous-segment references in partial-upsert flows.
Changes:
- Add byte-serialization snapshotting to
ThreadSafeMutableRoaringBitmapand switch snapshot persistence to use serialized bytes instead of cloning bitmaps. - Eagerly remove replaced/dropped segments from
_updatedSegmentsSinceLastSnapshottracking. - Clear the reusable
LazyRowafter partial-upsert merges to release old segment references sooner.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/ThreadSafeMutableRoaringBitmap.java |
Adds synchronized toBytes() and getCardinality() helpers to snapshot the live bitmap without cloning. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/immutable/ImmutableSegmentImpl.java |
Persists docIds snapshots by writing serialized bitmap bytes instead of serializing a cloned bitmap. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/upsert/BasePartitionUpsertMetadataManager.java |
Eagerly cleans _updatedSegmentsSinceLastSnapshot on segment replace/drop; adjusts snapshot-taking logic and cardinality accounting. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/upsert/ConcurrentMapPartitionUpsertMetadataManager.java |
Clears reusable LazyRow after partial-upsert merge to avoid retaining old segment state. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/upsert/ConcurrentMapPartitionUpsertMetadataManagerForConsistentDeletes.java |
Same reusable-LazyRow cleanup as the non-consistent-deletes manager. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/readers/LazyRow.java |
Adjusts LazyRow lifecycle by no longer clearing on init() and by nulling _segment on clear(). |
_updatedSegmentsSinceLastSnapshot): Pinot tracks which segments need their state saved to disk. When a segment gets replaced (e.g. during a reload), the old segment is removed from the active list but not from the tracking set. It stays there until the next periodic save which only happens every 3-20 hours.getMutableRoaringBitmap(), which deep-copies the entire object graph. Under high commit frequency this causes significant heap spikes and GC pressure. Replace the clone with a synchronized in-place serialization (toBytes()) that produces the snapshot bytes directly from the live bitmap, eliminating the intermediate object allocation._reusePreviousRow. Instead, a reusable internal object (_reusePreviousRow) keeps holding a pointer to the old segment's data indefinitely. It's better to cleanup immediately after usage rather than doing it when the next previous row need to be used. This leak is specific for Partial Upsert tables.Ran a small benchmark with each segment having ~85K valid doc IDs (100K docs, ~41 KB serialized bitmap):
OLD flow -- dominated by clone internals:
NEW flow :