fix attributes dropped when decoding OTAP -> OTLP proto bytes when IDs are out of order#2421
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2421 +/- ##
==========================================
- Coverage 88.02% 88.01% -0.01%
==========================================
Files 582 582
Lines 204775 204888 +113
==========================================
+ Hits 180247 180342 +95
- Misses 24002 24020 +18
Partials 526 526
🚀 New features to boost your workflow:
|
| [Some(ids), None] => { | ||
| let ids = ids.as_any().downcast_ref::<UInt16Array>().ok_or_else(|| { | ||
| if let Some(ids) = record_batch.column_by_name(consts::ID) { | ||
| sort_columns[sort_columns_idx] = |
There was a problem hiding this comment.
nit: sort_columns_idx += 1; in case it is used in future?
There was a problem hiding this comment.
That would be a clippy error because the value we assign never gets read. If a new ID column gets added to the list of sort columns in the future, we can update the sort_column_index then. However, I doubt we'll add more columns to this multi-column sort.
| })?); | ||
| } | ||
|
|
||
| match sort_columns { |
There was a problem hiding this comment.
Worth mentioning I think that we do something pretty similar here, but for the two column cases:
otel-arrow/rust/otap-dataflow/crates/pdata/src/otap/transform/util.rs
Lines 136 to 137 in 0dde9c1
I don't think you have to do it now, but I wonder if we can create one "sort by ids" function that always does the best thing based on the inputs
|
After some discussion w/ @JakeDern , we've determined this only fixes the issue where there is an invariant such that all the values in the ID column in rows having some scope.id "X" will be less than all the values in the ID column in rows having some scope.id "Y" where X < Y. Or stated more simply, as the scope/resource ID increase, so to do we expect this of the ID column. We currently maintain this invariant by pure coincidence by virtue of how we encode OTAP data, and by the fact that as we're transforming the data, we have no good reason to start mixing up the values in the ID columns. That said, there's nothing in the OTAP spec that says we must have our ID columns ordered in such a way. Indeed, it's possible to manually craft by hand a truely pathological record batch that still doesn't decode properly: // Build an otap batch where:
//
// Row 0: resource_id=0, scope_id=5 (high scope_id in first resource)
// Row 1: resource_id=1, scope_id=2 (low scope_id in second resource)
// Row 2: resource_id=1, scope_id=3 (another low scope_id in second resource)
//
// After sorting by (resource_id, scope_id), visit order is rows [0, 1, 2]
// with scope_ids visited: [5, 2, 3] - non-monotonic at the 5 -> 2 boundary.
let str_type = 1u8; // AttributeValueType::Str
#[rustfmt::skip]
let mut otap_batch: OtapArrowRecords = logs!(
(Logs,
("id", UInt16, vec![0u16, 1, 2]),
("resource.id", UInt16, vec![0u16, 1, 1]),
("scope.id", UInt16, vec![5u16, 2, 3]),
("scope.name", Utf8, vec!["scope-five", "scope-two", "scope-three"])),
(ScopeAttrs,
("parent_id", UInt16, vec![2u16, 3, 5]),
("type", UInt8, vec![str_type, str_type, str_type]),
("key", Utf8, vec!["sk", "sk", "sk"]),
("str", Utf8, vec!["scope2-attr", "scope3-attr", "scope5-attr"])),
).into();I'm tempted to handle this additional case as a followup (I don't believe the solution will conflict), as it currently solves an issue that can be easily encountered in OPL/Transform processor. However I throw myself upon the mercy of my reviewers and if anyone feels this all must be fixed at once I will kindly do so. |
Truly pathological record batches are something of an interest of mine 😉 - I think we should file another bug with this test case for follow-up and merge this PR as it's strictly an improvement IMO. |
Change Summary
Fixes issue decoding OTAP to OTLP proto bytes when the ID column of the root record batch is not in sorted order.
When we do this decoding, we initialize cursors for the order in which to visit the rows of each record batch. When we visit some row in the root record batch, we then try to find its attributes by advancing the attributes cursor until the parent_id column matches the ID of the current row in the root record batch.
The whole scheme is predicated on the assumption that we'll visit the
id/parent_idcolumn in the same order. However, we initialize the cursor for the attribtues record batch in sorted parent_id order, but for the root record batch we were only initializing it in sorted order of resource/scope ID.The fix is to add the ID column to the multi-column sort which is used to initialize the cursor for the root record batch.
This PR also makes the change to avoid using RowConverter for the multi-column sort for the root record batch cursor init, instead opting to simply pack the IDs into a Vec of some unsized int (width depending on how many ID columns we're packing), and then sort this. This slightly improves performance (even after adding the additional column to the sort) and I added a benchmark to measure. Bench: main 337µs , after: 305µs
Note: one side-effect of this change (which in my opinion is OK), is that all the rows with a null ID (e.g. rows w/ no attributes) will appear first in the decoded OTLP message. This is b/c arrow typically uses 0 as a placeholder in int arrays in null rows, and the sort we are doing does not bother looking at the null buffer for best performance.
What issue does this PR close?
How are these changes tested?
Unit tests
Are there any user-facing changes?
No