Fix duplicate attribute keys in transform_attributes#2423
Fix duplicate attribute keys in transform_attributes#2423gyanranjanpanda wants to merge 8 commits intoopen-telemetry:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2423 +/- ##
==========================================
- Coverage 88.23% 88.21% -0.02%
==========================================
Files 639 639
Lines 242568 243089 +521
==========================================
+ Hits 214018 214451 +433
- Misses 28026 28114 +88
Partials 524 524
🚀 New features to boost your workflow:
|
a210873 to
361e6bd
Compare
|
@albertlockett and @ThomsonTan waiting for your feed backl |
albertlockett
left a comment
There was a problem hiding this comment.
Hey @gyanranjanpanda . I appreciate you taking the time to look at this, but I don't think we can accept this PR as is.
Unfortunately, the benchmarks we have for this code on main are currently broken. But when I apply the fix from #2426 and run the benchmark we see that this change introduces significant performance regression:
transform_attributes_dict_keys/single_replace_no_deletes/keys=32,rows=128,rows_per_key=4
time: [5.1300 µs 5.1348 µs 5.1394 µs]
change: [+1027.4% +1031.5% +1035.2%] (p = 0.00 < 0.05)
Performance has regressed.
transform_attributes_dict_keys/single_replace_single_delete/keys=32,rows=128,rows_per_key=4
time: [5.5027 µs 5.5091 µs 5.5155 µs]
change: [+495.01% +497.37% +499.48%] (p = 0.00 < 0.05)
Performance has regressed.
transform_attributes_dict_keys/no_replace_single_delete/keys=32,rows=128,rows_per_key=4
time: [5.3440 µs 5.3584 µs 5.3746 µs]
change: [+577.41% +580.27% +583.40%] (p = 0.00 < 0.05)
Performance has regressed.
transform_attributes_dict_keys/single_replace_no_deletes/keys=32,rows=1536,rows_per_key=48
time: [34.015 µs 34.050 µs 34.086 µs]
change: [+4000.2% +4016.4% +4031.3%] (p = 0.00 < 0.05)
Performance has regressed.
transform_attributes_dict_keys/single_replace_single_delete/keys=32,rows=1536,rows_per_key=48
time: [34.390 µs 34.472 µs 34.562 µs]
change: [+1421.9% +1433.5% +1443.9%] (p = 0.00 < 0.05)
Performance has regressed.
transform_attributes_dict_keys/no_replace_single_delete/keys=32,rows=1536,rows_per_key=48
time: [34.302 µs 34.340 µs 34.379 µs]
change: [+1562.1% +1568.0% +1573.6%] (p = 0.00 < 0.05)
Performance has regressed.
transform_attributes_dict_keys/single_replace_no_deletes/keys=32,rows=8192,rows_per_key=256
time: [171.62 µs 171.78 µs 171.96 µs]
change: [+6262.2% +6290.6% +6316.2%] (p = 0.00 < 0.05)
Performance has regressed.
transform_attributes_dict_keys/single_replace_single_delete/keys=32,rows=8192,rows_per_key=256
time: [171.79 µs 171.92 µs 172.06 µs]
change: [+1771.2% +1835.7% +1893.0%] (p = 0.00 < 0.05)
Performance has regressed.
transform_attributes_dict_keys/no_replace_single_delete/keys=32,rows=8192,rows_per_key=256
time: [171.20 µs 171.35 µs 171.49 µs]
change: [+1962.8% +1981.5% +1998.1%] (p = 0.00 < 0.05)
Performance has regressed.
transform_attributes_dict_keys/single_replace_no_deletes/keys=128,rows=128,rows_per_key=1
time: [4.9566 µs 4.9693 µs 4.9819 µs]
change: [+587.52% +592.02% +597.47%] (p = 0.00 < 0.05)
Performance has regressed.
transform_attributes_dict_keys/single_replace_single_delete/keys=128,rows=128,rows_per_key=1
time: [5.6185 µs 5.6284 µs 5.6377 µs]
change: [+292.54% +294.19% +296.01%] (p = 0.00 < 0.05)
Performance has regressed.
transform_attributes_dict_keys/no_replace_single_delete/keys=128,rows=128,rows_per_key=1
time: [5.2733 µs 5.2831 µs 5.2938 µs]
change: [+385.50% +387.73% +389.92%] (p = 0.00 < 0.05)
Performance has regressed.
While I expect to see some performance regression because we're doing extra work, I feel that such a serious regression in performance warrants some additional investigation into if/how we can do this in a more efficient way.
Please see my comment here which prescribes an approach that I believe will be more performant than what is currently in this PR: #1650 (comment)
361e6bd to
06392eb
Compare
|
thanks for your wonderful guidance i will make sure i could match your expectation |
|
Hey @gyanranjanpanda I wanted to give you a heads up that I am going to be working on #2014 and there may be some significant changes to the transform_attributes code. I will be touching code in transform_keys as well as transform_attributes_impl. I wanted to give you a heads up in case you want to hold off advancing your work until you can better understand the conflicts |
|
Thanks for the heads up! I’ll keep an eye on your changes to #2014 and try to align my work accordingly. If possible, could you share which parts might be most affected so I can avoid overlap? or should i wait after u finished your work i should continue this work |
It's probably easiest to hold off until I finish to avoid conflicts, but I'll leave it up to you. I think I should have the changes I need to make for #2014 done by early next week, if not sooner. For now, I'll show you the in-progress changes: I was imagining that for #1650 you'd need to make changes to |
|
@gyanranjanpanda the changes I mentioned that could cause conflicts have now been merged (see #2442) |
|
i will fix this code as soon as possible while looking your merged pr |
2d813af to
67e366e
Compare
|
@albertlockett Thanks for the detailed benchmark feedback! I have completely reworked the approach based on your guidance. What changed:
Benchmark results (no regression):
The plan-based approach avoids the expensive |
…metry#1650) When renaming attribute key 'x' to 'y', any existing row with key 'y' sharing a parent_id with a row having key 'x' would produce a duplicate. This commit fixes that by: - Adding find_rename_collisions_to_delete_ranges() which uses IdBitmap to efficiently detect these collisions in O(N) time - Generating KeyTransformRange::Delete entries that are merged into the existing transform pipeline in transform_keys() and transform_dictionary_keys() - Fixing an early-return in transform_dictionary_keys() that skipped row-level collision deletes when dictionary values had no deletions - Adding read_parent_ids_as_u32() helper for parent_id column access - Adding test_rename_removes_duplicate_keys integration test Only runs collision detection when parent_ids are plain-encoded (not transport-optimized) to avoid incorrect results from quasi-delta encoded values. Closes open-telemetry#1650
67e366e to
71f2ee6
Compare
albertlockett
left a comment
There was a problem hiding this comment.
Looks like some good progress, but still some things happening that are not as well optimized as they could be
…anges Addresses @albertlockett's review feedback: - Extract sorted_merge_into_vec helper to DRY up sorted merge pattern - Extend merge_transform_ranges to accept collision_delete_ranges as a third parameter, performing a single-pass 3-way merge - Remove duplicate sorted-merge code from transform_keys and transform_dictionary_keys - Preserve zero-copy Cow::Borrowed fast path when no collision deletes are present
- Add missing FieldExt trait import in upsert_tests module so the simultaneous rename+delete collision tests compile - Add parent_id column to 4 pre-existing tests that broke after enforcing parent_id as required per OTAP spec: - test_transform_attrs_keys_dict_encoded - test_transform_attrs_u16_keys - test_with_stats_utf8_rename_and_delete - test_with_stats_dict_rename_and_delete All 291 transform tests and 22 attributes_processor tests pass.
| let old_key_mask = eq(key_col, &StringArray::new_scalar(old_key)).map_err(|e| { | ||
| Error::UnexpectedRecordBatchState { | ||
| reason: format!("eq kernel failed for old_key: {e}"), | ||
| } | ||
| })?; |
There was a problem hiding this comment.
There is still quite a performance regression from what is on main. For example:
transform_attributes_native_keys/block_replace_no_delete/rows=1536
time: [5.9484 µs 5.9712 µs 5.9930 µs]
change: [+241.53% +256.29% +269.79%] (p = 0.00 < 0.05)
Performance has regressed.
When I profile this, I see we're spending a lot of time in the eq compute kernel:

I think we need to optimize how we check for the presence of the existing keys.
We actually have a highly optimized kernel for checking if the keys match some given value, which I think is what we should use here:
otel-arrow/rust/otap-dataflow/crates/pdata/src/otap/transform.rs
Lines 2377 to 2459 in 9c54c8e
One caveat about this is that, it only works on the offset/values buffer from the arrow string arrays. That means if the keys column happens to be dictionary encoded, we can only run this on the dictionary values.
There was a problem hiding this comment.
Also, because we're checking for the existence of the old keys both in this method, and in plan_key_replacements:
https://github.com/open-telemetry/otel-arrow/blob/main/rust/otap-dataflow/crates/pdata/src/otap/transform.rs#L2282
It'd be nice if we can avoid checking that twice, but we'd need to dramatically refactor how this function is called in order to do that (which we may want to do).
If that refactoring is not possible, we could consider the fact that it may be somewhat rare that someone may have existing keys that would become duplicates via renaming. Given this fact, imo it might be better logic here to first check for the existing of the new key, and if no rows are found, we exist early (instead of how we're currently doing it, where we look for old keys first).
| for (start, end) in BitSliceIterator::new(old_key_mask.values().inner(), 0, num_rows) { | ||
| for i in start..end { | ||
| let pid: u64 = parent_ids.value(i).into(); | ||
| source_parents.insert(pid as u32); | ||
| } | ||
| } | ||
|
|
||
| if source_parents.is_empty() { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Careful about the unnecessary work here - we don't actually need to load all the IDs into the ID bitmap before returning early. We may be able to use the result of having checked for the existence of the key to determine if we can continue early on this iteration of the loop.
| let mask = eq(key_col, &scalar).map_err(|e| Error::UnexpectedRecordBatchState { | ||
| reason: format!("eq kernel failed on attribute keys: {e}"), | ||
| })?; | ||
| if mask.true_count() > 0 { | ||
| return Ok(true); | ||
| } |
There was a problem hiding this comment.
Similar to the comment I've made on the code below, the eq compute kernel is kind of expensive, and we might be able to use the more optimized
otel-arrow/rust/otap-dataflow/crates/pdata/src/otap/transform.rs
Lines 2377 to 2384 in 9c54c8e
That said, I also feel that the first step of find_rename_collisions_to_delete_ranges maybe should be be to call this for the new keys (again, see my comments on code below), so given that fact - we might want to be careful about doing duplicate work.
albertlockett
left a comment
There was a problem hiding this comment.
hey @gyanranjanpanda - sorry it took me some time to review the last round of changes. The code looks like it's in much better shape, thanks for all your work!
The performance is still not quite where we need it to be. I left some suggestions about how things could maybe be improved - specifically around how we're using the eq kernel.
I also noticed that these changes break the existing benchmarks, which would make this perf regression hard for you to measure locally. I've pushed a fix to my branch here: 32a55e5 (which you may actually want to cherry-pick).
FWIW, instructions for profiling have also beend added here https://github.com/open-telemetry/otel-arrow/blob/main/rust/otap-dataflow/PROFILING.md . It's possible to use these same commands while running the benchmarks.
Addresses mentor review feedback on PR open-telemetry#2423: 1. Replace expensive arrow eq() compute kernel in find_rename_collisions_to_delete_ranges with direct offset/values buffer comparison (matching the optimized kernel pattern used by find_matching_key_ranges). This eliminates the kernel dispatch overhead that was causing 6000%+ latency regressions in benchmarks. 2. Reorder collision logic to check new_key (target) first. Since collisions are rare (the renamed target key rarely already exists), this provides an early exit in the common case before any IdBitmap work is done. 3. Defer IdBitmap population until after confirming both old_key and new_key exist, avoiding unnecessary bitmap allocations and clears. 4. Rewrite rename_has_target_key_in_column to use the same optimized raw buffer scan instead of the eq kernel. 5. Add parent_id column to generate_native_keys_attr_batch in benchmarks (cherry-pick from mentor's commit 32a55e5) to fix benchmark failures with the collision detection code that now requires parent_id. Also adds extract_dict_string_values and key_bytes_exist_in_buffer helper functions that handle both native StringArray and dictionary- encoded key columns.
Addresses mentor review feedback on PR open-telemetry#2423: 1. Replace expensive arrow eq() compute kernel in find_rename_collisions_to_delete_ranges with direct offset/values buffer comparison (matching the optimized kernel pattern used by find_matching_key_ranges). This eliminates the kernel dispatch overhead that was causing 6000%+ latency regressions in benchmarks. 2. Reorder collision logic to check new_key (target) first. Since collisions are rare (the renamed target key rarely already exists), this provides an early exit in the common case before any IdBitmap work is done. 3. Defer IdBitmap population until after confirming both old_key and new_key exist, avoiding unnecessary bitmap allocations and clears. 4. Rewrite rename_has_target_key_in_column to use the same optimized raw buffer scan instead of the eq kernel. 5. Add parent_id column to generate_native_keys_attr_batch in benchmarks (cherry-pick from mentor's commit 32a55e5) to fix benchmark failures with the collision detection code that now requires parent_id. Also adds extract_dict_string_values and key_bytes_exist_in_buffer helper functions that handle both native StringArray and dictionary- encoded key columns.
47b7ec4 to
0762cc5
Compare
Addresses mentor review feedback on PR open-telemetry#2423: 1. Replace expensive arrow eq() compute kernel in find_rename_collisions_to_delete_ranges with direct offset/values buffer comparison (matching the optimized kernel pattern used by find_matching_key_ranges). This eliminates the kernel dispatch overhead that was causing 6000%+ latency regressions in benchmarks. 2. Reorder collision logic to check new_key (target) first. Since collisions are rare (the renamed target key rarely already exists), this provides an early exit in the common case before any IdBitmap work is done. 3. Defer IdBitmap population until after confirming both old_key and new_key exist, avoiding unnecessary bitmap allocations and clears. 4. Rewrite rename_has_target_key_in_column to use the same optimized raw buffer scan instead of the eq kernel. 5. Add parent_id column to generate_native_keys_attr_batch in benchmarks (cherry-pick from mentor's commit 32a55e5) to fix benchmark failures with the collision detection code that now requires parent_id. Also adds extract_dict_string_values and key_bytes_exist_in_buffer helper functions that handle both native StringArray and dictionary- encoded key columns.
0762cc5 to
1fb1c23
Compare
|
@albertlockett could u rview this now |
albertlockett
left a comment
There was a problem hiding this comment.
Thanks for the latest round of changes @gyanranjanpanda !
Still some performance issues with this code that I feel we should address
| let dict_keys: Vec<usize> = match key_col.data_type() { | ||
| DataType::Dictionary(k, _) => match k.as_ref() { | ||
| DataType::UInt8 => key_col | ||
| .as_any() | ||
| .downcast_ref::<DictionaryArray<UInt8Type>>() | ||
| .expect("checked type") | ||
| .keys() | ||
| .values() | ||
| .iter() | ||
| .map(|v| *v as usize) | ||
| .collect(), | ||
| DataType::UInt16 => key_col | ||
| .as_any() | ||
| .downcast_ref::<DictionaryArray<UInt16Type>>() | ||
| .expect("checked type") | ||
| .keys() | ||
| .values() | ||
| .iter() | ||
| .map(|v| *v as usize) | ||
| .collect(), | ||
| _ => unreachable!("unsupported dict key type"), | ||
| }, | ||
| _ => unreachable!("checked dictionary type"), |
There was a problem hiding this comment.
We're eagerly collecting the dictionary keys into a Vec<usize>, and later on we're actually just iterating over the vec. Doing this collection seems wasteful.
I noticed we still have a performance regression in one of the existing benchmarks:
transform_attributes_dict_keys/single_replace_no_deletes/keys=128,rows=8192,rows_per_key=64
time: [1.4321 µs 1.4362 µs 1.4401 µs]
change: [+172.60% +174.04% +175.64%] (p = 0.00 < 0.05)
Performance has regressed.
There was a problem hiding this comment.
I actually think we can avoid materializing the vec, and just take the array. See my comment on the code below.
There was a problem hiding this comment.
Also, we have variants of crate::error::Error that can be used for invalid dictionary types instead of using unreachable! here. I wonder if we should either use those, or comment on why the code is actually unreachable
| for dict_val_idx in range.start()..range.end() { | ||
| for (row, dk) in dict_keys.iter().enumerate() { | ||
| if *dk == dict_val_idx { | ||
| let pid: u64 = parent_ids.value(row).into(); | ||
| source_parents.insert(pid as u32); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
For each value in the range, we iterate the entire dictionary keys array and check if the index from the range is equal to the key? If this range has a size of greater than one, this is not a very efficient way to do this check.
I actually think if you just took the dictionary keys as an arrow array (e.g. avoid materializing the Vec, as mentioned above). Then here it would be faster to do something like:
let row_mask = if range.len() == 1 {
eq(dict_keys, UInt16Array::new_scalar(range.start() as u16))?
} else {
let geq_start = gt_eq(dict_keys, UInt16Array::new_scalar(range.start() as u16))?;
let lt_end = lt(dict_keys, UInt16Array::new_scalar(range.end() as u16))?;
and(&geq_start, <_end)
}
let row_mask_buffer = row_mask.values();
for (start, end) in BitSliceIter::new(row_mask_buffer.inner(), row_mask_buffer.offset(), row_mask.len()) {
for i in start..end {
let pid: u64 = parent_ids.value(i).into();
source_parents.insert(pid as u32);
}
}see: https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/index.html
see: https://docs.rs/arrow-buffer/latest/arrow_buffer/bit_iterator/struct.BitSliceIterator.html
| /// dictionary-encoded attribute keys. Verifies that collision removal and real | ||
| /// deletes interact correctly through the dictionary key transform path. | ||
| #[test] | ||
| fn test_rename_collision_with_real_delete_dict() { |
There was a problem hiding this comment.
could you add an additional test where the dict key is for the key column is u16 as well?

Fix Duplicate Attribute Keys in
transform_attributesChanges Made
This PR resolves issue #1650 by ensuring that dictionary keys are deduplicated when transformations such as
renameare applied, as required by the OpenTelemetry specification ("Exported maps MUST contain only unique keys by default").To accomplish this while maintaining strict performance requirements, we replaced the previous
RowConverterdeduplication strategy with a new high-performance, proactive pre-filter:filter_rename_collisionsintotransform_attributes_implinsideotap-dataflow/crates/pdata/src/otap/transform.rs.parent_ids and target keys. It uses theIdBitmaptype to find any existing target keys whoseparent_idmaps back to an old key that will be renamed.arrow::compute::filter_record_batchbefore the actual transform happens.Testing
AttributesProcessorunit tests (test_rename_removes_duplicate_keys) to explicitly verify that renaming an attribute resulting in a collision automatically discards duplicate keys.AttributesTransformPipelineStageinquery-enginetests with a parallel case ensuring OPL/KQL query pipelines (project-rename) properly drop duplicates when resolving duplicates.otap_df_pdatatransform.rstests to properly expect deduplicated keys using this plan-based method.cargo test --workspace --all-features.Validation Results
All tests pass. OTel semantic rules surrounding unique mapped keys map cleanly through down/upstream processors. The
IdBitmapintersection approach completely resolves the multi-thousand percentRowConverterperformance regressions, dropping collision resolution overhead to essentially zero through efficient bitmap operations.