Skip to content

Conversation

@theomagellan
Copy link
Contributor

Description

useExpandedValue would resolve ExpandedValues after checking if a collection's type was "stringy." For collections of structs, isStringy always returns false, causing all ExpandedValues to be sanitized to their parsed values and breaking decoding of string fields if the parsed value was not string.

Now, useExpandedValue checks if it's dealing with a collection of structs; in which case, skip sanitization and rely on mapstructure's per-field decoding.

Testing

Added tests relying on configopaque.MapList which is an alias on the struct collection []Pair

theomagellan and others added 2 commits January 8, 2026 18:07
… unmarshalling a confmap

  `useExpandedValue` would resolve ExpandedValues after checking if a
  collection's type was "stringy." For collections of structs, `isStringy`
  always returns false, causing all ExpandedValues to be sanitized to their
  parsed values and breaking decoding of string fields if the parsed
  value was not `string`.

  Now, `useExpandedValue` checks if it's dealing with a collection of
  structs; in which case, skip sanitization and rely on mapstructure's
  per-field decoding.
@theomagellan theomagellan marked this pull request as ready for review January 12, 2026 16:21
@theomagellan theomagellan requested review from a team, evan-bradley and mx-psi as code owners January 12, 2026 16:21
@theomagellan
Copy link
Contributor Author

As discussed with @mx-psi, I'm kindly pinging @jade-guiton-dd to this PR!

@codecov
Copy link

codecov bot commented Jan 13, 2026

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.79%. Comparing base (97f66a9) to head (80a019a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
confmap/internal/decoder.go 0.00% 5 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14413      +/-   ##
==========================================
- Coverage   91.83%   91.79%   -0.04%     
==========================================
  Files         677      677              
  Lines       42679    42680       +1     
==========================================
- Hits        39195    39180      -15     
- Misses       2427     2439      +12     
- Partials     1057     1061       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jade-guiton-dd
Copy link
Contributor

jade-guiton-dd commented Jan 14, 2026

Hello 👋 Sorry, even after reading the existing code and your PR, I still don't really understand what the problem was. Could you expand on:

causing all ExpandedValues to be sanitized to their parsed values and breaking decoding of string fields if the parsed value was not string.

Reading the code, it seems to me that an array of structs should already be kept as-is by the sanitization operation. Did I miss something?

@theomagellan
Copy link
Contributor Author

Reading the code, it seems to me that an array of structs should already be kept as-is by the sanitization operation. Did I miss something?

The issue used to happen in cases like what this test does.

When decoding an ExpandedValue inside a struct field in a collection, the code would decode the ExpandedValue as its Value field (of any type) because we were deeming that a struct type was not stringy.
This would break cases like mentioned above where we would, later on in the unmarshalling process, try to set the result of our decoded ExpandedValue inside a struct field that would have been decoded as stringy if the decoding operation would've been delayed.

Commenting the fix and running the added tests fails with the error:

FAIL: TestMapListWithExpandedValueIntValue (0.00s)
[...]
'headers[0].value' expected type 'configopaque.String', got unconvertible type 'int'

Since the ExpandedValue was decoded as its Value because useExpandedValue called isStringyStructure on the type of the collection (collection is []Pair, so the type is Pair) which is not a stringy type.

Sorry if I wasn't clear enough before, I hope my explanations help!

@jade-guiton-dd
Copy link
Contributor

jade-guiton-dd commented Jan 14, 2026

Ah, it seems there was a misunderstanding on my part (isStringyStructure checks the type of the destination, whereas sanitizeExpanded switches over the type of the source). The PR makes sense to me now.

One thing I thought about is that, in the same way your PR delegates to mapstructure the task of inspecting the struct so we can more accurately tell when we're assigning ExpandedValues to strings, have you tried completely deleting the switch to.Kind() { blocks, so that arrays/slices/maps all get mapped by mapstructure instead of the sanitize functions, and we just rely on the initial test to identify values to keep as strings? It's possible there's an edge case I'm not thinking of where that wouldn't work, but I feel like that would be worth a try to simplify the logic instead of making it more complex.

@theomagellan
Copy link
Contributor Author

I tested it and it also solves the issue and doesn't break any other tests.
My understanding of the code was that these checks were here for performance improvements, as for cases like []string we can avoid subsequent hook calls since we already know that the collection is only composed of stringy elements.

I do agree that removing the switch altogether would make the code simpler and would also resolve any edge cases like the one I brought up. What do you think?

@jade-guiton-dd
Copy link
Contributor

Regardless of Pablo's intent 2 years ago, I suspect it's not actually an optimization: I think mapstructure will walk the sanitized slice/map returned by the hook and call it again on each contained items regardless. So I don't think removing the switch should cause a performance regression; I'd say it's an unconditional improvement.

However, I'm not 100% confident that this won't break someone somewhere (mapstructure's behavior and the interactions between our hooks can be a bit arcane in my experience), so I think it may be prudent to introduce a beta feature gate and run the latter part of the function if the gate has been disabled. That way we have a quick solution if someone encounters breakage; and if we don't receive any complaints, we can stabilize the gate and fully remove the code in a few releases.

featuregate.StageBeta,
featuregate.WithRegisterFromVersion("v0.144.0"),
featuregate.WithRegisterDescription("Disables early sanitization of ExpandedValue during config unmarshalling, allowing mapstructure to handle type conversion at the field level. Fixes decoding errors when environment variable values are parsed as non-string types (e.g., numbers, booleans) but need to be assigned to string fields."),
featuregate.WithRegisterReferenceURL("https://github.com/open-telemetry/opentelemetry-collector/pull/14413#issuecomment-3754949484"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other featuregates reference issues directly instead of PRs.
Should I create an issue or is this good enough? Sorry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think referencing a PR should be fine.

Comment on lines 102 to 105
// converted based on its target field type.
if elemType.Kind() == reflect.Struct {
return data, nil
}
Copy link
Contributor Author

@theomagellan theomagellan Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still kept the original fix inside the old behavior since it's still required to fix the problem I encountered.
I'm happy to remove it if you think it's unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I doubt the original fix would cause issues, but out of caution, I'd rather have disabling the feature gate return to the previous behavior, bug included. I figure that if someone's setup is broken by the new fix, I think there's a reasonable chance they will be broken by the original fix as well.

@mx-psi
Copy link
Member

mx-psi commented Jan 15, 2026

Regardless of Pablo's intent 2 years ago, I suspect it's not actually an optimization: I think mapstructure will walk the sanitized slice/map returned by the hook and call it again on each contained items regardless. So I don't think removing the switch should cause a performance regression; I'd say it's an unconditional improvement.

To be clear although I don't remember exactly what I was thinking at that point I don't think my intent here was performance-related, it was probably more just an oversight/something behavior-related.

However, I'm not 100% confident that this won't break someone somewhere (mapstructure's behavior and the interactions between our hooks can be a bit arcane in my experience), so I think it may be prudent to introduce a beta feature gate and run the latter part of the function if the gate has been disabled. That way we have a quick solution if someone encounters breakage; and if we don't receive any complaints, we can stabilize the gate and fully remove the code in a few releases.

I agree with this, mapstructure is always arcane!

  - revert to old behavior when feature gate disabled, bug included
  - test unmarshalling with both behaviors and expect error when feature
    gate disabled
Copy link
Contributor

@jade-guiton-dd jade-guiton-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just need to run make gotidy and add a change log entry (make chlog-new).

Codecov is claiming the old code in the if is completely untested, so I would also recommend confirming with a debugger whether the old code is being reached in the latter half of TestMapListWithExpandedValueIntValue. It could just be an issue with Codecov however.

@theomagellan theomagellan force-pushed the fix-expandedvalue-sanitization-on-struct-collection branch from 0e6ef84 to 3f49486 Compare January 16, 2026 12:28
}

// TestStringyStructureWithExpandedValue tests the isStringyStructure path in useExpandValue
func TestStringyStructureWithExpandedValue(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this new test, we should now cover all cases in both the new and old behavior.

@theomagellan theomagellan force-pushed the fix-expandedvalue-sanitization-on-struct-collection branch from 3f49486 to 68015e6 Compare January 16, 2026 12:50
  - renamed featuregate
  - added test with a stringy collection
  - added changelog entry
Co-authored-by: Jade Guiton <jade.guiton@datadoghq.com>
@theomagellan theomagellan force-pushed the fix-expandedvalue-sanitization-on-struct-collection branch from 68015e6 to 97efc61 Compare January 16, 2026 13:09
  The e2e test module was not recording coverage for the parent
  `confmap/internal` package that it tests. This caused test coverage
  to appear as 0% even though the tests exercise the code.

  This adds the parent package to COVER_PKGS so coverage of
  `confmap/internal` is properly tracked for e2e tests.
include ../../../Makefile.Common

# Override COVER_PKGS to include the parent package that this e2e module tests
COVER_PKGS := go.opentelemetry.io/collector/confmap/internal,$(COVER_PKGS)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be needed since I am covering code in confmap/internal from the e2e module.
This feels out of scope of the PR to me and I would appreciate any suggestions on how we could handle this coverage issue!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do this in a separate PR if you are up to contributing it. In any case it's not a requirement for merging this PR (codecov is not a required check)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I will revert this commit and open a new PR.
In the meantime, do you have any ideas why the changelog validate is unhappy? I don't really understand the issue here.

Copy link
Contributor

@jade-guiton-dd jade-guiton-dd Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's an issue from other changelogs on main, so no worries.

Copy link
Member

@mx-psi mx-psi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

theomagellan and others added 2 commits January 19, 2026 14:11
Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
@mx-psi
Copy link
Member

mx-psi commented Jan 20, 2026

Can you solve the merge conflict? Once you have dealt with that I can merge it

@mx-psi mx-psi enabled auto-merge January 20, 2026 15:09
@mx-psi mx-psi added this pull request to the merge queue Jan 20, 2026
Merged via the queue into open-telemetry:main with commit 246e428 Jan 20, 2026
60 of 61 checks passed
@otelbot
Copy link
Contributor

otelbot bot commented Jan 20, 2026

Thank you for your contribution @theomagellan! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey.

github-merge-queue bot pushed a commit that referenced this pull request Jan 21, 2026
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
The e2e test module was not recording coverage for the parent
`confmap/internal` package that it tests. This caused test coverage to
appear as 0% even though the tests exercise the code.

This adds the parent package to `COVER_PKGS` so coverage of
`confmap/internal` is properly tracked for e2e tests.
When looking at overall coverage of `confmap/internal/decoder.go`, it
goes from 91.4% to 96.3% of statements.
#### Link to tracking issue

Issue was found in PR
#14413 (review).

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Running `make gotest-with-cover` without the fix, then looking at
coverage in percentage with
```bash
go tool covdata percent -i=./coverage/unit -pkg=go.opentelemetry.io/collector/confmap/internal
```
Running these commands again with the fix added shows the improvement.

The e2e tests can also be run individually to verify they're tracking
parent package coverage:
```bash
make -C confmap/internal/e2e test-with-cover
# [...] coverage: 45.3% of statements in go.opentelemetry.io/collector/confmap/internal, go.opentelemetry.io/collector/confmap/internal/e2e, )
```
<!--Describe the documentation added.-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants