Skip to content

Add first test for dedup#166

Merged
himdel merged 37 commits intoansible:develfrom
Ladas:deduplication_tests
Jul 25, 2025
Merged

Add first test for dedup#166
himdel merged 37 commits intoansible:develfrom
Ladas:deduplication_tests

Conversation

@Ladas
Copy link
Copy Markdown
Contributor

@Ladas Ladas commented Jul 11, 2025

Issue: AAP-49237 & AAP-48948 & AAP-49231

Add a test demonstrating the dedup functionality

@Ladas Ladas force-pushed the deduplication_tests branch 3 times, most recently from fe1b705 to 9bad36d Compare July 13, 2025 15:22
@Ladas Ladas marked this pull request as draft July 13, 2025 22:13
@Ladas Ladas force-pushed the deduplication_tests branch 3 times, most recently from 03602b8 to ec1408a Compare July 14, 2025 09:01
@Ladas Ladas marked this pull request as ready for review July 14, 2025 09:55
@Ladas Ladas force-pushed the deduplication_tests branch 7 times, most recently from 7480119 to 7850a3f Compare July 17, 2025 08:37
Copy link
Copy Markdown
Contributor

@himdel himdel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See review comments, but overall LGTM :)

Comment thread metrics_utility/automation_controller_billing/dedup/ccsp.py Outdated
@sonarqubecloud
Copy link
Copy Markdown

Please retry analysis of this Pull-Request directly on SonarQube Cloud

Copy link
Copy Markdown
Member

@AlexSCorey AlexSCorey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code feels very dense to me. I've asked some questions mostly for my own clarity, though adding some details about how to test this would be helpful.


billing_data['serial'] = billing_data.apply(compute_serial, axis=1)

experimental_dedup = self.extra_params.get('deduplicator') == 'ccsp-experimental'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit weary of the experimental phrasing in this pr. If we are adding this to our code base it doesn't feel experimental any more, though I may be unaware of what part is experimental. Is the functionality experimental?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think we should rename to something else in another PR, something like "ccsp-advanced" or mabe dropping the ccsp entirely...

'host_names_before_dedup': 'combine_set',
}

def dedup(self, dataframe, hostname_mapping=None, scope_dataframe=None):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this function doesn't really dedupe anything. Should we call it something like facts_enrichment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is overridden dedup method that then calls the parent with

return super().dedup(dataframe, hostname_mapping)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but, it might be more memory effective to enhance the jobhost summary dataframe with facts later (after the number of rows is reduced)

"""Check if experimental deduplication is enabled."""
return self.extra_params.get('deduplicator') == 'ccsp-experimental'

def calculate_dedup_count(self, series):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the number of hosts that were deduplicated?

Copy link
Copy Markdown
Contributor Author

@Ladas Ladas Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is number of host before the new dedup, there are both count and list of hostnames columns

so it shows how many hosts were grouped together with the new dedup


or better, if you see this new dedup count column being bigger than 1, that means this new deduplication kicked in and multiple hosts were deduped

@Ladas Ladas force-pushed the deduplication_tests branch from 2dfd7be to 15c03ff Compare July 24, 2025 14:20
@Ladas Ladas force-pushed the deduplication_tests branch from 15c03ff to 07daa4c Compare July 24, 2025 14:36
Comment on lines +2266 to +2271
3.3. secure-host-01.company.com and secure-host-01-readonly.internal (privileged vs unprivileged):
- Same host accessed with different credentials
- Admin job has product_serial, user job doesn't
- Same machine_id in both cases 4f7a8b9c2d3e5f6a7b8c9d0e1f2a3b4c
- Result: Correctly merged based on machine_id
- Dedup: Working correctly - same machine_id causes deduplication
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong - if the user job doesn't have product serial, and uses different hostnames, it won't get merged.
We don't merge on machine_id alone when product_serial is missing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes, it's just a description error (they are correctly in false negatives category)

Comment on lines +2288 to +2294
4.2. nat-host-01.external and nat-host-02.external:
- Different hosts behind same NAT gateway
- NAT gateway's machine_id and serial exposed to both
- Same public IP address (203.0.113.10) with different ports (2201 and 2202)
- Result: Still merged (port not used in deduplication logic)
- Dedup: Serial based on product_serial/machine_id only - port ignored
- Note: This demonstrates a limitation where NAT gateway hosts are incorrectly merged
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suspicious - I'm seeing 2 different hostnames

why would the gateway's machine_id and serial propagate to both hosts?

(or is it because those are host_name, but ansible_host is 203.0.113.10 for both?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will fix the description

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw. I've added all original data into canonical facts, so these cases are visible:

'Canonical Facts': '{"ansible_host": ["203.0.113.10"], "ansible_machine_id": '
            '["639d3a53a94028d35a3f5f244793dad2"], "ansible_port": [2201, 2202], '
            '"ansible_product_serial": ["CN7792194B0NAT"], "host_name": '
            '["nat-host-01.external", "nat-host-02.external"]}',

shows there was one ansible_host and 2 host_names

@sonarqubecloud
Copy link
Copy Markdown

@himdel himdel merged commit afc880a into ansible:devel Jul 25, 2025
2 checks passed
MilanPospisil pushed a commit to MilanPospisil/metrics-utility that referenced this pull request Jan 6, 2026
* Add integration test testing new deduplication

* Fix dedup in invetory_scope

* Fix current test failing on new column thatshould show only with dedup

* Refactoring of handling host names before dedup

* Refactor host_names before dedup

* Organize helper functions better

* Add instructions on how to run tests in CI mode

Current approach is hacky, but it doesn;t require any code change.
We'll need to unify the CI run and the local run, or create
another docker compose.

* Test properly all sheets as dict comparissions

* Simplify host_names_before_decup

* Fix host before dedup and mock of report date

* Add more test cases for deduplication

* Extend the existing collector test to include new data for deduplication

* Expand on test cases

* Fix some failing tests

* Fix invalid collector

* More cleanup of the tests and regenerated data

* Fix failures of data_statuses assert

* Fix just a minimal version of gather, making sure it doesn;t throw an exception

* Add working gather test

* Add new validation, testing each deduplication usecase in detail

* Outline the test cases more clearly and add proper numbering

* Add more facts into collection

Will help to debug deduplication issues

* Fix ordering for asserts

* Update metrics_utility/automation_controller_billing/dataframe_engine/dataframe_jobhost_summary_usage.py

Co-authored-by: Martin Hradil <mhradil@redhat.com>

* Update metrics_utility/automation_controller_billing/dataframe_engine/dataframe_jobhost_summary_usage.py

Co-authored-by: Martin Hradil <mhradil@redhat.com>

* Update metrics_utility/automation_controller_billing/dedup/ccsp.py

Co-authored-by: Martin Hradil <mhradil@redhat.com>

* Change product_serial and machine_id to be realistic

* Verify and fix first set of cases

* Expand the data to be able to test all cases

* Fix the NAT example

* Fix case with 2 regions

* Fix test case 3.3

* Extract facts fro serial safely

if the fact key was missing, it was exploding

* Tweak the unit test after adding dedup of also main_host

* Remove test case 1.7, it's duplicate of 1.2

* Fix test order and cleanup desctiprion of case 5.1

* Fix wrong cases description

---------

Co-authored-by: Martin Hradil <mhradil@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants