Add first test for dedup#166
Conversation
fe1b705 to
9bad36d
Compare
03602b8 to
ec1408a
Compare
7480119 to
7850a3f
Compare
himdel
left a comment
There was a problem hiding this comment.
See review comments, but overall LGTM :)
|
Please retry analysis of this Pull-Request directly on SonarQube Cloud |
AlexSCorey
left a comment
There was a problem hiding this comment.
This code feels very dense to me. I've asked some questions mostly for my own clarity, though adding some details about how to test this would be helpful.
|
|
||
| billing_data['serial'] = billing_data.apply(compute_serial, axis=1) | ||
|
|
||
| experimental_dedup = self.extra_params.get('deduplicator') == 'ccsp-experimental' |
There was a problem hiding this comment.
I'm a bit weary of the experimental phrasing in this pr. If we are adding this to our code base it doesn't feel experimental any more, though I may be unaware of what part is experimental. Is the functionality experimental?
There was a problem hiding this comment.
yeah, I think we should rename to something else in another PR, something like "ccsp-advanced" or mabe dropping the ccsp entirely...
| 'host_names_before_dedup': 'combine_set', | ||
| } | ||
|
|
||
| def dedup(self, dataframe, hostname_mapping=None, scope_dataframe=None): |
There was a problem hiding this comment.
Looks like this function doesn't really dedupe anything. Should we call it something like facts_enrichment
There was a problem hiding this comment.
this is overridden dedup method that then calls the parent with
return super().dedup(dataframe, hostname_mapping)
There was a problem hiding this comment.
but, it might be more memory effective to enhance the jobhost summary dataframe with facts later (after the number of rows is reduced)
| """Check if experimental deduplication is enabled.""" | ||
| return self.extra_params.get('deduplicator') == 'ccsp-experimental' | ||
|
|
||
| def calculate_dedup_count(self, series): |
There was a problem hiding this comment.
Is this the number of hosts that were deduplicated?
There was a problem hiding this comment.
this is number of host before the new dedup, there are both count and list of hostnames columns
so it shows how many hosts were grouped together with the new dedup
or better, if you see this new dedup count column being bigger than 1, that means this new deduplication kicked in and multiple hosts were deduped
2dfd7be to
15c03ff
Compare
Current approach is hacky, but it doesn;t require any code change. We'll need to unify the CI run and the local run, or create another docker compose.
Will help to debug deduplication issues
…/dataframe_jobhost_summary_usage.py Co-authored-by: Martin Hradil <mhradil@redhat.com>
…/dataframe_jobhost_summary_usage.py Co-authored-by: Martin Hradil <mhradil@redhat.com>
Co-authored-by: Martin Hradil <mhradil@redhat.com>
15c03ff to
07daa4c
Compare
if the fact key was missing, it was exploding
| 3.3. secure-host-01.company.com and secure-host-01-readonly.internal (privileged vs unprivileged): | ||
| - Same host accessed with different credentials | ||
| - Admin job has product_serial, user job doesn't | ||
| - Same machine_id in both cases 4f7a8b9c2d3e5f6a7b8c9d0e1f2a3b4c | ||
| - Result: Correctly merged based on machine_id | ||
| - Dedup: Working correctly - same machine_id causes deduplication |
There was a problem hiding this comment.
this is wrong - if the user job doesn't have product serial, and uses different hostnames, it won't get merged.
We don't merge on machine_id alone when product_serial is missing.
There was a problem hiding this comment.
ah, yes, it's just a description error (they are correctly in false negatives category)
| 4.2. nat-host-01.external and nat-host-02.external: | ||
| - Different hosts behind same NAT gateway | ||
| - NAT gateway's machine_id and serial exposed to both | ||
| - Same public IP address (203.0.113.10) with different ports (2201 and 2202) | ||
| - Result: Still merged (port not used in deduplication logic) | ||
| - Dedup: Serial based on product_serial/machine_id only - port ignored | ||
| - Note: This demonstrates a limitation where NAT gateway hosts are incorrectly merged |
There was a problem hiding this comment.
suspicious - I'm seeing 2 different hostnames
why would the gateway's machine_id and serial propagate to both hosts?
(or is it because those are host_name, but ansible_host is 203.0.113.10 for both?)
There was a problem hiding this comment.
yes, will fix the description
There was a problem hiding this comment.
btw. I've added all original data into canonical facts, so these cases are visible:
'Canonical Facts': '{"ansible_host": ["203.0.113.10"], "ansible_machine_id": '
'["639d3a53a94028d35a3f5f244793dad2"], "ansible_port": [2201, 2202], '
'"ansible_product_serial": ["CN7792194B0NAT"], "host_name": '
'["nat-host-01.external", "nat-host-02.external"]}',
shows there was one ansible_host and 2 host_names
|
* Add integration test testing new deduplication * Fix dedup in invetory_scope * Fix current test failing on new column thatshould show only with dedup * Refactoring of handling host names before dedup * Refactor host_names before dedup * Organize helper functions better * Add instructions on how to run tests in CI mode Current approach is hacky, but it doesn;t require any code change. We'll need to unify the CI run and the local run, or create another docker compose. * Test properly all sheets as dict comparissions * Simplify host_names_before_decup * Fix host before dedup and mock of report date * Add more test cases for deduplication * Extend the existing collector test to include new data for deduplication * Expand on test cases * Fix some failing tests * Fix invalid collector * More cleanup of the tests and regenerated data * Fix failures of data_statuses assert * Fix just a minimal version of gather, making sure it doesn;t throw an exception * Add working gather test * Add new validation, testing each deduplication usecase in detail * Outline the test cases more clearly and add proper numbering * Add more facts into collection Will help to debug deduplication issues * Fix ordering for asserts * Update metrics_utility/automation_controller_billing/dataframe_engine/dataframe_jobhost_summary_usage.py Co-authored-by: Martin Hradil <mhradil@redhat.com> * Update metrics_utility/automation_controller_billing/dataframe_engine/dataframe_jobhost_summary_usage.py Co-authored-by: Martin Hradil <mhradil@redhat.com> * Update metrics_utility/automation_controller_billing/dedup/ccsp.py Co-authored-by: Martin Hradil <mhradil@redhat.com> * Change product_serial and machine_id to be realistic * Verify and fix first set of cases * Expand the data to be able to test all cases * Fix the NAT example * Fix case with 2 regions * Fix test case 3.3 * Extract facts fro serial safely if the fact key was missing, it was exploding * Tweak the unit test after adding dedup of also main_host * Remove test case 1.7, it's duplicate of 1.2 * Fix test order and cleanup desctiprion of case 5.1 * Fix wrong cases description --------- Co-authored-by: Martin Hradil <mhradil@redhat.com>



Issue: AAP-49237 & AAP-48948 & AAP-49231
Add a test demonstrating the dedup functionality