You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Each collector type has an associated anonymized rollup class that processes the collected data. Collectors are located in `metrics_utility/library/collectors/controller/`, and their corresponding rollup classes are in `metrics_utility/anonymized_rollups/`.
6
+
7
+
### Collector Types
8
+
9
+
Collectors fall into two categories:
10
+
11
+
-**Since-until collectors (time-series)**: These collectors require `since` and `until` parameters and collect data for a specific time range. They run hourly to collect incremental data. But they can be configured to run whatever we want.
12
+
-**Snapshot collectors**: These collectors do not require time parameters and collect a point-in-time snapshot of the current state. They run once per day (or whever we want).
-**Description**: Collects unified job data including job status, duration, execution environment, inventory, organization, ansible version, installed collections, and job template information. Filters jobs by `finished` timestamp within the time range.
-**Description**: Collects job host summary data including task execution statistics (ok, failed, skipped, unreachable, etc.) per job and host. Uses partition-optimized queries for better performance.
-**Description**: Collects job event data including module usage, collection usage, role usage, and event statistics. This is the largest collector and uses partition-optimized queries.
-**Description**: Collects database table metadata including row counts and table sizes for various system tables. It is used for estimation of how many rows customer can have, and how large those tables are in terms of disc size.
-**Description**: Collects controller version information showing which versions of the controller are running.
54
+
55
+
## 2. Rollup Flow
56
+
57
+
The anonymized rollup process follows a multi-stage flow:
58
+
59
+
### Hourly Collection
60
+
61
+
1.**Collection**: Each time-series collector runs hourly, collecting data for a specific hour (e.g., 10:00-11:00). This is important, because otherwise we will
62
+
not be able to compute data for whole day because of performance.
63
+
64
+
The data are then processed in batches (see prepare and merge below). Each batch computes basicaly hourly aggregate, which is much much smaller than raw data - it looks like json data with summaries, total counts, total durations...
65
+
66
+
Those summaries are updated with each batch (result of two hourly aggregates are then aggregated together - this is call rollups - rollups are basicaly hierarchical aggregates). Then this result is again aggregated with another hour and up until whole day.
67
+
68
+
The daily rollup is sent to the analytics team, who is then further aggregating our daily rollups into monthly and yearly rolups, but this is not part of our metrics utility.
69
+
70
+
2.**Prepare**: The raw dataframe from the collector is passed to the rollup's `prepare()` method, which:
71
+
- Filters and preprocesses the data (e.g., filtering out unfinished jobs)
72
+
- Performs initial aggregations
73
+
- Returns a serializable dictionary or list (not a dataframe)
74
+
75
+
3.**Merge**: The result from `prepare()` is merged with the partial daily rollup using the `merge()` method:
76
+
- The partial daily rollup is initially empty (None) for the first hour
77
+
- Each subsequent hour's prepared data is merged into the accumulating daily rollup
78
+
- Both the partial rollup and prepared data are serializable (JSON-compatible) structures
79
+
- The merge operation combines these structures appropriately (e.g., concatenating lists, summing counts)
80
+
81
+
### Daily Base Processing
82
+
83
+
4.**Base**: After all hours for the day have been processed, the complete daily rollup is passed to the `base()` method, which:
84
+
- Performs final aggregations and statistics computation if needed
85
+
- Usualy quite short
86
+
- Returns a dictionary with a `json` key containing the final rollup data
87
+
88
+
### Final Merging
89
+
90
+
5.**Combination**: All rollup results from `base()` are combined in `anonymized_rollups.py`:
91
+
- Each rollup's `json` output is collected
92
+
- All rollups are merged together using `anonymize_rollups()` function
93
+
- The combined data is flattened into a single structure
94
+
- Sensitive data is anonymized (see section 3)
95
+
96
+
## 3. Anonymization
97
+
98
+
After all rollups are merged, the data goes through anonymization:
99
+
100
+
1.**String Filtering**: Any string value that is not a built-in Python type or part of a public collection (defined in `collections.json`) is either:
101
+
- Set to `"Unknown"` (for module names, collection names, role names with `collection_source == 'Unknown'`)
102
+
- Filtered out entirely during collection (e.g., filtered by `manage` DB column or other filters)
103
+
104
+
3.**Sanitization**: NaN and infinity values are replaced with `None` to ensure valid JSON output.
105
+
106
+
The anonymization ensures that no sensitive customer data (like custom module names, collection names, or job template names) is exposed in the final output.
107
+
108
+
## 4. Message Splitting
109
+
110
+
The final anonymized rollup JSON is split into multiple messages for transmission to Segment.com:
111
+
112
+
1.**Top-level Key Splitting**: Each top-level key in the JSON dictionary becomes a separate message chunk. For example:
113
+
-`statistics` → one chunk
114
+
-`module_stats` → one or more chunks (if it's a list)
115
+
-`jobs_by_job_type` → one or more chunks (if it's a list)
116
+
117
+
2.**Array Splitting**: If a top-level key contains an array (list), that array is split into multiple chunks if it exceeds the maximum message size:
118
+
- Maximum size: 24KB (with empty space reserved for additional metadata)
119
+
- Each chunk contains as many array items as can fit within the size limit
120
+
- Items are never split across chunks
121
+
122
+
3.**Size Calculation**: The size of each chunk is calculated as the JSON-encoded byte size of the data.
123
+
124
+
4.**Dictionary Handling**: If a top-level key contains a dictionary (not a list), it is sent as a single chunk and cannot be split. Therefore, dictionaries must be smaller than the maximum message size.
125
+
126
+
The splitting logic is implemented in `metrics_utility/library/storage/segment.py` in the `_split_into_chunks()` method.
127
+
128
+
## 5. Testing
129
+
130
+
To test the anonymized rollup system, use the `run_no_events.py` script:
0 commit comments