[ DOCs] Add missing docstrings across codebase#380
Conversation
📝 WalkthroughWalkthroughThis pull request adds comprehensive documentation across the codebase and introduces configurable ID-range batching and time-window interval functionality for collectors. Key functional changes include a new Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
metrics_utility/automation_controller_billing/collectors.py (1)
73-91:⚠️ Potential issue | 🟠 MajorGuard against non-positive
METRICS_UTILITY_GATHER_INTERVAL_HOURSto avoid an infinite loop.
get_gather_interval_hours()only validates that the env var parses to an int — it does not reject0or negative values. If a user setsMETRICS_UTILITY_GATHER_INTERVAL_HOURS=0, thenintervalistimedelta(0), soend = min(start + interval, day_end, until) = startwheneverstart < day_endandstart < until, and the next iteration setsstart = end. Thewhile start < until:loop will never terminate, hanging the gather command.Either validate the value at parse time or guard here.
🛡️ Suggested guard at parse time (in `metrics_utility/base/utils.py`)
def get_gather_interval_hours(): try: value = int(os.getenv('METRICS_UTILITY_GATHER_INTERVAL_HOURS', '24')) except (ValueError, TypeError): logger.error('METRICS_UTILITY_GATHER_INTERVAL_HOURS cannot be converted to an integer') raise if value <= 0: raise ValueError(f'METRICS_UTILITY_GATHER_INTERVAL_HOURS must be > 0, got {value}') return value🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@metrics_utility/automation_controller_billing/collectors.py` around lines 73 - 91, The loop in collectors.py can hang when get_gather_interval_hours() returns 0 or negative, because interval = timedelta(hours=...) could be zero; guard against this by validating the parsed value either where it’s read (get_gather_interval_hours in metrics_utility/base/utils.py) or immediately before using it in collectors.py: ensure the returned integer is > 0 and raise/log a clear error (including METRICS_UTILITY_GATHER_INTERVAL_HOURS and the bad value) if not, or if you prefer local guarding, check interval.total_seconds() > 0 before the while start < until loop and raise ValueError if it isn’t; reference symbols: get_gather_interval_hours, METRICS_UTILITY_GATHER_INTERVAL_HOURS, interval, start, end in collectors.py.metrics_utility/automation_controller_billing/report_saver/report_saver_s3.py (1)
37-53:⚠️ Potential issue | 🟠 MajorAvoid logging success when upload fails.
Line 52 logs a success message even after an exception path at Line 49-50, which can hide failed report delivery and degrade operational reliability.
Proposed fix
def save(self, report_spreadsheet): @@ - with tempfile.TemporaryDirectory(prefix='report_saver_billing_data_') as temp_dir: + upload_succeeded = False + with tempfile.TemporaryDirectory(prefix='report_saver_billing_data_') as temp_dir: try: local_report_path = os.path.join(temp_dir, 'report') report_spreadsheet.save(local_report_path) self.s3_handler.upload_file(local_report_path, self.report_spreadsheet_destination_path) + upload_succeeded = True except Exception as e: logger.exception(f'{self.LOG_PREFIX} ERROR: Saving report to S3 into path {self.report_spreadsheet_destination_path} failed with {e}') - logger.info(f'Report sent into S3 bucket into path: {self.report_spreadsheet_destination_path}') + if upload_succeeded: + logger.info(f'Report sent into S3 bucket into path: {self.report_spreadsheet_destination_path}')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@metrics_utility/automation_controller_billing/report_saver/report_saver_s3.py` around lines 37 - 53, The success log is executed unconditionally after the try/except, so failures are still reported as successes; modify the method in report_saver_s3.py so the logger.info that mentions successful upload (logger.info(... self.report_spreadsheet_destination_path)) is only called when no exception occurred — e.g., move that logger.info into the try block immediately after self.s3_handler.upload_file(...) (or use a try/except/else and put the info in the else), leaving logger.exception(...) in the except branch using self.LOG_PREFIX and the caught exception; keep references to report_spreadsheet.save(...), self.s3_handler.upload_file(...) and self.report_spreadsheet_destination_path to locate the change.
🧹 Nitpick comments (1)
metrics_utility/library/dataframes/base_dataframe.py (1)
71-76: Add aReturnssection toto_parquetfor parity with other I/O helpers.Line 71-76 is the only helper docstring missing explicit return behavior, which makes this API surface slightly less clear than
to_csv/to_json.Suggested docstring tweak
def to_parquet(df, f=None): """Serialise *df* to Parquet format, writing to *f*. Args: df: pandas DataFrame to serialise. f: Optional file path or file-like object. + + Returns: + Serialized parquet payload when *f* is None; otherwise None (writes to *f*). """ return df.to_parquet(f)As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@metrics_utility/library/dataframes/base_dataframe.py` around lines 71 - 76, The to_parquet docstring is missing a Returns section; update the docstring for the function to_parquet to include a clear "Returns" entry stating that the function writes the DataFrame to the given file/path or file-like object and returns None (i.e., no meaningful return value), and mention that any bytes are written to the provided file-like object rather than returned; keep wording consistent with the Returns sections used by to_csv/to_json for parity and clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@metrics_utility/library/collectors/controller/config_django.py`:
- Line 12: The function config_django currently uses
mutable/default-instantiated arguments (billing_provider_params={} and
output=DictOutput()), which can retain state across calls; change the signature
to use None defaults (e.g., billing_provider_params=None, output=None) and
inside the config_django function initialize billing_provider_params = {} if
None and output = DictOutput() if None so each call gets fresh instances; update
any internal references to those parameters accordingly (look for usages of
config_django, billing_provider_params, and output in the same file to ensure
behavior is preserved).
In `@metrics_utility/library/collectors/controller/job_host_summary_service.py`:
- Around line 74-92: The batched path (when batch_size is set) currently orders
rows by mu.finished per batch causing pd.concat to produce a final result that
is not globally ordered; change the ordering in the batched query to use mjs.id
ASC to match the keyset partitioning and preserve order across batches—update
the min_max_query and the build_query/ORDER BY clause used by output.batch_sql
(references: batch_size, min_max_query, build_query, output.batch_sql, and the
mjs alias) so each batch is emitted in mjs.id ASC order consistent with
non-batched behavior.
In
`@metrics_utility/library/collectors/controller/main_indirectmanagednodeaudit.py`:
- Around line 49-62: ORDERING mismatch: the current ORDER BY
main_indirectmanagednodeaudit.created ASC only sorts within each id-range batch
so concatenated results won't be globally chronological; change the query
ordering used by build_query and the final output.sql call to ORDER BY
main_indirectmanagednodeaudit.id ASC (i.e., order by the same column used for
batching) so batch_sql's concatenated CSV remains globally ordered, and apply
the same change used in job_host_summary_service to both the batch_sql
build_query lambda and the non-batched build_query.
In `@metrics_utility/library/collectors/controller/main_jobevent.py`:
- Around line 86-107: The min_max_query in main_jobevent.py uses the alias jhs
for main_jobhostsummary but reuses the previously-built where predicate that
references main_jobhostsummary.modified, causing a missing FROM-clause error;
fix by making the predicates consistent: either change min_max_query to
reference the unaliased table (remove the jhs alias) or update the where
predicate to use the alias (e.g., jhs.modified) before passing it into
output.batch_sql; locate the variables and functions min_max_query, where,
build_query, get_batch_size and the call to output.batch_sql and ensure the same
table reference (aliased or not) is used in both the min_max_query and the built
batch query.
In `@metrics_utility/library/collectors/util.py`:
- Around line 11-21: The get_batch_size function should coerce non-positive
parsed values to 0 to avoid infinite loops; update get_batch_size in util.py so
after parsing the environment variable
(os.getenv('METRICS_UTILITY_GATHER_BATCH_SIZE', '0')) it returns 0 for any value
<= 0 (instead of returning negative numbers), ensuring callers like
DataframeOutput.batch_sql and _batch_copy_table_files that rely on "if
batch_size:" do not receive negative batch sizes—parse to int, check <= 0 and
return 0, otherwise return the positive int.
In `@metrics_utility/library/storage/crc.py`:
- Around line 50-57: The docstring for Base.put claims "Exactly one of
*filename*, *fileobj*, or *dict* must be provided" but the implementation of
Base.put currently accepts multiple of these and will perform multiple uploads;
update either the implementation or the docstring to match: either enforce
exclusivity by validating that only one of filename, fileobj, or dict is
non-None (raise ValueError) inside Base.put, or change the docstring wording to
state that any combination is allowed and describe the actual behavior (e.g.,
"One or more of filename, fileobj, or dict may be provided; each will be
uploaded"), referencing the Base.put function and the parameters filename,
fileobj, dict so reviewers can locate the code.
---
Outside diff comments:
In `@metrics_utility/automation_controller_billing/collectors.py`:
- Around line 73-91: The loop in collectors.py can hang when
get_gather_interval_hours() returns 0 or negative, because interval =
timedelta(hours=...) could be zero; guard against this by validating the parsed
value either where it’s read (get_gather_interval_hours in
metrics_utility/base/utils.py) or immediately before using it in collectors.py:
ensure the returned integer is > 0 and raise/log a clear error (including
METRICS_UTILITY_GATHER_INTERVAL_HOURS and the bad value) if not, or if you
prefer local guarding, check interval.total_seconds() > 0 before the while start
< until loop and raise ValueError if it isn’t; reference symbols:
get_gather_interval_hours, METRICS_UTILITY_GATHER_INTERVAL_HOURS, interval,
start, end in collectors.py.
In
`@metrics_utility/automation_controller_billing/report_saver/report_saver_s3.py`:
- Around line 37-53: The success log is executed unconditionally after the
try/except, so failures are still reported as successes; modify the method in
report_saver_s3.py so the logger.info that mentions successful upload
(logger.info(... self.report_spreadsheet_destination_path)) is only called when
no exception occurred — e.g., move that logger.info into the try block
immediately after self.s3_handler.upload_file(...) (or use a try/except/else and
put the info in the else), leaving logger.exception(...) in the except branch
using self.LOG_PREFIX and the caught exception; keep references to
report_spreadsheet.save(...), self.s3_handler.upload_file(...) and
self.report_spreadsheet_destination_path to locate the change.
---
Nitpick comments:
In `@metrics_utility/library/dataframes/base_dataframe.py`:
- Around line 71-76: The to_parquet docstring is missing a Returns section;
update the docstring for the function to_parquet to include a clear "Returns"
entry stating that the function writes the DataFrame to the given file/path or
file-like object and returns None (i.e., no meaningful return value), and
mention that any bytes are written to the provided file-like object rather than
returned; keep wording consistent with the Returns sections used by
to_csv/to_json for parity and clarity.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: ff7b878e-57a7-4e6c-8497-74694fcaa094
📒 Files selected for processing (88)
docs/environment.mdmetrics_utility/anonymized_rollups/__init__.pymetrics_utility/anonymized_rollups/anonymized_rollups.pymetrics_utility/anonymized_rollups/base_anonymized_rollup.pymetrics_utility/anonymized_rollups/controller_version_anonymized_rollup.pymetrics_utility/anonymized_rollups/credentials_anonymized_rollup.pymetrics_utility/anonymized_rollups/events_modules_anonymized_rollup.pymetrics_utility/anonymized_rollups/execution_environments_anonymized_rollup.pymetrics_utility/anonymized_rollups/jobhostsummary_anonymized_rollup.pymetrics_utility/anonymized_rollups/jobs_anonymized_rollup.pymetrics_utility/anonymized_rollups/table_metadata_anonymized_rollup.pymetrics_utility/automation_controller_billing/base/s3_handler.pymetrics_utility/automation_controller_billing/collector.pymetrics_utility/automation_controller_billing/collectors.pymetrics_utility/automation_controller_billing/dataframe_engine/base.pymetrics_utility/automation_controller_billing/dataframe_engine/dataframe_collection_status.pymetrics_utility/automation_controller_billing/dataframe_engine/dataframe_content_usage.pymetrics_utility/automation_controller_billing/dataframe_engine/dataframe_inventory_scope.pymetrics_utility/automation_controller_billing/dataframe_engine/dataframe_jobhost_summary_usage.pymetrics_utility/automation_controller_billing/dataframe_engine/db_dataframe_host_metric.pymetrics_utility/automation_controller_billing/dataframe_engine/factory.pymetrics_utility/automation_controller_billing/dedup/ccsp.pymetrics_utility/automation_controller_billing/dedup/factory.pymetrics_utility/automation_controller_billing/dedup/renewal_guidance.pymetrics_utility/automation_controller_billing/extract/base.pymetrics_utility/automation_controller_billing/extract/extractor_controller_db.pymetrics_utility/automation_controller_billing/extract/extractor_directory.pymetrics_utility/automation_controller_billing/extract/extractor_s3.pymetrics_utility/automation_controller_billing/extract/factory.pymetrics_utility/automation_controller_billing/helpers.pymetrics_utility/automation_controller_billing/package/factory.pymetrics_utility/automation_controller_billing/package/package_crc.pymetrics_utility/automation_controller_billing/package/package_directory.pymetrics_utility/automation_controller_billing/package/package_s3.pymetrics_utility/automation_controller_billing/report/base.pymetrics_utility/automation_controller_billing/report/factory.pymetrics_utility/automation_controller_billing/report/report_ccsp.pymetrics_utility/automation_controller_billing/report/report_ccsp_v2.pymetrics_utility/automation_controller_billing/report/report_renewal_guidance.pymetrics_utility/automation_controller_billing/report_saver/factory.pymetrics_utility/automation_controller_billing/report_saver/report_saver_directory.pymetrics_utility/automation_controller_billing/report_saver/report_saver_s3.pymetrics_utility/base/collection_data_status.pymetrics_utility/base/collection_manifest.pymetrics_utility/base/utils.pymetrics_utility/exceptions.pymetrics_utility/library/collectors/controller/config.pymetrics_utility/library/collectors/controller/config_django.pymetrics_utility/library/collectors/controller/credentials_service.pymetrics_utility/library/collectors/controller/execution_environments.pymetrics_utility/library/collectors/controller/job_host_summary.pymetrics_utility/library/collectors/controller/job_host_summary_service.pymetrics_utility/library/collectors/controller/main_host.pymetrics_utility/library/collectors/controller/main_indirectmanagednodeaudit.pymetrics_utility/library/collectors/controller/main_jobevent.pymetrics_utility/library/collectors/controller/unified_jobs.pymetrics_utility/library/collectors/others/total_workers_vcpu.pymetrics_utility/library/collectors/util.pymetrics_utility/library/dataframes/base_dataframe.pymetrics_utility/library/dataframes/base_traditional.pymetrics_utility/library/dataframes/data_collection_status.pymetrics_utility/library/dataframes/host_metric.pymetrics_utility/library/dataframes/job_host_summary.pymetrics_utility/library/dataframes/main_host.pymetrics_utility/library/dataframes/main_jobevent.pymetrics_utility/library/debug.pymetrics_utility/library/extractors.pymetrics_utility/library/instants.pymetrics_utility/library/lock.pymetrics_utility/library/package.pymetrics_utility/library/reports.pymetrics_utility/library/storage/crc.pymetrics_utility/library/storage/directory.pymetrics_utility/library/storage/s3.pymetrics_utility/library/storage/segment.pymetrics_utility/library/storage/util.pymetrics_utility/library/utils.pymetrics_utility/logger.pymetrics_utility/management/commands/build_report.pymetrics_utility/management/commands/gather_automation_controller_billing_data.pymetrics_utility/management/validation.pymetrics_utility/management_utility.pymetrics_utility/metric_utils.pymetrics_utility/test/base/classes/analytics_collector.pymetrics_utility/test/base/classes/package.pymetrics_utility/test/base/functional/helpers.pymetrics_utility/test/conftest.pymetrics_utility/test/util.py
Added Google-style docstrings to all public modules, classes, methods, and non-trivial functions that previously had none. Covers 86 source files spanning anonymized_rollups, automation_controller_billing, base, library, management, test, and tools packages. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
68d233b to
93c2847
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 93c2847. Configure here.
…-update These constants were pulled in from the batch processing commit during conflict resolution and do not belong in the docstring-only branch. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
|
himdel
left a comment
There was a problem hiding this comment.
This is probably a bad idea,
Humans will interpret the docstrings as authoritative and believe them over code,
AIs will have that much more context to deal with,
and information duplicated in multiple places will slowly go out of sync,
so this is likely to be another maintenance nightmare for the future.
But, if it's a requirement, approved :).




Summary
Test plan
🤖 Generated with Claude Code
Note
Medium Risk
Mostly documentation-only, but it also tightens
date_whereinput validation (type + timezone-awareness) and adjusts collection/role regex matching to avoid potential ReDoS, which could surface new runtime errors or change parsing behavior.Overview
Adds extensive module/class/function docstrings across the
anonymized_rollups, billing (automation_controller_billing), library, management commands, and test utilities to improve API clarity.Non-doc changes include stricter validation in
library/collectors/util.py:date_where(raises on non-datetimeor timezone-naive bounds), a safer collection-name regex inlibrary/dataframes/main_jobevent.pyto prevent catastrophic backtracking, and minor public-API polish like explicit__all__exports inanonymized_rollups/__init__.py.Reviewed by Cursor Bugbot for commit e6eb6ed. Bugbot is set up for automated code reviews on this repo. Configure here.