Skip to content

Added nerc_invoicing repo with functions for querying invoicing Iceberg table#267

Open
jimmysway wants to merge 2 commits into
CCI-MOC:mainfrom
jimmysway:feature/266-add-invoicing-library
Open

Added nerc_invoicing repo with functions for querying invoicing Iceberg table#267
jimmysway wants to merge 2 commits into
CCI-MOC:mainfrom
jimmysway:feature/266-add-invoicing-library

Conversation

@jimmysway

@jimmysway jimmysway commented Feb 17, 2026

Copy link
Copy Markdown
Contributor

Added nerc_invoicing repo for querying invoicing Iceberg table

Added functionality to get lifetime costs grouped by project

Added pyproject.toml for future publishing

Closes #266

Comment thread nerc_invoicing/costs.py Outdated
BALANCE_COL = "Balance"


def get_invoice_dataframe() -> pd.DataFrame:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should cache this using functools.cache

Comment thread nerc_invoicing/costs.py Outdated
Comment thread nerc_invoicing/costs.py Outdated
Comment thread process_report/data_tools/config.py Outdated
Comment thread nerc_invoicing/costs.py Outdated
Comment thread pyproject.toml
@jimmysway jimmysway force-pushed the feature/266-add-invoicing-library branch 4 times, most recently from 8b3abd3 to 73b5ca0 Compare February 17, 2026 20:18
Comment thread pyproject.toml Outdated
Comment thread pyproject.toml Outdated
Comment thread process_report/data_tools/costs.py Outdated
@jimmysway jimmysway requested a review from knikolla February 17, 2026 22:03

@knikolla knikolla left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous comments still unresolved.

@jimmysway jimmysway force-pushed the feature/266-add-invoicing-library branch from 73b5ca0 to 42f08b4 Compare February 17, 2026 22:25

@knikolla knikolla left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments about using the config system and caching still applicable.

@jimmysway

Copy link
Copy Markdown
Contributor Author

Comments about using the config system and caching still applicable.

Sorry I missed those I am addressing those now. I think caching the get_invoice_dataframe() function might be premature. The function returns a dataframe potentially for a specific filter if we are only querying this dataframe once there is little benefit from caching right now.

I will instead apply it to the _get_table() function

@jimmysway jimmysway force-pushed the feature/266-add-invoicing-library branch from 42f08b4 to a2fbd6f Compare February 18, 2026 19:27
Comment thread process_report/data_tools/costs.py Outdated
Comment thread .github/ISSUE_TEMPLATE/custom.md
Comment thread process_report/data_tools/config.py Outdated
Comment thread process_report/data_tools/__init__.py Outdated
Comment thread process_report/data_tools/costs.py Outdated
@knikolla

Copy link
Copy Markdown
Contributor

Comments about using the config system and caching still applicable.

Sorry I missed those I am addressing those now. I think caching the get_invoice_dataframe() function might be premature. The function returns a dataframe potentially for a specific filter if we are only querying this dataframe once there is little benefit from caching right now.

I will instead apply it to the _get_table() function

Why do you think _get_table() is a better place for implementing caching?

@jimmysway

jimmysway commented Feb 18, 2026

Copy link
Copy Markdown
Contributor Author

Comments about using the config system and caching still applicable.

Sorry I missed those I am addressing those now. I think caching the get_invoice_dataframe() function might be premature. The function returns a dataframe potentially for a specific filter if we are only querying this dataframe once there is little benefit from caching right now.
I will instead apply it to the _get_table() function

Why do you think _get_table() is a better place for implementing caching?

I applied caching to _get_table() instead because it returns a reusable StaticTable handle (metadata + query interface), not full row data. Caching _get_table() mainly reduces repeated table-initialization overhead on every call.

If we were instead to cache the dataframe itself with get_invoice_dataframe(). We would instead be caching filter-specific dataframes which we don't reuse much right now.

I guess the question is what is the purpose of this library? Are we going to use as a thin wrapper for downstream processes to run predictable queries once a month, or are we going to be using this as a primary way to interface with Iceberg where it will be used as other processes or data tools to interface and run analytics on the data itself.

Currently, I think the library is going to be used as a thin wrapper. Since this module is just a thin wrapper, it should do the simple safe optimization (_get_table) and leave heavier caching to whatever service or job is using it.

@jimmysway jimmysway force-pushed the feature/266-add-invoicing-library branch from a2fbd6f to 6239016 Compare February 18, 2026 22:42
@jimmysway jimmysway closed this Feb 18, 2026
@jimmysway jimmysway force-pushed the feature/266-add-invoicing-library branch from 6239016 to 4920ec7 Compare February 18, 2026 22:52
@jimmysway jimmysway reopened this Feb 18, 2026
return grouped_df


def aggregate_by(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've split the previous select_and_group into two functions aggregate_by and group_and_sum. I think the old function was doing too much and was misleading. This way group_and_sum alone is useful whenever you already have a DataFrame and just need deterministic aggregation.

@QuanMPhm QuanMPhm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments I have for now. Will take a closer look and run some tests

Comment thread pyproject.toml Outdated
Comment thread pyproject.toml Outdated
Comment thread process_report/data_tools/config.py Outdated
Comment thread process_report/tests/unit/data_tools/test_data_tools.py Outdated
@jimmysway jimmysway force-pushed the feature/266-add-invoicing-library branch from 31f1732 to 7ef0bf9 Compare February 24, 2026 15:24
@jimmysway jimmysway requested review from QuanMPhm and knikolla and removed request for QuanMPhm February 24, 2026 15:24
@jimmysway jimmysway force-pushed the feature/266-add-invoicing-library branch from dee51ba to 89d21b0 Compare February 24, 2026 15:36

@QuanMPhm QuanMPhm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the good work. Just a few more comments mostly regarding style.

Comment thread process_report/data_tools/config.py Outdated
Comment thread process_report/data_tools/costs.py Outdated
Comment thread process_report/data_tools/costs.py Outdated
group_by: tuple[str, ...],
*,
agg_col: str,
agg_name: str = "total",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knikolla Is there a strong need for the output column to have a custom name? This function already only outputs the grouped and output columns, so without agg_name, there would not be any ambiguity.

a_summed = a.groupby([INVOICE_DATE_FIELD, PROJECT_FIELD], dropna=False, as_index=False)["Balance"].agg("sum")
print(a_summed.head())

  Invoice Month Project - Allocation  Balance
0       2024-03                False     6.43
1       2024-03                 True   403.30
2       2024-03                  NaN     3.30
3       2024-04                False  1000.00
4       2024-04                 True   510.00

I just don't want unneeded features for now.

Comment thread process_report/data_tools/costs.py Outdated
Comment thread process_report/data_tools/costs.py
Comment on lines +42 to +45
expression = costs._row_filter(**filters)
assert isinstance(expression, costs.And)
assert isinstance(expression.left, costs.EqualTo)
assert isinstance(expression.right, costs.EqualTo)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a stricter check, like an equality:

Suggested change
expression = costs._row_filter(**filters)
assert isinstance(expression, costs.And)
assert isinstance(expression.left, costs.EqualTo)
assert isinstance(expression.right, costs.EqualTo)
expression = costs._row_filter(**filters)
assert isinstance(expression, costs.And)
assert expression.left == costs.EqualTo(PID, "vllm-test")
...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test checks that the right type of expression was built, which I think is sufficient. Checking exact values is not really useful imho because if _row_filter garbles the values, group_and_sum and aggregate_by tests would catch it anyway since they operate end-to-end with real data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test checks that the right type of expression was built, which I think is sufficient. Checking exact values is not really useful

I disagree. This unit test is the only one letting us understand the expected behavior of _row_filter. We should know exactly the expected output should be, and in this is case it is not difficult to do so.

because if _row_filter garbles the values, group_and_sum and aggregate_by tests would catch it anyway since they operate end-to-end with real data.

get_invoice_dataframe, the only function that calls _row_filter, is heavily mocked in your test cases. None of your test cases deliberately test the filtering feature because we lack an integration test. Therefore, I'll argue none of the tests you presented adequately test _row_filter

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +48 to +58
def test_aggregate_by_rounds_and_forwards_filters(
monkeypatch: pytest.MonkeyPatch, sample_invoice_dataframe: pd.DataFrame
):
captured: dict[str, object] = {}

def _fake_loader(cols=None, **filters):
captured["cols"] = cols
captured["filters"] = filters
return sample_invoice_dataframe

monkeypatch.setattr(costs, "get_invoice_dataframe", _fake_loader)

@QuanMPhm QuanMPhm Feb 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered unittest.mock, or pytest-mock to mock get_invoice_dataframe? Mocking will provide you with helper functions to capture how get_invoice_dataframe was called, removing the need for _fake_loader. There's a few examples of this mocking in this repo

This suggest would also help remove some mocking code you have further in your test cases, such as for class _FakeIcebergTable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the _FakeIcebergTable. I believe having code that is understandable is better.

_FakeIcebergTable explicitly documents the .scan().select().to_pandas() chain the code depends on. Which is very useful to understand the context which is something that a MagicMock will abstract away.

@QuanMPhm QuanMPhm Mar 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the test case is to see if get_invoice_dataframe emits a warning if the filtered dataframe is empty. It is explicitly not concerned with the internals of the iceberg table, which is why this is mocked out in the first place. A developer would review a test case to understand the expected behavior, not the internals of how that is done.

Looking at it a bit more, I would argue asserting that the output equals the input is moot here, since you setup the mock to return the same dataframe. If we wanted to test the pyiceberg code, we would need an integration test, which can be a part of this PR or a seperate one. @knikolla Thoughts?

Comment thread process_report/tests/unit/data_tools/test_data_tools.py Outdated
@jimmysway jimmysway force-pushed the feature/266-add-invoicing-library branch 2 times, most recently from 89d21b0 to 97a1923 Compare March 2, 2026 16:14
@jimmysway jimmysway requested a review from QuanMPhm March 4, 2026 14:19
@QuanMPhm

Copy link
Copy Markdown
Contributor

@jimmysway Some CI test cases are broken

Added functionality to get lifetime costs grouped by project

Added pyproject.toml for future publishing

Added testing for new functionality
…o alert based on the "Cost" column

I have made the changes to the code as necessary, switching from summing "Balance" to "Cost" instead

Added equality check for row_filter test
@QuanMPhm QuanMPhm force-pushed the feature/266-add-invoicing-library branch from c96e05a to ff60637 Compare April 9, 2026 18:01
@QuanMPhm

QuanMPhm commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

@knikolla @naved001 I made some small amendments to this PR. I have one question here

@QuanMPhm QuanMPhm requested a review from naved001 April 9, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Query lifetime cost of a specific (Allocation Name, Cluster Name)

3 participants