Track initialized and used columns in each processor by QuanMPhm · Pull Request #285 · CCI-MOC/invoicing

QuanMPhm · 2026-04-02T19:10:07Z

Closes #284. During 2026-03 invoicing, a bug was found where the columns initialized by the New-PI credit processor (i.e PI Balance column), was being accessed by the PI-SU processor before it was initialized, causing an KeyError.

More details in the commit message

[1] #279

knikolla · 2026-04-02T19:17:05Z

@QuanMPhm Does this mean that the InitColumnsProcessor needs to know about ALL the processors?

knikolla · 2026-04-02T19:21:03Z

+
+
+@dataclass
+class PISUCreditProcessor(discount_processor.DiscountProcessor):


Why is this called PISUCreditProcessor?

knikolla · 2026-04-02T19:21:58Z

+]
+
+
+class BaseTesCase(TestCase):


knikolla · 2026-04-02T19:24:22Z

@QuanMPhm Does this mean that the InitColumnsProcessor needs to know about ALL the processors?

I guess this is inevitable since we want to also upload the resulting DataFrame into Iceberg, which means we need to centralize the listing of columns, their initialization and migration.

I don't think a processor is the best way to implement this initialization. Please think about a solution that also extends the Iceberg table when a new column is introduced.

naved001 · 2026-04-03T15:29:09Z

I don't think a processor is the best way to implement this initialization.

@knikolla Why not? I was just trying to think where else could this happen, and I guess when the CSVs are merged into the pandas dataframe before the processors run. But I want to understand your rationale behind it not being a processor.

knikolla · 2026-04-03T17:38:00Z

I don't think a processor is the best way to implement this initialization.

@knikolla Why not? I was just trying to think where else could this happen, and I guess when the CSVs are merged into the pandas dataframe before the processors run. But I want to understand your rationale behind it not being a processor.

@naved001 some quick thoughts.

The Processors are an attempt at organizing business logic into respective modules. This Processor wouldn't really contain any business logic and just perform common initialization. This strongly feels more like a base class function than a specific processor.
Currently, each Processor already may introduce new columns, and these columns are initialized by the Processor that introduces them. This has the added benefit of providing ownership over that column and additional context and documentation about how the column is initialized and used for by the surrounding code.
The introduction of new columns creates ordering concerns but this ordering concern isn't fixed by the introduction of a column initializer. The business logic is what places the constraint on the ordering of Processors, not the initialization of the columns. This bug merely highlighted that as part of the PR that introduced the PISUCreditProcessor the initialization of that specific column should have either been moved from the NewPICreditProcessor to the PISUCreditProcessor (it now being the first one to use the column) or the ordering of the 2 processors changed (if it wouldn't make a difference to the business logic).
We are already storing the entire list of columns in the base Invoice class. We can introduce more metadata to that list (for example by making each of the columns a dataclass containing Column Name, Type, Default Value or Initializer whatever you think is beneficial and then extend the Processors to have a initializes_columns: List[Column] and operates_on_columns: List[Column].
This allows us to document in code and write unit tests that check for the list of Processors and fail the unit tests when the ordering is such that a Processor operates_on a column that wasn't initialized by a previous Processor or doesn't have a default initializer.
This also allows us to include within the Column definition migration logic for the Iceberg table when a column is introduced.

naved001 · 2026-04-06T14:49:51Z

@knikolla Thanks for the explanation! It makes a lot of sense to not have a processor just do the init. Just a couple of thoughts/questions.

We are already storing the entire list of columns in the base Invoice class. We can introduce more metadata to that list (for example by making each of the columns a dataclass containing Column Name, Type, Default Value or Initializer whatever you think is beneficial and then extend the Processors to have a initializes_columns: List[Column] and operates_on_columns: List[Column].

Wouldn't the processors only initialize these columns when they are called? I don't understand how this would have solved the current issue.

This allows us to document in code and write unit tests that check for the list of Processors and fail the unit tests when the ordering is such that a Processor operates_on a column that wasn't initialized by a previous Processor or doesn't have a default initializer.

Even if we had this implemented, how would a unit test have captured the current problem which was due to the ordering of the processors? Because if we wrote the unit tests for checking if the columns exist that would be just on the fake data we are passing to that specific processor. I think the end to end tests should have been updated to capture this because we actually run all processors in a sequence and that would have failed.

knikolla · 2026-04-06T16:09:28Z

@knikolla Thanks for the explanation! It makes a lot of sense to not have a processor just do the init. Just a couple of thoughts/questions.

We are already storing the entire list of columns in the base Invoice class. We can introduce more metadata to that list (for example by making each of the columns a dataclass containing Column Name, Type, Default Value or Initializer whatever you think is beneficial and then extend the Processors to have a initializes_columns: List[Column] and operates_on_columns: List[Column].

Wouldn't the processors only initialize these columns when they are called? I don't understand how this would have solved the current issue.

This by itself wouldn't have solved the current issue. It would have provided a better place to centralize column data, rather than storing the column names in the base Invoice and the default initialization values in the InitProcessor.

One way it would have solved it is if a Processor operates_on a column that hasn't been initialized yet, the Processor would automatically set the default value at that point of the execution.

This allows us to document in code and write unit tests that check for the list of Processors and fail the unit tests when the ordering is such that a Processor operates_on a column that wasn't initialized by a previous Processor or doesn't have a default initializer.

Even if we had this implemented, how would a unit test have captured the current problem which was due to the ordering of the processors? Because if we wrote the unit tests for checking if the columns exist that would be just on the fake data we are passing to that specific processor. I think the end to end tests should have been updated to capture this because we actually run all processors in a sequence and that would have failed.

It would have exposed the problem because the PISUCreditProcessor would have had the column in the operates_on list that it defines, however none of the preceding processors would have had the column in their initializes_column list. The list of the ordering of the processors is stored in code and it would be easy to have a unit test validate that this case doesn't happen.

If we decide to centralize Column information into a dataclass and have the dataclass provide a default value, we could also have the Processor at this point in the execution set that default value to the column.

naved001 · 2026-04-06T17:22:19Z

@knikolla I am convinced that having columns be dataclasses with associated metadata and defaults is a reasonable path forward.

The list of the ordering of the processors is stored in code and it would be easy to have a unit test validate that this case doesn't happen.

As far as I can tell, this is the only place we have a list of all the processors

https://github.com/CCI-MOC/invoicing/blob/main/process_report/process_report.py#L89-L99

I guess we just need to make sure that the tests use the same list.

knikolla · 2026-04-06T18:18:25Z

@knikolla I am convinced that having columns be dataclasses with associated metadata and defaults is a reasonable path forward.

The list of the ordering of the processors is stored in code and it would be easy to have a unit test validate that this case doesn't happen.

As far as I can tell, this is the only place we have a list of all the processors

https://github.com/CCI-MOC/invoicing/blob/main/process_report/process_report.py#L89-L99

I guess we just need to make sure that the tests use the same list.

Yes, that can be stored into a global variable and the unit test can access the variable.

The end to end test would also probably throw an error since IIRC it does call all the processors.

QuanMPhm · 2026-04-08T03:11:33Z

@naved001 @knikolla I will put a hold on this and review Jimmy's PR to add the Iceberg functionality first, since both issues are critical and will likely influence each other. I'll see what the iceberg implementation may look like before continuing on this.

QuanMPhm · 2026-04-20T18:07:13Z

I will continue work on this PR, now that I've done my iceberg review and draft PR(#292)

QuanMPhm · 2026-04-21T15:40:37Z

@knikolla @naved001 I have overhauled this PR to implement @knikolla's suggestions. I encapsulated columns into a new class, and added new Invoice class attributes, initializes_columns and operates_on_columns. More details in the commit message.

I see that this PR is enormous, and will try to split out the type handling if the PR's too much to review before April invoicing.

I have a question below for consensus on the typing of columns.

A follow-up from this PR would be to cleanup test cases to remove redundant type casting.

naved001 · 2026-04-29T16:31:27Z

@QuanMPhm Maybe I missed some conversation, but why did you add a dedicated initColumnProcessor? This comment by @knikolla explained the reasons why we don't want that: #285 (comment)

QuanMPhm · 2026-04-30T15:12:21Z

@QuanMPhm Maybe I missed some conversation, but why did you add a dedicated initColumnProcessor? This comment by @knikolla explained the reasons why we don't want that: #285 (comment)

@naved001 While rewriting the PR, I decided to make a processor at the start that would check that the input invoices have all the necessary initial columns. This first processor doesn't initialize any columns by itself. It just checks that prerequisite columns exist. I guess a better name would be ValidateInputColumnsProcessor. I just stuck with the old name at the time of writing the PR.

QuanMPhm · 2026-05-19T13:26:52Z

@larsks Since this PR is moreso restructuring the code rather than adding billing logic, I'd appreciate your feedback

larsks

Since this PR is moreso restructuring the code rather than adding billing logic, I'd appreciate your feedback

@QuanMPhm There's nothing here that jumps out at me in particular. I like the standardization of using the InvoiceColumn dataclass to contain column metadata. It looks like you've addressed most of the comments from Kristi and Naved, and the mechanism for handling column initialization seems like an improvement.

naved001

Mostly looks fine to me but a few questions in line. I need to take a deep look at the tests as well. Thanks!

naved001 · 2026-05-26T17:21:32Z

+
+
+@dataclass
+class InitColumnsProcessor(processor.Processor):


Don't you think the name is misleading? It doesn't actually initialize the columns it only casts them. This is a cause of confusion imo

renamed to ValidateInputColumnsProcessor

naved001 · 2026-05-26T17:45:30Z

@@ -0,0 +1,26 @@
+from process_report.process_report import PROCESSING_ORDER
+from process_report.tests.base import BaseTesCase


I guess this typo "BaseTesCase" is everywhere because your IDE saved the day. Could be solved in a different PR if it wasn't introduced in this one.

knikolla

Great work! Only one comment otherwise direction and everything looks good.

naved001 · 2026-05-28T14:45:09Z

+        invoice.GROUP_INSTITUTION_COLUMN,
+        invoice.GROUP_MANAGED_COLUMN,
+        invoice.GROUP_BALANCE_COLUMN,
+        invoice.GROUP_BALANCE_USED_COLUMN,


I realized that a processors will always operate on columns that it initalizes i.e. initializes_columns is a subset of operates_on_columns. You could do this to cut down on the repetition. Just a suggestion.

operates_on_columns = ( *initalizes_columns, invoice.INVOICE_EMAIL_COLUMN, ... ... )

Didn't even realize I could do that

That is called "list unpacking". You often see this when passing a list of arguments to a function that expects multiple arguments. E.g., if I have:

def example(greeting: str, noun: str): print(f"{greeting.title()}, {noun}!")

Then I can do this:

>>> example("hello", "world") Hello, world!

But if I had the arguments in a list, I could do this:

>>> args=["hello", "world"] >>> example(*args)

Using the * operator like that isn't unique to argument lists, which is why you can do things like @naved001 suggested:

>>> list1 = ["cyan", "magenta"] >>> list2 = [*list1, "yellow", "black"] >>> list2 ['cyan', 'magenta', 'yellow', 'black']

Although you can also accomplish the same thing with list addition:

>>> list1 = ["cyan", "magenta"] >>> list2 = list1 + ["yellow", "black"] >>> list2 ['cyan', 'magenta', 'yellow', 'black']

There is also dictionary unpacking through the ** operator. You've probably
seen this in the context of passing keyword arguments to a function from a
dictionary. Given:

def example(name:str|None = None, company:str|None = None): print(f"{name.title()} works at {company.title()}")

We can call it like this:

>>> args={'name': 'Lars', 'company': 'red hat'} >>> example(**args) Lars works at Red Hat

This can also be used for extending dictionaries:

>>> dict1 = {'name': 'lars', 'company': 'redhat'} >>> dict2 = {**dict1, 'cats': 3} >>> dict2 {'name': 'lars', 'company': 'redhat', 'cats': 3}

Or even merging dictionaries:

>>> dict1 = {'name': 'lars', 'company': 'redhat'} >>> dict2 = {'town': 'belmont', 'state': 'ma'} >>> dict3 = {**dict1, **dict2} >>> dict3 {'name': 'lars', 'company': 'redhat', 'town': 'belmont', 'state': 'ma'}

Although since Python 3.9, the more obvious way of doing this would be using the | union operator:

>>> dict1 = {'name': 'lars', 'company': 'redhat'} >>> dict2 = {'town': 'belmont', 'state': 'ma'} >>> dict3 = dict1 | dict2

Thank you for the very thorough tutorial! I saw it used in function calls, not in a list definition before.

QuanMPhm · 2026-06-01T14:15:46Z

@knikolla @naved001 I have responded to all comments.

knikolla · 2026-06-01T15:01:51Z

            dtype={
-                invoice.COST_FIELD: pandas.ArrowDtype(pyarrow.decimal128(21, 2)),
-                invoice.RATE_FIELD: str,
+                invoice.INVOICE_DATE_FIELD: invoice.STRING_FIELD_TYPE,


You have already defined the columns into a InvoiceColumn class containing both name and dtype. Therefore I strongly suggest you use that pairing here too, otherwise you are maintaining 2 copies of what type a column is and those are prone to go out of sync with time.

such as

dtype={ invoice.RATE_COLUMN.name: invoice.RATE_COLUMN.dtype, ... }

knikolla

Given that a warning will be output if a column's dtype is different from the type defined in the InvoiceColumn type, I'm fine with my requested change in the previous review being done in a follow-up PR.

naved001

Just one more thing, but I am approving the PR so merge it once that's addressed.

naved001 · 2026-06-08T20:55:25Z

+        invoice.CLUSTER_NAME_COLUMN,
+        invoice.IS_COURSE_COLUMN,
+        invoice.INSTITUTION_COLUMN,
+    ]


these are lists here and not tuples like everywhere else.

knikolla · 2026-06-09T13:04:53Z

@QuanMPhm before merging, please create issues to track the 2 outstanding requested changes.

During 2026-03 invoicing, a bug was found where the columns initialized by the New-PI credit processor (i.e `PI Balance` column), was being accessed by the PI-SU processor before it was initialized, causing an KeyError. To fix this, the codebase has been refactored to allow each processor to explicitly document which columns they initialize and use, defined in two new properties, `initializes_columns` and `operates_on_columns`. A helper function `_init_columns()` is added to initalize columns Unit test `tests/unit/processors/test_processor_list.py` is added to check each processor only uses columns that itself or previous processors initialized, and no column is initialized more than once Additionally, each column will now be encapsulated as a `InvoiceColumn` instance. `InvoiceColumn` contains the name, datatype, and default values for each column This will also enable stricter and clearer type enforcement for data entering and leaving the pipeline A new processor `ValidateInputColumnsProcessor` is added to check the input dataframe to the processing pipeline has prerequisite columns, and to cast to appropriate types The e2e test data has been updated to surface the bug that was found. It did not failed during the PR that introduced the bug [1] because the test data didn't have the right conditions to trigger the PI-SU processor Refactored unit tests to accomodate the new processor by adding a new base test class. [1] CCI-MOC#279

QuanMPhm · 2026-06-11T18:28:29Z

@knikolla @naved001 I've implemented both of your feedback to spare everyone's time

QuanMPhm requested review from knikolla and naved001 April 2, 2026 19:10

knikolla reviewed Apr 2, 2026

View reviewed changes

knikolla requested changes Apr 2, 2026

View reviewed changes

QuanMPhm mentioned this pull request Apr 20, 2026

Enable exporting monthly invoice to Iceberg #292

Open

QuanMPhm force-pushed the refactor/init_cols branch from e68a019 to 81db774 Compare April 21, 2026 15:35

QuanMPhm requested a review from knikolla April 21, 2026 15:36

QuanMPhm force-pushed the refactor/init_cols branch from 81db774 to 4ee5cb2 Compare April 21, 2026 15:40

QuanMPhm commented Apr 21, 2026

View reviewed changes

Comment thread process_report/invoices/invoice.py

QuanMPhm changed the title ~~Handle init columns in a separate processor~~ Track initialized and used columns in each processor Apr 21, 2026

QuanMPhm mentioned this pull request Apr 21, 2026

Upload 2026-03 invoicing data (and any other data) to iceberg table manually #288

Open

1 task

QuanMPhm mentioned this pull request May 4, 2026

Refactor testing and validation checks to prepare invoicing for complete automation #185

Open

QuanMPhm requested a review from larsks May 19, 2026 13:25

larsks reviewed May 21, 2026

View reviewed changes

naved001 reviewed May 26, 2026

View reviewed changes

QuanMPhm force-pushed the refactor/init_cols branch from 4ee5cb2 to fee4c23 Compare May 27, 2026 22:45

QuanMPhm requested a review from naved001 May 27, 2026 22:45

knikolla requested changes May 28, 2026

View reviewed changes

Comment thread process_report/invoices/invoice.py

naved001 reviewed May 28, 2026

View reviewed changes

QuanMPhm force-pushed the refactor/init_cols branch 2 times, most recently from 0b4cfe0 to 8698a77 Compare June 1, 2026 13:44

QuanMPhm requested review from knikolla and naved001 June 1, 2026 14:15

knikolla requested changes Jun 1, 2026

View reviewed changes

knikolla approved these changes Jun 1, 2026

View reviewed changes

This was referenced Jun 4, 2026

Refactor type handling in invoices #201

Closed

Refactoring how we're handling typing #191

Closed

naved001 approved these changes Jun 8, 2026

View reviewed changes

QuanMPhm force-pushed the refactor/init_cols branch from 8698a77 to 30f34a6 Compare June 11, 2026 18:28

QuanMPhm merged commit 5c0f578 into CCI-MOC:main Jun 11, 2026
6 checks passed



		@dataclass
		class PISUCreditProcessor(discount_processor.DiscountProcessor):

		@@ -0,0 +1,26 @@
		from process_report.process_report import PROCESSING_ORDER
		from process_report.tests.base import BaseTesCase

Uh oh!

Conversation

QuanMPhm commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knikolla commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knikolla commented Apr 2, 2026

Uh oh!

naved001 commented Apr 3, 2026

Uh oh!

knikolla commented Apr 3, 2026

Uh oh!

naved001 commented Apr 6, 2026

Uh oh!

knikolla commented Apr 6, 2026

Uh oh!

naved001 commented Apr 6, 2026

Uh oh!

knikolla commented Apr 6, 2026

Uh oh!

QuanMPhm commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QuanMPhm commented Apr 20, 2026

Uh oh!

QuanMPhm commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

naved001 commented Apr 29, 2026

Uh oh!

QuanMPhm commented Apr 30, 2026

Uh oh!

QuanMPhm commented May 19, 2026

Uh oh!

larsks left a comment

Choose a reason for hiding this comment

Uh oh!

naved001 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knikolla left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

QuanMPhm commented Jun 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knikolla left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naved001 left a comment

Choose a reason for hiding this comment

QuanMPhm commented Apr 2, 2026 •

edited

Loading

QuanMPhm commented Apr 8, 2026 •

edited

Loading

QuanMPhm commented Apr 21, 2026 •

edited

Loading

knikolla left a comment •

edited

Loading