Skip to content

Track initialized and used columns in each processor#285

Merged
QuanMPhm merged 1 commit into
CCI-MOC:mainfrom
QuanMPhm:refactor/init_cols
Jun 11, 2026
Merged

Track initialized and used columns in each processor#285
QuanMPhm merged 1 commit into
CCI-MOC:mainfrom
QuanMPhm:refactor/init_cols

Conversation

@QuanMPhm

@QuanMPhm QuanMPhm commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

Closes #284. During 2026-03 invoicing, a bug was found where the columns initialized by the New-PI credit processor (i.e PI Balance column), was being accessed by the PI-SU processor before it was initialized, causing an KeyError.

More details in the commit message

[1] #279

@QuanMPhm QuanMPhm requested review from knikolla and naved001 April 2, 2026 19:10
@knikolla

knikolla commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

@QuanMPhm Does this mean that the InitColumnsProcessor needs to know about ALL the processors?



@dataclass
class PISUCreditProcessor(discount_processor.DiscountProcessor):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called PISUCreditProcessor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad

Comment thread process_report/tests/base.py Outdated
]


class BaseTesCase(TestCase):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

@knikolla

knikolla commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

@QuanMPhm Does this mean that the InitColumnsProcessor needs to know about ALL the processors?

I guess this is inevitable since we want to also upload the resulting DataFrame into Iceberg, which means we need to centralize the listing of columns, their initialization and migration.

I don't think a processor is the best way to implement this initialization. Please think about a solution that also extends the Iceberg table when a new column is introduced.

@naved001

naved001 commented Apr 3, 2026

Copy link
Copy Markdown
Collaborator

I don't think a processor is the best way to implement this initialization.

@knikolla Why not? I was just trying to think where else could this happen, and I guess when the CSVs are merged into the pandas dataframe before the processors run. But I want to understand your rationale behind it not being a processor.

@knikolla

knikolla commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

I don't think a processor is the best way to implement this initialization.

@knikolla Why not? I was just trying to think where else could this happen, and I guess when the CSVs are merged into the pandas dataframe before the processors run. But I want to understand your rationale behind it not being a processor.

@naved001 some quick thoughts.

  • The Processors are an attempt at organizing business logic into respective modules. This Processor wouldn't really contain any business logic and just perform common initialization. This strongly feels more like a base class function than a specific processor.
  • Currently, each Processor already may introduce new columns, and these columns are initialized by the Processor that introduces them. This has the added benefit of providing ownership over that column and additional context and documentation about how the column is initialized and used for by the surrounding code.
  • The introduction of new columns creates ordering concerns but this ordering concern isn't fixed by the introduction of a column initializer. The business logic is what places the constraint on the ordering of Processors, not the initialization of the columns. This bug merely highlighted that as part of the PR that introduced the PISUCreditProcessor the initialization of that specific column should have either been moved from the NewPICreditProcessor to the PISUCreditProcessor (it now being the first one to use the column) or the ordering of the 2 processors changed (if it wouldn't make a difference to the business logic).
  • We are already storing the entire list of columns in the base Invoice class. We can introduce more metadata to that list (for example by making each of the columns a dataclass containing Column Name, Type, Default Value or Initializer whatever you think is beneficial and then extend the Processors to have a initializes_columns: List[Column] and operates_on_columns: List[Column].
  • This allows us to document in code and write unit tests that check for the list of Processors and fail the unit tests when the ordering is such that a Processor operates_on a column that wasn't initialized by a previous Processor or doesn't have a default initializer.
  • This also allows us to include within the Column definition migration logic for the Iceberg table when a column is introduced.

@naved001

naved001 commented Apr 6, 2026

Copy link
Copy Markdown
Collaborator

@knikolla Thanks for the explanation! It makes a lot of sense to not have a processor just do the init. Just a couple of thoughts/questions.

We are already storing the entire list of columns in the base Invoice class. We can introduce more metadata to that list (for example by making each of the columns a dataclass containing Column Name, Type, Default Value or Initializer whatever you think is beneficial and then extend the Processors to have a initializes_columns: List[Column] and operates_on_columns: List[Column].

Wouldn't the processors only initialize these columns when they are called? I don't understand how this would have solved the current issue.

This allows us to document in code and write unit tests that check for the list of Processors and fail the unit tests when the ordering is such that a Processor operates_on a column that wasn't initialized by a previous Processor or doesn't have a default initializer.

Even if we had this implemented, how would a unit test have captured the current problem which was due to the ordering of the processors? Because if we wrote the unit tests for checking if the columns exist that would be just on the fake data we are passing to that specific processor. I think the end to end tests should have been updated to capture this because we actually run all processors in a sequence and that would have failed.

@knikolla

knikolla commented Apr 6, 2026

Copy link
Copy Markdown
Contributor

@knikolla Thanks for the explanation! It makes a lot of sense to not have a processor just do the init. Just a couple of thoughts/questions.

We are already storing the entire list of columns in the base Invoice class. We can introduce more metadata to that list (for example by making each of the columns a dataclass containing Column Name, Type, Default Value or Initializer whatever you think is beneficial and then extend the Processors to have a initializes_columns: List[Column] and operates_on_columns: List[Column].

Wouldn't the processors only initialize these columns when they are called? I don't understand how this would have solved the current issue.

This by itself wouldn't have solved the current issue. It would have provided a better place to centralize column data, rather than storing the column names in the base Invoice and the default initialization values in the InitProcessor.

One way it would have solved it is if a Processor operates_on a column that hasn't been initialized yet, the Processor would automatically set the default value at that point of the execution.

This allows us to document in code and write unit tests that check for the list of Processors and fail the unit tests when the ordering is such that a Processor operates_on a column that wasn't initialized by a previous Processor or doesn't have a default initializer.

Even if we had this implemented, how would a unit test have captured the current problem which was due to the ordering of the processors? Because if we wrote the unit tests for checking if the columns exist that would be just on the fake data we are passing to that specific processor. I think the end to end tests should have been updated to capture this because we actually run all processors in a sequence and that would have failed.

It would have exposed the problem because the PISUCreditProcessor would have had the column in the operates_on list that it defines, however none of the preceding processors would have had the column in their initializes_column list. The list of the ordering of the processors is stored in code and it would be easy to have a unit test validate that this case doesn't happen.

If we decide to centralize Column information into a dataclass and have the dataclass provide a default value, we could also have the Processor at this point in the execution set that default value to the column.

@naved001

naved001 commented Apr 6, 2026

Copy link
Copy Markdown
Collaborator

@knikolla I am convinced that having columns be dataclasses with associated metadata and defaults is a reasonable path forward.

The list of the ordering of the processors is stored in code and it would be easy to have a unit test validate that this case doesn't happen.

As far as I can tell, this is the only place we have a list of all the processors

https://github.com/CCI-MOC/invoicing/blob/main/process_report/process_report.py#L89-L99

I guess we just need to make sure that the tests use the same list.

@knikolla

knikolla commented Apr 6, 2026

Copy link
Copy Markdown
Contributor

@knikolla I am convinced that having columns be dataclasses with associated metadata and defaults is a reasonable path forward.

The list of the ordering of the processors is stored in code and it would be easy to have a unit test validate that this case doesn't happen.

As far as I can tell, this is the only place we have a list of all the processors

https://github.com/CCI-MOC/invoicing/blob/main/process_report/process_report.py#L89-L99

I guess we just need to make sure that the tests use the same list.

Yes, that can be stored into a global variable and the unit test can access the variable.

The end to end test would also probably throw an error since IIRC it does call all the processors.

@QuanMPhm

QuanMPhm commented Apr 8, 2026

Copy link
Copy Markdown
Contributor Author

@naved001 @knikolla I will put a hold on this and review Jimmy's PR to add the Iceberg functionality first, since both issues are critical and will likely influence each other. I'll see what the iceberg implementation may look like before continuing on this.

@QuanMPhm

Copy link
Copy Markdown
Contributor Author

I will continue work on this PR, now that I've done my iceberg review and draft PR(#292)

@QuanMPhm QuanMPhm force-pushed the refactor/init_cols branch from e68a019 to 81db774 Compare April 21, 2026 15:35
@QuanMPhm QuanMPhm requested a review from knikolla April 21, 2026 15:36
@QuanMPhm QuanMPhm force-pushed the refactor/init_cols branch from 81db774 to 4ee5cb2 Compare April 21, 2026 15:40
@QuanMPhm

QuanMPhm commented Apr 21, 2026

Copy link
Copy Markdown
Contributor Author

@knikolla @naved001 I have overhauled this PR to implement @knikolla's suggestions. I encapsulated columns into a new class, and added new Invoice class attributes, initializes_columns and operates_on_columns. More details in the commit message.

I see that this PR is enormous, and will try to split out the type handling if the PR's too much to review before April invoicing.

I have a question below for consensus on the typing of columns.

A follow-up from this PR would be to cleanup test cases to remove redundant type casting.

Comment thread process_report/invoices/invoice.py
@QuanMPhm QuanMPhm changed the title Handle init columns in a separate processor Track initialized and used columns in each processor Apr 21, 2026
@naved001

Copy link
Copy Markdown
Collaborator

@QuanMPhm Maybe I missed some conversation, but why did you add a dedicated initColumnProcessor? This comment by @knikolla explained the reasons why we don't want that: #285 (comment)

@QuanMPhm

Copy link
Copy Markdown
Contributor Author

@QuanMPhm Maybe I missed some conversation, but why did you add a dedicated initColumnProcessor? This comment by @knikolla explained the reasons why we don't want that: #285 (comment)

@naved001 While rewriting the PR, I decided to make a processor at the start that would check that the input invoices have all the necessary initial columns. This first processor doesn't initialize any columns by itself. It just checks that prerequisite columns exist. I guess a better name would be ValidateInputColumnsProcessor. I just stuck with the old name at the time of writing the PR.

@QuanMPhm

Copy link
Copy Markdown
Contributor Author

@larsks Since this PR is moreso restructuring the code rather than adding billing logic, I'd appreciate your feedback

@larsks larsks left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR is moreso restructuring the code rather than adding billing logic, I'd appreciate your feedback

@QuanMPhm There's nothing here that jumps out at me in particular. I like the standardization of using the InvoiceColumn dataclass to contain column metadata. It looks like you've addressed most of the comments from Kristi and Naved, and the mechanism for handling column initialization seems like an improvement.

@naved001 naved001 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks fine to me but a few questions in line. I need to take a deep look at the tests as well. Thanks!

Comment thread process_report/invoices/invoice.py
Comment thread process_report/invoices/invoice.py
Comment thread process_report/processors/validate_input_column_processor.py


@dataclass
class InitColumnsProcessor(processor.Processor):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you think the name is misleading? It doesn't actually initialize the columns it only casts them. This is a cause of confusion imo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to ValidateInputColumnsProcessor

@@ -0,0 +1,26 @@
from process_report.process_report import PROCESSING_ORDER
from process_report.tests.base import BaseTesCase

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this typo "BaseTesCase" is everywhere because your IDE saved the day. Could be solved in a different PR if it wasn't introduced in this one.

@QuanMPhm QuanMPhm force-pushed the refactor/init_cols branch from 4ee5cb2 to fee4c23 Compare May 27, 2026 22:45
@QuanMPhm QuanMPhm requested a review from naved001 May 27, 2026 22:45

@knikolla knikolla left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Only one comment otherwise direction and everything looks good.

Comment thread process_report/invoices/invoice.py
invoice.GROUP_INSTITUTION_COLUMN,
invoice.GROUP_MANAGED_COLUMN,
invoice.GROUP_BALANCE_COLUMN,
invoice.GROUP_BALANCE_USED_COLUMN,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that a processors will always operate on columns that it initalizes i.e. initializes_columns is a subset of operates_on_columns. You could do this to cut down on the repetition. Just a suggestion.

operates_on_columns = (
    *initalizes_columns,
    invoice.INVOICE_EMAIL_COLUMN,
	...
	...
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't even realize I could do that

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is called "list unpacking". You often see this when passing a list of arguments to a function that expects multiple arguments. E.g., if I have:

def example(greeting: str, noun: str):
  print(f"{greeting.title()}, {noun}!")

Then I can do this:

>>> example("hello", "world")
Hello, world!

But if I had the arguments in a list, I could do this:

>>> args=["hello", "world"]
>>> example(*args)

Using the * operator like that isn't unique to argument lists, which is why you can do things like @naved001 suggested:

>>> list1 = ["cyan", "magenta"]
>>> list2 = [*list1, "yellow", "black"]
>>> list2
['cyan', 'magenta', 'yellow', 'black']

Although you can also accomplish the same thing with list addition:

>>> list1 = ["cyan", "magenta"]
>>> list2 = list1 + ["yellow", "black"]
>>> list2
['cyan', 'magenta', 'yellow', 'black']

There is also dictionary unpacking through the ** operator. You've probably
seen this in the context of passing keyword arguments to a function from a
dictionary. Given:

def example(name:str|None = None, company:str|None = None):
  print(f"{name.title()} works at {company.title()}")

We can call it like this:

>>> args={'name': 'Lars', 'company': 'red hat'}
>>> example(**args)
Lars works at Red Hat

This can also be used for extending dictionaries:

>>> dict1 = {'name': 'lars', 'company': 'redhat'}
>>> dict2 = {**dict1, 'cats': 3}
>>> dict2
{'name': 'lars', 'company': 'redhat', 'cats': 3}

Or even merging dictionaries:

>>> dict1 = {'name': 'lars', 'company': 'redhat'}
>>> dict2 = {'town': 'belmont', 'state': 'ma'}
>>> dict3 = {**dict1, **dict2}
>>> dict3
{'name': 'lars', 'company': 'redhat', 'town': 'belmont', 'state': 'ma'}

Although since Python 3.9, the more obvious way of doing this would be using the | union operator:

>>> dict1 = {'name': 'lars', 'company': 'redhat'}
>>> dict2 = {'town': 'belmont', 'state': 'ma'}
>>> dict3 = dict1 | dict2

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the very thorough tutorial! I saw it used in function calls, not in a list definition before.

Comment thread process_report/tests/e2e/test_data/test_pi.yaml
@QuanMPhm QuanMPhm force-pushed the refactor/init_cols branch 2 times, most recently from 0b4cfe0 to 8698a77 Compare June 1, 2026 13:44
@QuanMPhm QuanMPhm requested review from knikolla and naved001 June 1, 2026 14:15
@QuanMPhm

QuanMPhm commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

@knikolla @naved001 I have responded to all comments.

Comment thread process_report/process_report.py Outdated
dtype={
invoice.COST_FIELD: pandas.ArrowDtype(pyarrow.decimal128(21, 2)),
invoice.RATE_FIELD: str,
invoice.INVOICE_DATE_FIELD: invoice.STRING_FIELD_TYPE,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have already defined the columns into a InvoiceColumn class containing both name and dtype. Therefore I strongly suggest you use that pairing here too, otherwise you are maintaining 2 copies of what type a column is and those are prone to go out of sync with time.

such as

dtype={
    invoice.RATE_COLUMN.name: invoice.RATE_COLUMN.dtype,
...
}

@knikolla knikolla left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that a warning will be output if a column's dtype is different from the type defined in the InvoiceColumn type, I'm fine with my requested change in the previous review being done in a follow-up PR.

@naved001 naved001 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more thing, but I am approving the PR so merge it once that's addressed.

invoice.CLUSTER_NAME_COLUMN,
invoice.IS_COURSE_COLUMN,
invoice.INSTITUTION_COLUMN,
]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are lists here and not tuples like everywhere else.

@knikolla

knikolla commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@QuanMPhm before merging, please create issues to track the 2 outstanding requested changes.

During 2026-03 invoicing, a bug was found where the columns initialized
by the New-PI credit processor (i.e `PI Balance` column), was
being accessed by the PI-SU processor before it was
initialized, causing an KeyError.

To fix this, the codebase has been refactored to allow each processor to
explicitly document which columns they initialize and use, defined in two
new properties, `initializes_columns` and `operates_on_columns`. A helper
function `_init_columns()` is added to initalize columns

Unit test `tests/unit/processors/test_processor_list.py` is added to check each processor
only uses columns that itself or previous processors initialized, and no
column is initialized more than once

Additionally, each column will now be encapsulated as a `InvoiceColumn` instance.
`InvoiceColumn` contains the name, datatype, and default values for each column
This will also enable stricter and clearer type enforcement for data entering
and leaving the pipeline

A new processor `ValidateInputColumnsProcessor` is added to check the input
dataframe to the processing pipeline has prerequisite columns, and to cast
to appropriate types

The e2e test data has been updated to surface the bug that was found.
It did not failed during the PR that introduced the bug [1] because
the test data didn't have the right conditions to trigger the PI-SU processor

Refactored unit tests to accomodate the new
processor by adding a new base test class.

[1] CCI-MOC#279
@QuanMPhm

Copy link
Copy Markdown
Contributor Author

@knikolla @naved001 I've implemented both of your feedback to spare everyone's time

@QuanMPhm QuanMPhm force-pushed the refactor/init_cols branch from 8698a77 to 30f34a6 Compare June 11, 2026 18:28
@QuanMPhm QuanMPhm merged commit 5c0f578 into CCI-MOC:main Jun 11, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resolve bug with PI-SU processor

4 participants