Track initialized and used columns in each processor#285
Conversation
|
@QuanMPhm Does this mean that the InitColumnsProcessor needs to know about ALL the processors? |
|
|
||
|
|
||
| @dataclass | ||
| class PISUCreditProcessor(discount_processor.DiscountProcessor): |
There was a problem hiding this comment.
Why is this called PISUCreditProcessor?
| ] | ||
|
|
||
|
|
||
| class BaseTesCase(TestCase): |
I guess this is inevitable since we want to also upload the resulting DataFrame into Iceberg, which means we need to centralize the listing of columns, their initialization and migration. I don't think a processor is the best way to implement this initialization. Please think about a solution that also extends the Iceberg table when a new column is introduced. |
@knikolla Why not? I was just trying to think where else could this happen, and I guess when the CSVs are merged into the pandas dataframe before the processors run. But I want to understand your rationale behind it not being a processor. |
@naved001 some quick thoughts.
|
|
@knikolla Thanks for the explanation! It makes a lot of sense to not have a processor just do the init. Just a couple of thoughts/questions.
Wouldn't the processors only initialize these columns when they are called? I don't understand how this would have solved the current issue.
Even if we had this implemented, how would a unit test have captured the current problem which was due to the ordering of the processors? Because if we wrote the unit tests for checking if the columns exist that would be just on the fake data we are passing to that specific processor. I think the end to end tests should have been updated to capture this because we actually run all processors in a sequence and that would have failed. |
This by itself wouldn't have solved the current issue. It would have provided a better place to centralize column data, rather than storing the column names in the base Invoice and the default initialization values in the InitProcessor. One way it would have solved it is if a Processor
It would have exposed the problem because the If we decide to centralize Column information into a dataclass and have the dataclass provide a default value, we could also have the Processor at this point in the execution set that default value to the column. |
|
@knikolla I am convinced that having columns be dataclasses with associated metadata and defaults is a reasonable path forward.
As far as I can tell, this is the only place we have a list of all the processors https://github.com/CCI-MOC/invoicing/blob/main/process_report/process_report.py#L89-L99 I guess we just need to make sure that the tests use the same list. |
Yes, that can be stored into a global variable and the unit test can access the variable. The end to end test would also probably throw an error since IIRC it does call all the processors. |
|
@naved001 @knikolla I will put a hold on this and review Jimmy's PR to add the Iceberg functionality first, since both issues are critical and will likely influence each other. I'll see what the iceberg implementation may look like before continuing on this. |
|
I will continue work on this PR, now that I've done my iceberg review and draft PR(#292) |
e68a019 to
81db774
Compare
81db774 to
4ee5cb2
Compare
|
@knikolla @naved001 I have overhauled this PR to implement @knikolla's suggestions. I encapsulated columns into a new class, and added new I see that this PR is enormous, and will try to split out the type handling if the PR's too much to review before April invoicing. I have a question below for consensus on the typing of columns. A follow-up from this PR would be to cleanup test cases to remove redundant type casting. |
|
@QuanMPhm Maybe I missed some conversation, but why did you add a dedicated initColumnProcessor? This comment by @knikolla explained the reasons why we don't want that: #285 (comment) |
@naved001 While rewriting the PR, I decided to make a processor at the start that would check that the input invoices have all the necessary initial columns. This first processor doesn't initialize any columns by itself. It just checks that prerequisite columns exist. I guess a better name would be |
|
@larsks Since this PR is moreso restructuring the code rather than adding billing logic, I'd appreciate your feedback |
larsks
left a comment
There was a problem hiding this comment.
Since this PR is moreso restructuring the code rather than adding billing logic, I'd appreciate your feedback
@QuanMPhm There's nothing here that jumps out at me in particular. I like the standardization of using the InvoiceColumn dataclass to contain column metadata. It looks like you've addressed most of the comments from Kristi and Naved, and the mechanism for handling column initialization seems like an improvement.
naved001
left a comment
There was a problem hiding this comment.
Mostly looks fine to me but a few questions in line. I need to take a deep look at the tests as well. Thanks!
|
|
||
|
|
||
| @dataclass | ||
| class InitColumnsProcessor(processor.Processor): |
There was a problem hiding this comment.
Don't you think the name is misleading? It doesn't actually initialize the columns it only casts them. This is a cause of confusion imo
There was a problem hiding this comment.
renamed to ValidateInputColumnsProcessor
| @@ -0,0 +1,26 @@ | |||
| from process_report.process_report import PROCESSING_ORDER | |||
| from process_report.tests.base import BaseTesCase | |||
There was a problem hiding this comment.
I guess this typo "BaseTesCase" is everywhere because your IDE saved the day. Could be solved in a different PR if it wasn't introduced in this one.
4ee5cb2 to
fee4c23
Compare
knikolla
left a comment
There was a problem hiding this comment.
Great work! Only one comment otherwise direction and everything looks good.
| invoice.GROUP_INSTITUTION_COLUMN, | ||
| invoice.GROUP_MANAGED_COLUMN, | ||
| invoice.GROUP_BALANCE_COLUMN, | ||
| invoice.GROUP_BALANCE_USED_COLUMN, |
There was a problem hiding this comment.
I realized that a processors will always operate on columns that it initalizes i.e. initializes_columns is a subset of operates_on_columns. You could do this to cut down on the repetition. Just a suggestion.
operates_on_columns = (
*initalizes_columns,
invoice.INVOICE_EMAIL_COLUMN,
...
...
)There was a problem hiding this comment.
Didn't even realize I could do that
There was a problem hiding this comment.
That is called "list unpacking". You often see this when passing a list of arguments to a function that expects multiple arguments. E.g., if I have:
def example(greeting: str, noun: str):
print(f"{greeting.title()}, {noun}!")Then I can do this:
>>> example("hello", "world")
Hello, world!But if I had the arguments in a list, I could do this:
>>> args=["hello", "world"]
>>> example(*args)Using the * operator like that isn't unique to argument lists, which is why you can do things like @naved001 suggested:
>>> list1 = ["cyan", "magenta"]
>>> list2 = [*list1, "yellow", "black"]
>>> list2
['cyan', 'magenta', 'yellow', 'black']Although you can also accomplish the same thing with list addition:
>>> list1 = ["cyan", "magenta"]
>>> list2 = list1 + ["yellow", "black"]
>>> list2
['cyan', 'magenta', 'yellow', 'black']There is also dictionary unpacking through the ** operator. You've probably
seen this in the context of passing keyword arguments to a function from a
dictionary. Given:
def example(name:str|None = None, company:str|None = None):
print(f"{name.title()} works at {company.title()}")We can call it like this:
>>> args={'name': 'Lars', 'company': 'red hat'}
>>> example(**args)
Lars works at Red HatThis can also be used for extending dictionaries:
>>> dict1 = {'name': 'lars', 'company': 'redhat'}
>>> dict2 = {**dict1, 'cats': 3}
>>> dict2
{'name': 'lars', 'company': 'redhat', 'cats': 3}Or even merging dictionaries:
>>> dict1 = {'name': 'lars', 'company': 'redhat'}
>>> dict2 = {'town': 'belmont', 'state': 'ma'}
>>> dict3 = {**dict1, **dict2}
>>> dict3
{'name': 'lars', 'company': 'redhat', 'town': 'belmont', 'state': 'ma'}Although since Python 3.9, the more obvious way of doing this would be using the | union operator:
>>> dict1 = {'name': 'lars', 'company': 'redhat'}
>>> dict2 = {'town': 'belmont', 'state': 'ma'}
>>> dict3 = dict1 | dict2There was a problem hiding this comment.
Thank you for the very thorough tutorial! I saw it used in function calls, not in a list definition before.
0b4cfe0 to
8698a77
Compare
| dtype={ | ||
| invoice.COST_FIELD: pandas.ArrowDtype(pyarrow.decimal128(21, 2)), | ||
| invoice.RATE_FIELD: str, | ||
| invoice.INVOICE_DATE_FIELD: invoice.STRING_FIELD_TYPE, |
There was a problem hiding this comment.
You have already defined the columns into a InvoiceColumn class containing both name and dtype. Therefore I strongly suggest you use that pairing here too, otherwise you are maintaining 2 copies of what type a column is and those are prone to go out of sync with time.
such as
dtype={
invoice.RATE_COLUMN.name: invoice.RATE_COLUMN.dtype,
...
}
naved001
left a comment
There was a problem hiding this comment.
Just one more thing, but I am approving the PR so merge it once that's addressed.
| invoice.CLUSTER_NAME_COLUMN, | ||
| invoice.IS_COURSE_COLUMN, | ||
| invoice.INSTITUTION_COLUMN, | ||
| ] |
There was a problem hiding this comment.
these are lists here and not tuples like everywhere else.
|
@QuanMPhm before merging, please create issues to track the 2 outstanding requested changes. |
During 2026-03 invoicing, a bug was found where the columns initialized by the New-PI credit processor (i.e `PI Balance` column), was being accessed by the PI-SU processor before it was initialized, causing an KeyError. To fix this, the codebase has been refactored to allow each processor to explicitly document which columns they initialize and use, defined in two new properties, `initializes_columns` and `operates_on_columns`. A helper function `_init_columns()` is added to initalize columns Unit test `tests/unit/processors/test_processor_list.py` is added to check each processor only uses columns that itself or previous processors initialized, and no column is initialized more than once Additionally, each column will now be encapsulated as a `InvoiceColumn` instance. `InvoiceColumn` contains the name, datatype, and default values for each column This will also enable stricter and clearer type enforcement for data entering and leaving the pipeline A new processor `ValidateInputColumnsProcessor` is added to check the input dataframe to the processing pipeline has prerequisite columns, and to cast to appropriate types The e2e test data has been updated to surface the bug that was found. It did not failed during the PR that introduced the bug [1] because the test data didn't have the right conditions to trigger the PI-SU processor Refactored unit tests to accomodate the new processor by adding a new base test class. [1] CCI-MOC#279
8698a77 to
30f34a6
Compare
Closes #284. During 2026-03 invoicing, a bug was found where the columns initialized by the New-PI credit processor (i.e
PI Balancecolumn), was being accessed by the PI-SU processor before it was initialized, causing an KeyError.More details in the commit message
[1] #279