Resolved pandas deprecation warnings during pipeline run#198
Conversation
4c9624a to
0816bbe
Compare
There was a problem hiding this comment.
@jimmysway This is a good first draft! Before I give further feedback, I should mention that while reviewing this, I realized I am planning a bigger PR that would make your PR redundant. You should close this PR.
Nevertheless, I've made a few comments below on things you should take note of when submitting future PRs. Besides those comments, I'll say that this PR should have only been 1 commit.
When submitting PRs, each commit should be a self-contained set of changes. If a PR contains more than one commit, that either means the solution did require multiple distinct changes, or that the issue's scope is too big and should be split up into smaller PRs. All of your commits have been to solve the same issue, and they were sets of changes that depended on each to work. You could also add more detail in your commit message. I would recommend that you see how commit messages are structured in this repo.
There was a problem hiding this comment.
The purpose of this PR is to resolve FutureWarnings when exporting PI PDFs. While the test cases may have FutureWarnings problems too, these are out of scope for this issue. It's usually best that you divide your changes into small, modular parts, with clearly defined scope, to make it easier for reviewers to review your changes.
If you want to resolve the FutureWarnings in the test cases, feel free to make another PR addressing it
There was a problem hiding this comment.
You have not revert the changes in this file yet
There was a problem hiding this comment.
The additions to the README are not related to the issue being addressed. I would suggest that you make a new PR to update the README.
|
@jimmysway It does not seem my refactoring PR will fully resolve the |
QuanMPhm
left a comment
There was a problem hiding this comment.
Here more my more in-depth comments
| new_row[col] = val | ||
|
|
||
| # Convert all columns to object type before concatenation to avoid dtype warnings | ||
| pi_projects = pi_projects.astype("object") |
There was a problem hiding this comment.
You should not typecast the dataframe that will be exported into object type. Doing so means there will be no type warning or enforcement at all, which we still want. It would also remove the 2 decimal precision that we want on our money columns, so the exported PDFs would have inconsistent precision.
I would suggest instead that you cast the "totals row" to be the same type as pi_projects instead:
| pi_projects = pi_projects.astype("object") | |
| new_row = pandas.DataFrame(data={col: [None] for col in pi_projects.columns}) | |
| new_row = new_row.astype(pi_projects.dtypes) | |
| pi_projects = pandas.concat([pi_projects, new_row], ignore_index=True) | |
| pi_projects.loc[pi_projects.index[-1], invoice.INVOICE_DATE_FIELD] = "Total" | |
| pi_projects.loc[pi_projects.index[-1], sum_columns_list] = column_sums | |
| ... |
There was a problem hiding this comment.
I would also suggest a more informative name than new_row. totals_row might be better?
| for col in pi_projects.columns: | ||
| # First ensure all columns are object type | ||
| if pi_projects[col].dtype.name.startswith(("float", "int")): | ||
| pi_projects[col] = pi_projects[col].astype("object") |
There was a problem hiding this comment.
You should cast to Pandas' built-in StringDType instead. It is stricter than object and has other compatibilities with Pandas.
It should also be fine to cast the entire dataframe to StringDtype. The entire dataframe will be rendered as html text anyways:
| pi_projects[col] = pi_projects[col].astype("object") | |
| pi_projects = pi_projects.astype(invoice.STRING_FIELD_TYPE) | |
| pi_projects.fillna("", inplace=True) |
| None # Adds a new row to end of dataframe initialized with None | ||
|
|
||
| # Create a new row with proper dtypes | ||
| new_row = {col: None for col in pi_projects.columns} |
There was a problem hiding this comment.
Within this repo and many of our other ones, our team generally only writes comments when the rationale for a piece of code is not immediately obvious. This is usually when a piece of code is quite complex, structured weirdly, or its purpose not clear. In which case, the comment would explain WHY the code is there.
If the code is self-explanatory, in the case of this line, it's not helpful to have a comment explaining what the code does. I would suggest you remove this comment.
0cd41e6 to
5b5c094
Compare
5b5c094 to
33c2665
Compare
QuanMPhm
left a comment
There was a problem hiding this comment.
Aside from the comments below, could you modify the unit test for this invoice so that if a FutureWarning warning is raised, the test fails? I tested your changes on my local computer and still saw the warnings
| pi_projects.loc[len(pi_projects)] = None | ||
|
|
||
| # Set Invoice Month and totals - add Invoice Month column if it doesn't exist | ||
| if invoice.INVOICE_DATE_FIELD not in pi_projects.columns: |
There was a problem hiding this comment.
All of the invoices passed to our billing pipeline should have an Invoice Month column. All of our Processor and Invoice subclass assumes certain columns are already present, and don't check if certain columns exists or not. Is there a specific reason why you added this check?
There was a problem hiding this comment.
You have not revert the changes in this file yet
|
|
||
| pi_projects.loc[pi_projects.index[-1], invoice.INVOICE_DATE_FIELD] = "Total" | ||
| pi_projects.loc[pi_projects.index[-1], sum_columns_list] = column_sums | ||
| for col, val in zip(sum_columns_list, column_sums): |
There was a problem hiding this comment.
Is there a reason why you decided to use zip in this statement?
| pi_instituition = pi_dataframe[invoice.INSTITUTION_FIELD].iat[0] | ||
|
|
||
| # Convert to StringDtype for HTML rendering | ||
| pi_dataframe = pi_dataframe.astype(pandas.StringDtype()) |
There was a problem hiding this comment.
This statement should be placed in _get_pi_dataframe().
There was a problem hiding this comment.
You have not moved this yet, and I suggested that you move it before the pi_projects = pi_projects.fillna(""). I believe that would resolve the warning emitted from that statement. Feel free to respond if you were following a different approach
There was a problem hiding this comment.
Oh yeah, I did this and one of the unit tests failed because it was manually generating the expected types and so I put this here instead to avoid changing the unit tests but I can move it and then change the unit tests
0741c9c to
ae8ff45
Compare
QuanMPhm
left a comment
There was a problem hiding this comment.
Aside from my comments below, could you implement the change I requested from slack?
You can add a new test that creates an invoice, just calls process() , then assert if the warning was raised or not. No need to test the correctness of the output
| pi_instituition = pi_dataframe[invoice.INSTITUTION_FIELD].iat[0] | ||
|
|
||
| # Convert to StringDtype for HTML rendering | ||
| pi_dataframe = pi_dataframe.astype(pandas.StringDtype()) |
There was a problem hiding this comment.
You have not moved this yet, and I suggested that you move it before the pi_projects = pi_projects.fillna(""). I believe that would resolve the warning emitted from that statement. Feel free to respond if you were following a different approach
| invoice.at[project_i, pi_balance_field] -= applied_discount | ||
|
|
||
| # Convert applied_discount to the same dtype as the balance columns | ||
| pi_balance_dtype = invoice[pi_balance_field].dtype |
There was a problem hiding this comment.
When running the pipeline with the current upstream version, I did not see a warning emitted from this file. Is there a reason why you modified this file as well?
There was a problem hiding this comment.
When running pytest with the upstream version I had an error with this:
process_report/tests/unit/processors/test_new_pi_credit_processor.py::TestNewPICreditProcessor::test_two_new_pi
/mnt/MOC/invoicing/process_report/processors/discount_processor.py:57: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '300.00' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
invoice.at[project_i, balance_field] -= applied_discount
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
There was a problem hiding this comment.
I would still say that addressing the warnings from the unit tests is outside the scope of this issue. You should revert this change.
Also, your changes here made I think a bit. It doesn't seem fitting to have DiscountProcessor be concerned with type handling. From a code organization perspective, if we have type handling issues, that suggests either we need to update our input data, or add a validator to make sure our input data contains the types that we expect. Neither DiscountProcessor or ColdfrontFetchProcessor should care about what types the input dataframe has. They should only focus on their designated tasks.
I'm writing this comment more so as a note to myself. This suggests we should consider some refactoring, part of which I'm doing with this PR
|
|
||
| def _apply_allocation_data(self, allocation_data): | ||
| # Convert columns to string dtype to handle string values properly | ||
| if invoice.PI_FIELD in self.data.columns: |
There was a problem hiding this comment.
As mentioned before, all of our Processor and Invoice subclasses assume that the input dataframe already contain certain columns. We decided a while ago that the input dataframe should already have the needed columns. No need to check if the columns exists here.
You could make the case that if we expect input dataframes to conform to certain formats (i.e specific dtypes, columns, etc), we should make validators for them. But that's a refactoring problem beyond the scope of this issue.
There was a problem hiding this comment.
This hasn't been addressed
| dtype={ | ||
| COST_FIELD: pandas.ArrowDtype(pyarrow.decimal128(12, 2)), | ||
| RATE_FIELD: str, | ||
| PI_FIELD: "string", # Use pandas string dtype for proper string handling |
There was a problem hiding this comment.
If you're already specifying the type for the columns here, would that make your changes in process_report/processors/coldfront_fetch_processor.py redundant?
There was a problem hiding this comment.
Checking the docs, it seems string is an alias for pandas.StringDtype. However, you are using pandas.StringDtype in your test case, so I would suggest that you be consistent and just go with one option.
My suggesting would be to use pandas.StringDtype since it's more clear (it's might not be obvious to a developer that string is an alias for pandas.StringDtype)
There was a problem hiding this comment.
Note to self: Also, since the cause of the FutureWarning here is because the input invoice has an empty PI field, this seems to be a problem with the invoice itself, and that we should remove these empty columns.
There was a problem hiding this comment.
Is there a reason why you are also modifying the test cases here? If it is to also resolve warnings emitted by this unit test, I would want it to be put in a separate PR instead.
There was a problem hiding this comment.
Yeah this what we were talking about earlier when I mentioned that you said to avoid object types in general. I changed the tests because the tests expected objects instead, and you mentioned in Slack.
You should not use the object type in pi_specific_invoice.py . You can edit the test case to resolve your type problem, then I'll take a look
We don't want to use object because it removes some of the formatting we want when exporting to PDF
There was a problem hiding this comment.
Some more comments. And just to re-iterate some of my thoughts from Slack:
- Some of the
FutureWarningcomes from code that's manipulating dataframes to do weird things (PIInvoice), in which case it makes sense for that code to handle typing and resolve it there. The more I think about it, it may have been a mistake to be doing these weird manipulations with the PI-specific dataframes. This may be something to consider re-writing in the future. - Other
FutureWarninghappens when input data has different datatypes (i.e, certain numeric columns are ints, float, or Decimal). In which case the task of handling typing should be done by a validator of some sorts. Type handling logic definitely should not be bundled into an Invoice or Processor - For now, I would only like your PR to focus on resolving the warnings when running the pipeline end-to-end. I believe it's best to keep the PR's scope limited since it's your first PR. There's still other feedback I want to move on to and I don't want to drag the PR out too long.
| invoice.at[project_i, pi_balance_field] -= applied_discount | ||
|
|
||
| # Convert applied_discount to the same dtype as the balance columns | ||
| pi_balance_dtype = invoice[pi_balance_field].dtype |
There was a problem hiding this comment.
I would still say that addressing the warnings from the unit tests is outside the scope of this issue. You should revert this change.
Also, your changes here made I think a bit. It doesn't seem fitting to have DiscountProcessor be concerned with type handling. From a code organization perspective, if we have type handling issues, that suggests either we need to update our input data, or add a validator to make sure our input data contains the types that we expect. Neither DiscountProcessor or ColdfrontFetchProcessor should care about what types the input dataframe has. They should only focus on their designated tasks.
I'm writing this comment more so as a note to myself. This suggests we should consider some refactoring, part of which I'm doing with this PR
| dtype={ | ||
| COST_FIELD: pandas.ArrowDtype(pyarrow.decimal128(12, 2)), | ||
| RATE_FIELD: str, | ||
| PI_FIELD: "string", # Use pandas string dtype for proper string handling |
There was a problem hiding this comment.
Checking the docs, it seems string is an alias for pandas.StringDtype. However, you are using pandas.StringDtype in your test case, so I would suggest that you be consistent and just go with one option.
My suggesting would be to use pandas.StringDtype since it's more clear (it's might not be obvious to a developer that string is an alias for pandas.StringDtype)
|
|
||
| def _apply_allocation_data(self, allocation_data): | ||
| # Convert columns to string dtype to handle string values properly | ||
| if invoice.PI_FIELD in self.data.columns: |
There was a problem hiding this comment.
This hasn't been addressed
| dtype={ | ||
| COST_FIELD: pandas.ArrowDtype(pyarrow.decimal128(12, 2)), | ||
| RATE_FIELD: str, | ||
| PI_FIELD: "string", # Use pandas string dtype for proper string handling |
There was a problem hiding this comment.
Note to self: Also, since the cause of the FutureWarning here is because the input invoice has an empty PI field, this seems to be a problem with the invoice itself, and that we should remove these empty columns.
3593668 to
b3bce3c
Compare
| pi_projects.loc[pi_projects.index[-1], sum_columns_list] = column_sums | ||
|
|
||
| # Add totals row by copying the first row and modifying values | ||
| if not pi_projects.empty: |
There was a problem hiding this comment.
When would pi_projects be empty?
| group_name=[None, "G1", None, None], | ||
| ) | ||
|
|
||
| # Expected result for PI1 - build using the new approach |
There was a problem hiding this comment.
This comment is not useful to understanding the code.
| if pandas.isna(data) or data == "": | ||
| return "$" if data == "" else data |
There was a problem hiding this comment.
Why is this function being changed?
| answer_invoice_pi1 = answer_invoice_pi1.astype(pandas.StringDtype()) | ||
| answer_invoice_pi1.fillna("", inplace=True) | ||
|
|
||
| # Expected result for PI2 - build using the new approach |
| answer_invoice_pi2.loc[ | ||
| answer_invoice_pi2.index[-1], ["Invoice Month", "Balance"] | ||
| ] = ["Total", 700] | ||
| # Create totals row by copying first row and modifying |
There was a problem hiding this comment.
You do not need to repeat this comment here
| # Capture all warnings | ||
| with warnings.catch_warnings(record=True) as w: | ||
| warnings.simplefilter("always") | ||
| pi_inv.process() | ||
|
|
||
| # Assert no warnings were raised |
There was a problem hiding this comment.
These 2 comments are not necessary. As mentioned before, we usually only wrote comments to explain the WHY, unless the code is really complicated and the intention not clear. If the code is not clear in the first place, the problem might be the code, not a lack of comments :)
| if pandas.isna(data) or data == "": | ||
| return "$" if data == "" else data |
| def test_process_no_warnings(self): | ||
| """Test that no warnings are raised during invoice processing""" | ||
| test_invoice = self._get_test_invoice( | ||
| ["PI1", "PI1", "PI2", "PI2"], | ||
| ["BU", "BU", "HU", "HU"], | ||
| [100, 200, 300, 400], | ||
| group_name=[None, "G1", None, None], | ||
| ) | ||
|
|
||
| pi_inv = test_utils.new_pi_specific_invoice(data=test_invoice) | ||
|
|
||
| with warnings.catch_warnings(record=True) as w: | ||
| warnings.simplefilter("always") | ||
| pi_inv.process() | ||
|
|
||
| self.assertEqual( | ||
| len(w), | ||
| 0, | ||
| f"Unexpected warnings raised: {[str(warning.message) for warning in w]}", | ||
| ) |
There was a problem hiding this comment.
Testing the test case locally, it doesn't seem that warnings.catch_warnings can actually catch the warnings. It seems that unittests' built-in self.assertNoLogs work though. Also, I suggest you run process() as well, since that's where most of the code under review is.
| def test_process_no_warnings(self): | |
| """Test that no warnings are raised during invoice processing""" | |
| test_invoice = self._get_test_invoice( | |
| ["PI1", "PI1", "PI2", "PI2"], | |
| ["BU", "BU", "HU", "HU"], | |
| [100, 200, 300, 400], | |
| group_name=[None, "G1", None, None], | |
| ) | |
| pi_inv = test_utils.new_pi_specific_invoice(data=test_invoice) | |
| with warnings.catch_warnings(record=True) as w: | |
| warnings.simplefilter("always") | |
| pi_inv.process() | |
| self.assertEqual( | |
| len(w), | |
| 0, | |
| f"Unexpected warnings raised: {[str(warning.message) for warning in w]}", | |
| ) | |
| def test_process_no_warnings(self): | |
| """Test that no warnings are raised during invoice processing""" | |
| test_invoice = self._get_test_invoice( | |
| ["PI1", "PI1", "PI2", "PI2"], | |
| ["BU", "BU", "HU", "HU"], | |
| [100, 200, 300, 400], | |
| group_name=[None, "G1", None, None], | |
| ) | |
| pi_inv = test_utils.new_pi_specific_invoice(data=test_invoice) | |
| with warnings.catch_warnings(record=True) as w: | |
| warnings.simplefilter("always") | |
| pi_inv.process() | |
| self.assertEqual( | |
| len(w), | |
| 0, | |
| f"Unexpected warnings raised: {[str(warning.message) for warning in w]}", | |
| ) | |
| @mock.patch("process_report.invoices.invoice.Invoice._filter_columns") | |
| @mock.patch("os.path.exists") | |
| @mock.patch("subprocess.run") | |
| def test_process_no_warnings(self, mock_subprocess_run, mock_path_exists, mock_filter_cols): | |
| """Test that no warnings are raised during invoice processing""" | |
| invoice_month = "2024-10" | |
| test_invoice = self._get_test_invoice( | |
| ["PI1", "PI1", "PI2", "PI2"], | |
| ["BU", "BU", "HU", "HU"], | |
| [100, 200, 300, 400], | |
| group_name=[None, "G1", None, None], | |
| ) | |
| with tempfile.TemporaryDirectory() as test_dir: | |
| pi_inv = test_utils.new_pi_specific_invoice( | |
| test_dir, invoice_month, data=test_invoice | |
| ) | |
| with self.assertNoLogs("process_report.invoices.pi_specific_invoice", level="WARNING") as log: | |
| pi_inv.process() | |
| pi_inv.export() |
There was a problem hiding this comment.
Looks good. The last thing now is to merge all the commits in this PR into 1 commit. Make sure the message contains a succinct title, and some details in the body about the changes introduced, so that people in the future (including you) can follow it to understand the commit.
Thanks for your patience in the process. This code section will likely get refactored in the future :P
2b17d26 to
1d89101
Compare
8921801 to
1d89101
Compare
|
@QuanMPhm Ever since the merge into main on June 10th there is now a bunch of commits when I try to rebase all my commits, I'm not exactly sure how I am supposed to squash this in a safe way. I've tried a bunch of things but I'm unsure what I'm supposed to do.
|
|
Checking the git logs for your PR, it seems like you are trying to merge I would suggest running |
07a93d0 to
1c5a1c8
Compare
QuanMPhm
left a comment
There was a problem hiding this comment.
As mentioned elsewhere, please remove the issue number from the title. The title should be a brief description of the PR instead.
|
@jimmysway I meant remove the issue number from the commits title. Sorry :p |
1c5a1c8 to
334d9f0
Compare
|
@jimmysway Your PR is adding a duplicate of |
334d9f0 to
c3e80ad
Compare
removed |
QuanMPhm
left a comment
There was a problem hiding this comment.
Just one small comment, otherwise looks good
| [100, 200, 300, 400], | ||
| group_name=[None, "G1", None, None], | ||
| ) | ||
|
|
c3e80ad to
87cf61d
Compare
QuanMPhm
left a comment
There was a problem hiding this comment.
There is a merge conflict, and your commit message has a issue number in it
2fffc44 to
4062ae7
Compare
…late compatibility Create totals row by copying and clearing first row to retain DataFrame formatting Convert DataFrame to StringDtype and fill NA for template compatibility Update unit tests for new totals row logic and add test to ensure no warnings
4062ae7 to
2d83a40
Compare
Addresses #180 reflect real usage cases. Addressed the fillna issues by converting columns into objects
Precommit hook checked.