Skip to content

Consolidate configurations into a config.py file#214

Merged
QuanMPhm merged 1 commit into
CCI-MOC:mainfrom
QuanMPhm:213/config_consolidate
Oct 21, 2025
Merged

Consolidate configurations into a config.py file#214
QuanMPhm merged 1 commit into
CCI-MOC:mainfrom
QuanMPhm:213/config_consolidate

Conversation

@QuanMPhm

@QuanMPhm QuanMPhm commented Jun 27, 2025

Copy link
Copy Markdown
Contributor

Closes #213 and closes #91. I'll submit this as a draft for now, since its quite voluminous already, and I want people's opinions before moving forward.

@knikolla @larsks @naved001 In particular, right now the unit tests are failing because config.py tries to fetch from s3 while unauthenticated. I was considering creating a base test class that mocks the s3 functions during setUp(), or to add a layer of abstraction to config.py so that other files can only access the configurations through specific getter functions. Do you think there's a better way?

@larsks

larsks commented Jul 10, 2025

Copy link
Copy Markdown
Member

I think mocking or dependency injection are both good ways of resolving this sort of situation. I'm driving at the moment and haven't had a chance to look at this code yet; I'll try to take a look this afternoon and see if I have any more specific recommendations.

@larsks

larsks commented Jul 10, 2025

Copy link
Copy Markdown
Member

One of the difficulties here is that you have modules that run code on import, which can really complicate testing (because it makes it much more difficult to mock out dependencies or otherwise change the logic of the code). Specifically, in config.py you run load_from_url on import:

rates_info = load_from_url()

As well as a variety of s3 related functions:

CSV_INVOICE_FILEPATH_LIST = fetch_s3_invoices(INVOICE_MONTH)
PREPAY_DEBITS_FILEPATH = util.fetch_s3(PREPAY_DEBITS_S3_FILEPATH)
OLD_PI_FILEPATH = util.fetch_s3(PI_S3_FILEPATH)
ALIAS_FILEPATH = util.fetch_s3(ALIAS_S3_FILEPATH)

If you can refactor your code so that these variables are initialized at runtime rather than at import, you'll find things easier to test.

@larsks

larsks commented Jul 10, 2025

Copy link
Copy Markdown
Member

It's not part of this PR, but this code in util.py is problematic:

@functools.lru_cache
def get_invoice_bucket():
    try:
        s3_resource = boto3.resource(
            service_name="s3",
            endpoint_url=os.environ.get(
                "S3_ENDPOINT", "https://s3.us-east-005.backblazeb2.com"
            ),
            aws_access_key_id=os.environ["S3_KEY_ID"],
            aws_secret_access_key=os.environ["S3_APP_KEY"],
        )
    except KeyError:
        logger.error(
            "Error: Please set the environment variables S3_KEY_ID and S3_APP_KEY"
        )
    return s3_resource.Bucket(os.environ.get("S3_BUCKET_NAME", "nerc-invoicing"))

If the attempt to reference the S3_* environment variables raises a KeyError, you should probably exit rather than simply logging an error. Currently, you log a a message and then proceed to execute...

    return s3_resource.Bucket(os.environ.get("S3_BUCKET_NAME", "nerc-invoicing"))

...which fails because s3_resource has not been defined:

UnboundLocalError: cannot access local variable 's3_resource' where it is not associated with a value

@QuanMPhm QuanMPhm force-pushed the 213/config_consolidate branch from d6b61ab to 8287e3f Compare July 10, 2025 20:51
@QuanMPhm QuanMPhm marked this pull request as ready for review July 10, 2025 20:52
Comment thread process_report/config.py Outdated
Comment on lines +74 to +79
### Miscellaneous config values TODO: Should these have their own getter functions?
NEW_PI_CREDIT_AMOUNT = rates_info.get_value_at("New PI Credit", INVOICE_MONTH, Decimal)
LIMIT_NEW_PI_CREDIT_TO_PARTNERS = rates_info.get_value_at(
"Limit New PI Credit to MGHPCC Partners", INVOICE_MONTH, bool
)
BU_SUBSIDY_AMOUNT = rates_info.get_value_at("BU Subsidy", INVOICE_MONTH, Decimal)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larsks @knikolla My only question is should these values that are fetched from nerc-rates get their own getter functions? Having this code ran on import means local testing can't be done offline. Aside from that, the code that use these values can have them injected at runtime, so dependency injection during unit tests works fine.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current version of the code those are fetched from nerc-rates only if a value hasn't been provided by the CLI.

I don't exactly know what you mean by getter function, can you please describe how you would implement it?

@QuanMPhm QuanMPhm Jul 11, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By getter functions, I'm referring to functions like:

@functools.lru_cache
def get_old_pi_filepath() -> str:
    return OLD_PI_FILEPATH or util.fetch_s3(PI_S3_FILEPATH)

...in contrast to what my original draft proposed:

OLD_PI_FILEPATH = util.fetch_s3(PI_S3_FILEPATH)

The problem with the original draft was that config.py would try to fetch from S3 on import, making testing and dependency injection a lot more difficult. Making a "getter" function meant the fetch S3 code can be ran on-demand (or at runtime), allowing dependency injection.

With the getter function, processors/invoices can access configurations like this:

old_pi_filepath: str = field(default_factory=config.get_old_pi_filepath)

And since it's just an argument in their __init__, during unit tests, I can inject some other value as I wish.

To address your original question, a getter function for the nerc-rates values would be:

NEW_PI_CREDIT_AMOUNT = None

@functools.lru_cache
def get_new_pi_credit_amount() -> Decimal:
    return NEW_PI_CREDIT_AMOUNT or rates_info.get_value_at("New PI Credit", INVOICE_MONTH, Decimal)

@QuanMPhm

Copy link
Copy Markdown
Contributor Author

@larsks @knikolla I've decided to add getter functions for configurations that require fetching from S3. This PR is now ready for review

@knikolla

Copy link
Copy Markdown
Contributor

@larsks @knikolla I've decided to add getter functions for configurations that require fetching from S3. This PR is now ready for review

@QuanMPhm you are trying to treat everything as being the same type of "configuration", however you have different types of things which may warrant being treated with different mechanisms

  • strings that define paths in s3
  • numbers that defines values that are either fetched from the CLI or nerc-rates
  • lists
  • old pis
  • aliases

Some can be generalized, like the first two. While others may require a specialized solution by implementing functions in either the Invoice base class or somewhere else.

@QuanMPhm QuanMPhm force-pushed the 213/config_consolidate branch from 8287e3f to 32c8a3a Compare July 14, 2025 18:37
@QuanMPhm

Copy link
Copy Markdown
Contributor Author

@larsks @knikolla I have decided to use Pydantic to validate configurations.

Comment thread process_report/config.py Outdated

@functools.lru_cache
def get_prepaid_credits_df() -> pandas.DataFrame:
pandas.read_csv(PREPAY_CREDITS_FILEPATH)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not return anything. All of the tests are passing, which suggests that this function isn't tested. In fact, a test coverage report shows that most of the functions in this file are not tested.

@QuanMPhm QuanMPhm force-pushed the 213/config_consolidate branch from 32c8a3a to 4bf4abf Compare July 25, 2025 18:24
@QuanMPhm

Copy link
Copy Markdown
Contributor Author

@larsks Apologies. I forgot to actually push up my latest changes that use Pydantic.

This function does not return anything. All of the tests are passing, which suggests that this function isn't tested. In fact, a test coverage report shows that most of the functions in this file are not tested.

I will submit unit tests for this soon :P

Comment thread process_report/config.py Outdated
Comment on lines +43 to +50
# Input file paths
NONBILLABLE_PIS_FILEPATH: pydantic.FilePath = "pi.txt"
NONBILLABLE_PROJECTS_FILEPATH: pydantic.FilePath = "projects.txt"
NONBILLABLE_TIMED_PROECTS_FILEPATH: pydantic.FilePath = "timed_projects.txt"

PREPAY_PROJECTS_FILEPATH: pydantic.FilePath = "prepaid_projects.csv"
PREPAY_CREDITS_FILEPATH: pydantic.FilePath = "prepaid_credits.csv"
PREPAY_CONTACTS_FILEPATH: pydantic.FilePath = "prepaid_contacts.csv"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are saying that these variables are all of type pydantic.FilePath, but then you're assigning them a string value. These two types are not equivalent, and something expecting to get a FilePath object will break if it gets a str object instead:

>>> from process_report import config
>>> cfg = config.Config()
>>> cfg.NONBILLABLE_PIS_FILEPATH.name
Traceback (most recent call last):
  File "<python-input-10>", line 1, in <module>
    cfg.NONBILLABLE_PIS_FILEPATH.name
AttributeError: 'str' object has no attribute 'name'

Either just make them strings (which is exactly what you're doing in lines 30-32 for S3 file paths), or properly initialize them:

NONBILLABLE_PIS_FILEPATH: pydantic.FilePath = pydantic.FilePath("pi.txt")

(Same issue crops up elsewhere in this file.)

Comment thread process_report/config.py Outdated
@staticmethod
def fetch_s3_invoices(invoice_month):
"""Fetches usage invoices from S3 given invoice month"""
s3_invoice_list = list()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typicall to initialize an empty list you just write:

s3_invoice_list = []

Comment thread process_report/config.py Outdated
Comment on lines +83 to +86
def get_csv_invoice_filepaths(self) -> list[pydantic.FilePath]:
if not self.CSV_INVOICE_FILEPATH_LIST:
self.CSV_INVOICE_FILEPATH_LIST = self.fetch_s3_invoices(self.INVOICE_MONTH)
return self.CSV_INVOICE_FILEPATH_LIST

@larsks larsks Jul 25, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to avoid having accessor methods for some attributes and not for others, you could make these properties instead:

class Config(pydantic.BaseModel):
    _csv_invoice_filepath_list: list[str] = []

    @property
    def CSV_INVOICE_FILEPATH_LIST(self) -> list[str]:
        if not self._csv_invoice_filepath_list:
            self._csv_invoice_filepath_list = self.fetch_s3_invoices()
        return self._csv_invoice_filepath_list

And now your code can treat these values just like the static ones:

print(cfg.CSV_INVOICE_FILEPATH_LIST)
print(cfg.NONBILLABLE_PIS_FILEPATH)

(Or use functools.cached_property, which would remove the need for private backing fields.)

Comment thread process_report/config.py Outdated


# Custom configurations goes here
CONFIG_DICT = {}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is CONFIG_DICT for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was that CONFIG_DICT can be populated with custom configuration, i.e:

CONFIG_DICT = {
    "UPLOAD_TO_S3": False,  # Set to False for local testing, no files uploaded,
    "NONBILLABLE_INVOICE_NAME": "custom_nonbillable",
}
config = Config.model_validate(CONFIG_DICT)

Comment thread process_report/config.py Outdated

def get_old_pi_filepath(self) -> pydantic.FilePath:
if not self.OLD_PI_FILEPATH:
self.OLD_PI_FILEPATH = util.fetch_s3(self.PI_S3_FILEPATH)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have better error handling around s3 access (e.g., in the case of incorrect auth, connection timeouts, etc)?

@QuanMPhm QuanMPhm Jul 29, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That problem never occurred to me to address. Is there some reference code I can look at that does this s3 error handling?

@dataclass
class ValidatePIAliasProcessor(processor.Processor):
alias_map: dict
alias_map: dict = field(default_factory=config.get_alias_map)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, we expect to see type arguments, like dict[str, str], rather than an untyped dict.

Comment thread process_report/config.py Outdated
return dataframe[mask]["Project"].to_list()

def get_alias_map(self) -> dict:
alias_dict = dict()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer:

alias_dict = {}

Comment thread process_report/config.py Outdated
)
return dataframe[mask]["Project"].to_list()

def get_alias_map(self) -> dict:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, we expect to see type arguments, like dict[str, str], rather than an untyped dict.

@larsks

larsks commented Jul 25, 2025

Copy link
Copy Markdown
Member

It's not part of this PR, but note that the environment variables for holding S3 (or other AWS-service) credentials are typically AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, rather than S3_KEY_ID and S3_APP_KEY, and for specifying an S3 endpoint AWS_ENDPOINT_URL_S3.

@QuanMPhm

Copy link
Copy Markdown
Contributor Author

Note-to-self. Check @larsks's feedback from this repo: https://github.com/larsks/invoicing-example/tree/main

@QuanMPhm

QuanMPhm commented Sep 9, 2025

Copy link
Copy Markdown
Contributor Author

Note-to-self. A new revision is made on branch 213/config_final. The old revision is on 213/config_consolidate_pydantic

@QuanMPhm QuanMPhm force-pushed the 213/config_consolidate branch 2 times, most recently from 32c8a3a to ace6639 Compare September 9, 2025 20:40
@QuanMPhm QuanMPhm requested a review from larsks September 9, 2025 20:42
@QuanMPhm

QuanMPhm commented Sep 9, 2025

Copy link
Copy Markdown
Contributor Author

@larsks @knikolla I have made a revision of my PR. The main change has been the separation of settings and loading of information.

@knikolla knikolla left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big improvement over how it was before, great work!

Some minor comments.

Comment thread process_report/settings.py Outdated
Comment thread process_report/settings.py Outdated
Comment thread process_report/settings.py Outdated
Comment thread process_report/invoices/billable_invoice.py
Comment thread process_report/invoices/billable_invoice.py
All invoicing configurations, such as output file names, input file
paths, etc, have been grouped into a Pydantic module in `settings.py`
file. There is no longer any CLI arguments, and all
`Processor`/`Invoice` subclasses will now get their unique
configurations from `settings.py`, instead of being passed through
`__init__()`. Test cases, especially the e2e, have been changed accordingly.

Loading of configuration files moved to `loader.py`. This separation
between settings and loading logic should make the code more maintainable.

With how we previously handled invoicing configurations,
`process_report.py` had an unyieldingly long list of configurations,
which also forced us to initialize each processor/invoice individually,
creating much clutter over time. This change is made to reduce clutter by
grouping all configurations into one file.

Invoice names are no longer configurable. They are now considered
immutable business logic

The default invoice month will now always be the previous month,
regardless of which date of the month invoicing is ran
@QuanMPhm

Copy link
Copy Markdown
Contributor Author

@larsks @knikolla Great thanks for your approval of this ponderous PR

@QuanMPhm QuanMPhm merged commit fec354d into CCI-MOC:main Oct 21, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consolidate configuration for invoicing Remove default arguments for path of generated invoices

3 participants