Link ComponentJobUtilization and Invoice#4718
Conversation
| Q(job_utilizations__invoice__isnull=True) | ||
| | Q(evaluation_utilizations__invoice__isnull=True) | ||
| | Q(job_warm_pool_utilizations__invoice__isnull=True) | ||
| ).distinct() |
There was a problem hiding this comment.
This is going to be a pretty expensive query. Unsure how bad it really is, but maybe replacing it with 3 simple indexed scans on individual tables would be better? Something like this:
challenge_ids = set()
for utilization_model in CHALLENGE_UTILIZATION_MODELS:
ids = (
utilization_model.objects
.filter(invoice__isnull=True)
values_list("challenge_id", flat=True)
.distinct()
)
challenge_ids.update(ids)
challenges = Challenge.objects.filter(pk__in=challenge_ids)
for challenge in challenges:There was a problem hiding this comment.
To make this more efficient, you could pre-compute the challenge to invoice mapping:
invoice_by_challenge = {
inv.challenge_id: inv
for inv in Invoice.objects.order_by("challenge_id", "expires_on", "created")
}That saves you from repeating the same invoice lookup 300+ times.
There was a problem hiding this comment.
I'm also still wondering if looping over utilizations would be better. It would drastically reduce the number of DB queries to execute, but the downside would be that it would require batching /to work, which is less easy to read/follow.
There was a problem hiding this comment.
But actually, if you precompute the challenge-to-invoice mapping, you don't need your expensive initial query. You can just loop over all challenges regardless. So I think that is the most efficient way of going about this after all.
for challenge in Challenge.objects.all():
invoice = invoice_by_challenge.get(challenge.pk)
if not invoice:
missing_invoice_challenges.append(challenge.short_name)
continue
for model in CHALLENGE_UTILIZATION_MODELS:
model.objects.filter(
challenge=challenge,
invoice__isnull=True,
).update(invoice=invoice)There was a problem hiding this comment.
Updated, and I think resolved.
There was a problem hiding this comment.
I assume you have tested the command, but of course those tests will never be a true representation of the production DB, so it's hard to test the efficiency of the command.
Indeed. Tbh, I did not bother adding 100k worth of utilizations as the local dev instance does not match the production in terms of performance. Rather I focused on variants of having invoices and missing invoices.
I think we can reduce the subsequent call number of round trips to the DB if we get a cheap set of challenge_ids that have empty invoice FKs in the 3 utilization tables. The 1248 round trips on every call seems like a lot to me.
There was a problem hiding this comment.
I think we can reduce the subsequent call number of round trips to the DB if we get a cheap set of challenge_ids that have empty invoice FKs in the 3 utilization tables. The 1248 round trips on every call seems like a lot to me.
I don't think we can do a 'quick' scan to gather all challenge_ids. The first time around it will be near full table scan =/ per utilization table.
There was a problem hiding this comment.
I don't think we can do a 'quick' scan to gather all challenge_ids. The first time around it will be near full table scan =/ per utilization table.
Exactly, I don't think that is possible.
I think 1248 DB round trips is not too bad, as long as each call is manageable. I believe it is ok given that the filter can use indexes, but I don't know for sure.
It might be nice to print out how many rows the update call touched. Then we know that it did something or (on subsequent calls of the command) that it didn't.
There was a problem hiding this comment.
I discussed the migration approach synchronously with James. Looping over utilizations is the better approach here since the filter on invoice__isnull=True is very cheap and allows us to skip already processed rows on subsequent runs. When looping over challenges, we filter utilizations by invoice and challenge. Even though both fields are indexed, they are not indexed together (there is no composite index for these two fields), and so the filter is much less efficient. That combined with the fact that we'd redo these queries unnecessarily on subsequent runs, makes this approach less efficient.
|
I still have some comments about the management call, but otherwise this looks good to me now. Would still be good if @koopmant could take a look as well. |
Sorry, I thought I would be able to look at this sooner, but kept getting stuck in the weeds in my own pitch. I can look at this next week. |
koopmant
left a comment
There was a problem hiding this comment.
It would have been better if this had been split up into separate PR's. The three headings in the description nicely describe three possible PR's. There is a lot going on here and it was a bit much for me.
Some thoughts:
Do we have to assign the invoice on creation? What if we set the invoice when we set the duration on the utilization; if there is no available budget at that time, we set it at the last invoice. (I don't assume we can quickly update the budget for an invoice at that time?) It might still be a small improvement over assigning early?
If we do assign it on creation, I think adding it as an argument for create() would be nicer, see this next comment.
the intent here is to create a Job and mark it as a failure if no invoice can be found
Maybe this is discussed on the pitch, but I don't fully understand why this is necessary? I.e., why not let job creation fail and update the evaluation with the error message? I like the implementation of getting the active invoice or an error if there are no funds. By adding invoice it to the Job create call, we'd prevent being able to create jobs when there are no funds. Now you have to check afterwards and not forget it... Once a job is created we will not check the funds again; will we?
Regarding management commands in general. When do we use a command and when a data migration? Data migrations can be run multiple times as well, can't they?
| ) | ||
| jobs.append( | ||
| job | ||
| ) # Keep track of created jobs, even if utilization setup has failed |
There was a problem hiding this comment.
| ) # Keep track of created jobs, even if utilization setup has failed | |
| ) |
The comment might be confusing. I started rewriting it until I realized this comment is not really necessary at all, the code explains itself.
There was a problem hiding this comment.
This code expects that the returned jobs are the ones that have been scheduled, so the utilisation set up must not fail.
| status=Job.FAILURE, | ||
| error_message="Job cannot be executed. The challenge has insufficient budget to run this job.", | ||
| ) | ||
| break |
There was a problem hiding this comment.
With James' comment here, this will need to be changed. The invoice will need to be passed along as a kwarg to this function.
Setting it after the fact would potentially allow for more unbudgeted submissions (e.g. on deadline day, parallel submissions), right? |
A data migration would only run once. They are not meant to be run multiple times. |
| def update_compute_cost_euro_millicents( | ||
| *, | ||
| obj, | ||
| obj_field, |
There was a problem hiding this comment.
Why is this new kwarg necessary?
There was a problem hiding this comment.
I see now. The construction indicates a design problem. Please rename compute_costs_utilized_euros_millicents to compute_cost_euro_millicents on invoices.
This is for several reasons:
- This is the name suggested by the method being used here:
update_compute_cost_euro_millicents - Internal consistency: attributes that represent the same thing should be called the same thing
- Historical consistency: we already have an existing name for this property
- Simplified implementation: reduced complexity in passing and setting these attributes, no need for extra kwargs or
setattr - Reduced cognitive load: developers do not need to thing about what objects they are dealing with when accessing this attribute
There was a problem hiding this comment.
The name was chosen deliberately to contrast it with the existing Invoice.compute_costs_euros. Having these side by side is easy to confuse:
Invoice.compute_costs_eurosInvoice.compute_cost_euro_millicents
Challenge.compute_cost_euro_millicents will eventually be removed as it can be replaced by a simple query on the Invoice table.
Rather than renaming the Invoice.compute_costs_utilized_euros_millicents, would it make more sense to rename the Phase.compute_cost_euro_millicents to align with it (i.e. Phase.compute_costs_utilized_euros_millicents)?
There was a problem hiding this comment.
That cannot be done without downtime and/or a data migration, so rename the new one.
Not really, because just linking the utilization does not change the budget. Only once the utilization has a value for what's been utilized (i.e. when we set the duration) and only after the invoice has been updated will the budget change. |
Maybe a feature branch then? The coupling could have gone in a different (tiny) PR, but the rest would need to go in in one go.
I agree with that note. We could add an increment of the overall utilization at the time of setting the duration to the utilized invoice. That would not absolve you from doing a periodic consolidation as that is far simpler than covering all corner cases where the overall sum should be adjusted. With the change that everything involving a single evaluation will be booked on the same invoice, it's really a lot easier to set it at the time of creation. |
|
Quick to-do here:
|
Part of pitch:
This changeset introduces a link between
ComponentJobUtilizationand a challenge'sInvoice, enabling compute costs to be tracked and summed on a per-invoice basis.Invoice Linking
Active invoices are resolved via
Challenge.active_invoice, which raises an error if the challenge has insufficient funds.EvaluationUtilizationis linked at evaluation creation time (create_evaluation), which is triggered by:In all three cases, the user receives an appropriate error message if the challenge has insufficient funds.
JobUtilizationis linked when creating jobs for an evaluation. A fresh active invoice is fetched per job — if none is found, theJobis markedFAILUREwith a descriptive error message.WarmJobUtilizationis linked afterJobcompletion.Compute Cost Caching
update_challenge_compute_costsnow sums linked utilizations on a per-invoice basis. The existingChallenge-levelcompute_cost_euro_millicentsfield is kept intact for now — removing it and its related paths will be addressed in a follow-up PR.Management Command:
link_challenge_utilization_to_invoiceSince existing challenges have no invoice-utilization linkage, a management command is introduced to backfill this (rather than a one-off migration), with the longer-term goal of enforcing a constraint that
phase,challenge, andinvoiceare always set together on utilizations.The command reports cleanly which challenges lack invoices.