-
-
Notifications
You must be signed in to change notification settings - Fork 68
Support large args #1385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support large args #1385
Conversation
9b9b346
to
e03034c
Compare
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd be reassured with some tests. Doesn't have to be 100% coverage but what can be tested should. Please favor unit tests to integration tests where possible.
I'm not sure the Django admin is tested, so it's fine to leave that out (and you made screenshots)
I've left a few comments below :)
Nice work though ! Please be welcome in the joyful assembly of Procrastinate contributors. Feel free to share feedback on the contribution process (now or when it's merged).
EDIT: Ah, you just switched to draft :) Well, you have a few ideas for the rest then :)
@@ -329,3 +329,17 @@ def queues_display(queues: Iterable[str] | None) -> str: | |||
return f"queues {', '.join(queues)}" | |||
else: | |||
return "all queues" | |||
|
|||
|
|||
def ellipsize_middle( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this function, and it seems like the perfect candidate for a parametrize
set of unit tests :)
In particular, I like that this implementation works well around 100 chars (if we look at what happens char by char, there's no moment where we add 1 char and we end up with ellipsis and with the whole string being less than 100 chars). That said, I felt the need to test it to be 100% sure of that, so it would be awesome if the tests were included :D
I specifically liked a lot that all the important configuration is passed as parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know, I'm also annoyed by the counterproductive ellipses in some implementations, hehe. Will add tests.
pretty_json = pretty_json[:2000] + "..." | ||
arg_rows = [ | ||
format_html( | ||
"<tr><td>{key}</td><td>{value}</td></tr>", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go this route, we should at least wrap keys and values in <code>
, but it that much better than json ?
json is easily copyable.
yaml could be a good middle ground, maybe ? It's readable (but it needs one more dep :( )
format_html( | ||
"<tr><td>{key}</td><td>{value}</td></tr>", | ||
key=key, | ||
value=ellipsize_middle(value), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should json.dump()
the value though. Given we know it's json, I guess the only difference will be boolean capitalization and null/None. But here in the django admin, it's not about being Python anymore, and json types are more "neutral".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also, even if we don't json.dump()
, your IDE should probably complain: value has no reason to be a string, it might be any json type. I guess PyRight should complain in the CI, if it doesn't I'll be somewhat disappointed at another Microsoft product :D (said the guy on GitHub, using VSCode )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right, this was an omission, I didn't even notice I don't json.dumps
it.
I guess the problem is Django, instance.args
is not correctly typed and appears as Unknown
, so then type checking assumed that I knew what I was doing, which was a mistake :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(No idea how it even worked before, as shown in your screenshot...)
key=key, | ||
value=ellipsize_middle(value), | ||
) | ||
for key, value in instance.args.items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about calls with thousands of short args ? Shouldn't we limit that too ? Maybe more than 30 args don't make a lot of sense to display ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me kind of contrived that you would define a function with thousands of parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well that could be:
@app.task
def x(**kwargs):
...
Then it's only up to how big the dict is.
f"{key}={ellipsize_middle(repr(value))}" | ||
for key, value in self.task_kwargs.items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you left out the part where we allow people to override this. It might be ok in this PR, but I'd like to make sure we're fine with this before a release. Doesn't have to be something gigantic, but this may be ruining someone's workflow with no hopes of improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I just saw your comment in the PR body ! I'll try to think about it. It's ok to leave it out of the PR for now if it's complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's easier if it can be configured in the tasks, then ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually made me think... Perhaps a better solution to all of this would be to have a way to simply exclude an argument from display (call_string
, log context, and Django admin), for whatever reason (security, size).
This seems much simpler than all of this.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, it seemed much simpler but:
- We can define on
Task.unlogged_args
, but you can't get fromJob
->Task
easily - Seems wrong to persist fixed argument names into
Job
/database - Also seems fairly "dangerous": you forget to filter in one place (call string, log context, admin, shell) and you're printing what you should not be.
- Perhaps splitting
task_kwargs
intotask_kwargs
+task_unlogged_kwargs
but that seems like an overkill too.
@ewjoachim I'm tried to add some test that checks that log context does not grow infinitely with args size and am noticing some things:
|
At the time this was written, recursive types weren't supported. But it's not the case anymore so feel free to fix the types (preferably in another PR, I suspect the CI will be complaining here and there).
You mean
That's something that comes up a lot: I'm a maintainer of procrastinate (and the developer of a large part of it), but it's been ages since I last was a user of Procrastinate in any real life setting, so the points users like you bring up are usually kind of my best bet to understand and be aware of pain points. All in all, I would say the point you bring up seems very reasonable, and we should probably be mindful about the size of the logs. That said, I think it would make sense that the complete args of each task is logged at least once when it's deferred and once when it's executed (probably upon completion or at the start of the task). That said, I'm ok if that's only in DEBUG logs. Maybe something like: if we detect that the |
Yikes 😅
At some point arbitrary types need to get converted into JSON for storage in the database. I guess it's unclear to me if
|
This somewhat perfectly captures my opinion of 2025 so far :) Screen.Recording.2025-05-10.at.13.34.14.movJoke apart, let me know if you need any help. |
@ewjoachim Yes, now you understand the problem this PR is solving :) Tbh I have kind of given up on both approaches:
I will most likely go with an extra table to hold the large data and only reference it’s PK in the task, then delete the row at the end of the job. |
I understand you have to fix pressing issues, and I respect that. That said, I still think it makes sense, either for you or someone else, in due time, to continue working on this so that at some point, it stops being an issue for other people ? I'm not saying you have to, but first and foremost thank you for trying, and then: maybe someone will want to pick this work up where it was left :) (or maybe you'll want to, though no pressure) |
Ok so: I've started working on this to get a better hang on the remaining issues:
What smells bad for me is that if we json dumps all args so that we can truncate them to length, then arguments that were json encoded into a string (so mostly strings, but possibly other things too depending on the I could check if after Also, but that's kinda logical with the choices so far: if we truncate objects, then in the logs, all objects will become strings, so instead of logging All in all, there's at least a decision to make:
I vote for B, so I welcome smartness from everyone around :D |
I don't like D) because of the inconsistencies. So, to be more concrete, what would I prefer:
Not sure if this all makes sense, as you are much deeper into it (never used or touched the custom encoder stuff). |
I'm not sure this will be helpful if you have a list with 10000 short strings :( |
🤣 Ignoring the details of the shortening and the presentation of the admin page for a second, what would make most sense to me is to convert the kwargs to JSON early, so that The conceptual unclarity and the amount of heart surgery was too much for me to undertake, but if you decide to do this change, I'm happy to pick up the trimming itself again. |
Closes #1313
This PR:
Before:
After:
@ewjoachim This is slightly different than your original suggestion in #1313. I noticed that
Job
does not currently referenceApp
and it seemed ugly to make it aware just for logging customization. Do we really need this to be customizable? Happy to still do it if you want.Successful PR Checklist:
call_string
.