fix: low latency client streaming waiting for data to close connection after client closes#6669
fix: low latency client streaming waiting for data to close connection after client closes#6669Botelho31 wants to merge 11 commits intocaddyserver:masterfrom
Conversation
|
Very interesting! Thanks for the contribution. I'm reviewing it now. So, uh, forgive me for asking, but does this work as intended? I don't think I would have come up with this, and if it works, that's great. :) It would take me some time to set up a test harness for this (or perhaps we should write an automated test). Curious if you have seen this working as you developed it. 👍 |
|
I actually have created some automated tests already ( as i was not too sure myself of if this would work 100% ) but didn't add them as im not too sure where i should put them in the file structure. Should i just create a |
|
@Botelho31 Typically, you'll put the Go code that runs the test in a file of the same name but ending with |
|
Tests for this could go in |
|
I added the unit tests, but im not too sure on how to create a integration test for this behaviour, im gonna take a shot at it. |
|
So i have written the integration test and have seen it fail, but im finding that the timing is very precise for Caddy to not send the EOF to the backend. It is already triggering context failed every time if not using the new variable. |
|
Thank you for this! I've given this another look-over, and now I wonder if we should be closing the connection proactively after uploading the request, OR if we should wait for the backend to respond / close the connection like we're doing here? In either case, the client has already disconnected so that doesn't matter, but I'm not sure if the intention is for the proxy to also disconnect before a response? But I think this is worth a shot either way. Do you have a moment to test with the original client reproducer presented in #4922? Maybe @kuhnchris could offer some insight. |
|
I actually implemented my integration testing based on his, i will give a try to the ffmpeg script to see if my issue is due to caddy not being fast enough. About proactivelly closing, i guess we could close on the next error catch, but isn`t there any important processing that happens afterwards? Or is it all logic related to the client, then we should definitely close it. |
There was a problem hiding this comment.
According to official documentation, Done should return the reference to the same channel. Does this have any impact? Some requests might keep running until caddy is reloaded because they keeps onto the channel before the request is finished, when subsequent calls will return a channel that is from the client. I know http3 does this.
|
I just read this part of the documentation. Actually as we are only doing this on a per roundtrip basis ( independently after roundtrip we return to old context.Done ), i don`t think it will affect other requests. As an update, i have tested this using the python ffmpeg reproducer from the original client and again, i can |
|
Hmm, well thank you for digging into it. I am about to tag 2.9 beta 3 today, and I am trying to decide whether to include this or wait until it's a little more ready / understood. I'm thinking of giving this PR more time until we have a grasp as to why the tests aren't failing in the first place, and so we can be sure this really works. What do you think? |
|
It's better to get a way to reproduce the error every-time before adding this PR i think. Im sorry for the overall slow pace this is taking. Im gonna give it another try later today |
|
No worries. We appreciate you contributing it! It's totally okay if it doesn't go in today, I just didn't want to disappoint. Looking forward to figuring things out and seeing it merged :) |
Context is per request. Naturally changing the context of a request won't affect other requests. What do you mean by this? As I said, I am curious if this will impact the response handling as it's clearly different from stdlib definition. Some parts of the code might hold on to the old reference which leads to context cancelling not firing for these code paths. I'm not asking that we stick to stdlib behaviour. If test checks out, write a comment about it for future reference. |
| // the roundtrip. We delay the context.Done() channel until | ||
| // the round trip is done, or until the parent context is | ||
| // done, whichever comes first. | ||
| type delayClientDoneContext struct { |
There was a problem hiding this comment.
I think instead of introducing a new type, we can just use context.WithCancel with h.ctx and context.AfterFunc with the request context and cancel the context introduced before to achieve the same effect. Not only does this save many lines of code, it will also avoid the Done returning different channel inconsistency with stdlib. What do you think?
ctx, cancel := context.WithCancel(h.ctx)
defer cancel()
stop := context.AfterFunc(req.Context(), cancel)
defer stop()
|
I have tested this in the following ways trying to catch an error: Using the old POC from the original issue It seems to me that SOMETHING has changed in golang stdlib RoundTrip in the last two years that has fixed this on their end. I could not in any way shape or form recreate the old error. Maybe this whole pull request isn`t needed anymore, and for now i give up on trying to reproduce this. If someone wants to give it a shot. Reading the official package github, i believe the commit that could have solved this is this one |
|
@Botelho31 so is the conclusion that this PR isn't needed after all? Or do you just mean one specific branch/usecase is one you're not able to replicate? Sorry I'm not super caught up here, just wanting to get a summary 😅 |
|
I cannot replicate the original error #4922 that was fixed by the flush_interval -1 in todays version of caddy/go in order to test if this pull request fixes that. Maybe it was fixed on stdlib, or maybe its just a skill issue on my part ( im quite new to golang ), either way i wont be trying this any further. You guys can close this pull request if you like, and if nobody can recreate this issue maybe we should even remove the original dirty fix. |
|
Ah wow -- well, thank you so much for working on this @Botelho31! We really appreciate your contribution, regardless of whether we merge it or not. You've been a pleasure to work with! We super appreciate the deep-dive into narrowing things down as well. I wonder if recent implementation of "full duplex" has anything to do with this? @francislavoie what do you think? |
|
No, the full-duplex thing is opt-in only (off by default) |
|
Okay. So, it sounds like with the current version of Caddy (and/or Go), the issue reported by #4922 may no longer be a problem and we don't need our patch from back then. In that case, we can even try reverting the original patch and seeing if anyone experiences the original behavior anymore. Sounds like they won't. That's great -- we might not need a separate setting with a complex implementation, or a buggy patch at all! @Botelho31 Thank you again so much for working on this. I will respectfully close this PR, but we will refer to it in the future if it comes up again. What I'll probably do then is revert the original patch and credit you in my commit :) |
…eam mode i.e. Revert commit f5dce84 Two years ago, the patch in #4952 was a seemingly necessary way to fix an issue (sort of an edge case), but it broke other more common use cases (see #6666). Now, as of #6669, it seems like the original issue can no longer be replicated, so we are reverting that patch, because it was incorrect anyway. If it turns out the original issue returns, a more proper patch may be in #6669 (even if used as a baseline for a future fix). A potential future fix could be an opt-in setting.
…eam mode i.e. Revert commit f5dce84 Two years ago, the patch in #4952 was a seemingly necessary way to fix an issue (sort of an edge case), but it broke other more common use cases (see #6666). Now, as of #6669, it seems like the original issue can no longer be replicated, so we are reverting that patch, because it was incorrect anyway. If it turns out the original issue returns, a more proper patch may be in #6669 (even if used as a baseline for a future fix). A potential future fix could be an opt-in setting.
…tion in stream mode i.e. Revert commit f5dce84 Two years ago, the patch in caddyserver#4952 was a seemingly necessary way to fix an issue (sort of an edge case), but it broke other more common use cases (see caddyserver#6666). Now, as of caddyserver#6669, it seems like the original issue can no longer be replicated, so we are reverting that patch, because it was incorrect anyway. If it turns out the original issue returns, a more proper patch may be in caddyserver#6669 (even if used as a baseline for a future fix). A potential future fix could be an opt-in setting.
My suggestion on how to properly fix #6666