tls-eio: decide on improvements needed

In https://github.com/mirleft/ocaml-tls/pull/460#issuecomment-1364121892, @hannesm commented:

> Since it seems both you @bikallem and @talex5 are working on TLS and EIO, maybe it'd make sense if both of you come to a common understanding what kind of semantics you'd like to have. [...] I already wonder whether I merged the tls-eio PR way too early.

I agree we're sending too many (half-baked) PRs. Let's first decide what behaviour we want and then write the code afterwards. But I think we really need @hannesm's input on this.

As I see it, there are 4 issues being discussed:

1. tls-eio may perform concurrent writes on a flow, and Eio does not specify the behaviour in this case.
2. Trying to use a TLS flow after it has failed produces confusing errors.
3. Trying to use a TLS flow after cancelling an operation is not supported and produces confusing errors.
4. Half-shutdown does not work, but the documentation doesn't mention this and the resulting behaviour is confusing.

As far as I can see, 1, 2 and 3 also affect tls-lwt (4 doesn't because the documentation doesn't claim to support it in the first place).

## Concurrent writes

Reading from a TLS flow may cause a write. This means that performing a read and a write of the TLS flow at the same time (which should be allowed) may cause overlapping writes on the underlying flow (which isn't).

I'm not sure why we haven't seen this problem in tls-lwt, which seems to be doing the same thing (`Lwt_cs.write_full` doesn't take a lock, even though it may do multiple writes on the FD).

This problem doesn't show up in the fuzz tests because the mock socket handles concurrent writes in order. But I tried modifying it to fail in the case of concurrent writes and it still didn't trigger. Do concurrent writes only happen on errors, perhaps?

In https://github.com/mirleft/ocaml-tls/pull/458#issuecomment-1346181930, @hannesm says:
> the other side-effecting pieces do not need a mutex due to careful design

I propose that Eio should explicitly document that the behaviour of concurrent writes is unspecified (see https://github.com/ocaml-multicore/eio/issues/387). It's possible that there's nothing to fix here, but I suspect we need some kind of lock when reporting errors at least. I can make a PR to do that, but I'd like to know whether this is really a problem and how to trigger it in the fuzz tests if so.

## Errors using a flow after failure

If an exception occurs while reading or writing the underlying flow, tls-eio moves into the `Error` state and all future operations fail with the same error.

This can be confusing. For example, if a read failed with `Out_of_memory` then all future reads will also claim to be out of memory. The stack-trace only indicates where the error was rethrown, not the original cause.

This also affects tls-lwt, but people don't expect to get reasonable backtraces from Lwt anyway.

I propose that:
1. We record the backtrace when storing an exception (if backtraces are on) and include it when reraising.
2. When reraising, we wrap the error as e.g. `Tls_previously_failed _` to indicate that the error hasn't just happened.

In #458, @bikallem instead proposes not storing errors at all. I think that could be confusing, however, since you will likely get bogus errors about protocol violations. In any case, I think the tls library as a whole should decide whether it wants to do this and tls-eio should do the same as tls-lwt.

## Errors using a flow after cancellation

If a read or write operation is cancelled, this raises a `Cancelled` exception which is stored as noted above. This means that a TLS flow cannot be used again after being cancelled. The reported error is confusing because it looks like it has been cancelled again and doesn't mention the location of the original cancellation. This is a bigger problem than for other exceptions because the original cancellation exception will typically not have been shown to the user.

This should also affect Lwt, which also raises an exception to indicate cancellation.

There seem to be three reasonable options here:

1. Using a flow after cancelling is never allowed (the current behaviour). With the improvements to re-raising errors above, this should be fairly clear.
2. Reusing a cancelled flow is best-efforts. If we cancel during a read then we ignore the error and the flow can continue to be used, but if you are unlucky and cancel during a write then reusing the flow will raise an exception, as before. This is the behaviour you get from #459.
3. Make it safe to cancel and resume at any point. This would require changes to Eio to report partial writes, which we have debated doing.

Option 1 seems simplest here. I'm not convinced there's much use for cancelling a flow operation if you might want it later, and we've managed with this behaviour in tls-lwt so far without trouble.

## Half-close / shutdown

tls-eio claims to implement the `Eio.Flow.two_way` signature, which allows performing a half close. However, tls always closes both directions (and tls-eio gives a confusing behaviour when you try to read, by simply returning end-of-file).

#460 tried to fix this by shutting down the underlying flow, but that won't work (both because TLS needs to write in order to read, and because the underlying flow is supposed to remain usable after a TLS shutdown).

It should be possible to support half-close properly (tracked in #452). @hannesm wrote:
> Reading up on the RFCs and the mailing list, I guess we can actually have a unified close_notify semantics for all protocol versions. Still we need to track that close_notify state in the TLS engine to reject any "send_application_data" etc.

So the main question is what tls-eio should do until then. Some options are:

1. Simply document in `tls_eio.mli` that shutdown closes both directions for now and note that we intend to improve this in future.
2. As (1), but also raise an exception if you try to shut down the sending side only.
3. As (1), but raise a helpful exception on future reads instead of end-of-file.
4. Stop claiming to support generic POSIX-style shutdown and provide a separate `close_tls` operation as tls-lwt does.

Part of the problem is that some users may be satisfied with the current behaviour (i.e. they don't plan to read after shutting down the sending side), so (2) or (4) will annoy them unnecessarily. And changing the API in (4) and then changing it back once it's working seems annoying too. So I think I prefer either (1) or (3).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tls-eio: decide on improvements needed #464

Concurrent writes

Errors using a flow after failure

Errors using a flow after cancellation

Half-close / shutdown

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tls-eio: decide on improvements needed #464

Description

Concurrent writes

Errors using a flow after failure

Errors using a flow after cancellation

Half-close / shutdown

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions