Skip to content

[CH-4] Support for large distributed networks #455

Open
@qnikst

Description

@qnikst

[Imported from JIRA. Reported by Tim Watson [Administrator] @hyperthunk) as CH-4 on 2012-12-18 16:04:53]
Issue taken from: https://cloud-haskell.atlassian.net/browse/CH-4

This was first raised as a question on issue #99. This is a common problem in distributed systems.

Distributed Erlang systems require a fully connected network of nodes, which places considerable strain on system resources. As a professional Erlang programmer, I've seen anecdotal reports of distributed Erlang systems with thousands of nodes, though these apparently start to come unstuck beyond 1000 or so nodes.

In issue #99 @robstewart57 mentioned operating system limits on open TCP connections, and I'd like to discuss that in a a bit more detail.

As I discovered [http://www.haskell.org/pipermail/haskell-cafe/2012-February/099212.html] , there are operating system limits on > the number of open TCP connections --- by default 1024. If we are thinking at scales of 1,000's of nodes, then connecting every > endpoint to every other endpoint is problematic.

As the respondent in that thread points out, the 1024 file descriptor limit is an artefact of the unix select system call, which can handle only that many descriptors at a time. Modern non-blocking I/O system calls such as poll and epoll on linux, kqueue on BSD variants and AIO capabilities on other operating systems can support a much higher number of open file descriptors in practice.

There are still limits however. Forcing all nodes to be fully interconnected carries a significant overhead, especially in a connection oriented transport layer protocol such as TCP.

There is a case to say that this concern is not one for the programmer to deal with.

I think that's both true and false at the same time. It depends, as the saying goes. Let's address the problem area a little, then we can come back to areas of responsibility.

Handling Connections Efficiently

A solution would be have the transport on each node manage the connections on each of its endpoints. Heuristics might include killing heavyweight connections to remote endpoints that haven't been used in a time limit, killing old connections when > new connection requests are made and a connection limit (e.g. ulimit) isn't far from being breached etc..

The classic way to deal with this is to use a kind of 'heart beat' sensor, that periodically sends a ping to the other end to ensure it's alive, and closes the connection if it doesn't receive a response in a timely fashion. The definition of 'timely' is up for debate and, of course, this might benefit from being configurable.

Erlang certainly does support this notion, and the time limit for the heartbeat that determines whether or not a connection should be considered dead and torn down is set using the net_kernel.tick_time parameter to the virtual machine.

These however, are the concerns of handling connections efficiently and carefully in a fully connected system. They do not actually address the scalability concerns that you've raised, because if the system is fully operational - i.e., all the nodes stay online and are able to remain connected to one another - there is still a limit to how many nodes you can interconnect before you begin to exhaust system limits on each node.

Personally I do think that connection management should be configurable and exposed via the management APIs, but I do not think it should be magic. We should simply pick some sensible defaults.

Another issue to be aware of, with all this, is that Cloud Haskell deliberately makes the programmer handle reconnect explicitly. This is a good thing because, unlike Erlang, it forces the programmer to be aware that message ordering and/or delivery guarantees might not hold after a reconnection has occurred.

There is, BTW, some open issues to look at this in detail:

Scalable Distributed Computing

Ha ha - the title says it all! This is a massive area of distributed systems research and we'd do well to absorb as much of that as makes sense before pushing forward with any particular implementation.

One fairly common solution to the problem of needing fully connected networks is to create independent federated clusters. In this architecture, you have several clusters of nodes, which are fully inter-connected (like a normal CH system). These clusters are then connected to one another via a quorum of members, which act as intermediaries.

I'm not sure whether this is a parameter set when creating transports or endpoints, but such connection management should probably not be a concern of the programmer.. (?)

Connection management is very much the concern of the programmer, but we should not make it a barrier to entry. You should

  • be able to build a scalable, high performance distributed system on CH using the (sensible) defaults
  • be able to configure connection management (such as heartbeat timeouts) if you wish
  • be able to manage the network layer at runtime, once your application is in production
  • be able to choose the right strategy for your own needs, where this is appropriate [*]

On that last point, the example I gave of federated clusters is a good one. We should not be doing something like that automatically IMO, but if that kind of capability does exist then we should make it easy and transparent for the application developer to take advantage of it. From the system architects point of view however, this is almost certainly something that should be turned on or off explicitly.

Finally, I do need to point out that this is way off our radar right now. That doesn't mean it's unimportant, but the attention it receives will increase over time - I see the massive scalability support as low (ish) priority, versus the connection management which is of medium importance - less so than, for example, being able to send messages efficiently to local processes, but perhaps more so than providing some gorgeous web UI to do cluster/grid administration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions