Datomic

Blog conversations about Datomic from http://nosql.mypopescu.com/post/19310504456/thoughts-about-datomic

From Sergio: I waited for the Datomic announcement with great excitement, and I’d like now to share some thoughts, hoping they will be food for more comments or blog posts.

Datomic certainly provides interesting features, most notably:

Clojure-style data immutability, separating entity values in time. Declarative query language with powerful aggregation capabilities. But unfortunately, my list of concerns is way longer, maybe because some lower level aspects weren’t addressed in the whitepaper, or maybe because my expectations were really too high. Let’s try to briefly enumerate the most relevant ones:

Datomic provides powerful aggregation/processing capabilities, but violates one of the most important rules in distributed systems: collocating processing with data, as data must be moved from storage to peers’ working set in order to be aggregated/processed. In my experience, this is a huge penalty when dealing with even medium-sized datasets, and just answering that “we expect it to work for most common use cases” isn’t enough.

My comment: The answer to similar comments pointed to the local caches. But I think it is still a very valid observation.

In-process caching of working sets usually leads in my experience to compromising overall application reliability: that is, the application usually ends up spending lots of time dealing with the working set cache, either faulting/flushing objects or gc’ing them, rather than doing its own business.

Transactors are both a Single Point Of Bottleneck and Single Point Of Failure: you may don’t care about the former (which I’d do btw), but you have to care about the latter.

My comment: The Datomic paper contains an interesting formulation about the job of transactors for reads and writes:

When reads are separated from writes, writes are never held up by queries. In the Datomic architecture, the transactor is dedicated to transactions, and need not service reads at all!

In an ACID system, both reads and writes represent transactions though.

You say you avoid sharding, but being transactors a single point of bottleneck, when the time you have too much data over a single transactor system will come, you’ll have to, guess what, shard, and Datomic has no support for this apparently.

There’s no mention about how Datomic deals with network partitions.

I think that’s enough. I’ll be happy to read any feedback about my points.

From Rich: There seems to be some idealized target in Sergio's comments, and I'd love to know what that is so we can clearly identify the tradeoffs it makes. The ones Datomic makes are clear and by design, but it is a very different system, and as such I expect similar misconceptions and misgivings, so I'll try to address them here.

Datomic provides powerful aggregation/processing capabilities, but violates one of the most important rules in distributed systems: collocating processing with data, as data must be moved from storage to peers’ working set in order to be aggregated/processed.

There is in some sense no such thing as collocating processing with data. Data is located on disks/SSDs, and they don't have any processing capability. It may have been the case that, in the past, those disks were housed in machines with accessible processors, and there was leverage in utilizing them. But those days are numbered. First, any such processing power is inherently limited and not horizontally scalable. Second, it requires location-sensitive processing distribution or scatter-gather to all machines in an effort to reach relevant data, often not in real-time (e.g. map-reduce).

Consider the future, in which network-addressable SSDs (a la DynamoDB) provide equally speedy access to data to all comers. Why ever in that model would one make certain machines the exclusive processors of certain data segments? Any segmentation one would have done physically at the bottom one could certainly do virtually at the top. The key difference being that many, varied, dynamic and horizontally scalable sets can be created, instead of a single set of disk hosts, and a single partitioning.

It would certainly be an anti-pattern for AWS to offer compute on the DynamoDB storage hosts, as that would interfere with their ability to offer deterministic latency. So, storage-as-a-service would seem to be in inherent conflict with traditional notions of 'collocating processing with data', and that 'rule' can't be considered essential.

In-process caching of working sets usually leads in my experience to compromising overall application reliability: that is, the application usually ends up spending lots of time dealing with the working set cache, either faulting/flushing objects or gc’ing them, rather than doing its own business.

That has a lot to do with a) what is being cached (e.g. mutable objects vs immutable index segments), and b) working set coherence. Again, if one could have sharded under, physically, one could as easily 'shard over' virtually (e.g. consistent hashing of user IDs in a routing layer) in order to get coherent working sets. There is no reason why one tier should be any better at this than another. It is arguably just a matter of perspective whether or not one considers Datomic's empowered peers the new implementation of collocating processing with data.

Transactors are both a Single Point Of Bottleneck and Single Point Of Failure: you may don’t care about the former (which I’d do btw), but you have to care about the latter.

Transactors are subject to the same hot-standby availability recipes that have been used successfully for ordinary servers for decades. Datomic will support multiple transactors in an auto-scaling group, standing by for failover. Furthermore, it is much simpler to configure, as there need be no complex server-to-server streaming.

In an ACID system, both reads and writes represent transactions though.

The need for transactions in order to get consistent reads is a (rather severe) limitation of an update-in-place system. Datomic is fully ACID, but not update-in-place. One key advantage of immutability is that consistent reads no longer require transactional coordination.

You say you avoid sharding, but being transactors a single point of bottleneck, when the time you have too much data over a single transactor system will come, you’ll have to, guess what, shard, and Datomic has no support for this apparently.

Many, many businesses do not now, and will never, saturate a single machine due to write volume. The point of Datomic is that those businesses are giving up a lot when adopting non-transactional, non-consistent systems, when all they need is read, query or storage scaling.

In any case, sharding means giving up having one database and creating two or more, and you can certainly do that with Datomic (each with their own transactor). Furthermore, you can, if needed, conveniently query and join those two databases, due to the independently-located query power.

There’s no mention about how Datomic deals with network partitions.

On the write side, Datomic has the same characteristics as any traditional database (e.g. PostgreSQL), favoring consistency. On the read side, Datomic has the same characteristics as DynamoDB and other redundant distributed solutions.

The bottom line is that Datomic is a hybrid system. On the write side it is somewhat traditional, emphasizing consistency and transactions, and making traditional tradeoffs regarding scaling and availability. It is for people whose requirements include transactions and consistency. It combines that with quite painless, elastic and configuration-light scaling of query, reads and storage, a time-aware, flexible data model and much more.

I hope this clarifies some things, and thanks for covering Datomic!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Datomic

Uh oh!

Clone this wiki locally