Datomic RationaleIntroduction
Datomic is a distributed database designed to enable scalable, flexible and intelligent applications, running on next-generation cloud architectures.
It does this by:
Datomic has:
Thus, Datomic avoids the compromises and losses of many NoSQL solutions. In addition, it offers flexibility and power over the traditional model in supporting:
Datomic avoids manual caching and replication, complex configuration, sharding (automatic or manual), logging, locking, latching and disk management of traditional servers. The Database, Deconstructed
The vast majority of databases today reflect designs from decades ago, when memory and disks were very small and very expensive. Now they are a million times more capacious and cheaper, and many of the presumptions underlying database design should be revisited in that light. Most significantly, databases of the past were defined in terms of updating places, in an effort to conserve disk space and memory. Place-oriented programming needs to be abandoned if we are to move away from efficiency-in-the-small to capability in the large.
It is interesting that we use the terms 'memory' and 'records' when talking about update-in-place databases, as records of the past (predating computers) were not in fact erased as new records were made. Nor do we erase our old mental memories in order to form new ones. It is likely we will look back at the last few decades of the 20th century as an unfortunate time when the economics of computers kept us from doing the right thing. The time to change that is now. Databases have traditionally been called upon to deliver the following services, among others:
Update-in-place, even when implemented using append-only techniques, drives these services to be co-located. A traditional DBMS server is in charge of all of these, and frequently becomes a brittle, difficult to scale component of an application architecture. Some distributed architectures use sharding or other techniques to divide up the data and allow independent islands of this stack of services, but do not significantly break them apart. Breaking these services apart is exactly what is needed to enable more flexible architectures and new capabilities.
Why do we want to break things apart? There are several general benefits of independence:
And specifically, in the case of Datomic, we seek to dislodge query capabilities from the servers and relocate them into applications, as that is the only way to get scalable and elastic intelligence for our applications. The Datomic Architecture
Datomic divides the traditional model into 3 independent roles:
Peers The peer component is a library that gets embedded in the applications
Transactors
Storage services
Many benefits fall out of this decomposition:
Separating reads and writes When reads are separated from writes, writes are never held up by queries. In the Datomic architecture, the transactor is dedicated to transactions, and need not service reads at all! Integrated data distribution Each peer and transactor manages its own local cache of data segments, in memory. This cache self-tunes to the working set of that application. All caching is fully integrated into the system, not added on top of the system manually by the users. Every peer gets its own brain (query engine and cache) Traditional client-server databases engender a strong sense of us vs. them, here vs. there amongst application programmers. The data isn't local, you have to communicate with the server via strings, the declarative logic runs on the server but is unavailable to application code etc. This leads to the proverbial impedance mismatch. The problem, though, isn't that databases are insufficiently object-oriented, rather, that applications are insufficiently declarative. Moving a proper, index-supported, declarative query engine into applications will enable them to work with data at a higher level than ever before, and at application-memory speeds. Elasticity Application servers are frequently scaled up and down as demand fluctuates, but traditional databases, even with read-replication configured, have difficulty scaling query capability similarly. Putting query engines in peers makes query capability as elastic as the applications themselves. In addition, putting query engines into the applications themselves means they never wait on each other's queries. Ready for the cloud All components are designed to run on commodity servers, with expectations that they, and their attached storage, are ephemeral. The speed of memory While the (disk-backed) storage service constitutes the data of record, the rest of the system operates primarily in memory. Since memory prices are falling faster than business information is growing, it is only a matter of time before most businesses' data will fit in memory. The Datomic Data Model
Immutable Data
Datomic is built upon the model of data consisting of immutable values. How can data be immutable? Don't facts change? They don't, in fact, when you incorporate time in the data. For instance, when Obama became president, it didn't mean that Bush was never president. As long as who is president isn't stored in a single (logical) place, there's no reason a database system couldn't retain both facts simultaneously. While many queries might be interested in the 'current' facts, others might be interested in, e.g. what the product catalog looked like last month compared to this month. Incorporating time in data allows the past to be retained (or not), and supports point-in-time queries. Many real world systems have to retain all changes, and struggle mightily to efficiently provide the 'latest' view in a traditional database. This all happens automatically in Datomic. Datomic is a database of facts, not places. Atomic Data - the Datom Once you are storing facts, it becomes imperative to choose an appropriate granularity for facts. If you want to record the fact that Sally likes pizza, how best to do so? Most databases require you to update either the Sally record or document, or the set of foods liked by Sally, or the set of likers of pizza. These kind of representational issues complicate and rigidify applications using relational and document models. This can be avoided by recording facts as independent atoms of information. Datomic calls such atomic facts 'datoms'. A datom consists of an entity, attribute, value and transaction (time). In this way, any of those sets can be discovered via query, without embedding them into a structural storage model that must be known by applications. Minimal Schema All databases have a schema, whether they are written down or not. Rigidity arises in system to the extent the schema pervades the storage representation or application access patterns, making changes to your tables or documents difficult. The schema required to encode datoms is extremely minimal, consisting primarily of the attribute definitions, which specify name, type, cardinality etc. Applications written to this model are free of the structural rigidity of relational and document models. Pivoting your design is trivial. Hierarchy and sets are easy to represent. General purpose data manipulation logic is easy to write. The Database A database is just a set of datoms, indexed in various ways. These indexes contain all of the data, not pointers to data (i.e. they are covering indexes). The storage service and caches are just a distribution network for the data segments of these indexes, all of which are immutable, and thus effectively and coherently cached. The Datomic Programming Model
Connecting
A peer embedded in an application connects to a storage service and transactor. It will pull index/data segments from the storage service as needed, and cache them locally, and will get updates from the transactor. Queries and data operations work against a dynamically merged view of the world (the stable index + recent changes). After running for a while, the working set of the application will be cached locally, causing the majority of queries to incur no network activity at all. Database programming with ... data! Traditional databases have data manipulation and query languages based upon strings. While strings are widely supported, they are not very programmable. Database application programs are riddled with nuisance string concatenation, substitution etc. Datomic is designed to be a programmable database, and as such uses widely supported data structures like lists and maps for transactions and queries. These data structures are far easier to construct and compose programmatically than are strings. Query results are returned as similar data. Datomic can be programmed in a consistent manner in any language that can manipulate lists and maps. Transactions Transactions fundamentally assert or retract sets of facts (you can think of a retraction as a new fact - Sally no longer likes pizza). The database can be extended with data functions that expand into other data functions, or eventually bottom out as assertions and retractions. A set of assertions/retractions/functions, represented as data structures, is sent to the transactor as a transaction, and either succeeds or fails all together, as one would expect. Queries Datomic supports Datalog as a query language. Datalog is a deductive query system combining a database of facts (the Datomic db) with a set of rules for deriving new facts from existing facts and other rules. Queries take the form of a partial specification of a rule or datom, finding all completions that satisfy the specification. This query capability is combined with a powerful hierarchical selection facility, so you can recover tree-like data without joins or complex re-assembly. Datalog with negation is of equivalent power to relational algebra with recursion. Datalog is a great fit for application queries due to:
Datomic's Datalog can query both database and non-database sources, and more than one source together, allowing transparent extension of declarative programming to application data. Datomic's Datalog can be extended with user-supplied predicates and functions written in your programming language. Because the query engine runs locally, these functions can be arbitrarily complex and run much more safely than having to install them on some server. Consistency Applications always work on a completely consistent snapshot of the database. The same snapshot can be used for an arbitrary time, for multiple queries etc, in full confidence that all the results will correlate, regardless of what has transpired in the interim. All this without impeding other peers, or even other threads using the database in the same process. Time Given a value of the database, one can obtain another value of the database as-of, or since, a point in time in the past, or both (creating a windowed view of activity). These database values can be queried ordinarily, without having to make special queries parameterized by time. Summary
Datomic provides a truly distributed database system, separating transaction processing, storage, caching and query capabilities. This maximizes scalability and provides the redundancy necessary for the cloud, while retaining performance. Datomic moves powerful data manipulation capabilities, and the data itself, into applications, coupling them with a sound and flexible data model. This provides the basis for the next generation of elastic, intelligent applications, free from the bounds of traditional databases.
|
