Very nice and interesting post from Michael Stonebraker explaining how errors dictate CAP Theorem (Consistency, Availability and Partition-tolerance); as only one objective from the CAP can be achieved during normal error conditions as NoSQL system seems to relax the consistency model as CAP theorem anyway proves that one can’t get all 3 at the same time, by favoring partition based availability
As most NoSQL systems adopt Eventual Consistency (depends and some systems it is configurable, and here is yet another nice article on variations in eventual consistency from Werner Vogels, CTO of Amazon) especially when data is distributed across the cluster of systems. Stonebraker suggests CA (Consistency and Availability, typical SQL system) rather than AP (Availability and Partition-Tolerance, typical NoSQL) by explaining the different error scenarios.
For example (just for fun); let’s consider the same error conditions, one by one and see how this can be adopted in distributed cloud computing (SQL or NoSQL)…
Application errors; and up on error, system needs to rollback to its previous consistent state…
In case of transactional system, it is easy to rollback if the unit of work is in transaction (automatically by DBMS or manually by the application). In key/value pair, as easy as calling delete (key). Few systems use append-only mode (where there is no concept of updates or deletes, few SAS companies in the valley also use MySQL/InnoDB/MyISAM in this append only mode where system never gets any deletes/updates as everything runs on INSERT and SELECT only), updating with older time-stamp or serial-number can revert the latest change (depending on how the latest value is read).
Repeatable system crash on bad query request or bad data or something…
It is really hard to escape from this failure as other system in the cluster or replica can also fail for the same condition unless the replicated system is on a different version (rare case) or other degree of fault tolerance.
Unrepeatable system crash (System failure, hardware/power issue, data center down, network issue…)
If it is a system failure; then this can be handled by fail-over to another system or replica in the LAN or WAN. NoSQL defines much better solution for fail-over with online substitution of nodes to the cluster. The only issue is, if data is persistent in master node and if it implements read-your own write or monotonic read consistency model (same value upon subsequent requests); and before the change propagates to replica of other nodes; then the consistency model fails unless the model is designed such that more than one node shares the same write.
Partition failure in LAN, WAN
If bunch of partitions/nodes fail within LAN or WAN due to network, power failure etc; then it is hard to fail-over unless there is a replica or same copy within LAN or WAN. But due to network and/or power redundancy in data center or cloud environment; this failure is very unlikely.
Fail-over due to slowness or poor response
When one of the node starts performing slow due to higher load or more data processing; instead of failing over to another node (typical NoSQL concept), Stonebraker recommends building a system that can take the load spikes; but it is really hard due to growing data needs and poor capacity planning as things change over time. So, one option is to retire by fail-over to newer one (not always possible if you already have latest and greatest specs) or adding more nodes to take the load spikes (distributed).
But in case of SQL with shared-nothing or shared-common architecture; it is not always possible to distribute the load to more than one system (for example, client X in system A, but when A exceeds the load; only option is to split client X into system A and B; but client X can’t be split due to data integrity or lack of common layer that can interact with both the nodes and perform the join/merge operations)
Both has its advantages and dis-advantages; but if NoSQL adopts strict consistency model as that of SQL; then it is hard to scale in the distributed environment where that architecture is more or less demanded by many big web applications where the scalability comes only by distribution and consistency has to be sacrificed to some extent to get the blend of performance + scalability + high availability