Distributed Transactions

ACID, Commit protocols, and BASE

Paul Krzyzanowski

November 2012 (major updates from October 2007)

Introduction

We've looked at a number of low level techniques that can be used for managing synchronization in a distributed environment: algorithms for mutual exclusion and critical section management. In addition (and we'll look at these later), we can have algorithms for deadlock resolution and crash recovery. Much as remote procedure calls allowed us to concentrate on the functionality of a program and express it in a more natural way than sends and receives, we crave a higher level of abstraction in dealing with issues of synchronization. This brings us to the topic of atomic transactions (also known colloquially simply as transactions).

In transacting business, all parties involved may have to go through a number of steps in negotiating a contract but the end result of the transaction won't be committed until both parties sign on the dotted line. If even one of the parties reconsiders and aborts, the contract will be forgotten and life goes on as before.

Consider, for example, the purchase of a house. You express your interest in purchasing a house by making an offer (and possibly putting some money down with a trusted party). At that point, you have not bought the house, but you have entered the transaction of purchasing a house. You may have things to do (such as getting a mortgage and inspection) and the seller may have things to do (such as fixing up certain flaws). If something goes wrong (you can't get a mortgage, the seller won't fix the heating system, you find the house is sitting on a fault line, the seller won't remove the black velvet wallpaper, ...), then the transaction is cancelled (aborted) and both parties go back to life as before: you look for another house and the seller remains in the house, possibly still trying to sell it. If, however, the transaction is not aborted and both parties sign the contract on the closing day, it is made permanent. The deed is signed over and you own the house. If the seller changes her mind at this point, she'll have to try to buy back the house. If you change your mind, you'll have to sell the house.

The concept of a transaction in the realm of computing is quite similar. One process announces that it's beginning a transaction with one or more processes. Certain actions take place. When all processes commit, the results are permanent. Until they do so, any process may abort (if something fails, for example). In that case, the state of computing reverts to the state before the transaction began: all side effects are gone. A transaction has an all or nothing property.

The origins of transactions in computing date back to the days of batch jobs scheduled to processes tapes. A days worth of "transactions" would be logged on a tape. At the end of the day, a merge job would be run with the original database tape and the transactions tape as inputs, producing a new tape with all the transactions applied. If anything went wrong, the original database tape was unharmed. If the merge succeeded, then the original tapes could be reused.

Transaction model

A process that wishes to use transactions must be aware of certain primitives associated with them. These primitives are:

begin transaction - mark the start
end transaction - mark the end; try to commit
abort transaction - kill transaction, restore old values
read data from object(file), write data to object(file).

In addition, ordinary statements, procedure calls, etc. are allowed in a transaction.

To get a flavor for transactions, consider booking a flight from Newark, New Jersey to Ridgecrest, California. The destination requires us to land at Inyokern airport, and non-stop flights are not available:

transaction begin
1. reserve a seat for Newark to Denver (EWK→DEN)
2. reserve a seat for Denver to Los Angeles (DEN→LAX)
3. reserve a seat for Los Angeles to Inyokern (LAX→IYK)
transaction end

Suppose there are no seats available on the LAX→IYK leg of the journey. In this case, the transaction is aborted, reservations for (1) and (2) are undone, and the system reverts to the state before the reservation was made.

Properties of transactions

The properties of transactions are summarized with the acronym ACID, which stands for Atomic, Consistent, Isolated, and Durable.

Atomic

either an entire transaction happens completely or not at all. If the transaction does happen, it happens as a single indivisible action. Other processes cannot see intermediate results. For example, suppose we have a file that is 100 bytes long and a transaction begins appending to it. If other processes read the file, they only see the 100 bytes. At the end of the transaction, the file instantly grows to its new size.

Consistent

If the system has certain invariants, they must hold after the transaction (although they may be broken within the transaction). For example, in some banking application, the invariant may be that the amount of money before a transaction must equal the amount of money after the transaction. Within the transaction, this invariant may be violated but this is not visible outside the transaction.

Isolated (or serializable)

If two or more transactions are running at the same time, to each of them and to others, the final result looks as though all transactions ran sequentially in some order.

An order of running transactions is called a schedule. Orders may be interleaved. If no interleaving is done and the transactions are run in some sequential order, they are serialized.

Consider the following three (small) transactions:

begin
x=0
x=x+1
end

begin
x=0
x=x+2
end

begin
x=0
x=x+3
end

Some possible schedules are (with time flowing from left to right):

schedule	execution order						final x	legal?
schedule 1	x=0	x=x+1	x=0	x=x+2	x=0	x=x+3	3	yes
schedule 1	x=0	x=0	x=x+1	x=x+2	x=0	x=x+3	3	yes
schedule 1	x=0	x=0	x=x+1	x=0	x=x+2	x=x+3	5	NO

Durable

Once a transaction commits, the results are made permanent. No failure after a commit can undo results or cause them to get lost. [Conversely, the results are not permanent until a transaction commits.]

Nested transactions

Transactions may themselves contain subtransactions (nested transactions). A top-level transaction may fork off children that run in parallel with each other. Any or all of these may execute subtransactions.

The problem with this is that the subtransactions may commit but, later in time, the parent may abort. Now we find ourselves having to undo the committed transactions. The level of nesting (and hence the level of undoing) may be arbitrarily deep. For this to work, conceptually, each subtransaction must be given a private copy of every object it may manipulate. On commit, the private copy displaces its parent's universe (which may be a private copy of that parent's parent).

Implementation

We cannot just allow a transaction to update the objects (files, DB records, et cetera) that it uses. The transactions won't be atomic (i.e., appear indivisible) or consistent in that case. If other transactions read and act on the data, we also violate the isolated property. Finally, we need to ensure that we can undo changes if the transaction aborts. One way of supporting object modification is by providing a private workspace. When a process starts a transaction, it's given a private workspace containing all the objects to which it has access. On a commit, the private workspace becomes the real workspace. Clearly this is an expensive proposition. It requires us to copy everything that the transaction may modify (every file, for example). However, it is not as bleak as it looks. A number of optimizations can make this a feasible solution.

Suppose that a process (transaction) reads a file but doesn't modify it. In that case it doesn't need a copy. The private workspace can be empty except that it contains a pointer back to the parent's workspace. How about writing a file? On an open, don't copy the file to the private workspace but just copy the index (information of where the file's data is stored; a UNIX inode, for example). The file is then read in the usual way. When a block is modified, a local copy is made and the address for the copied block is inserted into the index. New blocks (appends) work this way too. Privately allocated blocks are called shadow blocks.

If this transaction was to abort, the private blocks go back on the free list and the private space is cleaned up. Should the transaction commit, the private indices are moved into the parent's workspace (atomically). Any parent blocks that would be overwritten are freed.

Another, and more popular, mechanism for ensuring that transactions can be undone (and possibly redone) is the use of a write-ahead log, also known as an intentions list. With this system, objects are modified in place (proper locking should be observed to control when other processes can access these objects). Before any data is changed, a record is written to the write-ahead log in stable storage. The record identifies the transaction (with an ID number), the block or page modified, and the old and new values. This log allows us to undo the effects of a transaction should an abort be necessary.

If the transaction succeeds (i.e., commits), a commit record is written to the log. If the transaction aborts, the log is used to back up to the original state (this is called a rollback. The write-ahead log can also be played forward for crash recovery (this becomes useful in the two-phase commit protocol, which is discussed next). A term associated with the write-ahead log was stable storage. This is intended to be a data repository that can survive system crashes. After a datum is written to stable storage, it is retrievable even if the system crashes immediately after the write. A disk is suitable for stable storage, but it is important that any writes are immediately flushed to the disk and not linger in the memory (unstable) buffer cache.

The two-phase commit protocol

(Gray, 1978)

In a distributed system, a transaction may involve multiple processes on multiple machines. Even in this environment, we still need to preserve the properties of transactions and achieve an atomic commit (either all processes involved in the transaction commit or else all of them will abort the transaction - it will be unacceptable to have some commit and some abort). A protocol that achieves this atomic commit is the two-phase commit protocol.

In implementing this protocol, we assume that one process will function as the coordinator and the rest as cohorts (the coordinator may be the one that initiated the transaction, but that's not necessary). We further assume that there is stable storage and a write-ahead log at each site. Furthermore, we assume that no machine involved crashes forever.

The protocol works as follows (the coordinator is ready to commit and needs to ensure that everyone else will do so as well):

phase coordinator cohort
1
request write prepare to commit message to the log work on transaction; when done, wait for a prepare message
send prepare to commit message
wait for reply receive message. When transaction is ready to commit, write agree to commit (or abort) to log.
send "agree" or "abort" reply
2
commit write commit message to the log. wait for commit message
send commit (or abort) message receive commit (or abort) message
wait for all cohorts to respond if a commit was received, write "commit" to the log, release all locks & resources, update databases.
if an abort was received, undo all changes.
send done message.
clean up all state. Done.

What the two phase commit protocol does is this:

In phase 1, the coordinator sends a request to commit to all the cohorts and waits for a reply from all of them. The reply is either an agreement or an abort. Note that nobody has committed at this point. After the coordinator receives a reply from all cohorts, it knows that all transaction-relevant computation is finished so nothing more will happen to abort the transaction. The transaction can now be committed or, in the case that at lease one of the parties could not complete its transaction, aborted. The second phase is to wait for all cohorts to commit (or abort). If aborting, an abort message is sent to everyone. The coordinator waits until every cohort responds with an acknowledgement. If committing, a cohort receives a commit message, commits locally, and sends an acknowledgment back. All message deliveries are reliable (retransmits after time-out).

No formal proof will be given here of the correctness of the two-phase protocol. Inspecting for correctness, it is readily apparent that if one cohort completes the transaction, all cohorts will complete if eventually. If a cohort is completing a transaction, it is because it received a commit message, which means that we're in the commit phase and all cohorts have agreed. This information is in permanent memory in case of a crash (that's why information is written to the log before a message is sent. If any system crashes, it can replay its log to find its latest state (so it will know if it was ready to commit, for example). When the coordinator is completing, it is ensured that every cohort completes before the coordinator's data is erased(update).

Three-Phase Commit

A problem with the two-phase commit protocol is that there is no time limit for the protocol to complete. A sub-transaction may be delayed indefinitely or the process (or machine) may die and it might be a long time before it restarts. If the coordinator dies, there is no easy way for a standby coordinator to find out the state of the protocol and continue the commit. From a practical point of view, this is not good.

The three-phase commit protocol is a variation of the two-phase commit protocol that places an upper bound on the time that a transaction may take to commit or abort. It also introduces an extra phase where cohorts are told what the consensus was so that any of them that received this information before a coordinator died could inform a standy coordinator whether there was a unanimous decision to commit or abort.

The setup is the same as with the two-phase commit protocol. A coordinator process is in charge of soliciting votes from multiple cohorts that are responsible for the various sub-transactions of the top-level transaction. Here are the steps:

phase coordinator cohort
1
request write prepare to commit message to the log work on transaction; when done, wait for a prepare message
send prepare to commit message receive message. When transaction is ready to commit, write agree to commit (or abort) to log.
if timeout on waiting for a prepare message, then abort
wait for reply from all cohorts.

2
commit authorized if all replies have been received and all replies are "agree" messages, then write prepare-to-commit message to the log.
else if all replies are not received before a timeout or at least a single abort message is received then write an abort message to the log. wait for a prepare-to-commit or abort message
send a prepare-to-commit or abort message to all cohorts. If the cohort receives a prepare-to-commit message, it sends back an acknowledgement and waits. The commit does not yet take place.
If the cohort receives an abort message or times out waiting for a message from the coordinator, then it aborts the transaction. This means that it releases all locks & resources, and reverts the state of the data it modified.
if a prepare-to-commit was sent, then wait for all cohorts to respond.
otherwise, we're done.
3
commit finalized write commit message to the log. wait for a commit message
send commit message receive a commit. release all locks & resources, make database changes permanent.
if a timeout on waiting for a commit message, then commit anyway.
send a commit completed message.

Receive commit completed messages from all cohorts. Give up waiting after a certain time.

clean up all state. Done.

If the coordinator crashes during this protocol, another one can step in and query the cohorts for the commit decision. If every cohort received the prepare-to-commit message then the coordinator can commit. If only some cohorts received the message, the coordinator now knows that the unanimous decision was to commit and can re-issue the request. If no cohort received the message, the coordinator can restart to protocol or, if necessary, restart the transaction.

Paxos Commit

What's wrong with the two-phase commit protocol?

The problem with the two-phase commit protocol is that it requires all systems to be available in order to complete. A single fault can make the two-phase commit protocol block. Two-phase commit is not fault tolerant because it uses a single coordinator whose failure can cause the protocol to block.

What about three-phase commit?

Three-phase commit tries to solve this with timeouts but no implementations have been put forth with a truly complete algorithm with a correctness proof. If the three-phase commit protocol implements voting for a coordinator, a key problem with the algorithm is that it is undefined what happens when a resource manager (the cohort, responsible for a sub-transaction) receives messages from two different processes; both claiming to be the current transaction manager (coordinator).

Can we get ACID guarantees that we want and still survive F faults?

Fault-tolerant consensus algorithms such as Paxos are designed to reach agreement and do not block whenever any majority of the processes are working. Let's use Paxos to create a fault-tolerant commit protocol that uses multiple coordinators. A majority of functioning coordinators will allow the commit to occur.

The participants in the algorithm are:

N resource managers (RMs). Each resource manager is associated with a single sub-transaction. For the transaction to be committed, each participating resource manager must be willing to commit it.
2F+1 acceptors, where F is the number of failures that we can tolerate. If F+1 acceptors see that all resource managers are prepared then then transaction can be committed. All instances of Paxos can share the same set of acceptors.
a Leader. The leader coordinates the commit algorithm. All instances of Paxos share the same leader. Unlike the two-phase commit, it is not a single point of failure.
One instance of the Paxos consensus algorithm is executed for each resource manager. Each instance provides a fault-tolerant way to agree on the commit or abort proposed by each resource manager: each resource manager is responsible for a sub-transaction.

Here's how we run the algorithm:

A client requests a commit by sending a commit request to a transaction manager. The Paxos Commit algorithm uses a separate instance of the Paxos consensus algorithm to obtain agreement on the decision each RM makes of whether to prepare (commit) or abort. We can represent this decision by unique values that represent Prepared and Aborted, respectively. The transaction will be committed if and only if each resource manager's instance chooses Prepared. Otherwise, the transaction is aborted.
The transaction manager sends a PREPARE message to each resource manager.
Each resource manager then sends a proposal to its own consensus algorithm (running on multiple servers). Each resource manager is the first proposer in its own instance of Paxos.
Each instance of the consensus algorithm sends the results back to the transaction manager.
The transaction manager is stateless and just gets consensus outcomes. It will issue a COMMIT or ABORT message to each resource manager based on whether it received any ABORT messages.

As long as the majority of acceptors are working, the transaction manager can always learn what was chosen. If it fails to hear from all the resource managers then it can make the decision to abort. Paxos maintains consistency, never allowing two different values to be chosen, even if multiple processes think they are the leader.

Paxos provides a fault-tolerant commit algorithm based on replication. With two-phase commit, you rely on the coordinator to not fail or to recover after a failure. With Paxos Commit, the two-phase commit's transaction manager's stable storage is replaced by the acceptor's stable storage. The transaction manager itself is replaced with a set of possible leaders. With two-phase commit, the transaction manager is solely responsible for deciding whether to abort. With Paxos Commit, a leader will make an abort decision only for a resource manager that cannot decide for itself (e.g., it is not functioning). This will ensure that the protocol will not block due to a failed resource manager.

Brewer's CAP Theorem

Eric Brewer proposed a conjecture that states that if you want consistency, availability, and partition tolerance, you have to settle for two out of three for any shared data system. This assertion as since been proven and Brewer's proposal is known as Brewer's CAP theorem, where CAP stands for Consistency, Availability, and Partitions. Partition tolerance means that all the systems will continue to work unless there is a total network failure. The inaccessibility of a few notes will not impact the system. Let's examine each of the aspects of CAP.

Consistency

Consistency in this discussion means that everyone sees the same view of the data if it is replicated in a distributed system. This can be enforced by forcing the algorithms to wait until all participating nodes acknowledge their actions (e.g., two phase commit). Guaranteeing this impacts availability. Alternatively, if we want to offer availability, we need to ensure that all live nodes can get updated and we have to give up on partition tolerance.

Availability

Availability refers to the system being highly available. Since commodity-built individual systems are not highly available, we achieve availability through redundancy, which means replication. If one system is down, a request can be fulfilled by another. In an environment with multiple systems connected on a network we have to be concerned about network partitioning. If we have partition tolerance, then we lose consistency: some systems are disconnected from the network segment where updates are being issued. Conversely, to keep consistency, we have to ensure that the network remains fully connected so that all live nodes can get updates. This means giving up on partition tolerance.

Partition Tolerance

Partition tolerance means that the system performs correctly even if the network gets segmented. This can be enforced by using a non-distributed system (in which case partitioning is meaningless) or by forcing the algorithms to wait until network partitioning no longer exists (e.g., two phase commit). Guaranteeing this impacts availability. Alternatively, the system can continue running, but partitioned nodes will not participate in the computation (e.g., commits, updates) and will hence have different values of data, impacting consistency.

Giving up on consistency allows us to use optimistic concurrency control techniques as well as leases instead of locks. Examples of this are web caches and the Domain Name System (DNS).

BASE: Giving up on ACID

Availability and partition tolerance are not part of the ACID guarantees of a transaction, so we may be willing to give those up to preserve database integrity. However, that may not be the best choice in all environments since it limits a system's ability to scale and be highly available. In fact, in a lot of environments, availability and partition tolerance are more important than consistency (so what if you get stale data?).

In order to guarantee ACID behavior in transactions, objects (e.g., parts of the database) have to be locked so that everyone will see consistent data, which involves other entities having to wait until that data is consistent and unlocked. Locking works well on a small scale but is difficult to do efficiently on a huge scale. Instead, it is attractive to consider using cached data. The risk is that we violate the "C" and "I" in ACID (Consistent & Isolated): two separate transactions might see different views of the same data. An example might me that you just purchased the last copy of a book on Amazon.com but I still see one copy remaining.

An alternative to the strict requirements of ACID is BASE, which stands for Basic Availability, Soft-state, Eventual consistency. Instead of requiring consistency after every transaction, it is enough for a database to eventually be in a consistent state. In these environments, accessing stale data is acceptable. This leniency makes it easy to cache copies of data throughout multiple nodes, never have to lock access to all those copies for any extensive time (e.g., a transaction operating on data will not lock all copies of that data), and update that data asynchronously (eventually). With a BASE model, extremely high scalability is obtainable through caching (replication), no central point of congestion, and no need for excessive messages to coordinate activity and access to data.

Some vocabulary

abort: transaction will not complete (commit). All changes are undone to the state before the transaction started.
commit: action which indicates that the transaction has successfully completed. All changes to the database, files, and objects are made permanent.
commit protocol: a fault-tolerant algorithm which ensures that all sides in a distributed system either commit or abort a transaction unanimously.
log: a record of system activity recorded in sufficient detail so that a previous state of a process can be restored.
redo: given a log record, redo the action specified in the log.
stable storage: permanent storage to which we can do atomic writes.
transaction: an atomic action which is some computation that read and/or changes the state of one or more data objects and appears to take place indivisibly.
write-ahead log protocol: a method in which operations done on objects may be undone after restarting a system.

References

Eric A. Brewer, Towards Robust Distributed Systems, PODC Keynote, 2004.
Seth Gilbert and Nancy Lynch, ACID versus BASE for database transaction , The Endeavour blog, July 6, 2009.
Julian Browne, Brewer's CAP Theorem: The kool aid Amazon and Ebay have been drinking , January 11, 2009.: Great discussion on CAP and scalability.
There is no free lunch with distributed data white paper, Hewlett Packard, 2005.: HP's explanation of the CAP Theorem and its impact on database systems: easy reading.
Three-phase commit protocol. Wikipedia article.