Distributed Databases

NoSQL and Data Model Categories

NoSQL systems are databases designed to scale horizontally across many machines. They often relax some of the assumptions of traditional relational databases, such as fixed schemas or full transactional support across all data, in order to improve scalability, availability, or flexibility. The term does not describe one single design. It covers several different data models.

The main categories are the following:

Key-value stores store data as a mapping from a key to a value. The system treats the value as opaque. This makes the model simple and easy to distribute, but limits queries beyond direct key lookup. Amazon Dynamo and distributed hash tables (DHTs) are the main examples to remember here.
Column-family stores, also called wide-column stores, organize data into rows and columns, but allow different rows to have different columns. They are designed for large-scale storage and high write throughput. Bigtable, HBase, and Cassandra are the main examples.
Document stores store structured documents, usually in a JSON-like form. They keep more structure than key-value stores and often support richer queries over document fields.
Graph databases store data as nodes and edges and are optimized for traversing relationships among entities.

In this material, the main focus is on column-family stores and on Spanner, which brings strong transactional guarantees back into a distributed setting.

Bigtable

Bigtable is a distributed column-family store that organizes data as a sparse, sorted map indexed by row key, column, and time. It was designed for large-scale structured data and for applications that need high throughput across many machines.

Data Model

Bigtable’s data model is easiest to understand by breaking it into its three indexing dimensions.

Row keys are arbitrary byte strings stored in sorted lexicographic order. That sorted order is important because it makes range scans efficient. Rows are also the unit of atomicity, so a read or write to a single row is atomic.
Column families group related columns and must be declared when the table is created. Within a family, column qualifiers can be created dynamically by any client. Column names therefore have the form family:qualifier.
Timestamps allow multiple versions of a value to be stored in a single cell. Version retention is configurable, so the system can keep several recent versions or discard older ones automatically.

Bigtable is also sparse. Empty cells are never stored. A row can have no columns at all in a given family, and only the cells that actually contain data occupy storage.

Partitioning and Structure

Bigtable scales by partitioning the table by row-key range.

A tablet is a contiguous range of rows. Each tablet is served by one tablet server, and tables are split into multiple tablets as they grow. Since rows are stored in sorted row-key order, each tablet is a slice of that sorted key space.

A master server tracks which tablet server owns which tablet and handles reassignment when failures occur. The important point is that data does not flow through the master. Clients use metadata to locate the right tablet server and then talk to that server directly.

Storage and Writes

Bigtable stores recent writes in memory and older data on disk.

The write path works as follows:

A write is recorded in a log for recovery.
The write is placed in a memtable, which is an in-memory sorted buffer.
When the memtable fills, it is flushed to disk as an SSTable.

An SSTable is an immutable, sorted file of key-value pairs stored in GFS. Since SSTables are immutable, the system periodically compacts them to merge files and discard obsolete data. Reads may therefore need to combine information from the memtable and multiple SSTables.

Key Concepts in Bigtable

The main ideas to remember are:

Row keys determine both identity and sorted storage order.
Tablets are key-range partitions.
Tablet servers serve tablets, while the master manages assignment and recovery.
Memtables and SSTables form the storage engine.

Bigtable scales very well, but it does not provide general transactional support across rows and does not support multiple tables.

Cassandra

Cassandra is a distributed column-family store designed for high availability, decentralized control, and horizontal scalability. It combines ideas from Bigtable, especially the wide-column model and SSTable-style storage, with ideas from Dynamo, especially decentralized partitioning and replication.

Architecture and the Hash Ring

Cassandra uses a peer-to-peer architecture. There is no master node, and all nodes are equal. Any node can accept a client request and coordinate the work needed to satisfy it.

Cassandra distributes data using a distributed hash table organized as a hash ring. Each node owns one or more tokens, which correspond to positions on that ring. A node is responsible for the range of hash values between its token and the previous token on the ring.

When Cassandra stores a row, it hashes the row’s partition key. The row is then routed to the node whose token is the first one that follows that hash value on the ring.

This design has an important scaling property. When a node is added or removed, only the data in the affected adjacent ring arc has to move. The rest of the data stays where it is. That is one of the main reasons DHT-style partitioning is attractive.

Data Model

Cassandra separates distribution across the cluster from ordering within one partition.

The key ideas are these:

The partition key determines where the data lives. Rows with the same partition key are placed in the same partition and stored on the same set of replica nodes.
The clustering columns determine how rows are ordered within that partition.
Columns store the actual data values within each row.

That separation is central to understanding Cassandra.

An example is a database of McDonald’s restaurants. If the partition key is restaurant_id, each restaurant will likely be distributed independently across the cluster. That is good for load balancing, but not very useful if queries often want all restaurants in one country.

If the partition key is country_code, then all restaurants in the same country live in the same partition and therefore on the same set of machines. If the clustering columns are state and county, then the rows within that country partition are sorted first by state and then by county. This makes it efficient to iterate through all restaurants in one country, one state, or one county.

So the division of roles is:

The partition key controls which rows live together on the same set of machines.
The clustering columns control how those rows are ordered within that partition.

Replication and Consistency

Cassandra replicates data across multiple nodes for fault tolerance. Each partition is stored on several nodes according to a chosen replication factor. This allows the system to continue operating even when some nodes fail.

Cassandra provides tunable consistency. The application can choose how many replicas must respond to a read or write. Waiting for more replicas gives stronger consistency, while waiting for fewer improves availability and often reduces latency.

The basic trade-off is:

More replica responses improve consistency.
Fewer replica responses improve availability and latency.

Storage Model

Cassandra’s storage engine resembles Bigtable’s.

Writes are first recorded durably and placed in memory. Later, they are flushed to disk as immutable files. Reads may have to combine recent in-memory data with older on-disk files. Compaction merges those files over time.

The important connection is that Cassandra resembles Dynamo in how it distributes and replicates data across the cluster, but resembles Bigtable in how each node stores and retrieves data locally.

What to Remember About Cassandra

The main ideas to remember are:

Cassandra uses a peer-to-peer architecture with no master.
Data is distributed by hashing the partition key onto a hash ring.
Each node owns a range of token values on that ring.
Adding or removing a node moves only data in the adjacent ring arc.
The partition key controls placement across machines.
The clustering columns control sorted order within a partition.
Cassandra emphasizes availability and scalability, with tunable consistency chosen by the application.

Spanner (NewSQL)

Spanner is a globally distributed relational database that combines horizontal scalability with strong transactional guarantees. It keeps the key-range partitioning ideas of systems like Bigtable, but adds distributed transactions, multiversion data, and globally meaningful timestamps.

A compact mental model is this: Spanner combines Bigtable-style partitioning, distributed transaction mechanisms, and carefully managed time.

What Spanner Combines

Spanner is best understood as a combination of several mechanisms, each solving a different problem.

The main components are:

Key-range partitioning into splits
Paxos replication within each split
Two-phase commit (2PC) for transactions that span multiple splits
Strict two-phase locking (2PL) for read-write transactions
Wound-wait for deadlock prevention
Multiversion concurrency control (MVCC) for versioned data
TrueTime for bounded clock uncertainty and globally meaningful timestamps

Partitioning gives scale. Paxos gives fault tolerance and a consistent order of updates within each split. Two-phase commit coordinates work across splits. Strict 2PL and wound-wait control concurrency. MVCC supports snapshot reads. TrueTime and commit wait make timestamp order line up with real time.

Physical and Logical Structure

Spanner has both a physical and a logical structure.

At the physical level, data is stored on spanservers organized into zones. A zone is a large administrative and failure domain, roughly at the datacenter level. Replicating data across zones allows the system to survive larger failures than a single machine crash.

At the logical level, a Spanner deployment is called a universe. A universe contains databases, databases contain tables, and tables are divided into splits.

A split is a contiguous range of keys. This is directly analogous to a Bigtable tablet. Rows are stored in sorted key order, and a split is one slice of that sorted key space.

Replication and Paxos Groups

Each split is replicated across multiple servers. The set of servers maintaining replicas of one split is called a Paxos group. These replicas use Paxos to agree on the order of updates.

The core structural idea is worth stating clearly:

Partitioning gives scale.
Paxos replication gives fault tolerance and a consistent order of updates within each split.

As long as a majority of replicas in a Paxos group are available, that split can continue to make progress.

Transactions Across Splits

If a transaction touches data in only one split, that split’s Paxos group can handle the transaction locally.

If a transaction touches data in more than one split, Spanner uses two-phase commit. This is the same protocol discussed earlier: a coordinator drives a prepare phase and then a commit phase across all participating groups.

This leads to an important distinction:

Paxos handles replication and ordering within one split.
2PC coordinates one transaction across multiple splits.

Read-Write Transactions, Locks, and Wound-Wait

For read-write transactions, Spanner uses strict two-phase locking.

A read acquires a shared lock, and a write acquires an exclusive lock. Locks are held until commit. Holding locks until commit prevents other transactions from seeing partial results and gives serializable behavior.

Shared and exclusive locks affect concurrency in different ways. Multiple readers can hold shared locks at the same time, but an exclusive lock prevents other reads and writes on that data item.

Locking creates the possibility of deadlock, so Spanner uses wound-wait.

In wound-wait:

If an older transaction needs a lock held by a younger transaction, the younger transaction is aborted and retried.
If a younger transaction wants a lock held by an older transaction, it waits.

That policy prevents deadlock while giving older transactions priority.

Snapshot Reads and MVCC

Spanner stores multiple committed versions of data, each tagged with a timestamp. That is the role of MVCC.

A snapshot read returns the state of the database at a specific time \(t\). More precisely, for each item, the system returns the most recent committed version whose timestamp is less than or equal to \(t\).

That gives a consistent cut through time.

Snapshot reads are especially useful for large searches, scans, and reports. Because they read committed older versions, they usually do not need to lock the current data. Writers can continue updating the newest versions while the read-only operation sees a stable snapshot.

This is one of the main practical advantages of MVCC. Large read-only workloads can proceed without blocking current writes.

Snapshot reads also motivate one of Spanner’s most important requirements. Since a snapshot may span many splits and datacenters, commit timestamps must be meaningful across the whole system. Otherwise, “the state of the database at time \(t\)” would not define one coherent global state.

External Consistency

Spanner provides external consistency, also called strict serializability.

The guarantee is this: if transaction \(T_1\) commits before transaction \(T_2\) begins in real time, then \(T_1\)’s commit timestamp must be less than \(T_2\)’s commit timestamp.

This is the transaction-level version of linearizability. Linearizability is usually stated for single operations on shared objects: if one operation finishes before another begins, the first must appear before the second. External consistency applies that same real-time ordering idea to transactions.

This guarantee is stronger than ordinary serializability. Serializability only requires that the result be equivalent to some serial order. That order does not have to match wall-clock time. External consistency requires both a serial order and agreement with real time.

An example is a bank transfer. Suppose one transaction transfers money from checking to savings and returns “committed” to the client. If another transaction starts afterward and reads the balances, external consistency requires that the second transaction see the transfer. A merely serializable system could still choose some serial order, but that order might not match real time. External consistency rules that out.

TrueTime

Spanner cannot enforce external consistency using naive local clock readings, because clocks in distributed systems are never perfectly synchronized.

Spanner’s solution is TrueTime, which represents time as an interval rather than a single exact value:

\[TT.now() = [earliest, latest]\]

The real current time is guaranteed to lie somewhere inside that interval.

The two bounds provided by the TrueTime API are:

TT.now().earliest is definitely not later than the real current time.
TT.now().latest is definitely not earlier than the real current time.

The width of the interval reflects clock uncertainty. Google keeps that uncertainty small by synchronizing clocks using GPS receivers at each data center, with atomic clocks as backup references.

The key idea is that Spanner does not need perfect synchronization. It only needs the uncertainty to be bounded.

Commit Wait

TrueTime by itself is not enough. Spanner also uses commit wait.

When a read-write transaction is ready to commit, Spanner follows this pattern:

It chooses a commit timestamp \(t = TT.now().latest\).
It waits until \(TT.now().earliest > t\).
It then makes the transaction visible.

That waiting step ensures that the chosen timestamp is definitely in the past relative to real time before the commit becomes visible.

The logic is straightforward. If transaction \(T_1\) becomes visible only after time \(t\) is definitely in the past, then any later transaction \(T_2\) that begins afterward must receive a later timestamp. That is how Spanner enforces external consistency.

Commit Wait and Replication

Commit wait sounds expensive, but Spanner overlaps it with replication.

The sequence is:

Acquire locks and perform the transaction’s work.
Choose a commit timestamp.
In parallel:
Run Paxos so the commit is replicated durably.
Perform commit wait so real time advances past the chosen timestamp.
Once both are complete, commit and release locks.

This overlap hides much of the waiting cost. In Google’s environment, thanks to accurate time sources at each datacenter, the uncertainty window is typically only a few milliseconds, and the average commit wait time is on the order of 4 ms.

What to Remember About Spanner

The main ideas to remember are:

Spanner is a distributed relational database with strong transactional guarantees.
Splits are key-range partitions, like Bigtable tablets.
A Paxos group manages replicas of one split.
2PC coordinates transactions across multiple splits.
Strict 2PL controls read-write concurrency.
Wound-wait prevents deadlock.
MVCC enables snapshot reads.
Snapshot reads require globally meaningful timestamps.
External consistency is stronger than ordinary serializability because it also respects real time.
TrueTime and commit wait are the mechanisms that make that possible.

Key Points

Across these systems, the main design questions are about partitioning, replication, concurrency control, and time.

The high-level distinction among the systems is:

Bigtable uses sorted row keys and key-range partitioning into tablets.
Cassandra uses hash partitioning with a DHT-style ring.
Spanner uses sorted key ranges like Bigtable, but adds full distributed transactions and globally meaningful time.

You should also be comfortable with these ideas:

Key-range partitioning keeps nearby keys together and supports efficient range scans.
Hash partitioning spreads load evenly but does not preserve global key order.
Paxos provides replication and ordered updates within one partition.
2PC coordinates transactions across multiple partitions.
Strict 2PL and wound-wait manage read-write concurrency.
MVCC enables snapshot reads by keeping multiple committed versions.
TrueTime and commit wait allow Spanner to align transaction order with real time.

You Do Not Need to Study

Some details in the notes are useful for understanding the systems, but are too detailed to matter for the exam.

The following details are not required for upcoming exams:

Implementation details of storage formats like SSTable encoding
Specific APIs or query syntax for any system
Paxos message flows or optimizations
Procedures for leader election within Paxos
Detailed timing information beyond understanding the purpose of commit wait
Implementation details of GPS or atomic clocks beyond their role in bounding uncertainty
Any cloud product names or deployment details