2021-09-27 · TECH

Transactions and Consensus from DDIA

Translated by Claude from the Chinese original.

Some Reflections

Most applications are built by layering data models on top of one another.

The problem of surplus and scarcity of computing resources will always exist. “Resource-saving technologies only lead to increased resource usage” (Jevons paradox).

Many problems discussed in the book follow a pattern: under constraint P (the real-world problem), find the lowest-cost C (consistency level) that still achieves the best A (availability/outcome). One example in the book is multi-core CPUs. While multi-core CPUs can be viewed as distributed systems, they are not constrained by inter-machine network latency, so partition tolerance is effectively assumed. Even so, to maximize throughput, multi-core designs may still trade some consistency for higher availability, accepting occasional redundant or incorrect computation to squeeze out more performance.

Are Paradigms Useful?

Paradigms exist to standardize how we use data. But as hardware improves and application scenarios expand, breaking those rules can sometimes bring bigger gains. Otherwise NoSQL would not have emerged: store a large JSON document, skip strict relational modeling, and iteration speed can increase dramatically.

So should we still follow paradigms? Actually, paradigms are just a yardstick that tells us what to do in which scenario and what happens if we don’t—not rules we must rigidly follow.

What Are Concurrency Problems?

A single machine with a single thread can always avoid various concurrency problems, but this approach is too slow. So we develop multi-threading, multi-processing, multi-machine, and distributed systems. Along with these come concurrency problems—problems that fit the following condition:

Under identical conditions, if the results of two tasks executed serially differ from those executed in parallel, a race condition has occurred, causing a concurrency problem.

Single-Machine Concurrency Problems

Let’s first discuss concurrency problems that arise even with a single machine running multiple threads/processes.

Discussion model:

There are exactly 2 transactions involved in the problem scenario.
For real-world problems caused by more than 2 transactions, we use induction to reduce the problem to 2 concurrent transactions. For example, we group multiple non-conflicting transactions into 1, or split the problem into different causes, each corresponding to 2 transactions.

Scenario 1—Transaction 1 is reading, Transaction 2 is writing, leading to:

Dirty reads
Non-repeatable reads / Read skew

Scenario 2—Both transactions are writing, leading to:

Lost updates
Dirty writes
Write skew

Dirty Reads

Prerequisite: One transaction reads partial modification results of another uncommitted transaction.

Without read-committed isolation, dirty reads occur in these scenarios:

One transaction updates multiple objects while another transaction sees only some of the updated objects, not all.
One transaction aborts, and during rollback another transaction sees partially un-rolled-back objects.

Examples:

Example 1: Alice has two bank accounts—Account 1 has 100 yuan and Account 2 has 0. Account 1 transfers 100 yuan to Account 2. The transaction deducts 100 from Account 1 and adds 100 to Account 2. If Alice reads her total balance after the deduction from Account 1 but before the addition to Account 2, the result is 0. (Note: non-repeatable reads can also cause the same situation—we’ll discuss this in the non-repeatable reads section.)

Example 2: Building on Example 1, Account 2 is a Class II account with a single-transfer limit of 50 yuan. If the transfer exceeds the limit, the transaction rolls back. After deducting 100 from Account 1, another query reads Account 1’s balance as 0 yuan. Then the transfer fails and rolls back. Due to transaction atomicity, Account 1’s balance never actually became 0—yet at a certain moment, another transaction saw it as 0.

Dirty Writes

Prerequisite: One transaction modifies partial modification results of another uncommitted transaction.

Without read-committed isolation, dirty writes occur in this scenario:

Transaction 1 updates multiple objects while Transaction 2 modifies some of those objects (through updates, creates, deletes, or their rollbacks). Then Transaction 1 updates the remaining unmodified objects.

Example:

On a trading website, a purchase updates both the item’s recipient and the payer. Alice and Bob simultaneously buy the same item. Alice’s transaction sets the recipient to Alice. Then Bob’s transaction changes the recipient to Bob and declares Bob as the payer. Then Alice’s transaction updates the payer to Alice. Result: Bob receives the item but Alice pays.

Non-Repeatable Reads / Read Skew

Prerequisite: Transaction 2’s execution starts after Transaction 1 begins but completes before Transaction 1 ends, and Transaction 1 reads Transaction 2’s modifications.

Most business scenarios don’t want non-repeatable reads. These scenarios especially cannot tolerate them:

Backups
Long-running analytical queries and integrity checks

Non-repeatable reads are also called “read skew” because: ideally, a transaction should read all data in an instant. When non-repeatable reads occur, the transaction’s reads are spread across the timeline rather than happening at a single point—hence the read is “skewed.”

Example:

Continuing from dirty read Example 1: Alice starts a balance-reading Transaction 1 before initiating transfer Transaction 2. Transaction 1 first reads Account 2’s 0 yuan (this conforms to read-committed). Then Transaction 2 completes—Account 1 becomes 0, Account 2 becomes 100. Finally Transaction 1 reads Account 1’s 0 yuan (also conforming to read-committed). Total: 0 yuan.

Lost Updates

Two transactions simultaneously execute read-modify-write sequences, where one overwrites the other’s write without incorporating the other’s latest value, ultimately causing some modified data to be lost.

Examples:

Incrementing counters
Modifying part of a complex object (e.g., multiple users editing a large JSON)

Write Skew

An escalated version of lost updates, differing in its greater dependency on application-layer logic. It follows this pattern:

Read: Input some matching conditions and query.
Decide: Based on query results, application-layer code decides the next action.
Write: If the application decides to proceed, it initiates a database write.

Generally, if 2 transactions updating different objects cause errors (usually semantic errors at the application layer), that’s write skew. If they update the same object, it’s likely dirty writes or lost updates.

Why “write skew”: Referencing the meaning of read skew—ideally, a transaction’s write operations should happen in an instant. But when write skew occurs, a transaction typically reads stale data, the data gets modified, and the transaction unknowingly writes invalid data. The transaction’s reads and writes are spread across the timeline—hence the write is “skewed.”

Example: A user has an expense ledger with a balance constraint. Two transactions each insert expense items that individually don’t exceed the balance. But since neither notices the other, the combined expenses push the balance negative (application-layer logic violation).

Here we discuss one solution with limited feasibility:

Materializing Conflicts

Write skew often occurs because query results contain no objects, so there’s nothing to lock. Solution: pre-create lockable objects.

For the balance example above, we create a total balance table with a new field for pending changes—meaning only one change can affect the balance at a time. This transforms the problem into a lost update or dirty write problem.

The main issue with materializing conflicts is excessive database storage. For instance, should we materialize all room-and-time combinations for the next 6 months in a meeting room booking system?

This method is generally not used unless absolutely necessary.

Phantom Reads

A write in one transaction changing the query results of another transaction is called a phantom read.

Phantom reads are a highly general concept. I believe most concurrency problems discussed so far could be classified as phantom reads.

Preventing Lost Updates

We must prevent lost updates to avoid many other phantom-read-like anomalies. As we’ll see, the key to distributed transactions is achieving consensus, and consensus is largely about preventing lost updates.

Lost update scenarios are simpler than other concurrency problems, and the solutions are easier to understand. So we discuss lost update prevention first.

Atomic Write Operations

If the DB supports atomic write operations, use them whenever possible.

The DB can implement atomic writes through exclusive locks or by executing all atomic operations on a single thread.

Explicit Locking

The application explicitly locks relevant objects. DDIA uses the FOR UPDATE keyword to represent application-layer requests to lock SELECT results.

This approach might seem to solve write skew too, but write skew also includes cases like “checking whether rows matching a search condition exist (expected result: empty).” In that case, there is nothing concrete to lock. We discuss write-skew solutions in other sections.

Above we discussed two lock-based solutions. Now let’s discuss lock-free solutions.

Automatically Detecting Lost Updates

Allow updates to execute concurrently. If the transaction manager detects a lost-update risk, it aborts the current transaction and retries using a safe read-modify-write path.

The database can use snapshot-level isolation for detection. A rough intuition is that an object can only have one uncommitted version at a given moment, though finer-grained detection methods likely exist.

Atomic Compare-and-Set

Only allow updates when the data hasn’t changed since the last read. If it has changed, fall back to another read-modify-write approach or retry.

Implementation: CAS (compare-and-swap)—modern CPUs support this instruction. Or explicitly add conditions like WHERE content = 'old content' during execution. The danger is that if WHERE executes against a snapshot, the content value may not be the latest.

Conflict Resolution and Replication

Details are discussed in the distributed concurrency section. The general idea: the application layer has logic, or data structures have logic, to handle conflicting writes. Or, design operations to be order-independent, so update conflicts naturally don’t occur.

Isolation Levels

Read Committed

The most basic transaction isolation level: a transaction’s internal execution won’t be affected by other transactions.

Solves:

Dirty reads
Dirty writes

Why “read committed”? My interpretation is: reads should observe only committed data.

Implementation methods:

Row-Level Read Locks

Read locks certainly achieve read committed, but with drawbacks:

Poor performance
Potential deadlocks

Old/New Value Snapshots

For each object pending update, maintain two versions: the old value and the new value the lock-holding transaction will set. Before the transaction commits, all other reads return the old value. Only after the write transaction commits does the system switch to the new value.

Multi-Version Concurrency Control / MVCC

Discussed in the next section.

Snapshot Isolation

Each transaction reads a consistent snapshot of the database—once read, data doesn’t change.

Prevents:

Non-repeatable reads / Read skew

Multi-Version Concurrency Control / MVCC

The database maintains multiple committed versions of objects, adding created_by and deleted_by fields to each row, representing versions created by different transaction operations. A periodic garbage collection task cleans up versions no longer needed.

A transaction cannot see:

Changes made by transactions still running when this transaction started
Changes made by any aborted transactions
Changes made by transactions that start after this transaction
All other changes are visible to this transaction (I understand “other changes” refers to non-transactional changes)

Conversely, a transaction can see:

Objects created or updated by already-committed transactions before this transaction started
Objects not deleted by uncommitted transactions before this transaction started

MVCC indexing has roughly two implementation approaches:

Index points to all versions of an object: PostgreSQL uses this method, placing all versions on a single memory page for performance optimization.
Persistent data structures: Typically a persistent B-tree. Different transactions create their own database entry nodes when writing. When reading, use the node corresponding to the latest committed transaction as the entry point.

Serializable Isolation

Even if transactions may execute in parallel, the final result is the same as if they executed one at a time (serially).

Generally considered the strongest isolation level.

But achieving serializability in practice is very difficult. Here are three implementation approaches.

Actual Serial Execution

With continuous hardware advances, database researchers have recognized that single-threaded transaction execution is feasible and efficient.

These conditions help achieve serializable isolation using memory and a single CPU:

Transactions must be short and efficient—otherwise a slow transaction affects all others (since it’s single-threaded).
Limited to scenarios where the active dataset fits entirely in memory.
Write throughput must be low enough for a single CPU core to handle; otherwise partitioning is needed, ideally without cross-partition transactions.
Cross-partition transactions can be supported but must be a very small proportion.

In practice, we can encapsulate transactions in stored procedures. Typical OLTP operations are short, and as long as user I/O is excluded from the transaction path, single-threaded execution can be very efficient. The business server packages logic as data and sends it to the database server, which executes it directly in memory. Historically, stored procedures were criticized because each database had its own language. A modern approach is to use general-purpose languages where possible (for example, Redis uses Lua).

Two-Phase Locking

The only widely used serialization algorithm for nearly 30 years.

Use locking to achieve serializable isolation: transactions need shared locks before reading objects and exclusive locks before modifying them, excluding all other transactions from reading or writing the modified objects (two phases: acquire locks before start, release after end). The database system automatically detects deadlocks between transactions and forcibly terminates one to break the deadlock.
For the “nothing to lock” case discussed in the read-skew section (execute only when query results are empty), apply predicate locks. Predicate locks apply to all rows matching certain search conditions (similar to locking a WHERE predicate so overlapping ranges from concurrent transactions are disallowed). However, predicate locks are hard to implement and inefficient.
In practice, index-range locks often replace predicate locks by widening the protected scope. By locking one or more index ranges of queried objects, those ranges become exclusively locked. In the worst case, a single transaction may lock the whole table.

Serializable Snapshot Isolation

An optimistic-control algorithm proposed on top of MVCC in 2008, with limited real-world adoption. The DDIA author believes it will become a future standard in databases for these reasons:

Pessimistic control shuts down too many transactions—retry costs are too high.
Hardware still has much room for improvement. In the future we’ll encounter fewer concurrency problems, so we should adopt optimistic control strategies.

Building on MVCC snapshot isolation, we apply optimistic control with these principles:

Before a transaction reads, check whether concurrent uncommitted writes could make that read stale.
Before a transaction commits, check whether writes that occurred after its read phase could create conflicts with other transactions.

If conflicts are found, roll back.

Serializable Snapshot Isolation (SSI) relies on SSI locks, similar to index-range locks, but SSI locks only record—they don’t block. After a transaction commits, SSI locks notify other related transactions and are discarded.

The implementation tradeoff is lock granularity: too coarse may misjudge conflicts and expand a transaction’s impact; too fine may cause excessive metadata overhead.

Distributed System Challenges

Network latency
Clock synchronization
Process pausing/crashing

Why go distributed?

Scalability
Fault tolerance
Low latency
If you can avoid opening Pandora’s box, keeping everything on one machine is worth trying

Properties:

Safety: Properties that must never be violated—once violated, the system design has failed.
Liveness: Availability the system guarantees under certain preconditions. If preconditions fail, restoring them returns the system to normal.

The following discussions assume we’ve already solved some problems through transactions.

Byzantine Faults

Not worth considering.

Too expensive.
Environmental issues like radiation can cause Byzantine faults, but the probability on Earth’s surface is extremely low. Of course, machines operating in space must consider this.
Software bugs can cause machine errors, but all machines run the same code. Bugs can’t be prevented unless all machines’ software is independently developed and only a few have bugs—which is clearly unrealistic.
Network intrusions could cause machine errors, but once an intruder can compromise one machine, there’s no reason to believe they can’t compromise all machines. Authentication, encryption, and firewalls are better approaches to network intrusion.

Consensus

Consensus is one of the most important abstractions in distributed systems: all nodes agree on a proposal. Based on this, many distributed-system challenges can be addressed. The solutions below can achieve consensus. In that sense, consensus is a bit like Turing completeness: once you implement one strong consensus mechanism, you can derive many others from it.

Linearizability

Basic idea: Make a system appear as if there’s only one data copy, and all operations are atomic. As we’ll see, because linearizability has a simple definition, we can use whether other solutions can achieve linearizability to verify whether they’re consensus algorithms.

Notes:

Linearizability’s most intuitive requirement: once the system returns the latest value for a read, even if the related write hasn’t committed, all subsequent reads must return the latest value.
Atomic operations. This property can be expressed as CAS (compare-and-set), similar to preventing lost updates in single-machine transactions.
Building on the above, we don’t consider the effects of external network latency when observing the system—i.e., we don’t need to consider phantom reads. Once linearizability is achieved, we can naturally build distributed transactions to handle phantom reads. But when thinking about the problem model, be clear: transactions address data inconsistency between tables (a business-layer concern); distributed system challenges address data synchronization inconsistency between replicas (an infrastructure-layer concern).
Note the distinction between serializability and linearizability: the former concerns concurrent transaction results matching serial execution; the latter concerns data replicas appearing as a single copy—the strongest linearizability prevents the outside world from perceiving parallelism.
Actual serial execution and two-phase locking first achieve linearizability by restricting parallelism and concurrency, thereby achieving serializability. Serializable snapshot isolation, however, uses different snapshots for optimistic concurrency control; snapshot states and their changes can proceed in parallel, so linearizability is not guaranteed. Multiple SSI-based transactions may read different values, but conflicting ones cannot both commit. The first two approaches try to prevent this situation up front.

Quorum can achieve linearizability, provided there’s no uncertain network latency.

As long as there’s an unreliable network, there’s a risk of violating linearizability. This is CAP theory: the network is definitely unreliable; between availability and consistency, you can only choose one.

Ordering Guarantees

Causal Consistency

Linearizability’s constraints are too strict. Can we achieve consensus with weaker requirements? We think of causal consistency: as long as causally related events occur in order and other events happen in parallel, the difficulty should be less than linearizability (which is equivalent to no parallelism).

If a system obeys the order prescribed by causal relationships, we call it causally consistent. If neither of two operations happened before the other, they’re concurrent; otherwise they’re causal and can be ordered.

How do we causally order operations? One approach is Lamport timestamps: assign each operation/client a sequence number composed of an incrementing counter plus a node ID. Sequence numbers are ordered by counter first, then node ID on ties. Every request carries a timestamp. Whenever a node or client observes a larger sequence number, it advances its own counter so the next request uses an even larger value. This allows ordering of operations.

As long as there’s no uncertain network latency, ordering is guaranteed. With network latency, implementing CAS with Lamport timestamps requires each node to first confirm no concurrent CAS requests (resolving ties by sequence number), so network latency directly stalls the system. (Lamport timestamps’ operating environment requires more assumptions that we won’t discuss.)

Total Order Broadcast

Lamport timestamps and other causal-consistency approaches can fail here because they behave like synchronous coordination models (decision waits for information from all nodes). If we switch to an asynchronous model, CAS can be implemented.

Regardless of implementation environment, CAS can be achieved with these two conditions:

Reliable delivery: No message loss. If a message is sent to one node, it must be sent to all nodes.
Strict ordering: Messages are always delivered to each node in the same order.

With this, CAS only needs to be implemented on one node, and all other nodes must follow that CAS operation because operations are reliably delivered and strictly ordered.

So does implementing total order broadcast achieve consensus? We can verify by checking whether total order broadcast can achieve linearizability. It turns out that asynchronous models can’t handle the reading problem: system-internal network latency (not external—we said we don’t need to consider external network latency) may cause the outside world to read a new value followed by an old value.

Therefore:

We say total order broadcast satisfies write linearizability (which is essentially serializability) but not read linearizability.
But is that really the case? Actually, if we treat reads (or reads requiring strict accuracy) as operations and add them to the operation queue, read linearizability is satisfied. ZooKeeper and etcd have similar implementations.

The key takeaway from the two paragraphs above is:

Fully asynchronous consensus is impossible to achieve. Synchronous consensus is possible but has performance issues.

So how do we implement total order broadcast? (Answer: using linearizability…)

Implementing Total Order Broadcast

Finally, we discuss achieving consensus through total order broadcast.

We first discuss a feasible approximate solution, then extend it to total order broadcast.

Two-Phase Commit (2PC)

Introduce a new component: the coordinator. The coordinator and nodes implement two-phase commit as follows:

Send a prepare request to all nodes, asking if they can commit.
If all nodes return “yes,” the coordinator issues a commit request and nodes commit. If any node returns “no,” the coordinator tells all nodes to abort.

This can clearly produce agreement in the happy path. The problem, as always, is network latency and failures. So we enhance the coordinator’s fault tolerance:

Fault-Tolerant Consensus

The book introduces several fault-tolerant consensus algorithms: VSR, Paxos, Raft, and Zab. I’m most familiar with Raft (MIT 6.824), so I’ll use Raft to explain how to extend 2PC for better fault tolerance.

The coordinator can also be the leader node. Whenever a problem occurs, if we can elect a new leader from available nodes to serve as coordinator, consensus remains valid. Electing a new node is itself a consensus problem—this consensus only needs to be acknowledged by more than half the nodes.

Whenever a node detects leader failure, it can declare itself the new leader. As long as it gains acknowledgment from more than half the nodes, it becomes the leader. This leader then acts as coordinator, ordering operations for follower nodes—any operation approved by more than half the nodes is immediately executed.

If more than half the nodes go down, the system enters a crash state. The benefit of requiring a majority: there’s always at least one node that participated in every vote, ensuring split-brain doesn’t occur and consensus is maintained.

Membership and Coordination Services

Now we know how to achieve consensus. We also see that not all operations need consensus participation; only certain critical operations do. Therefore, we can use packaged systems like ZooKeeper (Zab) and etcd (Raft) to establish consensus for small, memory-resident datasets and build highly reliable coordination paths.