Guanzhou (Jose) Hu

Guanzhou's personal storage for lecture notes, blog posts, & good mood.

Find me on GitHub
https://github.com/josehu07

Resume  |  LinkedIn  |  Google Scholar

Email guanzhou (dot) hu (at) wisc (dot) edu
josehu (at) cs (dot) wisc (dot) edu

To all my loved ones, my friends, and Annie ♥
Hosted on GitHub Pages — Theme by orderedlist

Serializable Distributed Transactions over Sharded Scenario

31 May 2020 - Guanzhou Hu

Sharding is a common distributed system design to scale out and achieve better performance. Distributed transactions (concurrency control + atomic commits) are used to coordinate sharded nodes. It is important to implement serializable distributed transactions for such a system to act correctly.

Sharding & Distributed Transactions

Consider a classic key-value store scenario. Sharding represents the practice of partitioning data (key-value pairs) into multiple parts and put different parts on different nodes. Unlike replication which is for fault-tolerance, sharding is for performance & scalability - more nodes bring larger capacity and better load sharing.

A transaction is a sequence of read/write operations (records) carried out by a client to finish some task, e.g.:

# Transfer $1 from Y's bank account to X's.
read x;
read y;
set x = x + 1;
set y = y - 1;

When data is sharded (which is often the case in real-world scenario), x and y are probably on different nodes of the database system. A transaction becomes a distributed transaction when the keys involved are distributed across different sharded nodes.

DistributedTransactions

The “ACID” Principle & Serializability

An ideal database should satisfy the “ACID” principle1.

  1. Atomic: either the whole transaction is done or the whole transaction aborts, NO partial commit;
  2. Consistent: does not violate application-specific rules (e.g., bank balance cannot go below zero);
  3. Isolated \(\equiv\) Serializable: when there are multiple concurrent transactions, they do not interfere with each other;
    • Formally, each transaction should not read partial results of other transactions
    • Equivalently, this implies serializability as defined below
  4. Durable: once written, value should be persistently stored.

Serializability describes a history of multiple concurrent distributed transactions which has an order that, when executed one-by-one serially as if on a single machine, yields the same result. This is essentially the same requirement as the Isolation requirement in “ACID”.

To provide ACID distributed transactions, a distributed system must solve the following two questions at the same time:

  1. Concurrency control: how to prevent data race and coordinate among concurrent transactions?
  2. Atomic commit: how to ensure all-or-none commit (Atomic requirement in “ACID”)?

1. Pessimistic/Optimistic Concurrency Control

Concurrency control typically take two different forms - pessimistic/optimistic.

Examples of optimistic concurrency control include Microsoft FaRM system2 which explores OCC over RDMA direct read and writes.

2. Atomic Commit with Two-Phase Commit (2PC)

Atomic commit is typically implemented by using two-phase commit (2PC)3. This is so common that I would put a link to its wikipedia page instead of rephrasing its definition here again: READ.

Several points worth noticing about 2PC:

To improve the performance and throughput of such a system, we often want to avoid 2PC when it is not necessary. One way to do this is to distinguish between read-only transactions (RO) and read-write transactions (RW). RO transactions can bypass 2PC by using snapshot isolation: keep a multi-version DB with multiple timestamped versions of values for each key. Also assign a timestamp for each transaction. Then, the transaction only reads the newest versions not greater then its timestamp.

SnapshotIsolation

How to keep timestamps on all nodes synchronized now becomes a significant problem. Examples of snapshot isolation implementation include Google Spanner DB4 which introduces a novel TrueTime API.

Another way to enhance 2PC is to replicate each participant over multiple replicas that form a logical participant. In this way, each participant is very unlikely to fail, thus 2PC is very unlikely to block. Google Spanner DB does this over Paxos.

References

Please comment below anything you wanna say! 😉