Below are the files composing the checkable model (organized in VSCode extension style):
CommitNotice
messages)This spec is different from traditional, general descriptions of Paxos/MultiPaxos in the following aspects:
HandleCommitNotice
action (which is the least significant) and having one fewer request reduces check time down to < 10 secsThis spec has been accepted into the official TLA+ Examples repo! 1
Here are some links I found particularly useful when developing this spec by myself: 2 3 4 5
Below are the files composing an extended version of the spec along with model inputs:
The extended spec includes the following extra features/variants of MultiPaxos that are very essential and useful in practice:
This post assumes latest Linux kernel version and is tested on v6.5.7.
I have a distributed system codebase consisting of the following processes, each of which should conceptually run on a separate physical machine:
W.L.O.G., let’s ignore the client nodes and only talk about the server nodes plus the manager. I would like to test a wide range of different network performance parameters on the network connections between servers. Doing that across real physical machines would be prohibitively resource-demanding (as it requires a bunch of powerful machines all connected with each other through physically links that are as strong as the “best parameters” you will test against). Processes in my codebase are not computation- or memory-demanding, though; so it might be a good idea to run them on a single host and emulate a network environment among them.
There’re not really any canonical tutorials online demonstrating how to do this. After a bit of searching & digging, I found Linux kernel-supported network emulation features to be quite promising.
tc
netem
on LoopBackThe first tool to introduce here is the netem
queueing discipline (qdisc) 1 provided by the tc
traffic control command. Each network interface in Linux can have an associated software queueing discipline that sits atop the device driver queue. netem
is one of them and it provides functionality for emulating various network properties, including delay, jitter distribution, rate, loss, corruption, duplication, and reordering, etc.
For example, we can put a netem
qdisc on the loopback interface that injects a 100ms delay with a Pareto-distributed jitter around 10ms and limits the rate as 1Gbps:
~$ sudo tc qdisc add dev lo root netem delay 100ms 10ms distribution pareto rate 1gibit
~$ ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=192 ms
64 bytes from localhost (127.0.0.1): icmp_seq=4 ttl=64 time=226 ms
64 bytes from localhost (127.0.0.1): icmp_seq=5 ttl=64 time=193 ms
...
It feels natural to just put a netem
qdisc on the loopback interface, let all processes bind to a different port on localhost
, and let them talk to each other all through loopback. It seemed to work pretty well until I found two significant caveats:
netem
qdisc is allowed on each interface. This means we cannot emulate different parameters for different links among different pairs of processes.What we need are separate network interfaces for the processes.
dummy
InterfacesThe dummy
kernel module supports creating dummy network interfaces that route packets to the host itself. However, creating a netem
qdisc on a dummy interface doesn’t really work 2 3.
~$ sudo ip link add dummy0 type dummy
~$ sudo ip addr add 192.168.77.0/24 dev dummy0
~$ sudo ip link set dummy0 up
~$ sudo tc qdisc add dev dummy0 root netem delay 10ms
~$ sudo tc qdisc show
...
qdisc netem 80f8: dev dummy0 root refcnt 2 limit 1000 delay 10ms
Though the qdisc is indeed listed, pinging the associated address still shows lightning-fast delay:
~$ ping 192.168.77.0
PING 192.168.77.0 (192.168.77.0) 56(84) bytes of data.
64 bytes from 192.168.77.0: icmp_seq=1 ttl=64 time=0.019 ms
64 bytes from 192.168.77.0: icmp_seq=2 ttl=64 time=0.024 ms
64 bytes from 192.168.77.0: icmp_seq=3 ttl=64 time=0.024 ms
...
This is because the dummy interface is just a “wrapper”; it is still supported by the loopback interface behind the scene. We can verify this fact using:
~$ sudo ip route get 192.168.77.0
local 192.168.77.0 dev lo src 192.168.77.0 uid 0
cache <local>
Notice it is reported that the route is backed by dev lo
. In fact, if a netem
qdisc is still being applied to loopback as of previous section, you will see that delay when pinging dummy0
. Dummy is not we are looking for here.
veth
sWhat we are actually looking for are network namespaces and veth
-type interfaces 4 5. In Linux, a network namespace is an isolated network stack that processes can attach to. By default, all devices are in a nameless namespace. One can create named namespaces using ip netns
commands and assign them to running processes (or launch new processes directly from them through ip netns exec
). Namespaces are a perfectly tool for our task here.
To give connectivity between namespaces without bringing in physical devices, one can create veth
(virtual Ethernet) interfaces. By design, veth
interfaces come as pairs: you must create pairs of two veth
s at the same time. This post 6 gives a nice demonstration of creating two namespaces and making a pair of veth
s to connect them.
However, this is not enough for us, because we would want more than 2 isolated devices, each being able to talk to everyone else. To achieve this, we make use of a bridge device. We create one pair of veth
s per namespace, put one end into the namespace while keeping the other end, and then bridge those ends together. All namespaces can then find a route to each other through the bridge; also, the bridge can talk to each of the namespaces. Since we have a manager process, it is quite natural to let the manager use the bridge device and let each server process reside in its own namespace and use the veth
device put into it.
Let’s walk through this step-by-step for a 3-servers setting.
Create namespaces and assign them proper IDs:
~$ sudo ip netns add ns0
~$ sudo ip netns set ns0 0
~$ sudo ip netns add ns1
~$ sudo ip netns set ns1 1
~$ sudo ip netns add ns2
~$ sudo ip netns set ns2 2
Create a bridge device brgm
and assign address 10.0.1.0
to it:
~$ sudo ip link add brgm type bridge
~$ sudo ip addr add "10.0.1.0/16" dev brgm
~$ sudo ip link set brgm up
Create veth
pairs (vethsX
-vethsXm
) for servers, put the vethsX
end into its corresponding namespace, and assign address 10.0.0.X
to it:
~$ sudo ip link add veths0 type veth peer name veths0m
~$ sudo ip link set veths0 netns ns0
~$ sudo ip netns exec ns0 ip addr add "10.0.0.0/16" dev veths0
~$ sudo ip netns exec ns0 ip link set veths0 up
# repeat for servers 1 and 2
Put the vethsXm
end under the bridge device:
~$ sudo ip link set veths0m up
~$ sudo ip link set veths0m master brgm
# repeat for servers 1 and 2
This gives us a topology that looks like the following figure:
Let’s do a bit of delay injection with netem
to verify that this topology truly gives us what we want. Say we add 10ms delay to veths1
and 20ms delay to veths2
:
~$ sudo ip netns exec ns1 tc qdisc add dev veths1 root netem delay 10ms
~$ sudo ip netns exec ns2 tc qdisc add dev veths2 root netem delay 20ms
Pinging the manager from server 1:
~$ sudo ip netns exec ns1 ping 10.0.1.0
PING 10.0.1.0 (10.0.1.0) 56(84) bytes of data.
64 bytes from 10.0.1.0: icmp_seq=1 ttl=64 time=10.1 ms
64 bytes from 10.0.1.0: icmp_seq=2 ttl=64 time=10.1 ms
64 bytes from 10.0.1.0: icmp_seq=3 ttl=64 time=10.1 ms
...
Pinging server 2 from server 1:
~$ sudo ip netns exec ns1 ping 10.0.0.2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=30.1 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=30.1 ms
64 bytes from 10.0.0.2: icmp_seq=3 ttl=64 time=30.1 ms
...
All good!
There are also obvious ways to extend this topology to allow even more flexibility; for example, each server’s could own multiple devices in its namespace that have different connectivity and different performance parameters, etc. Let me describe one example extension below.
ifb
sIt is important to note that most of netem
’s emulation functionality applies only to the egress side of the interface. This means all the injected delay happen on the sender side for every packet. In some cases, you might want to put custom performance emulation on the ingress side of the servers’ interfaces as well. To do so, we could utilize the special IFB devices 7.
First, load the kernel module that implements IFB devices:
~$ sudo modprobe ifb
By default, two devices ifb0
and ifb1
are added automatically. You can add more by doing:
~$ sudo ip link add ifb2 type ifb
We then bring one IFB device to each server’s namespace and redirect all the incoming traffic to the veth
interface to go through the ifb
device’s egress queue first. This is done by adding a special ingress
qdisc to the veth
(which can exist simultaneously with an egress netem
qdisc we added earlier) and placing a filter rule to simply “move” all ingress packets to the ifb
interface’s egress queue. The ifb
device will automatically move the packet back after it has gone through the ifb
’s egress queue.
~$ sudo ip link set ifb0 netns ns0
~$ sudo ip netns exec ns0 tc qdisc add dev veths0 ingress
~$ sudo ip netns exec ns0 tc filter add dev veths0 parent ffff: protocol all u32 match u32 0 0 flowid 1:1 action mirred egress redirect dev ifb0
~$ sudo ip netns exec ns0 ip link set ifb0 up
We can then put a netem
qdisc on the ifb
interface, which effectively emulates specified performance on the ingress of the veth
. For example:
~$ sudo ip netns exec ns0 tc qdisc add dev ifb0 root netem delay 5ms rate 1gibit
To achieve our goal of emulating network links among distributed processes, on a single host, beyond the limitation of a single loopback interface, we can take the following steps:
ip netns
.veth
interface pairs, probably one pair for each process, using ip link ... type veth
.veth
pair into the corresponding namespace, then keep the other ends and create a bridge that stitches them together.tc qdisc ... netem
to apply the netem
queueing discipline with desired parameters on the veth
devices for each process.Below is a script for setting up the above described topology for a given number of server processes:
#! /bin/bash
NUM_SERVERS=$1
echo
echo "Deleting existing namespaces & veths..."
sudo ip -all netns delete
sudo ip link delete brgm
for v in $(ip a | grep veth | cut -d' ' -f 2 | rev | cut -c2- | rev | cut -d '@' -f 1)
do
sudo ip link delete $v
done
echo
echo "Adding namespaces for servers..."
for (( s = 0; s < $NUM_SERVERS; s++ ))
do
sudo ip netns add ns$s
sudo ip netns set ns$s $s
done
echo
echo "Loading ifb module & creating ifb devices..."
sudo rmmod ifb
sudo modprobe ifb # by default, add ifb0 & ifb1 automatically
for (( s = 2; s < $NUM_SERVERS; s++ ))
do
sudo ip link add ifb$s type ifb
done
echo
echo "Creating bridge device for manager..."
sudo ip link add brgm type bridge
sudo ip addr add "10.0.1.0/16" dev brgm
sudo ip link set brgm up
echo
echo "Creating & assigning veths for servers..."
for (( s = 0; s < $NUM_SERVERS; s++ ))
do
sudo ip link add veths$s type veth peer name veths${s}m
sudo ip link set veths${s}m up
sudo ip link set veths${s}m master brgm
sudo ip link set veths$s netns ns$s
sudo ip netns exec ns$s ip addr add "10.0.0.$s/16" dev veths$s
sudo ip netns exec ns$s ip link set veths$s up
done
echo
echo "Redirecting veth ingress to ifb..."
for (( s = 0; s < $NUM_SERVERS; s++ ))
do
sudo ip link set ifb$s netns ns$s
sudo ip netns exec ns$s tc qdisc add dev veths$s ingress
sudo ip netns exec ns$s tc filter add dev veths$s parent ffff: protocol all u32 match u32 0 0 flowid 1:1 action mirred egress redirect dev ifb$s
sudo ip netns exec ns$s ip link set ifb$s up
done
echo
echo "Listing devices in default namespace:"
sudo ip link show
echo
echo "Listing all named namespaces:"
sudo ip netns list
for (( s = 0; s < $NUM_SERVERS; s++ ))
do
echo
echo "Listing devices in namespace ns$s:"
sudo ip netns exec ns$s ip link show
done
https://superuser.com/questions/764986/howto-setup-a-veth-virtual-network ↩
https://medium.com/@mishu667/creating-two-network-namespaces-and-connect-them-with-virtual-ethernet-veth-devices-565f83af4c37#:~:text=Network%20namespaces%20provide%20a%20powerful,control%20network%20connectivity%20between%20them. ↩
Consider a database with only one small table (i.e. relation), shared by multiple clients. The clients could issue concurrent transactions that read some tuples (i.e. records) of the table or update them with new values. To protect the database from data races, it is pretty natural to apply a traditional reader-writer lock on the table.
In database terminology, we denote acquiring a reader lock on the table as locking it in shared (S
) mode, while acquiring a writer lock on the table as locking it in exclusive (X
) mode. Multiple clients could hold S
locks on the same table at the same time for reads. At most one client could hold an X
lock on the table (with no S
locks held by anyone else as well).
We call two locking attempts compatible if their lock modes are allowed to be held at the same time on the same thing. S
mode is compatible with itself. S
and X
are not compatible with each other. X
is of course also not compatible with itself.
Back to our problem scenario, since the database has only one table with a small number of tuples, a reasonable solution is to put a lock on that table. Read requests must attempt to acquire the lock in S
mode and can proceed only after the acquirement is successful. Writes requests must attempt to acquire it in X
mode. This is basically how a reader-writer lock works in classic systems. So far, so good.
Problem: what if the database is not in toy scale any more, but is composed of hundreds of tables, each having millions of records? Real-world databases can easily reach this scale. The traditional locking mechanism with uniform granularity puts a dilemma on choosing the granularity of locks:
Huge DB lock: we could choose to lock on coarse granularity, e.g., the entire database. However, it unacceptably hurts concurrency; a client transaction updating only one tuple in one table would block all other clients that try to read disjoint sets of tuples in the database.
One lock per tuple: alternatively, we could choose to put locks only at the finest granularity, in this case, tuples. A client transaction only locks the tuples it would touch in desired mode. This way, concurrency is preserved. The problem is that it forces large transactions to touch too many locks; e.g, a transaction that scans all tuples of a table will have to acquire potentially millions of locks. This can easily lead to prohibitive performance overhead.
Both choices are not ideal for overall performance. The solution to this problem is to introduce hierarchical locking on different levels of database resources.
A database is naturally structured as a tree (or more generally, a DAG) of resources. For example, the following figure represents a database with 3 tables, each having 100 tuples. Tuples could further be decomposed into fields (i.e. attributes or columns); we consider tuples as the finest granularity in this post.
The core idea of hierarchical locking 1 2 is to allow putting locks on nodes of tree (which may be at different granularity levels), instead of only at a uniform granularity.
In the first step towards hierarchical locking, we introduce implicit locking: locking an internal node in S
mode implicitly locks all its descendant nodes with S
mode; X
mode behaves similarly.
S
or X
locks on the individual tuples.S
or X
lock on the table – this implicitly grants S
or X
permissions on children nodes of the table, in this cases the tuples in it, to the client.
Implicit locking reduces the number of locks dramatically in cases of bulk operations, which nicely solves the performance problem of fine-grained locking. However, this mechanism itself is not enough, because it introduces correctness problems.
Problem: what about conflicting transactions that end up holding conflicting lock modes at different levels? Transaction B holds X
locks on tuple R99
in table 0 and is going to update it. Transaction A comes and acquires a single S
lock on table 0 to read all of its tuples. This situation should not be allowed. There are more incorrect scenarios besides this example.
To solve the correctness problem, we need to let internal nodes remember the locking state of its children. We introduce two intention lock modes: intention shared (IS
) mode and intention exclusive (IX
) mode.
To lock a node in X
mode, the client must traverse the tree from root and lock all ancestor nodes along the path with IX
mode, before locking the target node in X
. Similarly, to lock a node in S
mode, the client must traverse the tree from root and lock all ancestor nodes with IS
mode, before locking the target node in S
. By doing this, internal nodes now carry necessary information about the locking state of its descendant nodes in the subtree.
IS
and S
modes are compatible: it is allowed to acquire a S
lock on a node already locked in IS
mode – the two clients will probably share reading permissions of some children.IS
and X
modes are not compatible: children of a node being updated by someone cannot be read by anyone else.IX
and S
modes are not compatible: if a node and all its children are being read by someone, it is not allowed to grant any write permissions in this subtree to anyone else.IX
and X
modes are obviously not compatible.IS
mode is compatible with itself: multiple clients could be reading children of this node.IX
mode is compatible with itself: multiple clients could be updating disjoint sets of children. Conflicts, if any, will be resolved at lower levels of the subtree.IX
and IS
modes are compatible: multiple clients could be reading and updating disjoint sets of children. Possible conflicts are again resolved at lower levels.By always traversing the tree from root and locking ancestor nodes in intention modes (and releasing them in the reverse order when done), the correctness problem described in the previous section is now solved. Transaction B must have locked table 0 in IX
already before it locks its R99
in X, which prevents transaction A from locking the entire table in S
. If A and B touch different tables, however, they can proceed concurrently.
Problem: consider a workload that scans a big table while only attempting to update a few tuples in it. With the current version of hierarchical locking, it must either hold a big X
lock on the table, or hold many S
locks on tuples it reads. Can we further optimize performance for this situation?
SIX
Mode as an OptimizationWe introduce a combined mode of S
and IX
to optimize for the aforementioned situation. The shared and intention exclusive (SIX
) mode grants the client with read permission on all children, while optionally allowing it to further acquire X
locks on some child nodes. This way, the client can hold a single SIX
lock on the table plus a few X
locks on tuples it is trying to modify.
SIX
and IS
modes are compatible: two clients can have disjoint sets of children nodes locked in X
and S
modes, respectively. Conflicts, if any, will be resolved at those lower levels.SIX
is not compatible with any mode other than IS
, including itself. Reasoning behind this is left as an exercise for the reader.
The original paper 1 presents a nice summary of compatibility between modes. Note that NL
simply stands for null lock (i.e. not locked).
Concurrency control in database systems involve many more interesting issues besides hierarchical locking. To name a few examples:
Some of these things have been covered in my past blog posts. Other techniques and their modern implications may be covered in my future blog posts.
The very early form of AI, namely small-scale statistical algorithms, didn’t attract too much attention from computer architects and system builders. They were treated as yet another type of normal application workloads. System researchers had other issues to deal with, such as the I/O bottleneck, which appear to be more urgent problems to be solved at that time.
Around the year 2010, system researchers started to pay attention to something slightly closer to AI – which we later call “Big Data” applications – thanks to the emergence of Hadoop MapReduce1 and Spark2 3. A typical example of such Big Data application is an iterative graph processing algorithm, such as PageRank. These workloads require notably more compute power as well as higher storage performance requirement, pushing datacenters to go really large-scale and become vastly distributed. Combined with technical advances in other areas, including OS virtualization, high-speed networking, and advanced architecture, they lead to the success of large-scale datacenters and cloud computing (beyond traditional HPC).
Then, there comes machine learning (ML), more specifically, deep learning (DL) models. There’s no need for me to emphasize how much attention these data-hungry workloads have attracted in other areas of computer science in recent years. Their requirements for tremendous amount of data storage, massive parallel computation, and heavy communication have made them one of the most important and challenging workloads. People have done many things in building better systems for ML, and nothing seems to be stopping this trend so far:
With big models (with billions of parameters) gaining popularity, AI continues to be one of the main driving forces of the advancement of computing infrastructure. Many top conferences in e.g. the systems area now have 1 or 2 sessions dedicated for ML systems in recent years (see 4 for an example). There’s even a specialized conference for this topic, MLSys5, which started in 2018.
The interaction between AI and systems can also go the other way around: deploying AI algorithms to help design and implement smarter computer systems infrastructure, in short, AI for Sys. A natural question to ask at this point is: what are the problems in computer systems that AI techniques could really solve better than experienced developers? This is a tough question and many systems researchers are still trying to find a reasonable answer.
One of such opportunities, in my opinion, is to use AI algorithms to help improve or replace heuristics. Systems builders have long been putting heuristics here and there in different kinds of systems.
For example, cache eviction algorithms in data store systems rely heavily on heuristics about the incoming workload to decide which entry to evict when the cache is full. Many production systems still choose a simple heuristic such as LRU (least-recently used) that might not fit the actual workload well and is not resistant to large scans. If you are interested, here is a post 6 I wrote earlier about cache modes and eviction algorithms.
Another example of heuristics would be magic configuration numbers. A hash function implementation needs to decide how many buckets to create initially and how many more to grow at resizing. A database system needs to decide how much memory space to allocate as the block cache, etc. Magic numbers are everywhere and they are typically just chosen by an experienced system designer with very little assumption on the actual workload the system is going to serve.
AI techniques, especially data-driven ML models, seem to be a good fit to replace such heuristics. Given that a workload has its own statistical characteristics, we may assume that it is drawn from some probability distribution and is thus learnable by a smart enough ML model. Indeed, there are quite a few recent research papers addressing this opportunity. Just to name a few off the top of my head:
However, ML models are not free plug-and-play replacement for these decision-making heuristics. The real workload might not actually follow a causal pattern, and even if we assume it always does, the pattern may change dynamically and rapidly. Furthermore, ML training and inference are themselves storage- and compute-heavy.
By integrating ML algorithms into systems, our ultimate goal is to let it come up with smarter policies that make better decisions to yield better performance. However, deploying ML models themselves introduce significant performance overhead. The overhead consists of two parts: training on some existing data to learn a policy and doing inference through the policy to get decisions.
Coarsely, we can categorize “ML for Sys” techniques into two classes:
Nonetheless, the performance benefit of deploying a ML model in a computer system must be greater than its cost of deployment for it to be actually useful. This is why most of the research work around this topic so far are still limited to light-weight ML models. Bourbon, for example, only incorporates a simple segmented linear regression model and not any form of neural networks (NN). Some offline configuration tuning tools that produce static magic numbers may use larger NN models.
I hope that other ways of integrating AI techniques into computer systems can be discovered in the near future to help us build smarter systems and spawn more interesting ideas.
Crash consistency is a general concept that applies to any storage system maintaining data on persistent storage media.
We say a piece of persistent data is in a consistent state if it is in a correct form representing the logical data structure it stores. For example, if a group of bytes is meant to store a B-tree, then it is in a consistent state iff. the root block is in the correct position and all non-null node pointers point to correct child nodes (no dangling pointers), etc. Note that the “data structure” does not have to be a canonical data structure such as a B-tree – it can be any custom user specification.
We say a storage system provides crash consistency if data on persistent media it manages always transits from a consistent state to another consistent state. Equivalently, no matter when a crash happens during the steps of an update, data on persistent media is always left at a consistent state and can thus be recovered correctly upon restart.
Consistency and durability are two orthogonal guarantees:
It is possible for a storage system to be consistent yet not durable: acking requests once reaching DRAM cache, but always flushing them to persistent media in a consistent way – acked requests might be lost after a crash, but data on persistent media is always consistent, thus can be recovered (to a possibly outdated version).
It is also possible to be durable yet not consistent: reflecting any updates to persistent media immediately, but not managing ordering carefully – acked requests must have been persisted completely, but in-progress requests might leave the system in a corrupted state after a crash.
This post focuses on the consistency aspect, although most file systems provide both guarantees. Providing consistency is often a must. In certain cases where the application allows version rollbacks, weaker durability might be allowed.
The difference between crash consistency and other “consistency” terminologies should also be made clear:
- In distributed systems, consistency often means the strength of guarantee of reaching global consensus on the ordering of actions;
- Sometimes, the word “consistent” might also be used as a synonym to “uniform”, such as in consistent hashing.
In the setting of a file system, there are three categories of persistent data that must be managed:
Depending on which of the three categories of data are guaranteed crash consistent, an FS could provide two different levels of crash consistency:
Metadata consistency is often enough, since applications often have their own error detection & correction mechanisms on file data. As long as the FS image is always consistent, file content does not matter too much. Some FS designs also provide data consistency inherently.
Before diving into the three FS consistency techniques in detail, I’d like to talk about two underlying hardware architecture primitives that must be available to FS developers. These two primitives are so essential that any file system design must rely on them, otherwise it is impossible to provide any consistency guarantee.
The formularization comes from the Optimistic Crash Consistency paper 1.
This section formally summarizes the three classic FS consistency techniques: journaling, shadow paging, and log-structuring, and analyzes their pros & cons.
A journaling FS allocates a dedicated region of persistent storage as a journal (sometimes referred to as a log, though the name might get confused with log-structuring). The journal is an append-only “log” of transactions, where each transaction corresponds to a user update request. The idea behind journaling is that, for any user request, its transaction entry must be persisted and committed before the actual update. Journaling is a specific form of the write-ahead logging (WAL) technique. The action of “committing a transaction entry” must be atomic.
Journaling could be done in two different flavors:
Handling a user request involves the following actions:
A journaling FS has the flexibility to choose between providing only metadata consistency and providing stronger data consistency. In metadata journaling mode, only metadata changes are logged in the journal. This mode introduces minimal overhead. Formally, the algorithm is:
\[D \vert J_M \rightarrow \overline{J_E} \rightarrow M\]In data journaling mode, data changes are logged in the journaling as well, resulting in write-twice penalty. Formally, the algorithm is:
\[J_D \vert J_M \rightarrow \overline{J_E} \rightarrow D \vert M\]Many famous Linux file systems are journaling file systems, with Ext2/4 2 being a perfect example. By default, Ext4 is mounted in data=ordered
mode, i.e., only doing metadata journaling. When mounted with data=journal
option, Ext4 does data journaling. XFS 3 also uses journaling. Also see the Optimistic Crash Consistency paper 1 for a thorough discussion on possible optimizations to the algorithm.
Shadow paging (or shadowing) is a specific form of the copy-on-write (CoW) technique. The idea behind shadow paging is to first write all updates to newly-allocated empty blocks (copying over any partial blocks if necessary), and then publish the new blocks into the file atomically.
Handling a user request involves the following actions:
Formally, the algorithm is:
\[B \rightarrow W_C \vert W_D \rightarrow \overline{M}\]Shadow paging has its obvious advantages and disadvantages compared to journaling. \(\uparrow\) Shadow paging provides data consistency without introducing write-twice penalty. \(\downarrow\) Shadow paging works well only if most updates are bulky, block-sized, and block-aligned. Small, in-place updates will introduce significant overhead of allocation and copying. In tree-structured FS, shadow paging might also result in cascading CoW upto the root of the tree (where an atomic pointer switch can be done).
BtrFS 4 and WAFL 5 are two typical examples of CoW FS. To reduce the CoW overhead on small updates, WAFL aggregates and batches incoming writes into a single CoW. BPFS 6 is a CoW FS optimized for non-volatile memory.
Introduced in the classic LFS paper 7, a log-structured file system organizes the entire FS itself as an append-only log. All updates are just atomic appends to the log (involving both new data blocks and new metadata inode). Atomicity of appends is ensured by doing atomic updates to the log tail offset. The FS maintains an in-DRAM inode map recording the address of the latest version of each file’s inode. This in-DRAM inode map can be safely lost after a crash – the persistent log is the ground-truth and the FS image can be rebuilt from reading through the log and figuring out the latest version of each block.
Handling a user request involves the following actions:
Formally, the algorithm is:
\[A_D \rightarrow \overline{L_D} \rightarrow A_M \rightarrow \overline{L_M} \rightarrow I\]Log-structuring has its own pros and cons. \(\uparrow\) All device requests happen in a sequential manner, yielding good performance. Log-structured FS inherently provides data crash consistency. \(\downarrow\) The log could grow indefinitely, so there must be a garbage collection mechanism to discard outdated blocks and compact the log. Also, though writes become sequential, reads of a single file get scattered around the log.
It is possible to combine log-structuring with journaling/shadow paging. For example, NOVA 8 combines metadata journaling with log-structured file data blocks to optimize for non-volatile memory.
Originally, I was using the GlobalProtect client directly on my host PC or on my laptop. My lab machine labmachine.cs.wisc.edu
sits behind the departmental VPN. The VPN connection scheme looked like:
Since GlobalProtect clients force all outbound traffic to go through the VPN once connected, I could not let only one terminal SSH session to use VPN while leaving all other connections native. One workaround would be to install a virtual machine on the host, start GlobalProtect client in the virtual machine, and do SSH from there, but that requires careful configuration of guest networking and also seems to be an unnecessarily heavy-weight solution.
If you get lucky and have one or two spare Raspberry Pi chips at home, you can follow the steps listed below to setup them up properly as an SSH proxy server. SSH connections are very light-weight, so even RPi Zero chips can do the work nicely.
Let’s first assume that the RPi chip is within the same local network with the host machine (where I want split tunneling). In this case, one RPi chip should be sufficient. The next section will talk about adding an extra RPi chip and setting up Dynamic DNS (DDNS) to allow accessing the proxy server from anywhere on the Internet.
With one RPi chip, the network connection scheme looks like:
Setup steps:
192.168.0.1
for my TP-Link Archer). Identify the RPi’s hardware MAC address.192.168.0.131
).ssh piuser@192.168.0.131
. Setup password-less SSH if desired.Start GlobalProtect client on RPi:
(on-rpi) globalprotect connect --portal compsci.vpn.wisc.edu
After the above steps, I can connect to my lab machine from my host PC using the nice Proxy Jump feature of SSH:
ssh -J piuser@192.168.0.131 labuser@labmachine.cs.wisc.edu
It is strongly recommended to setup alias targets in .ssh/config
to save future typing, e.g.:
Host josepi4
Hostname 192.168.0.131
User piuser
Port 22
IdentityFile ~/.ssh/id_rsa
ServerAliveInterval 30
Host labmachine
Hostname labmachine.cs.wisc.edu
User labuser
Port 22
IdentityFile ~/.ssh/id_rsa
ServerAliveInterval 30
Host labmachine-jl
Hostname labmachine.cs.wisc.edu
User labuser
Port 22
IdentityFile ~/.ssh/id_rsa
ServerAliveInterval 30
ProxyJump josepi4
Then, to SSH to the RPi from local network at home:
ssh josepi4
To SSH to the lab machine behind VPN, either will work:
ssh -J josepi4 labmachine
# or simpler:
ssh labmachine-jl
Notice that the GlobalProtect client on RPi might timeout and disconnect after a few minutes of inactivity. It might be possible to write a simple keep-alive script that runs indefinitely on the RPi to keep GlobalProtect connected.
So far, the RPi proxy server is available to any machine connected to my home router’s local network. However, I still want access to the proxy server from anywhere on the Internet when I’m not at home.
It is time to introduce two more techniques into the workflow:
Due to strict traffic hijacking of GlobalProtect, the virtual server feature does not work when the previous RPi is on GlobalProtect VPN. Hence, unfortunately, an additional RPi chip needs to be involved. (RPi chips are cheap enough, anyway.)
The final network connection scheme looks like:
Setup steps:
192.168.0.130
), similarly.ssh piuser@192.168.0.130
. Setup password-less SSH if desired.22122
) to internal port 192.168.0.130:22
. It is recommended to choose a non-default external port to avoid exposing port 22
on public Internet.josedns.ddns.net
), and activate it.After the above steps, I can connect to the second RPi from anywhere on the public Internet through:
ssh piuser@josedns.ddns.net:22122
To access the first RPi:
ssh -J piuser@josedns.ddns.net:22122 piuser@192.168.0.131
Notice that SSH proxy jumps can be chained, so to access the lab machine behind VPN:
ssh -J piuser@josedns.ddns.net:22122,piuser@192.168.0.131 labuser@labmachine.cs.wisc.edu
Add a few more SSH config entries to save typing, e.g.:
Host josepi0
Hostname 192.168.0.130
User piuser
Port 22
IdentityFile ~/.ssh/id_rsa
ServerAliveInterval 30
Host josepi0-jp
Hostname josedns.ddns.net
User piuser
Port 22122
IdentityFile ~/.ssh/id_rsa
ServerAliveInterval 30
Host josepi4-jp
Hostname 192.168.0.131
User piuser
Port 22
IdentityFile ~/.ssh/id_rsa
ServerAliveInterval 30
ProxyJump josepi0-jp
Host labmachine-jp
Hostname labmachine.cs.wisc.edu
User labuser
Port 22
IdentityFile ~/.ssh/id_rsa
ServerAliveInterval 30
ProxyJump josepi4-jp
Then, to connect to the second RPi when away from home:
ssh josepi0-jp
To access the first RPi:
ssh josepi4-jp
To access the lab machine:
ssh labmachine-jp
Hooray!
]]>Let’s assume that the shared resource is a data structure on DRAM and that single-cacheline reads/writes to DRAM are atomic. Multiple running entities (say threads) run concurrently on multiple cores and share access to that data structure. The code for one access from one thread to the data structure is a critical section - a sequence of memory reads/writes and computations that must not be interrupted in the middle by other concurrent attempts of access. We want mutual exclusion: at any time, there will be at most one thread executing its critical section, and if some thread is doing that, other threads attempting to enter their critical section must wait. We do not want race conditions which may corrupt the data structure and yield incorrect results.
Based on this reasonable setup, it is possible to develop purely software-based algorithms. See this section of Wikipedia 1 for examples. Though very valuable in the theoretical aspect, these solutions are too sophisticated and quite inefficient to be deployed as locking primitives in an operating system under heavy load.
Modern operating systems, instead, rely on hardware atomic instructions – ISA supported instructions that are more than just single memory reads/writes, but are guaranteed by the hardware architecture to be atomic and unbreakable. The operating system implements (mutex) locks upon these instructions (in a threading library, for example). Threads have their critical sections protected in this way to get mutual exclusion:
lock.acquire();
... // critical section
lock.release();
Here are three classic examples of hardware atomic instructions.
The most basic hardware atomic instruction would be test-and-set (TAS). It writes a 1
to a memory location and returns the old boolean value on that location, atomically.
TEST_AND_SET(addr) -> old_val
// old_val = *addr;
// *addr = 1;
// return old_val;
Using this instruction, it is simple to build a basic spinlock that grants mutual exclusion (but not fairness and performance, of course).
void acquire() {
while (TEST_AND_SET(&flag) == 1) {}
}
void release() {
flag = 0;
}
Notice that this is a spinlock, which we will explain in the next section. Also, modern architectures have private levels of cache for each core. When threads are competing for the lock, there will be a great amount of cache invalidation traffic as they are all doing TEST_AND_SET
to the same bit in the while loop.
Compare-and-swap (CAS, or Exchange) compares the value on a memory location with a given value, and if they are the same, writes a new value into it. It returns a boolean, which is the old value.
COMPARE_AND_SWAP(addr, val, new_val) -> old_val
// old_val = *addr;
// if (old_val == val)
// *addr = new_val;
// return old_val;
There are some variants of CAS such as compare-and-set or exchange, but their ideas are the same. It is also simple to build a basic spinlock out of CAS.
void acquire() {
while (COMPARE_AND_SWAP(&flag, 0, 1) == 1) {}
}
void release() {
flag = 0;
}
Load-linked (LL) & store-conditional (SC) are a pair of atomic instructions used together. LL is just like a normal memory load. SC tries to store a value to the location and succeeds only if there’s no LL going on at the same time, otherwise returning failure.
LOAD_LINKED(addr) -> val
// return *addr;
STORE_CONDITIONAL(addr, val) -> success?
// if (no LL to addr happening) {
// *addr = val;
// return 1; // success
// } else
// return 0; // failed
Building a spinlock out of LL/SC:
void acquire() {
while (1) {
while (LOAD_LINKED(&flag) == 1) {}
if (STORE_CONDITIONAL(&flag, 1) == 1)
return;
}
}
void release() {
flag = 0;
}
Fetch-and-add (FAA) is a less common atomic instruction that could be implemented upon CAS or just natively supported by the architecture. It operates on an integer counter.
FETCH_AND_ADD(addr) -> old_val
// old_val = *addr;
// *addr += 1;
// return old_val;
Before we list a few lock implementations, I’d like to give a comparison between spinning locks and blocking locks.
A spinning lock (or spinlock, non-blocking lock) is a lock implementation where lock waiters will spinning in a loop checking for some condition. The examples given above are basic spinlocks. Spinlocks are typically used for low-level critical sections that are short, small, but invoked very frequently, e.g., in device drivers.
A blocking lock is a lock implementation where a lock waiter yields the core to the scheduler when the lock is currently taken. A lock waiter thread adds itself to the lock’s wait queue and blocks the execution of itself (called parking) to let some other free thread run on the core, until it gets woken up (typically by the previous lock holder) and scheduled back. It is designed for higher-level critical sections. The pros and cons are exactly the opposite of a spinlock.
It is possible to have smarter hybrid locks that combine spinning and blocking. This is now referred to as two-phase locking. POSIX mutex locks have the semantic option to first try to spin for a designed length of time. If the waiting time becomes too long, it switches to the scheduler to park. The Linux lock based on its futex syscall support 2 is a good example of such locks implementing the two-phase semantic.
Here are a few interesting examples of lock implementations that appeared in the history of operating systems research. The list goes in the order from lower-level spinlocks to higher-level scheduling-based locks with more considerations.
To ease the problem of cache invalidation in the simple TAS spinlock example, we could use a test-test-and-set (TTAS) protocol spinlock 3.
void acquire() {
do {
while (flag == 1) {}
} while (TEST_AND_SET(&flag) == 1);
}
void release() {
flag = 0;
}
The point is that, in the internal while loop, the value of flag
will be cached in the core’s private cache and it is just spinning on a local cached copy. So most of the time, there won’t be cache throttling. Whenever the value of flag
changes to 0
(lock seems released), cache invalidation traffic will invalidate the cached copy, terminating the internal while loop. Only then it falls back to an outer TEST_AND_SET
check.
Ticket lock is a spinlock that uses the notion of “tickets” to improve arriving-order fairness.
volatile int ticket = 0;
volatile int turn = 0;
void acquire() {
int myturn = FETCH_AND_ADD(&ticket);
while (turn != myturn) {}
}
void release() {
turn++;
}
The downside is still the same cache throttling problem as in basic spinlocks.
A comparison table across Linux low-level spinlock implementations, including LL/SC and ABQL locks, can be found in this Wikipedia section 4.
MCS lock uses a linked-list structure to further optimize for the cache problem beyond TTAS. MCS is based on atomic swap. It queues the waiters into a linked-list and lets each waiter spin on its own node’s is_locked
variable. A good demonstration of how this algorithm works can be found here 5.
MCS-TP (time-published) is an enhancement to MCS that involves a timestamp for letting a thread park after spinning for some time, as mentioned in the POSIX locks.
Lozi et al. proposed a lock delegation design that aims to improve the scalability of locks in this ATC’12 paper 6. Remote core locking recognizes the fact that, at any time, there will only be one thread executing the critical section, so why not let a dedicated “server” thread do that. For a critical section that is invoked frequently, RCL allocates a threads just for executing that critical section logic. Other threads use atomic cacheline operations to put themselves into a fixed mailbox-like queue, and the server thread loops over the queue serving them in order. This prevents the lock data structure from being cache-invalidated and transferred to different cores at different times.
Figure 1 of the paper.
The downside is that it is harder to pass data/results out of the critical section. The server core will always be occupied and it can only be serving a chosen set of critical section logics.
Kashyap et al. proposed an interesting enhancement called shuffling to blocking locks in this SOSP’19 paper 7. The shuffle locks are NUMA-aware: they take into consideration that on modern non-uniform memory access architectures, cores in the same NUMA socket (or in closer sockets) have faster access to local memory than to memory on a different socket (a “remote” socket). Hence, it would be a nice idea to reorder the wait queue of a lock dynamically depending on which NUMA socket is each waiter residing on.
Periodically, it assigns the first waiter in queue to be the shuffler, which traverses through the remaining wait queue and reorders it, grouping waiters on the same NUMA socket together. Then, there will be a higher chance that once a lock is released, the next holder scheduled will be on the same NUMA socket as the previous holder, so the transferring of lock data structures will be faster and there will be less cache invalidation traffic.
Figure 5 of the paper.
However, fairness in this case is not guaranteed as a lock waiter could possibly be pushed back in the queue constantly, which remains a big problem to be solved in shuffle locks.
An operating system kernel is the composition and collaboration of the following pieces:
A complete operating system typically include the following programs running as privileged processes, with the support of the kernel, to provide a reasonable user interface:
Then, user application programs run as user-level processes with the support of above-listed functionalities. Different OS kernel models distinguish from each other in where they put each of these functionalities, how they implement them, and how they hook them together.
As the name suggests, a monolithic kernel encapsulates everything into a whole. A monolithic kernel often takes a layered structure, where each layer is based on the correctness of the lower layer and adds in a level of abstraction to be used by the upper layer. All application programs run as processes on top of the kernel.
Not to be mistaken, the kernel itself is NOT a running entity. It is just a big codebase of registered handlers for interrupt events: software interrupts issued by application processes (syscalls) or hardware interrupts from devices (e.g., timer interrupts that trigger scheduler decisions, etc.). You can think of it as a static code stack that a process will have access to when it switches to privileged mode. Only the processes are running entities that take up CPU and memory resources to do work. Upon an interrupt, they switch to privileged mode to execute some kernel logic such as accessing a shared resource through a syscall or yielding to the scheduler context on a timer interrupt. There are exceptions, of course, for example at booting or where we have special kernel threads doing background work that do not belong to any specific process.
Monolithic kernel is the most classic kernel model originating from the very early OS prototypes such as THE 1 and UNIX 2. Most of the recent mainstream OS platforms are based on a monolithic kernel: BSD, Linux 3, OS X, and Windows. A monolithic kernel is very compact and highly efficient, meanwhile, hard to develop (low code velocity) and less flexible.
A microkernel, in contrast, implements only the core functionalities such as process isolation, CPU scheduling, virtual memory, and basic IPC mechanisms. Each of the subsystems run as a dynamic process, often called a service, which is just like a user process listening on IPCs but may have higher privilege and scheduling priority. These services often have dedicated direct access to their corresponding hardware resource.
Say a user application wishes to fetch the next incoming network packet. Instead of making a syscall down to the kernel network stack, it makes an IPC through the kernel into the networking service. The service process then reacts to the IPC request. Think of it as the client-server communication model.
Examples of microkernel include MINIX 4 and the L3/L4 microkernel family 5. Microkernel makes it easier to develop/debug kernel subsystems as developers are almost just writing user programs. Code is also more modularized and less entangled. However, microkernel performance is very sensitive to the efficiency of IPCs, making it generally less performant than monolithic kernel.
Sometimes we only want to move a subset of the subsystems up as processes and keep everything else still in the kernel. For example, if we are targeting at applications with special storage requires and want to easily develop a custom scalable file system, we can have the file system running as a process while the device drivers and the networking stack still in the kernel.
Examples of such semi-microkernel include FUSE 6 for user-space file systems and Google Snap 7 for user-space networking. Our group also has an upcoming paper on this one. Semi-microkernel is a compromise between monolithic kernel and microkernel.
Exokernel takes a more aggressive approach by moving not only the subsystems but also most of the core functionalities into each application, essentially linking each application program against a custom library OS. The kernel itself becomes minimal. These library OSes share hardware resources through a more primitive, coarser-grained interface than traditional syscalls.
Exokernel is introduced by this paper 8 and it also inspires the idea of virtual machine hypervisors. Exokernel calls share much similarity with hypercalls in type-1 virtual machines, which we will talk about in the next section. Exokernel allows developing highly-optimized kernel implementations customized for each different application, but makes it harder to coordinate and schedule around multiple application processes.
With the evolution of storage and networking devices, the overhead of kernel software stack is becoming much more significant. Sometimes, we want a lighter-weight storage/network subsystem tuned for an application and granted direct access to a modern device, so that it bypasses the centralized kernel stack for better latency.
Unlike the microkernel case, the “moved-up” part does not run as a separate processes (which means still a centralized component able to perform scheduling and performance isolation), but instead is written as an application library linked into application processes (which means multiple processes invoke the subsystem logic independently, with less performance isolation).
Much research effort has been put into this field in recent years. Examples include Arrakis 9, Strata 10, SplitFS 11, Twizzler 12, and many more. I think of direct-access libraries as a compromise between monolithic kernel and exokernel, though some of the research prototypes have not considered the sharing and scheduling problem among processes yet - they just grant the library full control over the device on the datapath.
Hardware devices are getting smarter and are equipped with “small computers” on board. SSDs have FTL controllers running inside with its own RAM. Other devices such as network cards have the same trend too. A disaggregated kernel takes advantage of the computing power on each device and distributes the kernel component for a device onto the device itself. A smart memory chip may run a memory management component and a disk may run a full storage stack + driver. This moves the kernel closer to the hardware (, in contrast to closer to the applications, as in microkernel and exokernel).
Kernel disaggregation brings flexibility, elasticity, and fault independence. A DRAM failure in other kernel models means the entire machine goes down, while a failed memory component in disaggregated kernel does not affect the correct functioning of all other components. The downside is that it is essentially turning an OS kernel into a heterogeneous distributed system, making it much harder to develop, maintain consistency, or yield high performance.
The best example of disaggregated kernel is LegoOS 13.
There are also efforts in exploring writing an OS kernel in higher-level languages with runtime garbage collection. Biscuit 14 does it in Go.
In the cloud computing era, it becomes increasingly interesting and useful to be able to run multiple OS environments on one physical machine. The virtual machine technology uses a hypervisor (or virtual machine monitor, VMM) to simulate/emulate hardware resources and to coordinate across multiple guest OSes. Virtual machine solutions are often categorized as follows.
The hypervisor runs directly on and has full control over the hardware, and the device drivers are also implemented inside the hypervisor. Guest OS does not need to be modified, as long as it hooks with the emulated device interfaces.
This approach is the most straightforward but requires a strong hypervisor that provides complete device driver emulation. This model originates from the work of Disco 15 and examples include VMware ESX/ESXi 16.
The hypervisor runs directly on and has full control over the hardware, but device driver implementations are provided by a special domain-0 (Dom0) OS. Other guest OSes are called domain-U (DomU). Guest kernel device requests are redirected to the Dom0 kernel.
Typically, the DomU kernels may need a few modifications to be able to fit in this model. This characteristic is called para-virtualization, meaning that it is OK to apply small modifications to the guest kernels and they do not need to work out-of-the-box as if without virtualization.
Sometimes, there are even special, minimal OS kernels written just to be used as these DomU kernels in type-1b VMs. They are called “unikernels”.
Examples of type-1b hypervisors include Xen 17.
The hypervisor is just a software package/kernel module extension running on a host OS, which emulates hardware platforms for running guest OSes.
Examples include VMware Workstation 18, Virtual Box 19, and QEMU (on Linux, possibly with KVM support) 20 21.
Pure software emulators deliver poor performance as they add in an expensive layer of abstraction. Modern hypervisors, whichever type it belongs to, often take advantage of dedicated hardware support to provide more efficient virtualization if the guest ISA is the same as the host machine (e.g., running x86-64 VMs on an x86-64 platform).
For type-2 use cases, some host OSes like Linux also provides kernel-based virtual machine (KVM) support which allows emulators like QEMU to run more efficiently and “natively”.
THE: https://www.cs.utexas.edu/users/dahlin/Classes/GradOS/papers/p341-dijkstra.pdf ↩
MINIX: http://www.minix3.org/ ↩
L3/L4 Family: https://en.wikipedia.org/wiki/L4_microkernel_family ↩
FUSE: https://en.wikipedia.org/wiki/Filesystem_in_Userspace ↩
Exokernel: https://cs.nyu.edu/~mwalfish/classes/14fa/ref/engler95exokernel.pdf ↩
Arrakis: https://www.usenix.org/conference/osdi14/technical-sessions/presentation/peter ↩
Strata: https://www.cs.utexas.edu/users/witchel/pubs/kwon17sosp-strata.pdf ↩
SplitFS: https://www.cs.utexas.edu/~vijay/papers/sosp19-splitfs.pdf ↩
Twizzler: https://www.usenix.org/system/files/atc20-bittman.pdf ↩
LegoOS: https://www.usenix.org/conference/osdi18/presentation/shan ↩
Biscuit: https://www.usenix.org/conference/osdi18/presentation/cutler ↩
Disco: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=1473D91F21DBDF43FEF78259A24F0F2D?doi=10.1.1.103.714&rep=rep1&type=pdf ↩
VMware ESXi: https://www.vmware.com/products/esxi-and-esx.html ↩
Xen: https://xenproject.org/ ↩
VMware Workstation: https://www.vmware.com/products/workstation-pro.html ↩
Virtual Box: https://www.virtualbox.org/ ↩
QEMU: https://www.qemu.org/ ↩
KVM: https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine ↩