<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://www.josehu.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.josehu.com/" rel="alternate" type="text/html" /><updated>2026-03-15T04:22:46+00:00</updated><id>https://www.josehu.com/feed.xml</id><title type="html">Guanzhou Hu</title><subtitle>Feeling comfortably numb</subtitle><entry><title type="html">An Effective Algorithm for On-line Linearizability Checking</title><link href="https://www.josehu.com/technical/2024/07/28/on-line-linearizability-checker.html" rel="alternate" type="text/html" title="An Effective Algorithm for On-line Linearizability Checking" /><published>2024-07-28T16:16:03+00:00</published><updated>2024-07-28T16:16:03+00:00</updated><id>https://www.josehu.com/technical/2024/07/28/on-line-linearizability-checker</id><content type="html" xml:base="https://www.josehu.com/technical/2024/07/28/on-line-linearizability-checker.html"><![CDATA[<p>This post describes a simple yet effective algorithm of an on-line linearizability checker for concurrent Put/Get operations from a known number of nodes. The core idea is to maintain a set of still-possible states (i.e., <em>possibilities</em>) given the operation results observed. If this set ever becomes empty after feeding an operation result in, then linearizability has been violated. Check out <a href="https://github.com/josehu07/linearize">this repo</a> for a Rust crate implementation of this algorithm.</p>

<h2 id="linearizability">Linearizability</h2>

<p>With multiple <em>nodes</em> issuing and completing concurrent <em>operations</em> on a single object, <em>linearizability</em> <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> is defined as the conjunction of the following two conditions:</p>

<ul>
  <li>there must exist an equivalent global <em>sequential</em> order of all operations, where each operation observes the results of all preceding operations, and</li>
  <li>the global order must obey the <em>real-time</em> property: if an operation starts later than another one finishes (based on their timestamps), it must be placed after that one in the global order.</li>
</ul>

<p>The core idea behind this algorithm is to maintain a set of still-possible states (hereafter called <em>possibilities</em>) given the operation results observed. If this set ever becomes empty after feeding an operation result in, then linearizability has been violated.</p>

<h2 id="definitions">Definitions</h2>

<p>Each possibility is a “snapshot” of the object’s value after successfully applying a sequence of operations. More precisely, a possibility tracks the following three things:</p>

<ul>
  <li><strong>lineage history</strong>: the sequence of operations that have been applied; think of this as the determined prefix of a possible global sequential order</li>
  <li><strong>current value</strong>: the current value of the object, obtained by starting from an initial nil value and applying the determined sequence</li>
  <li><strong>per-node queues</strong>: per-node queues of operation results coming from each node which have not been applied yet</li>
</ul>

<p>where each operation result, denoted <code class="language-plaintext highlighter-rouge">&lt;ts_req&gt;Type(in/out)&lt;ts_ack&gt;</code>, contains the following information besides its source node ID:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">ts_req</code>: starting timestamp</li>
  <li><code class="language-plaintext highlighter-rouge">Type(in/out)</code>: Put input/Get output</li>
  <li><code class="language-plaintext highlighter-rouge">ts_ack</code>: finish timestamp</li>
</ul>

<p>Let’s assume all timestamps are unique, and operations from each node are always already in order (i.e., <code class="language-plaintext highlighter-rouge">ts_req</code> of the next operation fed by a node is always &gt; <code class="language-plaintext highlighter-rouge">ts_ack</code> of its previous operation).</p>

<p>Here is an example of a valid possibility, assuming a known number of 2 nodes <code class="language-plaintext highlighter-rouge">n0</code> and <code class="language-plaintext highlighter-rouge">n1</code>:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">lineage history</th>
      <th style="text-align: center">current</th>
      <th style="text-align: left">per-node queues</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">&lt;1&gt;Put(7)&lt;4&gt;</code> ~ <code class="language-plaintext highlighter-rouge">&lt;3&gt;Get(7)&lt;6&gt;</code></td>
      <td style="text-align: center"><code class="language-plaintext highlighter-rouge">7</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">n0</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;10&gt;Get(8)&lt;11&gt;</code> ~ <code class="language-plaintext highlighter-rouge">&lt;13&gt;Put(9)&lt;17&gt;</code> <br /> <code class="language-plaintext highlighter-rouge">n1</code> ➛</td>
    </tr>
  </tbody>
</table>

<h2 id="the-algorithm">The Algorithm</h2>

<p>The checker starts from an initial set that contains only one initial possibility.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">lineage history</th>
      <th style="text-align: center">current</th>
      <th style="text-align: left">per-node queues</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"> </td>
      <td style="text-align: center">nil</td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">n0</code> ➛ <br /> <code class="language-plaintext highlighter-rouge">n1</code> ➛</td>
    </tr>
  </tbody>
</table>

<p>Nodes feed completed operations to the checker. For each operation fed, the checker pushes it to the back of the corresponding node’s queue of every current possibility. Say node <code class="language-plaintext highlighter-rouge">n0</code> feeds a <code class="language-plaintext highlighter-rouge">Put(55)</code> that started on timestamp 1 and finished on 5:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">lineage history</th>
      <th style="text-align: center">current</th>
      <th style="text-align: left">per-node queues</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"> </td>
      <td style="text-align: center">nil</td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">n0</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;1&gt;Put(55)&lt;5&gt;</code> <br /> <code class="language-plaintext highlighter-rouge">n1</code> ➛</td>
    </tr>
  </tbody>
</table>

<p>The checker tries to <strong>step</strong> each current possibility by consuming it, producing 0-to-some new possibilities, and adding them to the set. A possibility can be stepped iff. it has at least one pending operation from every node. Here, there’s now only one possibility in the set and it cannot be stepped (as we don’t yet know what the next op from <code class="language-plaintext highlighter-rouge">n1</code> would look like), so nothing happens.</p>

<p>Say <code class="language-plaintext highlighter-rouge">n1</code> then feeds a <code class="language-plaintext highlighter-rouge">Put(66)</code>:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">lineage history</th>
      <th style="text-align: center">current</th>
      <th style="text-align: left">per-node queues</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"> </td>
      <td style="text-align: center">nil</td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">n0</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;1&gt;Put(55)&lt;5&gt;</code> <br /> <code class="language-plaintext highlighter-rouge">n1</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;3&gt;Put(66)&lt;6&gt;</code></td>
    </tr>
  </tbody>
</table>

<p>Now we know at least one operation from every node for this possibility, meaning it can be stepped. It picks candidate operations from heads of per-node queues and tries to apply the op to its current value; a successful apply produces a new possibility, while a Get with mismatching value produces none. In this case, either head is a valid candidate because their timestamp spans overlap and both are just Puts. After stepping, the possibility is consumed and two new valid possibilities are produced, so the set now looks like:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">lineage history</th>
      <th style="text-align: center">current</th>
      <th style="text-align: left">per-node queues</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">&lt;1&gt;Put(55)&lt;5&gt;</code></td>
      <td style="text-align: center"><code class="language-plaintext highlighter-rouge">55</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">n0</code> ➛ <br /> <code class="language-plaintext highlighter-rouge">n1</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;3&gt;Put(66)&lt;6&gt;</code></td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">&lt;3&gt;Put(66)&lt;6&gt;</code></td>
      <td style="text-align: center"><code class="language-plaintext highlighter-rouge">66</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">n0</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;1&gt;Put(55)&lt;5&gt;</code> <br /> <code class="language-plaintext highlighter-rouge">n1</code> ➛</td>
    </tr>
  </tbody>
</table>

<p>Stepping is attempted repeatedly until all possibilities in the new set cannot be stepped.</p>

<p>Say <code class="language-plaintext highlighter-rouge">n1</code> then feeds a <code class="language-plaintext highlighter-rouge">Get(77)</code> that started late:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">lineage history</th>
      <th style="text-align: center">current</th>
      <th style="text-align: left">per-node queues</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">&lt;1&gt;Put(55)&lt;5&gt;</code></td>
      <td style="text-align: center"><code class="language-plaintext highlighter-rouge">55</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">n0</code> ➛ <br /> <code class="language-plaintext highlighter-rouge">n1</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;3&gt;Put(66)&lt;6&gt;</code> ~ <code class="language-plaintext highlighter-rouge">&lt;10&gt;Get(77)&lt;12&gt;</code></td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">&lt;3&gt;Put(66)&lt;6&gt;</code></td>
      <td style="text-align: center"><code class="language-plaintext highlighter-rouge">66</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">n0</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;1&gt;Put(55)&lt;5&gt;</code> <br /> <code class="language-plaintext highlighter-rouge">n1</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;10&gt;Get(77)&lt;12&gt;</code></td>
    </tr>
  </tbody>
</table>

<p>While this may look like a linearizability violation at first glance, we can’t say for sure because <code class="language-plaintext highlighter-rouge">n0</code> could have made a <code class="language-plaintext highlighter-rouge">Put(77)</code> sometime in the middle. Anyways, feeding this Get makes the second possibility steppable; but this time, only the <code class="language-plaintext highlighter-rouge">Put(55)</code> can be a valid next operation. The <code class="language-plaintext highlighter-rouge">Get(77)</code> cannot be chosen as a candidate because of two reasons: 1. it started strictly after the finish of <code class="language-plaintext highlighter-rouge">Put(55)</code>, and 2. even if it overlapped with the Put, its output does not match the current value <code class="language-plaintext highlighter-rouge">66</code>. The new set after stepping:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">lineage history</th>
      <th style="text-align: center">current</th>
      <th style="text-align: left">per-node queues</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">&lt;1&gt;Put(55)&lt;5&gt;</code></td>
      <td style="text-align: center"><code class="language-plaintext highlighter-rouge">55</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">n0</code> ➛ <br /> <code class="language-plaintext highlighter-rouge">n1</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;3&gt;Put(66)&lt;6&gt;</code> ~ <code class="language-plaintext highlighter-rouge">&lt;10&gt;Get(77)&lt;12&gt;</code></td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">&lt;3&gt;Put(66)&lt;6&gt;</code> ~ <code class="language-plaintext highlighter-rouge">&lt;1&gt;Put(55)&lt;5&gt;</code></td>
      <td style="text-align: center"><code class="language-plaintext highlighter-rouge">55</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">n0</code> ➛ <br /> <code class="language-plaintext highlighter-rouge">n1</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;10&gt;Get(77)&lt;12&gt;</code></td>
    </tr>
  </tbody>
</table>

<p>Say <code class="language-plaintext highlighter-rouge">n0</code> then feeds a <code class="language-plaintext highlighter-rouge">Put(77)</code> which indeed happened in the middle:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">lineage history</th>
      <th style="text-align: center">current</th>
      <th style="text-align: left">per-node queues</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">&lt;1&gt;Put(55)&lt;5&gt;</code></td>
      <td style="text-align: center"><code class="language-plaintext highlighter-rouge">55</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">n0</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;7&gt;Put(77)&lt;9&gt;</code> <br /> <code class="language-plaintext highlighter-rouge">n1</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;3&gt;Put(66)&lt;6&gt;</code> ~ <code class="language-plaintext highlighter-rouge">&lt;10&gt;Get(77)&lt;12&gt;</code></td>
    </tr>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">&lt;3&gt;Put(66)&lt;6&gt;</code> ~ <code class="language-plaintext highlighter-rouge">&lt;1&gt;Put(55)&lt;5&gt;</code></td>
      <td style="text-align: center"><code class="language-plaintext highlighter-rouge">55</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">n0</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;7&gt;Put(77)&lt;9&gt;</code> <br /> <code class="language-plaintext highlighter-rouge">n1</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;10&gt;Get(77)&lt;12&gt;</code></td>
    </tr>
  </tbody>
</table>

<p>After stepping all current possibilities exhaustively, the set reduces to one possibility, and linearizability still holds.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">lineage history</th>
      <th style="text-align: center">current</th>
      <th style="text-align: left">per-node queues</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">&lt;3&gt;Put(66)&lt;6&gt;</code> ~ <code class="language-plaintext highlighter-rouge">&lt;1&gt;Put(55)&lt;5&gt;</code> ~ <code class="language-plaintext highlighter-rouge">&lt;7&gt;Put(77)&lt;9&gt;</code></td>
      <td style="text-align: center"><code class="language-plaintext highlighter-rouge">77</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">n0</code> ➛ &lt;/br&gt; <code class="language-plaintext highlighter-rouge">n1</code> ➛ <code class="language-plaintext highlighter-rouge">&lt;10&gt;Get(77)&lt;12&gt;</code></td>
    </tr>
  </tbody>
</table>

<blockquote>
  <p>Note that operations <code class="language-plaintext highlighter-rouge">Put(66)</code> and <code class="language-plaintext highlighter-rouge">Put(55)</code> are swappable in the lineage history, but we consider both as the same possibility as they don’t affect the checker’s decisions beyond.</p>
</blockquote>

<p>Consider, alternatively, that <code class="language-plaintext highlighter-rouge">n0</code> instead feeds an arbitrary operation that started at timestamp 13, rather than a <code class="language-plaintext highlighter-rouge">Put(77)</code> that started before 12. You should find no valid possibilities left after exhaustive stepping, meaning a linearizability violation is detected: <code class="language-plaintext highlighter-rouge">n1</code>’s Get that finished at timestamp 12 cannot observe a value of <code class="language-plaintext highlighter-rouge">77</code>. I will leave this as an exercise for readers =)</p>

<h2 id="implementation">Implementation</h2>

<p>An implementation of this algorithm, along with examples, can be found at <a href="https://github.com/josehu07/linearize">this GitHub repo</a> <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>.</p>

<h2 id="references">References</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p><a href="https://dl.acm.org/doi/10.1145/78969.78972">https://dl.acm.org/doi/10.1145/78969.78972</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p><a href="https://www.josehu.com/technical/2020/05/23/consistency-models.html">https://www.josehu.com/technical/2020/05/23/consistency-models.html</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p><a href="https://github.com/josehu07/linearize">https://github.com/josehu07/linearize</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Guanzhou Hu</name></author><category term="Technical" /><summary type="html"><![CDATA[This post describes a simple yet effective algorithm of an on-line linearizability checker for concurrent Put/Get operations from a known number of nodes. The core idea is to maintain a set of still-possible states (i.e., possibilities) given the operation results observed. If this set ever becomes empty after feeding an operation result in, then linearizability has been violated. Check out this repo for a Rust crate implementation of this algorithm.]]></summary></entry><entry><title type="html">Practical SMR-style TLA+ Specification of the MultiPaxos Protocol</title><link href="https://www.josehu.com/technical/2024/02/19/practical-MultiPaxos-TLA-spec.html" rel="alternate" type="text/html" title="Practical SMR-style TLA+ Specification of the MultiPaxos Protocol" /><published>2024-02-19T12:11:20+00:00</published><updated>2024-02-19T12:11:20+00:00</updated><id>https://www.josehu.com/technical/2024/02/19/practical-MultiPaxos-TLA-spec</id><content type="html" xml:base="https://www.josehu.com/technical/2024/02/19/practical-MultiPaxos-TLA-spec.html"><![CDATA[<p>The attached files present a practical TLA+ specification of MultiPaxos that very closely models how a real state machine replication (SMR) system would implement this protocol. I did not find anything similar on the web, so I’d like to share it with anyone interested.</p>

<h2 id="files-of-this-tla-spec">Files of This TLA+ Spec</h2>

<p>Below are the files composing the checkable model (organized in VSCode extension style):</p>

<ul>
  <li><a href="/assets/file/tla-specs/multipaxos_smr_style/MultiPaxos.tla">MultiPaxos.tla</a> (main protocol spec written in PlusCal and with translation attached)</li>
  <li><a href="/assets/file/tla-specs/multipaxos_smr_style/MultiPaxos_MC.tla">MultiPaxos_MC.tla</a> (entrance of running model checking; contains the checked constraints)</li>
  <li><a href="/assets/file/tla-specs/multipaxos_smr_style/MultiPaxos_MC.cfg">MultiPaxos_MC.cfg</a> (recommended model inputs and configurations, which should give 100% coverage of all interesting cases)</li>
  <li><a href="/assets/file/tla-specs/multipaxos_smr_style/MultiPaxos_MC_small.cfg">MultiPaxos_MC_small.cfg</a> (smaller input with one fewer write and no <code class="language-plaintext highlighter-rouge">CommitNotice</code> messages)</li>
</ul>

<h2 id="whats-good-about-this-spec">What’s Good About This Spec</h2>

<p>This spec is different from traditional, general descriptions of Paxos/MultiPaxos in the following aspects:</p>

<ul>
  <li>It models MultiPaxos in a practical SMR system style that’s much closer to real implementations than its traditional, abstract specs (e.g., <a href="https://github.com/tlaplus/Examples/tree/master/specifications/Paxos">this</a>)
    <ul>
      <li>All servers explicitly replicate a log of instances, each holding a command</li>
      <li>Numbers of client write/read commands are made model inputs</li>
      <li>Explicit <em>termination</em> condition is defined, thus semi-liveness can be checked by not having deadlocks</li>
      <li>Safety constraint is defined as a clean client-viewed <em>linearizability</em> property upon termination</li>
      <li>Replica node failure is injected to assure the protocol’s fault-tolerance level</li>
      <li>See the detailed comments in the source files…</li>
    </ul>
  </li>
  <li>Careful optimizations are applied to the spec to reduce the state space W.L.O.G.
    <ul>
      <li>Model checking with recommended inputs completes in &lt; 22 min on a 40-core server machine</li>
      <li>Commenting out the <code class="language-plaintext highlighter-rouge">HandleCommitNotice</code> action (which is the least significant) and having one fewer request reduces check time down to &lt; 10 secs</li>
    </ul>
  </li>
  <li>It is easy to extend this spec and add even more practical features
    <ul>
      <li>Leader lease and local read</li>
      <li>Asymmetric write/read quorum sizes</li>
      <li>…</li>
    </ul>
  </li>
</ul>

<p>This spec has been accepted into the official <a href="https://github.com/tlaplus/Examples">TLA+ Examples repo</a>! <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<p>Here are some links I found particularly useful when developing this spec by myself: <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> <sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup></p>

<h2 id="update-extended-spec-with-extra-features">Update: Extended Spec with Extra Features</h2>

<p>Below are the files composing an extended version of the spec along with model inputs:</p>

<ul>
  <li><a href="/assets/file/tla-specs/multipaxos_smr_addon/MultiPaxos.tla">MultiPaxos.tla</a> (extended main protocol spec written in PlusCal and with translation attached)</li>
  <li><a href="/assets/file/tla-specs/multipaxos_smr_addon/MultiPaxos_MC.tla">MultiPaxos_MC.tla</a> (entrance of running model checking; contains the checked constraints)</li>
  <li><a href="/assets/file/tla-specs/multipaxos_smr_addon/MultiPaxos_MC.cfg">MultiPaxos_MC.cfg</a> (recommended model inputs and configurations, which should give 100% coverage of all interesting cases with default features)</li>
  <li><a href="/assets/file/tla-specs/multipaxos_smr_addon/MultiPaxos_MC_small.cfg">MultiPaxos_MC_small.cfg</a> (smallest input for sanity check)</li>
  <li><a href="/assets/file/tla-specs/multipaxos_smr_addon/MultiPaxos_MC_rwqrm.cfg">MultiPaxos_MC_rwqrm.cfg</a> (input demonstrating asymmetric write/read quorum sizes)</li>
  <li><a href="/assets/file/tla-specs/multipaxos_smr_addon/MultiPaxos_MC_lease.cfg">MultiPaxos_MC_lease.cfg</a> (input demonstrating stable leader leases and local read)</li>
</ul>

<h2 id="whats-new-in-the-extended-spec">What’s New in the Extended Spec</h2>

<p>The extended spec includes the following extra features/variants of MultiPaxos that are very essential and useful in practice:</p>

<ul>
  <li>Only keep writes in the log (while reads squeeze in between writes)</li>
  <li>Asymmetric write/read quorum sizes</li>
  <li>Stable leader lease and local read at leader</li>
</ul>

<h2 id="references">References</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p><a href="https://github.com/tlaplus/Examples">https://github.com/tlaplus/Examples</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p><a href="https://lamport.azurewebsites.net/tla/tutorial/home.html">https://lamport.azurewebsites.net/tla/tutorial/home.html</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p><a href="https://learntla.com/index.html">https://learntla.com/index.html</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p><a href="https://lamport.azurewebsites.net/tla/summary-standalone.pdf">https://lamport.azurewebsites.net/tla/summary-standalone.pdf</a> <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p><a href="https://tla.msr-inria.inria.fr/tlatoolbox/doc/model/distributed-mode.html">https://tla.msr-inria.inria.fr/tlatoolbox/doc/model/distributed-mode.html</a> <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Guanzhou Hu</name></author><category term="Technical" /><summary type="html"><![CDATA[The attached files present a practical TLA+ specification of MultiPaxos that very closely models how a real state machine replication (SMR) system would implement this protocol. I did not find anything similar on the web, so I’d like to share it with anyone interested.]]></summary></entry><entry><title type="html">Emulating a Distributed Network on a Single Linux Host</title><link href="https://www.josehu.com/technical/2023/10/28/emulating-network-env.html" rel="alternate" type="text/html" title="Emulating a Distributed Network on a Single Linux Host" /><published>2023-10-28T18:20:07+00:00</published><updated>2023-10-28T18:20:07+00:00</updated><id>https://www.josehu.com/technical/2023/10/28/emulating-network-env</id><content type="html" xml:base="https://www.josehu.com/technical/2023/10/28/emulating-network-env.html"><![CDATA[<p>Recently, I need to benchmark a lightweight distributed system codebase on a single host for my current research project. I want to have control over the network performance parameters (including delay, jitter distribution, rate, loss, etc.) and test a wide range of parameter values; meanwhile, I want to avoid pure software-based simulation. Thus, I opt in for using kernel-supported network emulation. In this post, I document what I tried and what finally worked.</p>

<p>This post assumes latest Linux kernel version and is tested on v6.5.7.</p>

<h2 id="problem-setting">Problem Setting</h2>

<p>I have a distributed system codebase consisting of the following processes, each of which should conceptually run on a separate physical machine:</p>

<ul>
  <li>\(S\) server nodes,</li>
  <li>\(C\) client nodes,</li>
  <li>and one manager node.</li>
</ul>

<p>W.L.O.G., let’s ignore the client nodes and only talk about the server nodes plus the manager. I would like to test a wide range of different network performance parameters on the network connections between servers. Doing that across real physical machines would be prohibitively resource-demanding (as it requires a bunch of powerful machines all connected with each other through physically links that are as strong as the “best parameters” you will test against). Processes in my codebase are not computation- or memory-demanding, though; so it might be a good idea to run them on a single host and emulate a network environment among them.</p>

<p>There’re not really any canonical tutorials online demonstrating how to do this. After a bit of searching &amp; digging, I found Linux kernel-supported network emulation features to be quite promising.</p>

<h2 id="first-try-tc-netem-on-loopback">First Try: <code class="language-plaintext highlighter-rouge">tc</code> <code class="language-plaintext highlighter-rouge">netem</code> on LoopBack</h2>

<p>The first tool to introduce here is the <a href="https://man7.org/linux/man-pages/man8/tc-netem.8.html"><code class="language-plaintext highlighter-rouge">netem</code> <em>queueing discipline</em> (qdisc)</a> <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> provided by the <code class="language-plaintext highlighter-rouge">tc</code> traffic control command. Each network interface in Linux can have an associated software queueing discipline that sits atop the device driver queue. <code class="language-plaintext highlighter-rouge">netem</code> is one of them and it provides functionality for emulating various network properties, including delay, jitter distribution, rate, loss, corruption, duplication, and reordering, etc.</p>

<p>For example, we can put a <code class="language-plaintext highlighter-rouge">netem</code> qdisc on the loopback interface that injects a 100ms delay with a Pareto-distributed jitter around 10ms and limits the rate as 1Gbps:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~$ sudo tc qdisc add dev lo root netem delay 100ms 10ms distribution pareto rate 1gibit
~$ ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=192 ms
64 bytes from localhost (127.0.0.1): icmp_seq=4 ttl=64 time=226 ms
64 bytes from localhost (127.0.0.1): icmp_seq=5 ttl=64 time=193 ms
...
</code></pre></div></div>

<p>It feels natural to just put a <code class="language-plaintext highlighter-rouge">netem</code> qdisc on the <em>loopback</em> interface, let all processes bind to a different port on <code class="language-plaintext highlighter-rouge">localhost</code>, and let them talk to each other all through loopback. It seemed to work pretty well until I found two significant caveats:</p>

<ol>
  <li>Since all process transfer packets through the same loopback interface, packets with different source-destination pairs will all be congesting with each other in the same queue, creating unwanted interference and often overflowing the loopback queue.</li>
  <li>Only one <code class="language-plaintext highlighter-rouge">netem</code> qdisc is allowed on each interface. This means we cannot emulate different parameters for different links among different pairs of processes.</li>
</ol>

<p>What we need are separate network interfaces for the processes.</p>

<h2 id="didnt-work-dummy-interfaces">Didn’t Work: <code class="language-plaintext highlighter-rouge">dummy</code> Interfaces</h2>

<p>The <a href="https://tldp.org/LDP/nag/node72.html"><code class="language-plaintext highlighter-rouge">dummy</code> kernel module</a> supports creating dummy network interfaces that route packets to the host itself. However, creating a <code class="language-plaintext highlighter-rouge">netem</code> qdisc on a dummy interface doesn’t really work <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">2</a></sup> <sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">3</a></sup>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~$ sudo ip link add dummy0 type dummy
~$ sudo ip addr add 192.168.77.0/24 dev dummy0
~$ sudo ip link set dummy0 up
~$ sudo tc qdisc add dev dummy0 root netem delay 10ms
~$ sudo tc qdisc show
...
qdisc netem 80f8: dev dummy0 root refcnt 2 limit 1000 delay 10ms
</code></pre></div></div>

<p>Though the qdisc is indeed listed, pinging the associated address still shows lightning-fast delay:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~$ ping 192.168.77.0
PING 192.168.77.0 (192.168.77.0) 56(84) bytes of data.
64 bytes from 192.168.77.0: icmp_seq=1 ttl=64 time=0.019 ms
64 bytes from 192.168.77.0: icmp_seq=2 ttl=64 time=0.024 ms
64 bytes from 192.168.77.0: icmp_seq=3 ttl=64 time=0.024 ms
...
</code></pre></div></div>

<p>This is because the dummy interface is just a “wrapper”; it is still supported by the loopback interface behind the scene. We can verify this fact using:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~$ sudo ip route get 192.168.77.0
local 192.168.77.0 dev lo src 192.168.77.0 uid 0
    cache &lt;local&gt;
</code></pre></div></div>

<p>Notice it is reported that the route is backed by <code class="language-plaintext highlighter-rouge">dev lo</code>. In fact, if a <code class="language-plaintext highlighter-rouge">netem</code> qdisc is still being applied to loopback as of previous section, you will see that delay when pinging <code class="language-plaintext highlighter-rouge">dummy0</code>. Dummy is not we are looking for here.</p>

<h2 id="solution-network-namespaces--veths">Solution: Network Namespaces &amp; <code class="language-plaintext highlighter-rouge">veth</code>s</h2>

<p>What we are actually looking for are <em>network namespaces</em> and <code class="language-plaintext highlighter-rouge">veth</code>-type interfaces <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">4</a></sup> <sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">5</a></sup>. In Linux, a network namespace is an isolated network stack that processes can attach to. By default, all devices are in a nameless namespace. One can create named namespaces using <code class="language-plaintext highlighter-rouge">ip netns</code> commands and assign them to running processes (or launch new processes directly from them through <code class="language-plaintext highlighter-rouge">ip netns exec</code>). Namespaces are a perfectly tool for our task here.</p>

<p>To give connectivity between namespaces without bringing in physical devices, one can create <code class="language-plaintext highlighter-rouge">veth</code> (virtual Ethernet) interfaces. By design, <code class="language-plaintext highlighter-rouge">veth</code> interfaces come as pairs: you must create pairs of two <code class="language-plaintext highlighter-rouge">veth</code>s at the same time. <a href="https://medium.com/@mishu667/creating-two-network-namespaces-and-connect-them-with-virtual-ethernet-veth-devices-565f83af4c37#:~:text=Network%20namespaces%20provide%20a%20powerful,control%20network%20connectivity%20between%20them.">This post</a> <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">6</a></sup> gives a nice demonstration of creating two namespaces and making a pair of <code class="language-plaintext highlighter-rouge">veth</code>s to connect them.</p>

<p>However, this is not enough for us, because we would want more than 2 isolated devices, each being able to talk to everyone else. To achieve this, we make use of a <em>bridge</em> device. We create one pair of <code class="language-plaintext highlighter-rouge">veth</code>s per namespace, put one end into the namespace while keeping the other end, and then bridge those ends together. All namespaces can then find a route to each other through the bridge; also, the bridge can talk to each of the namespaces. Since we have a manager process, it is quite natural to let the manager use the bridge device and let each server process reside in its own namespace and use the <code class="language-plaintext highlighter-rouge">veth</code> device put into it.</p>

<p>Let’s walk through this step-by-step for a 3-servers setting.</p>

<ol>
  <li>
    <p>Create namespaces and assign them proper IDs:</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ~$ sudo ip netns add ns0
 ~$ sudo ip netns set ns0 0
 ~$ sudo ip netns add ns1
 ~$ sudo ip netns set ns1 1
 ~$ sudo ip netns add ns2
 ~$ sudo ip netns set ns2 2
</code></pre></div>    </div>
  </li>
  <li>
    <p>Create a bridge device <code class="language-plaintext highlighter-rouge">brgm</code> and assign address <code class="language-plaintext highlighter-rouge">10.0.1.0</code> to it:</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ~$ sudo ip link add brgm type bridge
 ~$ sudo ip addr add "10.0.1.0/16" dev brgm
 ~$ sudo ip link set brgm up
</code></pre></div>    </div>
  </li>
  <li>
    <p>Create <code class="language-plaintext highlighter-rouge">veth</code> pairs (<code class="language-plaintext highlighter-rouge">vethsX</code>-<code class="language-plaintext highlighter-rouge">vethsXm</code>) for servers, put the <code class="language-plaintext highlighter-rouge">vethsX</code> end into its corresponding namespace, and assign address <code class="language-plaintext highlighter-rouge">10.0.0.X</code> to it:</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ~$ sudo ip link add veths0 type veth peer name veths0m
 ~$ sudo ip link set veths0 netns ns0
 ~$ sudo ip netns exec ns0 ip addr add "10.0.0.0/16" dev veths0
 ~$ sudo ip netns exec ns0 ip link set veths0 up
 # repeat for servers 1 and 2
</code></pre></div>    </div>
  </li>
  <li>
    <p>Put the <code class="language-plaintext highlighter-rouge">vethsXm</code> end under the bridge device:</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ~$ sudo ip link set veths0m up
 ~$ sudo ip link set veths0m master brgm
 # repeat for servers 1 and 2
</code></pre></div>    </div>
  </li>
</ol>

<p>This gives us a topology that looks like the following figure:</p>

<p><img src="/assets/img/net-emulation-topology.png" alt="Net Emulation Topology" /></p>

<p>Let’s do a bit of delay injection with <code class="language-plaintext highlighter-rouge">netem</code> to verify that this topology truly gives us what we want. Say we add 10ms delay to <code class="language-plaintext highlighter-rouge">veths1</code> and 20ms delay to <code class="language-plaintext highlighter-rouge">veths2</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~$ sudo ip netns exec ns1 tc qdisc add dev veths1 root netem delay 10ms
~$ sudo ip netns exec ns2 tc qdisc add dev veths2 root netem delay 20ms
</code></pre></div></div>

<p>Pinging the manager from server 1:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~$ sudo ip netns exec ns1 ping 10.0.1.0
PING 10.0.1.0 (10.0.1.0) 56(84) bytes of data.
64 bytes from 10.0.1.0: icmp_seq=1 ttl=64 time=10.1 ms
64 bytes from 10.0.1.0: icmp_seq=2 ttl=64 time=10.1 ms
64 bytes from 10.0.1.0: icmp_seq=3 ttl=64 time=10.1 ms
...
</code></pre></div></div>

<p>Pinging server 2 from server 1:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~$ sudo ip netns exec ns1 ping 10.0.0.2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=30.1 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=30.1 ms
64 bytes from 10.0.0.2: icmp_seq=3 ttl=64 time=30.1 ms
...
</code></pre></div></div>

<p>All good!</p>

<p>There are also obvious ways to extend this topology to allow even more flexibility; for example, each server’s could own multiple devices in its namespace that have different connectivity and different performance parameters, etc. Let me describe one example extension below.</p>

<h2 id="extension-symmetric-ingress-emulation-w-ifbs">Extension: Symmetric Ingress Emulation w/ <code class="language-plaintext highlighter-rouge">ifb</code>s</h2>

<p>It is important to note that most of <code class="language-plaintext highlighter-rouge">netem</code>’s emulation functionality applies only to the <em>egress</em> side of the interface. This means all the injected delay happen on the sender side for every packet. In some cases, you might want to put custom performance emulation on the ingress side of the servers’ interfaces as well. To do so, we could utilize the special IFB devices <sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup>.</p>

<p>First, load the kernel module that implements IFB devices:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~$ sudo modprobe ifb
</code></pre></div></div>

<p>By default, two devices <code class="language-plaintext highlighter-rouge">ifb0</code> and <code class="language-plaintext highlighter-rouge">ifb1</code> are added automatically. You can add more by doing:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~$ sudo ip link add ifb2 type ifb
</code></pre></div></div>

<p>We then bring one IFB device to each server’s namespace and redirect all the incoming traffic to the <code class="language-plaintext highlighter-rouge">veth</code> interface to go through the <code class="language-plaintext highlighter-rouge">ifb</code> device’s egress queue first. This is done by adding a special <code class="language-plaintext highlighter-rouge">ingress</code> qdisc to the <code class="language-plaintext highlighter-rouge">veth</code> (which can exist simultaneously with an egress <code class="language-plaintext highlighter-rouge">netem</code> qdisc we added earlier) and placing a filter rule to simply “move” all ingress packets to the <code class="language-plaintext highlighter-rouge">ifb</code> interface’s egress queue. The <code class="language-plaintext highlighter-rouge">ifb</code> device will automatically move the packet back after it has gone through the <code class="language-plaintext highlighter-rouge">ifb</code>’s egress queue.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~$ sudo ip link set ifb0 netns ns0
~$ sudo ip netns exec ns0 tc qdisc add dev veths0 ingress
~$ sudo ip netns exec ns0 tc filter add dev veths0 parent ffff: protocol all u32 match u32 0 0 flowid 1:1 action mirred egress redirect dev ifb0
~$ sudo ip netns exec ns0 ip link set ifb0 up
</code></pre></div></div>

<p>We can then put a <code class="language-plaintext highlighter-rouge">netem</code> qdisc on the <code class="language-plaintext highlighter-rouge">ifb</code> interface, which effectively emulates specified performance on the ingress of the <code class="language-plaintext highlighter-rouge">veth</code>. For example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~$ sudo ip netns exec ns0 tc qdisc add dev ifb0 root netem delay 5ms rate 1gibit
</code></pre></div></div>

<h2 id="summary">Summary</h2>

<p>To achieve our goal of emulating network links among distributed processes, on a single host, beyond the limitation of a single loopback interface, we can take the following steps:</p>

<ul>
  <li>Create separate network namespaces, probably one for each process, using <code class="language-plaintext highlighter-rouge">ip netns</code>.</li>
  <li>Create <code class="language-plaintext highlighter-rouge">veth</code> interface pairs, probably one pair for each process, using <code class="language-plaintext highlighter-rouge">ip link ... type veth</code>.</li>
  <li>Put one end of each <code class="language-plaintext highlighter-rouge">veth</code> pair into the corresponding namespace, then keep the other ends and create a bridge that stitches them together.</li>
  <li>Use <code class="language-plaintext highlighter-rouge">tc qdisc ... netem</code> to apply the <code class="language-plaintext highlighter-rouge">netem</code> queueing discipline with desired parameters on the <code class="language-plaintext highlighter-rouge">veth</code> devices for each process.</li>
  <li>Run the processes with their corresponding network namespace attached.</li>
</ul>

<p>Below is a script for setting up the above described topology for a given number of server processes:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#! /bin/bash</span>

<span class="nv">NUM_SERVERS</span><span class="o">=</span><span class="nv">$1</span>


<span class="nb">echo
echo</span> <span class="s2">"Deleting existing namespaces &amp; veths..."</span>
<span class="nb">sudo </span>ip <span class="nt">-all</span> netns delete
<span class="nb">sudo </span>ip <span class="nb">link </span>delete brgm
<span class="k">for </span>v <span class="k">in</span> <span class="si">$(</span>ip a | <span class="nb">grep </span>veth | <span class="nb">cut</span> <span class="nt">-d</span><span class="s1">' '</span> <span class="nt">-f</span> 2 | rev | <span class="nb">cut</span> <span class="nt">-c2-</span> | rev | <span class="nb">cut</span> <span class="nt">-d</span> <span class="s1">'@'</span> <span class="nt">-f</span> 1<span class="si">)</span>      
<span class="k">do
    </span><span class="nb">sudo </span>ip <span class="nb">link </span>delete <span class="nv">$v</span>
<span class="k">done


</span><span class="nb">echo
echo</span> <span class="s2">"Adding namespaces for servers..."</span>
<span class="k">for</span> <span class="o">((</span> s <span class="o">=</span> 0<span class="p">;</span> s &lt; <span class="nv">$NUM_SERVERS</span><span class="p">;</span> s++ <span class="o">))</span>
<span class="k">do
    </span><span class="nb">sudo </span>ip netns add ns<span class="nv">$s</span>
    <span class="nb">sudo </span>ip netns <span class="nb">set </span>ns<span class="nv">$s</span> <span class="nv">$s</span>
<span class="k">done


</span><span class="nb">echo
echo</span> <span class="s2">"Loading ifb module &amp; creating ifb devices..."</span>
<span class="nb">sudo </span>rmmod ifb
<span class="nb">sudo </span>modprobe ifb  <span class="c"># by default, add ifb0 &amp; ifb1 automatically</span>
<span class="k">for</span> <span class="o">((</span> s <span class="o">=</span> 2<span class="p">;</span> s &lt; <span class="nv">$NUM_SERVERS</span><span class="p">;</span> s++ <span class="o">))</span>
<span class="k">do
    </span><span class="nb">sudo </span>ip <span class="nb">link </span>add ifb<span class="nv">$s</span> <span class="nb">type </span>ifb
<span class="k">done


</span><span class="nb">echo
echo</span> <span class="s2">"Creating bridge device for manager..."</span>
<span class="nb">sudo </span>ip <span class="nb">link </span>add brgm <span class="nb">type </span>bridge
<span class="nb">sudo </span>ip addr add <span class="s2">"10.0.1.0/16"</span> dev brgm
<span class="nb">sudo </span>ip <span class="nb">link set </span>brgm up


<span class="nb">echo
echo</span> <span class="s2">"Creating &amp; assigning veths for servers..."</span>
<span class="k">for</span> <span class="o">((</span> s <span class="o">=</span> 0<span class="p">;</span> s &lt; <span class="nv">$NUM_SERVERS</span><span class="p">;</span> s++ <span class="o">))</span>
<span class="k">do
    </span><span class="nb">sudo </span>ip <span class="nb">link </span>add veths<span class="nv">$s</span> <span class="nb">type </span>veth peer name veths<span class="k">${</span><span class="nv">s</span><span class="k">}</span>m
    <span class="nb">sudo </span>ip <span class="nb">link set </span>veths<span class="k">${</span><span class="nv">s</span><span class="k">}</span>m up
    <span class="nb">sudo </span>ip <span class="nb">link set </span>veths<span class="k">${</span><span class="nv">s</span><span class="k">}</span>m master brgm
    <span class="nb">sudo </span>ip <span class="nb">link set </span>veths<span class="nv">$s</span> netns ns<span class="nv">$s</span>
    <span class="nb">sudo </span>ip netns <span class="nb">exec </span>ns<span class="nv">$s</span> ip addr add <span class="s2">"10.0.0.</span><span class="nv">$s</span><span class="s2">/16"</span> dev veths<span class="nv">$s</span>
    <span class="nb">sudo </span>ip netns <span class="nb">exec </span>ns<span class="nv">$s</span> ip <span class="nb">link set </span>veths<span class="nv">$s</span> up
<span class="k">done


</span><span class="nb">echo
echo</span> <span class="s2">"Redirecting veth ingress to ifb..."</span>
<span class="k">for</span> <span class="o">((</span> s <span class="o">=</span> 0<span class="p">;</span> s &lt; <span class="nv">$NUM_SERVERS</span><span class="p">;</span> s++ <span class="o">))</span>
<span class="k">do
    </span><span class="nb">sudo </span>ip <span class="nb">link set </span>ifb<span class="nv">$s</span> netns ns<span class="nv">$s</span>
    <span class="nb">sudo </span>ip netns <span class="nb">exec </span>ns<span class="nv">$s</span> tc qdisc add dev veths<span class="nv">$s</span> ingress
    <span class="nb">sudo </span>ip netns <span class="nb">exec </span>ns<span class="nv">$s</span> tc filter add dev veths<span class="nv">$s</span> parent ffff: protocol all u32 match u32 0 0 flowid 1:1 action mirred egress redirect dev ifb<span class="nv">$s</span>
    <span class="nb">sudo </span>ip netns <span class="nb">exec </span>ns<span class="nv">$s</span> ip <span class="nb">link set </span>ifb<span class="nv">$s</span> up
<span class="k">done


</span><span class="nb">echo
echo</span> <span class="s2">"Listing devices in default namespace:"</span>
<span class="nb">sudo </span>ip <span class="nb">link </span>show


<span class="nb">echo
echo</span> <span class="s2">"Listing all named namespaces:"</span>
<span class="nb">sudo </span>ip netns list


<span class="k">for</span> <span class="o">((</span> s <span class="o">=</span> 0<span class="p">;</span> s &lt; <span class="nv">$NUM_SERVERS</span><span class="p">;</span> s++ <span class="o">))</span>
<span class="k">do
    </span><span class="nb">echo
    echo</span> <span class="s2">"Listing devices in namespace ns</span><span class="nv">$s</span><span class="s2">:"</span>
    <span class="nb">sudo </span>ip netns <span class="nb">exec </span>ns<span class="nv">$s</span> ip <span class="nb">link </span>show
<span class="k">done</span>
</code></pre></div></div>

<h2 id="references">References</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p><a href="https://man7.org/linux/man-pages/man8/tc-netem.8.html">https://man7.org/linux/man-pages/man8/tc-netem.8.html</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p><a href="https://tldp.org/LDP/nag/node72.html">https://tldp.org/LDP/nag/node72.html</a> <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p><a href="https://man7.org/linux/man-pages/man8/ip-link.8.html">https://man7.org/linux/man-pages/man8/ip-link.8.html</a> <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p><a href="https://superuser.com/questions/764986/howto-setup-a-veth-virtual-network">https://superuser.com/questions/764986/howto-setup-a-veth-virtual-network</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:6" role="doc-endnote">
      <p><a href="https://man7.org/linux/man-pages/man8/ip-netns.8.html">https://man7.org/linux/man-pages/man8/ip-netns.8.html</a> <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p><a href="https://medium.com/@mishu667/creating-two-network-namespaces-and-connect-them-with-virtual-ethernet-veth-devices-565f83af4c37#:~:text=Network%20namespaces%20provide%20a%20powerful,control%20network%20connectivity%20between%20them.">https://medium.com/@mishu667/creating-two-network-namespaces-and-connect-them-with-virtual-ethernet-veth-devices-565f83af4c37#:~:text=Network%20namespaces%20provide%20a%20powerful,control%20network%20connectivity%20between%20them.</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:7" role="doc-endnote">
      <p><a href="http://linux-ip.net/gl/tc-filters/tc-filters-node3.html">http://linux-ip.net/gl/tc-filters/tc-filters-node3.html</a> <a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Guanzhou Hu</name></author><category term="Technical" /><summary type="html"><![CDATA[Recently, I need to benchmark a lightweight distributed system codebase on a single host for my current research project. I want to have control over the network performance parameters (including delay, jitter distribution, rate, loss, etc.) and test a wide range of parameter values; meanwhile, I want to avoid pure software-based simulation. Thus, I opt in for using kernel-supported network emulation. In this post, I document what I tried and what finally worked.]]></summary></entry><entry><title type="html">Revisiting My Distributed Replication Consistency Models Post</title><link href="https://www.josehu.com/technical/2023/04/05/consistency-models-revisited.html" rel="alternate" type="text/html" title="Revisiting My Distributed Replication Consistency Models Post" /><published>2023-04-05T13:47:05+00:00</published><updated>2023-04-05T13:47:05+00:00</updated><id>https://www.josehu.com/technical/2023/04/05/consistency-models-revisited</id><content type="html" xml:base="https://www.josehu.com/technical/2023/04/05/consistency-models-revisited.html"><![CDATA[<p>Previously, I made a <a href="https://www.josehu.com/technical/2020/05/23/consistency-models.html">blog post</a> about common consistency models in distributed state machine replication (SMR). As I am recently picking up my scattered knowledge about distributed replication systems, I found some inaccuracy and ambiguity in that old post. This short post lists some patches and complementary material I revisited on this convoluted topic.</p>

<h2 id="revisited-material">Revisited Material</h2>

<ul>
  <li><strong><a href="https://decentralizedthoughts.github.io/start-here/">Decentralized Thoughts blog series</a></strong>: a really good blog series on consensus problems. Although most of the advanced blog posts there focus on decentralized Byzantine-fault-tolerant systems, the beginner posts offer a great summary of the problem and the foundational models we are studying.</li>
  <li><strong><a href="https://arxiv.org/abs/2409.01576">My practical technical report</a></strong>: a “somewhat formal” summary of non-transactional consistency models and availability models for distributed replication. This report corrects some inaccuracy/errors in my previous blog post and extends it to a broader scope.</li>
</ul>]]></content><author><name>Guanzhou Hu</name></author><category term="Technical" /><summary type="html"><![CDATA[Previously, I made a blog post about common consistency models in distributed state machine replication (SMR). As I am recently picking up my scattered knowledge about distributed replication systems, I found some inaccuracy and ambiguity in that old post. This short post lists some patches and complementary material I revisited on this convoluted topic.]]></summary></entry><entry><title type="html">Understanding Hierarchical Locking in Database Systems</title><link href="https://www.josehu.com/technical/2022/10/06/dbms-hierarchical-locking.html" rel="alternate" type="text/html" title="Understanding Hierarchical Locking in Database Systems" /><published>2022-10-06T22:07:17+00:00</published><updated>2022-10-06T22:07:17+00:00</updated><id>https://www.josehu.com/technical/2022/10/06/dbms-hierarchical-locking</id><content type="html" xml:base="https://www.josehu.com/technical/2022/10/06/dbms-hierarchical-locking.html"><![CDATA[<p>Described in <a href="https://dl.acm.org/doi/10.1145/1282480.1282513">this classic paper</a> by Jim Gray et. al, <em>hierarchical locking</em> has been a well-studied idea in database management systems (DBMS). Despite its long history, I found the theoretical notion of lock modes less intuitive and hard to understand upon first encounter. This post tries to distill the core motivations of hierarchical locking, break its design down into three pieces, and describe them progressively, to hopefully clarify this beautiful idea.</p>

<h2 id="traditional-locking">Traditional Locking</h2>

<p>Consider a database with only one small table (i.e. relation), shared by multiple clients. The clients could issue concurrent transactions that read some tuples (i.e. records) of the table or update them with new values. To protect the database from data races, it is pretty natural to apply a traditional <em>reader-writer</em> lock on the table.</p>

<p style="text-align:center;">
    <img src="/assets/img/db-locking-traditional.png" width="180px" alt="Traditional Locking" />
</p>

<p>In database terminology, we denote acquiring a reader lock on the table as locking it in <em>shared</em> (<code class="language-plaintext highlighter-rouge">S</code>) mode, while acquiring a writer lock on the table as locking it in <em>exclusive</em> (<code class="language-plaintext highlighter-rouge">X</code>) mode. Multiple clients could hold <code class="language-plaintext highlighter-rouge">S</code> locks on the same table at the same time for reads. At most one client could hold an <code class="language-plaintext highlighter-rouge">X</code> lock on the table (with no <code class="language-plaintext highlighter-rouge">S</code> locks held by anyone else as well).</p>

<p>We call two locking attempts <em>compatible</em> if their lock modes are allowed to be held at the same time on the same thing. <code class="language-plaintext highlighter-rouge">S</code> mode is compatible with itself. <code class="language-plaintext highlighter-rouge">S</code> and <code class="language-plaintext highlighter-rouge">X</code> are not compatible with each other. <code class="language-plaintext highlighter-rouge">X</code> is of course also not compatible with itself.</p>

<p>Back to our problem scenario, since the database has only one table with a small number of tuples, a reasonable solution is to put a lock on that table. Read requests must attempt to acquire the lock in <code class="language-plaintext highlighter-rouge">S</code> mode and can proceed only after the acquirement is successful. Writes requests must attempt to acquire it in <code class="language-plaintext highlighter-rouge">X</code> mode. This is basically how a reader-writer lock works in classic systems. So far, so good.</p>

<p><u>Problem</u>: what if the database is not in toy scale any more, but is composed of hundreds of tables, each having millions of records? Real-world databases can easily reach this scale. The traditional locking mechanism with uniform granularity puts a dilemma on choosing the <strong>granularity of locks</strong>:</p>

<ul>
  <li>
    <p>Huge DB lock: we could choose to lock on coarse granularity, e.g., the entire database. However, it unacceptably hurts <em>concurrency</em>; a client transaction updating only one tuple in one table would block all other clients that try to read disjoint sets of tuples in the database.</p>
  </li>
  <li>
    <p>One lock per tuple: alternatively, we could choose to put locks only at the finest granularity, in this case, tuples. A client transaction only locks the tuples it would touch in desired mode. This way, concurrency is preserved. The problem is that it forces large transactions to touch too many locks; e.g, a transaction that scans all tuples of a table will have to acquire potentially millions of locks. This can easily lead to prohibitive performance overhead.</p>
  </li>
</ul>

<p>Both choices are not ideal for overall performance. The solution to this problem is to introduce <em>hierarchical locking</em> on different levels of database resources.</p>

<h2 id="hierarchical-locking">Hierarchical Locking</h2>

<p>A database is naturally structured as a tree (or more generally, a DAG) of <em>resources</em>. For example, the following figure represents a database with 3 tables, each having 100 tuples. Tuples could further be decomposed into fields (i.e. attributes or columns); we consider tuples as the finest granularity in this post.</p>

<p style="text-align:center;">
    <img src="/assets/img/db-locking-tree-hierarchy.png" width="420px" alt="Tree hierarchy" />
</p>

<p>The core idea of <em>hierarchical locking</em> <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> is to allow putting locks on <em>nodes</em> of tree (which may be at different granularity levels), instead of only at a uniform granularity.</p>

<h3 id="version-1-introduce-implicit-locking">Version #1: Introduce Implicit Locking</h3>

<p>In the first step towards hierarchical locking, we introduce <em>implicit locking</em>: locking an internal node in <code class="language-plaintext highlighter-rouge">S</code> mode <em>implicitly</em> locks all its descendant nodes with <code class="language-plaintext highlighter-rouge">S</code> mode; <code class="language-plaintext highlighter-rouge">X</code> mode behaves similarly.</p>

<ul>
  <li>If a client wants to read or update only a few tuples, it better acquire <code class="language-plaintext highlighter-rouge">S</code> or <code class="language-plaintext highlighter-rouge">X</code> locks on the individual tuples.</li>
  <li>If a client wants to scan or update most of the tuples of a table, it better acquire a single <code class="language-plaintext highlighter-rouge">S</code> or <code class="language-plaintext highlighter-rouge">X</code> lock on the table – this implicitly grants <code class="language-plaintext highlighter-rouge">S</code> or <code class="language-plaintext highlighter-rouge">X</code> permissions on children nodes of the table, in this cases the tuples in it, to the client.</li>
  <li>Compatibility between modes follow the same rules as in traditional locking.</li>
</ul>

<p style="text-align:center;">
    <img src="/assets/img/db-locking-implicit-locking.png" width="340px" alt="Implicit Locking" />
</p>

<p>Implicit locking reduces the number of locks dramatically in cases of bulk operations, which nicely solves the performance problem of fine-grained locking. However, this mechanism itself is not enough, because it introduces correctness problems.</p>

<p style="text-align:center;">
    <img src="/assets/img/db-locking-conflict-error.png" width="340px" alt="Conflict Error" />
</p>

<p><u>Problem</u>: what about conflicting transactions that end up holding conflicting lock modes at different levels? Transaction B holds <code class="language-plaintext highlighter-rouge">X</code> locks on tuple <code class="language-plaintext highlighter-rouge">R99</code> in table 0 and is going to update it. Transaction A comes and acquires a single <code class="language-plaintext highlighter-rouge">S</code> lock on table 0 to read all of its tuples. This situation should not be allowed. There are more incorrect scenarios besides this example.</p>

<h3 id="version-2-introduce-intention-modes">Version #2: Introduce Intention Modes</h3>

<p>To solve the correctness problem, we need to let internal nodes remember the locking state of its children. We introduce two <em>intention</em> lock modes: <em>intention shared</em> (<code class="language-plaintext highlighter-rouge">IS</code>) mode and <em>intention exclusive</em> (<code class="language-plaintext highlighter-rouge">IX</code>) mode.</p>

<p>To lock a node in <code class="language-plaintext highlighter-rouge">X</code> mode, the client must <em>traverse the tree from root</em> and lock all ancestor nodes along the path with <code class="language-plaintext highlighter-rouge">IX</code> mode, before locking the target node in <code class="language-plaintext highlighter-rouge">X</code>. Similarly, to lock a node in <code class="language-plaintext highlighter-rouge">S</code> mode, the client must traverse the tree from root and lock all ancestor nodes with <code class="language-plaintext highlighter-rouge">IS</code> mode, before locking the target node in <code class="language-plaintext highlighter-rouge">S</code>. By doing this, internal nodes now carry necessary information about the locking state of its descendant nodes in the subtree.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">IS</code> and <code class="language-plaintext highlighter-rouge">S</code> modes are compatible: it is allowed to acquire a <code class="language-plaintext highlighter-rouge">S</code> lock on a node already locked in <code class="language-plaintext highlighter-rouge">IS</code> mode – the two clients will probably share reading permissions of some children.</li>
  <li><code class="language-plaintext highlighter-rouge">IS</code> and <code class="language-plaintext highlighter-rouge">X</code> modes are not compatible: children of a node being updated by someone cannot be read by anyone else.</li>
  <li><code class="language-plaintext highlighter-rouge">IX</code> and <code class="language-plaintext highlighter-rouge">S</code> modes are not compatible: if a node and all its children are being read by someone, it is not allowed to grant any write permissions in this subtree to anyone else.</li>
  <li><code class="language-plaintext highlighter-rouge">IX</code> and <code class="language-plaintext highlighter-rouge">X</code> modes are obviously not compatible.</li>
  <li><code class="language-plaintext highlighter-rouge">IS</code> mode is compatible with itself: multiple clients could be reading children of this node.</li>
  <li><code class="language-plaintext highlighter-rouge">IX</code> mode is compatible with itself: multiple clients could be updating disjoint sets of children. Conflicts, if any, will be <em>resolved at lower levels of the subtree</em>.</li>
  <li><code class="language-plaintext highlighter-rouge">IX</code> and <code class="language-plaintext highlighter-rouge">IS</code> modes are compatible: multiple clients could be reading and updating disjoint sets of children. Possible conflicts are again resolved at lower levels.</li>
</ul>

<p>By always traversing the tree from root and locking ancestor nodes in intention modes (and releasing them in the reverse order when done), the correctness problem described in the previous section is now solved. Transaction B must have locked table 0 in <code class="language-plaintext highlighter-rouge">IX</code> already before it locks its <code class="language-plaintext highlighter-rouge">R99</code> in X, which prevents transaction A from locking the entire table in <code class="language-plaintext highlighter-rouge">S</code>. If A and B touch different tables, however, they can proceed concurrently.</p>

<p style="text-align:center;">
    <img src="/assets/img/db-locking-hierarchical-locking.png" width="380px" alt="Hierarchical Locking" />
</p>

<p><u>Problem</u>: consider a workload that scans a big table while only attempting to update a few tuples in it. With the current version of hierarchical locking, it must either hold a big <code class="language-plaintext highlighter-rouge">X</code> lock on the table, or hold many <code class="language-plaintext highlighter-rouge">S</code> locks on tuples it reads. Can we further optimize performance for this situation?</p>

<h3 id="version-3-introduce-six-mode-as-an-optimization">Version #3: Introduce <code class="language-plaintext highlighter-rouge">SIX</code> Mode as an Optimization</h3>

<p>We introduce a combined mode of <code class="language-plaintext highlighter-rouge">S</code> and <code class="language-plaintext highlighter-rouge">IX</code> to optimize for the aforementioned situation. The <em>shared and intention exclusive</em> (<code class="language-plaintext highlighter-rouge">SIX</code>) mode grants the client with read permission on all children, while optionally allowing it to further acquire <code class="language-plaintext highlighter-rouge">X</code> locks on some child nodes. This way, the client can hold a single <code class="language-plaintext highlighter-rouge">SIX</code> lock on the table plus a few <code class="language-plaintext highlighter-rouge">X</code> locks on tuples it is trying to modify.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">SIX</code> and <code class="language-plaintext highlighter-rouge">IS</code> modes are compatible: two clients can have disjoint sets of children nodes locked in <code class="language-plaintext highlighter-rouge">X</code> and <code class="language-plaintext highlighter-rouge">S</code> modes, respectively. Conflicts, if any, will be resolved at those lower levels.</li>
  <li><code class="language-plaintext highlighter-rouge">SIX</code> is not compatible with any mode other than <code class="language-plaintext highlighter-rouge">IS</code>, including itself. Reasoning behind this is left as an exercise for the reader.</li>
</ul>

<p style="text-align:center;">
    <img src="/assets/img/db-locking-six-mode.png" width="240px" alt="SIX Mode" />
</p>

<p>The original paper <sup id="fnref:1:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> presents a nice summary of compatibility between modes. Note that <code class="language-plaintext highlighter-rouge">NL</code> simply stands for null lock (i.e. not locked).</p>

<p style="text-align:center;">
    <img src="/assets/img/db-locking-compatibility-table.png" width="400px" alt="Compatibility Table" />
</p>

<h2 id="related-issues">Related Issues</h2>

<p>Concurrency control in database systems involve many more interesting issues besides hierarchical locking. To name a few examples:</p>

<ul>
  <li><em>Semantic locking</em> <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>: we can have more lock purposes other than reads and writes. For example, <em>increment</em> operations can have its own semantic and be compatible with other concurrent increments. This allows us to manage locks with more compatibility modes.</li>
  <li><em>Deadlock</em> solutions <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>: <em>deadlock detection</em> by maintaining a dependency “wait-for” graph, or <em>deadlock prevention</em> (No-Wait, Wait-Die, Wound-Wait), etc.</li>
  <li><em>Two-phase locking</em> (2PL) <sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup>: within each transaction, locks must be acquired progressively in the acquiring phase and released in the finishing phase – once released, the transaction should not re-acquire a lock. This is a conservative protocol to prevent deadlocks and maintain <em>serializability</em> among transactions.</li>
  <li><em>Optimistic concurrency control</em> (OCC) <sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup>, consistency and durability, <em>two-phase commit</em> (2PC) <sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup>, …</li>
</ul>

<p>Some of these things have been covered in my past blog posts. Other techniques and their modern implications may be covered in my future blog posts.</p>

<h2 id="references">References</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p><a href="https://dl.acm.org/doi/10.1145/1282480.1282513">https://dl.acm.org/doi/10.1145/1282480.1282513</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:1:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/Multiple_granularity_locking">https://en.wikipedia.org/wiki/Multiple_granularity_locking</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p><a href="https://dl.acm.org/doi/10.1145/191081.191144">https://dl.acm.org/doi/10.1145/191081.191144</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/Deadlock">https://en.wikipedia.org/wiki/Deadlock</a> <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/Two-phase_locking">https://en.wikipedia.org/wiki/Two-phase_locking</a> <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:6" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/Optimistic_concurrency_control">https://en.wikipedia.org/wiki/Optimistic_concurrency_control</a> <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:7" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/Two-phase_commit_protocol">https://en.wikipedia.org/wiki/Two-phase_commit_protocol</a> <a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Guanzhou Hu</name></author><category term="Technical" /><summary type="html"><![CDATA[Described in this classic paper by Jim Gray et. al, hierarchical locking has been a well-studied idea in database management systems (DBMS). Despite its long history, I found the theoretical notion of lock modes less intuitive and hard to understand upon first encounter. This post tries to distill the core motivations of hierarchical locking, break its design down into three pieces, and describe them progressively, to hopefully clarify this beautiful idea.]]></summary></entry><entry><title type="html">Systems for AI and AI for Systems: Some Chitter-Chatter</title><link href="https://www.josehu.com/technical/2022/05/21/sys-for-ai-and-ai-for-sys.html" rel="alternate" type="text/html" title="Systems for AI and AI for Systems: Some Chitter-Chatter" /><published>2022-05-21T18:29:30+00:00</published><updated>2022-05-21T18:29:30+00:00</updated><id>https://www.josehu.com/technical/2022/05/21/sys-for-ai-and-ai-for-sys</id><content type="html" xml:base="https://www.josehu.com/technical/2022/05/21/sys-for-ai-and-ai-for-sys.html"><![CDATA[<p>This is a short post where I note down some of my insignificant thoughts about the interaction between AI and systems. With the rapid evolution of AI technologies, especially in the field of <em>machine learning</em> (ML), there is now a rising interest in studying the intersection between AI and <em>computer systems</em> design. The combination of the two can further be categorized into two directions: building systems for AI applications (Sys for AI) and using AI to empower smarter systems (AI for Sys).</p>

<h2 id="systems-for-ai">Systems for AI</h2>

<p>The very early form of AI, namely small-scale statistical algorithms, didn’t attract too much attention from computer architects and system builders. They were treated as yet another type of normal application workloads. System researchers had other issues to deal with, such as the I/O bottleneck, which appear to be more urgent problems to be solved at that time.</p>

<p>Around the year 2010, system researchers started to pay attention to something slightly closer to AI – which we later call “Big Data” applications – thanks to the emergence of Hadoop MapReduce<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> and Spark<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>. A typical example of such Big Data application is an iterative graph processing algorithm, such as PageRank. These workloads require notably more compute power as well as higher storage performance requirement, pushing datacenters to go really large-scale and become vastly distributed. Combined with technical advances in other areas, including OS virtualization, high-speed networking, and advanced architecture, they lead to the success of large-scale datacenters and cloud computing (beyond traditional HPC).</p>

<p>Then, there comes machine learning (ML), more specifically, <em>deep learning</em> (DL) models. There’s no need for me to emphasize how much attention these data-hungry workloads have attracted in other areas of computer science in recent years. Their requirements for tremendous amount of data storage, massive parallel computation, and heavy communication have made them one of the most important and challenging workloads. People have done many things in building better systems for ML, and nothing seems to be stopping this trend so far:</p>

<ul>
  <li>Hardware: GP-GPU, specialized tensor computation hardware such as TPU, …</li>
  <li>Programming: auto differentiation (Autograd), just-in-time compilation (JAX), …</li>
  <li>Computation: highly-optimized libraries, various systems for scalable and high-throughput training, scheduling, …</li>
  <li>Communication: collective communication for distributed training, parameter servers, high-speed interconnect (NVLink), …</li>
  <li>Profiling: performance monitoring, …</li>
  <li>Serving: low-latency inference, performance predictability, …</li>
  <li>Storage: data I/O optimizations, model checkpointing, …</li>
</ul>

<p>With big models (with billions of parameters) gaining popularity, AI continues to be one of the main driving forces of the advancement of computing infrastructure. Many top conferences in e.g. the systems area now have 1 or 2 sessions dedicated for ML systems in recent years (see <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> for an example). There’s even a specialized conference for this topic, MLSys<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup>, which started in 2018.</p>

<h2 id="ai-for-systems">AI for Systems</h2>

<p>The interaction between AI and systems can also go the other way around: deploying AI algorithms to help design and implement smarter computer systems infrastructure, in short, AI for Sys. A natural question to ask at this point is: what are the problems in computer systems that AI techniques could really solve better than experienced developers? This is a tough question and many systems researchers are still trying to find a reasonable answer.</p>

<h3 id="heuristics-might-be-a-good-entry-point">Heuristics Might Be A Good Entry Point</h3>

<p>One of such opportunities, in my opinion, is to use AI algorithms to help improve or replace <strong>heuristics</strong>. Systems builders have long been putting heuristics here and there in different kinds of systems.</p>

<p>For example, cache eviction algorithms in data store systems rely heavily on heuristics about the incoming workload to decide which entry to evict when the cache is full. Many production systems still choose a simple heuristic such as LRU (least-recently used) that might not fit the actual workload well and is not resistant to large scans. If you are interested, here is <a href="https://josehu.com/technical/2020/08/07/cache-eviction-algorithms.html">a post</a> <sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup> I wrote earlier about cache modes and eviction algorithms.</p>

<p>Another example of heuristics would be magic configuration numbers. A hash function implementation needs to decide how many buckets to create initially and how many more to grow at resizing. A database system needs to decide how much memory space to allocate as the block cache, etc. Magic numbers are everywhere and they are typically just chosen by an experienced system designer with very little assumption on the actual workload the system is going to serve.</p>

<p>AI techniques, especially data-driven ML models, seem to be a good fit to replace such heuristics. Given that a workload has its own statistical characteristics, we may assume that it is drawn from some probability distribution and is thus learnable by a smart enough ML model. Indeed, there are quite a few recent research papers addressing this opportunity. Just to name a few off the top of my head:</p>

<ul>
  <li>Bourbon<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup>: applying <em>learned indexes</em> in LevelDB to speed up the searching of keys</li>
  <li>Stacked Filters<sup id="fnref:8" role="doc-noteref"><a href="#fn:8" class="footnote" rel="footnote">8</a></sup>: applying <em>learned filters</em> in database queries for more efficient filtering</li>
  <li>Entropy-Learned Hashing<sup id="fnref:9" role="doc-noteref"><a href="#fn:9" class="footnote" rel="footnote">9</a></sup>: discovering patterns in incoming keys to reduce the cost of hashing</li>
  <li>Learning on distributed <em>traces</em> for making decisions in datacenter storage systems<sup id="fnref:10" role="doc-noteref"><a href="#fn:10" class="footnote" rel="footnote">10</a></sup></li>
  <li>LlamaTune<sup id="fnref:11" role="doc-noteref"><a href="#fn:11" class="footnote" rel="footnote">11</a></sup>: example of DBMS configuration <em>knobs tuning</em> on given workloads</li>
</ul>

<p>However, ML models are not free plug-and-play replacement for these decision-making heuristics. The real workload might not actually follow a causal pattern, and even if we assume it always does, the pattern may change dynamically and rapidly. Furthermore, ML training and inference are themselves storage- and compute-heavy.</p>

<h3 id="the-performance-obstacle">The Performance Obstacle</h3>

<p>By integrating ML algorithms into systems, our ultimate goal is to let it come up with smarter <em>policies</em> that make better <em>decisions</em> to yield better <em>performance</em>. However, deploying ML models themselves introduce significant performance overhead. The overhead consists of two parts: <em>training</em> on some existing data to learn a policy and doing <em>inference</em> through the policy to get decisions.</p>

<p>Coarsely, we can categorize “ML for Sys” techniques into two classes:</p>

<ul>
  <li><strong>Online</strong>: gather workload data at run-time, train on gather data constantly to update the policy, and use the most up-to-date policy to make decisions.
    <ul>
      <li>\(\uparrow\) This strategy is rather robust against workload shifts.</li>
      <li>\(\downarrow\) Gathering data and training most of the useful ML models at run-time are very expensive and time-consuming.</li>
    </ul>
  </li>
  <li><strong>Offline</strong>: train on offline data (which are probably profiled from previous runs ahead-of-time) to get a determined policy and then deploy that policy.
    <ul>
      <li>\(\uparrow\) This strategy removes the cost of training from the critical path.</li>
      <li>\(\downarrow\) It cannot react to dynamic changes in workload pattern.</li>
      <li>\(\downarrow\) Evaluating a policy may still involve inference costs, which might not be cheap depending on the type of the model.</li>
    </ul>
  </li>
</ul>

<p>Nonetheless, the performance benefit of deploying a ML model in a computer system must be greater than its cost of deployment for it to be actually useful. This is why most of the research work around this topic so far are still limited to light-weight ML models. Bourbon, for example, only incorporates a simple segmented linear regression model and not any form of neural networks (NN). Some offline configuration tuning tools that produce static magic numbers may use larger NN models.</p>

<p>I hope that other ways of integrating AI techniques into computer systems can be discovered in the near future to help us build smarter systems and spawn more interesting ideas.</p>

<h2 id="references">References</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p><a href="https://dl.acm.org/doi/10.1145/1327452.1327492">https://dl.acm.org/doi/10.1145/1327452.1327492</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p><a href="https://dl.acm.org/doi/10.1145/2934664">https://dl.acm.org/doi/10.1145/2934664</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p><a href="https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia">https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p><a href="https://www.usenix.org/conference/osdi22/technical-sessions">https://www.usenix.org/conference/osdi22/technical-sessions</a> <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p><a href="https://mlsys.org/">https://mlsys.org/</a> <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:6" role="doc-endnote">
      <p><a href="https://josehu.com/technical/2020/08/07/cache-eviction-algorithms.html">https://josehu.com/technical/2020/08/07/cache-eviction-algorithms.html</a> <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:7" role="doc-endnote">
      <p><a href="https://www.usenix.org/conference/osdi20/presentation/dai">https://www.usenix.org/conference/osdi20/presentation/dai</a> <a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:8" role="doc-endnote">
      <p><a href="https://dl.acm.org/doi/10.14778/3436905.3436919">https://dl.acm.org/doi/10.14778/3436905.3436919</a> <a href="#fnref:8" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:9" role="doc-endnote">
      <p><a href="https://bhentsch.github.io/doc/EntropyLearnedHashing.pdf">https://bhentsch.github.io/doc/EntropyLearnedHashing.pdf</a> <a href="#fnref:9" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:10" role="doc-endnote">
      <p><a href="https://mlsys.org/virtual/2021/oral/1627">https://mlsys.org/virtual/2021/oral/1627</a> <a href="#fnref:10" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:11" role="doc-endnote">
      <p><a href="https://arxiv.org/abs/2203.05128">https://arxiv.org/abs/2203.05128</a> <a href="#fnref:11" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Guanzhou Hu</name></author><category term="Technical" /><summary type="html"><![CDATA[This is a short post where I note down some of my insignificant thoughts about the interaction between AI and systems. With the rapid evolution of AI technologies, especially in the field of machine learning (ML), there is now a rising interest in studying the intersection between AI and computer systems design. The combination of the two can further be categorized into two directions: building systems for AI applications (Sys for AI) and using AI to empower smarter systems (AI for Sys).]]></summary></entry><entry><title type="html">Formal Description of File System Crash Consistency Techniques</title><link href="https://www.josehu.com/technical/2021/12/26/filesystem-crash-consistency.html" rel="alternate" type="text/html" title="Formal Description of File System Crash Consistency Techniques" /><published>2021-12-26T15:28:41+00:00</published><updated>2021-12-26T15:28:41+00:00</updated><id>https://www.josehu.com/technical/2021/12/26/filesystem-crash-consistency</id><content type="html" xml:base="https://www.josehu.com/technical/2021/12/26/filesystem-crash-consistency.html"><![CDATA[<p><em>Crash consistency</em> is one of the most essential guarantees that a storage system needs to make to ensure correctness. In a file system (FS) setting, consistency techniques must be carefully designed, integrated with the layout of blocks, and deployed in the procedure of updates. This post summarizes the three classic FS consistency techniques: <em>journaling</em>, <em>shadow paging</em> (CoW), and <em>log-structuring</em>, in a formal way and analyzes their pros &amp; cons.</p>

<h2 id="concept-of-crash-consistency">Concept of Crash Consistency</h2>

<p>Crash consistency is a general concept that applies to any storage system maintaining data on persistent storage media.</p>

<h3 id="general-crash-consistency">General Crash Consistency</h3>

<p>We say a piece of persistent data is in a <strong>consistent state</strong> if it is in a correct form representing the logical data structure it stores. For example, if a group of bytes is meant to store a B-tree, then it is in a consistent state iff. the root block is in the correct position and all non-null node pointers point to correct child nodes (no dangling pointers), etc. Note that the “data structure” does not have to be a canonical data structure such as a B-tree – it can be any custom user specification.</p>

<p>We say a storage system provides <strong>crash consistency</strong> if data on persistent media it manages always transits from a consistent state to another consistent state. Equivalently, no matter when a crash happens during the steps of an update, data on persistent media is always left at a consistent state and can thus be <em>recovered</em> correctly upon restart.</p>

<h3 id="disambiguation">Disambiguation</h3>

<p>Consistency and <strong>durability</strong> are two orthogonal guarantees:</p>

<ul>
  <li>Having durability means that all requests that have been <em>acknowledged</em> to the user must have been made persistent;</li>
  <li>Having consistency means that when applying any request, data on persistent media is always in a consistent state.</li>
</ul>

<p>It is possible for a storage system to be consistent yet not durable: acking requests once reaching DRAM cache, but always flushing them to persistent media in a consistent way – acked requests might be lost after a crash, but data on persistent media is always consistent, thus can be recovered (to a possibly outdated version).</p>

<p>It is also possible to be durable yet not consistent: reflecting any updates to persistent media immediately, but not managing ordering carefully – acked requests must have been persisted completely, but in-progress requests might leave the system in a <em>corrupted</em> state after a crash.</p>

<p>This post focuses on the consistency aspect, although most file systems provide both guarantees. Providing consistency is often a must. In certain cases where the application allows version rollbacks, weaker durability might be allowed.</p>

<blockquote>
  <p>The difference between crash consistency and other “consistency” terminologies should also be made clear:</p>

  <ul>
    <li>In distributed systems, consistency often means the strength of guarantee of reaching global consensus on the ordering of actions;</li>
    <li>Sometimes, the word “consistent” might also be used as a synonym to “uniform”, such as in consistent hashing.</li>
  </ul>
</blockquote>

<h3 id="fs-crash-consistency">FS Crash Consistency</h3>

<p>In the setting of a file system, there are three categories of persistent data that must be managed:</p>

<ol>
  <li><em>FS metadata</em>: FS-wide meta information, e.g., superblock fields, inode bitmap, data block bitmap, …</li>
  <li><em>File metadata</em>: metadata information of a file stored in its inode, e.g., file size, data block index mapping table, …</li>
  <li><em>File data</em>: actual user data of a file.</li>
</ol>

<p>Depending on which of the three categories of data are guaranteed crash consistent, an FS could provide two different levels of crash consistency:</p>

<ul>
  <li><em>Metadata consistency</em>: FS metadata and all file metadata are guaranteed crash consistent, while file data might be not. The FS is always able to identify all files and figure out which data blocks belong to which file correctly, yet the actual content of those data blocks could be corrupted across a crash.</li>
  <li><em>Data consistency</em>: in addition to metadata, the content of data blocks are guaranteed crash consistent. User update requests are applied to file data in a consistent way as well.</li>
</ul>

<p>Metadata consistency is often enough, since applications often have their own error detection &amp; correction mechanisms on file data. As long as the FS image is always consistent, file content does not matter too much. Some FS designs also provide data consistency inherently.</p>

<h2 id="required-architecture-primitives">Required Architecture Primitives</h2>

<p>Before diving into the three FS consistency techniques in detail, I’d like to talk about two underlying hardware architecture primitives that must be available to FS developers. These two primitives are so essential that any file system design must rely on them, otherwise it is impossible to provide any consistency guarantee.</p>

<ul>
  <li><strong>Atomicity</strong>: there must be a way to write to persistent data <em>atomically</em> (complete-or-none), at least at <em>some granularity</em>. For example, to maintain a B-tree data structure consistently, there must be a way to at least write out a pointer value atomically.
    <ul>
      <li>On block drives, at least updating a sector is atomic;</li>
      <li>On non-volatile memory DIMMs on x86, at least flushing a cacheline is atomic.</li>
      <li>Formally, if an action \(A\) is atomic, we denote as \(\overline{A}\).</li>
    </ul>
  </li>
  <li><strong>Ordering</strong>: there must be a way to <em>enforce an ordering</em> between certain actions. For example, to append to a file consistently (assuming data write &amp; file size update are two separate steps in the FS), there must be a way to enforce that the update to file size happens strictly after the new data blocks have been prepared.
    <ul>
      <li>On block drives, device controllers at least raise signals about completions, which the FS software waits on;</li>
      <li>On non-volatile memory DIMMs on x86, <em>memory fences</em> set up barriers between updates.</li>
      <li>Formally, if action \(B\) is ordered after action \(A\), we denote as \(A \rightarrow B\); if actions \(C\) and \(D\) do not require an ordering barrier in between, we denote as \(C \vert D\).</li>
    </ul>
  </li>
</ul>

<p>The formularization comes from the Optimistic Crash Consistency paper <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>.</p>

<h2 id="three-fs-consistency-techniques">Three FS Consistency Techniques</h2>

<p>This section formally summarizes the three classic FS consistency techniques: <em>journaling</em>, <em>shadow paging</em>, and <em>log-structuring</em>, and analyzes their pros &amp; cons.</p>

<h3 id="1-journaling-wal">1) Journaling (WAL)</h3>

<p>A <strong>journaling</strong> FS allocates a dedicated region of persistent storage as a <strong>journal</strong> (sometimes referred to as a log, though the name might get confused with log-structuring). The journal is an append-only “log” of <em>transactions</em>, where each transaction corresponds to a user update request. The idea behind journaling is that, for any user request, its transaction entry must be persisted and committed before the actual update. Journaling is a specific form of the <strong>write-ahead logging</strong> (WAL) technique. The action of “committing a transaction entry” must be atomic.</p>

<p>Journaling could be done in two different flavors:</p>

<ul>
  <li><em>Redo journaling</em>: transactions record new data to be applied. During recovery, the FS replays the journal forward, re-applies all committed entries, and discards all uncommitted entries;</li>
  <li><em>Undo journaling</em>: transactions record backup old data. During recovery, the FS reads out all uncommitted entries to back things out, and ignores all committed entries.</li>
</ul>

<p>Handling a user request involves the following actions:</p>

<ul>
  <li>\(J_M\): write out metadata changes to journal</li>
  <li>\(J_D\): write out data changes to journal</li>
  <li>\(J_E\): write out “transaction end” to journal, indicating a commit</li>
  <li>\(M\): actual in-place update of metadata</li>
  <li>\(D\): actual in-place update of data</li>
</ul>

<p>A journaling FS has the flexibility to choose between providing only metadata consistency and providing stronger data consistency. In <em>metadata journaling</em> mode, only metadata changes are logged in the journal. This mode introduces minimal overhead. Formally, the algorithm is:</p>

\[D \vert J_M \rightarrow \overline{J_E} \rightarrow M\]

<p>In <em>data journaling</em> mode, data changes are logged in the journaling as well, resulting in <em>write-twice penalty</em>. Formally, the algorithm is:</p>

\[J_D \vert J_M \rightarrow \overline{J_E} \rightarrow D \vert M\]

<p>Many famous Linux file systems are journaling file systems, with Ext2/4 <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> being a perfect example. By default, Ext4 is mounted in <code class="language-plaintext highlighter-rouge">data=ordered</code> mode, i.e., only doing metadata journaling. When mounted with <code class="language-plaintext highlighter-rouge">data=journal</code> option, Ext4 does data journaling. XFS <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> also uses journaling. Also see the Optimistic Crash Consistency paper <sup id="fnref:1:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> for a thorough discussion on possible optimizations to the algorithm.</p>

<h3 id="2-shadow-paging-cow">2) Shadow Paging (CoW)</h3>

<p><strong>Shadow paging</strong> (or shadowing) is a specific form of the <strong>copy-on-write</strong> (CoW) technique. The idea behind shadow paging is to first write all updates to newly-allocated empty blocks (copying over any partial blocks if necessary), and then <em>publish</em> the new blocks into the file atomically.</p>

<p>Handling a user request involves the following actions:</p>

<ul>
  <li>\(B\): allocation of empty blocks</li>
  <li>\(W_C\): copy any partial blocks touched by the update from current file data into the new blocks</li>
  <li>\(W_D\): write out new data into the new blocks</li>
  <li>\(M\): publish the new blocks into metadata (typically a pointer switch in the inode’s index table)</li>
</ul>

<p>Formally, the algorithm is:</p>

\[B \rightarrow W_C \vert W_D \rightarrow \overline{M}\]

<p>Shadow paging has its obvious advantages and disadvantages compared to journaling. \(\uparrow\) Shadow paging provides data consistency without introducing write-twice penalty. \(\downarrow\) Shadow paging works well only if most updates are bulky, block-sized, and block-aligned. Small, in-place updates will introduce significant overhead of allocation and copying. In tree-structured FS, shadow paging might also result in cascading CoW upto the root of the tree (where an atomic pointer switch can be done).</p>

<p>BtrFS <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> and WAFL <sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup> are two typical examples of CoW FS. To reduce the CoW overhead on small updates, WAFL aggregates and batches incoming writes into a single CoW. BPFS <sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup> is a CoW FS optimized for non-volatile memory.</p>

<h3 id="3-log-structuring">3) Log-Structuring</h3>

<p>Introduced in the classic LFS paper <sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup>, a <strong>log-structured</strong> file system organizes the entire FS itself as an append-only log. All updates are just atomic appends to the log (involving both new data blocks and new metadata inode). Atomicity of appends is ensured by doing atomic updates to the <strong>log tail</strong> offset. The FS maintains an in-DRAM <em>inode map</em> recording the address of the latest version of each file’s inode. This in-DRAM inode map can be safely lost after a crash – the persistent log is the ground-truth and the FS image can be rebuilt from reading through the log and figuring out the latest version of each block.</p>

<p>Handling a user request involves the following actions:</p>

<ul>
  <li>\(A_D\): append new data blocks to log tail</li>
  <li>\(L_D\): update of log tail to right after the newly appended data blocks</li>
  <li>\(A_M\): append new inode metadata to log tail, which contains updated pointers to the previously appended data blocks</li>
  <li>\(L_M\): update of log tail to right after the newly appended inode</li>
  <li>\(I\): update the DRAM inode map image with the address in log of the new inode</li>
</ul>

<p>Formally, the algorithm is:</p>

\[A_D \rightarrow \overline{L_D} \rightarrow A_M \rightarrow \overline{L_M} \rightarrow I\]

<p>Log-structuring has its own pros and cons. \(\uparrow\) All device requests happen in a sequential manner, yielding good performance. Log-structured FS inherently provides data crash consistency. \(\downarrow\) The log could grow indefinitely, so there must be a <em>garbage collection</em> mechanism to discard outdated blocks and compact the log. Also, though writes become sequential, reads of a single file get scattered around the log.</p>

<p>It is possible to combine log-structuring with journaling/shadow paging. For example, NOVA <sup id="fnref:8" role="doc-noteref"><a href="#fn:8" class="footnote" rel="footnote">8</a></sup> combines metadata journaling with log-structured file data blocks to optimize for non-volatile memory.</p>

<h2 id="references">References</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p><a href="https://research.cs.wisc.edu/adsl/Publications/optfs-sosp13.pdf">https://research.cs.wisc.edu/adsl/Publications/optfs-sosp13.pdf</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:1:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p><a href="https://ext4.wiki.kernel.org/index.php/Main_Page">https://ext4.wiki.kernel.org/index.php/Main_Page</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/XFS">https://en.wikipedia.org/wiki/XFS</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p><a href="https://btrfs.wiki.kernel.org/index.php/Main_Page">https://btrfs.wiki.kernel.org/index.php/Main_Page</a> <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/Write_Anywhere_File_Layout">https://en.wikipedia.org/wiki/Write_Anywhere_File_Layout</a> <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:6" role="doc-endnote">
      <p><a href="https://www.sigops.org/s/conferences/sosp/2009/papers/condit-sosp09.pdf">https://www.sigops.org/s/conferences/sosp/2009/papers/condit-sosp09.pdf</a> <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:7" role="doc-endnote">
      <p><a href="https://web.stanford.edu/~ouster/cgi-bin/papers/lfs.pdf">https://web.stanford.edu/~ouster/cgi-bin/papers/lfs.pdf</a> <a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:8" role="doc-endnote">
      <p><a href="https://www.usenix.org/conference/fast16/technical-sessions/presentation/xu">https://www.usenix.org/conference/fast16/technical-sessions/presentation/xu</a> <a href="#fnref:8" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Guanzhou Hu</name></author><category term="Technical" /><summary type="html"><![CDATA[Crash consistency is one of the most essential guarantees that a storage system needs to make to ensure correctness. In a file system (FS) setting, consistency techniques must be carefully designed, integrated with the layout of blocks, and deployed in the procedure of updates. This post summarizes the three classic FS consistency techniques: journaling, shadow paging (CoW), and log-structuring, in a formal way and analyzes their pros &amp; cons.]]></summary></entry><entry><title type="html">Raspberry Pi As Campus GlobalProtect VPN Proxy Server</title><link href="https://www.josehu.com/memo/2021/12/22/rpi-globalprotect-proxy.html" rel="alternate" type="text/html" title="Raspberry Pi As Campus GlobalProtect VPN Proxy Server" /><published>2021-12-22T16:16:32+00:00</published><updated>2021-12-22T16:16:32+00:00</updated><id>https://www.josehu.com/memo/2021/12/22/rpi-globalprotect-proxy</id><content type="html" xml:base="https://www.josehu.com/memo/2021/12/22/rpi-globalprotect-proxy.html"><![CDATA[<p>Wisc campus VPN and our CS departmental VPN both use GlobalProtect. On the user side, GlobalProtect clients cannot configure VPN split tunneling, meaning that once connected, all outbound traffic from my host machine goes through the VPN. I have a daily need to access my lab machine sitting behind the departmental VPN, yet I would like all other traffic (e.g., searching Google) to bypass the VPN. I came up with a solution of using one or two Raspberry Pi chips as an always-on SSH proxy server.</p>

<h2 id="original-vpn-connection-scheme">Original VPN Connection Scheme</h2>

<p>Originally, I was using the GlobalProtect client directly on my host PC or on my laptop. My lab machine <code class="language-plaintext highlighter-rouge">labmachine.cs.wisc.edu</code> sits behind the departmental VPN. The VPN connection scheme looked like:</p>

<p><img src="/assets/img/globalprotect-vpn-proxy-0.png" alt="GlobalProtectVPNProxy0" /></p>

<p>Since GlobalProtect clients force all outbound traffic to go through the VPN once connected, I could not let only one terminal SSH session to use VPN while leaving all other connections native. One workaround would be to install a virtual machine on the host, start GlobalProtect client in the virtual machine, and do SSH from there, but that requires careful configuration of guest networking and also seems to be an unnecessarily heavy-weight solution.</p>

<h2 id="raspberry-pi-as-proxy-server">Raspberry Pi As Proxy Server</h2>

<p>If you get lucky and have one or two spare Raspberry Pi chips at home, you can follow the steps listed below to setup them up properly as an SSH proxy server. SSH connections are very light-weight, so even RPi Zero chips can do the work nicely.</p>

<p>Let’s first assume that the RPi chip is <strong>within the same local network</strong> with the host machine (where I want split tunneling). In this case, one RPi chip should be sufficient. The next section will talk about adding an extra RPi chip and setting up Dynamic DNS (DDNS) to allow accessing the proxy server from anywhere on the Internet.</p>

<p>With one RPi chip, the network connection scheme looks like:</p>

<p><img src="/assets/img/globalprotect-vpn-proxy-1.png" alt="GlobalProtectVPNProxy1" /></p>

<p>Setup steps:</p>

<ol>
  <li>Install Raspbian OS on RPi. Connect RPi to home router and test network connection.</li>
  <li>Start OpenSSH server on RPi.</li>
  <li>Open router configuration console  (<code class="language-plaintext highlighter-rouge">192.168.0.1</code> for my TP-Link Archer). Identify the RPi’s hardware MAC address.</li>
  <li>Most home routers do DHCP for its LAN. To give the RPi a permanent LAN IP, find the “Address Reservation” or equivalent setting in router console, add an entry mapping from the RPi’s MAC address to a fixed LAN IP of your choice (e.g., <code class="language-plaintext highlighter-rouge">192.168.0.131</code>).</li>
  <li>SSH connect to the RPi: <code class="language-plaintext highlighter-rouge">ssh piuser@192.168.0.131</code>. Setup password-less SSH if desired.</li>
  <li>Install GlobalProtect command-line Linux client on RPi: <a href="https://kb.wisc.edu/page.php?id=105971">link for WiscVPN</a>.</li>
</ol>

<p>Start GlobalProtect client on RPi:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>on-rpi<span class="o">)</span> globalprotect connect <span class="nt">--portal</span> compsci.vpn.wisc.edu
</code></pre></div></div>

<p>After the above steps, I can connect to my lab machine from my host PC using the nice <em>Proxy Jump</em> feature of SSH:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh <span class="nt">-J</span> piuser@192.168.0.131 labuser@labmachine.cs.wisc.edu
</code></pre></div></div>

<p>It is strongly recommended to setup alias targets in <code class="language-plaintext highlighter-rouge">.ssh/config</code> to save future typing, e.g.:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Host josepi4
  Hostname 192.168.0.131
  User piuser
  Port 22
  IdentityFile ~/.ssh/id_rsa
  ServerAliveInterval 30

Host labmachine
  Hostname labmachine.cs.wisc.edu
  User labuser
  Port 22
  IdentityFile ~/.ssh/id_rsa
  ServerAliveInterval 30

Host labmachine-jl
  Hostname labmachine.cs.wisc.edu
  User labuser
  Port 22
  IdentityFile ~/.ssh/id_rsa
  ServerAliveInterval 30
  ProxyJump josepi4
</code></pre></div></div>

<p>Then, to SSH to the RPi from local network at home:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh josepi4
</code></pre></div></div>

<p>To SSH to the lab machine behind VPN, either will work:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh <span class="nt">-J</span> josepi4 labmachine
<span class="c"># or simpler:</span>
ssh labmachine-jl
</code></pre></div></div>

<p>Notice that the GlobalProtect client on RPi might timeout and disconnect after a few minutes of inactivity. It might be possible to write a simple keep-alive script that runs indefinitely on the RPi to keep GlobalProtect connected.</p>

<h2 id="using-proxy-server-when-not-at-home">Using Proxy Server When Not At Home</h2>

<p>So far, the RPi proxy server is available to any machine connected to my home router’s local network. However, I still want access to the proxy server <strong>from anywhere on the Internet</strong> when I’m not at home.</p>

<p>It is time to introduce two more techniques into the workflow:</p>

<ul>
  <li><em>Port Forwarding</em> on the router. This feature is named <em>Virtual Server</em> on my TP-Link Archer. This enables the router to recognize all inward traffic from the Internet to a specific port number, and relay those traffic to a specific LAN IP.</li>
  <li><em>Dynamic DNS</em> (DDNS) service. This allows me to rent a fixed public domain name and map it to my router’s public IP address. It is necessary because my Internet service provider allocates dynamic public addresses for my router, which means the public IP address may change at least once per 14 days. DDNS-aware routers can collaborate with DDNS providers to auto-update the IP address mapped to by that domain name.</li>
</ul>

<p>Due to strict traffic hijacking of GlobalProtect, the virtual server feature does not work when the previous RPi is on GlobalProtect VPN. Hence, unfortunately, an additional RPi chip needs to be involved. (RPi chips are cheap enough, anyway.)</p>

<p>The final network connection scheme looks like:</p>

<p><img src="/assets/img/globalprotect-vpn-proxy-2.png" alt="GlobalProtectVPNProxy2" /></p>

<p>Setup steps:</p>

<ol>
  <li>Set up the second RPi and start OpenSSH server, similarly.</li>
  <li>Open router configuration console and reserve a fixed LAN IP for the second RPi (e.g., <code class="language-plaintext highlighter-rouge">192.168.0.130</code>), similarly.</li>
  <li>SSH connect to the RPi: <code class="language-plaintext highlighter-rouge">ssh piuser@192.168.0.130</code>. Setup password-less SSH if desired.</li>
  <li>Go to router console and locate the “Virtual Server” or equivalent setting. Register port forwarding from some external port (e.g., <code class="language-plaintext highlighter-rouge">22122</code>) to internal port <code class="language-plaintext highlighter-rouge">192.168.0.130:22</code>. It is recommended to choose a non-default external port to avoid exposing port <code class="language-plaintext highlighter-rouge">22</code> on public Internet.</li>
  <li>Check what DDNS providers does your router support. <a href="https://www.noip.com/">No-IP</a> is a great choice – it gives you one free domain name per account. Go to the provider, register an available domain name (e.g., <code class="language-plaintext highlighter-rouge">josedns.ddns.net</code>), and activate it.</li>
  <li>Go to router console and locate the “Dynamic DNS” or equivalent setting. Enter DDNS provider account and password and enable public IP auto-update feature.</li>
</ol>

<p>After the above steps, I can connect to the second RPi from anywhere on the public Internet through:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh piuser@josedns.ddns.net:22122
</code></pre></div></div>

<p>To access the first RPi:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh <span class="nt">-J</span> piuser@josedns.ddns.net:22122 piuser@192.168.0.131
</code></pre></div></div>

<p>Notice that SSH proxy jumps can be chained, so to access the lab machine behind VPN:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh <span class="nt">-J</span> piuser@josedns.ddns.net:22122,piuser@192.168.0.131 labuser@labmachine.cs.wisc.edu
</code></pre></div></div>

<p>Add a few more SSH config entries to save typing, e.g.:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Host josepi0
  Hostname 192.168.0.130
  User piuser
  Port 22
  IdentityFile ~/.ssh/id_rsa
  ServerAliveInterval 30

Host josepi0-jp
  Hostname josedns.ddns.net
  User piuser
  Port 22122
  IdentityFile ~/.ssh/id_rsa
  ServerAliveInterval 30

Host josepi4-jp
  Hostname 192.168.0.131
  User piuser
  Port 22
  IdentityFile ~/.ssh/id_rsa
  ServerAliveInterval 30
  ProxyJump josepi0-jp

Host labmachine-jp
  Hostname labmachine.cs.wisc.edu
  User labuser
  Port 22
  IdentityFile ~/.ssh/id_rsa
  ServerAliveInterval 30
  ProxyJump josepi4-jp
</code></pre></div></div>

<p>Then, to connect to the second RPi when away from home:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh josepi0-jp
</code></pre></div></div>

<p>To access the first RPi:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh josepi4-jp
</code></pre></div></div>

<p>To access the lab machine:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh labmachine-jp
</code></pre></div></div>

<p>Hooray!</p>]]></content><author><name>Guanzhou Hu</name></author><category term="Memo" /><summary type="html"><![CDATA[Wisc campus VPN and our CS departmental VPN both use GlobalProtect. On the user side, GlobalProtect clients cannot configure VPN split tunneling, meaning that once connected, all outbound traffic from my host machine goes through the VPN. I have a daily need to access my lab machine sitting behind the departmental VPN, yet I would like all other traffic (e.g., searching Google) to bypass the VPN. I came up with a solution of using one or two Raspberry Pi chips as an always-on SSH proxy server.]]></summary></entry><entry><title type="html">System Building Rules &amp;amp; Tips from the OSTEP Book</title><link href="https://www.josehu.com/memo/2021/06/15/ostep-book-tips-list.html" rel="alternate" type="text/html" title="System Building Rules &amp;amp; Tips from the OSTEP Book" /><published>2021-06-15T15:03:17+00:00</published><updated>2021-06-15T15:03:17+00:00</updated><id>https://www.josehu.com/memo/2021/06/15/ostep-book-tips-list</id><content type="html" xml:base="https://www.josehu.com/memo/2021/06/15/ostep-book-tips-list.html"><![CDATA[<p>This short post is a summary list of all the system building tips/rules/laws boxes in <a href="https://pages.cs.wisc.edu/~remzi/OSTEP/">the OSTEP book</a> (also see my <a href="https://www.josehu.com/notes.html">reading note</a>). Without proper context, these tips make little sense, so I included the chapter numbers as well for easier back-tracing.</p>

<h2 id="list-of-system-building-tips">List of System Building Tips</h2>

<ul>
  <li>Use <em>time-sharing</em> and <em>space-sharing</em> (Chapter 4)</li>
  <li>Separate <em>policy</em> and <em>mechanism</em> (Chapter 4)</li>
  <li>“<em>Get it right</em>. Neither abstraction nor simplicity is a substitute for getting it right.” (Lampson’s law, Chapter 5)</li>
  <li><em>RTFM</em>: read the manual pages (Chapter 5)</li>
  <li>Use <em>protected control transfer</em> (Chapter 6)</li>
  <li>Be wary of <em>user inputs</em> in secure systems (Chapter 6)</li>
  <li>Deal with application <em>misbehavior</em> (Chapter 6)</li>
  <li>Use <em>interrupts</em> to regain control (Chapter 6)</li>
  <li><em>Reboot</em> is useful because it reverts the system to a known and likely correct state (Chapter 6)</li>
  <li><em>Shortest-job-first</em> is a general scheduling principle (Chapter 7)</li>
  <li><em>Amortization</em> can reduce costs (Chapter 7)</li>
  <li><em>Overlapping</em> enables higher utilization (Chapter 7)</li>
  <li>Learn from <em>history</em> to make better decisions (Chapter 8)</li>
  <li><em>Scheduling</em> also needs to be secure from attacks (Chapter 8)</li>
  <li>“<em>Avoid voo-doo constants</em>.” (Ousterhout’s Law, Chapter 8)</li>
  <li>Use <em>advice</em> where possible (Chapter 8)</li>
  <li>Use <em>randomness</em> when appropriate (Chapter 9)</li>
  <li>Use <em>efficient data structures</em> (Chapter 9)</li>
  <li>Remember the principle of <em>isolation</em> (Chapter 13)</li>
  <li>When in doubt, <em>try it out</em> (Chapter 14)</li>
  <li>It <em>compiled/ran</em> != it is <em>correct</em> (Chapter 14)</li>
  <li><em>Interposition</em> is powerful (Chapter 15)</li>
  <li>Require <em>hardware support</em> if that’s better (Chapter 15)</li>
  <li>If 1000 solutions exist, no great one does; in this case, try to <em>avoid the problem</em> altogether (Chapter 16)</li>
  <li>Great engineers are <em>really great</em> (Chapter 17)</li>
  <li>Use <em>caching</em> when possible (Chapter 19)</li>
  <li>Use <em>hybrid</em> solution when appropriate (Chapter 20)</li>
  <li>Do work in the <em>background</em> (Chapter 21)</li>
  <li>Comparing against <em>theoretical optimal</em> is useful (Chapter 22)</li>
  <li>Be aware of the <em>curse of generality</em> (Chapter 23)</li>
  <li>Be <em>lazy</em> in certain cases (Chapter 23)</li>
  <li>Consider <em>incrementalism</em> (Chapter 23)</li>
  <li>Know and use available <em>tools</em> (Chapter 26)</li>
  <li>Use <em>atomic</em> operations (Chapter 26)</li>
  <li>Think in the way of a <em>malicious scheduler</em> when talking about concurrency bugs (Chapter 28)</li>
  <li>“<em>Less code</em> is often better code.” (Lauer’s Law, Chapter 28)</li>
  <li>More concurrency <em>isn’t necessarily faster</em> (Chapter 29)</li>
  <li>Be wary of <em>control flow</em> changes when using locks (Chapter 29)</li>
  <li>“<em>Avoid premature optimization</em>.” (Knuth’s law, Chapter 29)</li>
  <li>Always <em>hold</em> the lock while signaling (Chapter 30)</li>
  <li>Use <em>while</em>, not if, in multi-threaded program (Chapter 30)</li>
  <li>“<em>Simple and dumb</em> can be better.” (Hill’s law, Chapter 31)</li>
  <li>Be careful with <em>generalization</em> (Chapter 31)</li>
  <li>“<em>Don’t always do it perfectly</em>.” (Tom West’s law, Chapter 32)</li>
  <li>Don’t <em>block</em> in event-based servers. (Chapter 33)</li>
  <li><em>Interrupts</em> not always better than <em>polling</em>. (Chapter 36)</li>
  <li>Be aware of disk <em>sequentiality</em> (Chapter 37)</li>
  <li>“<em>It always depends</em>.” (Livny’s law, Chapter 37)</li>
  <li><em>Transparency</em> enables easier deployment (Chapter 38)</li>
  <li>Think carefully about <em>naming</em> (Chapter 39)</li>
  <li>Be wary of <em>powerful commands</em> (Chapter 39)</li>
  <li><em>TOCTTOU</em>: time-of-check to time-of-use (Chapter 39)</li>
  <li>Consider <em>extent</em>-based approaches (Chapter 40)</li>
  <li>Reads don’t access <em>allocation</em> structures (Chapter 40)</li>
  <li>Understand <em>static</em> vs. <em>dynamic</em> partitioning (Chapter 40)</li>
  <li>Understand the <em>durability/performance tradeoff</em> (Chapter 40)</li>
  <li>Make the system <em>usable</em> (Chapter 41)</li>
  <li><em>Details</em> matter (Chapter 43)</li>
  <li>Use a level of <em>indirection</em> when necessary (Chapter 43)</li>
  <li>Turn <em>flaws</em> into <em>features</em> (Chapter 43)</li>
  <li>Be careful with <em>terminology</em> (Chapter 44)</li>
  <li>The importance of <em>backwards compatibility</em> (Chapter 44)</li>
  <li>Sometimes the <em>implementation</em> shapes the <em>interface</em> (Chapter 44)</li>
  <li><em>TNSTAAFL</em>: there is no free lunch (Chapter 45)</li>
  <li><em>Communication</em> is inherently unreliable (Chapter 48)</li>
  <li>Use <em>checksums</em> for integrity (Chapter 48)</li>
  <li>Be careful setting the <em>timeout</em> value (Chapter 48)</li>
  <li><em>Idempotency</em> is powerful (Chapter 49)</li>
  <li>“<em>Perfection</em> is the enemy of the good. Even in a beautiful system, there are corner cases.” (Voltaire’s law, Chapter 49)</li>
  <li><em>Innovaton</em> breeds innovation (Chapter 49)</li>
  <li>“<em>Measure</em>, then build.” (Patterson’s law, Chapter 50)</li>
  <li><em>Crash consistency</em> is not a panacea (Chapter 50)</li>
  <li>Understand the importance of <em>workload</em> (Chapter 50)</li>
  <li>Be careful of the <em>weakest link</em> (Chapter 53)</li>
  <li>Avoid storing <em>secrets</em> (Chapter 54)</li>
  <li><em>Privilege escalation</em> is considered dangerous (Chapter 55)</li>
  <li>Don’t develop your <em>own ciphers</em> (Chapter 56)</li>
  <li>Infer <em>implicit</em> information if necessary (Appendix B)</li>
</ul>]]></content><author><name>Guanzhou Hu</name></author><category term="Memo" /><summary type="html"><![CDATA[This short post is a summary list of all the system building tips/rules/laws boxes in the OSTEP book (also see my reading note). Without proper context, these tips make little sense, so I included the chapter numbers as well for easier back-tracing.]]></summary></entry><entry><title type="html">Multicore Locking Design &amp;amp; A Partial List of Lock Implementations</title><link href="https://www.josehu.com/technical/2021/05/31/locking-techniques.html" rel="alternate" type="text/html" title="Multicore Locking Design &amp;amp; A Partial List of Lock Implementations" /><published>2021-05-31T12:09:27+00:00</published><updated>2021-05-31T12:09:27+00:00</updated><id>https://www.josehu.com/technical/2021/05/31/locking-techniques</id><content type="html" xml:base="https://www.josehu.com/technical/2021/05/31/locking-techniques.html"><![CDATA[<p>Concurrency plays a significant role in modern multi-core operating systems. We want a locking mechanism that is <em>efficient</em> (low latency), <em>scalable</em> (increasing the number of threads does not degrade performance too badly), and <em>fair</em> (considers the order of acquirement and does not make any one thread wait too long). This post summarizes a bit on hardware atomic instructions which modern locks are built upon, a comparison between spinning and blocking locks, and a partial list of representative lock implementations.</p>

<h2 id="mutual-exclusion--locking">Mutual Exclusion &amp; Locking</h2>

<p>Let’s assume that the shared resource is a data structure on DRAM and that single-cacheline reads/writes to DRAM are atomic. Multiple running entities (say threads) run concurrently on multiple cores and share access to that data structure. The code for one access from one thread to the data structure is a <em>critical section</em> - a sequence of memory reads/writes and computations that must not be interrupted in the middle by other concurrent attempts of access. We want <strong>mutual exclusion</strong>: at any time, there will be at most one thread executing its critical section, and if some thread is doing that, other threads attempting to enter their critical section must wait. We do not want <em>race conditions</em> which may corrupt the data structure and yield incorrect results.</p>

<p>Based on this reasonable setup, it is possible to develop purely <em>software</em>-based algorithms. See <a href="https://en.wikipedia.org/wiki/Mutual_exclusion#Software_solutions">this section of Wikipedia</a> <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> for examples. Though very valuable in the theoretical aspect, these solutions are too sophisticated and quite inefficient to be deployed as locking primitives in an operating system under heavy load.</p>

<p>Modern operating systems, instead, rely on <em>hardware atomic instructions</em> – ISA supported instructions that are more than just single memory reads/writes, but are guaranteed by the hardware architecture to be atomic and unbreakable. The operating system implements (mutex) <em>locks</em> upon these instructions (in a threading library, for example). Threads have their critical sections protected in this way to get mutual exclusion:</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lock</span><span class="p">.</span><span class="n">acquire</span><span class="p">();</span>
<span class="p">...</span> <span class="c1">// critical section</span>
<span class="n">lock</span><span class="p">.</span><span class="n">release</span><span class="p">();</span>
</code></pre></div></div>

<h2 id="hardware-atomic-instructions">Hardware Atomic Instructions</h2>

<p>Here are three classic examples of hardware atomic instructions.</p>

<h3 id="test-and-set-tas">Test-and-Set (TAS)</h3>

<p>The most basic hardware atomic instruction would be test-and-set (TAS). It writes a <code class="language-plaintext highlighter-rouge">1</code> to a memory location and returns the old boolean value on that location, atomically.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">TEST_AND_SET</span><span class="p">(</span><span class="n">addr</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">old_val</span>
<span class="c1">// old_val = *addr;</span>
<span class="c1">// *addr = 1;</span>
<span class="c1">// return old_val;</span>
</code></pre></div></div>

<p>Using this instruction, it is simple to build a basic spinlock that grants mutual exclusion (but not fairness and performance, of course).</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">acquire</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">TEST_AND_SET</span><span class="p">(</span><span class="o">&amp;</span><span class="n">flag</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{}</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="n">release</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">flag</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Notice that this is a spinlock, which we will explain in the next section. Also, modern architectures have private levels of cache for each core. When threads are competing for the lock, there will be a great amount of <em>cache invalidation</em> traffic as they are all doing <code class="language-plaintext highlighter-rouge">TEST_AND_SET</code> to the same bit in the while loop.</p>

<h3 id="compare-and-swap-cas-exchange">Compare-and-Swap (CAS, Exchange)</h3>

<p>Compare-and-swap (CAS, or Exchange) compares the value on a memory location with a given value, and if they are the same, writes a new value into it. It returns a boolean, which is the old value.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">COMPARE_AND_SWAP</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="n">val</span><span class="p">,</span> <span class="n">new_val</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">old_val</span>
<span class="c1">// old_val = *addr;</span>
<span class="c1">// if (old_val == val)</span>
<span class="c1">//   *addr = new_val;</span>
<span class="c1">// return old_val;</span>
</code></pre></div></div>

<p>There are some variants of CAS such as compare-and-set or exchange, but their ideas are the same. It is also simple to build a basic spinlock out of CAS.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">acquire</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">COMPARE_AND_SWAP</span><span class="p">(</span><span class="o">&amp;</span><span class="n">flag</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{}</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="n">release</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">flag</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="load-linked-ll--store-conditional-sc">Load-Linked (LL) &amp; Store-Conditional (SC)</h3>

<p>Load-linked (LL) &amp; store-conditional (SC) are a pair of atomic instructions used together. LL is just like a normal memory load. SC tries to store a value to the location and succeeds only if there’s no LL going on at the same time, otherwise returning failure.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">LOAD_LINKED</span><span class="p">(</span><span class="n">addr</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">val</span>
<span class="c1">// return *addr;</span>
</code></pre></div></div>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">STORE_CONDITIONAL</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">success</span><span class="o">?</span>
<span class="c1">// if (no LL to addr happening) {</span>
<span class="c1">//   *addr = val;</span>
<span class="c1">//   return 1;  // success</span>
<span class="c1">// } else</span>
<span class="c1">//   return 0;  // failed</span>
</code></pre></div></div>

<p>Building a spinlock out of LL/SC:</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">acquire</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">while</span> <span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">while</span> <span class="p">(</span><span class="n">LOAD_LINKED</span><span class="p">(</span><span class="o">&amp;</span><span class="n">flag</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{}</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">STORE_CONDITIONAL</span><span class="p">(</span><span class="o">&amp;</span><span class="n">flag</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span>
            <span class="k">return</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="n">release</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">flag</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="fetch-and-add-faa">Fetch-and-Add (FAA)</h3>

<p>Fetch-and-add (FAA) is a less common atomic instruction that could be implemented upon CAS or just natively supported by the architecture. It operates on an integer counter.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">FETCH_AND_ADD</span><span class="p">(</span><span class="n">addr</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">old_val</span>
<span class="c1">// old_val = *addr;</span>
<span class="c1">// *addr += 1;</span>
<span class="c1">// return old_val;</span>
</code></pre></div></div>

<h2 id="spinning-lock-vs-blocking-lock">Spinning Lock vs. Blocking Lock</h2>

<p>Before we list a few lock implementations, I’d like to give a comparison between spinning locks and blocking locks.</p>

<p>A <em>spinning</em> lock (or <em>spinlock</em>, <em>non-blocking</em> lock) is a lock implementation where lock waiters will spinning in a loop checking for some condition. The examples given above are basic spinlocks. Spinlocks are typically used for low-level critical sections that are short, small, but invoked very frequently, e.g., in device drivers.</p>

<ul>
  <li>\(\uparrow\) Advantage: low latency for lock acquirement as there is no scheduling stuff kicking in – value changes reflect almost immediately;</li>
  <li>\(\downarrow\) Disadvantage:
    <ul>
      <li>Spinning occupies the whole CPU core and wastes CPU power if the waiting time is long that could have been used for scheduling another free thread in to do useful work;</li>
      <li>Spinning also introduces the cache invalidation traffic throttling problem if not handled properly, as mentioned in the TAS section;</li>
      <li>Spinning locks make sense only if the scheduler is <em>preemptive</em>, otherwise there is no way to interrupt and break out of an infinite loop spin.</li>
    </ul>
  </li>
</ul>

<p>A <em>blocking</em> lock is a lock implementation where a lock waiter yields the core to the scheduler when the lock is currently taken. A lock waiter thread adds itself to the lock’s wait queue and blocks the execution of itself (called <em>parking</em>) to let some other free thread run on the core, until it gets woken up (typically by the previous lock holder) and scheduled back. It is designed for higher-level critical sections. The pros and cons are exactly the opposite of a spinlock.</p>

<ul>
  <li>\(\uparrow\) Advantage: not occupying full core during the waiting period, good for long critical sections;</li>
  <li>\(\downarrow\) Disadvantage: switching back and forth from/to the scheduler and doing scheduling stuff takes significant time, so if the critical sections are fast and invoked frequently, better just do spinning.</li>
</ul>

<p>It is possible to have smarter <em>hybrid</em> locks that combine spinning and blocking. This is now referred to as <em>two-phase locking</em>. POSIX mutex locks have the semantic option to first try to spin for a designed length of time. If the waiting time becomes too long, it switches to the scheduler to park. The Linux lock based on its <a href="https://en.wikipedia.org/wiki/Futex"><em>futex</em> syscall support</a> <sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">2</a></sup> is a good example of such locks implementing the two-phase semantic.</p>

<h2 id="representative-lock-implementations">Representative Lock Implementations</h2>

<p>Here are a few interesting examples of lock implementations that appeared in the history of operating systems research. The list goes in the order from lower-level spinlocks to higher-level scheduling-based locks with more considerations.</p>

<h3 id="test-test-and-set-ttas">Test-Test-and-Set (TTAS)</h3>

<p>To ease the problem of cache invalidation in the simple TAS spinlock example, we could use a test-test-and-set (TTAS) protocol spinlock <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">3</a></sup>.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">acquire</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">do</span> <span class="p">{</span>
        <span class="k">while</span> <span class="p">(</span><span class="n">flag</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{}</span>
    <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">TEST_AND_SET</span><span class="p">(</span><span class="o">&amp;</span><span class="n">flag</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="n">release</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">flag</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The point is that, in the internal while loop, the value of <code class="language-plaintext highlighter-rouge">flag</code> will be cached in the core’s private cache and it is just spinning on a local cached copy. So most of the time, there won’t be cache throttling. Whenever the value of <code class="language-plaintext highlighter-rouge">flag</code> changes to <code class="language-plaintext highlighter-rouge">0</code> (lock seems released), cache invalidation traffic will invalidate the cached copy, terminating the internal while loop. Only then it falls back to an outer <code class="language-plaintext highlighter-rouge">TEST_AND_SET</code> check.</p>

<h3 id="ticket-lock">Ticket Lock</h3>

<p>Ticket lock is a spinlock that uses the notion of “tickets” to improve arriving-order fairness.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">volatile</span> <span class="kt">int</span> <span class="n">ticket</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">volatile</span> <span class="kt">int</span> <span class="n">turn</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

<span class="kt">void</span> <span class="nf">acquire</span><span class="p">()</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">myturn</span> <span class="o">=</span> <span class="n">FETCH_AND_ADD</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ticket</span><span class="p">);</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">turn</span> <span class="o">!=</span> <span class="n">myturn</span><span class="p">)</span> <span class="p">{}</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="n">release</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">turn</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The downside is still the same cache throttling problem as in basic spinlocks.</p>

<blockquote>
  <p>A comparison table across Linux low-level spinlock implementations, including LL/SC and ABQL locks, can be found in <a href="https://en.wikipedia.org/wiki/Ticket_lock#Comparison_of_locks">this Wikipedia section</a> <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">4</a></sup>.</p>
</blockquote>

<h3 id="mellor-crummey-scott-mcs">Mellor-Crummey Scott (MCS)</h3>

<p>MCS lock uses a linked-list structure to further optimize for the cache problem beyond TTAS. MCS is based on atomic swap. It queues the waiters into a linked-list and lets each waiter spin on its own node’s <code class="language-plaintext highlighter-rouge">is_locked</code> variable. A good demonstration of how this algorithm works can be found <a href="https://lwn.net/Articles/590243/">here</a> <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">5</a></sup>.</p>

<p>MCS-TP (<em>time-published</em>) is an enhancement to MCS that involves a timestamp for letting a thread park after spinning for some time, as mentioned in the POSIX locks.</p>

<h3 id="remote-core-locking-rcl">Remote Core Locking (RCL)</h3>

<p>Lozi et al. proposed a <em>lock delegation</em> design that aims to improve the scalability of locks in <a href="https://www.usenix.org/conference/atc12/technical-sessions/presentation/lozi">this ATC’12 paper</a> <sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">6</a></sup>. Remote core locking recognizes the fact that, at any time, there will only be one thread executing the critical section, so why not let a dedicated “server” thread do that. For a critical section that is invoked frequently, RCL allocates a threads just for executing that critical section logic. Other threads use atomic cacheline operations to put themselves into a fixed mailbox-like queue, and the server thread loops over the queue serving them in order. This prevents the lock data structure from being cache-invalidated and transferred to different cores at different times.</p>

<p><img src="/assets/img/remote-core-locking-example.png" alt="RCL" /></p>

<p>Figure 1 of the paper.</p>

<p>The downside is that it is harder to pass data/results out of the critical section. The server core will always be occupied and it can only be serving a chosen set of critical section logics.</p>

<h3 id="shuffle-lock-shfl">Shuffle Lock (SHFL)</h3>

<p>Kashyap et al. proposed an interesting enhancement called <em>shuffling</em> to blocking locks in <a href="https://taesoo.kim/pubs/2019/kashyap:shfllock.pdf">this SOSP’19 paper</a> <sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">7</a></sup>. The shuffle locks are <em>NUMA-aware</em>: they take into consideration that on modern non-uniform memory access architectures, cores in the same NUMA socket (or in closer sockets) have faster access to local memory than to memory on a different socket (a “remote” socket). Hence, it would be a nice idea to reorder the wait queue of a lock dynamically depending on which NUMA socket is each waiter residing on.</p>

<p>Periodically, it assigns the first waiter in queue to be the shuffler, which traverses through the remaining wait queue and reorders it, grouping waiters on the same NUMA socket together. Then, there will be a higher chance that once a lock is released, the next holder scheduled will be on the same NUMA socket as the previous holder, so the transferring of lock data structures will be faster and there will be less cache invalidation traffic.</p>

<p><img src="/assets/img/shuffle-lock-example.png" alt="ShuffleLock" /></p>

<p>Figure 5 of the paper.</p>

<p>However, fairness in this case is not guaranteed as a lock waiter could possibly be pushed back in the queue constantly, which remains a big problem to be solved in shuffle locks.</p>

<h2 id="references">References</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/Mutual_exclusion">https://en.wikipedia.org/wiki/Mutual_exclusion</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:7" role="doc-endnote">
      <p>Futex: <a href="https://en.wikipedia.org/wiki/Futex">https://en.wikipedia.org/wiki/Futex</a> <a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/Test_and_test-and-set">https://en.wikipedia.org/wiki/Test_and_test-and-set</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/Ticket_lock#Comparison_of_locks">https://en.wikipedia.org/wiki/Ticket_lock#Comparison_of_locks</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>MCS: <a href="https://lwn.net/Articles/590243/">https://lwn.net/Articles/590243/</a> <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p>RCL: <a href="https://www.usenix.org/conference/atc12/technical-sessions/presentation/lozi">https://www.usenix.org/conference/atc12/technical-sessions/presentation/lozi</a> <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:6" role="doc-endnote">
      <p>Shuffling: <a href="https://taesoo.kim/pubs/2019/kashyap:shfllock.pdf">https://taesoo.kim/pubs/2019/kashyap:shfllock.pdf</a> <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Guanzhou Hu</name></author><category term="Technical" /><summary type="html"><![CDATA[Concurrency plays a significant role in modern multi-core operating systems. We want a locking mechanism that is efficient (low latency), scalable (increasing the number of threads does not degrade performance too badly), and fair (considers the order of acquirement and does not make any one thread wait too long). This post summarizes a bit on hardware atomic instructions which modern locks are built upon, a comparison between spinning and blocking locks, and a partial list of representative lock implementations.]]></summary></entry></feed>