Sometimes, I write down what I learned, what I thought, what surprised me, and what I wanted to remember.
The attached files present a practical TLA+ specification of MultiPaxos that very closely models how a real state machine replication (SMR) system would implement this protocol. I did not find anything similar on the web, so I’d like to share it with anyone interested.
Recently, I need to benchmark a lightweight distributed system codebase on a single host for my current research project. I want to have control over the network performance parameters (including delay, jitter distribution, rate, loss, etc.) and test a wide range of parameter values; meanwhile, I want to avoid pure software-based simulation. Thus, I opt in for using kernel-supported network emulation. In this post, I document what I tried and what finally worked.
Previously, I made a blog post about common consistency models in distributed state machine replication (SMR). As I am recently picking up my scattered knowledge about distributed replication systems, I found some inaccuracy and ambiguity in that old post. This short post lists some patches and complementary material I revisited on this convoluted topic.
Described in this classic paper by Jim Gray et. al, hierarchical locking has been a well-studied idea in database management systems (DBMS). Despite its long history, I found the theoretical notion of lock modes less intuitive and hard to understand upon first encounter. This post tries to distill the core motivations of hierarchical locking, break its design down into three pieces, and describe them progressively, to hopefully clarify this beautiful idea.
This is a short post where I note down some of my insignificant thoughts about the interaction between AI and systems. With the rapid evolution of AI technologies, especially in the field of machine learning (ML), there is now a rising interest in studying the intersection between AI and computer systems design. The combination of the two can further be categorized into two directions: building systems for AI applications (Sys for AI) and using AI to empower smarter systems (AI for Sys).
Crash consistency is one of the most essential guarantees that a storage system needs to make to ensure correctness. In a file system (FS) setting, consistency techniques must be carefully designed, integrated with the layout of blocks, and deployed in the procedure of updates. This post summarizes the three classic FS consistency techniques: journaling, shadow paging (CoW), and log-structuring, in a formal way and analyzes their pros & cons.
Concurrency plays a significant role in modern multi-core operating systems. We want a locking mechanism that is efficient (low latency), scalable (increasing the number of threads does not degrade performance too badly), and fair (considers the order of acquirement and does not make any one thread wait too long). This post summarizes a bit on hardware atomic instructions which modern locks are built upon, a comparison between spinning and blocking locks, and a partial list of representative lock implementations.
This post summarizes the different types of operating system kernel structures (kernel models) and virtual machine structures. Apart from the best-known monolithic kernel model, OS kernels may also take the form of microkernel, semi-microkernel, exokernel, kernel bypassing library for certain subsystems, or disaggregated kernel. Virtualization of OS environment as a whole (i.e., virtual machines) has become popular with the rapid trend towards cloud computing. Virtual machines can be categorized as type-1a vs. type-1b vs. type-2.
When doing systems research, we sometimes need to modify/add new stuff into the Linux kernel. This post lists a successful workflow of building and installing a custom Linux kernel under a Ubuntu 18.04/20.04 environment (deb), along with steps to debug the Linux kernel by running it over the QEMU emulator and attaching to GDB.
As minimization and cell density of traditional 2D NAND SSDs reach a manufacturing bottleneck, 3D NAND SSDs come on the market. They push block capacity a little bit forward, but suffer from severer write amplification and are more expensive, thus are not a perfect solution. Intel 3D XPoint (official brand name as Optane), a hybrid design sitting in-between DRAM and NAND flash storage, adds a new possibility in the storage hierarchy.
Caching is an essential technique used broadly in computer system hierarchies. This post briefly summarizes existing cache mode configurations and cache eviction algorithms. This serves as a shallow review of cache systems before I go deeper into this field.
The name Blockchain has been a hot word in the past few years. Despite the controversy behind some of its applications such as virtual currency, blockchain itself is actually an appealing proposal towards decentralized trust over the Internet. It is worth looking into when studying modern distributed systems, especially as a good example of the design and implementation of decentralized systems.
In file system & database design, write buffering (write grouping or coalescing) is a commonly-used technology to avoid in-place updates and only expose sequential writes to disks. Log-Structured Merge Tree (LSM tree) is a modern practical solution which sacrifices a little bit of read performance to enable efficient write buffering. Journaling (write-ahead logging) is another file system terminology which is sometimes confused with write buffering. In short, write buffering is for write performance and journaling is for crash recovery - they are different, but can be combined.
One of the most dangerous kinds of security attacks is side-channel attacks since they are not part of the designed threat model. Meltdown & Spectre, the most recent side-channel vulnerabilities found on modern microprocessors, are good demonstration of the sneakiness and danger of side-channel attacks. These attacks combine CPU speculative execution + cache timing side-channel.
Sharding is a common distributed system design to scale out and achieve better performance. Distributed transactions (concurrency control + atomic commits) are used to coordinate sharded nodes. It is important to implement serializable distributed transactions for such a system to act correctly.
NOTE: this post is outdated and contains some of my early misunderstandings, so please read skeptically. A new post series on understandable categorization and in-depth analysis of consistency models is coming out soon, which will serve as the theoretical foundation of my ongoing research.
分布式系统中，基础的共识算法（Consensus Algorithms）希望解决的是在节点可能 crash / restart、节点间网络消息可能乱序、丢失、重复的情况下，让所有节点对 clients 一串提案达成 strong consistency (linearizability)，从而实现 Replicated State Machines，做到有效的 fault-tolerence。
学习存储系统的过程中不可避免地会接触到许多硬件层面的术语简称，包括硬件设备、接口、传输和控制协议等。在打超算比赛时想起，应该把这些整理成文以做总结。原写于 3 月，10 月再次修改如下。图片地址仍然在原 CNBlogs 站上没有迁移，等哪天链接崩了再换成更新的图吧。
Rust 作为一门新兴的 system programming 语言，其设计参考了各 system programming 语言的优劣势，以安全、同时快速为目标，开创了 compile-time 实现几乎一切安全检查的新颖的编程语言模式。可以说，Rust 承载了作者 Graydon Hoare 和新互联网时代逐渐关注起安全的众多开发者们的理想，也吸引了如 M 校前沿系统研究者们的关注（参考 PDOS 博士生 Jon Gjengset)。
lldb instead of GNU
gdb on macOS >= 10.14 Mojave directly (app verification scheme on newer macOS gets really complicated). If you really wanna make it, the following procedure is what finally worked or me.
如下是在 Mojave 上 GDB debugger 安装使用踩坑后，最终成功的步骤总结。
Wisc campus VPN and our CS departmental VPN both use GlobalProtect. On the user side, GlobalProtect clients cannot configure VPN split tunneling, meaning that once connected, all outbound traffic from my host machine goes through the VPN. I have a daily need to access my lab machine sitting behind the departmental VPN, yet I would like all other traffic (e.g., searching Google) to bypass the VPN. I came up with a solution of using one or two Raspberry Pi chips as an always-on SSH proxy server.
This short post is a summary list of all the system building tips/rules/laws boxes in the OSTEP book (also see my reading note). Without proper context, these tips make little sense, so I included the chapter numbers as well for easier back-tracing.
Memory fragments encountered, mostly not in my major fields. Noting them down just for a memorandum. 这篇用于记录一些学习中遇到的细碎知识。大多不是主要领域的知识，所以并未系统地学习和整理，权当备忘和随笔啦。
This post summarizes my personal development environment configuration on macOS X >= 10.14 and includes a brief memo of setting up WSL 2 on Windows >= 10. 记录一下我在 macOS X >= 10.14 上的个人开发环境 & 工具配置，以及在 Windows >= 10 上搭建基于 WSL 2 的开发环境的简要过程，以便将来需要时 refer。
在美国勉强算是安顿下来了。这个小公寓可能一呆就是 5 年，故干脆狠下心配了一套 2020 年中高配的【游戏+工作站】的 PC，作为自己 5 年学习生涯的小家。在此将自己第一次亲力亲为的装机过程，尽可能详细地记录下来。
本系列三篇是 2020 COVID-19 疫情时期，隔离在波士顿租住的小房间里的一些胡思乱想。世界和人生都在经历重要的转变期，故做此反思文字，以期将来能够回望，看到进步。此为 #3 篇。
本系列三篇是 2020 COVID-19 疫情时期，隔离在波士顿租住的小房间里的一些胡思乱想。世界和人生都在经历重要的转变期，故做此反思文字，以期将来能够回望，看到进步。此为 #2 篇。
本系列三篇是 2020 COVID-19 疫情时期，隔离在波士顿租住的小房间里的一些胡思乱想。世界和人生都在经历重要的转变期，故做此反思文字，以期将来能够回望，看到进步。此为 #1 篇。