Operating Systems

Author: Guanzhou (Jose) Hu 胡冠洲 @ UW-Madison

This note is a reading note of the book: Operating Systems: Three Easy Pieces (OSTEP) v1.01 by Prof. Remzi Arpaci-Dusseau and Prof. Andrea Arpaci-Dusseau. Figures included in this note are from this book unless otherwise stated.

Operating SystemsIntroduction to OSVirtualizing the CPU: Processes & Scheduling Abstraction of ProcessMachine StateProcess StatusProcess Control Block (PCB)Process APIsTime-Sharing of the CPUPrivilege ModesHardware InterruptsContext Switch & SchedulerCPU Scheduling PoliciesBasic PoliciesMulti-Level Feedback Queue (MLFQ)Lottery SchedulingCompletely Fair Scheduler (CFS)Virtualizing the Memory: Memory ManagementAbstraction of Address SpaceAddress Space LayoutMemory APIsAddress Mapping & TranslationBase & BoundHardware/OS ResponsibilitiesSegmentationConcept of PagingLinear Page TablesFree-Space Management PoliciesSplitting & CoalescingBasic PoliciesSegregated ListsBuddy AllocationAdvanced PagingTranslation Lookaside Buffer (TLB)Multi-Level Page TablesConcept of SwappingPage Replacement PoliciesCaching EverywhereMemory HierarchyExamples of CachingConcurrency: Multi-Tasking & SynchronizationAbstraction of ThreadMulti-Threaded Address SpaceThread Control Block (TCB)Thread APIsSynchronizationRace ConditionsAtomicity & Mutex LocksOrdering & Condition VariablesSemaphoresImplementing LocksControlling InterruptsHardware Atomic InstructionsSpinning vs. BlockingAdvanced ConcurrencyLock-Optimized Data StructuresConcurrency BugsEvent-Based ModelMulti-CPU SchedulingPersistence: Storage Devices & File SystemsGeneral Input/Output (I/O)Polling v.s InterruptsDirect Memory Access (DMA)OS Software StackStorage DevicesHard Disk Drives (HDD)HDD Scheduling PoliciesRAID ArraysSolid-State Drives (SSD)Abstraction of FilesFiles & DirectoriesFile System (FS) APIsFile System ImplementationVSFS Data StructuresVSFS Access PathsUNIX Fast FS (FFS)Crash ConsistencyCrash ScenariosFile System Checker (FSCK)Journaling (Write-Ahead Logging)More on Storage SystemsLog-Structured FS (LFS)Access ControlDisk Data IntegrityAdvanced/Related Topics

Introduction to OS

An Operating System (OS) is a body of software sitting in between software applications and a Von Neumann computer architecture. An OS connects applications to physical hardware. It makes abstractions of the underlying hardware and provides an easy-to-use interface for running portable software on physical hardware.

Abstraction is a great idea in both computer architecture and operating systems. It hides implementation complexity about the underlying layer and exposes a unified model of how to use the underlying layer to the upper layer. Check out the first section of this note.

In the layers of abstractions, we call the operations supported by the lower-layer as mechanisms, and the algorithms/decisions made by the higher-layer on how to use the mechanisms to achieve a goal as policies (or disciplines).

An OS provides support in at least these three general aspects:

  1. Virtualization: taking a possibly limited physical resource (processor, memory, storage, ...) and transforms it into a more general, portable, and easy-to-use virtual interface for user applications to use
  2. Concurrency: acting as a resource manager which supports multiple user applications (and multiple tasks inside one application) to run concurrently and coordinates among running entities, ensuring correct, fair, and efficient sharing of resources
  3. Persistence: data can be easily lost on volatile devices such as DRAM. An OS allows users to talk to external devices - including persistent storage drives - through Input/Output (I/O)

A modern operating system also pursues some other goals apart from the above stated. They are:

Generally, the most essential part of an OS is the kernel, which is the collection of code that implements the above-mentioned core functionalities. The kernel is typically a piece of static code (hence not a dynamic running entity). A monolithic OS consists of the kernel and some upper-level system applications that provide a more friendly UI. The OS field also covers part of the hardware architecture and external devices.

Virtualizing the CPU: Processes & Scheduling

One of the most important abstractions an OS provides is the process: a running instance of a program. We typically want to run many processes at the same time (e.g., a desktop environment, several browser windows, a music player, etc.), more than the number of available CPU cores. The OS must be able to virtualize a physical resource and let multiple processes share the limited resource. This section focuses on sharing the CPU, which is the most fundamental resource required to kick off any process.

Abstraction of Process

A process is simply an instance (the dynamic instance, doing actual work) of a piece of program (the static code + data, residing on persistent storage) running in the system. There can be multiple processes executing the same piece of program code. Each process represents a somewhat isolated entity.

Machine State

Running a process instance of a program requires the OS to remember the machine state of the process, which typically consists of the following information:

Process Status

A process can be in one of the following states at any given time:

Process Control Block (PCB)

The OS must have some data structures to hold the information of each process. We call the metadata structure of a process the process control block (PCB, or process descriptor). This structure must include the machine state of the process, the status of the process, and any other necessary information related to the process. For example, the xv6 OS has the following PCB struct:

The collection of PCBs is the process list (or task list), the first essential data structure we meet in an OS. It can be implemented as just a fixed-sized array of PCB slots as in xv6, but can also take other forms such as a linked list or a hash table.

Process APIs

These interfaces are available on any modern OS to enable processes:

Application programming interface (API) means the set of interfaces a lower-layer system/library provides to upper-layer programs. The APIs that an OS provides to user programs are called system calls (syscalls). Everything a user program wants to do that might require system-level privileges, such as creating processes, accessing shared resources, and communicating with other processes, are typically done through invoking system calls.

In UNIX systems, process creation is done through a pair of fork() + exec() syscalls.

Copying over the entire state of a process when forking it could be very expensive. A general technique called copy-on-write (COW) can be applied. Upon forking, the system does not produce a duplicate of most of the states (more specifically, most pages in its address space). A page is copied over when the new process attempts to modify that page.

See Chapter 5, Figure 5.3 of the book for a fork+exec+wait example, Section 5.4 for an introduction to UNIX shell terminologies, and Section 5.5. for control signals handling.

Time-Sharing of the CPU

To virtualize the CPU and coordinate multiple processes, the virtualization must be both performant (no excessive overhead) and controlled (OS controls which one runs at which time; bad processes cannot simply run forever and take over the machine). OS balances these two goals with the following two techniques:

The OS needs to solve several problems to enable the combination of these two techniques.

Privilege Modes

If we just let a process run all possible instructions directly, it will have dominant control over the machine. Hence, the processor lists a set of sensitive instructions as privileged instructions, which can only be run in high privilege mode. A user process normally runs in user mode, with restricted permissions. When it invokes a syscall (mentioned in the section above), it switches to kernel mode to execute the registered syscall handler containing privileged instructions. After the handler has done its work, it returns the process back to user mode.

Examples of privileged instructions in x86 (can only be called in "ring-0" mode, as opposed least-privileged "ring-3" mode) include:

The switching between modes is enabled through a mechanism called trap. The special trap instruction simultaneously jumps into somewhere in kernel mode (identified by a trap number) and raises the privilege level to kernel mode. This also includes changing the stack pointer to the kernel stack reserved for this process, and saving the process's user registers into the kernel stack.

To let the hardware know where is the corresponding handler code for a given trap number, the OS registers the address of a trap table: trap no. trap handler addresses into a special hardware register. This trapping mechanism provides protection since a user process cannot do arbitrary system-level actions, but rather must request a particular action via a number.

Trapping into kernel from software through a syscall is sometimes called a software interrupt.

We often use trap to name an active syscall made by the program. We use exception (or fault) to name a passive fall-through into kernel if the program misbehaves (e.g., tries to directly call a privileged instruction) or encounters an error (e.g., runs out-of-memory, OOM).

Hardware Interrupts

Another big issue is that the OS must be able to re-gain control in the middle of the user-level execution of a process, and achieve a switch between processes. We need help from a special hardware component - the timer.

A process can trap/fault into kernel, so does a hardware device. The behavior of a hardware device sending a signal to the processor to pause it current user code execution and to run a specified handler is called an interrupt. The handler is often called an interrupt service routine (ISR). Similar to traps, there is an interrupt table from interrupting device no. interrupt handler addresses (sometimes the software trap table is just embedded in the interrupt table).

One particularly important type of interrupt is the timer interrupt, issued by the timer every configured interval (e.g., 10ms). The ISR for timer interrupt is the place where the OS re-gains control on deciding which process to schedule for the next time slot.

Examples of other hardware interrupts include:

Interrupts need to be disabled (meaning not obeying incoming interrupts) during the handling of a trap/interrupt, so that every interrupt is handled to its completion. Also see double fault and triple fault if interested.

Context Switch & Scheduler

We have defined the context of a process as the set of important registers values. Switching from process A to process B is a context switch procedure:

  1. Save the current registers values (the context of A) into A's PCB
  2. Restore the saved register values (the context) of B from B's PCB
  3. Jump to where B was left off - most likely B is in its kernel stack right after the switch, because the last time B was scheduled out, it must be in the scheduler routine

The timer interrupt ISR typically calls the scheduler: a routine that chooses which process to run for the next time slot, possibly using its knowledge of what were scheduled in the history, based on a scheduling policy. It then performs a context switch to switch to the target process. The entire picture looks like:

Note the difference between user registers (those at user-level execution of the program) and kernel registers (those actually stored in the context). At a timer interrupt, the user registers are already stored into the process's kernel stack and the registers have been switched to the set for kernel execution (PC pointing to the handler code, SP pointing to kernel stack, etc.). Hence, the context saved for a process is actually the set of its kernel registers, and the user registers are recovered by the "return-from-trap" operation - this is when the process returns back to user mode.

Context switch is not free and there is observable overhead (saving/restoring registers, making caches/TLBs/branch predictors cold, etc.). We should not ignore this overhead when developing scheduling policies.

CPU Scheduling Policies

With the trap/interrupt mechanism and the context switch mechanism in hand, it is now time to develop scheduling policies for the scheduler to decide which process runs next. Determining the workload - here the processes running in the system and their characteristics - is critical to building policies. A policy can be more find-tuned if we know more about the workload.

Basic Policies

Let's first (unrealistically) assume that our workloads are:

We look at these metrics to compare across policies:

Some basic, classic policies are:

Multi-Level Feedback Queue (MLFQ)

The biggest problem of the above basic policies is that they assume job durations are known beforehand (priori knowledge). However, this is hardly true in any real systems. Good scheduling policies tend to learn something from jobs' past behavior (history) and make decisions accordingly. One of the best-known approaches is Multi-Level Feedback Queue (MLFQ). It tries to trade-off between optimizing turnaround time and minimizing response time. In other words, it well mixes interactive or I/O-intensive jobs with long-running CPU-intensive jobs.

MLFQ has a number of distinct queues, each assigned a different priority level. A job is in a single queue at any given time.

The key lies in how MLFQ dynamically sets priorities for jobs. MLFQ first assumes that a job might be short-running & interactive, hence giving it highest priority. Then, every time it runs till the end of a time slice without blocking itself on e.g. I/O, we reduce its priority level by 1, moving it to the queue one level down.

There are still a few problems. There might be starvation if there are quite a few interactive jobs that a long-running job might have no chance to run. A program could game the scheduler by issuing an I/O at the very end of every time slice. The behavior of a job might also change over time. Hence, we add several rules:

Smarter MLFQ implementations tune these parameters instead of setting them as a default constant: adjusting #priority levels (#queues), lower levels have longer time slice quanta, using math formula (e.g., decaying) to calculate quanta and allotment, etc. OS implementations, such as FreeBSD, Solaris, and Windows NT, use some form of MLFQ as the base scheduler.

Lottery Scheduling

We also examine a different type of scheduler known as fair-share scheduler. Instead of minimizing turnaround time, it tries to guarantee that each job obtain a certain percentage of CPU time - a proportional share. Lottery Scheduling is a good example of such scheduler, see the paper.

Lottery scheduling assigns each job a number of tickets. At each scheduling decision point, it uses randomness to decide who wins the next slot. Say A has 75 tickets, B has 25, and C has 100, so the total amount of tickets is 200. The systems picks a random number between 0~199 and walks a running sum: if 75 > num, A wins, else if 75+25=100 > num > 75, B wins, otherwise C wins.

There are a few extra ticket mechanisms that might be useful:

Stride scheduling avoids using randomness but instead divides the #tickets by a large number to get the stride value of each process. Whenever a process finishes a slot, increment its pass value by its stride. Pick the process with smallest pass value to run next. The strength of lottery scheduling (using randomness) is that it does not need any global state.

Completely Fair Scheduler (CFS)

Linux adopts a scalable, weighted-RR, fair-share scheduler named the Completely Fair Scheduler (CFS), see here. It tries to be efficient and scalable by spending very little time making scheduling decisions. CFS accounts virtual runtime (vruntime) for every job.

CFS uses a red-black tree to keep track of all ready processes for efficiency. When a process goes to sleep, it is removed from the tree. CFS also has many advanced features including heuristics and scheduling across groups and multiple CPUs.

Virtualizing the Memory: Memory Management

Apart from virtualizing the CPU, an OS must also be able to enable sharing of another important resource across processes - the memory. Processes require memory space to hold all of their runtime data, and it would be way too slow if we save/restore the entire memory content upon every context switch, especially when we are time-sharing the CPU (frequent switches). Hence, the OS must be able to do space-sharing of the memory: put all processes' in-memory data in the physical memory, and somehow let a process know how to address a byte it wants. Upon a context switch, we just switch the registers so that they point to correct addresses for the switched process. The OS must provide protection, efficiency in addressing, and a way to handle insufficient space if the physical memory is running out.

Abstraction of Address Space

To enable the space-sharing of memory, we again turn to virtualization. We give each process an illusion of the ability to address across its own memory space, which could be huge and sparse. This abstraction is called the address space of the process - it is the running program's view of memory (virtual memory, VM) and is completely decoupled with what the DRAM chip actually provides (physical memory). A program uses virtual address to locate any byte in its address space.

Address Space Layout

An OS must give a clear definition of the address space layout it expects. Conventionally, an address space contains the following segments and is structured in the following way:

In reality, size of the virtual address space typically depends on the number of bits the hardware platform uses for addressing. If the hardware operates on 32-bit numbers, we could use an entire range from 0x00000000 to 0xffffffff. The virtual address space could be very huge in size and very sparse (mostly empty) - this is not a problem. It is just the process's view on memory but we are not putting this space directly on physical memory at address 0. The OS will have some way to map the occupied bytes in the address spaces of all processes onto physical memory, which we will talk about in the next section.

This layout is just a convention for single-threaded processes. There could be other arrangements and things could be more complicated when a process becomes multi-threaded, which we will talk about in the concurrency section.

Memory APIs

These are the interfaces in C programming on UNIX systems for user programs to allocate memory. Other platforms might have different interfaces but they share the same spirit.

With the malloc() + free() interface, here is a list of common memory errors in C programming:

They often lead to the infamous segmentation fault (SEGFAULT) which happens when accessing a memory address the process shouldn't touch. Beware that code compiles & runs no segfault does not mean it runs semantically correctly - you could be accessing incorrect (but valid) memory addresses, which is still a serious bug.

Address Mapping & Translation

Every address a user program sees is virtual. The OS makes the virtual memory system transparent to users and implicitly translates any address a process asks for into physical memory. This is the procedure of (hardware-based) address translation and it implicitly implies the address mapping scheme the OS adopts. Here we list three address mapping and translation schemes. For efficiency, the OS makes use of hardware support (from a few register to full page table support) instead of simulating everything in software.

The idea of interposition is really powerful and essential in systems. When virtualizing the CPU, the hardware timer sends timer interrupts to interpose the execution of a process to let the OS re-gain control periodically; this enables transparent time-sharing of the CPU. When virtualizing the memory, the hardware MMU interposes on each memory access to perform address translation; this enables transparent space-sharing of the memory.

We will discuss three address translation schemes, along with some complementory knowledge:

  1. Base & Bound
  2. Segmentation
  3. Paging

Base & Bound

Let's first assume that the address spaces are small and compact, so it can fit in physical memory contiguously. With this assumption, the most intuitive way of address mapping is to simply relocate the address space to start at some offset in physical memory. The base & bound method (BB, or dynamic relocation, or hardware-based relocation) makes use of two hardware registers:

Upon a context switch to a process, the two registers get loaded the base & bound values for this process. This step requires OS intervention to decide where to relocate (where is base). Along its execution, upon any memory access, the address translation is done by the hardware directly. This part of hardware (doing address translations) is often named the memory management unit (MMU).

The MMU algorithm for base & bound is obviously simple:

Without hardware support, purely software-based relocation is not so great. We could let a loader scan through the machine code of the executable file and change add a base value to every address it sees at process loading. However, this approach lacks protection and is not flexible.

Hardware/OS Responsibilities

A brief summary of hardware support we have talked about so far:

And the OS must take these responsibilities for virtual memory to work:

Segmentation

The base & bound approach has an obvious drawback: it requires that the physical memory has a big, contiguous chunk of empty space to hold the address space of a process. However, our address space could be huge and sparse, so it might not fit in as a whole. And all the unused bytes are still mapped, wasting a lot of memory space. Segmentation is a generalized form of base & bound which eases this problem.

A segment is a a contiguous portion of the address space. We could break the code, stack, & heap into three segments, then for each of them, apply base & bound independently. This way, only used memory is allocated space in physical memory.

One problem thus arise: given a virtual address, how does the hardware know which segment it is referring to (so that it can do the base & bound check)? There are two approaches:

Some extra support from hardware could be added:

Concept of Paging

Chopping things into variable-sized segments can make free-space management very challenging. Most modern OSes take a more generalized approach of explicit segmentation called paging: chopping the address space into fixed-sized, relatively small pieces, in size of say 4KB. Each of such pieces is called a page (or virtual page). Physical memory is also chopped into pieces of the same size. To allocate space for a page, the OS allocates a physical frame (or physical page) to hold the page.

Paging offers these advantages over previous approaches:

The OS must maintain a per-process data structure to record the current mapping from pages to their allocated physical frames. This data structure is called the page table. A page table is essentially a mapping table from virtual page number (VPN) to physical frame number (PFN, or PPN). Assume page size is a power of 2:

The size of DRAM on the machine determines the number of physical frames available, which in turn determines how many bits we need for PFN. Just like in segmentation, multiple processes could share a same physical frame (possibly as different virtual page addresses), one good example being sharing a common code page.

Linear Page Tables

Due to efficiency issues, dynamic data structures like linked lists or hash tables are generally out of concern for implementing a page table. Let's first consider the simplest form of a linear (flat) page table: a one-level array of page table entries (PTE). The OS indexes the array by VPN, and each PTE holds the PFN for that VPN and a few necessary control bits:

The page table in this form can be incredibly large (e.g., 4MB page table per 32-bit process address space) and cannot fit in hardware MMU. Thus, the page tables themselves are stored in memory, specifically somewhere in kernel-reserved memory. Hardware MMU just keeps a page-table base register (PTBR) to record a pointer to start of the current process's page table, and upon a translation, go to that location on memory. The MMU algorithm goes:

We will talk about useful techniques on saving space and speed up translations in the advanced paging section.

Free-Space Management Policies

In the above sections, we haven't really answered this question: how does the OS find proper free space when it needs to enlarge/allocate something? How to manage free space efficiently? We use the name fragmentation to describe the situation where there is space wasted in an address mapping scheme.

Different address translation schemes have different behaviors with regard to the two types of fragmentation. Fragmentation is hard to eliminate completely, but we can try to do better.

Note that the problem of free-space management is generic. It applies to how a malloc() library allocates new bytes on heap, how an OS allocates physical memory for new segments, etc. Here we list a few classic free-space management policies.

Splitting & Coalescing

Suppose we keep a free list data structure as a linked list of free spaces. Two mechanisms play a significant role in all allocator policies:

Basic Policies

These are the most basic, intuitive policies for memory allocation:

There is hardly any perfect free space allocation policy in general, because we could always construct a worst-case input of requests (adversarial input) to mess it up. We can only compare their pros and cons.

Segregated Lists

If a particular application has one (or a few) popular-sized request it makes, keep a separate list just to manage objects (i.e. free chunks of fixed size) of that size. All other requests are forwarded to a general memory allocator, say running one of the above policies. The slab allocator (see here) used in Solaris and Linux kernel is one such example.

Objects in the cache can also be pre-initialized so that the time for initialization and destruction at run time is saved.

Buddy Allocation

The binary buddy allocator (see here) is designed around splitting & coalescing in sizes of 2, allowing some internal fragmentation but making the coalescing procedure much more efficient.

We are particularly interested in binary buddy allocation, because it fits the binary address representation on most architectures so well. The address for a block differs with the address of its buddy block in just one bit (which bit depends on its level), so we can have highly-efficient implementations using bit-wise operations. Of course, there could be higher-order buddy allocation.

For scalability issues on multiprocessors, concurrent data structures are being used in modern memory allocators to speed up concurrent accesses on the free list and to reduce the steps for synchronization. We will touch these in the concurrency section.

Advanced Paging

We have talked about basic paging with linear page tables, but they are not very efficient. It requires extra memory read for every memory access (notice that fetching an instruction is also a memory access), which almost doubles the latency penalty. The page tables take up a lot of space, but most pages in an address space are probably unused. We now talk about techniques to optimize paging.

Translation Lookaside Buffer (TLB)

A translation lookaside buffer (TLB) is a cache for address translation - it is part of the hardware MMU for storing several popular PTE entries with fast access speed (e.g., in SRAM instead of DRAM). Like all caches, it assumes that there is locality in page accesses (code instructions tend to be sequentially executed, etc.), so there tend to be popular pages at a given time.

Since memory accesses are so common, page lookups happen so frequently, thus we would like to minimize the chance of TLB misses as much as possible (i.e., maximizing the hit rate). Also, one problem remains as who should handle TLB misses.

Multi-Level Page Tables

Now we focus on making the page tables smaller. The main observation is that most of the pages in a sparse address space are unmapped, yet they still take up a PTE slot. To ease this issue, we use multi-level page tables: breaking the page table into multiple levels of indirection. Suppose we use two levels:

As you can see, if pages in an entire sub-table are unmapped, that sub-table (and it's children) won't be allocated at all, saving a lot of space. The virtual address split and the MMU algorithm goes (looking up a multi-level page table is often called walking a page table):

Two factors here require careful tradeoff between page table size and translation latency performance (time-space tradeoff):

A more extreme space-saving technique is to have an inverted page table, which is a global (one) mapping from physical frames to its currently mapped virtual address space & page. Hash tables could be built upon it to enable address translations.

Concept of Swapping

Now let's consider the situation where the physical memory space is running low and we cannot find a free physical frame to put a new page. This is when swapping is needed: writing some pages to persistent storage to free up frames. Swapping is at its essence a caching problem.

The OS takes out a certain region of a block storage device's space (say an HDD) as the swap space. Let's assume that the block device has block size equal the page size of our VM system. A PTE with valid bit set but present bit unset means it is mapped, but currently swapped out to disk. With swapping enabled, the name page fault sometimes narrowly refers to accessing a valid but swapped page, requiring the OS the handle the fault and bring the page back. An example snapshot of system state:

The mechanism of swapping is transparent to user processes, just like almost everything in our VM system. Be aware that, no matter how smart your page replacement policy is, if your workload demands way more memory than the actual physical memory size, paging in/out will happen and they are slow. In this case, the best solution is to buy more memory.

Swapping only at full could be dangerous and could trigger thrashing (or throttling). The OS can have a background process named the swap daemon or page daemon that monitors the free ratio of physical memory. If there are fewer than low watermark (LW) frames free, the daemon starts swapping out pages, until there are high watermark (HW) frames free.

Linux also has an out-of-memory (OOM) killer that kills processes when memory is nearly (but not exactly) full. These are examples of admission control and they help prevent thrashing.

Page Replacement Policies

Accessing a block on a disk is significantly slower than accessing the memory, and is sensitive to sequentiality, thus the page replacement policies must be smart enough to avoid disk reads/writes as much as possible. These policies more generally fit in any caching systems and there are so many of them. Read this wikipedia page and this post if interested.

Given an average miss rate of , we can calculate the average memory access time (AMAT) for a page as:

, where is the latency of reading from memory and is the latency of reading from disk. Even a tiny bit of increase in builds up AMAT because is generally much larger than .

So far we assumed demand paging (or lazy paging), meaning that we only bring a page to memory the first time it is accessed. Some systems deploy prefetching of nearby pages ahead of time when a page is accessed.

The system also do clustering (or grouping) of write-backs so that a sequence of pages next to each other are evicted in one sequential request to disk, yielding better throughput performance, as we will learn in the persistence section.

Caching Everywhere

We have touched on the important idea of caching several times, a concept that should have been well explained in computer architecture courses. Caching is critical to the performance of modern computer systems.

Memory Hierarchy

In general, on any architecture that forms a memory hierarchy: different levels of storage from faster, smaller, volatile storage to slower, larger, persistent storage, caching brings benefits. The typical memory hierarchy in modern machines has at least the following layers:

Caching is based on the observation that data accesses tend to have spatial locality & temporal locality:

Hence it would be beneficial if we maintain certain hot (popular) pieces of data in a faster, smaller, upper layer while keeping the majority of data in a slower, larger, lower layer of storage.

Examples of Caching

Quite a few concepts covered in this note are just instances of caching:

For caching to work well, a smart (and suitable) policy is a must to maximize hit rate (= minimize miss rate). Caching has been a long line of research and there are tons of caching policies with different characteristics. Please read this post if interested.

See Chapter 23 of the book for a case study on two real-world operating systems' virtual memory management: DEC VAX/VMS and the Linux VM.

Concurrency: Multi-Tasking & Synchronization

This section explores another important aspect of computer systems: concurrency. Concurrency generally means "multiple things going on at the same time". We could also call it multi-tasking. With the support for virtualization, we have already seen how the OS runs multiple processes on a machine with shared CPU cores and memory space, which is one form of concurrency. However, processes generally represent separate entities and the OS tries its best to ensure isolation between processes. Are there alternatives?

Abstraction of Thread

We introduce a new abstraction of a running entity called a thread. For a single process, it could have multiple threads sharing the same address space (hence, the same code, data, etc.). Different threads differentiate with each other mainly in:

We call the action of utilizing multiple threads as multi-threading. A program spawning threads is a multi-threaded program. A default process with just the main thread can be viewed as a single-threaded process. Multi-threading brings at least two benefits:

Multi-Threaded Address Space

A multi-threaded process's address space looks like:

The OS and the threading library could have different ways to layout the thread-local stacks.

Thread Control Block (TCB)

Since threads have their own PC and other registers, switching between threads on a CPU core is also a context switch. The difference from process context switches is that thread context switches are more lightweight: just registers, no page table pointer switches, no TLB flushes, no open file state changing, etc.

Apart from the registers context, each thread could have different scheduling properties, e.g., priorities, time-slice lengths, etc. A thread control block (TCB) is the data structure for keeping these states of a thread. TCBs of a process are typically part of the PCB of the process. For example, we could modify our xv6 PCB structure to something like:

Thread APIs

The POSIX specification defines a set of pthread APIs:

These interfaces are either raw system calls or wrappers provided by a pthread library (should be explicitly linked against) if the OS provides slightly different threading syscalls in the low level.

The portable operating system interface (POSIX) is a set of specifications that define the interfaces an operating system should provide to applications. There can be many different OS designs and implementations, but as long as they comply to a (sub)set of POSIX interfaces, applications can be developed on one POSIX system platform and ported to run on another POSIX platform with ease. Please see the linked wikipedia page for what interfaces the POSIX specification includes and which operating systems are POSIX-compliant.

Synchronization

Threads are powerful because they share the same address space , but such sharing leads to a critical problem: how to prevent race conditions if they operate on shared data and how to express a "waiting-on-something" logic? In general, the term synchronization refers to mechanisms + policies to ensure

  1. Atomicity of accesses
  2. Ordering of execution

It applies to any case where there are multiple running entities sharing something, but its exceptionally critical to OSes because the problem of sharing happens so often.

Race Conditions

If we do not apply any restrictions on the scheduling of threads, the result of execution of a multi-threaded program is likely to be indeterministic. Each thread is executing its own stream of instructions, yet every time which one executes its next instruction could be arbitrary. Consider an example where two threads executing the same "plus-one" routine to a shared variable with original value 50:

You might expect a correct program to always produce 52 as the result. However, the following sequence could happen and the result could be 51 which is incorrect:

This is what we call a race condition (or, more specifically, a data race): unprotected timing of execution. The result of such a program would be indeterminate: sometimes it produces the correct result but sometimes the results are different and are likely wrong. We call such piece of code a critical section: code routine that accesses a shared resource/data and, if unprotected, might yield incorrect result.

Atomicity & Mutex Locks

What we want is that whenever a thread starts executing a critical section, it executes until the completion of the critical section without anyone else interrupting in the middle to execute any conflicting critical section (executing unrelated code would be fine). In other words, we want the entire critical section to be atomic (we assume every single instruction is already guaranteed atomic).

At a higher level, one big, atomic action that groups a sequence of small actions is logically called a transaction (a name widely used in database systems, though the problem setup is quite different).

Atomicity is guaranteed if we have mutual exclusion (mutex) among concurrent executions of critical sections. We introduce a powerful synchronization primitive called a lock to enforce mutual exclusion. A lock is a data structure that supports at least the following two operations:

We will talk about how does the OS implement locks in a dedicated section below.

Ordering & Condition Variables

Apart from atomicity of critical sections, we also desire the ability to enforce certain ordering of executions so that a routine cannot proceed until some condition is made true by some other thread. Though constantly spinning on a variable might work, it is very inefficient. We introduce a second type of synchronization primitive called a condition variable. A condition variable is a data structure that supports at least the following two operations:

A condition variable is often coupled with a mutex lock to protect the shared resource to which the condition is related:

Another great example of demonstrating the usage of condition variables is the famous bounded buffer producer/consumer problem introduced by Dijkstra, where we have a fixed-size buffer array, producer(s) trying to write to empty slots of the buffer, and consumer(s) trying to grab things from the buffer. Use cases of bounded buffer include a web server request dispatching queue, or piping the output of one command to the input of another. A correct implementation looks like:

Using a while loop on the condition is something related to the semantic of the conditional variable.

  • Mesa semantic: first explored in the Mesa system and later deployed by virtually every system, this semantic says a condition variable signal is just a hint that the state of the condition has changed, but it is not guaranteed that the state remains true when the woken up thread gets scheduled and runs
  • Hoare semantic: described by Hoare, this stronger semantic requires a condition variable signal to immediately wake up and schedule the woken up thread to run, which is less practical in building actual systems

Semaphores

The third type of synchronization primitive we will investigate is a semaphore. Dijkstra and colleagues invented semaphores as a more general form of synchronization primitive which can be used to implement the semantics of both locks and condition variables, unifying the two conceptually. A semaphore is an integer value that we can manipulate with two routines:

We can use semaphores to implement mutex locks and condition variables. A binary semaphore is equivalent to a mutex lock:

And semaphores could be used to enforce ordering as well, like what condition variables do:

Generally, it is the initial value of a semaphore that determines how it behaves. To set the initial value of a semaphore, Kivolowitz has a general rule that you set it to the number of resources you are able to give away immediately after initialization. For example, mutex lock 1, waiting for child thread done 0, buffer vacant slots size of buffer, etc. However, generality is not always good and easy - given that we already have good primitives like locks and condition variables, it is mostly not necessary to use semaphores in practice.

Implementing Locks

We introduced the lock interface but haven't talked about how are locks implemented. We would like a lock implementation that provides the following four properties:

Certainly, a simple spin-wait on a shared variable is neither correct nor performant (this is not saying spinning locks are bad, as we will discuss in a section below):

This implementation does not guarantee mutual exclusion, for example:

Controlling Interrupts

Assume we only have a single-processor machine and we only want one lock variable, a naive approach would be to just disable interrupts for critical sections:

Though simple, this approach has significant drawbacks:

Hardware Atomic Instructions

Building locks out of pure load/store instructions are possible (e.g., see Peterson's algorithm and Lamport's bakery algorithm), yet with a little bit of extra hardware support, things can get much easier and much more efficient. Here we demand the hardware to provide certain atomic instructions: do something more than just a single load/store in one instruction, guaranteed to be atomic by the hardware. Classic examples include:

In modern multicore systems, the scalability of locks becomes critical. There are many more advanced lock design & implementations trying to avoid cache contention and trying to have NUMA-awareness. Please see this post if interested.

Spinning vs. Blocking

A spinning lock (or spinlock, non-blocking lock) is a lock implementation where lock waiters will spinning in a loop checking for some condition. The examples given above are basic spinlocks. Spinlocks are typically used for low-level critical sections that are short, small, but invoked very frequently, e.g., in device drivers.

A blocking lock is a lock implementation where a lock waiter yields the core to the scheduler when the lock is currently taken. A lock waiter thread adds itself to the lock’s wait queue and blocks the execution of itself (called parking, or yielding, or descheduling) to let some other free thread run on the core, until it gets woken up (typically by the previous lock holder) and scheduled back. It is designed for higher-level critical sections. The pros and cons are exactly the opposite of a spinlock.

For example, our previous TAS-based lock could be modified to do blocking and queueing instead:

The above code has one subtle problem called a wakeup/waiting race: what if a context switch happens right before the acquirer's park(), and the switched thread happens to release the lock? The releaser will try to unpack the not-yet-parked acquirer, so the acquirer could end up parking forever. One solution is to add a setpark() call before releasing guard, and let the park() call wake up immediately if any unpark() is made after setpark().

There are also hybrid approaches mixing spinlocks with blocking locks, by first spinning for a while in case the lock is about to be released, otherwise go to park. It is referred to as a two-phase lock. For example, the Linux locks based on its futex syscall support is one such approach.

Advanced Concurrency

We have yet more to discuss around concurrency and synchronization.

Lock-Optimized Data Structures

Adding locks to protect a data structure makes it thread-safe: safe to be used by concurrent running entities (not only for multi-threaded applications, but also for OS kernel data structures). Simply acquiring & releasing a coarse-grained lock around every operation on an object could lead to performance and scalability issues.

In general, the more fine-grained locks (smaller, fewer logics in critical sections), the better scalability. Here we list some examples on optimized data structures that reduce lock contention:

Many more concurrent data structures designs exist since this has been a long line of research. There are also a category of data structures named lock-free (or non-blocking) data structures which don't use locks at all with the help of clever memory operations & ordering. See read-copy-update (RCU) for an example.

Concurrency Bugs

The biggest source of bugs in concurrency code is the deadlock problem: all entities end up waiting for one another and hence making no progress - the system halts. Examples of deadlocks are everywhere:

There has also been a long line of research on deadlocks. Theoretically, four conditions need to hold for a deadlock to occur:

  1. Mutual exclusion: threads claim exclusive control of resources that they have acquired
  2. Hold-and-wait: threads hold resources allocated to them while waiting for additional resources
  3. No preemption: resources cannot be forcibly removed from holders
  4. Circular wait: there exists a circular chain of threads s.t. each holds some resources being requested by the next thread in chain

Accordingly, to tackle the deadlocks problem, we could take the following approaches, trying to break some of the conditions (see Chapter 32 of the book for detailed examples):

  1. Prevention (in advance)

    • Avoid circular wait: sort the locks in a partial order, and whenever the code wants to acquire locks, adhere to this order
    • Avoid hold-and-wait: acquire all locks at once, atomically, through e.g. always acquiring a meta-lock
    • Allow preemption: use a trylock() interface on lock_B and revert lock_A if B is not available at the time
    • Avoid mutual exclusion: use lock-free data structures with the help of powerful hardware atomic instructions
  2. Avoidance (at run-time) via scheduling: don't schedule threads that could deadlock with each other at the same time

  3. Recovery (after detection): allow deadlocks to occur but reserve the ability to revert system state (rolling back or just rebooting)

Apart from deadlocks, most of the other bugs are due to forgetting to apply synchronization, according to Lu et al.'s study:

Event-Based Model

Many modern applications, in particular GUI-based applications and web servers, explore a different type of concurrency based on events instead of threads. To be more specific:

See the select() and epoll() syscalls for examples. Modern applications typically use multi-threading with event-based polling on each thread in some form to maximize performance and programmability.

There are some other popular terminologies used in concurrent programming that I'd like to clarify:

Multi-CPU Scheduling

With the knowledge about the memory hierarchy and concurrency, let's now go back to scheduling and add back a missing piece of the "virtualizing the CPU" section: scheduling on multiprocessor machines (multiple CPUs, or multiple cores packed in one CPU). Multi-CPU scheduling is challenging due to these reasons:

Here are two general multiprocessor scheduling policies:

The Linux multiprocessor scheduler adopts MQMS (O(1), CFS) and SQMS (BFS) at different times.

Persistence: Storage Devices & File Systems

Beyond run-time sharing of volatile states, a complete operating system must also be able to persist data across power-offs. The OS code itself should at least be stored on persistent media and be loaded into memory to run at boot time, not to mention any storage and connectivity requirements from applications. Hence, we move to the third aspect of an OS: persistence through communicating with and managing input/output (I/O) devices.

General Input/Output (I/O)

For any computer program to be interesting, it must be taking in some input and be producing some output. For a computer system as a whole, inputs come from input devices and outputs go to output devices. Hardware components connect with the central CPU through various types of bus (interconnect):

Inside a typical I/O device is a controller (a small CPU with small memory) that runs firmware code to implement its functionality, to control how the device media works, and to provide an interface to the rest of the system. The interface adheres to a certain protocol (e.g., AHCI for SATA disks, NVMe for PCIe disks, etc.).

Polling v.s Interrupts

In a simplified protocol, the CPU communicates with a device through reading & writing its hardware registers. Instructions that manipulate external device registers are called I/O instructions and are privileged.

Direct Memory Access (DMA)

So far we described programmed I/O: I/O requests that involve the CPU in every I/O instruction. If our requests are data-intensive, it could over-burden the CPU with a trivial task of data transfer.

Instead of letting the CPU transfer data word-by-word, we introduce the direct memory access (DMA) engine: a special hardware component that can orchestrate transfers between devices and main memory without much CPU intervention.

Also, instead of using explicit I/O instructions (e.g., in and out on x86), modern systems also support memory-mapped I/O: mapping device registers to memory locations so that loading/storing these addresses are routed to the device instead of to main memory. Both approaches are in use today.

Memory-mapped I/O sometimes also refer to the Linux syscall mmap() of mapping a storage space (or file) into memory and using the memory as a cache (the OS is responsible for flushing dirty updates to disk). This is a file system level terminology and is total separate from DMA.

OS Software Stack

Ideally, we want to hide all the nasty details of I/O protocols from users and provide a cleaner interface. The OS does this through multiple layers of abstractions, forming a software stack. (We refer to them as the storage stack, the network stack, etc.) At the lowest level are device drivers that actually implement one side of the device protocols to talk to specific models of devices.

Since device drivers are mostly boiler-plate code and are needed for any device you might plug into your system, they have become a huge percentage of kernel code (over 70% of Linux code lines are device drivers). In the next few sections, we will first talk about some common storage devices, their structures and characteristics, and device driver/on-device controller level optimizations. We will talk about the file system layer in later sections.

Storage Devices

Disk-like storage devices are often viewed as block devices by the system because we do reads/writes in the granularity of blocks, of size e.g. 4KB. Let's first take a look at two common types of block devices (based on different media): HDD & SSD. Understanding the internals of a device is critical to building efficient, performant, and correct software stack for it.

Hard Disk Drives (HDD)

The hard disk drives (HDD) have been the main form of persistent data storage for decades. Many classic file system technologies are built around HDDs. An HDD is characterized by its unique mechanical structure:

Now we focus on a surface, the top view looks like:

The latency of completing a sector request on an HDD is composed of:

  1. Seek time: disk heads locating the correct track (this may be further divided into acceleration time + coasting time + deceleration time + settling time); avg. is of the full seek time (see Chapter 37.4 of the book for the math)
  2. Rotate time: wait for the beginning of the sector to be under the head; avg. is time taken to do a rotation
  3. Transfer time: where I/O actually takes place as the sector passes under the head

Modern disks deploy some fancier technologies:

As we can see, sequentiality is so crucial to HDD performance: its mechanical structure determines its sequentiality nature that reading/writing to consecutive addresses performs much better than randomly seeking around.

HDD Scheduling Policies

Due to the high cost of I/O, the disk controller (historically the OS) runs a disk scheduler to decide how to schedule HDD requests to minimize seeking. The major difference from CPU scheduling is that disk requests are of fixed sizes, so we have a pretty good estimate on the completion time of jobs. Here is a list of classic disk scheduling policies:

In modern systems, the OS and the disk collaborate on scheduling: the OS uses various storage stack techniques to try to issue sequential requests and pick the best few requests to send to the disk, and the disk controller uses internal knowledge to arrange the batch of requests, possibly doing I/O merging.

There is also tradeoff on how long should we wait before issuing a batch of requests to the disk: work-conserving approaches issue a request as soon as one arrives at the block layer, while non-work-conserving approaches wait for some time in case there are "better" requests arriving.

RAID Arrays

Making a single disk fast, large, and reliable at the same time is hard (law of physics). Patterson, Katz, & Gibson first introduced the idea of redundant array of inexpensive disks (RAID) to build a collectively large, fast, and reliable disk out of an array of disks. A RAID system provides a disk interface to the file systems layer above, and translates a logical block reads/writes into sophisticated physical requests to the underlying disks. RAID thus brings the following three advantages transparently:

The RAID design levels that appeared are listed below. For the performance analysis, we assume a single disk has capacity , single-request latency , and steady-state throughput for sequential workload and for random workload ().

A symbolical analysis of these RAID levels give the following results (see Chapter 38 of the book for our assumptions and calculations):

Solid-State Drives (SSD)

Generally, storage devices with no mechanical structures or moving parts are referred to as solid-state storage device (SSD). Unlike random-access memory (RAM, SRAM or DRAM), SSDs retain data despite power loss. Unlike HDDS, SSDs do not have moving parts and have largely different performance characteristics. We focus on the most well-known solid-state storage technology called flash (more specifically, NAND-based flash). The acronym SSD sometimes narrowly refers to solid-state drives exposed as large-capacity block devices, though ROMs (for storing firmware) and USB flash drives are all solid-state storage.

Logically, the internal of a NAND-flash SSD looks like:

 

The limitation of one-way charging in flash media leads to the fact that updating bits gets very tricky. The flash media allows three basic operations & their granularity:

  1. Read a page: by pre-charging & leaking a word-line, client can read any page
  2. Program a page: at the page granularity, flash only allows charging cells from 1 to 0; assuming a free page of all 1's, client can program the page and change some bits to 0
  3. Erase a whole block: changing bits from 0 to 1 requires leaking an entire block back to 1's; hence, to write arbitrary content to a page, either the page was previously free, or it is required to first erase the entire block it is in, then program the pages

Flash storage does not need to worry about physical head crashes, and has better theoretical latency than magnetic disks. However, it becomes more complicated when turning flash media into a block device. An SSD must keep a mapping from logical addresses (addresses exposed to device driver) to actual physical pages. This is done by the flash translation layer software running on the in-device controller.

The FTL algorithm must take care of at least the following things:

For NAND flash storage devices, a new disk command interface (other than disk read, write, and cache_flush) is introduced: trim. Trim basically tells the FTL that certain addresses have been deleted and thus the device no longer has to track information about those addresses, allowing optimizations in the FTL algorithm.

Compared with HDDs, SSD performance is generally better, especially in random read/write cases, but SSDs cost more per capacity.

The file system implementation section of this note focuses on HDD-based file systems. Some ideas also apply to SSDs and other media, but we leave the software stack technologies for flash SSDs, byte-addressable non-volatile memory (NVM) devices, and network devices as advanced topics.

Abstraction of Files

Now that we've seen how storage devices work, it's time to think about what abstractions and interfaces should an OS provide to user programs for managing persistent data storage. Unlike memory, which is run-time volatile and loses contents when there is a power-off or power accident, persistent storage devices keep data intact and can survive across power loss. Thus, the abstractions for persistent data will look much different.

Files & Directories

We begin by having files. A file is simply a linear array of blocks (or pages) of bytes. A file has a filename chosen by the user to be memorable, yet at the low level it has a unique, flat name (often in the form of an inode number, or i-number). The system tracks information through low level names.

Files reside in a directory tree (directory hierarchy): a directory is a "file" whose contents are a list of files or sub-directories. What a directory actually stores is a list of tuples. An example directory tree looks like:

Essentially, the files & directories abstraction gives us a powerful & standardized way of naming things. Since this way of naming is so great, UNIX systems push it to an extreme that it names almost everything accessible by the system through the file system interface, so everything looks like a "file" but does not necessarily represent data on persistent storage (for example, /dev/xxx names a raw device, /proc/xxx names the run-time information of a process, etc.).

File System (FS) APIs

A file system (FS) is a layer that provides a set of POSIX APIs to user programs and transforms POSIX API calls to efficient block layer requests. In some operating systems, this can be further separated into two layers:

A disk (e.g., /dev/sda) can be partitioned into multiple partitions (physical volumes, e.g., /dev/sda1, /dev/sda2, ...). Often, we view each single partition as a separate device. Each partition can be formatted as a particular file system (meaning laying out the file system data structures onto that partition), and can get mounted to a particular directory location.

A logical volume (in Linux, managed by the logical volume manager, LVM) is a logical device that can span across multiple physical volumes, managed specially.

The POSIX syscall APIs for file & directory operations include at least:

Above listed are FS-related syscalls. Often, we might be using a higher-level wrapper or a shell program that uses these syscalls and provides a more abstracted API, e.g., fopen() in C standard library that returns a FILE * pointer, and cp, mv, rm, touch, mdkir, mount shell commands, etc.

File System Implementation

To find out what's going on inside the file system layer, we start from a Very Simple File System design (we call VSFS) and move towards real FS implementations. We will discuss consistency issues along the way. There is great flexibility in how one can build a file system, hence many excellent implementations exist. We only touch on the basics.

We think about two aspects of a file system:

  1. The on-disk data structures layout it keeps to organize all the files' data and metadata
  2. The access methods it uses to process POSIX calls such as open(), read(), and write(), given the data structures

VSFS Data Structures

Our VSFS operates on very simple data structures. We divide the disk device space into blocks of commonly-used 4KB size (we only support one fixed block size). Say we have a very small disk of 64 blocks, we assign different roles to these blocks:

Inside an inode structure, probably the most important part of it is the array of pointers to the file's data blocks.

To allocate & track free space, the two bitmaps are used. This is essentially a free-space management problem, as we have seen in memory allocations, but this time in the FS context. Many optimizations to the free-space tracking data structure are possible:

We also need to think about directories. Directories are just a special type of file. For a directory, its contents (what stored in its data blocks) are a collection of tuples. Each such tuple is called a directory entry (dentry). Again, the collection can be in the form of any data structure - fixed-slot array, linked list, B-tree, etc.

VSFS Access Paths

Having a clear picture of the data structures, it would not be too hard to figure out what happens upon an access to a file, step by step. Assume the initial state is that the file system just gets mounted, so only the superblock is in memory.

Reading /foo/bar (the write to bar's inode is for updating the last access timestamp):

Creating and writing out (extending) /foo/bar (the write to foo's inode is for updating timestamp):

Things will get much more complicated when caching is involved.

UNIX Fast FS (FFS)

Our VSFS (exactly the original design of the old UNIX file system) captures the basic ideas in building a file system, but it has serious performance problems:

A group at Berkeley decided to build a better HDD-based file system which they call the Fast file system (FFS, see the paper). FFS is built to be disk-aware: it knows it runs on disks, assumes certain internal knowledge about underlying disks, and is optimized for disks. FFS does so through the following set of major optimizations:

Crash Consistency

A major challenge faced by a file system is crash consistency: since a file write operation is often composed of multiple steps, how to ensure that persisted data stays in a correct and consistent state if a power loss or system crash happens in the middle?

Crash Scenarios

Writes are often buffered, so actual requests to disk are often triggered by an fsync(), formally a write barrier. Let us consider the most interesting case of appending new blocks of data to a file, the requests include:

Now we enumerate all the possible crash scenarios. What if a crash happens when...

  1. is done, but and are not: as if the write didn't happen, not a problem
  2. is done, but and are not: inconsistency from FS perspective - will see garbage data (what was left in the new data block, or worse, other file's data if that data block is later chosen by another allocation), need to resolve
  3. is done, but and are not: inconsistency from FS perspective - space leak on the marked blocks, need to resolve
  4. and are done, but is not: nothing seems wrong from FS perspective, but the file ends up with garbage data (what was left in the new data block)
  5. and are done, but is not: inconsistency from FS perspective - the data blocks might later be chosen by another allocation, need to resolve
  6. and are done, but is not: inconsistency from FS perspective - space leak again, need to resolve

So, among the six cases, 1. is not a problem, 4. is a problem that might be expensive to resolve as we will see later, and 2., 3., 5., and 6. are all FS metadata inconsistency cases (only part of metadata changed). Ideally, we would like the FS to transit from one state to another state atomically, but atomicity is very hard to provide in disk requests since their granularity is too big and the disk media can only do sector writes atomically.

File System Checker (FSCK)

Early file systems took a simple approach to resolve inconsistency by running a full-pass metadata checking over the entire file system at the reboot after a crash. The checker is named the file system checker (fsck). The basic steps are:

A complete fsck pass can only resolve metadata inconsistency issues (2., 3., 5., 6.), and is incredibly slow (may take several hours for large disks), thus is not the preferred approach now.

Journaling (Write-Ahead Logging)

Modern file systems, such as Linux ext3/4, do journaling (write-ahead logging) which is a run-time approach: to update something, we first write a log entry into an append-only log (i.e., the journal; the FS must then reserve some space for the journal) saying "we are going to do these operations", before actually applying them in place. After a transaction has been committed to the journal, we checkpoint the file system by applying the pending transaction(s). Here, the ordering between journal writes and in-place updates matter.

Each journal entry starts with a transaction begin (TxB) and ends with a transaction end (TxE). TxB contains a transaction identifier (TID) and some information about the transaction, e.g., addresses of the and . TxE is simply a closing mark of journal commit. If a crash happens, at the next reboot, all committed log entries after the last checkpoint are replayed to recover state, and incomplete entries are discarded.

Journaling could be done in the following different modes:

There are other approaches to the crash consistency problem:

  • Copy-on-write file systems like LFS combine write buffering with journaling by always writing to an append-only log and never overwriting things in place - the log is the file system
  • Backpointer-based consistency: let data blocks maintain a back pointer pointing back to the inode, and always check if that inode has a forward pointer to the data block
  • Optimistic crash consistency: issue as many requests to disk as possible, and use certain techniques like checksums to detect inconsistencies and abort conflicting transactions (this shares much similarity with optimistic concurrency control in distributed transactions)

More on Storage Systems

Storage system is a design space with great flexibility and we cannot cover all the aspects in this note. Here we add a few more points. Advanced topics such as modern storage devices, more FS design schemes, distributed storage systems, and data security are left as advanced topics.

Log-Structured FS (LFS)

Ousterhout & Rosenblum proposed a log-structured way of building storage systems (see the paper). The motivations are:

A log-structured file system (LFS) maximizes the idea of write buffering by making the entire file system a sequential log:

Reading a block through LFS incurs 1 extra lookup on the inode map, then the rest are the same number of I/Os as in FFS. LFS caches all the mappings in memory. Everything seems fine instead of one critical problem that now arises in any log-structured system: garbage collection. When a newer version of any block gets appended, the older version becomes garbage, yet still occupies space in the log. Soon the log will be full of useless old blocks.

When a crash happens, everything before the last checkpoint region update are guaranteed correct, and LFS rolls forward starting from the last checkpoint towards the known end of the log to recover state. Cached updates will be lost.

The log-structured FS design is sometimes also called copy-on-write FS, and is similar to shadow paging in databases. Similar copy-on-write approaches are found in WAFL, ZFS, Btrfs, and flash-based SSDs.

Access Control

Modern operating systems are multi-user, so we must have a way to describe permissions on resources, allow operations from permitted users, and deny operations from under-permitted users. Take UNIX as an example, since every resource is exported through its file system interface, the basic mechanism for access control is the permission bits (or file modes). An example output of ls'ing a file looks like:

, where -rwxrw-r-- is the field we are interested in.

Changing a file's permission bits is done through the chmod command, which takes a sequence of three digits (e.g., 640), each from 0 to 7. The three digits represent the aforementioned user, group, and other permission of the file, interpreting each as three binary bits (so rwx = 111 = 7, rw- = 110 = 6, etc.).

Beyond permission bits, some file systems, especially distributed file systems, use an explicit access control list (ACL) which is a list of all users with their detailed permissions.

In all systems that involve access control, we must be wary of the time-of-check to time-of-use (TOCTTOU) problem: the time of accessing a resource must be after the time of checking for permission, so if we allow certain operations to be done during this gap (e.g., re-linking the file to /etc/passwd), there is security vulnerability.

Disk Data Integrity

In RAID, we talked about disks are not perfect and they may fail. The general area is referred to as data integrity (or data protection): given the unreliable nature of hardware media/structures, how to ensure that data is safe and stays what we believe have been stored?

RAID assumes a relatively simple fail-stop model, but there are other worse types of failure modes:

LSEs are easier to deal with and modern RAID solutions use ECCs to reconstruct a disk when full-disk faults or LSEs happen. Solutions like RDP has the equivalent of two parity disks instead of one.

Detecting corruptions, however, require upper-level software involvement, e.g., the file system or the application. A typical mechanism called the checksum is used: it takes a chunk of data and computes some hash function over the data, producing a short summary of the chunk. When reading that chunk, the procedure is applied to read data and the result is compared with the remembered checksum to see if they match. Examples of checksum functions:

Checksums are also used in verifying transferred data over a network. Beyond checksums, there are other techniques used to defend against specific failure cases:

In data deduplication systems, cryptographically-safe hash functions are used to compute fingerprints of chunks of data, which are used as unique identifiers and duplicate detectors.

Advanced/Related Topics

I've made a brief OS history tree in XMind which is available here.

Certain chapters of the OSTEP book are not included in this reading note, please read if interested:

I have a (relatively broad) set of course notes (similar to this reading note) on topics in computer systems, beyond the basic knowledge of single-node Operating Systems covered in this note:

I have written several blog posts on topics around computer systems:

With regard to the practical aspect of implementing an actual operating system, there are good open-source projects: