type
status
date
slug
summary
tags
category
icon
password
Below are notes from Architectural and Operating System Support for Virtual Memory.
Page table updates and lookup are just loads and stores. But TLBs are not generally kept coherent with the rest of the memory system: stores to page tables do not automatically propagate to the TLBs, nor are the stale TLB entries automatically invalidated. So software (program or OS) need add explicit synchronization whenever the state of the VM subsystem is updated.
In this chapter,
- reason of lack of coherence for TLBs and (often) instruction caches
- synchronization requirements for software, for single-thread and multi-thread
- how coherent cache produce unexpected behavior with weak memory consistency model
1. Non-Coherent Caches and TLBs
What to consider whether cache and TLB are coherent or not: performance, power, area, programmability(ease of use).
Coherence definition:
- there is a globally agreed total order of the stores to each memory location
- single-writer/multiple-readers (SWMR) condition
- each load returns the value written by the latest store to the same memory location
- “latest” can only be defined through memory consistency model
- each store propagates to every other observer in a finite amount of time
- processor will always make forward progress when executing multithreaded programs
Data cache holds data that is read and write frequently (sometimes by more than one core), so software managed coherence is too large of burden for programmers, so dedicated hardware to manage coherence.
TLBs and instruction caches keep data that is mostly read-only and changed relatively infrequently. Instruction fetch and page table walk often take place in different pathways in uarch of normal loads. So adding area for coherence cannot bring clear benefit in instruction cache and TLBs.
- For TLBs, TLB entries are not tagged with physical address to be read from, so cannot be snooped successfully. (set associate TLB is indexed by virtual address to lookup, so need search all TLB entries even if TLB is tagged with physical address to be read from. Or need add a fully associative Page Table Entry CAM (PCAM) to track the mapping between TLB entries and page table entries just like UNITD.)
- For instruction caches, even if coherent, the fetched and in-pipeline instructions may not be tracked by coherence protocol.
2. TLB Shootdowns
Failure to flush stale TLB entries will affect correctness and security: process A can read/write data to process B’s page.
The process by which stale TLB entries are invalidated is known as a TLB shootdown. TLB shootdown mechanism are defined by architecture. In some architecture(x86), each core can invalidate its own TLB, but no mechanism to invalidate the TLBs of other cores. So the core modifying the page table need send a signal to other cores telling them to invalidate TLBs themselves.
2.1 Invalidation Granularity
If one particular page table entry is modified, then invalidate one entry of TLB is appropriate.
If context switch for TLB with no ASID (or with OS not using ASID), then invalidate all entries of TLB is appropriate.
So architectures provide different granularity for invalidating TLB.
- ARM
- tlbi instruction, qualified with ALL to flush all of TLB, qualified with VA to flush only entries matching the VA and ASID
- older ARM, TLB is flushed by writing a particular value to system control coprocessor (CP15) c8 register
- x86
- TLB is flushed by writing to CR3 register
Between two granularity extremes, the choice of granularity is remained as an open area of research and development, and often the answer is just to find some heuristic or threshold through profiling.
E.g. Linux code for x86 decides the threshold is 33.

Architecture may provide other variants of TLB to speed up shootdown, e.g. invalidate only non-global TLB entries.
2.2 Inter-Processor Interrupts
Some architecture, a core can only direct invalidate it’s own TLB
- ARM: has instruction taking effect on all cores
- IBM Power: tlbiel for local TLB and tlbie for all core’s TLB
- x86: need inter-processor interrupt (IPI) to invalidate other core’s TLB
- IPI is a class of interrupt that is sent directly from one core to any subset of the cores
- IPI approach ensures that the message is received and processed even if the receiving cores are not currently checking for invalidation requests
- synchronization through shared memory needs receiver cores to check explicitly
- on receiving IPI, the core traps to OS and decodes IPI to perform TLBI shootdown


2.3 Optimizing TLB Shootdowns
No need TLB shootdown after accessed bit and dirty bit is set. Because accessed bit and dirty bit is read by OS from page table in memory, not TLB.
Need TLB shootdown after accessed bit is cleared. Otherwise new access not set the bit and OS cannot track recency of the page.
Need TLB shootdown after dirty bit is cleared (page is flushed to backing storage) for function correctness.
No need TLB shootdown after permission upgrade, the core will take a minor page fault to invalidate old TLB entry and table walk to new entry.
In IPI based system, filtering the set of cores to sent TLB shootdown request. E.g. only sent TLB shootdown to cores working in same context if context switch need flush all non-global TLBs. DiDi proposed a hardware TLB directory to track presence of page table entries in each TLB to provide accurate filtering.
2.4 Other Details
Because of synonyms, if TLB shootdown is specified with physical address (page frame to be swapped out to backing storage), need perform reverse map lookup to shootdown all related virtual address pointing to the same page frame.
TLB shootdown also need invalidate page table walk caches if page table walk cache is invisible to software. (IBM Power provides explicit page walk cache invalidation option)
3. Self-Modifying Code
Scenario that updates to instruction memory
- dynamic linkers often make use of a lazily updated procedure linkage table (PLT) which connects code to functions in dynamically linked libraries. Each update to the PLT is a write to instruction memory.
- Just-in-time (JIT) compilers also produce large code dynamically and then store it to memory so it can be executed.
- for runtime optimization, low-level debugging or profiling mechanism
Due to W^X policy for security, self-modifying code need OS assistance. When the code is updated, it must be writable, then the code is executed, it must be executable.
Self-modifying code varies from architecture to architecture
- x86: handles by hardware automatically, requiring jump into the modified code (not modifying the next instruction and naturally execute)
- sometimes need synchronization instruction
- self-modifying code is in multithreaded context, the remote thread must be informed
- modification through virtual address synonym
- ARM, POWER: requires user to insert explicit fences to ensure modification is propagated
- POWER:
- dcbst; sync; icbi; isync
- which writes any changes to main storage, waits for the writes to have propagated, invalidates the instruction cache, and waits for the invalidation to have completed
- ARM:
- DC CVAU; DSB ISH; IC IVAU; DSB ISH; ISB
- which is analogous to the Power sequence, but which requires an additional DSB ISH between the latter two operations.
4. Memory Consistency Model
A memory consistency model (or simply a memory model, for short) provides a set of rules that defines what values may be legally returned by loads to memory.
x86 uses TSO memory model, which allows a store to be reordered after a younger load.
POWER and ARM permits almost all memory accesses to be reordered by default.

MP is the canonical example of a litmus test.
If stores can be reordered, then producer can update flag before updating data
If loads can be reordered, consumer can read data before flag is set.

4.1 Why Memory Models are Hard
Each core has a different perspective on the order in which memory events take place. This means that there is no single universal notion of “latest” which can be used to determine which value each load should return.
- younger load can take forward value from older store if the store is in the same core, but other core’s load will see old load value (load happens before store)
- “latest” must specify the core whose perspective is used
4.2 Memory Models and the Virtual Memory Subsystem
Whether the memory model rules apply to VM, to physical memory, to both or to neither ?
- In Specifying and dynamically verifying address translation-aware memory consistency, they concluded that virtual address memory consistency and physical address memory consistency were fundamentally different, as virtual address memory consistency must deal with problems such as synonyms, mapping changes, and status changes that are not applicable to physical memory alone.
How “special” loads such as page table walks and instruction fetches differ in behavior from “normal” loads ?
- memory fence to enforce order between “normal” loads may not enforce ordering between “normal” load and page table walk
- ARM:
- DMB: orders all preceding loads and stores with all subsequent loads and stores, but it does NOT order instruction fetches or page table walks
- DSB: orders all preceding loads and stores with all subsequent loads and stores, and it does order instruction fetches or page table walks
- so DSB is needed in DC CVAU; DSB ISH; IC IVAU; DSB ISH; ISB sequence for self-modifying code
- so DSB is needed in DSB ISHST; TLBI VAE1IS; DSB ISH; ISB sequence for TLB shootdown after modifying page table
Bug in Linux:
- thread 0 is updating page table and shoot down TLB within same context
- thread 1 is changing context to the same context with thread 0
- the order can be (b)(c)(d)(a) (there is no fence between (a) and (b), and in x86, older store of (a) can reorder with younger load of (b))
- thread 1 read old table entry and not shootdown

Consistency Model | Paper |
Consistency model for non-volatile memory | |
Consistency model for disk filesystem | |
Introduce transistency model to describe the superset of consistency which captures
all translation-aware sets of ordering rules |
5. Summary
Page table updates and self-modifying code are generally off of the critical path of performance, so coherence is not implemented in hardware but left to software. Programmers must take care of architecture or even micro-architecture specific requirements to ensure correctness.