Architecture and Operating System Support for Virtual Memory - Chapter 6 - Virtual Memory, Coherence, and Consistency

type

status

date

slug

summary

1. Non-Coherent Caches and TLBs

What to consider whether cache and TLB are coherent or not: performance, power, area, programmability(ease of use).

Coherence definition:

there is a globally agreed total order of the stores to each memory location

single-writer/multiple-readers (SWMR) condition

each load returns the value written by the latest store to the same memory location

“latest” can only be defined through memory consistency model

each store propagates to every other observer in a finite amount of time

processor will always make forward progress when executing multithreaded programs

Data cache holds data that is read and write frequently (sometimes by more than one core), so software managed coherence is too large of burden for programmers, so dedicated hardware to manage coherence.

TLBs and instruction caches keep data that is mostly read-only and changed relatively infrequently. Instruction fetch and page table walk often take place in different pathways in uarch of normal loads. So adding area for coherence cannot bring clear benefit in instruction cache and TLBs.

For TLBs, TLB entries are not tagged with physical address to be read from, so cannot be snooped successfully. (set associate TLB is indexed by virtual address to lookup, so need search all TLB entries even if TLB is tagged with physical address to be read from. Or need add a fully associative Page Table Entry CAM (PCAM) to track the mapping between TLB entries and page table entries just like UNITD.)

For instruction caches, even if coherent, the fetched and in-pipeline instructions may not be tracked by coherence protocol.

2. TLB Shootdowns

Failure to flush stale TLB entries will affect correctness and security: process A can read/write data to process B’s page.

The process by which stale TLB entries are invalidated is known as a TLB shootdown. TLB shootdown mechanism are defined by architecture. In some architecture(x86), each core can invalidate its own TLB, but no mechanism to invalidate the TLBs of other cores. So the core modifying the page table need send a signal to other cores telling them to invalidate TLBs themselves.

2.1 Invalidation Granularity

If one particular page table entry is modified, then invalidate one entry of TLB is appropriate.

If context switch for TLB with no ASID (or with OS not using ASID), then invalidate all entries of TLB is appropriate.

So architectures provide different granularity for invalidating TLB.

tlbi instruction, qualified with ALL to flush all of TLB, qualified with VA to flush only entries matching the VA and ASID
older ARM, TLB is flushed by writing a particular value to system control coprocessor (CP15) c8 register

TLB is flushed by writing to CR3 register

Between two granularity extremes, the choice of granularity is remained as an open area of research and development, and often the answer is just to find some heuristic or threshold through profiling.

E.g. Linux code for x86 decides the threshold is 33.

Architecture may provide other variants of TLB to speed up shootdown, e.g. invalidate only non-global TLB entries.

2.2 Inter-Processor Interrupts

Some architecture, a core can only direct invalidate it’s own TLB

ARM: has instruction taking effect on all cores

IBM Power: tlbiel for local TLB and tlbie for all core’s TLB

x86: need inter-processor interrupt (IPI) to invalidate other core’s TLB

IPI is a class of interrupt that is sent directly from one core to any subset of the cores
IPI approach ensures that the message is received and processed even if the receiving cores are not currently checking for invalidation requests

synchronization through shared memory needs receiver cores to check explicitly

on receiving IPI, the core traps to OS and decodes IPI to perform TLBI shootdown

2.3 Optimizing TLB Shootdowns

No need TLB shootdown after accessed bit and dirty bit is set. Because accessed bit and dirty bit is read by OS from page table in memory, not TLB.

Need TLB shootdown after accessed bit is cleared. Otherwise new access not set the bit and OS cannot track recency of the page.

Need TLB shootdown after dirty bit is cleared (page is flushed to backing storage) for function correctness.

No need TLB shootdown after permission upgrade, the core will take a minor page fault to invalidate old TLB entry and table walk to new entry.

In IPI based system, filtering the set of cores to sent TLB shootdown request. E.g. only sent TLB shootdown to cores working in same context if context switch need flush all non-global TLBs. DiDi proposed a hardware TLB directory to track presence of page table entries in each TLB to provide accurate filtering.

2.4 Other Details

Because of synonyms, if TLB shootdown is specified with physical address (page frame to be swapped out to backing storage), need perform reverse map lookup to shootdown all related virtual address pointing to the same page frame.

TLB shootdown also need invalidate page table walk caches if page table walk cache is invisible to software. (IBM Power provides explicit page walk cache invalidation option)

3. Self-Modifying Code

Scenario that updates to instruction memory

dynamic linkers often make use of a lazily updated procedure linkage table (PLT) which connects code to functions in dynamically linked libraries. Each update to the PLT is a write to instruction memory.

Just-in-time (JIT) compilers also produce large code dynamically and then store it to memory so it can be executed.

for runtime optimization, low-level debugging or profiling mechanism

Due to W^X policy for security, self-modifying code need OS assistance. When the code is updated, it must be writable, then the code is executed, it must be executable.

Self-modifying code varies from architecture to architecture

x86: handles by hardware automatically, requiring jump into the modified code (not modifying the next instruction and naturally execute)

sometimes need synchronization instruction

self-modifying code is in multithreaded context, the remote thread must be informed
modification through virtual address synonym

ARM, POWER: requires user to insert explicit fences to ensure modification is propagated

POWER:

dcbst; sync; icbi; isync
which writes any changes to main storage, waits for the writes to have propagated, invalidates the instruction cache, and waits for the invalidation to have completed

ARM:

DC CVAU; DSB ISH; IC IVAU; DSB ISH; ISB
which is analogous to the Power sequence, but which requires an additional DSB ISH between the latter two operations.

4. Memory Consistency Model

A memory consistency model (or simply a memory model, for short) provides a set of rules that defines what values may be legally returned by loads to memory.

x86 uses TSO memory model, which allows a store to be reordered after a younger load.

POWER and ARM permits almost all memory accesses to be reordered by default.

MP is the canonical example of a litmus test.

If stores can be reordered, then producer can update flag before updating data

If loads can be reordered, consumer can read data before flag is set.

4.1 Why Memory Models are Hard

Each core has a different perspective on the order in which memory events take place. This means that there is no single universal notion of “latest” which can be used to determine which value each load should return.

younger load can take forward value from older store if the store is in the same core, but other core’s load will see old load value (load happens before store)

“latest” must specify the core whose perspective is used

4.2 Memory Models and the Virtual Memory Subsystem

Whether the memory model rules apply to VM, to physical memory, to both or to neither ?

In Specifying and dynamically verifying address translation-aware memory consistency, they concluded that virtual address memory consistency and physical address memory consistency were fundamentally different, as virtual address memory consistency must deal with problems such as synonyms, mapping changes, and status changes that are not applicable to physical memory alone.

How “special” loads such as page table walks and instruction fetches differ in behavior from “normal” loads ?

memory fence to enforce order between “normal” loads may not enforce ordering between “normal” load and page table walk

ARM:

DMB: orders all preceding loads and stores with all subsequent loads and stores, but it does NOT order instruction fetches or page table walks
DSB: orders all preceding loads and stores with all subsequent loads and stores, and it does order instruction fetches or page table walks

so DSB is needed in DC CVAU; DSB ISH; IC IVAU; DSB ISH; ISB sequence for self-modifying code
so DSB is needed in DSB ISHST; TLBI VAE1IS; DSB ISH; ISB sequence for TLB shootdown after modifying page table

Bug in Linux:

thread 0 is updating page table and shoot down TLB within same context

thread 1 is changing context to the same context with thread 0

the order can be (b)(c)(d)(a) (there is no fence between (a) and (b), and in x86, older store of (a) can reorder with younger load of (b))

thread 1 read old table entry and not shootdown

Consistency Model	Paper
Consistency model for non-volatile memory	2014-ISCA-Memory Persistency
Consistency model for disk filesystem	2016-ASPLOS-Specifying and Checking File System Crash-Consistency Models
Introduce transistency model to describe the superset of consistency which captures all translation-aware sets of ordering rules	2016-ASPLOS-COATCheck: Verifying Memory Ordering at the Hardware-OS Interface

5. Summary

Page table updates and self-modifying code are generally off of the critical path of performance, so coherence is not implemented in hardware but left to software. Programmers must take care of architecture or even micro-architecture specific requirements to ensure correctness.