type
status
date
slug
summary
tags
category
icon
password
Below are notes from Architectural and Operating System Support for Virtual Memory.
Advanced VM software design requires detailed discussion of core operating system design and implementation issues, beyond the scope of this synthesis lecture.
Briefly, two general streams of recent work on purely software topics are:
- OS support for transparent superpages
- “transparent”: programmer is relieved of the burden of identifying when superpages are useful
- OS support for efficient TLB coherence
- patch TLB coherence code into the microcode of x86 cores, thereby avoiding many of the performance overheads of TLB shootdowns
- better detecting which cores cache specific page table entries in their TLB
For hardware-software co-design, all three studies below essentially aim to boost TLB reach by adding intelligence to the OS VM stack, and modestly augmenting the TLB to take advantage of these changes.
1. Recency-based TLB Preloading
The first method is analogous to stride prefetching techniques already used for caches. The hardware detects repeated strides in access patterns and calculates the virtual page numbers of future accesses and prefetch into TLB.
The second method goes beyond prefetching approaches for uniprocessors and discusses prefetching techniques for multi-core systems. Multiple threads of parallel programs on different cores often share data structures and hence same translations. Hardware TLB prefetching schemes where cores study the TLB access patterns of one another to prefetch translations.
The third method is recency-based TLB preloading.
Basic idea:
Recency: LRU age.
In below figure, the LRU age from youngest to oldest is D, A, C, B, P, L. And at t0, C is accessed. Then t1/t2/t3 pages with similar recency is accessed.


Implementation:
The critical question is how the recency stack should be implemented.
If TLB is fully associative, then LRU TLB naturally is a recency stack.
For set-associative TLB, if association is relatively high, then can approximate a recency stack.
For recency stack in page table, each translation entry now houses two additional fields. These fields maintain pointers that track the translation numbers at a recency depth of +1 and -1, corresponding to a forward and backward pointer respectively. The pointers is updated by page table walker.

2. Non-contiguous Superpages
Recent studies show that permanent DRAM faults are becoming more common, and cause pages of physical memory to be retired from use by the OS. For example, 2 MB superpages require aligned contiguity of 512 4 KB physical address spaces. If even a single 4 KB physical frame has to be retired from this contiguous range, the superpage cannot be realized.
Gap-tolerant sequential mapping (GTSM):
- With GTSM, a superpage in the virtual address space is divided into smaller fixed-sized virtual address blocks, which are sequentially mapped to building-blocks in the physical address space.
- In the figure below, superpage cannot be setup because three page have faults (pages in black color that are marked with “Retired Page). With GTSM, three building blocks which are larger than the smallest page size but smaller than super page size are used.
- if no retired page, then GTSM is same as traditionally superpage mapping.
- can map to non-contiguous physical pages if all these building blocks reside in a portion of physical memory that is limited to twice the size of the superpage

Page table entry change for GTSM:
- Consider an x86-64 page table table, the L2 level page table entry is modified, which records information about 2 MB superpages. The L2 level page table is extended from 8 bytes to 16 bytes (a single GTSM entry is formed by using two adjacent L2 page table entries).
- one of the 8 bytes is used as bitmap to indicate which of the building blocks in the memory slice constitute the GTSM page
- at max, only half of the bitmap can be set
- 2MB superpage is resides in 4MB physical address space


3. Direct Segments
What aspects of VM are actually used by big-memory workloads today ?
- for the vast majority of the program’s address space, the workloads do not require swapping, fragmentation mitigation, or fine-grained protection afforded by current VM implementations
- Swapping
- many of these applications are performance-critical and cannot afford to wait for disk access
- for example, Google observes that sub-second latency increases can reduce user traffic by as much as 20%. Hence, like Google, Facebook, Microsoft, and Twitter keep userfacing data like search indices in main memory.
- enterprise databases and in-memory object caches exploit buffer pool memory to minimize disk access
- Memory Allocation and Fragmentation
- Another aspect of big-memory workloads is that they tend to allocate most of their memory during startup and manage that memory internally.
- So, most big-memory workloads see little variation in allocated memory after the workloads begins execution
- Per-page permission
- many scenarios where vast swathes of pages share the same permission attributes
- there are also situation where fine-grained memory protection becomes necessary
- memory regions used for inter-process communication use page-grain protection to share data/code between processes.
- Code regions are protected by per-page protection to avoid overwrite.
- Copy-on-write uses page grain protection for efficient implementation of the fork() system call to lazily allocate memory when a page is modified.
Modern systems must continue supporting all the features of VM that are classically important
(i.e., swapping, fine-grained protection, support for dynamic allocation and defragmentation),
but that there should also be parallel hardware/software support or “fast-paths” for the types of
big-memory workloads that they authors consider.
A direct segment maps a large range of contiguous VM addresses to contiguous physical memory addresses using small, fixed hardware: base, limit, and offset registers for each core. And the segment has same access permission.
- If a virtual address is between the base and limit, it is translated to a physical address with the corresponding offset within the direct segment.
- no overhead from TLB miss
- If a virtual address is not in the segment, the use traditional paging based VM
3.1 Hardware Support
In hardware, in addition to the standard TLB and page table walker, three registers are added. These register is maintained by OS. If LIMIT is equal to BASE, then direct segment is disabled.
- BASE
- starting address of the contiguous virtual address range of the direct segment
- LIMIT
- end address of the contiguous virtual address range of the direct segment
- OFFSET
- start address of the direct segment’s backing contiguous physical memory minus the value in the BASE register

3.2 Software Support
OS separates each application’s address space into two portions
- primary region where the direct segment can be allocated
- (1) the ability to provision a range of VM addresses as a primary region
- during the creation of a process, OS reserve a contiguous range for primary region
- (2) enable memory requests from an application to be mapped to it
- process can explicitly request memory allocation in primary region via a flag to memory allocator
- or default placing non-file-backed memory with read-write permissions into the primary region
- remainder of the address space is the normal pageable and swappable address space we use conventionally
4. Other Hardware-software Approaches
Direct segments at the host or guest level, or both, almost entirely eliminating the TLB overheads of virtualization in many big-memory workloads.
Range translation: replaces a single direct segment with the ability to create multiple smaller ranges, thereby improving the performance of even more workloads than the original direct segments work.
The authors encourage readers to investigate the use of concepts like direct segments and range translations in emerging accelerator architectures, for which novel address translation hardware is a first-class design goal.
5. Summary
Software-directed prefetching, or a better understanding of what aspects of VM are actually needed by modern workloads.
It will be interesting to explore how emerging workloads targeted at domains like deep learning, virtual/augmented reality, and so on continue to stress the VM system in new and interesting ways.