type
status
date
slug
summary
tags
category
icon
password
Traditional graphics applications interact with several memory spaces such as texture, constant, and render surfaces. While access to these memory spaces is available in GPGPU programming APIs like CUDA, we will focus on memory spaces employed in GPGPU programming in this chapter and in particular the microarchitecture support required to implement them.
CPUs typically include two separate memory spaces: The register file and memory.
Modern GPUs logically subdivide memory further into local and global memory spaces.
  • Local memory space is private per thread and typically used for register spilling
  • Global memory is used for data structures that are shared among multiple threads
  • In addition, modern GPUs typically implement a programmer managed scratchpad memory with shared access among threads that execute together in a cooperative thread array.
    • For higher performance and lower energy
      • Many applications a programmer knows which data needs to be accessed at a given step in a computation. By loading all of this data into shared memory at once they can overlap long latency off-chip memory accesses and avoid long latency accesses to memory while performing computation on this data.
      • More importantly, the number of bytes that can be transferred between the GPU and off-chip memory in a given amount of time (DRAM bandwidth) is small relative to the number of instructions that can be executed in that same amount of time.
      • Moreover, the energy consumed to transfer data between off-chip memory and the GPU is orders of magnitude higher than the energy consumed accessing data from on -chip memory.
We divide our discussion of the memory system into two parts reflecting the division of memory into portions that reside within the GPU cores and within memory partitions that connect to off-chip DRAM chips.

4.1 First-Level Memory Structures

This section describes the first-level cache structures found on GPUs with a focus on
  • unified L1 data cache
  • scratch pad “shared memory”
  • L1 texture cache
    • it can provides some insights and intuition as to how GPUs differ from CPUs
  • how these cache/memory interact with the core pipeline
    • first-level memory structures in GPUs is how they interact with the core pipeline when hazards are encountered

4.1.1 Scratchpad Memory and L1 Data Cache

In the CUDA programming model, “shared memory” refers to a relatively small memory space that is expected to have low latency but which is accessible to all threads within a given CTA (Cooperative Thread Array). (What is CTA: https://docs.nvidia.com/cuda/parallel-thread-execution/#cooperative-thread-arrays)
  • In other architectures, such a memory space is sometimes referred to as a scratchpad memory.
  • The latency to access this memory space is typically comparable to register file access latency.
  • In OpenCL this memory space is referred to as “local memory.”
 
From a programmer perspective key aspects to consider when using shared memory,
  • limited capacity
  • potential for bank conflicts.
    • the shared memory is implemented as a static random access memory (SRAM)
    • a bank conflict arises when more than one thread accesses the same bank on a given cycle and the threads wish to access distinct locations in that bank
 
The L1 data cache maintains a subset of the global memory address space in the cache. In some architectures the L1 cache contains only locations that are not modified by kernels, which helps avoid complications due to the lack of cache coherence on GPUs.
 
From a programmer perspective a key consideration when accessing global memory is the relationship of memory locations accessed by different threads within a given warp.
  • coalesced:
    • If all threads in a warp access locations that fall within a single L1 data cache block and miss, then only a single request needs to be sent to lower level caches
  • Programmers try to avoid both bank conflicts and uncoalesced accesses.
 
Figure 4.1 illustrates a GPU cache organization like that described by Minkin et al.. The design pictured implements a ❓unified shared memory and L1 data cache❓, which is a feature introduced in NVIDIA’s Fermi architecture that is also present in the Kepler architecture.
At the center of the diagram is an SRAM data array (5) which can be configured partly for direct mapped access for shared memory and partly as a set associative cache.
The design supports a non-stalling interface with the instruction pipeline by using a replay mechanism when handling bank conflicts and L1 data cache misses.
notion image

Shared Memory Access Operations

For a shared memory accesses the arbiter determines whether the requested addresses within the warp will cause bank conflicts.
  • If the requests will cause bank conflicts, the arbiter splits the request into two parts.
    • The first part includes addresses for a subset of threads in the warp which do not have bank conflicts.
      • This part of the original request is accepted by the arbiter for further processing by the cache.
        • The accepted portion of a shared memory request bypasses tag lookup inside the tag unit (3) as shared memory is direct mapped.
          • address crossbar (4) distributes addresses to the individual banks within the data array
          • the data is returned to the appropriate thread’s lane for storage in the register file via the data crossbar (6)
        • When accepting a shared memory load request the arbiter schedules a writeback event to the register file inside the instruction pipeline as the latency of the direct mapped memory lookup is constant in the absence of bank conflicts.
          • only lanes corresponding to active threads in the warp write a value to the register file
    • The second part contains those addresses that cause bank conflicts with addresses in the first part. (if the second part still contains addresses that will cause bank conflict, then continue to split the request into another two parts)
      • This part of the original request is returned to the instruction pipeline and must be executed again (replay).
      • While area can be saved by replaying the memory access instruction from the instruction buffer this consumes energy in accessing the large register file.
        • A better alternative for energy efficiency may be to provide limited buffering for replaying memory access instructions in the load/store unit and avoiding scheduling memory access operations from the instruction buffer when free space in this buffer beings to run out.

Cache Read Operations

A load to the global memory space
  • Only a subset of the global memory space is cached in the L1, the tag unit will need to check whether the data is present in the cache or not.
  • While the data array is highly banked to enable flexible access to shared memory by individual warps, access to global memory is restricted to a single cache block per cycle.
    • This restriction helps to reduce tag storage overhead relative to the amount of cached data and is also a consequence of the standard interface to standard DRAM chips.
  • The L1 cache block size is 128 bytes in Fermi and Kepler and is further divided into four 32-byte sectors in Maxwell and Pascal.
    • The 32-byte sector size corresponds to the minimum size of data that can be read from a recent graphics DRAM chip in a single access (e.g., GDDR5). Each 128-byte cache block is composed of 32-bit entries at the same row in each of the 32 banks.
  • Data and control flow in L1
    • The load/store unit (1) computes memory addresses and applies the coalescing rules to break a warp’s memory access into individual coalesced accesses which are then fed into the arbiter (2).
    • The arbiter may reject a request if enough resources are not available.
      • For example, if all ways in the cache set that the access maps to are busy or there are no free entries in the pending request table (7).
    • Assuming enough resources are available to handle a miss, the arbiter requests the instruction pipeline schedule a writeback to the register file a fixed number of cycles in the future corresponding to a cache hit.
    • In parallel the arbiter also requests the Tag Unit (3) check whether the access in fact leads to a cache hit or a miss.
      • In the event of a cache hit, the appropriate row of the data array (5) is accessed in all banks and the data is returned (6) to the register file in the instruction pipeline. As in the case of shared memory accesses, only register lanes corresponding to active threads are updated.
      • When accessing the Tag Unit, if it is determined that a request triggers a cache miss, the arbiter informs the load/store unit it must replay the request and in parallel it sends the request information to the pending request table (PRT) (7) .
      • The L1 data cache shown in Figure 4.1 is ❗virtually indexed and virtually tagged ❗.
        • CPUs use VIPT to avoid the overheads of flushing the L1 data cache on context switches.
        • While GPUs effectively perform a context switch every cycle that a warp issues, the warps are part of the same application.
      • After an entry is allocated in the PRT a memory request is forwarded to the memory management unit (MMU) (8) for virtual to physical address translation and from there over a crossbar interconnect to the appropriate memory partition unit.
        • the memory partition units contain a bank of L2 cache along with a memory access scheduler
        • Along with information about which physical memory address to access and how many bytes to read, the memory request contains a “subid” that can be used to lookup the entry in the PRT containing information about the request when the memory request returns to the core.
      • Once a memory request response for the load is returned to the core it is passed by the MMU to the fill unit (9). The fill unit in turn uses the subid field in the memory request to lookup information about the request in the PRT. This includes information that can be passed by the fill unit to the load/store unit via the arbiter (2) to reschedule the load which is then guaranteed to hit in the cache by locking the line in the cache after it has been placed into the data array (5) .

Cache Write Operations

The L1 data cache in Figure 4.1 can support both write through and write back policies. The specific memory space written to determines whether the write is treated as write through or write back.
  • write through with no write allocate
    • Accesses to global memory in many GPGPU applications can be expected to have very poor temporal locality as commonly kernels are written in such a way that threads write out data to a large array right before exiting.
  • write back with write allocate policy
    • local memory writes for spilling registers to the stack may show good temporal locality with subsequent loads.
 
The data to be written either to shared memory or global memory is first placed write data buffer (WDB) (10) . For uncoalesced accesses or when some threads are masked off, only a portion of a cache block is written to. If the block is present in the cache the data can be written to the data array via the data crossbar (6) . If the data is not present in the cache the block must first be read from the L2 cache or DRAM memory. Coalesced writes which completely fill a cache block may bypass the cache if they invalidate tags for any stale data in the cache.
Note that the cache organization described in Figure 4.1 does not support cache coherence. To avoid this issue, NVIDIA GPUs starting with Kepler only permitted local memory accesses for register spills and stack data or read-only global memory data to be placed in the L1 data cache.
  • Ref: https://forums.developer.nvidia.com/t/gpu-coherence-problem/84153

4.1.2 L1 Texture Cache

Recent GPU architectures from NVIDIA combine the L1 data cache and texture cache to save area.
 
The standalone texture cache discussion is largely based on this paper that aimed to how industrial texture cache designs tolerate long off-chip latencies for cache misses.
  • Homan Igehy, Matthew Eldridge, and Kekoa Proudfoot. Prefetching in a texture cache archi- tecture. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics hardware, pages 133–ff, 1998.
  • In 3D graphics it is desirable to make scenes look as realistic as possible. In texture mapping an image, called a texture, is applied to a surface in a 3D model to make the surface look more realistic.
  • To implement texture mapping the rendering pipeline first determines the coordinates of one or more samples within the texture image. These samples are called texels.
  • The coordinates are then used to find the address of the memory locations containing the texels. As adjacent pixels on the screen map to adjacent texels, and as it is common to average the values of nearby texels, there is significant locality in texture memory accesses that can be exploited by caches.
 
Figure 4.2 illustrates the microarchitecture of an L1 texture cache as described by Igehy et al. [1998].
  • In contrast to the L1 data cache described in Section 4.1.1, the tag array (2) and data array (5) are separated by a FIFO buffer (3) .
    • The motivation for this FIFO is to hide the latency of miss requests that may need to be serviced from DRAM. In essence, the texture cache is designed assuming that cache misses will be frequent and that the working set size is relatively small. To keep the size of the tag and data arrays small, the tag array essentially runs ahead of the data array. The contents of the tag array reflect what the data array in the future after an amount of time roughly equal to the round trip time of a miss request to memory and back.
    • While throughput is improved relative to regular CPU design with limited capacity and miss handling resources, both cache hits and misses experience roughly the same latency.
    •  
Texture cache handling for hit and miss:
In detail, the texture cache illustrated in Figure 4.2 operates as follows. The load/store unit (1) sends the computed addresses for texels to perform a lookup in the tag array (2) . If the access hits, a pointer to the location of the data in the data array is placed in an entry at the tail of the fragment FIFO (3) along with any other information required to complete the texture operation. When the entry reaches the head of the fragment FIFO a controller (4) uses the pointer to lookup the texel data from the data array (5) and return it to the texture filter unit (6) .
  • While not shown in detail, for operations such as bilinear and trilinear filtering (mipmap filtering) there are actually four or eight texel lookups per fragment (i.e., screen pixel). The texture filter unit combines the texels to produce a single color value which is returned to the instruction pipeline via the register file.
In the event of a cache miss during tag lookup, the tag array sends a memory request via the miss request FIFO (8) . The miss request FIFO sends requests to lower levels of the memory system (9) . DRAM bandwidth utilization in GPU memory systems can be improved by the use of memory access scheduling techniques [Eckert, 2008, 2015]. that may service memory requests out-of-order to reduce row switch penalties. To ensure the contents of the data array (5) reflect the time-delayed state of the tag array (2) , data must be returned from the memory system in order. This is accomplished using a reorder buffer (10) .
notion image

4.1.3 Unified Texture and Data Cache

In recent GPU architectures from NVIDIA and AMD caching of data and texture values is performed using a unified L1 cache structure.
  • only data values that can be guaranteed to read-only are cached in the L1
In AMD’s GCN GPU architecture all vector memory operations are processed through the texture cache.

4.2 On-Chip Interconnection Network

To supply the large amount of memory bandwidth required to supply the SIMT cores, high-performance GPUs connect to multiple DRAM chips in parallel via memory partition units (described in Section 4.3). Memory traffic is distributed across the memory partition units using address interleaving. An NVIDIA patent describes address interleaving schemes for balancing traffic among up to 6 memory partitions at granularities of 256 bytes or 1,024 bytes.
The SIMT cores connect to the memory partition units via an on-chip interconnection network. The on-chip interconnection networks described in recent patents for NVIDIA are crossbars. GPUs from AMD have sometimes been described as using ring networks.

4.3 Memory Partition Unit

Below, we describe the microarchitecture of a memory partition unit corresponding to several recent NVIDIA patents.
  • patents were filed about a year prior to the release of NVIDIA’s Fermi GPU architecture
  • each memory partition unit contains a portion of the second-level (L2) cache along with a one or more memory access schedulers also called a “frame buffer,” or FB, and a raster operation (ROP) unit
    • L2 contains both graphics and compute data
    • FB reorders memory read and write operations to reduce overheads of accessing DRAM
    • ROP is primarily used in graphics operation such as alpha blending and supports compression of graphics surfaces.
      • also supports atomic operations
notion image

4.3.1 L2 Cache

The L2 cache design includes several optimizations to improve overall throughput per unit area for the GPU.
  • L2 cache portion inside each memory partition is composed of two slices
    • each slice contains separate tag and data arrays and processes incoming requests in order
    • to match the DRAM atom size of 32 bytes in GDDR5, each cache line inside the slice has four 32-byte sectors. Cache lines are allocated for use either by store instructions or load instructions.
  • To optimize throughput in the common case of coalesced writes that completely overwrite each sector on a write miss no data is first read from memory.
    • How uncoalesced writes, which do not completely cover a sector, are handled is not described in the patents we examined, but two solutions are storing byte-level valid bits and bypassing the L2 entirely.
  • To reduce area of the memory access scheduler, data that is being written to memory is buffered in cache lines in the L2 while writes awaiting scheduling.

4.3.2 Atomic Operations

Atomic operations can be used for implementing synchronization across threads running in different thread blocks.
  • ROP unit includes function units for executing atomic and reduction operations.
  • A sequence of atomic operations accessing the same memory location can be pipelined as the ROP unit includes a local ROP cache.

4.3.3 Memory Access Scheduler

To store large amounts of data GPUs employ special dynamic random access memory (DRAM) such as GDDR5.
  • DRAM precharge and activate operations introduce delays during which no data can be read or written to the DRAM array. To mitigate these overheads DRAMs contain multiple banks, each with their own row buffer. However, even with multiple DRAM banks it is often not possible to completely hide the latency of switching between rows when accessing data.
  • Memory access schedulers that reorder DRAM memory access requests so as to reduce the number of times data must be moved between the row buffers and DRAM cells.
  • To enable access to DRAM, each memory partition in the GPU may contain multiple memory access schedulers connecting the portion of L2 cache it contains to off-chip DRAM.
  • Each memory access scheduler contains separate logic for sorting read requests and write requests (“dirty data notifications”) sent from the L2 cache. To group together reads to the same row in a DRAM bank, two separate tables are employed.
    • The first, called the read request sorter, is a set associative structure accessed by memory address and maps all read requests to the same row in a given bank to single pointer. The pointer is used to lookup a list of individual read requests in a second table called the read request store ❓.