type
status
date
slug
summary
tags
category
icon
password
Below are notes from General-Purpose Graphics Processor Architectures.
This book explores the hardware design of graphics processor units (GPUs).
- initially, GPUs are for real-time rendering with a focus on video games.
- while graphics acceleration continues to be their primary purpose, GPUs increasingly support non-graphics computing. ⇒ the book focus on performance and energy efficiency of non-graphics applications
- one prominent example today is machine learning systems.
1.1 THE LANDSCAPE OF COMPUTATION ACCELERATORS
Half of those performance gains were due to reductions in transistor size
- others comes from improvements in hardware architecture, improvements in compiler technology, and algorithms
- since 2005, the end of Dennard Scaling ⇒ to improve performance requires finding more efficient hardware architectures
- one key consequence is that clock frequencies now improve much more slowly as devices become smaller
Moving to vector hardware, such as that found in GPUs, yields about a 10x gain in efficiency by eliminating overheads of instruction processing.
- hardware specialization can minimizing data movement by introducing complex operations that perform multiple arithmetic operations while avoiding accesses to large memory arrays such as register files
A key challenge is balance the gains in efficiency by using specialized hardware and the need for flexibility required to support a wide range of programs.
- modern GPUs support a Turing Complete programming model.
- By Turing Complete, we mean that any computation can be run given enough time and memory.
- GPU manufacturer’s have refined the GPU architecture and programming model to increase flexibility while simultaneously improving energy efficiency.
1.2 GPU HARDWARE BASICS
Whether GPU might eventually replace CPUs entirely ?
This seems unlikely. In present systems GPUs are not stand-alone computing devices. Rather, they are combined with a CPU either
- on a single chip
- inserting an add-in card containing only a GPU into a system containing a CPU
The CPU is responsible for initiating computation on the GPU and transferring data to and from the GPU.
- the beginning and end of the computation typically require access to input/output (I/O) devices
- while there are ongoing efforts to develop application programming interfaces (APIs) providing I/O services directly on the GPU, so far these all assume the existence of a nearby CPU
- the software used to access I/O devices and otherwise provide operating system services would appear to lack features, such as massive parallelism, that would make them suitable to run on the GPU.

Figure 1.1
- (a) discrete GPU, the bus may be PCIe
- e.g. NVIDIA’s Volta GPU
- separate DRAM memory spaces for the CPU (often called system memory) and the GPU (often called device memory)
- DDR for CPU, optimized for low latency
- GDDR for GPU, optimized for high throughput
- (b) integrated CPU and GPU
- e.g. AMD’s Bristol Ridge APU or a mobile GPU
- single DRAM memory spaces with same technology, often in low-power mobile devices, so using LPDDR, which is optimized for low power
How CPU and GPU co-work for computation ?
A GPU computing application starts running on the CPU. Typically, the CPU portion of the application will allocate and initialize some data structures.
- On older discrete GPUs from both NVIDIA and AMD the CPU portion of the GPU Computing application typically allocates space for data structures in both CPU and GPU memory. For these GPUs, the CPU portion of the application must orchestrate the movement of data from CPU memory to GPU memory.
- More recent discrete GPUs (e.g., NVIDIA’s Pascal architecture) have software and hardware support to automatically transfer data from CPU memory to GPU memory. This can be achieved by leveraging virtual memory support, both on the CPU and GPU.
- NVIDIA called “unified memory”
- On systems in which the CPU and GPU are integrated onto the same chip and share the same memory, no programmer controlled copying from CPU memory to GPU memory is necessary.
- However, because CPUs and GPUs use caches and some of these caches may be private, there can be a cache-coherence problem.
At some point, the CPU must initiate computation on the GPU.
- In current systems this is done with the help of a driver running on the CPU.
- CPU portion of the GPU computing application specifies
- which code should run on the GPU
- this code is commonly referred to as a kernel.
- how many threads should run
- where these threads should look for input data
- the driver
- convey the kernel to run, threads number, data locations to the GPU hardware
- translate the information and place it memory accessible by the GPU at a location where the GPU is configured to look for it
- signals the GPU that it has new computations
GPU core and memory system organization
A modern GPU is composed of many cores, as shown in Figure 1.2. NVIDIA calls these
cores streaming multiprocessors and AMD calls them compute units. Each GPU core executes a
single-instruction multiple-thread (SIMT) program corresponding to the kernel.
- Each core on a GPU can typically run on the order of a thousand threads.
- The threads executing on a single core can communicate
- through a scratchpad memory
- synchronize using fast barrier operations
- Each core also typically contains first-level instruction and data caches.
- bandwidth filters to reduce the amount of traffic sent to lower levels of the memory system
- the large number of threads running on a core are used to hide the latency to access memory when data is not found in the first-level caches.

Balance high computational throughput with high memory bandwidth.
- requires parallelism in the memory system
- in GPUs this parallelism is by multiple memory channels
- often, each memory channel has associated with it a portion of last-level cache in a memory partition
- the GPU cores and memory partitions are connected via an on-chip interconnection network such as a crossbar.
- alternative organizations are possible. E.g., the Intel Xeon Phi distributes the last-level cache with the cores.
The relation between performance and number of threads ?
GPUs can obtain improved performance per unit area vs. superscalar out-of-order CPUs
on highly parallel workloads by dedicating a larger fraction of their die area to arithmetic logic
units and correspondingly less area to control logic.
- To develop intuition into the tradeoffs between CPU and GPU architectures, Guz et al. developed an insightful analytical model showing how performance varies with number of threads.
- To keep their model simple, they assume a simple cache model in which threads do not share data and infinite off-chip memory bandwidth.
- Figure 1.3 which reproduces a figure from their paper, illustrates an interesting trade-off they found with their model.
- When a large cache is shared among a small number of threads (as is the case in multicore CPUs), performance increases with the number of threads.
- However, if the number of threads increases to the point that the cache cannot hold the entire working set, performance decreases.
- As the number of threads increases further, performance increases with the ability of multithreading to hide long off-chip latency. GPUs architectures are represented by the right-hand side of this figure. GPUs are designed to tolerate frequent cache misses by employing multithreading.

Power consumption on GPU
With the end of Dennard Scaling, increasing energy efficiency has become a primary driver of innovation in computer architecture research. A key observation is that accessing large memory structures can consume as much or more energy as computation.
- When proposing novel GPU architecture designs it is important to take energy consumption into account. To aid with this, recent GPGPU architecture simulators such as GPGPU-Sim incorporate energy models.

1.3 A BRIEF HISTORY OF GPUS
This section briefly describes the history of graphics processing units.
Computer graphics emerged in the 1960s with projects such as Ivan Sutherland’s Sketchpad.
- off-line rendering for animation in films and in parallel the development of real-time rendering for use in video games
- Early video cards started with the IBM Monochrome Display Adapter (MDA) in 1981 which only supported text
- Later, video cards introduced for 2D and then 3D acceleration
- 3D accelerators also targeted computer-aided design
NVIDIA introduced programmability to the GPU in the form of vertex shaders and pixel shaders in the GeForce 3 introduced in 2001.
- Researchers quickly learned how to implement linear algebra using these early GPUs by mapping matrix data into into textures and applying shaders
- academic work at mapping general-purpose computing onto GPUs such that the programmer did not need to know graphics soon followed.
- These efforts inspired GPU manufacturers to directly support general-purpose computing in addition to graphics.
- The first commercial product was the NVIDIA GeForce 8 Series.
- The GeForce 8 Series introduced several innovations including ability to write to arbitrary memory addresses from a shader and scratchpad memory to limit off-chip bandwidth, which had been lacking in earlier GPUs.
- The next innovation was enabling caching of read-write data with NVIDIA’s Fermi architecture.
- Subsequent refinements include AMD’s Fusion architecture which integrated CPU and GPU on the same die and dynamic parallelism that enables launching of threads from the GPU itself.
- Most recently, NVIDIA’s Volta introduces features such as Tensor Cores that are targeted specifically at machine learning acceleration.
1.4 BOOK OUTLINE
In Chapter 2, a brief summary of the programming model, code development process, and compilation flow.
In Chapter 3, the architecture of individual GPU cores that support execution
of thousands of threads.
- incrementally build up an increasingly detailed understanding of the trade-offs involved in supporting high throughput and a flexible programming model
In Chapter 4, the memory system including both the first-level caches found within the GPU cores, and the internal organization of the memory partitions.
- important to understand the memory system of GPUs as computations that run on GPUs are often limited by off-chip memory bandwidth
In Chapter 5, overview of additional research on GPU computing architectures that does not neatly fit into Chapter 3 or 4.