General-Purpose Graphics Processor Architecture - Chapter 1 - Introduction

type

status

date

slug

summary

1.1 THE LANDSCAPE OF COMPUTATION ACCELERATORS

Half of those performance gains were due to reductions in transistor size

others comes from improvements in hardware architecture, improvements in compiler technology, and algorithms

since 2005, the end of Dennard Scaling ⇒ to improve performance requires finding more efficient hardware architectures

one key consequence is that clock frequencies now improve much more slowly as devices become smaller

Moving to vector hardware, such as that found in GPUs, yields about a 10x gain in efficiency by eliminating overheads of instruction processing.

hardware specialization can minimizing data movement by introducing complex operations that perform multiple arithmetic operations while avoiding accesses to large memory arrays such as register files

A key challenge is balance the gains in efficiency by using specialized hardware and the need for flexibility required to support a wide range of programs.

modern GPUs support a Turing Complete programming model.

By Turing Complete, we mean that any computation can be run given enough time and memory.

GPU manufacturer’s have refined the GPU architecture and programming model to increase flexibility while simultaneously improving energy efficiency.

1.2 GPU HARDWARE BASICS

Whether GPU might eventually replace CPUs entirely ?

This seems unlikely. In present systems GPUs are not stand-alone computing devices. Rather, they are combined with a CPU either

on a single chip

inserting an add-in card containing only a GPU into a system containing a CPU

The CPU is responsible for initiating computation on the GPU and transferring data to and from the GPU.

the beginning and end of the computation typically require access to input/output (I/O) devices

while there are ongoing efforts to develop application programming interfaces (APIs) providing I/O services directly on the GPU, so far these all assume the existence of a nearby CPU

the software used to access I/O devices and otherwise provide operating system services would appear to lack features, such as massive parallelism, that would make them suitable to run on the GPU.

Figure 1.1

(a) discrete GPU, the bus may be PCIe

e.g. NVIDIA’s Volta GPU
separate DRAM memory spaces for the CPU (often called system memory) and the GPU (often called device memory)

DDR for CPU, optimized for low latency
GDDR for GPU, optimized for high throughput

(b) integrated CPU and GPU

e.g. AMD’s Bristol Ridge APU or a mobile GPU
single DRAM memory spaces with same technology, often in low-power mobile devices, so using LPDDR, which is optimized for low power

How CPU and GPU co-work for computation ?

A GPU computing application starts running on the CPU. Typically, the CPU portion of the application will allocate and initialize some data structures.

On older discrete GPUs from both NVIDIA and AMD the CPU portion of the GPU Computing application typically allocates space for data structures in both CPU and GPU memory. For these GPUs, the CPU portion of the application must orchestrate the movement of data from CPU memory to GPU memory.

More recent discrete GPUs (e.g., NVIDIA’s Pascal architecture) have software and hardware support to automatically transfer data from CPU memory to GPU memory. This can be achieved by leveraging virtual memory support, both on the CPU and GPU.

NVIDIA called “unified memory”

On systems in which the CPU and GPU are integrated onto the same chip and share the same memory, no programmer controlled copying from CPU memory to GPU memory is necessary.

However, because CPUs and GPUs use caches and some of these caches may be private, there can be a cache-coherence problem.

At some point, the CPU must initiate computation on the GPU.

In current systems this is done with the help of a driver running on the CPU.

CPU portion of the GPU computing application specifies

which code should run on the GPU

this code is commonly referred to as a kernel.

how many threads should run
where these threads should look for input data

the driver

convey the kernel to run, threads number, data locations to the GPU hardware
translate the information and place it memory accessible by the GPU at a location where the GPU is configured to look for it
signals the GPU that it has new computations

GPU core and memory system organization

A modern GPU is composed of many cores, as shown in Figure 1.2. NVIDIA calls these cores streaming multiprocessors and AMD calls them compute units. Each GPU core executes a single-instruction multiple-thread (SIMT) program corresponding to the kernel.

Each core on a GPU can typically run on the order of a thousand threads.

The threads executing on a single core can communicate

through a scratchpad memory
synchronize using fast barrier operations

Each core also typically contains first-level instruction and data caches.

bandwidth filters to reduce the amount of traffic sent to lower levels of the memory system
the large number of threads running on a core are used to hide the latency to access memory when data is not found in the first-level caches.

Balance high computational throughput with high memory bandwidth.

requires parallelism in the memory system

in GPUs this parallelism is by multiple memory channels
often, each memory channel has associated with it a portion of last-level cache in a memory partition
the GPU cores and memory partitions are connected via an on-chip interconnection network such as a crossbar.

alternative organizations are possible. E.g., the Intel Xeon Phi distributes the last-level cache with the cores.

The relation between performance and number of threads ?

GPUs can obtain improved performance per unit area vs. superscalar out-of-order CPUs on highly parallel workloads by dedicating a larger fraction of their die area to arithmetic logic units and correspondingly less area to control logic.

To develop intuition into the tradeoffs between CPU and GPU architectures, Guz et al. developed an insightful analytical model showing how performance varies with number of threads.

To keep their model simple, they assume a simple cache model in which threads do not share data and infinite off-chip memory bandwidth.
Figure 1.3 which reproduces a figure from their paper, illustrates an interesting trade-off they found with their model.

When a large cache is shared among a small number of threads (as is the case in multicore CPUs), performance increases with the number of threads.
However, if the number of threads increases to the point that the cache cannot hold the entire working set, performance decreases.
As the number of threads increases further, performance increases with the ability of multithreading to hide long off-chip latency. GPUs architectures are represented by the right-hand side of this figure. GPUs are designed to tolerate frequent cache misses by employing multithreading.

Power consumption on GPU

With the end of Dennard Scaling, increasing energy efficiency has become a primary driver of innovation in computer architecture research. A key observation is that accessing large memory structures can consume as much or more energy as computation.

When proposing novel GPU architecture designs it is important to take energy consumption into account. To aid with this, recent GPGPU architecture simulators such as GPGPU-Sim incorporate energy models.

1.3 A BRIEF HISTORY OF GPUS

This section briefly describes the history of graphics processing units.

Computer graphics emerged in the 1960s with projects such as Ivan Sutherland’s Sketchpad.

off-line rendering for animation in films and in parallel the development of real-time rendering for use in video games

Early video cards started with the IBM Monochrome Display Adapter (MDA) in 1981 which only supported text

Later, video cards introduced for 2D and then 3D acceleration

3D accelerators also targeted computer-aided design

NVIDIA introduced programmability to the GPU in the form of vertex shaders and pixel shaders in the GeForce 3 introduced in 2001.

Researchers quickly learned how to implement linear algebra using these early GPUs by mapping matrix data into into textures and applying shaders

academic work at mapping general-purpose computing onto GPUs such that the programmer did not need to know graphics soon followed.

These efforts inspired GPU manufacturers to directly support general-purpose computing in addition to graphics.

The first commercial product was the NVIDIA GeForce 8 Series.

The GeForce 8 Series introduced several innovations including ability to write to arbitrary memory addresses from a shader and scratchpad memory to limit off-chip bandwidth, which had been lacking in earlier GPUs.
The next innovation was enabling caching of read-write data with NVIDIA’s Fermi architecture.
Subsequent refinements include AMD’s Fusion architecture which integrated CPU and GPU on the same die and dynamic parallelism that enables launching of threads from the GPU itself.
Most recently, NVIDIA’s Volta introduces features such as Tensor Cores that are targeted specifically at machine learning acceleration.

1.4 BOOK OUTLINE

In Chapter 2, a brief summary of the programming model, code development process, and compilation flow. In Chapter 3, the architecture of individual GPU cores that support execution of thousands of threads.

incrementally build up an increasingly detailed understanding of the trade-offs involved in supporting high throughput and a flexible programming model

In Chapter 4, the memory system including both the first-level caches found within the GPU cores, and the internal organization of the memory partitions.

important to understand the memory system of GPUs as computations that run on GPUs are often limited by off-chip memory bandwidth

In Chapter 5, overview of additional research on GPU computing architectures that does not neatly fit into Chapter 3 or 4.