type
status
date
slug
summary
tags
category
icon
password
Below are notes from General-Purpose Graphics Processor Architectures.
The goal of this chapter: How GPUs are programmed for non-graphics computing.
Why need to know programming model for GPGPUs ?
- if you want to explore making changes to the hardware/software interface of GPU as part of your research, you need create benchmarks by modifying source code of existing GPU computing applications
- introduction of transactional memory (TM) on GPUs required this because current GPUs do not support TM
What is SIMT (single-instruction, multiple-thread) model ?
- SIMD hardware + MIMD programming model
- Modern GPUs employ wide SIMD hardware to exploit the data-level parallel in GPU applications.
- GPU computing APIs, such as CUDA and OpenCL, feature a MIMD-like programming model to launch a large array of scalar threads onto the GPU
- Each of these scalar threads can follow its unique execution path and may access arbitrary memory locations
- the GPU hardware executes groups of scalar threads, called warps (or wavefronts in AMD terminology), in lockstep on SIMD hardware to exploit their regularities and spatial localities
2.1 Execution Model
A computational kernel is composed of (typically) thousands of threads. Each thread executes the same program, but may follow a different control flow through that program depending upon the outcome of the computation.
- grid → thread block (cooperative thread array (CTA)) → wrap → threads
SAXPY example
SAXPY: single-precision scalar value A times vector value X plus vector value Y
- part of BLAS (Basic Linear Algebra Software)
- useful for implementing higher level matrix operations such as Gaussian elimination


Threads that execute on the GPU are part of a compute kernel specified by a function
- the CUDA keyword _global_ on line 1 indicates the kernel function saxpy will run on the GPU
- parallelized the “for” loop
- each iteration of the “for” loop on line 4 in the original CPU-only C code from Figure 2.1 is translated into an individual thread running the code on lines 3–5 in Figure 2.2
A compute kernel typically consists of thousands of threads, each of which starts by running the same function.
- kernel calling
- CPU starts computation on the GPU on line 17 using CUDA’s kernel configuration syntax
- the kernel configuration syntax looks much like a function call in C with some additional information specifying the number of threads contained between triple angle brackets (<<<>>>)
- line 17 indicates the compute kernel should launch a single grid consisting of nblocks thread blocks where each thread block contains 256 threads
- the threads that make up a compute kernel are organized into a hierarchy composed of a grid of thread blocks consisting of warps
- NVIDIA warps consists of 32 threads while AMD wavefronts consist of 64 threads
- warps are grouped into a larger unit called a cooperative thread array (CTA) or thread block by NVIDIA
- memory allocate and copy
- we use the prefix
h_
in naming pointer variables for memory allocated in CPU memory andd_
for pointers for memory allocated in GPU memory - On line 13 the CPU calls the CUDA library function
cudaMalloc
. - This function invokes the GPU driver and asks it to allocate memory on the GPU for use by the program. The call to
cudaMalloc
setsd_x
to point to a region of GPU memory containing enough space to hold n 32-bit floating-point values. - On line 15 the CPU calls the CUDA library function
cudaMemcpy
. - This function invokes the GPU driver and asks it to copy the contents of the array in
CPU memory pointed to by
h_x
to the array in GPU memory pointed to byd_x
. - On line 18 after all threads in the grid are completed, the compute kernel returns control to the CPU after line 17. The CPU invokes the GPU driver to copy the array pointed to by
d_y
from GPU memory back to CPU memory.
- execution of threads
- each thread on the GPU can lookup its own identity within the grid of blocks of threads
- employs grid, block and thread identifiers
- In CUDA, grids and thread blocks have x, y, and z dimensions.
- Each thread block has x, y, and z coordinates within a grid.
- Each thread has x, y, and z coordinates within a thread block.
- The extents of these coordinates are set by the kernel configuration syntax (line 17).
- In our example, y and z dimensions are not specified and so all threads have zero values for their y and z thread block and thread coordinates.
- On line 3 the value of
threadIdx.x
identifies the x coordinate of the thread within its thread block andblockIdx.x
indicates the x coordinate of the thread block within its grid - The value
blockDim.x
indicates the maximum number of threads in the x-dimension. In our example, blockDim.x would evaluate to 256 since this is the value specified on line 17. - The expression
blockIdx.x*blockDim.x + threadIdx.x
is used to compute an offseti
for use when accessing the arrays x and y. As we will see, using indexi
we have assigned each thread a unique element of x and y. - compiler and hardware enables the programmer to remain oblivious to the lock-step nature of thread execution in a warp
- enable the appearance of each thread within a warp executing independently
- Line 5 in Figure 2.2 performs one iteration of the original loop in Figure 2.1.
Communication of threads
- threads in same CTA can communication via scratchpad memory
- Threads within a CTA can communicate with each other efficiently via a per compute core scratchpad memory.
- This scrathpad is called shared memory by NVIDIA. Each streaming multi-processor (SM) contains a single shared memory.
- The space in the shared memory is divided up among all CTAs running on that SM.
- AMD’s Graphics Core Next (GCN) architecture includes a similar scratchpad memory that AMD calls the local data store (LDS).
- Unlike GPUs from NVIDIA, AMD’s GCN GPUs also includes a global data store (GDS) scratchpad memory shared by all cores on the GPU.
- These scratchpad memories are small, ranging from 16–64 KB per SM, and exposed to programmers as different memory spaces.
- Programmers allocate memory into scratchpad memory using special keywords in their source code (e.g., “__shared__” in CUDA). The scratchpad memory acts as a software controlled cache.
- While GPUs also contain hardware managed caches, accessing data through such caches can lead to frequent cache misses. Applications benefit from using scratchpad memory when the programmer can identify data that is reused frequently and in a predictable manner.
- Scratchpad memories are used in graphics applications to pass results between different graphics shaders.
- For example, LDS is used for passing of parameter values between vertex and pixel shaders in GCN.
- threads in different CTA can communication via global address space
Load imbalance across threads in data intensive irregular applications
- NVIDIA introduced CUDA Dynamic Parallelism (CDP)
- Dynamic Warp Formation (DWF) also proposed
2.2 GPU Instruction Set Architecture
Translation of compute kernels from a high-level language such as CUDA and OpenCL to the assembly level executed by the GPU hardware and the form of current GPU instruction sets.
2.2.1 NVIDIA GPU Instruction Set Architectures
OpenGL Shading Language (OGSL) and Microsoft’s High-Level Shading Language (HLSL) are common when early GPUs became programmable.
Nvidia introduce CUDA in early 2007, to follow a similar path and introduced their own high-level virtual instruction set architecture for GPU computing called the Parallel Thread Execution ISA, or PTX.
Before running PTX code on the GPU it is necessary to compile PTX down to the actual instruction set architecture supported by the hardware.
- SASS: “Streaming ASSembler”
- NVIDIA not fully document SASS
- developer which to understand performance at low level create tools to disassemble SASS
- Wladimir Jasper van der Laan and named “decuda” in 2007 for GeForce 8 Series
- NVIDIA eventually introduced a tool, called
cuobjdump
, and started to partially document SASS. - NVIDIA’s SASS documentation [NVIDIA Corporation, c] currently (April 2018) provides only a list of the assembly opcode names but no details on operand formats or SASS instruction semantics.
- https://docs.nvidia.com/cuda/parallel-thread-execution/
- PTX ISA v8.5 and v6.1 on
add
instruction - assembler tools
- Hou Yunqing. Assembler for NVIDIA FERMI. https://github.com/hyqneuron/asfermi
- Scott Gray. Assembler for NVIDIA Maxwell architecture. https://github.com/NervanaSyst ems/maxas


- from PTX to SASS by
- GPU driver or
- a stand-alone program called
ptxas
provided in NVIDIA’s CUDA Toolkit



Figure 2.4/2.5 is extracted by
cuobjdump
, 2.4 is for Fermi, 2.5 is for Pascal- first column: address of the instruction
- second column: assembly
- third column: assembly encoding
Comparison between virtual ISA and hardware ISA (PTX and SASS)
- similarities
- RISC style (both used loads and stores to access memory)
- using predication
- differences
- PTX has infinite set of registers (each definition typically uses a new register much like static single assignment), SASS uses a limited set of registers
- PTX parameters are allocated into their own separate “parameter” address space, SASS kernel parameters are passed via banked constant memory which can be accessed by non load/store instructions
Comparison between Fermi SASS and Pascal SASS
- Figure 2.5 contains some lines for which there is no disassembled instructions (e.g., at address 0x0000 on Line 3).
- These are special “control instructions” introduced in the NVIDIA Kepler architecture to eliminate the need for explicit dependency checking using a scoreboard.
- as noted by Lai and Seznec, these control instructions appear to be similar to the explicit-dependence lookahead on the Tera Computer System
- Gray describes extensive details of the control instruction encoding that they were able to infer for NVIDIA’s Maxwell architecture.
- According to Gray
- there is one control instruction for every three regular instructions in Maxwell. This appears to also be the case for NVIDIA’s Pascal architecture as illustrated in Figure 2.5.
- the 64-bit control instructions on Maxwell contain three groups of 21-bits encoding the following information for each of the following three instructions
- a stall count
- yeild hint flag
- write, read, and wait dependency barriers
- the use of register reuse flags on regular instructions which can also be seen in Figure 2.5 (e.g., R0.reuse used for the first source operand in the Integer Short Multiply Add instruction, XMAD, on Line 8)
- indicate an “operand reuse cache” was added in NVIDIA GPUs starting with Maxwell
- operand reuse cache appears to enable register values to be read multiple times for each main register file access resulting in reduced energy consumption and/or improved performance
2.2.2 AMD Graphics Core Next Instruction Set Architecture
AMD released a complete hardware-level ISA specification when introduced their Southern Islands architecture (1st gen of Graphics Core Next (GCN) architecture)
AMD’s compilation flow also includes a virtual instruction set architecture, called HSAIL, as part of the Heterogeneous System Architecture (HSA).
A key difference between AMD’s GCN architecture and NVIDIA GPUs (including NVIDIA’s most recent Volta architecture) is separate scalar and vector instructions.
- Figure 2.6 is high-level OpenCL and Figure 2.7 is machine instruction for Southern Islands
- Figure 2.7, vector instruction are prefaced with
v_
, scalar instruction prefaced withs_
.
- In the AMD GCN architecture, each compute unit (e.g., SIMT core) contains
- a scalar unit coupled with four vector units
- vector instructions execute on the vector units and compute different 32-bit values for each individual thread in a wavefront
- scalar instructions execute on the scalar units compute a single 32-bit value shared by all threads in a wavefront
- Figure 2.7 the scalar instructions are related to control flow handling
- In particular,
exec
is a special register used to predicate execution of individual vector lanes for SIMT execution. - potential benefit of the scalar unit
- frequently certain portions of a computation in a SIMT program will compute the same result independent of thread ID

AMD’s GCN hardware instruction set manual provides many interesting insights into AMD GPU hardware.
- to enable data dependency resolution for long latency operations
- AMD’s GCN architecture includes
S_WAITCNT
instructions. - For each wavefront there are three counters to indicate the number of outstanding operations of a given type
- vector memory count
- local/global data store count
- register export count
- the compiler or programmer inserts
S_WAITCNT
instructions to have the wavefront wait until the number of outstanding operations decreases below a specified threshold