General-Purpose Graphics Processor Architecture - Chapter 2 - Programming Model

type

status

date

slug

summary

2.1 Execution Model

A computational kernel is composed of (typically) thousands of threads. Each thread executes the same program, but may follow a different control flow through that program depending upon the outcome of the computation.

grid → thread block (cooperative thread array (CTA)) → wrap → threads

SAXPY example

SAXPY: single-precision scalar value A times vector value X plus vector value Y

part of BLAS (Basic Linear Algebra Software)

useful for implementing higher level matrix operations such as Gaussian elimination

Threads that execute on the GPU are part of a compute kernel specified by a function

the CUDA keyword _global_ on line 1 indicates the kernel function saxpy will run on the GPU

parallelized the “for” loop

each iteration of the “for” loop on line 4 in the original CPU-only C code from Figure 2.1 is translated into an individual thread running the code on lines 3–5 in Figure 2.2

A compute kernel typically consists of thousands of threads, each of which starts by running the same function.

kernel calling

CPU starts computation on the GPU on line 17 using CUDA’s kernel configuration syntax

the kernel configuration syntax looks much like a function call in C with some additional information specifying the number of threads contained between triple angle brackets (<<<>>>)

line 17 indicates the compute kernel should launch a single grid consisting of nblocks thread blocks where each thread block contains 256 threads

the threads that make up a compute kernel are organized into a hierarchy composed of a grid of thread blocks consisting of warps
NVIDIA warps consists of 32 threads while AMD wavefronts consist of 64 threads
warps are grouped into a larger unit called a cooperative thread array (CTA) or thread block by NVIDIA

memory allocate and copy

we use the prefix h_ in naming pointer variables for memory allocated in CPU memory and d_ for pointers for memory allocated in GPU memory

On line 13 the CPU calls the CUDA library function cudaMalloc.

This function invokes the GPU driver and asks it to allocate memory on the GPU for use by the program. The call to cudaMalloc sets d_x to point to a region of GPU memory containing enough space to hold n 32-bit floating-point values.

On line 15 the CPU calls the CUDA library function cudaMemcpy.

This function invokes the GPU driver and asks it to copy the contents of the array in CPU memory pointed to by h_x to the array in GPU memory pointed to by d_x.

On line 18 after all threads in the grid are completed, the compute kernel returns control to the CPU after line 17. The CPU invokes the GPU driver to copy the array pointed to by d_y from GPU memory back to CPU memory.

execution of threads

each thread on the GPU can lookup its own identity within the grid of blocks of threads

employs grid, block and thread identifiers
In CUDA, grids and thread blocks have x, y, and z dimensions.

Each thread block has x, y, and z coordinates within a grid.
Each thread has x, y, and z coordinates within a thread block.
The extents of these coordinates are set by the kernel configuration syntax (line 17).

In our example, y and z dimensions are not specified and so all threads have zero values for their y and z thread block and thread coordinates.

On line 3 the value of threadIdx.x identifies the x coordinate of the thread within its thread block and blockIdx.x indicates the x coordinate of the thread block within its grid
The value blockDim.x indicates the maximum number of threads in the x-dimension. In our example, blockDim.x would evaluate to 256 since this is the value specified on line 17.
The expression blockIdx.x*blockDim.x + threadIdx.x is used to compute an offset i for use when accessing the arrays x and y. As we will see, using index i we have assigned each thread a unique element of x and y.

compiler and hardware enables the programmer to remain oblivious to the lock-step nature of thread execution in a warp

enable the appearance of each thread within a warp executing independently
Line 5 in Figure 2.2 performs one iteration of the original loop in Figure 2.1.

Communication of threads

threads in same CTA can communication via scratchpad memory

Threads within a CTA can communicate with each other efficiently via a per compute core scratchpad memory.

This scrathpad is called shared memory by NVIDIA. Each streaming multi-processor (SM) contains a single shared memory.

The space in the shared memory is divided up among all CTAs running on that SM.

AMD’s Graphics Core Next (GCN) architecture includes a similar scratchpad memory that AMD calls the local data store (LDS).

Unlike GPUs from NVIDIA, AMD’s GCN GPUs also includes a global data store (GDS) scratchpad memory shared by all cores on the GPU.

These scratchpad memories are small, ranging from 16–64 KB per SM, and exposed to programmers as different memory spaces.

Programmers allocate memory into scratchpad memory using special keywords in their source code (e.g., “__shared__” in CUDA). The scratchpad memory acts as a software controlled cache.
While GPUs also contain hardware managed caches, accessing data through such caches can lead to frequent cache misses. Applications benefit from using scratchpad memory when the programmer can identify data that is reused frequently and in a predictable manner.

Scratchpad memories are used in graphics applications to pass results between different graphics shaders.

For example, LDS is used for passing of parameter values between vertex and pixel shaders in GCN.

threads in different CTA can communication via global address space

Load imbalance across threads in data intensive irregular applications

NVIDIA introduced CUDA Dynamic Parallelism (CDP)

Dynamic Warp Formation (DWF) also proposed

2.2 GPU Instruction Set Architecture

Translation of compute kernels from a high-level language such as CUDA and OpenCL to the assembly level executed by the GPU hardware and the form of current GPU instruction sets.

2.2.1 NVIDIA GPU Instruction Set Architectures

OpenGL Shading Language (OGSL) and Microsoft’s High-Level Shading Language (HLSL) are common when early GPUs became programmable.

Nvidia introduce CUDA in early 2007, to follow a similar path and introduced their own high-level virtual instruction set architecture for GPU computing called the Parallel Thread Execution ISA, or PTX.

Before running PTX code on the GPU it is necessary to compile PTX down to the actual instruction set architecture supported by the hardware.

SASS: “Streaming ASSembler”

NVIDIA not fully document SASS
developer which to understand performance at low level create tools to disassemble SASS

Wladimir Jasper van der Laan and named “decuda” in 2007 for GeForce 8 Series

NVIDIA eventually introduced a tool, called cuobjdump, and started to partially document SASS.

NVIDIA’s SASS documentation [NVIDIA Corporation, c] currently (April 2018) provides only a list of the assembly opcode names but no details on operand formats or SASS instruction semantics.
https://docs.nvidia.com/cuda/parallel-thread-execution/
PTX ISA v8.5 and v6.1 on add instruction

assembler tools

Hou Yunqing. Assembler for NVIDIA FERMI. https://github.com/hyqneuron/asfermi
Scott Gray. Assembler for NVIDIA Maxwell architecture. https://github.com/NervanaSyst ems/maxas

from PTX to SASS by

GPU driver or
a stand-alone program called ptxas provided in NVIDIA’s CUDA Toolkit

Figure 2.4/2.5 is extracted by cuobjdump, 2.4 is for Fermi, 2.5 is for Pascal

first column: address of the instruction

second column: assembly

third column: assembly encoding

Comparison between virtual ISA and hardware ISA (PTX and SASS)

similarities

RISC style (both used loads and stores to access memory)
using predication

differences

PTX has infinite set of registers (each definition typically uses a new register much like static single assignment), SASS uses a limited set of registers
PTX parameters are allocated into their own separate “parameter” address space, SASS kernel parameters are passed via banked constant memory which can be accessed by non load/store instructions

Comparison between Fermi SASS and Pascal SASS

Figure 2.5 contains some lines for which there is no disassembled instructions (e.g., at address 0x0000 on Line 3).

These are special “control instructions” introduced in the NVIDIA Kepler architecture to eliminate the need for explicit dependency checking using a scoreboard.

as noted by Lai and Seznec, these control instructions appear to be similar to the explicit-dependence lookahead on the Tera Computer System
Gray describes extensive details of the control instruction encoding that they were able to infer for NVIDIA’s Maxwell architecture.

According to Gray

there is one control instruction for every three regular instructions in Maxwell. This appears to also be the case for NVIDIA’s Pascal architecture as illustrated in Figure 2.5.
the 64-bit control instructions on Maxwell contain three groups of 21-bits encoding the following information for each of the following three instructions

a stall count
yeild hint flag
write, read, and wait dependency barriers

the use of register reuse flags on regular instructions which can also be seen in Figure 2.5 (e.g., R0.reuse used for the first source operand in the Integer Short Multiply Add instruction, XMAD, on Line 8)

indicate an “operand reuse cache” was added in NVIDIA GPUs starting with Maxwell
operand reuse cache appears to enable register values to be read multiple times for each main register file access resulting in reduced energy consumption and/or improved performance

2.2.2 AMD Graphics Core Next Instruction Set Architecture

AMD released a complete hardware-level ISA specification when introduced their Southern Islands architecture (1st gen of Graphics Core Next (GCN) architecture)

https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/southern-islands-instruction-set-architecture.pdf

AMD’s compilation flow also includes a virtual instruction set architecture, called HSAIL, as part of the Heterogeneous System Architecture (HSA).

A key difference between AMD’s GCN architecture and NVIDIA GPUs (including NVIDIA’s most recent Volta architecture) is separate scalar and vector instructions.

Figure 2.6 is high-level OpenCL and Figure 2.7 is machine instruction for Southern Islands

Figure 2.7, vector instruction are prefaced with v_, scalar instruction prefaced with s_.

In the AMD GCN architecture, each compute unit (e.g., SIMT core) contains

a scalar unit coupled with four vector units

vector instructions execute on the vector units and compute different 32-bit values for each individual thread in a wavefront
scalar instructions execute on the scalar units compute a single 32-bit value shared by all threads in a wavefront

Figure 2.7 the scalar instructions are related to control flow handling

In particular, exec is a special register used to predicate execution of individual vector lanes for SIMT execution.

potential benefit of the scalar unit

frequently certain portions of a computation in a SIMT program will compute the same result independent of thread ID

AMD’s GCN hardware instruction set manual provides many interesting insights into AMD GPU hardware.

to enable data dependency resolution for long latency operations

AMD’s GCN architecture includes S_WAITCNT instructions.
For each wavefront there are three counters to indicate the number of outstanding operations of a given type

vector memory count
local/global data store count
register export count

the compiler or programmer inserts S_WAITCNT instructions to have the wavefront wait until the number of outstanding operations decreases below a specified threshold