General-Purpose Graphics Processor Architecture - Chapter 3 - The SIMT Core: Instruction and Register Data Flow (Part 1)

type

status

date

slug

summary

3.1 One-Loop Approximation

To increase efficiency, threads are organized into groups called “warps” by NVIDIA and “wavefronts” by AMD.

each cycle, a warp is selected for scheduling

the unit of scheduling is a warp
the warp’s program counter is used to access an instruction memory to find the next instruction to execute for the warp

decode, source operand reading, after instruction is fetched

the SIMT execution mask values are determined in parallel with source operand reading

execution proceeds in a SIMT manner

each thread executes on the function unit associated with a lane
function units

NVIDIA GPUs contain a special function unit (SFU), load/store unit, floating-point function unit, integer function unit, and, as of Volta, a Tensor Core

All function units nominally contain as many lanes as there are threads within a warp.

in several GPUs, a single warp or wave-front is executed over several clock cycles by clocking the function units at a higher frequency

more power with higher performance

SIMT execution presents the programmer with the abstraction that individual threads execute completely independently

it is achieved by a combination of traditional predication along with a stack of predicate masks(SIMT stack)

3.1.1 SIMT EXECUTION MASKING

The SIMT stack helps efficiently handle two key issues that occur when all threads can execute independently.

The first is nested control flow. In nested control flow one branch is control dependent upon another.

The second issue is skipping computation entirely while all threads in a warp avoid a control flow path.

For complex control flow this can represent a significant savings. Traditionally, CPUs supporting predication have handled nested control flow by using multiple predicate registers and supporting across lane predicate tests has been proposed in the literature.

Below examples describe a slightly simplified version introduced in an academic work that assumes the hardware is responsible for managing the SIMT stack

from the patents and ISA, the SIMT stack is at least partly managed by special instructions dedicated to this purpose

Code block and thread mask

“A/1111” inside the top node of the CFG (Control Flow Graph), initially all four threads in the warp are executing the code in Basic Block A

which corresponds to the code on lines 2 through 6 in Figure 3.2 and lines 1 through 6 in Figure 3.3.
the branch instruction is also included because all threads need execute the branch at line6
These four threads follow different (divergent) control flow after executing the branch on line 6 in Figure 3.3, which corresponds to the “if” statement on line 6 in Figure 3.2.

Specifically, as indicated by the label “B/1110” in Figure 3.4a the first three threads fall through to Basic Block B. These three threads branch to line 7 in Figure 3.3 (line 7 in Figure 3.2).

As indicated by the label “F/0001” in Figure 3.4a, after executing the branch the fourth thread jumps to Basic Block F, which corresponds to line 14 in Figure 3.3 (line 14 in Figure 3.2).

Similarly, when the three threads executing in Basic Block B reach the branch on line 9 in Figure 3.3 the first thread diverges to Basic Block C while the second and third thread diverges to Basic Block D.

Then, all three threads reach Basic Block E and execute together as indicated by the label “E/1110” in Figure 3.4a. At Basic Block G all four threads execute together.

The approach used in current GPUs is to serialize execution of threads following different paths within a given warp.

Figure 3.4b

To achieve this serialization of divergent code paths one approach is to use a stack like that is illustrated in Figure 3.4c–e.

each entry on the SIMT stack contains three entries

a re-convergence program counter (RPC)
the address of the next instruction to execute (NextPC)
an active mask

Figure 3.4c illustrates the state of the stack immediately after the warp has executed the branch on line 6 in Figure 3.3.

Since three threads branch to Basic Block B and one thread branches to Basic Block F, two new entries have been added to the top of the stack (TOS).

The next instruction that the warp executes is determined using the Next PC value in the top of stack (TOS) entry.

In Figure 3.4c, this Next PC value is B, which represents the address for the first instruction in Basic Block B. The corresponding Active Mask entry, “1110”, indicates only the first three threads in the warp should execute this instruction.

The first three threads in the warp continue executing instructions from Basic Block B until they reach the branch on line 9 in Figure 3.3.

After executing this branch they diverge, as noted earlier. This branch divergence causes three changes to the stack. First, the Next PC entry of the TOS entry prior to executing the branch, labeled (i) in Figure 3.4d, is modified to the reconvergence point of the branch, which is the address of the first instruction in Basic Block E. Then, two entries, labeled (ii) and (iii) in Figure 3.4d, are added, one for each of the paths followed by threads in the warp after executing the branch.

A reconvergence point is a location in the program where threads that diverge can be forced to continue executing in lock-step.

The nearest reconvergence point is generally preferred.
The earliest point in a given program execution where it can be guaranteed at compile time that threads which diverge can again execute in lock-step is the immediate post-dominator of the branch❓ that caused the branch divergence.
At runtime it is sometimes possible to reconverge at an earlier point in the program.

what order should be used to add the entries to the stack following a divergent branch

To reduce the maximum depth of the reconvergence stack to be logarithmic in the number of threads in a warp it is best to put the entry with the most active threads on the stack first and then the entry with fewer active threads.❓

In part (d) of Figure 3.4 we follow this order while in part (c) we used the opposite order.

💡

Summary:

When 1st branch diverge happens, the SIMT stack will allocate three entries

the down-est entry has no Re-convergence PC, NextPC is the Re-convergence PC of two loop bodies, the Active Mask is all 1’s
the top-est two entries has same Re-convergence PC, which is the same as NextPC in down-est entry, NextPC is the two PC of the loop body, Active Mask is set according to the corresponding active threads

RPC is the (end + 1) instruction of the loop, NextPC is the beginning instruction of the loop
Active Mask indicates the threads that will execute the loop

When the top-est entry diverge

the NextPC of the original top-est entry is changed to the Re-convergence PC of the two newly allocated entries

⇒ when the two newly divergent loops is finished, continue to execute on the Re-convergence PC

two new SIMT stack entries is allocated

When the next PC is the Re-convergence PC, then the top-est entry of SIMT stack is poped

instead of executing the RPC (PRC should be executed after branch converge), executing the NextPC of the newly top-est entry of SIMT

3.1.2 SIMT DEADLOCK AND STACKLESS SIMT ARCHITECTURES

The stack-based implementation of SIMT can lead to a deadlock condition called “SIMT deadlock”.

NVIDIA calls their new thread divergence management approach Independent Thread Scheduling.

which is similar to academic proposals

What is SIMT Deadlock ?

Figure 3.5 program in the left:

Line A initializes the shared variable, mutex, to zero to indicate that a lock is free.

Line B, each thread in a warp executes the atomicCAS operation, which performs a compare-and-swap operation on the memory location containing mutex.

Logically, the compare-and-swap first reads the contents of mutex, then it compares it to the second input, 0. If the current value of mutex is 0, then the compare and swap operation updates the value of mutex to the third input, 1. The value returned by atomicCAS is original value of mutex.
Importantly, the compare-and-swap performs the above sequence of logical operations atomically for each thread. Thus, multiple accesses by atomicCAS to any single location, made by different threads within the same warp, are serialized. Only one thread will see the value of mutex as 0, and the remaining threads will see the value as 1.

❓Line B should be while (atomicCAS(mutex, 0, 1)); ❓ if old read value is 1, then should be in the while loop to get lock

Specifically, one thread will want to exit the loop while the remaining threads will want to stay in the loop. The thread that exits the loop will have reached the reconvergence point and thus will no longer be active on the SIMT-stack and thus unable to execute the atomicExch operation to release the lock on line C.

the threads set the lock will have NextPC be the same as RPC, so the thread will be poped out of SIMT stack, and not execute atomicExch operation

The threads that remain in the loop will be active at the top of the SIMT-stack and will spin indefinitely.

What is stackless branch reconvergence mechanism ?

The key idea is to replace the stack with per warp convergence barriers.

These fields are stored in registers and used by the hardware warp scheduler. Each Barrier Participation Mask is used to track which threads within a given warp participate in a given convergence barrier (❓Active Mask in SIMT stack❓). There may be more than one barrier participation mask for a given warp. In the common case threads tracked by a given barrier participation mask will wait for each other to reach a common point in the program following a divergent branch and thereby reconverge together (❓Reconvergence PC in SIMT stack❓). To support this the Barrier State field is used to track which threads have arrived at a given convergence barrier. The Thread State tracks, for each thread in the warp whether the thread is ready to execute, blocked at a convergence barrier (and if so, which one), or has yielded. It appears the yielded state is may be used to enable other threads in the warp to make forward progress past the convergence barrier in a situation that would otherwise lead to SIMT deadlock. The Thread rPC field tracks, for each thread that is not active, the address of the next instruction to execute(❓). The Thread Active field is a bit that indicates if the corresponding thread in the warp is active.

Assuming a warp contains 32 threads, the barrier participation mask is 32-bits wide. If a bit is set, that means the corresponding thread in the warp participates in this convergence barrier. Threads diverge when they execute a branch instruction such as those at the end of basic blocks 510 and 512 in Figure 3.8. These branches correspond to the two “if” statements in Figure 3.7.

To initialize the convergence barrier participation mask a special “ADD” instruction is employed. All threads that are active when the warp executes this ADD instruction have their bit set in the convergence barrier indicated by the ADD instruction. After executing a branch some threads may diverge, meaning the address of the next instruction (i.e., PC) to execute will differ. When this happens the scheduler will select a subset of threads with a common PC and update the Thread Active field to enable execution for these threads of the warp.

such a subset of threads as “warp split”

With a convergence barrier implementation the scheduler is free to switch between groups of diverged threads. This enables forward progress between threads in a warp when some threads have acquired a lock while others have not.

A “WAIT” instruction is used to stop a warp split when it reaches a convergence barrier. As described in NVIDIA’s patent application, the WAIT instruction includes an operand to indicate the identity of the convergence barrier. The effect of the WAIT instruction is to add the threads in the warp split to the Barrier State register (the threads have arrived at a given convergence barrier) for the barrier and change the threads’ state to blocked. Once all threads in the barrier participation mask have executed the corresponding WAIT instruction the thread scheduler can switch all the threads from the original warp split to active and SIMD efficiency is maintained.

The example in Figure 3.8 has two convergence barriers, B1 and B2 with WAIT instructions in basic blocks 516 and 518.

To enable switching between warp splits NVIDIA describes using a YIELD instruction along with other details such as support for indirect branches that we omit in this discussion.

💡

Summary:

SIMT deadlock in AtomicCAS example is because the lock free instruction is after re-convergence point. Only one thread in the wrap will set lock, and other threads is waiting the lock to be free. Since the thread which set the lock, will never execute lock free instruction after re-covergence point, then deadlock is produced.

To solve SIMT deadlock, the lock free instruction need to be executed in some manner. If one thread is yield after reaching re-convergence point, instructions after re-convergence point need to be executed.

The text in this chapter is not very clear. The references need to be read

SIMT deadlock

Ahmed ElTantawy and Tor M. Aamodt. MIMD synchronization on SIMT architectures. In Proc. of the ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 1–14, 2016. DOI: 10.1109/micro.2016.7783714

Stackless structure

Multi-Path IPDOM: Ahmed ElTantaway, Jessica Wenjie Ma, Mike O’Connor, and Tor M. Aamodt. A scalable multi-path microarchitecture for efficient GPU control flow. In Proc. of the IEEE Interna-tional Symposium on High-Performance Computer Architecture (HPCA), 2014. DOI: 10.1109/h-pca.2014.6835936.
Wrap Barrier: Wilson W. L. Fung and Tor M. Aamodt. Thread block compaction for efficient SIMT control flow. In Proc. of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 25–36, 2011.
Nvidia Patent: Gregory Frederick Diamos, Richard Craig Johnson, Vinod Grover, Olivier Giroux, Jack H. Choquette, Michael Alan Fetterman, Ajay S. Tirumala, Peter Nelson, and Ronny Meir Krashinsky. Execution of divergent threads using a convergence barrier, July 13, 2015.
NVIDIA Corp. Inside volta: The world’s most advanced data center GPU. https://developer.nvidia.com/blog/inside-volta/, May 2017.

3.1.3 WARP SCHEDULING

Each core in a GPU hosts contains many warps. A very interesting question is which order these warps should be scheduled in.

assume that each warp issues only a single instruction when it is scheduled and furthermore that the warp is not eligible to issue another instruction until the first instruction completes execution.

If the memory system were “ideal” and responded to memory requests within some fixed latency it would, in theory, be possible to design the core to support enough warps to hide this latency using fine-grained multithreading.

we can reduce the area of the chip for a given throughput by scheduling warps in “round robin” order.
In round robin the warps are given some fixed ordering, for example ordered by increasing thread identifiers, and warps are selected by the scheduler in this order.

However, there is an important trade off: to enable a different warp to issue an instruction each cycle it is necessary that each thread have its own registers (this avoids the need to copy and restore register state between registers and memory). Thus, increasing the number of warps per core increases the fraction of chip area devoted to register file storage relative to the fraction dedicated to execution units. For a fixed chip area increasing warps per core will decrease the total number of cores per chip.

What impact does scheduling play when considering the memory system of the GPU?

This has been the topic of considerable research in the past few years.
locality properties can either favor or discourage round-robin scheduling: when different threads share data at a similar point in their execution, such as when accessing texture maps in graphics pixel shaders, it is beneficial for threads to make equal progress as this can increase the number of memory references which “hit” in on-chip caches, which is encouraged by round-robin scheduling
Similarly, accessing DRAM is more efficient when nearby locations in the address space are accessed nearby in time and this is also encouraged by round-robin scheduling
On the other hand, when threads mostly access disjoint data, as tends to occur with more complex data structures, it can be beneficial for a given thread to be scheduled repeatedly so as to maximize locality

💡

Summary:

if data locality is among different warps, then round-robin scheduling different warps

each warp must have its own register, if the total warps per core increasing, then area of register file increasing, so the number of cores decreasing

if data locality is in the same warps, then repeatedly scheduling the same warps

3.2 Two-Loop Approximation

To help reduce the number of warps that each core must support to hide long execution latencies, it is helpful to be able to issue a subsequent instruction from a warp while earlier instructions have not yet completed.

Specifically, in one-loop micro-architecture, it does not know whether the next instruction to issue for the warp has a dependency upon an earlier instruction that has not yet completed execution. To provide such dependency information it is necessary to first fetch the instruction from memory so as to determine what data and/or structural hazards exists.

For this purpose, GPUs implement an instruction buffer where instructions are placed after cache access. A separate scheduler is used to decide which of several instructions in the instruction buffer should be issued next to the rest of the pipeline.

After a cache hit or a fill from a cache miss, the instruction information is placed into the instruction buffer. The organization of the instruction buffer can take many forms. One particularly straightforward approach is to have storage for one or more instructions per warp.

There are two traditional approaches to detecting dependencies between instructions found in traditional CPU architectures: a scoreboard and reservation stations.

Reservation stations are used for eliminating name dependencies and introduce the need for associative logic that is expensive in terms of area and energy.
Scoreboards can be designed to support either in-order execution or out-of-order execution. Scoreboards supporting out-of-order execution, like that used in the CDC 6600, are also fairly complex. On the other hand, the scoreboard for a single threaded in-order CPU is very simple: each register is represented in the scoreboard with a single bit that is set whenever an instruction issues that will write to that register. Any instruction that wants to read or write to a register that has its corresponding bit set in the scoreboard is stalled until the bit is cleared by the instruction writing to the register.

This prevents both read-after-write and write-after-write hazards.
in-order to prevent write-after-read hazard

GPUs implement in-order scoreboards. However, as discussed next, there are challenges to using an in-order scoreboard when supporting multiple warps.

The first concern with the simple in-order scoreboard design described above is the very large number of registers contained in modern GPUs. With up to 128 registers per warp and up to 64 warps per core a total of 8192 bits per core is required to implement the scoreboard.
Another concern with the simple in-order scoreboard design described above is that an instruction that encounters a dependency must repeatedly lookup its operands in the scoreboard until the prior instruction it depends upon writes its results to the register file.

If all such instructions must probe the scoreboard additional read ports are required. Recent GPUs support up to 64 warps per core and with up to 4 operands allowing all warps to probe the scoreboard every cycle would require 256 read ports.

an alternative design is:
for each wrap, there are some entries(3 or 4), each entry is the identifier of a register that will be written by an instruction that has been issued but not yet completed execution
scoreboard is accessed when an instruction is placed into the instruction buffer and when an instruction writes its results into the register file
the scoreboard entries for the corresponding warp are compared against that instructions’ source and destination registers. This results in a short bit vector, with one bit for each entry in the scoreboard for that warp (e.g., 3 or 4 bits). A bit is set if the corresponding entry in the scoreboard matched any of the operands of the instruction. This bit vector is then copied alongside the instruction in the instruction buffer. An instruction is not eligible to be considered by the instruction scheduler until all bits are cleared, which can be determined by feeding each bit of the vector into a NOR gate. Dependency bits in the instruction buffer are cleared as instructions write their results to the register file. If all entries are used up for a given warp then either fetch stalls for all warps or the instruction is discarded and must be fetched again. When an instruction that has executed is ready to write to the register file it clears the entry that was allocated to it in the scoreboard and also clears the corresponding dependency bit for any instructions from the same warp that are stored in the instruction buffer.
A possible example

instr_0 write reg1, instr_1 write reg3, these two is issued to execution unit
scoreboard

entry0: id-1
entry1: id-3

instr_2 read reg3, so, bit vector is [1, 0]

bit[1] is compared with scoreboard entry1, bit[0] is compared with scoreboard entry0

instr_3 read reg1, reg3, so, bit vector is [1, 1]
when instr_0 retired, clear scoreboard reg1, and clear instr_2 bit vector, so instr_2 can be issued

💡

Summary

In the two-loop architecture,

the first loop selects a warp that has space in the instruction buffer, looks up its program counter and performs an instruction cache access to obtain the next instruction

the second loop selects an instruction in the instruction buffer that has no outstanding dependencies and issues it to the execution units.

3.3 Three-Loop Approximation

To hide long memory latencies, it is necessary to support many warps per core to switch cycle by cycle

so must have a large register file that contains separate physical registers for every warp that is executing

E.g. 256 KB on recent GPU architectures from NVIDIA (e.g., Kepler, Maxwell, and Pascal architectures)

Now, the area of an SRAM memory is proportional to the number of ports. A naive implementation of a register file requires one port per operand per instruction issued per cycle.

One way to reduce the area of the register file is to simulate the large number of ports using multiple banks of single-ported memories.
Operand collector is used to achieve this in a more transparent way(not exposed bank to ISA). The operand collector effective forms a third scheduling loop as described below.

The naïve example of banked register file

four banks, each bank is single port

warp0, r0 is stored in Bank 0, and marked as “w0:r0”

In the timing diagram

w3:i1 can read all source registers without bank conflict

w0:i2 can only read one source operand because two operands have bank conflict

w0:i2 read another source operand, and w3:i1 write to destination register

w1:i2 read one source operand

w1:i2 cannot read another source operand because w0:i2 write to destination register

w1:i2 read another source operand

So, it takes six cycles for three instructions to finish reading their source registers

3.3.1 Operand Collector

Operand collector

staging registers have been replaced with collector units

there are multiple collector units so that multiple instructions can overlap reading of source operands

which can help improve throughput in the presence of bank conflicts between the source operands of different instructions

each collector unit contains buffering space for all source operands required to execute an instruction

given the larger number of source operands for multiple instructions the arbiter is more likely to achieve increased bank-level parallelism to allow accessing multiple register file banks in parallell

Swizzled banked register

allocate equivalent registers from different warps in different banks

This does not address bank conflicts between register operands of a single instruction

But it does help is in reducing bank conflicts between instructions from different warps

In particular, whenever warps are making relatively even progress

For naïve register file:

T0	w1	i1: add r1(1), r2(2), r5(1)
T1	w2	i1: add r1(1), r2(2), r5(1)
T2	w3	i1: add r1(1), r2(2), r5(1)
T3	w0	i2: mad r4(0), r3(3), r7(3), r1(1)

ㅤ	T1	T2	T3	T4	T5	T6	T7	T8
Bank 0	ㅤ	ㅤ	ㅤ	ㅤ	ㅤ	ㅤ	ㅤ	w0:r4
Bank 1	w1:r5	w2:r5	w1:r1	w2:r1	w3:r5	w0:r1	w3:r1	ㅤ
Bank 2	w1:r2	w2:r2	w3:r2	ㅤ	ㅤ	ㅤ	ㅤ	ㅤ
Bank 3	ㅤ	ㅤ	ㅤ	ㅤ	w0:r7	w0:r3	ㅤ	ㅤ

For swizzled register file (another scheduling scheme, two less cycle than the original in the book❓):

ㅤ	T1	T2	T3	T4	T4	T5
Bank 0	ㅤ	w2:r2	w3:r5	ㅤ	w3:r1	w0:r4
Bank 1	ㅤ	ㅤ	w3:r2	w0:r1	ㅤ	ㅤ
Bank 2	w1:r5	ㅤ	w1:r1	ㅤ	ㅤ	ㅤ
Bank 3	w1:r2	w2:r5	w0:r7	w2:r1	w0:r3	ㅤ

WAR Hazard

operand collector not impose order

if younger and older has WAR hazard, and older cannot collect source operand due to bank conflict, but younger (which will write to older source operand) operand ready earlier than older

three proposals

release-on-commit warpborad: at most one instruction per wrap to execute ⇒ impact performance negatively, even by a factor of two in some cases
release-on-read warpboard: only one instruction at a time per warp to collecting operands in operand collector ⇒ at most 10% negatively performance impact
bloomboard: using small bloom filter to track outstanding register reads ⇒ impact performance less than a few percent

Nvidia

may use “read dependency barrier” by special “control instructions”

💡

Summary

Operand collector can read source registers for multiple instructions per cycle

WAR hazard need to be solved for Operand Collector

Swizzled register file can reduce bank conflict if different wraps making relatively even progress.

3.3.2 INSTRUCTION REPLAY: HANDLING STRUCTURAL HAZARDS

Sources of structural hazard

the register read stage may run out of operand collector units

a single memory instruction executed by a warp may need to be broken down into multiple separate operations. Each of these separate operations may fully utilize a portion of the pipeline on a given cycle.

For GPU, if using stall when structural hazard happens

First, given the large size of the register file along with the many pipeline stages required to support a full graphics pipeline distributing a stall signal may impact the critical path. Pipelin- ing stall-cycle distribution leads to the need to introduce additional buffering increasing area.

Second, stalling an instruction from one warp may cause instructions from other warps to stall behind it. If those instructions do not require the resource required by the instruction that caused the stall, throughput may suffer.

Not stall, but replay

Instruction replay is found in some CPU designs where it is used as a recovery mechanism when speculatively scheduling a dependent instruction upon a earlier instruction that has variable latency.

speculative wake up instructions depending upon a load so as to improve single threaded performance

In contrast, GPUs avoid speculation as it tends to waste energy and reduce throughput. Instead, instruction replay is used in GPUs to avoid clogging the pipeline and the circuit area and/or timing overheads resulting from stalling.

To implement instruction replay a GPU can hold instructions in the instruction buffer either until it is known that they have completed or all individual portions of the instruction have executed.