CS448 Final Exam Topics

CS448 Final Exam Topics

The exam is open book and open notes. You can use a computer but no online activities are allowed. You are also not allowed to compile/run CUDA programs.

Multiprocessing

Taxonomy of parallel architectures

SISD, SIMD, MISD, MIMD
CSM vs. DSM
Typical interconnect with DSM (e.g. mesh, bus, hypercube)
Concept of overhead in communications, splitting/merging work done by individual processors

Cache Coherence

What the problem is
Snooping-based solution

Understand write-invalidate cache-coherence protocol for write-back cache (diagram from lecture/homework)

Directory-based solution

How same states could be implemented via messages and a directory
Home, local, remote data access

Synchronization
- Criteria for critical sections
  - Mutual exclusion, progress, bounded waiting
- Problem with alternation
- Peterson's Algorithm for synchronization
- Uninterruptible hardware instructions for synchronization
  - Exchange
  - Test and set
  - Fetch and increment
- Spin lock
- Barrier synchronization
  - Problems and how to scale up to larger number of processors

CUDA

General concepts of the host, compute device, kernel
- Differences from a CPU, where one is preferable to the other
Architecture of the Tesla or G80
- Streaming Processor, Streaming Multiprocessor, Building :Block, GPU, Global memory, thread manager
Programming Model
- Blocks and Threads, grid of blocks/threads (dim3)
  - Limits on number of blocks (65535) and threads (512 per block)
- Kernel
- __device, __host, __global
- Using threadIdx.x and blockIdx.x (or in 2 dimensions)
- Allocation, copying back and forth from the host to GPU
- Be prepared to read/write code in C
Mapping blocks/threads to physical SM's
- Maximizing parallelism for the hardware, number of supported threads
- 8 blocks per SM, 768 threads per SM, block executed as 32 thread warp
  - Mapping the kernel block/thread dimensions to match the hardware
  Warp thread scheduling
Programming
- Mapping 2D arrays to 1D array and vice versa
  - Examples we did: matrix multipy, Julia fractal, ray tracer
- Threads operating on multiple pieces of input data, e.g. nececessary with extremely long vectors
  - Assigning range of values
  - Using a stride length equal to the total number of threads
Shared memory
- Using __syncthreads
- Limitation of only shared per block
Reduction algorithm
Constant memory
Atomics
- atomicAdd, atomicInc
- atomics on shared memory variables
Constraints/using multiple GPU's

Student Presentationsions

Any general concepts from the student presentations may also be on the exam. Questions will not be of fine detail, but general ideas.