CS448 Final Exam Topics
The exam is open book and open notes. You can use a computer but no online activities are allowed. You are also not allowed to compile/run CUDA programs.
Multiprocessing
- Taxonomy of parallel architectures
- SISD, SIMD, MISD, MIMD
- CSM vs. DSM
- Typical interconnect with DSM (e.g. mesh, bus, hypercube)
- Concept of overhead in communications, splitting/merging work done by
individual processors
- Cache Coherence
- What the problem is
- Snooping-based solution
- Understand write-invalidate cache-coherence protocol for write-back cache
(diagram from lecture/homework)
- Directory-based solution
- How same states could be implemented via messages and a directory
- Home, local, remote data access
- Synchronization
- Criteria for critical sections
- Mutual exclusion, progress, bounded waiting
- Problem with alternation
- Peterson's Algorithm for synchronization
- Uninterruptible hardware instructions for synchronization
- Exchange
- Test and set
- Fetch and increment
- Spin lock
- Barrier synchronization
- Problems and how to scale up to larger number of processors
CUDA
- General concepts of the host, compute device, kernel
- Differences from a CPU, where one is preferable to the other
- Architecture of the Tesla or G80
- Streaming Processor, Streaming Multiprocessor, Building :Block, GPU, Global
memory, thread manager
- Programming Model
- Blocks and Threads, grid of blocks/threads (dim3)
- Limits on number of blocks (65535) and threads (512 per block)
- Kernel
- __device, __host, __global
- Using threadIdx.x and blockIdx.x (or in 2 dimensions)
-
Allocation, copying back and forth from the host to GPU
- Be prepared to read/write
code in C
- Mapping blocks/threads to physical SM's
- Maximizing parallelism for the hardware, number of supported threads
- 8
blocks per SM, 768 threads per SM, block executed as 32 thread warp
- Mapping the kernel block/thread dimensions to match the hardware
Warp thread scheduling
- Programming
- Mapping 2D arrays to 1D array and vice versa
- Examples we did: matrix multipy, Julia fractal, ray tracer
- Threads operating on multiple pieces of input data, e.g. nececessary with
extremely long vectors
- Assigning range of values
- Using a stride length equal to the total number
of threads
- Shared memory
- Using __syncthreads
- Limitation of only shared per block
- Reduction algorithm
- Constant memory
- Atomics
- atomicAdd, atomicInc
- atomics on shared memory variables
- Constraints/using multiple GPU's
Student Presentationsions
- Any general concepts from the student presentations may also be on the exam.
Questions will not be of fine detail, but general ideas.