Titanic honor and glory demo help5/7/2023 ![]() ![]() At the CU level, AMD augmented the memory pipeline to add hardware raytracing acceleration. A CU has its own memory pipeline and 16 KB L0 vector cache. Combine that with better caching, and each RDNA 2 WGP should be able to punch harder than an Ampere SM.Ī WGP’s four SIMDs are organized into groups of two, which AMD calls compute units (CUs). That gives a RDNA 2 WGP a better chance of being able to hide latency, by keeping more work in flight. RDNA 2 also has more vector register file capacity, meaning the compiler can keep more data in registers without reducing occupancy. An Ampere SM can only keep 48 warps in flight. Even though both architectures use basic building blocks (SMs or WGPs) that can do 128 FP32 operations per cycle, a RDNA 2 WGP can keep 64 wavefronts in flight. RDNA 2 clocks much higher than its predecessor on the same process node, so AMD did a good job there. AMD probably wanted to reduce the number of checks per cycle from 20 to 16, in order to hit higher clock speeds at lower power. Every cycle, every entry must be checked to see if it’s ready for execution. Thread or wavefront selection logic has to solve a very similar problem to CPU schedulers. That might sound like a regression, but tracking more wavefronts (analogous to CPU threads) is probably expensive. ![]() On RDNA 2, a SIMD basically has 16-way SMT, versus 20-way on RDNA 1.īasic sketch of the RDNA 2 architecture’s WGP, and Nvidia Ampere’s SM Instead, they keep a lot of threads in flight, and switch between threads to keep the execution units occupied to hide latency. GPUs don’t do out of order execution the way high performance CPUs do. AMD therefore reduced the number of wavefronts RDNA 2 could track, from 20 in RDNA 1. But those instructions let RDNA 2 do matrix multiplication with fewer instructions than if it had to use plain fused multiply-add instructions.Įach SIMD has 32-wide execution units for the most common operations, a 128 KB vector register file, and can track up to 16 wavefronts. It doesn’t go as far as tensor cores on Nvidia, where instructions like HMMA directly deal with 8×8 matrices. For example, V_DOT2_F32_F16 multiplies pairs of FP16 values, adds them, and adds a FP32 accumulator. RDNA 2 gets a few extra instructions for dot product operations to help accelerate machine learning. Each SIMD has a 32-wide execution unit for the most common operations. Each WGP, or workgroup processor, features four SIMDs. AMD made a number of changes to improve efficiency and keep hardware capabilities up to date, but the basic WGP architecture remains in place. ArchitectureĪs its name implies, RDNA 2 builds on top of the RDNA 1 architecture. So, I figured we could do something fun and look at some games from RDNA 2’s perspective. ![]() We already covered compute aspects of the RDNA 2 architecture in a couple of other articles, in order to make comparisons with other architectures. RDNA 2 takes that foundation and scales it up while adding raytracing support and a few other enhancements. We’ll cover the first generation of RDNA some other time. In 2019, AMD moved off their long-serving GCN architecture in favor of RDNA. ![]()
0 Comments
Leave a Reply. |