Tag Archives: hbm

Nvidia unveils Pascal specifics — up to 16GB of VRAM, 1TB of bandwidth

NV-Pascal1

Nvidia may have unveiled bits and pieces of its Pascal architecture back in March, but the company has shared some additional details at its GTC Japan technology conference. Like AMD’s Fury X, Pascal will move away from GDDR5 and adopt the next-generation HBM2 memory standard, a 16nm FinFET process at TSMC, and up to 16GB of memory. AMD and Nvidia are both expected to adopt HBM2 in 2016, but this will be Nvidia’s first product to use the technology, while AMD has prior experience thanks to the Fury lineup.

HBM vs. HBM2

HBM and HBM2 are based on the same core technology, but HBM2 doubles the effective speed per pin and introduces some new low-level features, as shown below. Memory density is also expected to improve, from 2Gb per DRAM (8Gb per stacked die) to 8Gb per DRAM (32Gb per stacked die).

sk_hynix_hbm_dram_2

Nvidia’s quoted 16GB of memory assumes a four-wide configuration and four 8Gb die on top of each other. That’s the same basic configuration that Fury X used, though the higher density DRAM means the hypothetical top-end Pascal will have four times as much memory as the Fury X. We would be surprised, however, if Nvidia pushes that 16GB stack below its top-end consumer card. In our examination of 4GB VRAM limits earlier this year, we found that the vast majority of games do not stress a 4GB VRAM buffer. Of the handful of titles that do use more than 4GB, none were found to exceed the 6GB limit on the GTX 980 Ti while maintaining anything approaching a playable frame rate. Consumers simply don’t have much to worry about on this front.

The other tidbit coming out of GTC Japan is that Nvidia will target 1TB/s of total bandwidth. That’s a huge bandwidth increase — 2x what Fury X offers — and again, it’s a meteoric increase in a short time. Both AMD and Nvidia are claiming that HBM2 and 14/16nm process technology will give them a 2x performance per watt improvement.

Historically, AMD has typically led Nvidia when it comes to adopting new memory technologies. AMD was the only company to adopt GDDR4 and the first manufacturer to use GDDR5 — the Radeon HD 4870 debuted with GDDR5 in June 2008, while Nvidia didn’t push the new standard on high-end cards until Fermi in 2010. AMD has argued that its expertise with HBM made implementing HBM2 easier, and some sites have reported rumors that the company has preferential access to Hynix’s HBM2 supply. Given that Hynix isn’t the only company building HBM2, however, this may or may not translate into any kind of advantage.

HBM2 production roadmap

With Teams Red and Green both moving to HBM2 next year, and both apparently targeting the same bandwidth and memory capacity targets, I suspect that the performance crown next year won’t be decided by the memory subsystem. Games inevitably evolve to take advantage of next-gen hardware, but the 1TB/s capability that Nvidia is talking up won’t be a widespread feature — especially if both companies stick to GDDR5 for entry and midrange products. One of the facets of HBM/HBM2 is that its advantages are more pronounced the more RAM you’re putting on a card and the larger the GPU is. We can bet that AMD and Nvidia will introduce ultra-high end and high-end cards that use HBM2, but midrange cards in the 2-4GB range could stick with GDDR5 for another product cycle.

The big question will be which company can take better advantage of its bandwidth, which architecture exploits it more effectively, and whether AMD can finally deliver a new core architecture that leaps past the incremental improvements that GCN 1.1 and 1.2 offered over the original GCN 1.0 architecture, which is now nearly three years old. Rumors abound on what kind of architecture that will be, but I’m inclined to think it’ll be more an evolution of GCN rather than a wholesale replacement. Both AMD and Nvidia have moved towards evolutionary advance rather than radical architecture swaps, and there’s enough low-hanging fruit in GCN that AMD could substantially improve performance without reinventing the entire wheel.

Neither AMD nor Nvidia have announced a launch date, but we anticipate seeing hardware from both in late Q1 / early Q2 of 2016.

Tagged , , , , , , , ,

Deep dive: Hynix’s High Bandwidth Memory

We’ve discussed the capabilities and performance of HBM (High Bandwidth Memory) multiple times over the past six months, but a new report sheds light on the physical architecture and construction of HBM. This new memory technology is viewed as the future of GPU memory. Nvidia will debut its own Pascal architecture in 2016 with HBM2, while AMD launched its own HBM-equipped GPUs, the Radeon Fury X and Radeon Fury, earlier this summer.

The full report by Tech Insights is paywalled, but the company shared a number of slides and details with EETimes. The HBM assembly that AMD and Hynix jointly designed is genuinely new compared to other products on the market. Samsung has used TSVs (through silicon vias) for wiring DRAM together before, but no one has ever built a wide I/O design like this in a commercial product.

Interposer and DRAM

The image above shows the substrate, the interposer layer (manufactured by UMC on a 65nm process) and the stacked DRAM. The TSVs aren’t visible in this shot, but can be seen in the image below. The report also details how Hynix manufactured the TSVs and the process it used for creating them. One thing the authors note is that while they expected to see “scallops” in the images (scallops are ridges formed in the sidewall during the etching process), Hynix apparently did an excellent job avoiding the problem. Hynix, the author concludes, “has got a great etch recipe.”

TSVs and DRAM die

The arrangement of the dies on the stack suggests that the first three DRAM dies were diced (cut from the wafer) as a group, while the top DRAM chip was cut separately, tested, and then attached to the stack. The entire four-die stack would then have been attached to the logic die. The advantage of this kind of configuration is that it offers Hynix ample opportunity to confirm that it’s building good die before attaching them in the final product.

TSVs

One piece of evidence in favor of this extensive test cycle is the sheer number of TSVs built into each DRAM. Tech Insights reports that there are nearly 2100 TSV pads on each DRAM die (one cross-section sample is shown below). In additional to being used for data, I/O, power, and redundancy, a significant percentage are apparently used to test the TSVs themselves. This fine-grained error control allows Hynix to determine exactly which TSVs aren’t meeting expectations and substitute one of the redundant TSVs where needed.

Why the fine details matter

Ever since AMD announced it would launch HBM, there have been rumors that HBM was either exceedingly expensive, yielding badly, or both. The Tech Insights excerpt doesn’t directly address either of these claims, but it does offer up some indirect evidence. Hynix has built a testing system that allows them to test for bad die at every level. They can test the stack of three ICs, they can test the top-level DRAM before mounting it, and they can test the TSVs after mounting and have a method of switching to redundant TSVs in case a bad link is found rather than tossing out the entire die stack.

The value of being able to test the product at multiple stages can’t be understated. Some of you may remember Rambus and its ill-fated attempt to conquer the DRAM market in the late 1990s and early 2000s. Rambus DIMMs were extremely expensive when they launched, and there were some conspiratorial whispers alleging that either Intel and Rambus were falsely inflating the price, or that the DRAM manufacturers were deliberately trying to cripple the product.

While the entire RDRAM situation was heavily political, one contact we spoke to at a memory company that was fully on-board with the RDRAM shift told us that no, there were real problems that crippled RDRAM yields. One of the most fundamental was that there was no way to test whether an individual RDRAM chip was good or not before mounting it in a series to make a RIMM module. If the module didn’t test perfectly, it had to be disassembled and swapped out, piece by piece, until a faulty IC was found. Since it was possible to have more than one faulty IC at a time, this step had to be performed using a “known good” set of chips until each RIMM was “known good.” Combined with the low yields that are typical for any ramping memory, this inability to test individual components contributed substantially to RDRAM’s high prices when it first launched.

By all accounts, Hynix hasn’t just rolled out a new solution by the skin of their teeth — they’ve built a scalable design that bodes well for the future of the memory standard. The interposer is built on a dirt-cheap 65nm process, and we already know HBM2 is ramping.

Tagged , , , , , ,