Tag Archives: graphics card

Asynchronous compute, AMD, Nvidia, and DX12: What we know so far

Ever since DirectX 12 was announced, AMD and Nvidia have jockeyed for position regarding which of them would offer better support for the new API and its various features. One capability that AMD has talked up extensively is GCN’s support for asynchronous compute. Asynchronous compute allows all GPUs based on AMD’s GCN architecture to perform graphics and compute workloads simultaneously. Last week, an Oxide Games employee reported that contrary to general belief, Nvidia hardware couldn’t perform asynchronous computing and that the performance impact of attempting to do so was disastrous on the company’s hardware.

This announcement kicked off a flurry of research into what Nvidia hardware did and did not support, as well as anecdotal claims that people would (or already did) return their GTX 980 Ti’s based on Ashes of the Singularity performance. We’ve spent the last few days in conversation with various sources working on the problem, including Mahigan and CrazyElf at Overclock.net, as well as parsing through various data sets and performance reports. Nvidia has not responded to our request for clarification as of yet, but here’s the situation as we currently understand it.

Nvidia, AMD, and asynchronous compute

When AMD and Nvidia talk about supporting asynchronous compute, they aren’t talking about the same hardware capability. The Asynchronous Command Engines in AMD’s GPUs (between 2-8 depending on which card you own) are capable of executing new workloads at latencies as low as a single cycle. A high-end AMD card has eight ACEs and each ACE has eight queues. Maxwell, in contrast, has two pipelines, one of which is a high-priority graphics pipeline. The other has a a queue depth of 31 — but Nvidia can’t switch contexts anywhere near as quickly as AMD can.


According to a talk given at GDC 2015, there are restrictions on Nvidia’s preeemption capabilities. Additional text below the slide explains that “the GPU can only switch contexts at draw call boundaries” and “On future GPUs, we’re working to enable finer-grained preemption, but that’s still a long way off.” To explore the various capabilities of Maxwell and GCN, users at Beyond3D and Overclock.net have used an asynchronous compute tests that evaluated the capability on both AMD and Nvidia hardware. The benchmark has been revised multiple times over the week, so early results aren’t comparable to the data we’ve seen in later runs.

Note that this is a test of asynchronous compute latency, not performance. This doesn’t test overall throughput — in other words, just how long it takes to execute — and the test is designed to demonstrate if asynchronous compute is occurring or not. Because this is a latency test, lower numbers (closer to the yellow “1” line) mean the results are closer to ideal.

Radeon R9 290

Here’s the R9 290’s performance. The yellow line is perfection — that’s what we’d get if the GPU switched and executed instantaneously. The y-axis of the graph shows normalized performance to 1x, which is where we’d expect perfect asynchronous latency to be. The red line is what we are most interested in. It shows GCN performing nearly ideally in the majority of cases, holding performance steady even as thread counts rise. Now, compare this to Nvidia’s GTX 980 Ti.


Attempting to execute graphics and compute concurrently on the GTX 980 Ti causes dips and spikes in performance and little in the way of gains. Right now, there are only a few thread counts where Nvidia matches ideal performance (latency, in this case) and many cases where it doesn’t. Further investigation has indicated that Nvidia’s asynch pipeline appears to lean on the CPU for some of its initial steps, whereas AMD’s GCN handles the job in hardware.

Right now, the best available evidence suggests that when AMD and Nvidia talk about asynchronous compute, they are talking about two very different capabilities. “Asynchronous compute,” in fact, isn’t necessarily the best name for what’s happening here. The question is whether or not Nvidia GPUs can run graphics and compute workloads concurrently. AMD can, courtesy of its ACE units.

It’s been suggested that AMD’s approach is more like Hyper-Threading, which allows the GPU to work on disparate compute and graphics workloads simultaneously without a loss of performance, whereas Nvidia may be leaning on the CPU for some of its initial setup steps and attempting to schedule simultaneous compute + graphics workload for ideal execution. Obviously that process isn’t working well yet. Since our initial article, Oxide has since stated the following:

“We actually just chatted with Nvidia about Async Compute, indeed the driver hasn’t fully implemented it yet, but it appeared like it was. We are working closely with them as they fully implement Async Compute.”

Here’s what that likely means, given Nvidia’s own presentations at GDC and the various test benchmarks that have been assembled over the past week. Maxwell does not have a GCN-style configuration of asynchronous compute engines and it cannot switch between graphics and compute workloads as quickly as GCN. According to Beyond3D user Ext3h:

“There were claims originally, that Nvidia GPUs wouldn’t even be able to execute async compute shaders in an async fashion at all, this myth was quickly debunked. What become clear, however, is that Nvidia GPUs preferred a much lighter load than AMD cards. At small loads, Nvidia GPUs would run circles around AMD cards. At high load, well, quite the opposite, up to the point where Nvidia GPUs took such a long time to process the workload that they triggered safeguards in Windows. Which caused Windows to pull the trigger and kill the driver, assuming that it got stuck.

“Final result (for now): AMD GPUs are capable of handling a much higher load. About 10x times what Nvidia GPUs can handle. But they also need also about 4x the pressure applied before they get to play out there capabilities.”

Ext3h goes on to say that preemption in Nvidia’s case is only used when switching between graphics contexts (1x graphics + 31 compute mode) and “pure compute context,” but claims that this functionality is “utterly broken” on Nvidia cards at present. He also states that while Maxwell 2 (GTX 900 family) is capable of parallel execution, “The hardware doesn’t profit from it much though, since it has only little ‘gaps’ in the shader utilization either way. So in the end, it’s still just sequential execution for most workload, even though if you did manage to stall the pipeline in some way by constructing an unfortunate workload, you could still profit from it.”

Nvidia, meanwhile, has represented to Oxide that it can implement asynchronous compute, however, and that this capability was not fully enabled in drivers. Like Oxide, we’re going to wait and see how the situation develops. The analysis thread at Beyond3D makes it very clear that this is an incredibly complex question, and much of what Nvidia and Maxwell may or may not be doing is unclear.

Earlier, we mentioned that AMD’s approach to asynchronous computing superficially resembled Hyper-Threading. There’s another way in which that analogy may prove accurate: When Hyper-Threading debuted, many AMD fans asked why Team Red hadn’t copied the feature to boost performance on K7 and K8. AMD’s response at the time was that the K7 and K8 processors had much shorter pipelines and very different architectures, and were intrinsically less likely to benefit from Hyper-Threading as a result. The P4, in contrast, had a long pipeline and a relatively high stall rate. If one thread stalled, HT allowed another thread to continue executing, which boosted the chip’s overall performance.

GCN-style asynchronous computing is unlikely to boost Maxwell performance, in other words, because Maxwell isn’t really designed for these kinds of workloads. Whether Nvidia can work around that limitation (or implement something even faster) remains to be seen.

What does this mean for gamers and DX12?

There’s been a significant amount of confusion over what this difference in asynchronous compute means for gamers and DirectX 12 support. Despite what some sites have implied, DirectX 12 does not require any specific implementation of asynchronous compute. That aside, it currently seems that AMD’s ACE’s could give the company a leg up in future DX12 performance. Whether Nvidia can perform a different type of optimization and gain similar benefits for itself is still unknown. Regarding the usefulness of asynchronous computing (AMD’s definition) itself, Kollock notes:

“First, though we are the first D3D12 title, I wouldn’t hold us up as the prime example of this feature. There are probably better demonstrations of it. This is a pretty complex topic and to fully understand it will require significant understanding of the particular GPU in question that only an IHV can provide. I certainly wouldn’t hold Ashes up as the premier example of this feature.”

Given that AMD hardware powers both the Xbox and PS4 (and possibly the upcoming Nintendo NX), it’s absolutely reasonable to think that AMD’s version of asynchronous compute could be important to the future of the DX12 standard. Talk of returning already-purchased NV cards in favor of AMD hardware, however, is rather extreme. Game developers optimize for both architectures and we expect that most will take the route that Oxide did with Ashes — if they can’t get acceptable performance from using asynchronous compute on Nvidia hardware, they simply won’t use it. Game developers are not going to throw Nvidia gamers under a bus and simply stop supporting Maxwell or Kepler GPUs.

Right now, the smart thing to do is wait and see how this plays out. I stand by Ashes of the Singularity as a solid early look at DX12 performance, but it’s one game, on early drivers, in a just-released OS. Its developers readily acknowledge that it should not be treated as the be-all, end-all of DX12 performance, and I agree with them. If you’re this concerned about how DX12 will evolve, wait another 6-12 months for more games, as well as AMD and Nvidia’s next-generation cards on 14/16nm before making a major purchase.

If AMD cards have an advantage in both hardware and upcoming title collaboration, as a recent post from AMD’s Robert Hallock stated, then we’ll find that out in the not-too-distant future. If Nvidia is able to introduce a type of asynchronous computing for its own hardware and largely match AMD’s advantage, we’ll see evidence of that, too. Either way, leaping to conclusions about which company will “win” the DX12 era is extremely premature. Those looking for additional details on the differences between asynchronous compute between AMD and Nvidia may find this post from Mahigan useful as well.  If you’re fundamentally confused about what we’re talking about, this B3D post sums up the problem with a very useful analogy.

Tagged , , , , , , , , , , , , , , , ,

AMD Radeon R9 Fury review: Splitting Nvidia’s GTX 980 and 980 Ti in performance

At E3 last month, AMD announced that it would bring launch multiple GPUs under its new Fury brand. First up was theFury X, a $649 card meant to compete with the GTX 980 Ti and sporting its own custom water cooler. Today, the company is launching its follow-up to the Fury X, the $549 Radeon R9 Fury. This new card uses the same base Fiji GPU as the Fury X, but with fewer cores (3584 as opposed to 4096). The modest reduction in total compute units is matched by a slight cut to texture mapping units (down to 224 from 256), but the total number of ROPS stayed the same, at 64. The Radeon Fury’s clock speed has been cut slightly, to 1GHz (down from the Radeon Fury X’s 1050MHz), but the GPU packs the same 500MHz, 4096-bit HBM interface, 275W maximum board power, and dual 8-pin PCIe connectors.

One of the factors that sets the new Radeon R9 Fury apart from the Fury X is the size of the card. While neither the Sapphire Tri-X or Asus Strix R9 Fury are that much bigger than other high-end air-cooled GPUs, they’re far larger than AMD’s diminutive Radeon Fury X. Granted, that GPU used a water-cooler while the Strix (the card we have in-house) is air-cooled, but it’s not just the cooler that’s large — Asus mounted the Fury on a standard-length high-end PCB as well.

The resulting card is the Asus Strix R9 Fury DirectCU III OC, but don’t let the OC get your hopes up. AMD’s reference card is clocked at 1GHz standard, while the Strix clocks in at a maximum of 1020MHz out of the box. That 2% OC isn’t going to push the envelope, and like the Fury X, Fury isn’t expected to have much overclocking headroom. One thing to like about the R9 Fury Strix, particularly if you have older monitors, is that the GPU supports a wide range of ports. Unlike the Sapphire version of the card, which offers 3x DisplayPort and 1x HDMI, the Strix packs 3x DisplayPort, 1x DVI-D, and 1x HDMI.

According to Asus, the GPU cooler is designed to maintain a maximum temperature of 85C. That’s not nearly as low as AMD’s 50C target for Fury X, but for an air-cooled card, 85C is quite good. It’s particularly impressive given that AMD’s last high-end air-cooled cards, the R9 290 and R9 290X, often ran right up to their 95C thresholds. Asus is bringing the Strix R9 to market at $579, marginally higher than the $549 AMD is targeting for the R9 Fury in general. The heatsink and attached GPU are huge compared to previous cards, at 11.75 inches long and with significant cooler overhang.

We asked Asus why the Strix was so large, given that AMD was able to build both Fury X and the upcoming Fury Nano into much smaller cards, but the company declined to comment in detail, saying only “[W]e went with what works best for the chipset and cooling options.” We’ll have to wait and see if other vendors introduce Fury’s in smaller form factors, or if that capability is reserved for the upcoming Fury Nano.

One thing we can say about the R9 Strix — the card may be long and the fans + heatsink are large, but this card delivers excellent performance for very little noise.

Fury’s positioning, tiny review window

To say that this review is coming in hot would be an understatement. We received our Asus test card on Wednesday at 5 PM for an 8 AM Friday launch. Given my other responsibilities for ET, the time I had to spend with this GPU was further compressed. It’s not clear why Asus sampled on such short notice; manufacturers typically give much longer lead times when testing new hardware. Add in some significant problems with testbed configuration (a series of unfortunate events so mind-boggling, I’m considering writing a post about them), and the end result was a badly compressed launch cycle.

Fortunately, Fury’s positioning is relatively straightforward. AMD is bringing the card in at $549, or roughly $50 more than the GeForce GTX 980. At that price point, the GPU needs to hit about 10% faster than its Team Green counterpart. AMD’s Fury X reliably delivered this kind of performance delta, but was priced to compete against the GTX 980 Ti, not the GTX 980. Fury is going after ostensibly easier prey.

Unfortunately, AMD’s rocket launch means that the 4GB HBM RAM comparisons I’ve wanted to do and wide-scale power consumption comparison are both on-hold for now. But let’s see what we can see from a quick run around the block, shall we?

All of our tests were run on a Haswell-E system with an Asus X99-Deluxe motherboard, 16GB of DDR4-2667, and Windows 8.1 64-bit with all patches and updates installed. The latest AMD Catalyst 15.7 drivers and Nvidia GeForce 353.30 drivers were used. Our power consumption figures are going to be somewhat higher in this review than in some previous stories — the 1200W PSU we used for testing was a standard 80 Plus unit, and not the 1275 80 Plus Platinum that we’ve typically tested with.

BioShock Infinite:

BioShock Infinite was tested using that game’s Ultra settings with the Alternative Depth of Field option using the built-in benchmark option at both 1080p and 4K.

BioShock infinite Radeon Fury

BioShock Infinite was tested using that game’s Ultra settings with the Alternative Depth of Field option using the built-in benchmark option at both 1080p and 4K. Fury pulls ahead of the GTX 980 nicely, nearly tying things up with the Radeon R9 Fury X and GTX 980 Ti. Playable 4K is no problem for any of the high-end cards in this sample.

Company of Heroes 2:

Company of Heroes 2 is an RTS game that’s known for putting a hefty load on GPUs, particularly at the highest detail settings. Unlike most of the other games we tested, COH 2 doesn’t support multiple GPUs. We tested the game with all settings set to “High,” with V-Sync disabled.

Company of Heroes 2 - Radeon Fury

Company of Heroes 2 is a mixed bag for AMD. At 1080p and a more playable framerate, we see the Fury lagging the GTX 980. At 4K, however, it’s the AMD cards that pull ahead. The margin between the Asus Strix R9 Fury and the R9 Fury X from AMD is rather small, and the Fury X ekes out a win over the stock-clocked GTX 980 Ti.

Metro Last Light:

We tested Metro Last Light in Very High Quality with 16x anisotropic filtering and normal tessellation, in both 1080p and 4K. While it’s a few years old at this point, Metro Last Light is still a punishing game at maximum detail.

Metro Last Light - Radeon Fury

The Asus Strix sweeps the GTX 980 in both tests, nearly tying the GTX 980 Ti. The overclocked version of the 980 Ti from EVGA (covered here) still edged Fury X, but the Fury offers nearly the same level of performance.

Total War: Rome 2

Total War: Rome II is the sequel to the earlier Total War: Rome title. It’s fairly demanding on modern cards, particularly at the highest detail levels. We tested at maximum detail levels in both 1080p and 4K.

Total War: Rome 2 Radeon Fury

In the game’s built-in benchmark, the Fury essentially ties the GTX 980 at 1080p but surpasses it in 4K, with the Fury X holding out a narrow edge above the Fury. Performance here is close across the board.

Shadow of Mordor:

Shadow of Mordor is a third-person open-world game that takes place in between The Hobbit and the Lord of the Rings. Think of it as Far Cry: Ranger Edition (or possibly Grand Theft Ringwraith) and you’re on the right track. We tested the game at maximum detail with FXAA (the only AA option available).


In Shadow of Mordor, AMD’s Fury X doesn’t quite match the stock GTX 980 Ti in 1080p, but it ekes out a win by just under 10% in 4K mode. Similarly, the Asus Strix R9 Fury is roughly 10% faster in 1080p, but a full 26% faster in 4K mode.

Dragon Age: Inquisition

Dragon Age: Inquisition is one of the greatest role playing games of all time, with a gorgeous Frostbite 3-based engine. While it supports Mantle, we’ve actually stuck with Direct3D in this title, as the D3D implementation has proven to be superior in previous testing.

While DAI does include an in-game benchmark, we’ve used a manual test run instead. The in-game test often runs more quickly than the actual title, and is a relatively simple test compared with how the game handles combat. Our test session focuses on the final evacuation of the town of Haven, and the multiple encounters that the Inquisitor faces as the party struggles to reach the chantry doors. We tested the game at maximum detail with 4x MSAA.


Again, we see the Asus Strix extending a lead over the GTX 980, leading the Nvidia card by roughly 9% in 1080p mode and as much as 23.5% at 4K. The GTX 980 Ti is the fastest card overall, but AMD’s solutions continue to show superior 4K scaling compared to Nvidia — the R9 Fury X matches the GTX 980 Ti in 4K even though it’s surpassed at 1080p.

Noise and power consumption:

AMD’s initial run of Fury X coolers were remarkably quiet under load, even if some of the first batch had a pitch profile we found less-than pleasing. The Asus Strix R9 Fury isn’t quite as silent as the Fury X (that’s what you give up for using air as opposed to water), but, in a rare win for AMD, the Asus Strix R9 was logged as quieter than competing GeForce cards by both Tech Report and Anandtech (I don’t have access to sound equipment capable of picking up decibel levels low enough to be used for this kind of testing.

That’s a noted turn-around for AMD, considering that Hawaii’s debut cards were infamous for their noise. Third-party designs vastly improved on the initial cards, but Fury doesn’t just compete against Nvidia on this front — it leads Team Green solidly. (TR and Anandtech differ slightly on this point; AT reports the Fury as being the quietest card, while TR logs a GTX 970 in that position). Either way, it’s a big leap forward for AMD. GPU temperatures are also excellent, with the Strix R9 typically topping out in the mid-70s Celsius.

AMD has caught some flak for building what supposedly amounted to “Fat Tonga” as opposed to an all-new GPU, but the Strix R9’s thermals prove that Sunnyvale didn’t just hook its existing GPU up to a helium tank. The Asus R9 390X uses the same cooler as the Strix R9 Fury, but TechReport shows the R9 390X running both hotter and louder for lower overall performance.

We performed our own power consumption tests at idle and using Metro Last Light at 4K to stress test all GPUs. Power consumption was measured in the third run-through, to ensure that the cards heated up.


The Strix’s overall power consumption is about 10% better than the R9 Fury X’s, which is in line with what we’d expect given overall performance. There’s still a significant gap between the GTX 980 and the R9 Strix, however, though it’s unlikely to make much difference in your power bill unless you game 24/7 or live in a state where electricity is extremely expensive.


Our watts-per-frame metric divides the power consumption in Metro Last Light by each card’s power consumption. Here, we see that the Asus Strix R9 Fury maintains the same improved power consumption ratio as the Fury X, even if it can’t quite match Nvidia’s figures.


How well the R9 Fury stacks up against the GTX 980 is going to depend on what your needs are. The Asus R9 Fury Strix doesn’t quite sweep the GTX 980, but it ties or beats it in virtually every benchmark. The Asus Strix is an excellent GPU with superior cooling, even if we were a bit surprised by the giant PCB and extremely short launch window (we only received our card on Wednesday for an NDA lift on Friday.

If the R9 Fury and the GTX 980 were both $500 cards, we’d say the R9 Fury was absolutely the better solution, particularly if you’re gaming above 1080p. AMD has set the price at $549, however, with the Asus Strix coming in around $579. That’s not bad, per se, but keep in mind that this GPU shows its best legs at 4K. That’s problematic, because 4K + maximum detail is still too demanding for a single GPU in most current titles. That means gamers who want to play in 4K without sacrificing visual fidelity are going to be better served by multiple GPUs, and Team Green’s multi-GPU support has long been superior to Team Red’s.

There’s an argument to be made for either card and buyers should be well-served by either solution. If you’re already in Team Red’s camp to start with, the R9 Fury is a great deal. It’s 6-10% slower than the Fury X but only costs about 85% as much, which makes it the more efficient card between the two in terms of dollars per frame. We understand why AMD chose to delay its launch slightly — the Fury steals some of the Fury X’s thunder. Combined, the Fury and Fury X put AMD on much better competitive footing against Nvidia. They aren’t the blowout wins that AMD captured in 2013, when the R9 290 and R9 290X took out the GTX 780 and GTX Titan, but they’re good cards with far better thermals than AMD’s previous top-end single card GPU launches.

AMD isn’t done with the Fury architecture just yet — Fury Nano and an unnamed dual-GPU Fury are both scheduled for later this year.

Tagged , , , , , , , , , , , , , , ,