Original Link: https://www.anandtech.com/show/5448/the-bulldozer-scheduling-patch-tested



The basic building block of Bulldozer is the dual-core module, pictured below. AMD wanted better performance than simple SMT (ala Hyper Threading) would allow but without resorting to full duplication of resources we get in a traditional dual core CPU. The result is a duplication of integer execution resources and L1 caches, but a sharing of the front end and FPU. AMD still refers to this module as being dual-core, although it's a departure from the more traditional definition of the word. In the early days of multi-core x86 processors, dual-core designs were simply two single core processors stuck on the same package. Today we still see simple duplication of identical cores in a single processor, but moving forward it's likely that we'll see more heterogenous multi-core systems. AMD's Bulldozer architecture may be unusual, but it challenges the conventional definition of a core in a way that we're probably going to face one way or another in the not too distant future.


A four-module, eight-core Bulldozer

The bigger issue with Bulldozer isn't one of core semantics, but rather how threads get scheduled on those cores. Ideally, threads with shared data sets would get scheduled on the same module, while threads that share no data would be scheduled on separate modules. The former allows more efficient use of a module's L2 cache, while the latter guarantees each thread has access to all of a module's resources when there's no tangible benefit to sharing.

This ideal scenario isn't how threads are scheduled on Bulldozer today. Instead of intelligent core/module scheduling based on the memory addresses touched by a thread, Windows 7 currently just schedules threads on Bulldozer in order. Starting from core 0 and going up to core 7 in an eight-core FX-8150, Windows 7 will schedule two threads on the first module, then move to the next module, etc... If the threads happen to be working on the same data, then Windows 7's scheduling approach makes sense. If the threads scheduled are working on different data sets however, Windows 7's current treatment of Bulldozer is suboptimal.

AMD and Microsoft have been working on a patch to Windows 7 that improves scheduling behavior on Bulldozer. The result are two hotfixes that should both be installed on Bulldozer systems. Both hotfixes require Windows 7 SP1, they will refuse to install on a pre-SP1 installation.

The first update simply tells Windows 7 to schedule all threads on empty modules first, then on shared cores. The second hotfix increases Windows 7's core parking latency if there are threads that need scheduling. There's a performance penalty you pay to sleep/wake a module, so if there are threads waiting to be scheduled they'll have a better chance to be scheduled on an unused module after this update.

Note that neither hotfix enables the most optimal scheduling on Bulldozer. Rather than being thread aware and scheduling dependent threads on the same module and independent threads across separate modules, the updates simply move to a better default cause of scheduling on modules first. This should improve performance in most cases but there's a chance that some workloads will see a performance reduction. AMD tells me that it's still working with OS vendors (read: Microsoft) to better optimize for Bulldozer. If I had to guess I'd say that we may see the next big step forward with Windows 8.

AMD was pretty honest when it described the performance gains FX owners can expect to see from this update. In its own blog post on the topic AMD tells users to expect a 1 - 2% gain on average across most applications. Without any big promises I wasn't expecting the Bulldozer vs. Sandy Bridge standings to change post-update, but I wanted to run some tests just to be sure.

The Test

Motherboard: ASUS P8Z68-V Pro (Intel Z68)
ASUS Crosshair V Formula (AMD 990FX)
Hard Disk: Intel X25-M SSD (80GB)
Crucial RealSSD C300
Memory: 2 x 4GB G.Skill Ripjaws X DDR3-1600 9-9-9-20
Video Card: ATI Radeon HD 5870 (Windows 7)
Video Drivers: AMD Catalyst 11.10 Beta (Windows 7)
Desktop Resolution: 1920 x 1200
OS: Windows 7 x64 SP1 w/ BD Hotfixes


Single & Heavily Threaded Workloads Need Not Apply

Remembering what these two hotfixes actually do, the only hope for performance gains comes from running workloads that are neither single threaded nor heavily threaded. To confirm that there are no gains at either end of the spectrum we first turn to Cinebench, a 3D rendering test that lets us configure how many threads are in use:

Cinebench 11.5 - Single Threaded

Cinebench 11.5 - Multi-Threaded

With one thread or 8 threads active, the FX-8150's performance is unchanged by the new hotfixes. I also ran TrueCrypt's encryption/decryption benchmark, another heavily threaded test that runs on all cores/modules:

AES-128 Performance - TrueCrypt 7.1 Benchmark

Once again, there's no change in performance. Although you can argue that CPU performance is most important when utilization is at its highest, most desktops will find themselves in between full utilization of a single core and all cores. To test those cases, we need to look elsewhere.



Mixed Workloads: Mild Gains

The one thing all of the following benchmarks have in common is they feature more varied CPU utilization. With periods of heavy single and all core utilization, we also see times when these benchmarks use more than one core but fewer than all.

SYSMark has always been a fairly lightly threaded test. While there are definite gains seen when going from 2 to 4 cores, this is hardly a heavily threaded test. The performance impact of the hotfixes is negligible in the overall performance result or across the individual benchmark suites however:

SYSMark 2012 - Overall

Our Visual Studio 2008 compile test is heavily threaded for the most part, however the beginning of the build process uses a fraction of the total available cores. The hotfixes show a reasonable impact on performance here (~5%):

Build Chromium Project - Visual Studio 2008

The first pass of our x264 transcode benchmark doesn't use all available cores but it is more than just single threaded:

x264 HD Benchmark - 1st pass - v3.03

Performance goes up but only by ~2% here. As expected, the second pass which consumes all cores in the system remains unchanged:

x264 HD Benchmark - 2nd pass - v3.03

Games are another place we can look for performance improvements as it's rare to see consistent, many-core utilization while playing a modern title on a high-end CPU. Metro 2033 is fairly GPU bound and thus we don't see much of an improvement, although for whatever reason the 51.5 fps ceiling at 19x12 is broken by the hotfixes.

Metro 2033 Frontline Benchmark - 1024 x 768 - DX11 High Quality

Metro 2033 Frontline Benchmark - 1920 x 1200 - DX11 High Quality

DiRT 3 shows a 5% performance gain from the hotfixes. The improvement isn't enough to really change the standings here, but it's an example of a larger performance gain.

DiRT 3 - Aspen Benchmark - 1024 x 768 Low Quality

DiRT 3 - Aspen Benchmark - 1920 x 1200 High Quality

Crysis Warhead mirrors the roughly 5% gain we saw in DiRT 3:

Crysis Warhead Assault Benchmark - 1680 x 1050 Mainstream DX10 64-bit

Civilization V's CPU bound no render results show no gains, which is to be expected. But looking at the average frame rate during the simulation we see a 4.9% increase in performance.

Civilization V - 1680 x 1050 - DX11 High Quality

Civilization V - 1680 x 1050 - DX11 High Quality



Final Words

AMD didn't overpromise as far as the benefits of these new scheduling/core parking hotfixes for Windows 7 are concerned. Single digit percentage gains can be expected in most mixed workloads, although there's a chance that you'd see low double digit gains if the conditions are right. It's important to note that the hotfixes for Windows 7 aren't ideal either. They simply force threads to be scheduled on empty modules first rather than idle cores on occupied modules. To properly utilize Bulldozer's architecture we'd need a scheduler that schedules both based on available cores/modules but biases its scheduling depending on data dependency between threads.

If Bulldozer were the last architecture to present this type of scheduling challenge I'd say that it's unlikely we'll see things get better. Luckily for AMD, I don't believe homogeneous multi-core architectures will be all we get moving forward. Schedulers will get better at understanding the underlying hardware, just as they have in the past. We may see better utilization of Bulldozer cores/modules in Windows 8 but as always, don't use the promise of what may come as a basis for any present day purchasing decisions.

Log in

Don't have an account? Sign up now