Name: AMD Zen Microarchiture Part 2: Extracting Instruction-Level Parallelism
Item: AMD Zen Microarchiture Part 2: Extracting Instruction-Level Parallelism
Author: Dr. Ian Cutress

AMD Zen Microarchiture Part 2: Extracting Instruction-Level Parallelism

by Ian Cutress on 8/23/2016 8:45 PM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

106 Comments

Back to Article

CrazyElf - Tuesday, August 23, 2016 - link
If they can really get a 40% improvement over Excavator, and I mean 40% on average, not on a few select benchmarks, then AMD has a serious chance of being a compelling option once again.

I'm hoping to see more improvements on Floating Point, which was comically bad in Bulldozer.

A big part of the problem is that we don't know how well Zen will clock or the power consumption. Still, this should be a major leap in performance overall. We'll have to wait for the launch day benchmarks to see the true story.

Another big concern is the platform. CPU performance is only part of the story. We need a good platform that can rival the Z170 and Intel HEDT platforms for this to be compelling on the desktop. For mobile, there will have to be good dual channel Zen APUs (Carrizo, as Anandtech noted was heavily gimped by poor quality OEM designs obsessed with cost cutting).
jabber - Wednesday, August 24, 2016 - link
Yeah I don't think OEMs and others are that worried about supporting AMD. AMD have withered away so much, making AMD CPU capable gear must have become a very minor part of say ASUS/Gigabyte/MSI etc. revenue stream. Making AMD based graphics cards is okay but motherboards? Not so much.
teuast - Wednesday, August 24, 2016 - link
I wouldn't speak so soon. Just this year MSI and Gigabyte (at least) have introduced new AM3+ boards with USB 3.1 and PCIe 3.0. Why, I'm not sure, but if they're doing that for something as old and deprecated as the FX chips, it would defy logic for Zen to come out and for them to only release a few token efforts.

I will say, if the CPUs are good but you're right about OEMs not being concerned with support, then the first OEM to say "hey, why don't we make some actually good AM4 boards?" is going to make an absolute killing.
h4rm0ny - Thursday, August 25, 2016 - link
Are you sure about the PCI-E v3 on AM3+ motherboards? I can find recent releases that have USB3.1 and M.2, but none that support PCI-Ev3. Can you link me or provide a model number? I didn't think 3rd generation PCI-E was possible on the Bulldozer line.
SKD007 - Thursday, August 25, 2016 - link
SABERTOOTH 990FX/GEN3 R2.0
SKD007 - Thursday, August 25, 2016 - link
https://www.asus.com/Motherboards/SABERTOOTH_990FX...
Outlander_04 - Thursday, August 25, 2016 - link
A little misleading . The Graphics pci-e controller is built in to an FX processor so adding a pci-e 3 standard slot to a motherboard will make no difference to actual bandwidth.
Not an issue though since x16 pci-e 2 has the same bandwidth as x8 pci-e 3 and intel boards with SLI/crossfire ability running at x8/x8 do not choke any current graphics card
h4rm0ny - Thursday, August 25, 2016 - link
What about PCI-E SSDs? Can I get full bandwidth on those? I agree about the graphics cards but that's not so important to me. If I can get full PCI-Ev3 x4 performance for an SSD then I'll pribably buy this as a hold-over until Zen. Thanks fir the link!
fanofanand - Friday, August 26, 2016 - link
Pci-e 3.0 x4 should be the same as 2.0 x 8. So long as you have a vacant x8 it should theoretically work the same.
extide - Wednesday, September 7, 2016 - link
I think they use a PLX chip and turn the 32 2.0 lanes from the FX chip into 16 3.0 lanes.
Bulat Ziganshin - Wednesday, August 24, 2016 - link
I think it's obvious from number of ALUs that 40% improvement is for scalar single-thread code that greartly bemnefits from access to all 4 integer ALUs. Of course, it will get the same benefit fro any code running up to 8 threads (for 8-core Zen). But anyway it should be slower than KabyLake since Intel spent much more time optimizng their CPUs

For m/t execution, improvements will be much smaller, 10-20%, i think. Plus, 8-core CPU will probably run at smaller frequency than 4-core Buldozers or 4-core KabyLake. AFAIK, even Intel 8c-ore CPUs run at 3.2 GHz only, and it's after many years of power optimization. We also know that *selected* Zen cpus run at 3.2 GHz in benchamrks. So, i expect either < 3 GHz frequency, or 200 Wt power budget
atomsymbol - Wednesday, August 24, 2016 - link
"For m/t execution, improvements will be much smaller, 10-20%, i think."

There are bottlenecks in Bulldozer-family when a module is running two threads. An improvement of 40% for m/t Zen execution in respect to Bulldozer m/t execution is possible. It is a question of what the baseline of measurement is.
Bulat Ziganshin - Wednesday, August 24, 2016 - link
M/t execution in Bulldozer already can use all 4 INT alus, so i think that 40% IPC improvement is impossible. In other words, if s/t IPC improved by 40% by moving from 2 alu to 4 alu arrangement, m/t performance that keeps the same 4 alu arrangement, hardly can be improved by more than 20%
looncraz - Wednesday, August 24, 2016 - link
IPC is NOT MT, it is ST only.

IPC is per-core, per-thread, per-clock, instruction retire rate... which generally equates to performance per clock per core per thread.
Bulat Ziganshin - Thursday, August 25, 2016 - link
you can measure instruction per cycles for a thread, 3 threads, core, cpu or anything else. what's a problem??

My point is that s/t speed on Zen is improved much more than m/t speed, compared to last in Bulldozer family. So, they advertized improvement in s/t speed, that is 40%. And m/t improvement is much less since it still the same 4 alus (although many other parts become wider).
atomsymbol - Thursday, August 25, 2016 - link
AMD presentation was comparing Zen to Broadwell in a m/t workload with all CPU cores utilized.

From

http://www.cpu-world.com/Compare/528/AMD_A10-Serie...

you can compute that the Blender-specific speedup of Zen over a previous AMD design is about 100/38.8=2.57
Bulat Ziganshin - Thursday, August 25, 2016 - link
Can you compute IPC improvement, that we are discussing here?
looncraz - Thursday, August 25, 2016 - link
Except you're absolutely wrong, the performance increases will be much higher for MT than ST.

Bulldozer was hindered by the module design, so you had poor MT scaling - not an issue with Zen. On top of that, Zen has SMT, which should add another 20% or so more MT performance for the same number of cores.

A 40% ST improvement for Zen could easily mean a 100% performance improvement for MT.
Bulat Ziganshin - Saturday, August 27, 2016 - link
It's not "on top of that". Zen is pretty simple Bulldozer modification that fianlly allowed to use all 4 scalar ALUs in the module for the single thread. It's why scalar s/t perfroamnce should be 40% faster. OTOH, two threads in the module still share those 4 scalar ALUs as before, so m/t perfromance cannot improve much. On top of that, module was renamed to core. So, there are 2x more cores now and of course m/t performance of entire CPU will be 2x higher
Nenad - Thursday, September 8, 2016 - link
It is possible that AMD already count SMT (hyperthreading) into those 40%.

Their slide which states "40% IPC Performance Uplift" also lists all things that AMD used to achieve those 40%...and first among those listed things is "Two threads per core". So if AMD already counted that into their 40% IPC uplift, then 'real' IPC improvement (for single thread) would be much lower.
extide - Monday, August 29, 2016 - link
No,k dude, it's not the same 4 ALU's, it's 4 ALU's per core. 2 threads a core, so 2 ALU's/thread, up to 16 threads, or 4 ALU's /thread up to 8 threads, but I would think it would be hard for a single thread to use 4 ALU's, so having 2 threads per 4 ALU seems fine, plus all the INT execution resources.
Outlander_04 - Thursday, August 25, 2016 - link
40% improvement is not over Bulldozer but over Excavator which is already 20% or more ahead of Bulldozer
looncraz - Wednesday, August 24, 2016 - link
Highly scalar code or vector code will exceed 40% easily. The core execution resources relative to that execution is 75% to 100% greater. That will only translate to 50~60% performance improvement for said code, but a larger impact than the overall 40% improvement.

The cache system, schedulers, issue width, AGUs, L/S, and other factors come more into play in the more common code paths, which reduces the maximum potential benefit derived from the additional execution resources.

However, multi-threaded performance should be HIGHER, not lower. Excavator had relatively poor MT scaling, Zen will be worlds better. Add SMT to the mix - and AMD's solution looks nearly exactly as I anticipated - and you have another 20% or so better SMT scaling.

It is easily conceivable, given what we now know, that AMD has met Haswell's average IPC outside of wider AVX workloads, and exceeded it in certain areas with heavy mixed compute (floating and integer concurrently). It is also now conceivable that AMD's first SMT implementation will be better than Intel's Sandy Bridge era Hyper-Threading. I didn't expect that at all, but the core flexibility is far ahead of Intel's flexibility - and that is largely what determines SMT performance in Zen's design.

Finally, <3Ghz @ 200W is way worse than the currently known figures for their 8C parts. They have 3.2Ghz boost clocks and just 95W TDP. It is expected that the clocks will increase, particularly for the quad core, 65W, parts.

You may not realize this, but these numbers put AMD slightly ahead of Intel in perf/W on 14nm.
niva - Wednesday, August 24, 2016 - link
So are you telling me my Phenom 2 black edition rig might be getting a worthy upgrade?

I'm with you, but I don't trust these benchmarks, wait until the retail CPU samples are out then we can decide.
looncraz - Wednesday, August 24, 2016 - link
I'm saying you'll be able to match that level of performance with a Dual core Zen CPU w/ SMT... if AMD were actually to make one (doubtful).

I do expect AMD to release triple core CPUs again, though, but possibly not right away.
Myrandex - Thursday, August 25, 2016 - link
Yay finally I've been holding onto my Phenom II as well and this might be it! :)
Bulat Ziganshin - Thursday, August 25, 2016 - link
For vector code - they added 4'th ALU, it's almost nothing (Skylake added 4th scalar ALU and got laughable +3% IPC).

For scalar code - they advertize +40% IPC. I'm pretty sure that they advertize the best part of perfromance, not the average one. It's ADVERTIZEMENT, after all.

Now, it's easy to analyze Zen as Carrizo+. M/t performance shouldn't change much since it's still 4-wide core (which was called module in Carrizo). S/t performance should improve much more since it changed from 2 alu to 4 alu. Overall, the core looks like Skylake, but it's not enough to put a lot of resources - they need to be carefully placed. Intel gone a long way optimizing their CPUs, and AMD have to repeat that. If you think that AMD can make Skylake-speed CPU in 2 years, then ask yourself - why Intel hasn't done the same in 2008 or so? Why IBM, having WIDER cpu, still slower than Intel in s/t tests?

All we know that AMD was able to SELECT single CPU that was able to run at 3 GHz using cooler looking like one they ship with 95W cpus. Just ask yourself - why they not tried to run their cpu at the same 3.2 GHz which is stock freq. for Intel CPU? And yes, it's way more effificent than Intel CPUs can, making me highly suspicious.

In one of pictures here AMD claims that Zen has the same power usage as Carrizo, that is 28nm CPU. AFAIR Carrizo with 2 modules at ~3GHz use 35-65 Wt. Multiple it by 4, please.

> It is also now conceivable that AMD's first SMT implementation will be better than Intel's Sandy Bridge era Hyper-Threading.

Why?? Intel's first SMT implementation in Pentium4 made a few percents improvement (over s/t), second one in Nehalem give me +20% on deflate, Sandy was +40%, and Haswell is +50%. Why you think that FIRST AMD attempt on SMT will be better than Pentium4?

Overall, i think that m/t perfromance of Zen is more predictable - it's Carrizo with some improvements, but still 4-wide, so i expect usual 10-20% generation-to-generation improvement.

For s/t, it less predictable, but i'm sure that it's impossible to beat Intel in single step, and AMD already advertized +40%, which i'm sure is about s/t perfromance.
looncraz - Thursday, August 25, 2016 - link
"For vector code - they added 4'th ALU, it's almost nothing (Skylake added 4th scalar ALU and got laughable +3% IPC)."

Well, that was the average program performance increase, but the vector code itself sped up more than that.

Also, Zen's ability to leverage its resources should be better than Intel's, but its scheduler setup is really unique, so we need more details on how it will handle holes in a scheduler when its neighbor is full. Having six 14-deep schedulers is a significant part of the design that is almost completely overlooked, IMHO.

"Now, it's easy to analyze Zen as Carrizo+. M/t performance shouldn't change much since it's still 4-wide core"

Only if you are comparing a full module to a single Zen core... There were many bottlenecks in the modules that prevented full performance for multi-threading - Zen does not have that. On top, Zen has SMT, so it will have even better MT performance per core.

"Why IBM, having WIDER cpu, still slower than Intel in s/t tests?"

The width is, as you say, only a part of the equation. It's all about being able to exploit that extra width. Intel does so decently well, but has restrictions as a result of their unified scheduler. A heavy FPU load reduces integer performance, for example, due to shared ports of the scheduler. The impact of this is not easily quantifiable - it would require some very specialized testing. Zen will not have this issue thanks to dedicated schedulers.

Intel uses their unified scheduler to be able to provide results more quickly to dependent instructions. Zen, from appearances, allows each scheduler to make fetch and load requests directly, thereby nullifying what used to be an Intel advantage - and maybe even turning it into a hindrance.

"Just ask yourself - why they not tried to run their cpu at the same 3.2 GHz which is stock freq. for Intel CPU?"

Because you don't push engineering sample CPUs, and 3Ghz is the defacto industry standard speed for IPC comparison testing. Just look around, you'll find 3Ghz is the most commonly chosen frequency when doing IPC comparisons on modern CPUs. Pushing both to 3.2Ghz would not have changed anything, but a Zen engineering sample chip is worth thousands more than that Intel CPU at this time, and is not easily replaceable. If you have to run 500 more tests with it, and hand it over to other departments or teams, you probably aren't being allowed to overclock it any.
deltaFx2 - Friday, August 26, 2016 - link
The answer to the IBM question is easy. 1) IBM designed the Power8 with SMT-2 as the sweet spot. Like bulldozer, or Alpha EV6, they have execution clusters. In 2T, each cluster runs a thread, in 1T, the thread is split across these clusters, with a penalty for moving between them. Hence their 1T->2T uplift is a lot higher than intel's 1T->2T (worse baseline). (2) You're comparing different ISAs. x86 is a lot more CISC'y than POWER. x86 supports load+compute, compute+store, load+compute+store, and this is dispatched as a single uop. The same "work" in a more RISC'y machine needs 2 or 3 uops. For the same reason, an ARM core that hopes to achieve the same performance as x86 will need to dispatch more ops, or fuse more ops before dispatch.
Spunjji - Saturday, August 27, 2016 - link
The CPU they tasted with is an early engineering sample. Simple answer. You write a lot to make yourself sound smart but you're exercising either clear bias or ignorance here.
tipoo - Wednesday, August 31, 2016 - link
Bulldozers engineering samples were 2.5GHz and that shipped stupid high clocked. Zen ESs being 3GHz doesn't worry me.
Cooe - Thursday, May 6, 2021 - link
Holy CRAP did history ever make you look like an absolute freaking idiot! xD
extide - Monday, August 29, 2016 - link
Well, they have already shown an 8-core Zen running at full load at 3Ghz with their regular OEM heatsink/fans, and those are rated at 125W TDP max, so we do already know that's possible.
defter - Wednesday, August 24, 2016 - link
It's 40% IPC improvement, not 40% overall improvement. If you improve IPC by 40% and achieve 85% of the clock speed, the total improvement will be only 20%.

Since AMD hasn't talked about clock speed we can assume that it will be lower than Bulldozer.
euskalzabe - Wednesday, August 24, 2016 - link
Let me fix that for you: "Since AMD hasn't talked about clock speed we can assume..." absolutely nothing and can only wait until the final product is released.
retrospooty - Wednesday, August 24, 2016 - link
Actually he is right and probably understating it. If AMD says it qill have 40% IPC improvement, it is probably not true, or true only in a few select benchmarks. If AMD left out the clockspeed it is almost definitely going to lower. AMD has zero credibility with pre-release performance claims. Nothing AMD says can be takes at its word until retail units (not engineering samples) are independently tested.
Azix - Wednesday, August 24, 2016 - link
why the flying fork would the clock speed be lower? I hope you dont mean lower than they have shown, that would make no sense.

Bulldozer engineering samples were maybe 2.5Ghz or 3Ghz. Additionally, talking about actual clock speeds would be to give away sku information. How they plan to structure the product line etc.
Outlander_04 - Thursday, August 25, 2016 - link
Both intel Broadwell-e and Zen were at 3 Ghz for the comparison .
Broadwell-e maxes out at 3.6 Ghz , but most models are at 3.2 Ghz .
Dont let your prejudices cause you to jump to conclusions.
Zen could easily be released running at higher clock rates
silverblue - Wednesday, August 24, 2016 - link
Imagine for a second that Zen was clocked like the FX-8320E, that is a 3.2GHz base with 4.0GHz boost. Would a 40 to 50% average IPC boost make Zen competitive?

For all we know, Zen could be conservatively clocked, paving the way for Zen+ with moderate tweaks and increased clocks; a bit like Piledriver vs. Bulldozer, as opposed to Phenom II vs. Phenom.
looncraz - Wednesday, August 24, 2016 - link
Zen will clock very close to 4Ghz out of the box - AMD kept most of the speed-demon elements of Bulldozer, such as the long pipelines. They also used dedicated, simple, schedulers - which is where frequency limits are frequently found... and they also put the L3 cache on a different clock bus, meaning it might operate at a different frequency from the cores... again.

The engineering samples are always clocked low, so if they are running at 3Ghz for a demo, then they will be able to achieve at least 3.4~3.6Ghz, with 4Ghz boost clocks on eight-core CPUs. Quad core units will obviously go higher, still. That is why half the cores still has 70% of the power draw - it's operating higher up the frequency curve. 3.8Ghz base, 4.2Ghz boost for the top quad core SKU seems very likely given what is known.
tipoo - Wednesday, August 31, 2016 - link
Meanwhile Intel worked on shortening pipelines...Curious to see how this will go, hope for AMDs sake it's competitive.
masouth - Friday, September 2, 2016 - link
I hope it works out for AMD as well but reading about long pipelines and higher freqs always reminds me of the P4 days

/shudder
junky77 - Wednesday, August 24, 2016 - link
The problem is now having Intel/AMD provide fast enough CPUs to feed the new GPUs that don't seem to slow down..
gamerk2 - Wednesday, August 24, 2016 - link
Pretty much anything from an i7 920 onward can keep GPUs fed these days. For gaming purposes, CPUs haven't been the bottleneck for over a decade. That's why you don't see significant improvement from generation to generation, since our favorite CPU tests happen to be with GPU sensitive benchmarks.
Death666Angel - Thursday, August 25, 2016 - link
The story is much more complicated than you are making it seem:
https://www.youtube.com/watch?v=frNjT5R5XI4
tipoo - Wednesday, August 31, 2016 - link
A Skylake i3 presents better frametimes than old i7s like the 920 or 2500K
rhysiam - Wednesday, August 24, 2016 - link
40% over Excavator probably still puts it well behind even Haswell on IPC. If I'm looking at it right, Bench on this site has 4 single threaded tests (3 Cinebench versions and 3D Particle...). I crunched some numbers and found that if you add 40% to Excavator @ 4Ghz (X4 860 turbo), it still loses to Skylake @ 3.9Ghz (turbo) by between 32% & 39% across the four benchmarks. Haswell @ 3.9Ghz (turbo) would still be faster by 24% to 33%.

If it really is 40% minimum, AND they can sustain decent clock speeds, then that's at least enough to be in the ballpark, but it's still well short of Intel in those few benchmarks at least. TBH I don't know how representative those benchmarks are of overall single-threaded performance.

It could well be a case of AMD offering significantly poorer lightly threaded performance, but a genuine 8 core CPU at an affordable (i.e. not $1000) price.
gamerk2 - Wednesday, August 24, 2016 - link
I except the following:

~40% average IPC gain in FP workloads
~30% average IPC gain in INT workloads
~20% clock speed reduction.

Average performance increase: ~15-20%, or Ivy Bridge i7 level performance.
Michael Bay - Wednesday, August 24, 2016 - link
Well, nothing stops them from their own brand of tick-tock, especially considering largely stagnant intel IPC.
looncraz - Wednesday, August 24, 2016 - link
40% over Excavator is almost exactly Haswell overall, particularly once you shape the performance to match what is known about Zen.

http://excavator.looncraz.net/
atlantico - Friday, August 26, 2016 - link
Wow looncraz!! Really cool effort you made :)
Spunjji - Saturday, August 27, 2016 - link
You numbers are different to everyone else's. Given that you don't cite any of your sources I believe everyone else.
Krysto - Wednesday, August 24, 2016 - link
I would hope they try to double the cores of Intel for notebooks.

Dual-core Zen without SMT will DESTROY Intel's Atom-based Celerons and Pentiums at the low-end. There will be absolutely ZERO reason to get a Celeron or Pentium notebooks once Zen appears on the market at that price range.

But at the Core i3 and Core i5 levels, I was hoping AMD would price a quad-core Zen with no SMT against dual-core Core i3 and Core i5, and a quad-core Zen with SMT against Intel's quad-core (no HT) Core i5, and finally 8-core with and without SMT variants against Intel's quad-core Core i7 chips (with HT).

If they can basically double the cores compared to what Intel has to offer at around the same price level, and maybe with only slightly worse single-thread performance and slightly worse power consumption, AMD's chips should be a NO-BRAINER. The value would be incredible, and it would push the market towards having powerful quad-core chips by default for most PCs. Intel is going to HATE that, because it would seriously cut into their profits. So AMD could use that strategy to both offer great value products and hurt Intel significantly.
looncraz - Wednesday, August 24, 2016 - link
AMD is not seeking the low end, they are trying to redefine AMD as the top-tier CPU company they once were. They are aiming for the top and the bulk of the market.

Zen+'s 15% IPC improvement over Zen might just give them the performance crown, but I'm sure Intel has taken note and planned accordingly.
zaza - Wednesday, August 24, 2016 - link
but the AMD CCX module is a quad core module. i am not sure if it is easy for AMD to just remove two.
looncraz - Wednesday, August 24, 2016 - link
Very easy, you just fuse off the defective core, that's the beauty of independent cores. The core complex just shares a common data bus and third level cache. Disabling a core in the complex will simply have it not ask for data on the common data bus. The L3 cache may or may not be cut down (probably will be).
H2323 - Wednesday, August 24, 2016 - link
"While Zen is initially a high-performance x86 core at heart, it is designed to scale all the way from notebooks to supercomputers, or from where the Cat cores (such as Jaguar and Puma) were all the way up to the old Opterons and beyond, all with at least +40% IPC."

https://www.youtube.com/watch?v=eUSJfGehKDQ

In the video its more than 40% across all of internal texting.
Vigilant007 - Saturday, August 27, 2016 - link
I don't know if AMD will ever have a major win as far as the PC industry again. Realistically they'll end up focusing on building custom x86 for consoles, and server chips. I can also see them exploiting their ability to do x86 to design custom chips for Apple.

AMD could end up being a fantastic acquisition target as well.
Tuna-Fish - Tuesday, August 23, 2016 - link
From page 3:

> and L2 with 512 entries and support for 4K and 256K pages only.

Surely you meant 4k and 2MB pages only?
deltaFx2 - Tuesday, August 23, 2016 - link
Ian, an error here: "It also states that the L3 is mostly inclusive of the L2 cache, which stems from the L3 cache as a victim cache for L2 data." A victim L3 is by definition an exclusive cache (as you note elsewhere). Also I don't understand why you have the impression that a victim cache is less efficient than an inclusive cache. As you note, an inclusive cache has to keep duplicate copies of data in L2 and L3 whereas an exclusive cache stores exactly 1 copy (either L2 or L3 but never both). In an exclusive cache hierarchy, a cache block is inserted into the L2, and when evicted, is put into the L3. In an inclusive cache hierarchy, a cache block is inserted both into the L2 and L3. Doesn't the exclusive hierarchy make better use of space? Incidentally, AMD has done exclusive caches since K8 at least. This isn't new.
bcronce - Tuesday, August 23, 2016 - link
Exclusive L3 cache makes better use of space, but requires snooping other core's L2 caches for data. If the L3 cache has all of the data all of the L2 cache has, then you only need to check one place.

This is important when you're trying to synchronize threads since locks are shared memory locations that each core is attempting to read and update. Common types of thread safe data-structures can take some pretty big performance scaling hits. Of course you can work around this in your data-structure.

One research paper that I read showed exclusive caches having twice the latency of inclusive when snooping was required. If your data-structure has a scaling that works well up to 16 cores on Intel's inclusive cache, it may cap out around 8 cores on AMD's exclusive, thanks to Amdahl's law.

Cache snooping gets slower as more cores are added. Gotta check them all.
deltaFx2 - Tuesday, August 23, 2016 - link
@bconce: Except that Intel doesn't do strictly inclusive caches either. Intel's caches are neither-inclusive-nor-exclusive (afaik), in which data is inserted into both L2 and L3, but evicted independently. So you have to check L2 and L3 independently, same as the exclusive cache. Strictly-inclusive caches have many bad properties, a few that come to mind immediately (1) False evictions of lines: If a block constantly hits in L2, the LRU in L3 is not updated. If the block then becomes the oldest in L3 and is evicted, it must be evicted in L2 as well, resulting in a miss all the way to memory (2) Associativity of the L3 cache must be at least the sum of the associativity of the L2 caches hanging off it, otherwise it will constrain the associativity of the L2 caches. Hence neither-inclusive-nor-exclusive, or strictly exclusive.

Exclusive caches are harder to build, true, because you have to manage exclusivity. That doesn't explain Ian's comment about them being less efficient.
68k - Wednesday, August 24, 2016 - link
The Intel manual state that

"The shared L3 cache is writeback and inclusive, such that a cache line that exists in either L1 data cache, L1 instruction cache, unified L2 cache also exists in L3."

That is, the L3-cache is strictly inclusive with anything stored in the core local L2/L1-caches. So it is enough to check L3 to see whether the cache-line is in use by any other core sharing the L3.
bcronce - Wednesday, August 24, 2016 - link
@68k
Thanks for looking it up. I only remembered Intel talking talking about this years ago when they made the design decision in order to minimize latency. Certain operations are extremely latency sensitive, like thread synchronizations.

The strange thing is AMD is pushing for so many cores, but then chooses a cache design that makes sharing data more expensive. What they did gain is exclusive caches tend to have more bandwidth and are great for independent threads with little sharing. It's a trade off. Nothing is free, pros and cons everywhere.
deltaFx2 - Wednesday, August 24, 2016 - link
@68k,@bcronce: I guess I haven't looked up Intel's latest and greatest cache organization :) I do recall though that Neither-incl-nor-exclusive was their scheme for quite a while, probably until Sandy Bridge. Perhaps that explains why their L2 cache went from 8-way to 4-way in SkyLake; the extra associativity cannot be effectively utilized with strict inclusion as you keep adding more cores (a single set in L3 maps to a unique set in L2. If you have 16 way L3, only 16 lines that map to that set in L3 can reside in the L2s. Obviously, multiple L3 sets map to the same L2 set, so this is somewhat mitigated, but it is a glass-jaw).

The nice thing about Intel's organization is that it's a monolithic L3 with variable latency to slices, as opposed to AMD's distributed L3. That probably is what adds the latency (if it does) on cache-to-cache transfers, not the inclusive-vs-exclusive, or the inclusive cache acting as a probe filter. You could just as easily add a separate probe filter to avoid unnecessary coherence lookups. Would you point me to that paper you quoted earlier? I have a hard time believing that the problem is the exclusive cache itself, and not the organization of the cache. Anyway, I don't know enough about AMD's design to comment, so I'll leave it at that. Thanks!
intangir - Wednesday, August 24, 2016 - link
As far as I know, since Nehalem Intel's L3 caches have been fully inclusive of L1+L2, but the L1 and L2 caches are neither inclusive nor exclusive with respect to each other.
Ryan Smith - Tuesday, August 23, 2016 - link
Right you are. That's a typo on our end, and in the deep dive section on cache you can see why it's exclusive. As for the first page, I've corrected the typo.
looncraz - Wednesday, August 24, 2016 - link
Zen's L3 is "mostly exclusive." This changes things up a bit - it isn't a pure victim cache and will probably contain data used between multiple cores. The first access will be slower as the data is snooped from another core's L2, but then that data will be mirrored in the L3. The coherent data fabric which links multiple core complexes adds a whole new level of complexity for sharing data between cores, but I suspect a mechanism exists to synchronize global data between the L3 caches, so global data will have a copy in each L3 and actions on global data will incur a latency penalty, but nothing compared to snooping L2s across multiple core complexes.
NikosD - Wednesday, August 24, 2016 - link
It seems that AMD did its job right this time.

Most of the CPU features are in between Broadwell and Skylake architectures and this is extremely important and fast, with the exception of AVX/AVX2 instructions that are executed in 128bit chunks instead of 256bit.

Of course we have to wait and see latencies and throughput of the rest of arithmetic instructions, but all these are just details.

I think with Zen we will all owe a lot to AMD like the older days of 64bit CPUs and OS.

This time the revolution will be the affordable true 8 core/ 16 thread CPU with no GPU inside for the first time in desktop.

The key point here is price, in order to be affordable. Not like High-End Desktop systems of Intel.

That move will force Intel to accept the fact that we, as customers, want 8 cores in our CPUs like 64bit CPUs and OS back in the past that Intel offered only with Itanium.

All in all, AMD could possibly hold in its hands a true winner, from laptops to servers that brings us memories of AMD Athlon and Opteron CPUs.

Well done AMD!
Michael Bay - Wednesday, August 24, 2016 - link
Do we, though? General purpose software like word processors and such is literally indistinguishable on 2 and 4 cores, and a lot of things on content creation side are already accelerated by GPU.
There are games of course, but CPU stopped being a bottleneck there long time ago.
Krysto - Wednesday, August 24, 2016 - link
I think PCs in general run better on four cores than on two, even if most apps themselves can't take advantage of them, although I think in the next 5 years most new games will take advantage of 8 threads. But otherwise, it's just good for multitasking.
tarqsharq - Wednesday, August 24, 2016 - link
I had an argument with one fellow on the internet regarding i7 being plenty for whatever I was doing in terms of core count. But streaming a show on one monitor while playing Overwatch was hitting 70%+ CPU usage, with all logical cores being 60-70% utilized consistently, with spikes up to 90%+.

That was on my i7-4770K to be specific, running 1080P on a 144hz monitor for Overwatch, and Crunchyroll for 1080P anime stream on the second monitor.

So some games combined with slight multitasking is already taxing the 4C/8T environment.
galta - Wednesday, August 24, 2016 - link
And how much multitasking are we really using? If I had to guess, I would say not much, on average.
You might have some folks here and there using it, but regular users need something between two and four cores, just as you said.
You have the OS, the software you're using, be it a game or not, plus everything that's running behind the scenes, including Windows ineficiencies, and that's it. But for some weird guy that spends his day on 7zip, more than 4 cores brings no extra power.
This is the reason why, no matter how excited we might get with 10 cores (I would love one, even if for bragging rights only), our i5s are enough for what we do.
Maybe in 5 years from now games will be multithreaded, but I'm not holding my breath: something similar was said 5 years ago, and here we are.
At the end of the day, we still need improvement in per core performance.
looncraz - Wednesday, August 24, 2016 - link
Browsers are becoming better and better at using more cores... and we're all running tens of processes in the background, some of which fire interrupts on a CPU. More cores allows for more going on at the same time without interruptions. You can actually feel this moving to an eight-core FX-8350 from a quad core i5... those eight cores provide a somewhat smoother multi-tasking environment, despite each core being slower and the overall performance being lower.

Humans are simply sensitive to changes in timing - more cores and more threads reduces the variability in timing, which improves perceived performance.
galta - Thursday, August 25, 2016 - link
Hum....
I don't know many people who share your opinion about FX-8350 vs i5.
Anyway, we have been multitasking for a while, a least to some extent: OS, Word, anti-virus, browser. The question is: for this light multitasking, are we better off with several cores with poor performance/core, or with less cores but with great performance/core.
Reviews and actual people generally prefer the later.
As of browsers, great news that they are improving, but download/upload speed is by far the most important factor in users experience.
Alexvrb - Sunday, August 28, 2016 - link
Download speed is fine for web browsing if you've got something faster than DSL. How much data exactly do you think you're consuming while browsing the web? Outside of streaming videos you won't use up a ton of bandwidth.
Cooe - Thursday, May 6, 2021 - link
I know this is ANCIENT, but how the hell did you not realize that multi-core optimization was so bad only because nobody could afford greater than >4 core CPU's pre-Zen??? Modern games run freaking TERRIBLE now on 4c/4t i5's.
Notmyusualid - Wednesday, August 24, 2016 - link
No, nope, nej, and nein.

I see (FEEL) tangible improvements in my computing ever since dropped 2 cores for 4.

And it looks like others below agree....
galta - Thursday, August 25, 2016 - link
I believe you do, for the sweet spot is now around 4 cores, as I said before.
The question is: do you believe that your experience will improve significantly if you mo to 6 or 8 cores?
Probably not, unless you spend your day zipping files or rendering images.
Alexvrb - Sunday, August 28, 2016 - link
They said the same thing about quad cores, and dual cores before that. AMD has to get on top of the curve, not behind it. They'll offer quad cores for more mainstream systems, and 8 for performance rigs. More for servers, and potentially less for low-power and/or low-cost.
eldakka - Wednesday, August 24, 2016 - link
The first page link, AMD Server CPUs and Motherboard Analysis, is wrong, it actually links to the ARM v8-A article.
atlantico - Friday, August 26, 2016 - link
Yes, it's also wrong here: http://www.anandtech.com/show/10585/unpacking-amds...

Sigh.
TristanSDX - Wednesday, August 24, 2016 - link
Zen do not support transactional memory, big disadvantage comparing to Intel
Senti - Wednesday, August 24, 2016 - link
And how much does it matter? TSX is great thing no doubt there. But the adoption? What can you name of real software what uses and get significant benefit of it?

I blame Intel stupid marketing for cutting TSX from too many versions and killing the adoption.
coder111 - Wednesday, August 24, 2016 - link
As far as I know, Azul JVMs do support transactional memory. So if you have a Java app, you can use it.

Other than that, yes, I haven't seen TSX used much...
68k - Wednesday, August 24, 2016 - link
Isn't the version of glibc in recent Linux-distributions using the lock elision feature of TSX?

https://lwn.net/Articles/534758/
https://01.org/blogs/tlcounts/2014/lock-elision-gl...

If so, then essentially every single Linux program does make use of TSX when present.
looncraz - Wednesday, August 24, 2016 - link
One of the most important features of TSX are checkpoints. Zen supports checkpoints in its execution pipeline. Otherwise, I've not seen anything that said Zen did or did not support TSX, not that the tech is widely used at this time.

From there, you just need tagging and a few other features to add support. It's something that could be included in Zen+ if Zen does not have it.
silverblue - Wednesday, August 24, 2016 - link
It looks like Zen was developed to accelerate the vast majority of software, and rely on core count for everything else. It might explain the lack of focus on AVX.

If cache stats were any indication of performance, it would appear that Zen was destined to compete with Broadwell, but not quite match the Lake CPUs; Zen+ would perhaps close the gap albeit a bit late. Bulldozer was hamstrung by half-speed writes and horrific L3 latency - would it be remiss to assume that they've at least fixed those two issues?

I'm not sure anybody can truly predict performance however, even with a Blender demonstration, and certainly not to work out prospective Cinebench or SuperPi performance. You could have a monster of an architecture, but if the software isn't optimised for it, it's not going to be representative of its true performance.
wumpus - Wednesday, August 24, 2016 - link
I'd still want the TSX instructions before even thinking about the server market. I guess they surrendered that before the overall architecture was finished. Although considering how badly it has worked for Intel (essentially turned off after errata was noted in the first generation), maybe it wasn't worth risk.
Alexvrb - Sunday, August 28, 2016 - link
Yeah they need to take their time. A faulty implementation would do more harm than good at this point.
Xajel - Wednesday, August 24, 2016 - link
I have a feeling that the Socket has more potential, there's a huge jump in pin counts that might hides something, I suspect AMD have specific HEDT version with higher TDP ( like 130-140 ) that might ship later in 2017 after the first wave. or maybe even triple channel that works only on higher-end HEDT motherboard while it will be still backward compatible with regular dual channels motherboards...
none12345 - Wednesday, August 24, 2016 - link
Nice article, thanks, and timely too.

Cant wait for real zen benchmarks.

I so badly want this to be another athlon64 x2 moment. But i dont think we will get that, and we don't need that. Consider the athlon 64 gave us multicore, and it stomped the pentium4 as well.

Ill be completey happy with a phenom II moment. Which was note quite as fast as intel, but gave you more almost as fast cores for your money. As well as unlocking cores at a much lower price point, which gave you superior overclocking for your money.

I will not at this point consider buying another quad core. Quad core is insufficient for my typical work load. I do not use 1 heavily multithreaded piece of software, i constantly use multiple pieces of moderately threaded software that currently mostly maxes out my processor.

In my opinion the industry should have stopped selling dual cores a year ago. It should be quad core at the low end and 6 or 8 core should be the mainstream. For desktop that is, i can still see some moble things being dual core.

Because i will NOT consider another quad core at this point. My only option today is the intel's enthusiast platform, which is far too expensive relative to the performance increase. So they are out.

And this is why im hoping that zen does not disappoint. If they can give me 6 or 8 cores that are within 10% per core, for similar costs to the i5 or i7 line, then im a definite buy. If they give me 6 or 8 cores that are priced like intel's enthusiast platform, well then i guess im not upgrading, untill someone can offer me more cores for a reasonable price.

If intel would offer more cores mainstream, then id absolutely consider a new chip from them. IE if i3 was 4 core, i5 was 6 core, and i7 was 8 core.
Vlad_Da_Great - Wednesday, August 24, 2016 - link
i7-4790K will wipe the floor with the ZEN mop. Roy Jones Jr(Intel) vs Montell Griffin(AMD) part II. https://www.youtube.com/watch?v=VZ_4FrhHHJE That is it! I cant believe AnandTech is biting on their marketing fluff.
H2323 - Wednesday, August 24, 2016 - link
"Nevertheless, power was the main concern rather than pure performance or function, which have been typical AMD targets in the past."

This is contradictory to what AMD has had to say. Power was not a greater focus than performance, just not true.
takeshi7 - Wednesday, August 24, 2016 - link
Wow, I haven't seen victim caches being used in a CPU since the old VIA C3. I hope the advantage of not having to duplicate data between the L2 and L3 caches pays off for AMD.
H2323 - Wednesday, August 24, 2016 - link
and bulldozer in 2011
Oxford Guy - Saturday, August 27, 2016 - link
The EDRAM L4 in Broadwell C is supposed to be a victim cache.
intangir - Wednesday, August 24, 2016 - link
Great article. By the way, Ian, you're missing a syllable from "Microarchitecture" in the title.
name99 - Wednesday, August 24, 2016 - link
"The first, CLZERO, is aimed to clear a cache line and is more aimed at the data center and HPC crowds"

Not exactly. The point of an instruction like CLZERO is that the usual way cache lines are filled uses twice as much bandwidth as necessary.
When I write the first datum to a cacheline, the first thing that needs to be done is to load the cacheline and then overwrite the datum I wanted to write. This is obvious. BUT suppose I'm writing enough data that I write over the entire cache line? Then pulling it in was a waste of bandwidth.
THAT is the point of an instruction like CLZERO, to "ready the cache line for being overwritten" without wasting time loading it. Of course for many purposes filling with zeros is what one wants, but there are other times when one is simply engaged in bulk writing and it again makes sense.
PPC for example had a similar instruction, DCBZ, as does ARM, DC ZVA.

I'd expect this instruction to be used, at the absolute minimum, by the OS wherever it needs to zero and copy pages, by standard libraries data copy routines, and by the compiler whenever it writes "large" (ie cache line or larger) data structures.

"PTE (Page Table Entry) Coalescing is the ability to combine small 4K page tables into 32K page tables, and is a software transparent implementation. This is useful for reducing the number of entries in the TLBs and the queues, but requires certain criteria of the data to be used within the branch predictor to be met."

I think you are misunderstanding what this is about. My GUESS (only a guess) is that it refers to the following.
Academic work was done a few years ago that showed that the way Linux (and probably most other OSs) allocated and deallocated pages meant that, for the most part, contiguous virtual pages remain as contiguous physical pages over reasonably long stretches (say 8 to 16 pages). A consequence of this is that a TLB entry could contain not just the single physical address it refers to but also a length field or something equivalent, say that this TLB holds for this page and, say, the next 5 pages. This would work IF
- the pages all have the same settings and permissions (usually the case)
- the pages are contiguous in physical memory (as I said, usually the case)

The consequence of this is that for fairly minor modifications to the TLB, one manages to double or more the coverage of one's TLB, and that's certainly nothing to be sneered at.
It's possible that an OS that tries to maintain page contiguity could do even better --- the papers I read referred to unmodified Linux.

I've no idea what that branch predictor info refers to; but perhaps this is more of the usual x86 BS where you have to deal with some insane corner condition involving self-modifying code. The basic point, however, is obvious --- you get a nice increase in TLB coverage without having to change software, and without the pain of jumping to a larger page size.
I'm really glad to see AMD implement this because I thought it was a nice idea when I read it, and it's basically useful for everyone ---also IBM, also Intel, also ARM --- as long as the OS you're running is not insane. For someone like Apple, where they can fully control the OS, it's especially appealing. (And hell, for all we know they're actually first before AMD, they just never told anyone?)
name99 - Wednesday, August 24, 2016 - link
here we are, this is the paper I was referring to:
http://www.cs.rutgers.edu/~abhib/binhpham-micro12....
Tucker Smith - Thursday, August 25, 2016 - link
I hear much regarding the potential of Zen in comparison to Intel's HEDT procs, but, given AMD's touting of Zen's scalability, can we glean insight into how it will compete in the $100 range against the i3? People have been clamoring for an unlocked 2c/4t. The excitement over the potential to OC via BCLK on the Skylake was huge, the disappointment when Intel reneged on it even larger.

The Kaveri-based Athlon x4 860k and the Carrizo Athlon, the 845, were fine chips under $100, but the limited cache and platform options kinda turned me off. A small Zen proc with one of the new, nicer cooling solutions they're offering on a modern mobo sounds incredibly compelling.

I hear much regarding 8c/16t chips, a lot about potential APUs, but what about that broad middle ground?
iranterres - Thursday, August 25, 2016 - link
Tucker Smith, you made an excellent point. But I think they will launch zen based stuff to compete all across the board
fanofanand - Thursday, August 25, 2016 - link
Zen is the architecture, not necessarily the name of the processor family. They have mentioned the scalability up and down the chain, indicating that they will indeed populate their entire processor line with the Zen architecture. It's impossible to know how well they will scale until they are in independent tester's hands, but I would imagine they have learned quite a bit from their Jaguar cores and should be able to put together a compelling offering in the sub $100 range.
Outlander_04 - Thursday, August 25, 2016 - link
AMD sell APU's with disabled graphics cores already, as well as a range of 2 module APU's with minimal graphics .
That is the ground you are talking about surely?
alpha754293 - Tuesday, August 30, 2016 - link
It WOULD be interesting to see how they perform in floating point intensive benchmarks compare to their Intel counterparts given the architectural differences between the two company's approaches.
tipoo - Wednesday, August 31, 2016 - link
Last table - >2MB/cire

AMD Zen Microarchiture Part 2: Extracting Instruction-Level Parallelism

Post Your Comment

106 Comments

Back to Article

CrazyElf - Tuesday, August 23, 2016 - link

jabber - Wednesday, August 24, 2016 - link

teuast - Wednesday, August 24, 2016 - link

h4rm0ny - Thursday, August 25, 2016 - link

SKD007 - Thursday, August 25, 2016 - link

SKD007 - Thursday, August 25, 2016 - link

Outlander_04 - Thursday, August 25, 2016 - link

h4rm0ny - Thursday, August 25, 2016 - link

fanofanand - Friday, August 26, 2016 - link

extide - Wednesday, September 7, 2016 - link

Bulat Ziganshin - Wednesday, August 24, 2016 - link

atomsymbol - Wednesday, August 24, 2016 - link

Bulat Ziganshin - Wednesday, August 24, 2016 - link

looncraz - Wednesday, August 24, 2016 - link

Bulat Ziganshin - Thursday, August 25, 2016 - link

atomsymbol - Thursday, August 25, 2016 - link

Bulat Ziganshin - Thursday, August 25, 2016 - link

looncraz - Thursday, August 25, 2016 - link

Bulat Ziganshin - Saturday, August 27, 2016 - link

Nenad - Thursday, September 8, 2016 - link

extide - Monday, August 29, 2016 - link

Outlander_04 - Thursday, August 25, 2016 - link

looncraz - Wednesday, August 24, 2016 - link

niva - Wednesday, August 24, 2016 - link

looncraz - Wednesday, August 24, 2016 - link

Myrandex - Thursday, August 25, 2016 - link

Bulat Ziganshin - Thursday, August 25, 2016 - link

looncraz - Thursday, August 25, 2016 - link

deltaFx2 - Friday, August 26, 2016 - link

Spunjji - Saturday, August 27, 2016 - link

tipoo - Wednesday, August 31, 2016 - link

Cooe - Thursday, May 6, 2021 - link

extide - Monday, August 29, 2016 - link

defter - Wednesday, August 24, 2016 - link

euskalzabe - Wednesday, August 24, 2016 - link

retrospooty - Wednesday, August 24, 2016 - link

Azix - Wednesday, August 24, 2016 - link

Outlander_04 - Thursday, August 25, 2016 - link

silverblue - Wednesday, August 24, 2016 - link

looncraz - Wednesday, August 24, 2016 - link

tipoo - Wednesday, August 31, 2016 - link

masouth - Friday, September 2, 2016 - link

junky77 - Wednesday, August 24, 2016 - link

gamerk2 - Wednesday, August 24, 2016 - link

Death666Angel - Thursday, August 25, 2016 - link

tipoo - Wednesday, August 31, 2016 - link

rhysiam - Wednesday, August 24, 2016 - link

gamerk2 - Wednesday, August 24, 2016 - link

Michael Bay - Wednesday, August 24, 2016 - link

looncraz - Wednesday, August 24, 2016 - link

atlantico - Friday, August 26, 2016 - link

Spunjji - Saturday, August 27, 2016 - link

Krysto - Wednesday, August 24, 2016 - link

looncraz - Wednesday, August 24, 2016 - link

zaza - Wednesday, August 24, 2016 - link

looncraz - Wednesday, August 24, 2016 - link

H2323 - Wednesday, August 24, 2016 - link

Vigilant007 - Saturday, August 27, 2016 - link

Tuna-Fish - Tuesday, August 23, 2016 - link

deltaFx2 - Tuesday, August 23, 2016 - link

bcronce - Tuesday, August 23, 2016 - link

deltaFx2 - Tuesday, August 23, 2016 - link

68k - Wednesday, August 24, 2016 - link

bcronce - Wednesday, August 24, 2016 - link

deltaFx2 - Wednesday, August 24, 2016 - link

intangir - Wednesday, August 24, 2016 - link

Ryan Smith - Tuesday, August 23, 2016 - link

looncraz - Wednesday, August 24, 2016 - link

NikosD - Wednesday, August 24, 2016 - link

Michael Bay - Wednesday, August 24, 2016 - link

Krysto - Wednesday, August 24, 2016 - link

tarqsharq - Wednesday, August 24, 2016 - link

galta - Wednesday, August 24, 2016 - link

looncraz - Wednesday, August 24, 2016 - link

galta - Thursday, August 25, 2016 - link

Alexvrb - Sunday, August 28, 2016 - link