Although I've worked on x86 obviously for 28 years, it's just an ISA, and you can build a low-power design or a high-performance out any ISA. I mean, ISA does matter, but it's not the main component - you can change the ISA if you need some special instructions to do stuff, but really the microarchitecture is in a lot of ways independent of the ISA. There are some interesting quirks in the different ISAs, but at the end of the day, it's really about microarchitecture. But really I focused on the Zen side of it all."
Something else to consider is that he's still an AMD employee and therefore can't say anything that would cast serious doubt on their current or future of their x86 offerings. From this perspective, the way he seems to catch himself and try to qualify his statements makes a lot of sense.
Does it? Isn't the M1 Max ridiculously expensive to produce, though?
I don't know anything about x86, but I do know that people were saying that ARM would be competitive with laptops/desktops, and now it is competitive, EVEN WITH x86 EMULATION. So it sure seems like AMD, Intel, Nvidia, etc. have been leaving a ton of performance on the table. Maybe because of the integrated RAM, or the shorter pipelines or whatever - I don't know the details, but suffice it to say, I won't be buying any new AMD/Intel/Nvidia chips until they have the same speed increases and the same performance per watt that the M1 series does.
Apple has (temporarily) set the new standard for CPUs. Never in a million years did I think I'd be saying that. I've never owned an Apple product.
I wouldn't mind paying $400-500 for an APU similar to the Xbox5/PS5, that would be a good start.
"I won't be buying any new AMD/Intel/Nvidia chips until they have the same speed increases and the same performance per watt that the M1 series does."
What do you do with your PC that you need such processing power? Are the applications you use even available on Mac OS? As you said yourself, you have never owned an Apple product.
"I wouldn't mind paying $400-500 for an APU similar to the Xbox5/PS5, that would be a good start."
I don't think you realise that those two APUs consume around 200-250 watts of power when gaming and 50 watts when idling. They aren't that efficient in general, let alone competitive with M1.
Sorry, for some reason I thought you were referring to M1 Max and Pro, as "those two APUs". You make a good point that if efficiency is @flyingpants265's top priority, those console APUs aren't great options. Maybe under full load, they'd compare favorably to a gaming PC of the same vintage, but not otherwise and certainly not against the M1 family.
Anyway, the link to the M1 power measurements provides an interesting basis for comparison.
Remember apple doesn’t have to sell its chips to anyone. They can bury the chip costs into the price of the whole product and aside from having large margins to begin with they will still make a profit. They also reuse a lot of the chips and R&D across product. A14 was iPhones and iPad Air. A15 was iPhone and iPad mini and so on. So they take advantage of the scale of their own platforms.
>I don't think you realise that those two APUs consume around 200-250 watts of power
You know what I think? You are pulling data out of your ass. XSX draws 160-210W of power from the wall. How does the APU alone consume 40 watts higher than total power? GDDR6/NVMe/fans/peripherals/power brick are now generating power?
The APU is consuming 95-135W of power within desktop TDP limit.
I would imagine, at this point, that Apple’s new chips, due to their size, are costing about what any other chip of that size might cost. But with the integrated RAM packages on the substrate, the package will cost much more. Of course, the question is just how much that RAM would cost as the usual sticks, plus the sockets, and the extra size of the Moro to accommodate them. Then we have the same question about the GPU.
The greater efficiency of that, leads to a smaller power supply, and smaller cooling hardware.
So overall, it could cost less when considered as a system, rather than just as a chip. I’d really love to see the numbers.
Apple has a huge advantage in that they use a better and more expensive process, so it's not like Intel and AMD is leaving performance on the table, they would be 15-30% better if they were also on N5.
> Apple has a huge advantage in that they use a better and more expensive process
I think their A13 used the same TSMC process as Zen 3. Compare those and Apple still wins on IPC and perf/W.
> it's not like Intel and AMD is leaving performance on the table, > they would be 15-30% better if they were also on N5.
That's an enormous range, and 20% is the highest generational improvement Intel has achieved since Core 2. AMD achieved a bigger generational improvement with Zen 1, but that was basically their "Core 2" moment.
Anyway, no. N5 (or equivalent) will not net them probably even 15%, and that's after accounting for all the uArch improvements enabled by that much larger transistor budget.
N5 vs N7 has a gain of up to 15%(Mobile)-25%(HPC version) improved performance or 30% improved power efficiency. So netting them 15% is what is touted by TSMC. So getting 15% is the minimum generational improvement that TSMC is marketing for that node. N5 is a major node. So I'm not sure how you think even with uArch improvements that AMD and Intel would not have better performance on a similar node to Apple, Zen 3 alone had 15% better performance (average) than Zen 2 at the same power consumption and that is on the same node as Zen 2. Apple pays heavily to be on the best node possible every year.
You can't just say a new process node will deliver 15% more performance, irrespective of the microarchitecture. There are some assumptions built into that number, which might not apply to these x86 cores. We'll just have to wait and see.
> Zen 3 alone had 15% better performance (average) than Zen 2 at the same > power consumption and that is on the same node as Zen 2.
First, it wasn't *exactly* the same node. It's a derivative node that AMD used for the Ryzen 3000 XR-series. Second, Zen 2 was still at lower IPC than Skylake. Some of the gains in Zen3 were from AMD still playing catch-up.
Sorry, I realized I was thinking about your claim in terms of IPC. I don't believe N5 will increase IPC by 15%, but that's a very plausible target for the combination of IPC and frequency.
This is what should be drilled into people's heads. The ISA is just an interface between software and hardware. How you _implement_ it determines how fast a processor performs.
I have heard that some aspects of x86 are problematic (variable instruction length, for example). Do you think there is a way to switch to fixed instruction length and let a compiler like Rosetta handle the outlier instructions?
Implementation matters, but some ISAs may be easier or more efficient than others to implement. Claims are that ARM is more efficient than x86. Are multi-threaded x86 cores an admission that the front end is a bottleneck? Is there something better than ARM? Do the IP costs of ARM cut into potential profits?
Good points are raised, but most of the sources in that article are pretty old. The most authoritative and comprehensive-sounding is dated 2013. And that study was surely based on chips even older! So, maybe it's about 10 years out-of-date, and we're using it mostly to gain insight into what's coming in the NEXT 5 years or so?
Also, let's not forget that Jim left AMD in like 2016? So, I consider his knowledge on the subject a little bit dated and limited. I mean, you need look no further than the disparity between x86 and Apple's A14 decoder widths to see that x86 is clearly facing issues that Apple isn't.
Next, the talk of "decode power" in ARM doesn't indicate whether power was an issue simply because its slice of the pie was growing out of proportion to its benefit (which is largely determined by what's downstream), or if it was actually growing in a very non-linear fashion. The former can be resolved by simply using a larger pie pan (i.e. newer process node, or maybe just a bigger silicon & power budget), but the latter can become a deal-breaker for ever going past a certain point.
Ultimately, time will tell. If the x86 cores can't catch Apple, I think that'll be the surest sign.
When David Patterson did the Berkeley RISC project, instruction set design mattered a lot. The transistor counts for microprocessors of that era were in the tens of thousands. (The Intel 8088 had 29,000 transistors and the Motorola 68000 had 68,000 transistors.) The Berkeley RISC I and RISC II processors used 44,500 and 39,000 transistors, respectively.
Fast forward to 2012, and you have a quad core Ivy Bridge processor with 1,400,000,000 transistors. That's a huge increase in transistor budget. That means that the instruction set no longer matters very much. The transistors required to decode the instruction stream no longer dominate the cost of the design, even after you account for the fact that there are four cores, meaning that there are four copies of the instruction decoding logic.
You point out that the sources the article uses to document this are pretty old. But transistor budgets keep getting larger. Apple's eight core M1 has 16,000,000,000 transistors. So the choice of instruction set should be even less important today. You mention the difference in decoder widths between the A14 and the current x86 implementations. Mike Clark explains that he wants a balanced design, so to increase the decoder in Zen, you have to beef up everything so that the design is still balanced. He suggests that will happen in future Zen designs. Apple currently has the largest transistor budget, so it's not surprising they currently have the widest decoder.
> The transistors required to decode the instruction stream > no longer dominate the cost of the design
You're presuming a serial decoder. For an interesting thought experiment, try imagining how you'd build a >= 4-wide decoder for a variable-length instruction set. I think it's safe to say that the area, complexity, and power of the decoder increases as a nonlinear function of its width. And note that the example of 4-wide is per-core. So, however many cores you have, that entire front end would be replicated in all of them.
The other pitfall faced in comparing transistor counts of ancient CPUs with modern ones is that the older chips took many cycles to execute each instruction. So, you're also failing to account for the increase in complexity from simply increasing the serial throughput of the decoders, apart from how many output ports they have.
Another factor you're missing is that the opcode space is very unbalanced, because it was never planned to encompass the vast number of instructions that currently comprise x86. The current opcode space is many times larger than 8088, which IIRC only had a couple hundred instructions.
> Apple currently has the largest transistor budget, > so it's not surprising they currently have the widest decoder.
Don't confuse the amount of transistors they're spending on CPU cores with what they spend on the entire die, since that includes cache, GPU, ISP, etc. You should compare only the CPU core transistor counts. I'm sure they didn't publish that, but people have done area analysis and you can probably find some pretty good estimates.
Now, if we look at the backend of their cores, I think they're not wider than AMD/Intel by the same ratio as their frontend. Part of that could be down to the fact that you need more ARM instructions to do the same work, since x86 can use memory operands (even with scaled register offsets), but even accounting for all of that, it looks like Intel and AMD have had disproportionately smaller frontends than what we see in the M1.
I like to think of an array and linked list. No matter how much one optimises the latter, it's still that same basic difference, that an element can't be accessed randomly or in O(1). Even if we were to build some table of indexes, it means more work elsewhere. (And we might as well have started off with an array to begin with.)
In x86, one way to remedy this could be adding length information to the start of each instruction; but that would likely mean breaking the ISA, and so it might be better just going to a new, fixed-length one. Or take it one step further, a new ISA that is able to encode the instructions in an out-of-order fashion, meaning dependencies are calculated outside of the CPU, hopefully at compile time. Try and remove everything that doesn't have to be done at runtime, outside of runtime.
> meaning dependencies are calculated outside of the CPU, hopefully at compile time
I liked this aspect of IA64's EPIC scheme. Not sure how well it actually worked, in practice. People have raised concerns about the additional instruction bandwidth needed to support it.
On the subject of instruction bandwidth, I keep thinking about how much texture compression helped GPUs. I know it's different (lossy vs. lossless), but it still shows that a simple compression scheme could be a net win. Then again, ARM tried something like that with THUMB, and ended up walking away from it.
There are worthwhile ideas in EPIC, especially the notion of moving dependency checking into the compiler; but owing to Itanium's stigma, I think designers will, sadly, steer away from it. If they could separate out the good from the bad, and absorb some ideas. But I fear no one will want to take that chance. Also, breaking compatibility is a big concern, and the only feasible option at present is going to ARM, if they've got to switch, purely because of solid support in Windows and MS's toolsets.
I think there's a lot to be said for this approach. Simplifying the CPU's decoder should free up some area and save energy. You can even have the OS cache the decoded program from one run to the next. It can also store profiling information, so the decoded program can be further optimized, either as it's run or on subsequent runs.
These are things GPUs already do! I think it's what Nvidia's Denver cores do, and someone mentioned maybe Transmeta CPUs did this, as well? It would need OS support, of course.
I think the main challenge is to achieve good density in the decoded stream. There's a definite tradeoff between how much work you want to save the CPU's frontend vs. how much memory bandwidth you want to burn on fetching code.
That's what DEC did in FX!32 (https://en.wikipedia.org/wiki/FX!32) and that's what Apple does in Rosetta. It was the whole point behind Transmetas Cruose architecture, they were made to be independent of ISA so an abstraction layer would have to be created for each ISA you wanted to run on it.
While variable instruction lengths is a problem, I also believe that developers of x86 compilers are largely aware of this and have tried to make sure to avoid using different ways one can encode the same instruction. And I'm also certain in modern x86 designs there's some cache set aside for decoding.
There's also an article I found going over the most used x86 instructions and how they're used (https://www.strchr.com/x86_machine_code_statistics... It's likely this or something similar is used as a reference as to where to optimize
Why would you want an essentially binary translation layer when the decoder is doing exactly that - decoding variable length to fix length. For AMD, Macro or Complex Ops are already fixed length, then further broken down to Micro Ops
@dotjaz, there would be a few benefits. First, anything that makes the instruction stream more uniform would help them widen up the decoder, which is currently a bottleneck in the uArch.
Second, remember that the decoder is doing essentially redundant work every time it encounters the instructions at the same point in an executable. Doing the work up-front, like when the code is first read in from storage, is essentially like moving code out of an inner loop. It could enable the pipeline to be shorter, simpler, burn less power, and take a smaller penalty on branch mispredictions.
Lastly, any savings of silicon area can be re-invested elsewhere, either in the form of beefing up other parts of the uArch, in adding more cores, or simply in reducing the cost and therefore hopefully the price to customers. If they take the silicon savings to the bank, then at least less area can enable some combination of lower power and higher clocks.
Another potential benefit would be the potential for doing some types of code transformations that are too expensive to do in hardware. Perhaps this could be guided by dumping branch-prediction state, since compilers are traditionally very bad at predicting the control flow of code.
I remember back in the late 90's, DEC Alpha had x86 emulation support that would do some JIT translation and would supposedly optimize the code as you continued to run the program. That's the first time I heard of such an idea. I do wonder whether today's Javascript JIT compilers do any sort of profile-driven optimization.
There are many pioneering ideas that happened at companies like DEC and IBM, decades before they became mainstream. I once talked with a processor architect who said he felt like they were blazing new ground at the place he worked in the 90's, only to later find out that IBM was doing some of the same stuff as far back as the 60's and 70's.
It's not efficient to predecode: 1) if the decode stages create larger instructions of fixed width; 2) are very hardware specific; 3) there's already dynamic components for address resolution involving load/store; 4) more dynamic when referencing results of previous instructions only available at runtime; 5) creates more security risks; 6) there's extra power involved to store this 2nd copy in RAM then read back into CPU ontop of the 1st read & precomiple.
> It's not efficient to predecode: > 1) if the decode stages create larger instructions of fixed width;
Maybe there's a cheap compression scheme that can be handled more efficiently than decoding x86-64 instructions. You could even put the decompression in the datapath that instruction cache misses must traverse, so that it's all decompressed by the time it gets into a cacheline.
> 2) are very hardware specific;
Why? The decoder can still live in hardware/microcode of the CPU. The OS can treat it as a black box, much as it does for GPU shader compilers.
And yet, somehow micro-ops already deal with this.
> 4) more dynamic when referencing results of previous instructions only available at runtime;
Self-modifying code will need to be handled via a slow-path that does the decoding inline (as today). This can still be a purely hardware path, but you wouldn't need to devote quite as much silicon to it as we currently do.
> 5) creates more security risks;
Such as? Code is sitting in memory, either way.
> 6) there's extra power involved to store this 2nd copy in RAM > then read back into CPU ontop of the 1st read & precomiple.
What circumstances would give rise to that? If the intermediate format is faster to decode than the x86 instruction stream, that suggests it would also require less power to do so. There's no way that x86 ISA is so close to optimal that a real win can't be had by doing some amount of preprocessing, then I'm left wondering whether the CPU designers are really trying hard enough.
Certainly, doing the decoding in the CPU and stashing it somehow, would make it all easier. The other approach of moving the decoder into the OS would be, as they say, disastrous. Breaking abstraction and tying itself to a CPU's hidden format; and that would mean different formats for different vendors too. Well, vendors could provide this layer in the form of a driver; but already it seems like a great deal of work, and not very elegant.
> Certainly, doing the decoding in the CPU and stashing it somehow, would make it all easier.
Yeah, I think the decoder should still live in the CPU. By default, decoding would happen inline. However, the OS could run the CPU's decoder on a block of code and translate it. Then, switch the process over to running the intermediate format from the new image.
Again, this isn't really blazing new ground. Transmeta and Nvidia's Denver have done this before. That means there could already be some level of OS support for it!
I see what you're saying. The CPU exposes its decoding system through an API, and the OS handles it similarly to managed programs.
Crusoe's idea was interesting. No doubt, implementing emulation in an OS is pretty hard work. (It took some time before MS, for example, added x64 to Windows ARM.) I wonder if Crusoe's idea of a generalised layer would be easier. Given enough time, as more translation paths are added, we might be able to run any program on any ISA, even for paths that aren't directly coded. Just go to intermediate format, and from there to destination ISA. And native programs running at full speed. Then again, I suppose that's what Java, .NET, and co. already do, and they're very convincing.
ISA is ultimately a limiting factor. It's not the only factor, but it influences the shape of the perf/W or perf/area curve and where the points of diminishing returns are.
It's one thing to say "ISA doesn't matter" and it's another thing to proof that. Why is there no x86 core that has the same efficiency like Apples cores? Build it and i believe it.
BTW: Apples efficiency comes from the relatively low frequency that their cores run at and it's not just TSMCs 5N process. Intel, AMD an Apple have give or take the same performance but, while Apples run at 3.2 GHz, intel and AMDs need about 5 GHz to deliver the same performance. If you build intel or AMD cores on the same 5N and run them at 5 Ghz they will draw much more power than at 3.2 GHz.
And this relationship between clock speed and power isn't new. Ask the folks at intel who designed the P4. So again: why is there no x86 core that only needs to run at 3.2 GHz for the same performance? Strange, isn't it?
> Apples run at 3.2 GHz, intel and AMDs need about 5 GHz to deliver the same performance.
How do the cores compare in terms of area? Because Intel and AMD have both designed cores that they sell into the server market. So, if Apple is delivering less performance per transistor, maybe that's a reason Intel and AMD didn't increase the width of their uArch at the same rate as Apple.
I should've added that the reason area is relevant for the server market is due to the high core-count needed for servers. So, if Apple can't scale up to the same core counts, due to the area needed by their approach, that could be a key reason why Intel and AMD haven't gone as wide.
Also, cost increases as a nonlinear function of area. So, perf/$ could be worse than their current solutions, on the nodes for which they were designed.
- AFAIK most PCs sold today are laptops and intel lost Apples business because they were not satisfied with the efficiency of current (mobile-)CPUs. intel should have enough ressoures to develop different cores for different applications. The loss of Apple as a customer not only lost them money, it was also bad PR.
- x86 CPUs have to clock down for thermal reasons in all core loads anyway, so they could design them to run at lower clock speeds from the beginning. So pretty much what is needed in a laptop. BTW: the new E-cores in ADL are much smaller and seem to be very efficient both in area and thermals. Maybe it's a starting point for a new P-core. The "core" architecture was derived from the Pentium M which was "only" a mobile chip at the beginning but evolved into much more. Funny enough, Apples stuff originated in mobile as well, so it seems that it makes more sense to focus on efficiency first and than scale performance. Both intel and AMD seem to have done it the other way round in recent years.
- Looking at a lot of die-shots i have the impression that caches and uncore are larger than the cores themselves, so i'm not sure area is the culprit. And chiplets mitigate that anyways for high core count CPUs.
Will be interressting to see what AMD can achieve with ZEN 4. I don't expect it to reach Apples IPC, but should be closer than today.
> I've always felt that Gracemont will lead to their main line of descent.
It depends on what people want and are willing to pay for. There are currently Atom server CPUs with up to 16 cores. If the version with Gracemont is popular (maybe with even more cores), then they'll probably expand their server CPU offerings with E-cores.
> intel should have enough ressoures to develop different cores for different applications.
They do. The Atom-line was for low-cost and low-power, while their other cores were for desktops, mid+ range laptops, and most servers.
What they haven't done is a completely separate core for servers. If they did that, then maybe they could make the desktop/laptop cores even bigger, although that would make them more expensive so maybe not.
> x86 CPUs have to clock down for thermal reasons in all core loads anyway
You're thinking of desktops & laptops, but not servers. Servers' clocks are limited by power, not thermals.
> the new E-cores in ADL are much smaller and seem to be very efficient both in > area and thermals. Maybe it's a starting point for a new P-core.
They've recently done complete redesigns of their P-cores. I forget which one was the last redesign, but it might be the very P-core in Alder Lake. They don't need Gracemont as a "starting point". If Golden Cove's efficiency is bad, it's because they *decided* not to care about efficiency, not because of legacy.
> it seems that it makes more sense to focus on efficiency first and than scale performance. > Both intel and AMD seem to have done it the other way round in recent years.
Apple didn't *only* focus on efficiency, in case you haven't been following their CPUs. They have consistently had far better performance than their competitors.
And neither AMD nor Intel can afford to come out with a lower-performing CPU that's simply more power-efficient. I mean, not in the main markets where they play. Obviously, Intel sells cost/efficiency-oriented CPUs for Chromebooks, and it sells specialty server CPUs based on the same "Mont" series cores.
> Looking at a lot of die-shots i have the impression that caches and uncore are > larger than the cores themselves
That's not very helpful. You need to find someone who has done the analysis and made credible estimates of the number of transistors in the cores.
Fantastic interview and reading! As always, it's a joy to learn more about Zen. Thanks, Ian. And hats off to Mike and everyone at AMD for the brilliant job they did and are still doing. Well done. Keep up the good work!
What a great interview! Loved the questions and the answers! Clark hits all the right notes, from describing the massive teamwork building an excellent CPU requires, to his excitement and genuine enthusiasm for what he sees coming down the pike. I loved this:
"If we don't do it, someone else will."...! AMD is in a good place with engineers like this, imo. If you don't believe in yourself, your company and your products--well, that doesn't lead to the best possible outcome, let's say. I also liked what he said about "machine learning" (and "AI", although it wasn't specifically mentioned), that all of it boils down to efficient, high-performance computing. I've heard Papermaster say the same thing. Buzzwords carry little weight with CPU engineers (thankfully)..;)
I think this is certainly one of the best--if not the best--Q&A interview I've seen from AT. Good job, Ian. And of course, Kudos, Mike!
Great to hear story of engineering CPU. But none question about hybrid, big-little, hot topic now. Could be great to bring some lead architect from Intel, go get instghts how long and how hard they are designing these cores.
@Ian , wow! You've been getting lots of interviews recently. Any particular reason that the various companies are willing to let you talk with their tech experts now of all times?
Alignment of development/release cycles, maybe? Plus, if one company is making their Sr. leaders available, other companies have incentive to make sure they get their own message out at the same time.
Superb interview! I've seen a lot of comments here mention the difficulties of decoding the variable length instructions of x86, but as we've heard from Jim and now Mike, there are ways to work around or mitigate this. I haven't seen anyone here ask if the fairly strict memory ordering model of x86/x64 is a bigger problem for usefully decoding in parallel? Maybe it's a dumb question, but it looks like ARM's soft memory ordering allows for much greater parallelism with memory loads/stores - in particular the ability to recognise streaming loads and thus not have to put barriers on writes. Apologies if I have the terminology wrong - I'm a n00b in this area. I could imagine there being difficulty decoding lots of memory operations in parallel if you can't tell whether they are subject to aliasing, alignment issues, or barriers. These things should be decoupled but I'm not sure to what extent they can be fully decoupled when you also have guarantees on ordering.
In particular, I think the comments by deltaFx2 are worth reading. They sound as if this person has some detailed knowledge of x86 CPU design, and even suggest having done some comparative analysis against the ARM64 ISA.
I see that was almost exactly 1 year ago. Coincidence?
According to Google, that poster only commented on Ryzen, EPYC, and a few ARM server CPU articles (Ampere, Applied Micro X-Gene 3, and Qualcomm Centriq). Interesting.
I doubt that's a complete list, since I didn't get a hit on the comments in that Papermaster interview. However, as I'm looking through the search hits, it seems none of them are on the dedicated comments pages. Maybe Google doesn't index those. The hits all seem to be on one of the handful of comments visible at the bottom of the article pages.
I also just searched the forums, and deltaFx2 doesn't seem to have a forum account. Apparently, neither do I... ?
Yeah it's a quality commenting system (as you say below). I have no idea who deltaFx2 might be. The best discussion of these things tends to be over at RWT (which also has a rubbish forum layout, but it's better than this). I miss AcesHardware so bad at times like this.
So I went and read delta's comments. They're a bit of a knob aren't they? "Deeper buffers and just speculate around it" was their opinion. I doubt that's entirely true - look at Spectre/Meltdown. Memory speculation can't just be done on everything.
deltaFx2> Are you sure you know what you're talking about, because it doesn't look like it.
Well, I'm no CPU designer. I can imagine if a noob were challenging me on a subject where I have some expertise, I might find it a little tiresome, but it still didn't need to be quite that blunt.
In that same post, they point out:
deltaFx2> ARM has some interesting memory ordering constraints too, deltaFx2> like loads dependent on other loads cannot issue out of order. deltaFx2> Turns out that these constraints are solved using speculation deltaFx2> because most of the time it doesn't matter. x86 needs a larger deltaFx2> tracking window but again, it's in the noise for area and power.
I think you're right that it's a rather dismissive answer. Still, I'm in no position to argue.
I also think you can't just reduce x86 memory ordering to a simple tracking problem, since transactions need to be committed in an order that probably isn't optimal. The order can't be changed, since it will break some multithreaded code.
This has been around for a couple of decades at least.
"I also think you can't just reduce x86 memory ordering to a simple tracking problem, since transactions need to be committed in an order that probably isn't optimal. The order can't be changed, since it will break some multithreaded code."
Remember, the rule is Total Store Ordering (TSO). Stores are always committed in program order regardless of ISA. Loads that snoop those stores may go out of order. As long as they don't observe stores out of order relative to a strict in-order machine, it's all ok. A CPU with weak memory ordering rules will allow loads to go out of order by design and ISA spec. A CPU with strong memory ordering rules allows loads to go out of order in the microarchitecture but enforces a check to ensure it satisfies the ISA spec. If whatever security loophole applies to a strongly ordered core, it also applies to the weakly ordered core. I don't see a difference. But if someone believes otherwise, write a proof of concept and have it published. Burden of proof is on you. Allowing loads to go out of order can cause leaks of information (Spectre V4) inside the same thread. It is ISA independent, and affects every CPU manufacturer.
Am I the only one seeing a big bottleneck in memory latency? It doesn't seem to be getting better generation to generation, and it doesn't seem people you talk to give it justice.
Or maybe I'm wrong and it's not that important? Somebody prove me wrong, then.
Welcome! This commenting system is like something straight out of the 1990's, but perhaps it has something to do with the reason we seem to have a bit fewer than the average number of trolls? Anyway, the lack of an editing capability encourages one to post with a bit more care.
Memory latency for DRAM DIMMs seems to remain relatively fixed, in terms of the absolute number of nanoseconds. How much impact it has is very much application-dependent.
One form of mitigation employed by CPUs is increasingly-intelligent prefetching of data before code actually needs it. Of course, that only helps with fairly regular access patterns. There are also instructions which enable software to explicitly prefetch data, as well, but even these don't help when you're doing something like walking a linked-list or a tree.
Another mitigation is in the form of ever-burgeoning amounts of cache come in (see AMD's V-Cache [1]).
To help hide stalls on cache misses, CPUs are incorporating ever larger out-of-order windows. SMT enables one or more other threads to utilize more of the CPU's execution units, when a thread sharing the same core stalls. GPUs employ this technique to an extreme degree.
Then, there's in-package memory, such as LPDDR5 memory stacks in Apple's latest additions to the M1 family [2]. On this point, Intel has announced a variant of their upcoming Sapphire Rapids server CPU that will feature in-package HBM-type DRAM [3]. I think details are still thin on how it'll be seen an supported by software (e.g. as a L4 cache or as separately addressable storage).
To round out the solution space, Samsung is embedding some (presumably simple) computation in its HBM memory, for deep learning applications.[4]
'This commenting system is like something straight out of the 1990's, but perhaps it has something to do with the reason we seem to have a bit fewer than the average number of trolls?'
It absolutely reduces trolling because it doesn't create the bloodsport that is voting and post hiding.
That is shunning, one of humanity's oldest methods of blocking innovative ideas. New ideas are resisted by most people, so the community will side against the person having their posts smothered. It will mistakenly view this as a righteous solution to the problem of the annoying unpleasantness of ideas that run counter to preconceived notions.
It is no wonder that these censorship systems keep infecting websites, with those pushing them actually making statements (as a Disqus employee did) that the way to improve a community is to increase the amount of censorship.
I've long wondered if the biggest problem with voting features, in commenting systems, isn't that it costs the voter nothing. Like, maybe if you had to actually spend some of the credits you received from up-votes of your own posts to vote someone else up/down, people would be a bit more thoughtful about it.
However, I think what's come to light about the sorts of posts that are frequently shared pours cold water on that notion. When someone shares a post, they're spending (or gaining) social capital, of a sort. And based on the kinds of emotive posts that tend to go viral, I think we can say that even voting systems that are costly to the voter are still likely to result in undesirable behaviors. Perhaps even more so than free votes, since posts that inform but don't inflame are then less likely to receive upvotes.
mode_13h, I think there's an emotional effect at work in liking, voting up and down, etc. Dopamine is likely being released, which makes them seek that action again and again. At a physical, Diablo-like level. Then, go a bit higher, and its the seeking out of those things which reaffirm their emotional opinions. Throw in some controversial, dividing issue, and the tribal spirit will come out in all its monstrous glory.
I'd also say that people like/approve what they know is "accepted" by current opinion, even if their view, internally, was different. For my part, I feel the like button, of which voting up and down is a specialisation, should be banished from society.
> For my part, I feel the like button, ... should be banished from society.
On another technical forum I visit, they have up-votes but no down-votes. I find the up-votes are useful when someone makes a specific factual claim, because it helps you know how much consensus there is around those claims. So, I think it's not all bad.
Oxford Guy, in this "progressive" world of today, censorship is on the rise, paradoxically. The idea is controlling what people think and speak so that they fall in line with the prevailing orthodoxy. That's why we need to preserve opinions that we don't even agree with. As the physicists would say, no preferred frames of reference.
I think you're referring to "cancel culture"? I think we shouldn't confuse that with government censorship. I don't like either, but I'd take cancel culture waaay before I'd accept government censorship. And sadly, with the rise of authoritarianism, I think *both* are on the rise.
In general I meant it, but you're right, cancel culture does fall under this. All I can say is, the despotic principle is always at work, even when the intentions start off good. We may begin as sincere, plant-based activists, but end up forcing the entire world to eat the way we eat, because we're right.
Thank you. And I was actually referencing to Ryzen family having a memory access penalty, which I think comes from the infinity fabric, suboptimal memory controller or something of that fashion.
Back when first two Ryzen generations came out (Zen 1 and +), I was hoping Zen2 would help latency as I was seeing it as a big bottleneck compared to competition, now we have chiplets and another layer of infinity to connect everything, and only L3 sizes to hide increased latencies. But since L3 isn't always helpful, this... makes me worried. I'm still hoping for improvements in this regard, but where everyone is going doesn't seem to line up with it.
Well, looking at the original Zen 3 performance review, it seems that the DRAM access latency is only about 10% worse than Comet Lake. However, between the size of L3 and some of the mitigations I mentioned, such as prefetchers, software rarely sees that penalty.
Sadly, I don't see the same cache-and-memory latency measurements for the 5000G-series. However, the benchmarks of 5600X vs. 5600G show the former wins almost across the board. So, I decided not to worry about the extra I/O die and will probably get a 5600X.
Yes, 8core CCX and adjusted prefetchers improved things quite a bit, and large L3 works even better now; but sub45ns memory latencies are quite tasty and only achievable on Intel platforms it seems (I have no reliable way of testing them myself so have to trust things like AIDA64 to measure it). And yeah, prices aside I'd crave for 5600x, too. Seems to be the best all around, except maybe new Intel stuff coming soon (we'll see). Latency-wise, I like some of the ARM ways, like starting a request to memory in parallel with cache probing (and cancelling it on hit), at least that's how I understood it.. this is from, like, A76 presentation? Maybe A77, can't remember. Anyway, if that's not a thing on desktop, I believe it should be. And Ryzens with their large caches and laggy memory accesses should benefit considerably
> sub45ns memory latencies are quite tasty and only achievable on Intel platforms it seems
That's not what the link I posed above shows. There's a plot of latency vs. size, in the section "Cache and Memory Latency". Underneath the plot, you can click buttons that switch the view to the test results of other CPUs. One of the options is "Core i9 10900K". When you click it, it shows "Full Random" latency peaking at 70.814 ns.
I you believe that's incorrect, please share your source. I suspect whatever test you're looking at didn't reach sizes or access patterns that went entirely beyond the effects of the cache hierarchy.
I don't have the same testing package, so can only measure (and share) my own overclocking results with the tools I have. And yes I believe AIDA64 patterns are somewhat more predictable, but it favors intel nonetheless (which was backed by gaming and web performance before Zen3, which seems to finally catch up).
Also, yes I know it's overclocking and comparisons with stock system results is incorrect, but it's tuned system performance that matters to its user, right? And I perform some basic optimisations to any system I get access to (even if it's only a slightly reduced voltage all around and adjusted memory timings), so I'm interested in potential just as much as stock performance.
While memory transfer rates increased a lot in the last three-four generations, the actual wait time for random access (total milliseconds until the first bytes arrive) did not. The solution to this was to increase cache size and cache efficiency. Also, the processor knows it needs certain data (memory address) early (in the instruction decoding cycle) but it actually needs the data late (into the execution cycle). Processors keeping approximately the same frequency as ten years ago means that the time between "I know I will need memory X" and "I need memory X _now_" is about the same. Here is a useful chart: https://www.anandtech.com/show/16143/insights-into... Basically, DDR4-3200 has about half the latency of SDR-100 (worst case, double the transfer speed for a single byte) but 30 times the total transfer rate (best case, sequential reads). As for importance... it's important for some and not so important for others. You can drive memory to faster speeds but you increase total power, and that total power might come from processor total power, which might be good or might not.
Also, there's too a question of cost - higher speeds are not free - in terms of memory cost ($), mainboard cost ($), power use(W), heat generated (Celsius/Fahrenheit degrees), ...
I hope I didn't confuse anyone with the typo around Cache and AMD's V-Cache. Also, the part where I went straight from OoO windows to talking about SMT (different topic).
If I'd had more time, I should've cited OoO window sizes for A15, Zen3, and Willow Cove (Tiger Lake), but people can look those up. It seems like they're getting big enough to cover a significant amount of the time needed for a full L3 cache miss, though that assumes there's much work the code can even do without the data.
I think SMT is a more elegant solution to this problem. It's too bad about side-channel attacks, because 4-way SMT should really help boost the perf/area of server CPUs, while still allowing for strong single-thread performance. SMT also helps with other problems, like decoding bottlenecks, branch mis-predictions, and general lack of ILP (instruction-level parallelism) in some code.
Whoever writes/transcripts these interviews should be replaced.Is it speech to text software or "Dr" Ian himself? So many errors it's fucking infuriating.
That's awfully disproportionate. There were indeed a couple points where I could tell what was probably said and it wasn't accurately reflected by the transcription, but the distinction didn't much matter in the context of the surrounding text. I certainly didn't get the sense that any quality issues with the transcription compromised the interview to any significant degree.
I think we probably need to accept that this is what advertising-funded publishing can offer our audience, in this day and age. Take it or leave it. I know I'd sure rather have the interview as-written than not at all!
So no one bothers reading the transcript of an interview at least once to fix typos because "ads don't pay enough" and it doesn't bother people like you? Talk about justifying mediocrity with the energy of a supermassive black hole. And your underlying assumption that the effort to read the article once is greater than the the will of the people behind it can only mean the people are both hate and are bad at their jobs.
> So no one bothers reading the transcript of an interview at least once to fix typos > because "ads don't pay enough"
I take it back. I am not in a position to say what their limiting factor is. Could be budget, could be deadlines... I don't honestly know.
What I do know is that you're being too harsh, at least in your tone. It's fair to point out the errors. But, the tone you struck will not help you find a sympathetic ear. So, please try to have some decorum.
I think the one takeaway I got from this article was how long it takes, from conception to delivery, to design a new CPU: roughly ~5 years, at least following Zen's history as an example.
In that context, I'm curious about Intel's upcoming Alder Lake. Back in 2015, Apple came out with their A9 processor that finally reached IPC parity with Intel's current best, Broadwell. I remember several articles written comparing Apple's seemingly disparate chip to that of Intel's, and actually discovering that the workloads, and use cases, accomplished on the two CPUs were not so different after all (iPad vs Core M tablets). That was roughly five years ago, and now we're seeing Intel coming out with a hybrid design in Alder Lake, similar to what Apple/ARM have been offering the past several years.
Now, I'm not fully convinced Intel has nailed this design, it's far too soon for that conclusion. However, I am rather of the strong opinion Apple has gotten it VERY right in their development the past 9 years, ever since their own custom silicon debuted with the A6 in 2012.
So that being the case, one of the glaring Qs posed by Ian to Mike Clark in this article was his opinion on the hybrid design and the approach by Zen to just scale the core up and down to meet TDP demand. I was not impressed with Clark's answer, which I felt amounted to not much of one at all. He kind of just gave vagaries in that it's complicated and that they designed for that up front. To me that sounds eerily similar to Intel's view on the matter way back in the late 2000s when they were asked about the big.LITTLE concept. Intel thought back then that they could just dynamically adjust frequency and voltage to produce similar results. Now we're seeing them backpedal and embrace the hybrid design.
So is AMD stuck in an antiquated mindset that's been clearly proven by the industry as a whole to be the wrong? Or do they have their own hybrid design in the works? Maybe they'll resurrect K12 and integrate it into a future Ryzen design? Either way, thanks Ian for asking that Q!
I think AMD was in the position that they couldn't afford to design different cores for different markets. So, they had no option but to scale up and down.
They're not new to the concept, as you probably know. They had the bobcat, jaguar, puma CPUs, which had distinct uArch from bulldozer, steamroller, piledriver, excavator. We'll just have to see what they do, going forward. He can't tip their hand before anything has been announced.
"... as you probably know. They had the bobcat, jaguar, puma CPUs, which had distinct uArch from bulldozer, steamroller, piledriver, excavator."
I actually did not know this! I knew the names of all these cores, but I never knew the Bobcat line was different than the Bulldozer line, in the same vein as Atom vs Core like Intel! In my mind, I kind of just thought they were all the same module-based pseudo SMT architecture that AMD got panned for pre-Ryzen! Thanks for educating me!
Now here's an interesting thought: do you think Alder Lake was a play to save Intel & Apple's marriage? It could be just a reaction to Qualcomm & other ARM vendors pushing into the laptop market, but it'd be more interesting if Intel was trying to ply Apple with offerings to string out their marriage just a little longer.
On a related note, I've wondered if the Iris Pro GPUs with eDRAM were meant to try and lure any console vendors. Like for the XBox 360 / PS4 generation. And then Intel just decided to bring it to market, anyhow. Maybe even scaled back from what they originally designed.
Apple: "It's not you, dear. It's me. I just can't take it any more and am filing for divorce."
Intel: "It's got to do with that ARM fellow, isn't it? He always seems to be around. Your silence speaks a mouthful. Well, what about Ann and Timmy? What's this going to do to them?"
Apple: "They'll be staying with Aunt Rosetta as we go through this period."
Intel: "Aunty Rosetta! Darling, I'm a changed man. There, there, doesn't it feel good to be in my arms? I've cut down my watts. My nanometres have been upgraded to 10. Why, I'm even working on big/little. I call it Gracemont, the mindfulness within me."
I flatter myself commenting in this lofty ~debate, but afaict, bulldozer seems widely reviled in amd folklore, yet the more i learn of it, the more it seems to contain the genesis of the truly inspired architecture that became Zen & Infinity Fabric.
As I say, I am a mere newb, but there was a lot that looked familiar in a 2014 Kaveri APU to me.
Is it just me or another example of how wrong conventional wisdom can be?
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
115 Comments
Back to Article
mode_13h - Tuesday, October 26, 2021 - link
Congrats to AMD and everyone involved with Zen! Here's to 5 more years of x86 dominance!Thanks, Ian. It's great to get more of Zen's backstory. Also, special thanks for asking about K12!
mode_13h - Tuesday, October 26, 2021 - link
One thing that confused me: the "Enzo" comments. I got around to searching, and I gather he meant ensō - the hand-painted circle in the Ryzen logo.sheh - Friday, October 29, 2021 - link
Thanks. My searching failed on that one. :)dickeywang - Tuesday, October 26, 2021 - link
Hey, new Anandtech article about AMD. Another great opportunity to get a new photo of Cutress!sirmo - Tuesday, October 26, 2021 - link
What a great dude, and an awesome interview Ian! Thank you!blanarahul - Tuesday, October 26, 2021 - link
Although I've worked on x86 obviously for 28 years, it's just an ISA, and you can build a low-power design or a high-performance out any ISA. I mean, ISA does matter, but it's not the main component - you can change the ISA if you need some special instructions to do stuff, but really the microarchitecture is in a lot of ways independent of the ISA. There are some interesting quirks in the different ISAs, but at the end of the day, it's really about microarchitecture. But really I focused on the Zen side of it all."blanarahul - Tuesday, October 26, 2021 - link
This sentence makes Apple's achievements with M1 Max even more impressive.mode_13h - Tuesday, October 26, 2021 - link
That's one way to look at it.Something else to consider is that he's still an AMD employee and therefore can't say anything that would cast serious doubt on their current or future of their x86 offerings. From this perspective, the way he seems to catch himself and try to qualify his statements makes a lot of sense.
flyingpants265 - Tuesday, October 26, 2021 - link
Does it? Isn't the M1 Max ridiculously expensive to produce, though?I don't know anything about x86, but I do know that people were saying that ARM would be competitive with laptops/desktops, and now it is competitive, EVEN WITH x86 EMULATION. So it sure seems like AMD, Intel, Nvidia, etc. have been leaving a ton of performance on the table. Maybe because of the integrated RAM, or the shorter pipelines or whatever - I don't know the details, but suffice it to say, I won't be buying any new AMD/Intel/Nvidia chips until they have the same speed increases and the same performance per watt that the M1 series does.
Apple has (temporarily) set the new standard for CPUs. Never in a million years did I think I'd be saying that. I've never owned an Apple product.
I wouldn't mind paying $400-500 for an APU similar to the Xbox5/PS5, that would be a good start.
blanarahul - Wednesday, October 27, 2021 - link
"I won't be buying any new AMD/Intel/Nvidia chips until they have the same speed increases and the same performance per watt that the M1 series does."What do you do with your PC that you need such processing power? Are the applications you use even available on Mac OS? As you said yourself, you have never owned an Apple product.
"I wouldn't mind paying $400-500 for an APU similar to the Xbox5/PS5, that would be a good start."
I don't think you realise that those two APUs consume around 200-250 watts of power when gaming and 50 watts when idling. They aren't that efficient in general, let alone competitive with M1.
mode_13h - Wednesday, October 27, 2021 - link
> I don't think you realise that those two APUs consume around 200-250 watts of power> when gaming and 50 watts when idling.
lol wut? Where the heck did you read that? Were you aware that they're *laptop* chips?
Regardless, you're in for quite a surprise, when you see this:
https://www.anandtech.com/show/17024/apple-m1-max-...
mode_13h - Wednesday, October 27, 2021 - link
Sorry, for some reason I thought you were referring to M1 Max and Pro, as "those two APUs". You make a good point that if efficiency is @flyingpants265's top priority, those console APUs aren't great options. Maybe under full load, they'd compare favorably to a gaming PC of the same vintage, but not otherwise and certainly not against the M1 family.Anyway, the link to the M1 power measurements provides an interesting basis for comparison.
Techinterested1 - Friday, October 29, 2021 - link
Remember apple doesn’t have to sell its chips to anyone. They can bury the chip costs into the price of the whole product and aside from having large margins to begin with they will still make a profit. They also reuse a lot of the chips and R&D across product. A14 was iPhones and iPad Air. A15 was iPhone and iPad mini and so on. So they take advantage of the scale of their own platforms.dotjaz - Saturday, October 30, 2021 - link
>I don't think you realise that those two APUs consume around 200-250 watts of powerYou know what I think? You are pulling data out of your ass. XSX draws 160-210W of power from the wall. How does the APU alone consume 40 watts higher than total power? GDDR6/NVMe/fans/peripherals/power brick are now generating power?
The APU is consuming 95-135W of power within desktop TDP limit.
Annnonymmous - Monday, December 6, 2021 - link
M1 is on a more advanced node. It is impressive, but not as much so when you consider that it's on 5nm while everyone else is on 7nm. /shrugAnnnonymmous - Monday, December 6, 2021 - link
When Zen moves to 5nm, and Intel drops a comparative part, M1 will be old hat.melgross - Wednesday, October 27, 2021 - link
I would imagine, at this point, that Apple’s new chips, due to their size, are costing about what any other chip of that size might cost. But with the integrated RAM packages on the substrate, the package will cost much more. Of course, the question is just how much that RAM would cost as the usual sticks, plus the sockets, and the extra size of the Moro to accommodate them. Then we have the same question about the GPU.The greater efficiency of that, leads to a smaller power supply, and smaller cooling hardware.
So overall, it could cost less when considered as a system, rather than just as a chip. I’d really love to see the numbers.
Zoolook - Saturday, October 30, 2021 - link
Apple has a huge advantage in that they use a better and more expensive process, so it's not like Intel and AMD is leaving performance on the table, they would be 15-30% better if they were also on N5.mode_13h - Saturday, October 30, 2021 - link
> Apple has a huge advantage in that they use a better and more expensive processI think their A13 used the same TSMC process as Zen 3. Compare those and Apple still wins on IPC and perf/W.
> it's not like Intel and AMD is leaving performance on the table,
> they would be 15-30% better if they were also on N5.
That's an enormous range, and 20% is the highest generational improvement Intel has achieved since Core 2. AMD achieved a bigger generational improvement with Zen 1, but that was basically their "Core 2" moment.
Anyway, no. N5 (or equivalent) will not net them probably even 15%, and that's after accounting for all the uArch improvements enabled by that much larger transistor budget.
caqde - Sunday, November 7, 2021 - link
N5 vs N7 has a gain of up to 15%(Mobile)-25%(HPC version) improved performance or 30% improved power efficiency. So netting them 15% is what is touted by TSMC. So getting 15% is the minimum generational improvement that TSMC is marketing for that node. N5 is a major node. So I'm not sure how you think even with uArch improvements that AMD and Intel would not have better performance on a similar node to Apple, Zen 3 alone had 15% better performance (average) than Zen 2 at the same power consumption and that is on the same node as Zen 2. Apple pays heavily to be on the best node possible every year.TSMC Nodes
https://fuse.wikichip.org/news/2567/tsmc-talks-7nm...
mode_13h - Monday, November 8, 2021 - link
You can't just say a new process node will deliver 15% more performance, irrespective of the microarchitecture. There are some assumptions built into that number, which might not apply to these x86 cores. We'll just have to wait and see.> Zen 3 alone had 15% better performance (average) than Zen 2 at the same
> power consumption and that is on the same node as Zen 2.
First, it wasn't *exactly* the same node. It's a derivative node that AMD used for the Ryzen 3000 XR-series. Second, Zen 2 was still at lower IPC than Skylake. Some of the gains in Zen3 were from AMD still playing catch-up.
mode_13h - Monday, November 8, 2021 - link
Sorry, I realized I was thinking about your claim in terms of IPC. I don't believe N5 will increase IPC by 15%, but that's a very plausible target for the combination of IPC and frequency.Tams80 - Saturday, October 30, 2021 - link
It doesn't. Apple used more silicon and a more advanced fabrication node. That's where most of the gains have come from.xenol - Tuesday, October 26, 2021 - link
This is what should be drilled into people's heads. The ISA is just an interface between software and hardware. How you _implement_ it determines how fast a processor performs.blanarahul - Tuesday, October 26, 2021 - link
I have heard that some aspects of x86 are problematic (variable instruction length, for example). Do you think there is a way to switch to fixed instruction length and let a compiler like Rosetta handle the outlier instructions?CrystalCowboy - Tuesday, October 26, 2021 - link
Implementation matters, but some ISAs may be easier or more efficient than others to implement. Claims are that ARM is more efficient than x86. Are multi-threaded x86 cores an admission that the front end is a bottleneck? Is there something better than ARM? Do the IP costs of ARM cut into potential profits?Kamen Rider Blade - Tuesday, October 26, 2021 - link
Chips & Cheese has a great article on this exact topic.ARM or x86? ISA Doesn’t Matter
https://chipsandcheese.com/2021/07/13/arm-or-x86-i...
mode_13h - Tuesday, October 26, 2021 - link
Good points are raised, but most of the sources in that article are pretty old. The most authoritative and comprehensive-sounding is dated 2013. And that study was surely based on chips even older! So, maybe it's about 10 years out-of-date, and we're using it mostly to gain insight into what's coming in the NEXT 5 years or so?Also, let's not forget that Jim left AMD in like 2016? So, I consider his knowledge on the subject a little bit dated and limited. I mean, you need look no further than the disparity between x86 and Apple's A14 decoder widths to see that x86 is clearly facing issues that Apple isn't.
Next, the talk of "decode power" in ARM doesn't indicate whether power was an issue simply because its slice of the pie was growing out of proportion to its benefit (which is largely determined by what's downstream), or if it was actually growing in a very non-linear fashion. The former can be resolved by simply using a larger pie pan (i.e. newer process node, or maybe just a bigger silicon & power budget), but the latter can become a deal-breaker for ever going past a certain point.
Ultimately, time will tell. If the x86 cores can't catch Apple, I think that'll be the surest sign.
KennethAlmquist - Wednesday, October 27, 2021 - link
When David Patterson did the Berkeley RISC project, instruction set design mattered a lot. The transistor counts for microprocessors of that era were in the tens of thousands. (The Intel 8088 had 29,000 transistors and the Motorola 68000 had 68,000 transistors.) The Berkeley RISC I and RISC II processors used 44,500 and 39,000 transistors, respectively.Fast forward to 2012, and you have a quad core Ivy Bridge processor with 1,400,000,000 transistors. That's a huge increase in transistor budget. That means that the instruction set no longer matters very much. The transistors required to decode the instruction stream no longer dominate the cost of the design, even after you account for the fact that there are four cores, meaning that there are four copies of the instruction decoding logic.
You point out that the sources the article uses to document this are pretty old. But transistor budgets keep getting larger. Apple's eight core M1 has 16,000,000,000 transistors. So the choice of instruction set should be even less important today. You mention the difference in decoder widths between the A14 and the current x86 implementations. Mike Clark explains that he wants a balanced design, so to increase the decoder in Zen, you have to beef up everything so that the design is still balanced. He suggests that will happen in future Zen designs. Apple currently has the largest transistor budget, so it's not surprising they currently have the widest decoder.
mode_13h - Wednesday, October 27, 2021 - link
> The transistors required to decode the instruction stream> no longer dominate the cost of the design
You're presuming a serial decoder. For an interesting thought experiment, try imagining how you'd build a >= 4-wide decoder for a variable-length instruction set. I think it's safe to say that the area, complexity, and power of the decoder increases as a nonlinear function of its width. And note that the example of 4-wide is per-core. So, however many cores you have, that entire front end would be replicated in all of them.
The other pitfall faced in comparing transistor counts of ancient CPUs with modern ones is that the older chips took many cycles to execute each instruction. So, you're also failing to account for the increase in complexity from simply increasing the serial throughput of the decoders, apart from how many output ports they have.
Another factor you're missing is that the opcode space is very unbalanced, because it was never planned to encompass the vast number of instructions that currently comprise x86. The current opcode space is many times larger than 8088, which IIRC only had a couple hundred instructions.
> Apple currently has the largest transistor budget,
> so it's not surprising they currently have the widest decoder.
Don't confuse the amount of transistors they're spending on CPU cores with what they spend on the entire die, since that includes cache, GPU, ISP, etc. You should compare only the CPU core transistor counts. I'm sure they didn't publish that, but people have done area analysis and you can probably find some pretty good estimates.
Now, if we look at the backend of their cores, I think they're not wider than AMD/Intel by the same ratio as their frontend. Part of that could be down to the fact that you need more ARM instructions to do the same work, since x86 can use memory operands (even with scaled register offsets), but even accounting for all of that, it looks like Intel and AMD have had disproportionately smaller frontends than what we see in the M1.
GeoffreyA - Thursday, October 28, 2021 - link
I like to think of an array and linked list. No matter how much one optimises the latter, it's still that same basic difference, that an element can't be accessed randomly or in O(1). Even if we were to build some table of indexes, it means more work elsewhere. (And we might as well have started off with an array to begin with.)In x86, one way to remedy this could be adding length information to the start of each instruction; but that would likely mean breaking the ISA, and so it might be better just going to a new, fixed-length one. Or take it one step further, a new ISA that is able to encode the instructions in an out-of-order fashion, meaning dependencies are calculated outside of the CPU, hopefully at compile time. Try and remove everything that doesn't have to be done at runtime, outside of runtime.
mode_13h - Thursday, October 28, 2021 - link
> meaning dependencies are calculated outside of the CPU, hopefully at compile timeI liked this aspect of IA64's EPIC scheme. Not sure how well it actually worked, in practice. People have raised concerns about the additional instruction bandwidth needed to support it.
On the subject of instruction bandwidth, I keep thinking about how much texture compression helped GPUs. I know it's different (lossy vs. lossless), but it still shows that a simple compression scheme could be a net win. Then again, ARM tried something like that with THUMB, and ended up walking away from it.
GeoffreyA - Friday, October 29, 2021 - link
There are worthwhile ideas in EPIC, especially the notion of moving dependency checking into the compiler; but owing to Itanium's stigma, I think designers will, sadly, steer away from it. If they could separate out the good from the bad, and absorb some ideas. But I fear no one will want to take that chance. Also, breaking compatibility is a big concern, and the only feasible option at present is going to ARM, if they've got to switch, purely because of solid support in Windows and MS's toolsets.mode_13h - Tuesday, October 26, 2021 - link
I think there's a lot to be said for this approach. Simplifying the CPU's decoder should free up some area and save energy. You can even have the OS cache the decoded program from one run to the next. It can also store profiling information, so the decoded program can be further optimized, either as it's run or on subsequent runs.These are things GPUs already do! I think it's what Nvidia's Denver cores do, and someone mentioned maybe Transmeta CPUs did this, as well? It would need OS support, of course.
I think the main challenge is to achieve good density in the decoded stream. There's a definite tradeoff between how much work you want to save the CPU's frontend vs. how much memory bandwidth you want to burn on fetching code.
Zoolook - Saturday, October 30, 2021 - link
That's what DEC did in FX!32 (https://en.wikipedia.org/wiki/FX!32) and that's what Apple does in Rosetta. It was the whole point behind Transmetas Cruose architecture, they were made to be independent of ISA so an abstraction layer would have to be created for each ISA you wanted to run on it.mode_13h - Saturday, October 30, 2021 - link
Yeah, I 'd heard about DEC's x86 software emulator.Also, it's a lot like Rosetta. It's modern examples like that which I think lend a lot of credibility to the idea.
xenol - Tuesday, October 26, 2021 - link
While variable instruction lengths is a problem, I also believe that developers of x86 compilers are largely aware of this and have tried to make sure to avoid using different ways one can encode the same instruction. And I'm also certain in modern x86 designs there's some cache set aside for decoding.There's also an article I found going over the most used x86 instructions and how they're used (https://www.strchr.com/x86_machine_code_statistics... It's likely this or something similar is used as a reference as to where to optimize
dotjaz - Tuesday, October 26, 2021 - link
Why would you want an essentially binary translation layer when the decoder is doing exactly that - decoding variable length to fix length. For AMD, Macro or Complex Ops are already fixed length, then further broken down to Micro Opsmode_13h - Tuesday, October 26, 2021 - link
@dotjaz, there would be a few benefits. First, anything that makes the instruction stream more uniform would help them widen up the decoder, which is currently a bottleneck in the uArch.Second, remember that the decoder is doing essentially redundant work every time it encounters the instructions at the same point in an executable. Doing the work up-front, like when the code is first read in from storage, is essentially like moving code out of an inner loop. It could enable the pipeline to be shorter, simpler, burn less power, and take a smaller penalty on branch mispredictions.
Lastly, any savings of silicon area can be re-invested elsewhere, either in the form of beefing up other parts of the uArch, in adding more cores, or simply in reducing the cost and therefore hopefully the price to customers. If they take the silicon savings to the bank, then at least less area can enable some combination of lower power and higher clocks.
mode_13h - Tuesday, October 26, 2021 - link
Another potential benefit would be the potential for doing some types of code transformations that are too expensive to do in hardware. Perhaps this could be guided by dumping branch-prediction state, since compilers are traditionally very bad at predicting the control flow of code.I remember back in the late 90's, DEC Alpha had x86 emulation support that would do some JIT translation and would supposedly optimize the code as you continued to run the program. That's the first time I heard of such an idea. I do wonder whether today's Javascript JIT compilers do any sort of profile-driven optimization.
Zoolook - Saturday, October 30, 2021 - link
It was DEC's pioneering work in this area with FX!32, that is given to little credit now, here is a brief paper describing how it worked can be found here: https://www.usenix.org/legacy/publications/library...mode_13h - Saturday, October 30, 2021 - link
Thanks for the link!There are many pioneering ideas that happened at companies like DEC and IBM, decades before they became mainstream. I once talked with a processor architect who said he felt like they were blazing new ground at the place he worked in the 90's, only to later find out that IBM was doing some of the same stuff as far back as the 60's and 70's.
tygrus - Wednesday, October 27, 2021 - link
It's not efficient to predecode:1) if the decode stages create larger instructions of fixed width;
2) are very hardware specific;
3) there's already dynamic components for address resolution involving load/store;
4) more dynamic when referencing results of previous instructions only available at runtime;
5) creates more security risks;
6) there's extra power involved to store this 2nd copy in RAM then read back into CPU ontop of the 1st read & precomiple.
tygrus - Wednesday, October 27, 2021 - link
I'm talking about the recompiling or transcoding into different ISA.mode_13h - Wednesday, October 27, 2021 - link
> It's not efficient to predecode:> 1) if the decode stages create larger instructions of fixed width;
Maybe there's a cheap compression scheme that can be handled more efficiently than decoding x86-64 instructions. You could even put the decompression in the datapath that instruction cache misses must traverse, so that it's all decompressed by the time it gets into a cacheline.
> 2) are very hardware specific;
Why? The decoder can still live in hardware/microcode of the CPU. The OS can treat it as a black box, much as it does for GPU shader compilers.
> 3) there's already dynamic components for address resolution involving load/store;
And yet, somehow micro-ops already deal with this.
> 4) more dynamic when referencing results of previous instructions only available at runtime;
Self-modifying code will need to be handled via a slow-path that does the decoding inline (as today). This can still be a purely hardware path, but you wouldn't need to devote quite as much silicon to it as we currently do.
> 5) creates more security risks;
Such as? Code is sitting in memory, either way.
> 6) there's extra power involved to store this 2nd copy in RAM
> then read back into CPU ontop of the 1st read & precomiple.
What circumstances would give rise to that? If the intermediate format is faster to decode than the x86 instruction stream, that suggests it would also require less power to do so. There's no way that x86 ISA is so close to optimal that a real win can't be had by doing some amount of preprocessing, then I'm left wondering whether the CPU designers are really trying hard enough.
GeoffreyA - Thursday, October 28, 2021 - link
Certainly, doing the decoding in the CPU and stashing it somehow, would make it all easier. The other approach of moving the decoder into the OS would be, as they say, disastrous. Breaking abstraction and tying itself to a CPU's hidden format; and that would mean different formats for different vendors too. Well, vendors could provide this layer in the form of a driver; but already it seems like a great deal of work, and not very elegant.mode_13h - Thursday, October 28, 2021 - link
> Certainly, doing the decoding in the CPU and stashing it somehow, would make it all easier.Yeah, I think the decoder should still live in the CPU. By default, decoding would happen inline. However, the OS could run the CPU's decoder on a block of code and translate it. Then, switch the process over to running the intermediate format from the new image.
Again, this isn't really blazing new ground. Transmeta and Nvidia's Denver have done this before. That means there could already be some level of OS support for it!
GeoffreyA - Friday, October 29, 2021 - link
I see what you're saying. The CPU exposes its decoding system through an API, and the OS handles it similarly to managed programs.Crusoe's idea was interesting. No doubt, implementing emulation in an OS is pretty hard work. (It took some time before MS, for example, added x64 to Windows ARM.) I wonder if Crusoe's idea of a generalised layer would be easier. Given enough time, as more translation paths are added, we might be able to run any program on any ISA, even for paths that aren't directly coded. Just go to intermediate format, and from there to destination ISA. And native programs running at full speed. Then again, I suppose that's what Java, .NET, and co. already do, and they're very convincing.
GeoffreyA - Friday, October 29, 2021 - link
* they aren't very convincingmode_13h - Tuesday, October 26, 2021 - link
ISA is ultimately a limiting factor. It's not the only factor, but it influences the shape of the perf/W or perf/area curve and where the points of diminishing returns are.Bambel - Thursday, October 28, 2021 - link
It's one thing to say "ISA doesn't matter" and it's another thing to proof that. Why is there no x86 core that has the same efficiency like Apples cores? Build it and i believe it.BTW: Apples efficiency comes from the relatively low frequency that their cores run at and it's not just TSMCs 5N process. Intel, AMD an Apple have give or take the same performance but, while Apples run at 3.2 GHz, intel and AMDs need about 5 GHz to deliver the same performance. If you build intel or AMD cores on the same 5N and run them at 5 Ghz they will draw much more power than at 3.2 GHz.
And this relationship between clock speed and power isn't new. Ask the folks at intel who designed the P4. So again: why is there no x86 core that only needs to run at 3.2 GHz for the same performance? Strange, isn't it?
mode_13h - Thursday, October 28, 2021 - link
> Apples run at 3.2 GHz, intel and AMDs need about 5 GHz to deliver the same performance.How do the cores compare in terms of area? Because Intel and AMD have both designed cores that they sell into the server market. So, if Apple is delivering less performance per transistor, maybe that's a reason Intel and AMD didn't increase the width of their uArch at the same rate as Apple.
mode_13h - Thursday, October 28, 2021 - link
I should've added that the reason area is relevant for the server market is due to the high core-count needed for servers. So, if Apple can't scale up to the same core counts, due to the area needed by their approach, that could be a key reason why Intel and AMD haven't gone as wide.Also, cost increases as a nonlinear function of area. So, perf/$ could be worse than their current solutions, on the nodes for which they were designed.
Bambel - Friday, October 29, 2021 - link
A few thoughts....- AFAIK most PCs sold today are laptops and intel lost Apples business because they were not satisfied with the efficiency of current (mobile-)CPUs. intel should have enough ressoures to develop different cores for different applications. The loss of Apple as a customer not only lost them money, it was also bad PR.
- x86 CPUs have to clock down for thermal reasons in all core loads anyway, so they could design them to run at lower clock speeds from the beginning. So pretty much what is needed in a laptop. BTW: the new E-cores in ADL are much smaller and seem to be very efficient both in area and thermals. Maybe it's a starting point for a new P-core. The "core" architecture was derived from the Pentium M which was "only" a mobile chip at the beginning but evolved into much more. Funny enough, Apples stuff originated in mobile as well, so it seems that it makes more sense to focus on efficiency first and than scale performance. Both intel and AMD seem to have done it the other way round in recent years.
- Looking at a lot of die-shots i have the impression that caches and uncore are larger than the cores themselves, so i'm not sure area is the culprit. And chiplets mitigate that anyways for high core count CPUs.
Will be interressting to see what AMD can achieve with ZEN 4. I don't expect it to reach Apples IPC, but should be closer than today.
GeoffreyA - Friday, October 29, 2021 - link
I've always felt that Gracemont will lead to their main line of descent. They're slowly working the design up.mode_13h - Saturday, October 30, 2021 - link
> I've always felt that Gracemont will lead to their main line of descent.It depends on what people want and are willing to pay for. There are currently Atom server CPUs with up to 16 cores. If the version with Gracemont is popular (maybe with even more cores), then they'll probably expand their server CPU offerings with E-cores.
mode_13h - Saturday, October 30, 2021 - link
> intel should have enough ressoures to develop different cores for different applications.They do. The Atom-line was for low-cost and low-power, while their other cores were for desktops, mid+ range laptops, and most servers.
What they haven't done is a completely separate core for servers. If they did that, then maybe they could make the desktop/laptop cores even bigger, although that would make them more expensive so maybe not.
> x86 CPUs have to clock down for thermal reasons in all core loads anyway
You're thinking of desktops & laptops, but not servers. Servers' clocks are limited by power, not thermals.
> the new E-cores in ADL are much smaller and seem to be very efficient both in
> area and thermals. Maybe it's a starting point for a new P-core.
They've recently done complete redesigns of their P-cores. I forget which one was the last redesign, but it might be the very P-core in Alder Lake. They don't need Gracemont as a "starting point". If Golden Cove's efficiency is bad, it's because they *decided* not to care about efficiency, not because of legacy.
> it seems that it makes more sense to focus on efficiency first and than scale performance.
> Both intel and AMD seem to have done it the other way round in recent years.
Apple didn't *only* focus on efficiency, in case you haven't been following their CPUs. They have consistently had far better performance than their competitors.
And neither AMD nor Intel can afford to come out with a lower-performing CPU that's simply more power-efficient. I mean, not in the main markets where they play. Obviously, Intel sells cost/efficiency-oriented CPUs for Chromebooks, and it sells specialty server CPUs based on the same "Mont" series cores.
> Looking at a lot of die-shots i have the impression that caches and uncore are
> larger than the cores themselves
That's not very helpful. You need to find someone who has done the analysis and made credible estimates of the number of transistors in the cores.
FreckledTrout - Tuesday, October 26, 2021 - link
What a well thought out interview. Thanks Ian.patel21 - Tuesday, October 26, 2021 - link
Would have loved some questions about their ARM endeavorsIan Cutress - Tuesday, October 26, 2021 - link
As shown by the K12 question, that's not what Mike's working on.mode_13h - Tuesday, October 26, 2021 - link
I'm sure AMD wouldn't say much more than that, anyhow. Presumably, even Mike knows at least a little more than he was willing to say.Igor_Kavinski - Tuesday, October 26, 2021 - link
We want moar stories like the cold A0 silicon, please!CrystalCowboy - Tuesday, October 26, 2021 - link
It would be great to hear all about Zen 7, but one must be wary of the Osborne Effect.Lord of the Bored - Tuesday, October 26, 2021 - link
I know what you mean, but that name makes me think of Spiderman first.So in my mind, talking about upcoming technology turns you into a supervillain.
GeoffreyA - Tuesday, October 26, 2021 - link
Fantastic interview and reading! As always, it's a joy to learn more about Zen. Thanks, Ian. And hats off to Mike and everyone at AMD for the brilliant job they did and are still doing. Well done. Keep up the good work!Oxford Guy - Tuesday, October 26, 2021 - link
All these advances in tech and the photo has awful red eye.WaltC - Tuesday, October 26, 2021 - link
What a great interview! Loved the questions and the answers! Clark hits all the right notes, from describing the massive teamwork building an excellent CPU requires, to his excitement and genuine enthusiasm for what he sees coming down the pike. I loved this:"If we don't do it, someone else will."...! AMD is in a good place with engineers like this, imo. If you don't believe in yourself, your company and your products--well, that doesn't lead to the best possible outcome, let's say. I also liked what he said about "machine learning" (and "AI", although it wasn't specifically mentioned), that all of it boils down to efficient, high-performance computing. I've heard Papermaster say the same thing. Buzzwords carry little weight with CPU engineers (thankfully)..;)
I think this is certainly one of the best--if not the best--Q&A interview I've seen from AT. Good job, Ian. And of course, Kudos, Mike!
Makste - Wednesday, October 27, 2021 - link
"If we don't do it, someone else will."This statement resonated with all my ideals.
mode_13h - Wednesday, October 27, 2021 - link
Taken out of context, that sentiment can legitimize some very bad behavior.TristanSDX - Tuesday, October 26, 2021 - link
Great to hear story of engineering CPU. But none question about hybrid, big-little, hot topic now.Could be great to bring some lead architect from Intel, go get instghts how long and how hard they are designing these cores.
CrystalCowboy - Wednesday, October 27, 2021 - link
AMD has been reported saying that they are not going big-little, but are pursuing other paths to efficiency.ballsystemlord - Tuesday, October 26, 2021 - link
@Ian , wow! You've been getting lots of interviews recently. Any particular reason that the various companies are willing to let you talk with their tech experts now of all times?easp - Wednesday, October 27, 2021 - link
Alignment of development/release cycles, maybe?Plus, if one company is making their Sr. leaders available, other companies have incentive to make sure they get their own message out at the same time.
LightningNZ - Tuesday, October 26, 2021 - link
Superb interview! I've seen a lot of comments here mention the difficulties of decoding the variable length instructions of x86, but as we've heard from Jim and now Mike, there are ways to work around or mitigate this. I haven't seen anyone here ask if the fairly strict memory ordering model of x86/x64 is a bigger problem for usefully decoding in parallel? Maybe it's a dumb question, but it looks like ARM's soft memory ordering allows for much greater parallelism with memory loads/stores - in particular the ability to recognise streaming loads and thus not have to put barriers on writes. Apologies if I have the terminology wrong - I'm a n00b in this area. I could imagine there being difficulty decoding lots of memory operations in parallel if you can't tell whether they are subject to aliasing, alignment issues, or barriers. These things should be decoupled but I'm not sure to what extent they can be fully decoupled when you also have guarantees on ordering.mode_13h - Tuesday, October 26, 2021 - link
I definitely tried to pursue the matter, in the comments of AMD CTO Mark Papermaster's interview:https://www.anandtech.com/comments/16176/amd-zen-3...
In particular, I think the comments by deltaFx2 are worth reading. They sound as if this person has some detailed knowledge of x86 CPU design, and even suggest having done some comparative analysis against the ARM64 ISA.
I see that was almost exactly 1 year ago. Coincidence?
mode_13h - Tuesday, October 26, 2021 - link
According to Google, that poster only commented on Ryzen, EPYC, and a few ARM server CPU articles (Ampere, Applied Micro X-Gene 3, and Qualcomm Centriq). Interesting.I doubt that's a complete list, since I didn't get a hit on the comments in that Papermaster interview. However, as I'm looking through the search hits, it seems none of them are on the dedicated comments pages. Maybe Google doesn't index those. The hits all seem to be on one of the handful of comments visible at the bottom of the article pages.
I also just searched the forums, and deltaFx2 doesn't seem to have a forum account. Apparently, neither do I... ?
LightningNZ - Wednesday, October 27, 2021 - link
Yeah it's a quality commenting system (as you say below). I have no idea who deltaFx2 might be. The best discussion of these things tends to be over at RWT (which also has a rubbish forum layout, but it's better than this). I miss AcesHardware so bad at times like this.LightningNZ - Wednesday, October 27, 2021 - link
So I went and read delta's comments. They're a bit of a knob aren't they? "Deeper buffers and just speculate around it" was their opinion. I doubt that's entirely true - look at Spectre/Meltdown. Memory speculation can't just be done on everything.mode_13h - Wednesday, October 27, 2021 - link
Heh, yeah. This one stung, a bit:deltaFx2> Are you sure you know what you're talking about, because it doesn't look like it.
Well, I'm no CPU designer. I can imagine if a noob were challenging me on a subject where I have some expertise, I might find it a little tiresome, but it still didn't need to be quite that blunt.
In that same post, they point out:
deltaFx2> ARM has some interesting memory ordering constraints too,
deltaFx2> like loads dependent on other loads cannot issue out of order.
deltaFx2> Turns out that these constraints are solved using speculation
deltaFx2> because most of the time it doesn't matter. x86 needs a larger
deltaFx2> tracking window but again, it's in the noise for area and power.
I think you're right that it's a rather dismissive answer. Still, I'm in no position to argue.
I also think you can't just reduce x86 memory ordering to a simple tracking problem, since transactions need to be committed in an order that probably isn't optimal. The order can't be changed, since it will break some multithreaded code.
deltaFx2 - Sunday, October 31, 2021 - link
There's a ton of info on google if you care to search. Here's the first hit google came up with: https://stackoverflow.com/questions/55563077/why-f...This has been around for a couple of decades at least.
"I also think you can't just reduce x86 memory ordering to a simple tracking problem, since transactions need to be committed in an order that probably isn't optimal. The order can't be changed, since it will break some multithreaded code."
Remember, the rule is Total Store Ordering (TSO). Stores are always committed in program order regardless of ISA. Loads that snoop those stores may go out of order. As long as they don't observe stores out of order relative to a strict in-order machine, it's all ok. A CPU with weak memory ordering rules will allow loads to go out of order by design and ISA spec. A CPU with strong memory ordering rules allows loads to go out of order in the microarchitecture but enforces a check to ensure it satisfies the ISA spec. If whatever security loophole applies to a strongly ordered core, it also applies to the weakly ordered core. I don't see a difference. But if someone believes otherwise, write a proof of concept and have it published. Burden of proof is on you. Allowing loads to go out of order can cause leaks of information (Spectre V4) inside the same thread. It is ISA independent, and affects every CPU manufacturer.
mode_13h - Sunday, October 31, 2021 - link
Hi there! Thanks for the additional details.To be honest, my only experience with weak memory ordering is on GPUs. I wasn't clear on the specifics of ARM vs. x86.
Makste - Wednesday, October 27, 2021 - link
This is the most interesting exchange I've ever read from a lead architect. It is even more interesting than the one Jim Keller gave.tiwariacademy - Wednesday, October 27, 2021 - link
helloAmber Shade - Wednesday, October 27, 2021 - link
Am I the only one seeing a big bottleneck in memory latency? It doesn't seem to be getting better generation to generation, and it doesn't seem people you talk to give it justice.Or maybe I'm wrong and it's not that important? Somebody prove me wrong, then.
Also, hello, community. Registered to post this.
mode_13h - Wednesday, October 27, 2021 - link
Welcome! This commenting system is like something straight out of the 1990's, but perhaps it has something to do with the reason we seem to have a bit fewer than the average number of trolls? Anyway, the lack of an editing capability encourages one to post with a bit more care.Memory latency for DRAM DIMMs seems to remain relatively fixed, in terms of the absolute number of nanoseconds. How much impact it has is very much application-dependent.
One form of mitigation employed by CPUs is increasingly-intelligent prefetching of data before code actually needs it. Of course, that only helps with fairly regular access patterns. There are also instructions which enable software to explicitly prefetch data, as well, but even these don't help when you're doing something like walking a linked-list or a tree.
Another mitigation is in the form of ever-burgeoning amounts of cache come in (see AMD's V-Cache [1]).
To help hide stalls on cache misses, CPUs are incorporating ever larger out-of-order windows. SMT enables one or more other threads to utilize more of the CPU's execution units, when a thread sharing the same core stalls. GPUs employ this technique to an extreme degree.
Then, there's in-package memory, such as LPDDR5 memory stacks in Apple's latest additions to the M1 family [2]. On this point, Intel has announced a variant of their upcoming Sapphire Rapids server CPU that will feature in-package HBM-type DRAM [3]. I think details are still thin on how it'll be seen an supported by software (e.g. as a L4 cache or as separately addressable storage).
To round out the solution space, Samsung is embedding some (presumably simple) computation in its HBM memory, for deep learning applications.[4]
Links:
1. https://www.anandtech.com/show/16725/amd-demonstra...
2. https://www.anandtech.com/show/17024/apple-m1-max-...
3. https://www.anandtech.com/show/16921/intel-sapphir...
4. https://www.samsung.com/semiconductor/solutions/te...
Oxford Guy - Friday, October 29, 2021 - link
'This commenting system is like something straight out of the 1990's, but perhaps it has something to do with the reason we seem to have a bit fewer than the average number of trolls?'It absolutely reduces trolling because it doesn't create the bloodsport that is voting and post hiding.
That is shunning, one of humanity's oldest methods of blocking innovative ideas. New ideas are resisted by most people, so the community will side against the person having their posts smothered. It will mistakenly view this as a righteous solution to the problem of the annoying unpleasantness of ideas that run counter to preconceived notions.
It is no wonder that these censorship systems keep infecting websites, with those pushing them actually making statements (as a Disqus employee did) that the way to improve a community is to increase the amount of censorship.
mode_13h - Friday, October 29, 2021 - link
I've long wondered if the biggest problem with voting features, in commenting systems, isn't that it costs the voter nothing. Like, maybe if you had to actually spend some of the credits you received from up-votes of your own posts to vote someone else up/down, people would be a bit more thoughtful about it.However, I think what's come to light about the sorts of posts that are frequently shared pours cold water on that notion. When someone shares a post, they're spending (or gaining) social capital, of a sort. And based on the kinds of emotive posts that tend to go viral, I think we can say that even voting systems that are costly to the voter are still likely to result in undesirable behaviors. Perhaps even more so than free votes, since posts that inform but don't inflame are then less likely to receive upvotes.
GeoffreyA - Friday, October 29, 2021 - link
mode_13h, I think there's an emotional effect at work in liking, voting up and down, etc. Dopamine is likely being released, which makes them seek that action again and again. At a physical, Diablo-like level. Then, go a bit higher, and its the seeking out of those things which reaffirm their emotional opinions. Throw in some controversial, dividing issue, and the tribal spirit will come out in all its monstrous glory.I'd also say that people like/approve what they know is "accepted" by current opinion, even if their view, internally, was different. For my part, I feel the like button, of which voting up and down is a specialisation, should be banished from society.
mode_13h - Saturday, October 30, 2021 - link
> For my part, I feel the like button, ... should be banished from society.On another technical forum I visit, they have up-votes but no down-votes. I find the up-votes are useful when someone makes a specific factual claim, because it helps you know how much consensus there is around those claims. So, I think it's not all bad.
GeoffreyA - Friday, October 29, 2021 - link
Oxford Guy, in this "progressive" world of today, censorship is on the rise, paradoxically. The idea is controlling what people think and speak so that they fall in line with the prevailing orthodoxy. That's why we need to preserve opinions that we don't even agree with. As the physicists would say, no preferred frames of reference.mode_13h - Saturday, October 30, 2021 - link
I think you're referring to "cancel culture"? I think we shouldn't confuse that with government censorship. I don't like either, but I'd take cancel culture waaay before I'd accept government censorship. And sadly, with the rise of authoritarianism, I think *both* are on the rise.GeoffreyA - Sunday, October 31, 2021 - link
In general I meant it, but you're right, cancel culture does fall under this. All I can say is, the despotic principle is always at work, even when the intentions start off good. We may begin as sincere, plant-based activists, but end up forcing the entire world to eat the way we eat, because we're right.Amber Shade - Friday, October 29, 2021 - link
Thank you. And I was actually referencing to Ryzen family having a memory access penalty, which I think comes from the infinity fabric, suboptimal memory controller or something of that fashion.Back when first two Ryzen generations came out (Zen 1 and +), I was hoping Zen2 would help latency as I was seeing it as a big bottleneck compared to competition, now we have chiplets and another layer of infinity to connect everything, and only L3 sizes to hide increased latencies. But since L3 isn't always helpful, this... makes me worried. I'm still hoping for improvements in this regard, but where everyone is going doesn't seem to line up with it.
mode_13h - Saturday, October 30, 2021 - link
Well, looking at the original Zen 3 performance review, it seems that the DRAM access latency is only about 10% worse than Comet Lake. However, between the size of L3 and some of the mitigations I mentioned, such as prefetchers, software rarely sees that penalty.https://www.anandtech.com/show/16214/amd-zen-3-ryz...
Sadly, I don't see the same cache-and-memory latency measurements for the 5000G-series. However, the benchmarks of 5600X vs. 5600G show the former wins almost across the board. So, I decided not to worry about the extra I/O die and will probably get a 5600X.
Amber Shade - Monday, November 1, 2021 - link
Yes, 8core CCX and adjusted prefetchers improved things quite a bit, and large L3 works even better now; but sub45ns memory latencies are quite tasty and only achievable on Intel platforms it seems (I have no reliable way of testing them myself so have to trust things like AIDA64 to measure it). And yeah, prices aside I'd crave for 5600x, too. Seems to be the best all around, except maybe new Intel stuff coming soon (we'll see).Latency-wise, I like some of the ARM ways, like starting a request to memory in parallel with cache probing (and cancelling it on hit), at least that's how I understood it.. this is from, like, A76 presentation? Maybe A77, can't remember. Anyway, if that's not a thing on desktop, I believe it should be. And Ryzens with their large caches and laggy memory accesses should benefit considerably
mode_13h - Monday, November 1, 2021 - link
> sub45ns memory latencies are quite tasty and only achievable on Intel platforms it seemsThat's not what the link I posed above shows. There's a plot of latency vs. size, in the section "Cache and Memory Latency". Underneath the plot, you can click buttons that switch the view to the test results of other CPUs. One of the options is "Core i9 10900K". When you click it, it shows "Full Random" latency peaking at 70.814 ns.
I you believe that's incorrect, please share your source. I suspect whatever test you're looking at didn't reach sizes or access patterns that went entirely beyond the effects of the cache hierarchy.
Amber Shade - Tuesday, November 2, 2021 - link
I don't have the same testing package, so can only measure (and share) my own overclocking results with the tools I have. And yes I believe AIDA64 patterns are somewhat more predictable, but it favors intel nonetheless (which was backed by gaming and web performance before Zen3, which seems to finally catch up).Also, yes I know it's overclocking and comparisons with stock system results is incorrect, but it's tuned system performance that matters to its user, right? And I perform some basic optimisations to any system I get access to (even if it's only a slightly reduced voltage all around and adjusted memory timings), so I'm interested in potential just as much as stock performance.
Calin - Wednesday, October 27, 2021 - link
While memory transfer rates increased a lot in the last three-four generations, the actual wait time for random access (total milliseconds until the first bytes arrive) did not.The solution to this was to increase cache size and cache efficiency.
Also, the processor knows it needs certain data (memory address) early (in the instruction decoding cycle) but it actually needs the data late (into the execution cycle). Processors keeping approximately the same frequency as ten years ago means that the time between "I know I will need memory X" and "I need memory X _now_" is about the same.
Here is a useful chart:
https://www.anandtech.com/show/16143/insights-into...
Basically, DDR4-3200 has about half the latency of SDR-100 (worst case, double the transfer speed for a single byte) but 30 times the total transfer rate (best case, sequential reads).
As for importance... it's important for some and not so important for others. You can drive memory to faster speeds but you increase total power, and that total power might come from processor total power, which might be good or might not.
Also, there's too a question of cost - higher speeds are not free - in terms of memory cost ($), mainboard cost ($), power use(W), heat generated (Celsius/Fahrenheit degrees), ...
Makste - Wednesday, October 27, 2021 - link
Thank you, to the both of you. Mode_13h and Calin.mode_13h - Thursday, October 28, 2021 - link
Thanks for the feedback!I hope I didn't confuse anyone with the typo around Cache and AMD's V-Cache. Also, the part where I went straight from OoO windows to talking about SMT (different topic).
If I'd had more time, I should've cited OoO window sizes for A15, Zen3, and Willow Cove (Tiger Lake), but people can look those up. It seems like they're getting big enough to cover a significant amount of the time needed for a full L3 cache miss, though that assumes there's much work the code can even do without the data.
I think SMT is a more elegant solution to this problem. It's too bad about side-channel attacks, because 4-way SMT should really help boost the perf/area of server CPUs, while still allowing for strong single-thread performance. SMT also helps with other problems, like decoding bottlenecks, branch mis-predictions, and general lack of ILP (instruction-level parallelism) in some code.
nvmd - Wednesday, October 27, 2021 - link
Whoever writes/transcripts these interviews should be replaced.Is it speech to text software or "Dr" Ian himself? So many errors it's fucking infuriating.mode_13h - Wednesday, October 27, 2021 - link
That's awfully disproportionate. There were indeed a couple points where I could tell what was probably said and it wasn't accurately reflected by the transcription, but the distinction didn't much matter in the context of the surrounding text. I certainly didn't get the sense that any quality issues with the transcription compromised the interview to any significant degree.I think we probably need to accept that this is what advertising-funded publishing can offer our audience, in this day and age. Take it or leave it. I know I'd sure rather have the interview as-written than not at all!
nvmd - Thursday, October 28, 2021 - link
So no one bothers reading the transcript of an interview at least once to fix typos because "ads don't pay enough" and it doesn't bother people like you? Talk about justifying mediocrity with the energy of a supermassive black hole.And your underlying assumption that the effort to read the article once is greater than the the will of the people behind it can only mean the people are both hate and are bad at their jobs.
mode_13h - Thursday, October 28, 2021 - link
> So no one bothers reading the transcript of an interview at least once to fix typos> because "ads don't pay enough"
I take it back. I am not in a position to say what their limiting factor is. Could be budget, could be deadlines... I don't honestly know.
What I do know is that you're being too harsh, at least in your tone. It's fair to point out the errors. But, the tone you struck will not help you find a sympathetic ear. So, please try to have some decorum.
Foeketijn - Wednesday, October 27, 2021 - link
What a nice read! (and how different from the other reviews from the blue camp that read like a legal document).Farfolomew - Thursday, October 28, 2021 - link
I think the one takeaway I got from this article was how long it takes, from conception to delivery, to design a new CPU: roughly ~5 years, at least following Zen's history as an example.In that context, I'm curious about Intel's upcoming Alder Lake. Back in 2015, Apple came out with their A9 processor that finally reached IPC parity with Intel's current best, Broadwell. I remember several articles written comparing Apple's seemingly disparate chip to that of Intel's, and actually discovering that the workloads, and use cases, accomplished on the two CPUs were not so different after all (iPad vs Core M tablets). That was roughly five years ago, and now we're seeing Intel coming out with a hybrid design in Alder Lake, similar to what Apple/ARM have been offering the past several years.
Now, I'm not fully convinced Intel has nailed this design, it's far too soon for that conclusion. However, I am rather of the strong opinion Apple has gotten it VERY right in their development the past 9 years, ever since their own custom silicon debuted with the A6 in 2012.
So that being the case, one of the glaring Qs posed by Ian to Mike Clark in this article was his opinion on the hybrid design and the approach by Zen to just scale the core up and down to meet TDP demand. I was not impressed with Clark's answer, which I felt amounted to not much of one at all. He kind of just gave vagaries in that it's complicated and that they designed for that up front. To me that sounds eerily similar to Intel's view on the matter way back in the late 2000s when they were asked about the big.LITTLE concept. Intel thought back then that they could just dynamically adjust frequency and voltage to produce similar results. Now we're seeing them backpedal and embrace the hybrid design.
So is AMD stuck in an antiquated mindset that's been clearly proven by the industry as a whole to be the wrong? Or do they have their own hybrid design in the works? Maybe they'll resurrect K12 and integrate it into a future Ryzen design? Either way, thanks Ian for asking that Q!
mode_13h - Thursday, October 28, 2021 - link
I think AMD was in the position that they couldn't afford to design different cores for different markets. So, they had no option but to scale up and down.They're not new to the concept, as you probably know. They had the bobcat, jaguar, puma CPUs, which had distinct uArch from bulldozer, steamroller, piledriver, excavator. We'll just have to see what they do, going forward. He can't tip their hand before anything has been announced.
mode_13h - Thursday, October 28, 2021 - link
If AMD isn't already on this track, I'd imagine Gracemont certainly caught their attention.Farfolomew - Friday, October 29, 2021 - link
"... as you probably know. They had the bobcat, jaguar, puma CPUs, which had distinct uArch from bulldozer, steamroller, piledriver, excavator."I actually did not know this! I knew the names of all these cores, but I never knew the Bobcat line was different than the Bulldozer line, in the same vein as Atom vs Core like Intel! In my mind, I kind of just thought they were all the same module-based pseudo SMT architecture that AMD got panned for pre-Ryzen! Thanks for educating me!
mode_13h - Thursday, October 28, 2021 - link
Now here's an interesting thought: do you think Alder Lake was a play to save Intel & Apple's marriage? It could be just a reaction to Qualcomm & other ARM vendors pushing into the laptop market, but it'd be more interesting if Intel was trying to ply Apple with offerings to string out their marriage just a little longer.On a related note, I've wondered if the Iris Pro GPUs with eDRAM were meant to try and lure any console vendors. Like for the XBox 360 / PS4 generation. And then Intel just decided to bring it to market, anyhow. Maybe even scaled back from what they originally designed.
GeoffreyA - Friday, October 29, 2021 - link
Apple: "It's not you, dear. It's me. I just can't take it any more and am filing for divorce."Intel: "It's got to do with that ARM fellow, isn't it? He always seems to be around. Your silence speaks a mouthful. Well, what about Ann and Timmy? What's this going to do to them?"
Apple: "They'll be staying with Aunt Rosetta as we go through this period."
Intel: "Aunty Rosetta! Darling, I'm a changed man. There, there, doesn't it feel good to be in my arms? I've cut down my watts. My nanometres have been upgraded to 10. Why, I'm even working on big/little. I call it Gracemont, the mindfulness within me."
Farfolomew - Friday, October 29, 2021 - link
I can't wait for the movie *popcorn*.mode_13h - Saturday, October 30, 2021 - link
Heh, nice!Working Rosetta in there was inspired!
msroadkill612 - Friday, November 19, 2021 - link
I flatter myself commenting in this lofty ~debate, but afaict, bulldozer seems widely reviled in amd folklore, yet the more i learn of it, the more it seems to contain the genesis of the truly inspired architecture that became Zen & Infinity Fabric.As I say, I am a mere newb, but there was a lot that looked familiar in a 2014 Kaveri APU to me.
Is it just me or another example of how wrong conventional wisdom can be?