It's a shame they overpriced this generation. AMD fans are more frugal and prefer a less expensive option. I wouldn't recommend these chips at their launch price.
@Qasar No, AMD ryzen 3000 series is going to provide more value than anything Intel has right now and you can expect the value to get better after the 5000 series launches.
They are still selling Ryzen 1000! There is basically zero risk of Ryzen 3000 no longer being available in the coming few years, let alone before the release of a 5600.
You tell me where they sell the Ryzen 1000, like the 1600 AE not in some jacked up price for less than $100. They disappeared off the shelves and AMD won't make more of them. I was lucky enough to get the Ryzen 2600x for cheap at microcenter earlier this year and the 2000 series disappeared off the shelves. The Ryzen 3100 and 3300x? Nowhere to be found.
and what happends when ryzen 3000 is no longer available ? guess what, intel will be the budget option. get used to it. intel looks to be the budget option for the next while. not amd
New performance product, you announce the high end parts first, the mid and value range items feed in later. There's going to be a supply of used 3000 series and probably budget chips with integrated gfx. But you have to factor in mobo costs too and for desktop gaming enthusiasts, the Intel competition are not budget parts. A good budget part will be on 7-10nm, designed for mass production with a lower TDP
@Qasar: 1) AMD release lower-priced 5000 series parts, and 2) The prices on the whole range gradually move down as supply increases and demand is met.
You're right in one way, though - Intel will have to present themselves as the budget option to compete. That will require some price realignment on their part, though. Right now they're barely competing with the 3000 series on value. 😬
spunjji yep, but if those out perform anything by intel, intel will have to lower its prices, and may have to low lowerer then what the amd equiv. are selling for. i still doubt amd is the budget option any more, i think for the time being, intel will be.
@Qasar - They already outperform plenty of things by Intel. Intel don't really have to lower their prices, though, because demand for their products remains high regardless of their competitiveness. The joys of a duopoly.
We'll see, of course - you could be right - I just don't think you are. Intel like their margins too much.
i think intel would, at least a little, after all why spend $500 on intel cpu X when for approx the same $500 AMD cpu X you get better performance across the board ? and be fore any one says " but the cost of the intel board is cheaper. " rog strix E gaming, in X570 and Z490 for example, at one store i go to, has a 30 buck difference, with the Z490 being less, 30 bucks for the board isnt much.
I read this article earlier, but I didn’t read the comments because of crap like this. The 5xxx Ryzen series actually offers much higher $ per watt than both previous gen as well as competitors. I am not an AMD employee or investor, rather I am just a guy that is passionate about tech without a stick to burn. If anything, AMD has made it’s chips worth more, not less.
That's real nonsense!! Any one wanting the fastest gaming part always paid too much, the market leader's premium. Ryzen 9 3900XT is slower than the silly expensive power hungry less cores Intel part. The R 9 59[05]0X is faster than the previous champ which has been Intel since Core Duo
MSRP doesn't mean a lot, I checked prices this week and lower end skews can cost more with the bargains being chips like Ryz 5 5500X 75% cost of a 5600 at €145. The Ryz 3 3000X actually sells at €223, more than an R5 5600X
But perf/CPU cost isn't the real factor, because +10% system performance is achieved with < +10% system cost, even if +19% CPU perf was gained with a +25% part.
Yeah, thats what it is effectively, begging for a couple of bucks. i dont mind AMD getten those 50 extra bucks. They did a good job, and they earned it. Anything else means beeing a cheap bastard. Take a look at how much the 64 core Zen 2 Threadripper Pro costs, mister "50 bucks". Then youll see how cheap these CPUs really are.
I am not sure why you think that it's over priced Wreckage. Anyone currently sitting on a higher end 3000 series either won't buy these cpu's or will pay the same premiums they do on intel processors as early adopters.
It's just $50 more for every model. I wouldn't call this overpriced. Given the fact that Zen had very aggressive prices so far and that Zen 3 will have no competition at all at launch this seems fair pricing. It's funny, people had no problem with $305 for 4-core 7700K in 2017. And now they think $299 for 6 cores is overpriced. Before Zen Intel's top Broadwell-E 6950X was over $1700. So, compared to what Intel offered before Zen current prices are still fine.
The bitching makes more sense if you assume that the most vocal posters are working from the logic that whatever Intel do is okay and whatever AMD do is bad.
I've seen a few people making genuine complaints, but for the most part it's just the standard bad-faith critiques from the usual suspects.
The R. 5 5600X with 65W TDP cooler will be available at $250 soon enough, the launch MSRPs being low would just cause a debacle like Nvidia's with true market prices being higher.
They are starting at MSRP... they have a lot of Zen 2 inventory to move. After that, maybe they'll release cheaper models and/or sell below MSRP. But as of right now the only reason Zen 2 is a better value is the discounts! Early adopter fee.
That's bllx!! I bought AMD often because they were technically the best, from dual SMP 32bit, Thunderbird and Opteron to consumer 64bit parts .. and in gfx cards, they had every few years compelling smaller die lower wattage options. Intel milked consumers, they did shady deals and tried to force through crap like the Itanic and P4 .. so yes, often AMD did offer better value but AMD actually shaped the x86 64bit market
Amazing how the language of shills has changed from "AMD are second-class products" to "AMD consumers want second-class products". Have you considered that maybe they're aiming to *expand their market share* and that having premium offerings at an appropriate price is part of that process? If not, why not?
I don't know. I have suspect the IPC boost is smaller than claimed expecially in integer, likely they have some major increase of performance in a particular segment and some of less stellar in the other ones. This "19%" number looks like marketing, they claimed something in the 10/15% range from ages, this can not change suddenly in a month or two. I suspect they include the frequency boost in their calculations. Bet we can wait a real review to know the truth. Moreover funny enough they do not mention Milan, from the revenue point of view this is the real launch.....but they are late. Focusing on games only is not a good strategy, someone have a very bad boy on arrival.
" I have suspect the IPC boost is smaller than claimed expecially in integer, " of course you would Gondalf, so far, amd as been pretty accurate with what it says about prev Zen cpus, i doubt they would start to fudge the numbers now. "This "19%" number looks like marketing, " when was the last time intel claimed ANY performance increases gen over gen that were above 10% ? i sure cant recall any. " Focusing on games only is not a good strategy, someone have a very bad boy on arrival. " um isnt this EXACTLY what intel and its fans have been touting for the last while ?? " sure amd are better in multi core and power usage, but intel is still the fastest for games " isnt that what was being said ??
There are figures around and indeed media ops improve a little more than integer ops, but the story is 6 core parts performing at Zen 2 8 core level. Gaming performance was the focus because AMD led already in every other segment but the single thread perf dominated gaming benchmarks, so AMD claiming that was NEWS. Everything else improving around is not news and MSRP is meaningless as the prices on skews fluctuate according to supply/demand.
AMD has plenty of mobile APUs now and will have more next year.
Nvidia/ARM will take years for regulators to approve and Apple switched from Intel to its own ARM chips, which was widely expected years before it was announced.
I was hoping for more detail on how this 7nm version is actually different from the previous process. Some of that 19% must be due to manufacturing improvements, right?
They've talked about this - and Anandtech has covered - that it's from a variety of sources as Dr Papermaster says in the interview. It's the same process with the same libraries, but better tolerances as it's now more mature.
The uplift comes from a variety of things but better cache design and overall focus on latency seems to be the d one of the bigger drivers.
Yep, the manufacturing part is very minor in here! Most come from architectural improvements. Most likely the clockspeed increase about 100hz is mainly manufacturing getting better.
Interesting how Papermaster basically dodged the (artfully put) question on future roadmaps. We knew about Zen what, 2 years before it launched? And Zen 3 and 4 were on the slides from st least the Zen+ launch, and but now we've just got one more generation and then "there be dragons here."
Ditto with AMDs graphics lineup - we'd been hearing about Navi and Navi 2 from when Vega launched (I think?) and now we're almost there and don't know what their next plans are after that.
I don't believe mainstream desktops will get PCIe 5.0, in the foreseeable future. This is going to be for servers and maybe workstations.
The most cost-effective way to improve I/O bandwidth (if desktops even needed it) would be to backport features from PCIe 6.0, like PAM4 and FEC, to create something like PCIe 4.1. And even that isn't likely to happen any time soon.
After Navi 2, I would expect Navi #3 and Navi #4. After the 5000 series. I feel confident that there will be a 6000 series. It is easy to talk roadmap. What makes the rubber hit the road is when you promise the performance and meet the performance. Look for up to 200 cores and not cuda cores. I am excited for the new announcements.
It is important to note that Vermeer is the last consumer chip that is officially on the roadmap. Zen 4 and 5 nm was mentioned for server applications only. That is not to say that AMD does not have plans, but rather, those plans have not been made public.
I still expect to see seen Zen3+ next year for desktops. And zen4 only for the servers... And zen4 to desktops at 2022... Other posibility is that amd does not release new cpu at all next year for normal consumer markets... But intel is not staying still, so it seems to be unlikely...
That's sort of been my guess. They want AM5 to be another long-lived socket but that probably means DDR5 modules will have to be widely available at desktop-friendly prices at launch, which may be happening later we'd have thought a couple of years ago.
So Zen 4 and AM5 will come together, but AMD might release a Zen 3+ in the interim - possibly with an improved I/O die (7nm? As Ian notes, idle power consumption on Zen 2's IO was pretty high), and depending on timing, maybe even with 5nm for the compute chiplets.
As a company, AMD has been pretty willing to spend the money to do a process shrink on existing designs - though we've mostly seen it on the graphics side (eg Radeon VII, using Vega, or even going from 14 to 12nm for the RX 590, which really was kind of a niche product).
Nice interview, thank you. Though a bit too much on the marketing side, very interesting and makes one excited to read the microarchitecture article. Some lines of Mark's I enjoyed were: "Frequency is a tide that raises all boats," and "A great idea overused can become a bad idea."
After looking at the diagrams, thinking about it, and what Mark said, I'm guessing this: besides the usual widening of structures (retire queue, register files, and schedulers), AMD has likely added another decoder, up from 4-way, and increased dispatch width from the decoders (while keeping the later 6-wide integer and 4-wide FP), increased the size of the micro-op cache (+25/50%), and increased load-store bandwidth to 64 bytes/cycle for the different links, along with another AGU. And the retire width has perhaps gone up to 10 (from 8).
> There are some differences between one ISA and another, but that’s not fundamental
That's good PR spin, but not intellectually honest. x86 is more work to decode, much more heavily-dependent on register-renaming (due to smaller register file), has more implicit memory-ordering constraints than ARM, and other things.
Like it or not, ISA matters. AMD just doesn't want to acknowledge this, until they announce another ARM CPU or maybe RISC V (increasingly likely, given ARM's new master).
ISA might matter but I really don't think AMD has to be too worried right now. ARM might be designing N1 cores, but Microsoft has only really bothered partnering with Qualcomm. Qualcomm seems not to care enough to build a large enough die to compete at the desktop level, and, until every developer has an ARM desktop at their cubical, we won't see ARM threaten AMD's server and consumer desktop/laptop space products.
Arm is already threatening AMD and Intel in the server space. For example, Graviton is now 10% of AWS and might be 15% by the end of the year.
The success of Graviton also shows it's perfectly feasible to use a cloud server without having a dedicated developer box. A fast server outperforms your desktop many times, particularly when building large software projects.
@Wilco1 : Please provide a reference to this claim re. AWS ARM share.
It's perfectly feasible to use a cloud server without a dev box, granted. But what it means is that most development will happen on the dev box (x86) and ported to ARM. Of course, virtual desktops change that argument but I'm not yet aware of a virtual desktop environment running arm (not that it can't be done, just not aware of AWS or someone offering one).
Arm development has been done on x86 boxes for decades - cross-compilation, ISA simulation and platform emulation is a thing. Having native developer boxes is nice but not as essential as some people are claiming. Arm developer boxes have existed for years, you can buy 32-core versions online: https://store.avantek.co.uk/arm-desktops.html
Thank you for the link. So take-away is that AMD and ARM are taking away market share from intel. At present AMD is taking away more share. It's unclear if the pie is growing (probably) so ARM and/or AMD are taking a large chunk of incremental gains or whether it's replacing intel (I suspect former). Anyway, I'll read more in detail later..
Re. cross development, sure it's a thing. On client and embedded. I've never heard anyone doing it for server. Sounds crazy to me.
People don't use expensive loud servers as their desktop - for server development you would login to a remote server already. Big software projects build so much faster on a proper server.
> It's perfectly feasible to use a cloud server without a dev box, granted. But what it means is that most development will happen on the dev box (x86) and ported to ARM.
That's a weird argument. A lot of development is already cloud-based, meaning it no longer matters what you have on your desk or in your lap.
Everyone already has ARM in their pocket, and a lot of kids and techies have ARM on their desktops in the form of a Raspberry Pi. ARM is making inroads into the laptop market and cloud, as well - key markets for AMD! While a company AMD's size could survive comfortably off just gaming PCs and consoles, investors would leave in droves, if AMD ceased to be a credible player in the cloud market.
Your point about Qualcomm is also strange, since they killed off their server core design group, and their mobile core design efforts were effectively extinguished even before that. The fact that MS used Qualcomm in laptops and Hololens says nothing about what they're doing towards cloud.
Like it or not, we've heard the same tired arguments from RISC vs CISC proponents for decades without a shred of proof that one side is superior. If you're a ballet dancer "do a pirouette" is more efficient than "get up on one toe, spin around your own axis and contract your extremities to increase the speed of your rotation" and there's no reason to think computers are any different. Like when you create instructions to do "one round of AES encryption" like in AES-NI, obviously it's less flexible and requires more dedicated logic but it's a lot more efficient than doing it with basic math. Just because you can do everything with simple instructions doesn't make it a good idea.
It's weird that you map x86 vs ARM to a CISC vs RISC argument. I look at it as a modern, planned city vs an old medieval town that slowly evolved into the modern era. There's no question which is more efficient, even if the latter has more lore, character, and is more picturesque.
ARM is winning on efficiency, dominating mobile and now cloud. That's the point Ian made, and it's the right one. Once you can get x86-levels of performance with ARM-levels of efficiency, the choice becomes obvious and x86 is quickly relegated to only markets with lots of legacy software.
Your example of crypto acceleration is odd, as mobile chips have long had crypto accelerators. A better example would be AI, which ARM has actively been adding ISA extensions to address.
BTW, since open source support for ARM is already top notch, I mean specifically PROPRIETARY legacy software - much of which could comfortably run in its own VM instance, in the cloud. Once you do that, you untether your primary platform from what's needed by an ever-shrinking number of legacy apps.
"Once you can get x86-levels of performance with ARM-levels of efficiency" Except you don't, do you? See Ampere. They claim to match Rome spec rate perf @210W, which is pretty much same TDP as rome. They need to operate the arm cores at the limit (3GHZ+) which is terrible for efficiency as those cores are designed for a slightly lower frequency. They have more efficient SKUs at lower frequencies and I have seen the claim that AWS runs at 2.5GHz, 110W. Efficient? Absolutely, Performant? Not so much (Yes, we've all seen the skylake SMT-thread vCPU vs graviton full-core CPU. Apples to oranges comparisons because AWS can offer SMT-off instances and they chose not to.)
So ARM winning on efficiency is a questionable claim if performance (say spec rate score per CPU). But no question that ARMs designs are efficient when operated at the optimal voltage and freqency (2.5-2.8GHz)
BOttom line: You can be efficient or performant, not both.
The 3+GHz Altra's give up efficiency to get more throughput. And that's a choice since a halo part that beats Rome will get more interest. However cloud companies are unlikely going to use it, stepping 10-20% down in frequency increases efficiency considerably without losing much performance, and reduces cooling costs.
Graviton 2 proves you can be both efficient and performant at the same time. You only lose out if you push too much on one aspect.
Of course those goals are at odds, but you seem to presume that the latest x86 designs are at least on the same perf/efficiency curve as leading ARM implementations, in which case I'm guessing you missed this:
Also, ARM is not standing still. Previously, they've been focused on energy- and area- efficiency. It looks like they're finally beginning to stretch their legs, lending credence to some of Nuvia's claims:
The writing is on the wall, regardless of whether you choose to see it or AMD to acknowledge it. But I trust AMD knows what's up, and is surely cooking something in a back room - perhaps a successor to Jim Keller's fabled K12.
Oh, for sure. Check out that first link (the one about Nuvia).
But it doesn't necessarily spell doom for Intel or AMD. They are as well-positioned as anyone to pivot and start making ARM or RISC-V CPUs. For them, it's just a question of getting the timing right, as they don't want to undermine their cash cows while demand is still strong and uncertainty still surrounds any potential successors.
That said, Intel clearly has the most to lose from any shift away from x86.
Graviton 2 actually beats x86 cloud instances on performance and cost while using a fraction of the power. Basically x86 is totally uncompetitive and as a result Arm will take over the cloud in the next few years.
I've only seen AWS comparisons with Intel, hyperthreads vs. graviton full core. (1) The x86 top dog is Rome, not Skylake. (2) SMT naturally reduces per-thread performance so it's apples to oranges. AWS chooses to run with SMT, it can be turned off but they chose not to. (3) Pricing is AWS's choice to drive adoption (would you pay the same for having to do extra work porting to a new ISA? Devs have to be paid and it's not zero work. In fact its not zero work going from intel -> AMD due to uarch differences needing different perf tuning. AWS/Azure also prices AMD instances lower for the same reason. Porting to a new uarch and a new ISA is even more work and the price reflects this reality. )
Single-threaded performance of Graviton 2 is similar to latest Intel and AMD CPUs, the low-clocked Graviton 2 is within 6% of a 3.2GHz Xeon Platinum on SPECINT. You could turn off SMT but you'd lose 30% throughput. How does that make x86 look more competitive? Adding more cores isn't an option either since x86 cores are huge. 128-core single-die Arm server CPUs are no problem however.
"Graviton 2 is within 6% of a 3.2GHz Xeon Platinum on SPECINT. " I'm assuming you mean Spec int rate. yeah sure, but Rome is ~2x Xeon platinum. Turning of SMT is a much easier tradeoff.
Chiplets allow at least AMD to add cores without worrying about yields. Intel might also go that route with Sapphire rapids. The link you shared is a marketing fluff slide with no X and Y axis. I was surprised at the claim that ARM could have higher IPC cores and more of them and see an increase in throughput (assuming it's spec int rate, it's well known that 8-channel DDR4 bandwidth gets maxed out in many cases with 64 cores). Sure enough, DDR5, HBM2e show up for Zeus: https://images.anandtech.com/doci/16073/Neoverse-c... In other words, they're comparing a system from last year to a system from next year. DDR5 is not really available in production volume until next year. Yeah, better memory subsystems improve both 1T and socket performance! Who knew? The real competition is then Sapphire Rapids from intel and Genoa from AMD both probably next year? And yes, in both cases expect a bump in core-count because DDR5 allows it. If history is any judge, server DDR5 will remain expensive until intel launches their server.
Meanwhile, if Vermeer is launching in a few weeks, you may be sure that hyperscalars have had early revisions for months already, and it's likely that they will be receiving shipments of Milan later this year if not already. "Bleak", "Uncompetitive" etc are not even gross exaggerations. They just come across as fanboyish.
> Chiplets allow at least AMD to add cores without worrying about yields.
Even ARM's balanced cores (the N-series) are tiny, by comparison. You can stuff a heck of a lot of those on a single die and still end up with something smaller than mass market GPUs.
Of course, as the market grows and chiplet technologies continue to mature, it's only natural to expect some ARM CPUs to utilize this technology, as well.
No doubt x86 will dominate the market for at least a couple more years. It can still win some battles, but it won't win the war.
GPUs and CPUs yield differently. Gpus have lots of redundancy and are therefore tolerant to defects. Cpus can’t handle faults in logic. At best, sram defects can be mitigated using redundancy and core defects by turning them off and selling them as a low core count part. Btw 7nm is yet to see volume production on large gpus. Navi might be the first. Xilinx has done large fpgas but those are even more redundant.
I don’t care much for wars and battles and cults. I seem to have a low tolerance for fanboyism and taking marketing slides as the final word. Arm in Cloud provides options to hyperscalers and for sure it will crash intels fat DC margins. I doubt arm “winning the war” is good for anyone except arm/nvidia if it means a situation similar to the last decade with intel. Time will tell. Cpus have not been this competitive in a while
At 331 mm^2, Vega 20 launched on 7 nm nearly 2 years ago!
And let's not get into dismissive talk of fanboyism and cults, because that cuts both ways. I've tried to make my case on facts and reason. I have no vested interest in ARM or anything else, but I would like to see the chapter of x86's dominance in computing come to a close, because I think it's really starting to hold us back.
You cannot make a credible claim that the current dominance of x86-64 is due to any technical merits of the ISA, itself. It's only winning due to inertia and the massive resources Intel and AMD have been funneling into tweaking the hell out of it. The same resources would be better spent on an ISA with more potential and less legacy.
Ok, let me rephrase. A volume large GPU. Those Vegas were sold in datacenters or HPC iirc and amd is no nvidia.
I made no claim re the superiority of x86 Isa. I happen to agree with papermaster that a high performance CPU is largely Isa agnostic. The area difference you see in arm vs x86 today is mostly x86s need for speed, I.e the desktop market. X86 succeeded because it could push the same design in all markets and until very recently the same silicon too. This is true for amd today: Vermeer ccd is same as Milan's. Ultimately economy of scale made the difference. Intel's process dominance also played a huge role. The PC market is still massive and as long as it exists as x86, the Isa will sustain if not thrive. It's just plain economics but money makes the world go around and all that
I'm mostly with deltaFx2 on this - the truth is that vendors basing products on ARM architectures will have to show a dramatic advantage in two or more area (cost / power / performance) over x86 to stand a chance of toppling x86 anywhere outside of the datacentre. There's too much legacy and gaming software in the PC market to make the change palatable for a good while yet.
I never said anything about desktop computing, but I think the ARM-based laptop segment will be interesting to watch. Now that Nvidia owns ARM, we could start to see them push some real GPU horsepower into that segment. If those start to gain traction with some gamers, then it's not inconceivable to see inroads created into ARM-based desktop gaming.
Within 5 months of their release as the MI50/MI60, AMD started selling them for $700 as Radeon VII. Those had 60 of 64 CUs enabled and 16 GB of HBM2 - exactly the same specs as MI50. So, it sure sounds like yields were decent. Availability of those cards remained strong, until they were discontinued (after the launch of Navi).
I also didn't say you made such a claim - just that it's not there to be made. However, based on what you did just say, it seems you need to spend more time examining this article.
ISA has consequences. Some of the same factors that helped ARM trounce x86 in mobile are the same ones that give it more potential in performance-optimized computing. You can't solve all problems with scale and by throwing money at them. A bad design ultimately has less potential than a good one, and the moribund ISA is the one constant of all x86-64 CPUs.
I did see Nuvia's claim. They haven't normalized for the memory subsystem so those results are being... erm... economical with the truth. (I can see *why* they did it, they want series B funding.) Mobile parts support very specific memories and sizes, soldered right on motherboard or maybe even PoP. The PC market usually supports much larger memories, slotted, etc (Apple solders but I don't believe this is standard. Corporate notebooks want the interchangeability). These sort of things have a fixed cost even when not being used. Phones don't have to support 128GB memory; laptops do. Another difference is big.Little in mobile which helps idle power. bigLittle is a bad idea in servers because if your sever is idling, you're doing it wrong. There's other fun stuff that x86 systems have to support (PCIe, USB in various flavors, display port, etc) that are not in mobile. This is an intentional choice.
Counterexample to this is Ampere Altra. In their own testing (not third party or spec submission), they claim to equal Rome 64c/128t using 80c at roughly the same tdp. i.e. when you throw in large memory controllers, 128 PCIe lanes, and all the paraphernalia that a server needs, the difference is in the noise at same perf levels.
x86 CPUs and ARM cpus look nearly identical outside the decode unit and maybe parts of the LSU. Heck, modern ARM CPUs even have a uop cache. The main tax for x86 is variable length decode which is a serial process and hard to scale. The two tricks used for that is the uop cache (sandy-bridge onwards) or two parallel decode pipes (treadmont). Beyond that, it's exactly the same in the core. Any difference in area is down to using larger library cells for higher current drive, duplicated logic to meet frequency goals, etc. The x86 tax is (1) high frequency design (2) supporting a wide variety of standard interfaces. #2 is also why x86 is successful. You can design your device and plug it into *any* x86 system from server to laptop using standard interfaces like Pcie. The x86 ISA may be closed, but the x86 ecosystem is wide open (before that you had vertically integrated servers like sun/HP/DEC/IBM with their own software and hardware). Even ARM can't claim such interoperability in mobile. They're adopting x86 standards in servers.
Wow. I won't impugn your motives, but you sure seem invested in your skepticism.
> I can see *why* they did it, they want series B funding.
They're not some kids, fresh out of uni, looking to draw a regular paycheck. If they weren't convinced they had something, I really believe they wouldn't be trying to build it. What they clearly want is to get acquired, and that's going to require a high burden of evidence.
> These sort of things have a fixed cost even when not being used. > ... > There's other fun stuff that x86 systems have to support (PCIe, USB in various flavors, display port, etc) that are not in mobile.
The power figures quoted are labelled "Idle-Normalized", by which they must mean that idle power is subtracted off.
> Another difference is big.Little
Except they specify they're talking about single-core performance and explicitly label exactly which Apple cores they're measuring.
The most glaring point you're missing is the Y-axis. Take a good look at that. For better legibility, right-click those images and save or view them in another tab - they're higher-res than shown in that article.
> Counterexample to this is Ampere Altra.
That's Neoverse N1-based, which clearly hasn't achieved the same level of refinement as Apple's cores, and is also balancing against area-efficiency. According to ARM's announcements, the V1 and N2 should finally turn the tables. In the meantime, we can admire what Apple has already achieved. And you needn't be content with one Geekbench benchmark - you can also find various specbench scores for their recent SoCs, elsewhere on this site.
> x86 CPUs and ARM cpus look nearly identical outside the decode unit and maybe parts of the LSU.
Since I don't like to repeat myself, I'll just point out that I already made my case against x86 in my first post in this thread. It's tempting to think of CPUs as just different decoders slapped on the same guts, but you need more than such a superficial understanding to build a top-performing CPU.
> supporting a wide variety of standard interfaces.
A few ARM-based PCs (beyond Raspberry Pi) do exist and also use PCIe, USB, SATA, etc.
> They're adopting x86 standards in servers.
USB and PCIe aren't x86 standards. I'm less sure about SATA, but whatever. This point is apropos of nothing.
I don't like repeating myself either, same pinch! Different memory subsystems, different uncore, different design targets lead to different results. I have seen the Nuvia presentation in detail, and they sure know what they're doing, and I don't think it's an accident. What they present is likely true data but misleading.
The qualcomm design discussed in Nuvia's data is the same as the one used in N1 but smaller caches, so unless you're saying ARM's uncore is a raging fireball, I don't understand your argument. Let's say the Qualcomm core is 1.8W for 900 points. for the same performance, the graph claims AMD/intel burn 6W. So the qualcomm core consumes 1/3rd the power of the AMD core for the same perf and somehow when put on a server SoC, all that supposed efficiency goes to naught? Or is the more likely explanation that Nuvia is comparing an apple to a orange?
Are you referring to this "x86 is more work to decode, much more heavily-dependent on register-renaming (due to smaller register file), has more implicit memory-ordering constraints than ARM, and other things." ?
* x86 is more work to decode in the absence of uop caches. That's the whole point of uop caches, to not have to do that work. Turns out ARM is complex enough that it too needs uop caches.
* "much more heavily-dependent on register-renaming (due to smaller register file)" That's just a mish-mash of terminology you don't seem to understand. ALL OoO machines are dependent on register renaming. ARM has to rename 31 architected registers, x86 has to rename 16. That's not the register file, that's architected registers. Skylake/Zen2 have 180E integer register file to hold renamed physical registers (rob is 224/256 can't remember). I can't find what A77's PRF size is, but the ROB is 128 entries, so its going to be ~100ish entries. PRF size minus architectural register size is usually what's available for OoO execution. So which design has a bigger instruction window? And what's ARM's advantage here? If it's 6-wide issue, it has to rename 6 destinations and 6*2 sources. You have to build that hardware whether it's ARM or x86 or MIPS or SPARC or Alpha or Power or IBM Z.
* "has more implicit memory-ordering constraints than ARM, and other things" yes it does. ARM has some interesting memory ordering constraints too, like loads dependent on other loads cannot issue out of order. Turns out that these constraints are solved using speculation because most of the time it doesn't matter. x86 needs a larger tracking window but again, it's in the noise for area and power.
Are you sure you know what you're talking about, because it doesn't look like it.
> That's the whole point of uop caches, to not have to do that work.
uOP caches are a hack. They work well for loops, but not highly-branchy code. Here's an idea: why not use main memory for a uOP cache? In fact, maybe involve the OS, so it can persist the decoded program and save the CPU from having to repeat the decoding process. Kinda GPU-like. You could even compress it, to save on memory bandwidth.
> Turns out ARM is complex enough that it too needs uop caches.
I wonder how much of that has to do with supporting both ARMv7-A and ARMv8-A, as all of their in-house 64-bit cores do. However, at least a couple custom cores don't bother with ARMv7-A compatibility.
> That's just a mish-mash of terminology you don't seem to understand.
*sigh* I meant ISA registers, obviously. And it's not 16 - ESP is used for stack, frame pointers usually occupy EBP, and some instructions are hard-wired to use a couple of the others, from what I dimly recall. Anyway, the number of software-visible registers restricts the types and extent of optimizations a compiler can do -- at best, forcing the CPU to repeatedly replicate them through speculative execution and out-of-order execution -- at worst, causing register spilling, which generates extra loads and stores.
> x86 needs a larger tracking window but again, it's in the noise for area and power.
You don't think it generates more stalls, as well?
Something else that comes to mind as a bit silly is AVX. Perhaps you'll insist it's trivial, but all the upper-register zeroing in scalar and 128-bit operations seems like wasted effort.
> Are you sure you know what you're talking about, because it doesn't look like it.
I'm not a CPU designer, but you'd do well to remember that the performance of a CPU is the combination of hardware *and* software. The compiler (or JIT engine) is part of that system, and the ISA is a bottleneck between them.
I'm only a layman in all this, but if ARM is aiming for high performance, I think its cores are going to get increasingly complex. On a side note, Apple didn't speak about the power efficiency of the A14 on the recent keynote, as far as I'm aware.
On the x86 length-decoding bottleneck, I was reading that there used to be another interesting trick (used by the K8/10/Bulldozer and Pentium MMX) marking instruction boundaries in the cache. Not sure if that was particularly effective or not, but I suppose the micro-op cache turned out to be a better tradeoff on the whole (removing the fetch and pre/decoding steps altogether when successful, saving power and raising speed).
Who knows, perhaps AMD and Intel could re-introduce the marking of instruction boundaries in cache, if it were feasible and made good sense, further toning down x86's length-decoding troubles, while keeping the micro-op cache as well.
> if ARM is aiming for high performance, I think its cores are going to get increasingly complex.
Yes. A couple years ago, ARM announced the N-series of cores that would target server and infrastructure applications. Recently, they announced the V-series of cores that would be even larger and further performance-optimized.
> perhaps AMD and Intel could re-introduce the marking of instruction boundaries in cache, if it were feasible and made good sense
If it still makes sense, I would assume they're still doing it. Do you know any differently?
That said, I don't really get why CPUs wouldn't basically expose their uOPS and decoder, so the OS can just decode a program once, instead of continually decoding the same instructions many times during a single run. Sure, it'd require a significant amount of work for the CPU designers and OS, but seems like it'd be a significant win on both the performance and efficiency fronts. And perhaps you could even get rid of the in-core uOP cache, with corresponding simplifications in core design and area-efficiency. Again, I think GPUs have already largely proven the concept, though I'm not necessarily proposing to ditch a hardware decoder, entirely.
An intriguing idea, definitely, but probably too difficult to implement and a maintenance nightmare.
The OS would have to maintain multiple decoders/compilers for different CPU brands, each of which likely has its own internal micro-op format, along with whatever quirks involved. There might be a delay when the program is first decoded. And likely, more bugs would be stirred into the soup.
Exposing the internal format also means going against abstraction and tying things to an implementation, and the CPU designer wouldn't be able to change much of that "interface" in the future. I think an abstract ISA, like x86 or ARM, would be a better choice, but that's just my view.
> If it still makes sense, I would assume they're still doing it. Do you know any differently?
Perhaps it was no good, or a micro-op cache was a better tradeoff. According to Agner Fog, Zen abandoned it and the P6 (and later) never used it. Pentium MMX, yes. Not sure about Pentium classic, 486, etc. Marking instruction boundaries in the cache alleviated a critical bottleneck of x86 decoding: working out the instruction length, which varies.
> The OS would have to maintain multiple decoders/compilers for different CPU brands,
Again, GPUs already do this. They each ship their own shader compilers as part of their driver. However, CPUs needn't follow that model, exactly. CPUs would still need some hardware decoding capability, in order to boot and support legacy OSes, if nothing else. But the CPU can continue to do the primary ISA -> uOP translations with hw-accelerated engines - just in a way that's potentially decoupled from the rest of the execution pipeline.
> Exposing the internal format also means going against abstraction and tying things to an implementation,
As with GPUs, the actual uOP format could continue to be opaque and unknown to the OS. I'm talking about this as distinct from actually changing the ISA.
I see what you're saying now. Something like this: At the OS level, things are still x86, for example. Then, going one step down, we've got the driver which takes all that and translates it into, say, AMD format micro-ops, using the hw-accelerated units on the CPU, caching the decoded results permanently (so we only compile once), and sending them to the pipeline to be executed.
Quite an interesting idea. Roughly equivalent to what CPUs already do, but with the critical difference of stashing the results permanently, which should bring about massive gains in efficiency. Come to think of it, it's a bit like a persistent micro-op cache.
It would come down to this then. Will the performance of this method be a drastic improvement over the current model (which, owing to op cache, might be able to come close)? Will the better performance or lower power consumption justify the extra development work involved? Or is it just better to keep things as-is by paying that extra power?
Yes, it seems @deltaFx2 was on point with both the Transmeta and Denver citations. However, I hadn't considered self-modifying code, though I thought that had somewhat recently fallen out of favor, due to security concerns and counter-measures against things like buffer-overrun exploits.
I guess if we're not talking about functions that literally modify themselves, but rather JIT-type cases where you have a like a Javascript engine acting like an in-memory compiler. That strikes me as a slightly more manageable problem, especially if the CPU is capable of supporting a giant uOPS cache in main memory.
Read about Denver and Crusoe this morning and was surprised to see this sort of thing had been done before. Faintly, I do recall reading about the Crusoe years ago but forgot all about it.
"uOP caches are a hack. They work well for loops, but not highly-branchy code." That's not true at all. The uop cache is a decoded instruction cache. It basically translates variable length x86 instructions to fixed length micro-instructions or micro-ops, so that next time around, you need to do less work to decode it so you can supply more instructions/cycle with a much shorter pipeline. The hard part of x86 is not the number of instructions (even arm has thousands) but the fact that it has variable length, prefixes, etc. So you can think of it as a copy of the instruction cache if you like. Caches naturally cache repetitive accesses better than random access. If you mean uopcache is bad for large instruction cache footprint workloads, yes, but that's true of the instruction cache as well. Uopcaches cut down the branch mispredict penalty which is one reason ARM uses it, I'd guess. ARM, despite what you might have heard, is not a 'simple' ISA, so it also benefits from the slightly shorter pipeline out of the opcache.
>> Here's an idea: why not use main memory for a uOP cache? That was transmeta/nvidia denver. Both had binary translators that rewrote x86 to VLIW (transmeta) or ARM -> VLIW ARM (denver). The software+VLIW approach has its drawbacks (predication to achieve the large basic blocks necessary for VLIW to extract parallelism). However, it's certainly possible to rewrite it in uop format and have the CPU suck it up. I've been told doing this in a static binary is hard but might be possible to do the denver thing on an x86 CPU and cut out all the fancy compiler tricks. Not infeasible, but the question always arises if it's just fewer points of failure if you did it in hardware (self modifying code for example. It's a thing for VMs, JITs, etc. That's why ARM now has a coherent I-cache for server CPUs)
>> *sigh* I meant ISA registers, obviously.<snip> Ok, so x86 32-bit had a problem with 8 arch regs with ax/bx/cx/dx reserved for specific uses. x86-64 largely does away with that although the older forms still exist. A problem with few registers is that you have to spill and fill a lot but it's not as big a deal in x86-64 and newer x86 cpus do memory renaming, which cuts out that latency if it occurs. The argument for many architected registers is strong for in-order cores. For OoO, it's debatable.
>> You don't think it generates more stalls, as well? No. It generates more pipeline flushes, but they're so rare that it's not worth worrying about. The extra work x86 cores have to do is to wait to make sure that out-of-order loads don't see shared data being written to by another core in the wrong order (TSO). So they're kept around for long enough to make sure this is not the case. It's just bookkeeping. Most of the time, you're not accessing shared data and when you are, another core isn't accessing it most of the time so you toss it away. When it happens, the CPU goes oops, lets flush and try again. ARM claimed to do barrier elision to fix the problem created by weak memory models (DSBs everywhere) and it may be that they are doing the same thing on a smaller scale. I could be wrong though, I haven't seen details.
>> Something else that comes to mind as a bit silly is AVX. Perhaps you'll insist it's trivial, but all the upper-register zeroing in scalar and 128-bit operations seems like wasted effort.
Ah but there's a reason for that. ARM used to not do that in Neon 32-bit and quickly realized it's a bad idea. x86 learned from 16bit that it was a bad idea. Preserving the upper bits when you are only using the lower bits of a register means that you have to merge the results from the previous computation that wrote the full 256bit register (say) and your 64 bit result. It creates false dependencies that hurt performance. Neon-32 bit had similar thing where a register could be accessed as quad, or double hi, double lo, or single 0/1/2/3. It sucks, and in one implementation of ARM (A57?), they stalled dispatch when this happens. Aarch64 zeros out upper bits just like x86-64.
>> I'm not a CPU designer, but you'd do well to remember that the performance of a CPU is the combination of hardware *and* software. Agreed. I'm saying there's nothing inherently lacking in x86 as an ISA. It's not 'pretty' but neither is C/C++. Python is pretty until you start debugging someone else's code. Arguably neither is linux if you ask the microkernel crowd. But it works well.
First, I'd like to thank you for your time and sharing your insights. Also, I have much respect for how long Intel and AMD have kept x86 CPUs dominant. Further, I acknowledge that you might indeed be right about everything you claim.
> The uop cache is a decoded instruction cache.
I understand what they are and why they exist. It's probably an overstatement to call them a hack, but my point is that (like pretty much all caches) rather than truly solving a problem in every case, they are an optimization of most cases, at the expense of (hopefully) a few. Even with branch-prediction and prefetching, you still have additional area and power overhead, so there's no free lunch.
> ARM, despite what you might have heard, is not a 'simple' ISA
As I'm sure you know, ARM isn't only one thing. ARMv8-A does away with much of the legacy features from ARMv7-A. Not much is yet known about ARMv9. Of course, ARMv8-A is probably going to be the baseline that all ARM-based servers and notebooks will always have to support.
> That was transmeta/nvidia denver. Both had binary translators that rewrote x86 to VLIW
So, the problem with those examples is that you have all the downsides of VLIW clouding the picture. Nvidia's use of VLIW is the most puzzling, since it really only excels on DSP-type workloads that are much better suited to the GPUs they integrated in the same SoC!
Interestingly, I guess what I was talking about is somewhere in between that and Itanium, which had a hardware x86 decoder. Of course, we know all too well about Itanic's tragic fate, but I was bemused to realize that I'd partially and unintentionally retread that path. And I still wonder if EPIC didn't have some untapped potential, like if they'd added OoO (which is actually possible, with it not being true VLIW). Late in its life, the product line really suffered from the lack of any vector instructions.
> ARM used to not do that in Neon 32-bit and quickly realized it's a bad idea.
SSE also seems to leave the upper elements unchanged, for scalar operations. My concern is that zeroing 224 bits (in case of AVX) or 480 bits (for AVX-512) will impart a slight but measurable cost on the energy-efficiency of scalar operations.
Finally, I should say that this has been one of the best and most enlightening exchanges I've had on this or probably any internet forum. So, I thank you for your patience with my thoughts & opinions, as well as with the 1990's-era commenting system.
Eh, this isn't about whether x86 designers are good or not. My point was that there wasn't anything inherent in the arm isa that made it power or area efficient, but implementation choices by arm to achieve that. Not sure why it bothers me, but this canard about ISAs making a significant difference to power has to die and it just doesn't.
>> ARMv8-A does away with much of the legacy features from ARMv7-A. Aarch64 is not simple. The RISC-V folks called ARM Ciscy in some paper (can't find it; you'll have to take my word for it. They said something to the effect of look how CISCy arm can be). RISC has morphed to mean whatever people want it to mean but originally it was about eschewing microcode. ARM has microcode. No, not a microcode ROM; that's just one way to implement microcode. Microcode is a translation layer between the ISA and the hardware. RISC philosophy was to expose the microarchitecture in the ISA to software so as to have no microcode (and simple instructions that execute in a single cycle to achieve highly efficient pipelining and 1 instruction per cycle. Cutting out microcode was a way of achieving that).
ARM has common instructions that write 3 GPRs. I don't think x86 has 3 GPR writing instructions (even 2 are rare and are special. 128-bit mul and division, iirc). See LDP with autoincrement where the CPU loads 2 GPRs into memory and increments the base pointer. Common loads (LDR) support autoincrement, so they write two destinations.
ARM has instructions that can be considered load-execute operations (i.e. operations with memory sources/destinations). This was a huge no-no in RISC. Consider LD4 (single structure) which reads a vector from memory and updates a particular vector lane of 4 different registers (say you wanted to update the 3rd element in 4 packed byte vector registers). The most likely implementation is going to be microcoded with operations that load, read-modify-write a register. See: https://developer.arm.com/docs/ddi0596/f/simd-and-...
There's other weirdness if you look closely... adds with shifts, loads with scaled index, flag registers (oh my!) etc. just like x86 and in some instances more capable than x86. Perfectly fine but the RISC folks get an aneurysm thinking about it. OoO machines I think benefit from having more information sent per instruction rather than less.
What Aarch64 did was to get rid of variable length (thumb/arm32), super-long-microcoded sequences (check out ldmia/stmia/push/pop), predicated instructions, shadow registers and some other stuff I can't remember. They did not make it RISC-y, it very much retained the information density. Now unlike x86, ARM can't add prefixes (yet) so you have only 4 bytes that blow up into thousands of instructions. So while it doesn't have the variable length problem of x86, it does have the 'more work to decode' problem and hence its pipeline is longer than say MIPS. Hence the uopcache for power savings (those 3-dest/2-dest instructions are going to get cracked, so why do it over and over?).
>> Nvidia's use of VLIW is the most puzzling, Nvidia hired the same transmeta people. If you have a hammer, every problem looks like a nail. Also, some people are like the monty python black knight 'tis only a flesh wound'. In fairness, they wanted to do x86 and that needs a license from Intel (not happening) but binary translation would work. I have no idea why VLIW was chosen again. I had heard around the same time, intel was looking into x86->x86 binary translation to help Atom. Probably went nowhere.
>> we know all too well about Itanic's tragic fate, Tragic is not the adjective I'd choose. Itanium not only sank but dragged a bunch of other viable ISAs with it (PA-RISC, Alpha, possibly even SPARC as Sun considered switching to itanum, IIRC). Itanium should've been torpedoed before it left the harbour. It belongs to the same school of thinking (hopefully dead but who knows) that all problems in hardware can be solved by exporting it to software. RISC was that (branch delay slots, anyone?), VLIW was that, Itanium was that (see ALAT). If only compilers would do our bidding. And maybe they do in 80% of the cases, but it's the 20% that gets you. Itanium has 128 architectural registers plus some predicated registers. Going out-of-order would be impossible. Too many to rename, needs an enormous RAT, PRF, etc while x86 would match it with fewer resources. They went too far down the compiler route to be able to back off.
You're right about SSE. I forgot. Nice discussing this with you too; one often gets zealots on forums like this so its a welcome change. Hope it helped.
> this canard about ISAs making a significant difference to power has to die and it just doesn't.
You make compelling points, but somehow it's not enough for me.
An interesting experiment would be to rig a compiler to use a reduced set of GP registers and look at the impact it has on the benchmarks of a couple leading ARM core designs. That should be trivial, for someone who knows the right parts of LLVM or GCC.
I don't know of an easy way to isolate the rest. Maybe a benchmark designed to stress-test the memory-ordering guarantees of x86 could at least put an upper bound on its performance impact. But, the rest of the points would seem to require detailed metrics on area, power-dissipation, critical path, etc. that only the CPU designers probably have access to.
> Aarch64 is not simple. The RISC-V folks called ARM Ciscy in some paper
Thanks for the very enlightening details. I don't have much to say about this subject, and it seems to me that many discussions about ISAs and uArchs veer off into unproductive debates about orthodoxies and semantics.
Again, I appreciate your specific examples, and many of us probably learned a few things, there. I definitely see the relevance to my earlier speculation about decoding cost.
> OoO machines I think benefit from having more information sent per instruction rather than less.
If information-density is the issue, is it not solvable by a simple compression format that can be decoded during i-cache fills? Perhaps it would be smaller and more energy-efficient than adding complexity to the decoder, and not add much latency in comparison with an i-cache miss.
> Itanium not only sank but dragged a bunch of other viable ISAs
We can agree that the outcome was tragic. At the time, I very much drank the EPIC cool-aide, but I was also programming VLIW DSPs and very impressed with the performance. One explanation I heard of its failure is that Intel's legal team had patented so much around it that an unlicensed competing implementation was impossible, and big customers & ISVs were therefore wary of vendor lock-in.
> It belongs to the same school of thinking (hopefully dead but who knows) that all problems in hardware can be solved by exporting it to software.
As for the school of thought being dead, this is worth a look (with a number of the more interesting details hiding in the comments thread):
I'd imagine Google would somehow be involved in the next iteration of software-heavy ISAs.
> Itanium has 128 architectural registers plus some predicated registers. Going out-of-order would be impossible.
At the ISA level, my understanding is that EPIC allows for OoO and speculative execution - all the compiler does is make the data-dependencies explicit, leaving the hardware to do the scheduling (which is required for binary backwards-compatibility). Also, I'm not clear why they'd require renaming for smaller levels of OoO - it seems to me more an issue in cases of extensive reordering, or for speculative execution. Perhaps the compiler would need to encode an additional set of dependencies on availability of the destination registers?
> You're right about SSE.
Something about the way AVX shoehorns scalar arithmetic into those enormous vector registers just feels inefficient.
Rome will struggle, I expect, at ~100W TDP due to the MCM design (inefficient). However, from a TCO standpoint, high performance at (reasonably) higher power generally wins because of consolidation effects (fewer racks and what not for same throughput). Unless you are power constrained. Anyway, I'll leave it at that.
>If information-density is the issue, is it not solvable by a simple compression format that can be decoded during i-cache fills?
Let's say you have a jump to target 0xFEED for the first time. How would you find the target instruction if it were compressed? You'd need some large table to tell you where to find it and someone would have to be responsible for maintaining it (like the OS, because otherwise it's a security issue). And for large I-cache footprint workloads, this could happen often enough that it would make things worse.
The ideal ISA would be one that studies the frequencies of various instructions and huffman-encodes them down for Icache density. ISAs are never designed that way, though.
The fundamental problem with compiler based solutions to OoO are that they cannot deal with unpredictable latencies. Cache latency is the most common case. OoO machines deal with them fine. Nvidia's denver was particularly strange in that regard as they should have known why transmeta didn't work out and went with the same solution without addressing that problem (static scheduling can't solve that problem. Oracle prefetching can, but it doesn't exist yet)
VISC: Pay attention to the operating frequency in addition to IPC. If you run your machine at 200MHz, for example, you can get spectacular IPC because memory latency is (almost) constant and your main memory is only (say) 20 cycles away instead of 200 cycles away. The artcle says their prototype was 500MHz. Intel acquired them for next to nothing (200Mn?) so it wasn't like they had something extraordinary. Likely an acquihire. Can't say much about Elbrus as I can't tell what they're doing or how well it performs. If I had to bet, I'd bet against it amounting to much. Too much history pointing in the opposite direction.
>> At the ISA level, my understanding is that EPIC allows for OoO and speculative execution - Oh yeah, you can probably do OoO on VLIW ISAs too. I'm saying, it has too many architected registers. You can solve it by having a backing store for the architectural registers and copying things into the main PRF when needed for OoO execution and all but it's not efficient and will be trounced by an x86 or arm design. EPIC only made sense if it reduced the number of transistors spent on speculative execution and gave that to caches and other things outside the core. Otherwise one might as well stick to an ISA with a large install base (x86). As a concept, EPIC was worth a shot (you never know until you try) but HP/Intel should've known well in advance that this won't pan out and killed it. Intel wanted to get in on big-iron and thought itanium was the ticket, plus it didn't have to compete with AMD and Cyrix and whoever else was around then in x86.
> How would you find the target instruction if it were compressed?
I'm not familiar with the current state of the art, but it does seem to me that you'd need some sort of double-indirection. I'd probably compress each I-Cache line into a packet, and you have some index you can use to locate that, for a given offset.
You could do some optimizations, though. Like, what about having the index store the first line, uncompressed, and then actually encode the location of the next line? That would avoid the latency hit from double-indirection, only adding the overhead of one memory offset, which would be amortized in fetches of subsequent lines. Interleaving offsets in with code (or at least all of the branch targets) would bloat slightly complicate indexing, but I think not much.
> The ideal ISA would be one that studies the frequencies of various instructions and huffman-encodes them down for Icache density.
I know, but if you're only compressing the opcodes, that still won't give you optimal compression.
> The fundamental problem with compiler based solutions to OoO are that they cannot deal with unpredictable latencies.
Yes, we're agreed that some runtime OoO is needed (unless you have a huge amount of SMT, like GPUs). I never meant to suggest otherwise - just that compilers (or their optimizers and instruction schedulers) could play a bigger role.
> Can't say much about Elbrus as I can't tell what they're doing or how well it performs.
If you're interested, check out the comments thread in that article. Some interesting tidbits, in there. Plus, about as much (or maybe even a little less) politics as one would expect.
Thanks, again, for the discussion. Very enlightening for myself and doubtlessly a few others.
I should add that, as manufacturing process technology runs out of steam, I see it as an inevitability that the industry will turn towards more software-heavy approaches to wring more speed and efficiency out of CPUs. It's mainly a question of exactly what shape it takes, who is involved, and when a level of success is achieved that forces everyone else to follow.
"USB and PCIe aren't x86 standards. " They are Intel IP, standardized. Because unfortunately that's what's needed for wide adoption and intel has a strong not-invented-here culture. There's committees and all that for this but intel is the big dog there. Just like qualcomm is for wireless IP standards. PCIe came about because AMD was pushing non-coherent Hypertransport. So intel decided to nip that in the bud with PCI/PCIe. A long time ago, AMD pushed coherent HT as well, which was adopted by some lke Cray (CXL/CCIX, 20 yrs ago) but AMD's shoddy execution after Barcelona killed it as well. CXL came about because there's no way intel would do CCIX (modified AMBA)
Gaviton 2 is close to the fastest EPYC 7742 on both single-threaded performance and throughput despite running at a much lower frequency. Turning off SMT means losing your main advantage. Without SMT, Rome would barely match Graviton 2. Now how will it look compared to 80 cores in Ampere Altra? Again, how does that make x86 look more competitive?
Milan maxes out at 64 cores (you need to redesign the IO die to increase cores, so chiplets are no magic solution) and will still use the basic 7nm process, so doesn't improve much over Rome - rumours say 10-15%.
The facts say 2021 will be the year of the 80-128 core Arm servers. If you believe AMD/Intel can somehow keep up with 64 cores (or less) then that would be fanboyism... DDR5 will help the generation after that and enable even higher core counts (Arm talks about 192 cores using DDR5), but that's going to be 2022.
What good is 128 cores if DDR4 bandwidth is insufficient? Niche use cases where data fits in caches?
Ampere’s 80 cores “beat” rome in their marketing slides. Where’s the official spec submission? Third party tests? Nope. Can it be produced in volume? If it were, we’d see general availability. A Silicon Valley startup looking for an exit is prone to gross exaggeration.
Rumors are just that. Rumors were that zen 3 had 50% fp ipc uplift. Wait and see.
Would you say the same about Rome? It has 128 threads too - everybody is hitting the same bandwidth limits, but it simply means scaling is non-linear (rather than non-existent).
Right now it looks like all the Altra chips are being gobbled up by their initial customers - Oracle announced Altra availability early next year. So benchmarks should come out soon.
You said it yourself. If your rumors is true that the 64c milan gains only 10-15% over rome, then that only makes sense if ddr b/w is constrained. Otherwise how is it that IPC is up 19% (measure at 4ghz so maybe higher at typical server frequency) but the socket perf is less? And you said it yourself too, smt buys only20- 30% uplift whereas 2x cores should scale nearly linearly in an unconstrained system. If 128 smt threads are starved would 128 full cores be fed?
As for doom, IIRC nobody claimed AMD or Intel will go bust. However you can't deny there is a shift happening across the industry towards Arm. And it's happening fast as the Graviton share shows.
@Wilco1 - It indicates - vaguely - where ARM predict their designs to be next year. Pointing to one of the haziest marketing slides I've ever seen significantly weakens your argument.
I didn't say anybody claimed about going bust, either. I simply don't believe the shift is happening "across the industry", and I don't believe x86 is at some massive disadvantage outside of mobile and datacentres, where power-per-core is king once a certain level of performance is reached.
The weaker memory ordering of ARM is a disadvantage for ARM. This means that porting multi-threaded applications from Intel/AMD is hard in general. Of course, ARM (and ARM implementors like Nuvia) can fix this by strengthening the memory ordering in their implementations. Will they do it?
Not to put too fine a point on it, but that's basically misinformed nonsense. There's absolutely no problem running code written according to standard and well-specified threading APIs.
The problems only occur when programmers try to outsmart the OS by doing things like userspace spinlocks -- a practice best summed up by none other than Linus Torvalds:
"Do not use spinlocks in user space, unless you actually know what you're doing. And be aware that the likelihood that you know what you are doing is basically nil."
Userspace locking has all kinds of pitfalls and is usually only a win in very narrow and unreliable circumstances. There are very many reasons not to do it, with portability problems being only one.
Also, as most modern ISAs are considerably more weakly-ordered than x86 (including POWER and RISC-V), code depending on x86's strong memory ordering is fundamentally non-portable. Various ordering guarantees of several ISAs are conveniently summarized, here:
All code that does not communicate or synchronize memory accesses through API calls may be affected, in particular code that employs lockless or waitless techniques (and much of that code probably is). Yes, such code would not be portable to ARM and Power as they are, which means that ARM and Power will not be able to serve as replacements for Intel and AMD in general.
If you want to replace a competitor, you have to give at least all the guarantees that the competition gives, not fewer; otherwise your product will be perceived as incompatible; your "blame the programmer for not follwing standards" strategy won't help if the program runs fine on the incumbent system. Yes, giving more guarantees has its costs, but the benefit is worth it: Your chips are compatible with existing code. OpenPower has understood this for byte ordering and switched to little-endian; I have not followed it enough to know whether they have done it for memory ordering.
Vast amounts of cell phone code running on ARM does not mean anything for servers.
In order to work correctly, you have to use some API (even if you define a custom one, as eg. Linux does). Some primitives are NOPs on some ISA and are simply required to stop the compiler from reordering memory accesses. If you don't use the API or don't use it correctly, your code is simply broken, and will fail even on x86.
I've seen various cases where memory ordering bugs (by not using the APIs) caused easily reproducible crashes on Arm while the code appeared to work on x86. You just needed many more threads and wait longer for the crash on x86. So a stronger memory ordering does not fix memory ordering bugs, it just makes them a bit harder to trigger.
Personally I much prefer an immediate crash so the bug can be fixed quickly.
If a primitive that is a nop on AMD64 is not used in the program, but would be required on architectures with weaker ordering, the program works as intended on AMD64, but not on the weakly ordered architectures. And I expect that there are a significant number of such programs, enough to make people avoid architectures with weaker ordering. Of course, if the offers for the weaker ordered architectures are substantially cheaper, that might be enough to make some customers organize their computing to use these offers for the programs that appear to work on them (and where the consequences of running a program that does not work as intended are recognizable or harmless enough), and run only the rest on AMD64. But that would mean that weak ordering lowers the value of the architecture substantially.
We all would like it if bugs would show up right away, but with memory ordering bugs, that often does not happen. Having more possibilities for making bugs tends to lead to some bug showing up earlier; that does not mean that you eliminate all bugs earlier (probably to the contrary).
Check the Wikipedia link - there are formal guarantees in x86 that do not exist in most modern ISAs. That means you can write robust, lock-free code that will ALWAYS work on x86 and break everywhere else.
However, before you actually attempt to do such a thing, you should really spend some time reading Linus' posts in that RealWorldTech forum I linked, above. He outlines a number of performance and efficiency pitfalls that make it inadvisable even on x86. Instead of simply avoiding locks, a better approach to minimizing lock contention and overhead is to minimize thread interaction.
For more on memory ordering, this seems pretty-well written. I've only skimmed it and don't necessarily endorse the author's opinions, but it lays out the issues pretty well:
> code that does not communicate or synchronize memory accesses through API calls may be affected, in particular code that employs lockless or waitless techniques
It basically boils down to spinlocks, and enough has been said about that, if you'd care to read it.
> If you want to replace a competitor, you have to give at least all the guarantees that the competition gives, not fewer
It sounds like you're in denial. This turns out not to be the deal-breaker issue you're casting it as.
> Vast amounts of cell phone code running on ARM does not mean anything for servers.
That distinction largely disappeared about a decade ago. Moreover, since you've apparently been living under a rock, it will surprise you to know that Amazon is on their second generation of ARM-based cloud servers and enterprise linux distros have supported ARM for more than two years, already.
We're no longer talking about the future - this is happening!
Wait-free code avoids all locks, including spin-locks. Lock-free code may wait, but unlike Torvalds' user-mode spinlocks, at least one thread always makes progress. No, these don't boil down to spinlocks at all.
In 1990 many (including me) believed that IA-32 would be replaced by RISCs eventually. IA-32 survived and morphed into AMD64 (and survivid also IA-64), and most RISCs of 1990 are now dead. These RISCs had more standing in server space than ARM does now. So we will see how it goes.
However, one thing to note wrt the present discussion is that SPARC offered stronger (TSO) and weaker memory models (e.g., RMO). If weak memory orders were as unproblematic for software and as beneficial for hardware and performance as you suggest, they should have dropped TSO and only offered RMO. The reverse has happened.
> Wait-free code avoids all locks, including spin-locks. Lock-free code may wait, but ... at least one thread always makes progress.
Most who think they need to write wait-free code are probably wrong (see: premature optimization; optimizing the wrong problem). But, if you really can't avoid significant lock-contention by any other means, then (like I already said) you can do it portably, if you just use standard and well-specified APIs.
> In 1990 many (including me) believed that IA-32 would be replaced by RISCs eventually. IA-32 survived and morphed into AMD64 (and survivid also IA-64), and most RISCs of 1990 are now dead.
I remember that, and I didn't know what they were smoking. Maybe it was my youthful inexperience, but I could hardly imagine a world in which PCs weren't x86-based. And PCs were taking over everything!
> These RISCs had more standing in server space than ARM does now.
So did mainframes!
> one thing to note wrt the present discussion is that SPARC
Eh, SPARC was basically the weakest of the lot. MIPS and POWER have way more going on than SPARC. About the only one deader than SPARC is Alpha, and that was killed off precisely because of how *good* it was! I guess PA-RISC is almost as dead, also for largely anti-competitive reasons. Anyway, I guess we can take it as a cautionary tale that changing to a stronger memory order is *not* going to save your bacon!
BTW, why would ARM or Nuvia do any differently, when vast amounts of code is already running perfectly on ARM? Do you think strong memory ordering is without costs?
Ian, it would've been great if you'd gotten in a question about the Ryzen 3 3300X! Why'd they even offer it, only to withdraw it so quickly? Was it just a PR move?
Just look how good the 7nm is! AMD does not get too many chips that are so bad they have to be used in 3300 chips! Most are goor for 3600 and most likely even more for 3700! Very few chip has only 4 to 5 working cpu units! So they can not sell something they don´t have!
There is nothing preventing them from taking chips from higher bins, unless supply of those bins is tight. There's a long history of both Intel and AMD selling more functional chips into lower bins, which some would take advantage of by overclocking or even unlocking disabled cores (not sure if that's possible on any Ryzens).
I've not heard of core unlocking on Ryzen, sadly, as if it were a thing I'd have the right kind of CPU to play with it.
Given that supplies of the 3100 are still available, I'd hazard that it's partly having too many functional 6+ core chips that they can sell without difficulty, not having enough defective chips that have 4 working cores entirely on one CCX, and competition with their own discounted 14/12nm products making the required binning less profitable.
Actually I think the most important note is: > MP: We do - we have math kernel libraries that optimize around Zen 3. That will be all part of the roll-out as the year continues.
Intel major advantage is its libraries: Intel MKL, Intel IPP, Intel Compiler. They are widely used in the industry and create major advantage for Intel. Once AMD has something similar (Better be open source) that will be adopted we'll see its advantage getting even bigger in real world.
Good questions but many of the answers were pure PR fluff. Unsurprisingly as they plan on releasing more info in the future, after release. So the timing of this shows it to be a PR stunt. I'd have respected AMD if they'd declined to take the interview until they could remove the gag from Puppetmaster.
What were you expecting? There were plenty of hints, we'll get more later, at least Ian asked the good questions in the first place. No need to be personal about it.
One thing I miss about the 1950s is that a technical interview wasn’t filled with the vapid ‘eternal sugar rush’ hype that everything involving corporate money does these days.
I, for instance, can’t stomach listening to top tennis players like Federer speak. It’s all “amazing” and “unbelievable” all that time. This is one area in which culture is continuing to deteriorate.
Agreed that language has gone downhill. As for "amazing," it's become one of those words over-used as a shorthand for the good x 10 idea behind it (I am as guilty as anybody). "Stunning" for home improvement and "decadent" for cooking also come to mind.
The amount of bullshit no-answer in this interview is mind-blowing. Might as well read a press release, it's as devoid of information. What have we learned beyond "Yeah, of course We are the best" + _nothing_ ??
If they had held this in November after the launch he'd have had more freedom to answer questions. Although based on his fluency in PR speak I'm not sure exactly how much more interesting it would be.
This article proves technical superiority of AMD. This Papermaster really will only tell us truths and without contempt. I appreciate hearing about my favorite company and their plans. They never fail in their plans. The people there seem so smart, funny, and most importantly good looking. I would very much like to hear more about Big Navi coming to compete with Intel Tiger and Rocket. I expect a very big speed improvement. Why not release the benchmarks now since everyone is so confident. Lets see what is in the tank for AMD to really win.
Great interview, Ian...;) Glad Papermaster was so forthcoming--AMD intends to be a moving target with respect to its competitors. Zen 4 is already in design at this time--Zen 5 is "a thing" already. Keeping the pedal to the metal and continuing its great record of execution will be necessary for AMD to continue its industry leadership on out into the future. It's kind of wild to consider the leadership role AMD is in at this time--it's really unique. Intel has never had this kind of *self-imposed* competitive pressure--what AMD has pulled off here is quite remarkable, imo. The folks at AMD know exactly what they've been doing for the past several years and it shows.
Wanted to add that Papermaster's detail on Zen 3 was very interesting--it's a ground-up redesign of the entire architecture--as opposed to an incremental change built on top of Zen 2. It will be interesting to see who if anyone will be able to keep up with AMD's pace of development in the next few years.
Very impressive how much they have managed to improve in one generation. Now that the 8C chiplets are monolithic in a sense, I wonder where the gains are next time around? The I/O die needs some love. Plus the eventual move to 5nm should help along with DDR5.
I'm guessing 5nm on the chiplets and 7nm on the IO die will be order of business... at least for the server parts. It'll be interesting to see what they do elsewhere, as unless I'm mistaken they're still on the hook for some 14/12nm wafer starts with GloFo through 2021.
My guess is that Papermaster's words about "redesign" can be interpreted in different ways. In Zen 3, it appears they've reworked every inch of the implementation, while the bird's-eye abstraction/schematic is largely the Zen architecture, with widening of the core across the board, and some new spices thrown in here and there. Just speculation on my part. Either way, brilliant work from AMD.
Yes, good questions. If only there were some inquisitive and resourceful journalist to find us some answers...
Since I've seen a couple OEM systems with these CPUs, one possibility is that they've completely soaked up AMD's supply. AMD seems to be bending over backwards to stay in OEMs' good graces, lately.
"We didn’t change technology nodes - we stayed in 7nm. So I think your readers would have naturally assumed therefore we went up significantly in power but the team did a phenomenal job of managing not just the new core complex but across every aspect of implementation and kept Zen 3 in the power envelope that we had been in Zen 2."
If tests can prove it I think that's the real achievement of Zen 3. Intel's Sunny Cove also had a significant IPC uplift, but at the cost of higher power consumption.
If Zen 3 is a complete redesign, why is it called Zen 3, and not, say, Painter 1.
I actually expect that Zen 3 is a substantial rework of the basic Zen microarchitecture, but not a from-scratch redesign of everything, like Bulldozer and Zen1 were (and even there I expect that some good parts were reused to some extent). And sticking with a good microarchitecture and refining it has been a good strategy: Intel's Tiger Lake has lineage going back to the Pentium Pro released in 1995, and both Intel's from-scratch Netburst line and AMD's from-scratch Bulldozer line turned out to e dead ends.
I think it's that from a block diagram level, Zen 3 looks like Zen. What they did is look deep at each part of each block and optimize for performance.
You're right, though, in that if the architecture works, use it, instead of making crazy assumptions of how computing "should work," like Intel's crazy long pipelines in NetBurst (... Or all of Itanium, for that matter...).
" If Zen 3 is a complete redesign, why is it called Zen 3, and not, say, Painter 1. " the same can be said about intel and anything gen 10 and higher. its been said that they are a new architecture, but intel sill calls it gen 10, gen 11 etc.
Intel's generations just tell you the year. They even have different microarchitectures in the same generation. Anyway, in their tech marketing (and in their optimization guides) Intel has revealed the microarchitectures of their products to some extent, and we know that Skylake-Comet Lake have the same microarchitecture, and we know that Sunny Cove is a widened variant with various changes (e.g., AVX512, and ROB size increased from 224 to 352 instructions). AMD has done that, too, and I expect that they will do it for Zen 3 at some later time.
Feel the same way too. Papermaster's "redesign" didn't really answer Ian's question right and is blurring things. I think the implementation has been reworked quite a bit, but the architecture is largely the old Zen design, the core widened and perhaps some new surprises here and there. Agree also that radical change is a bad idea in CPU design.
ALWAYS keep in mind that competition is good and drives perf/price to good values. Ofc price you see now also drives future innovations to a certain degree, still it would be interesting to see the profit margin (I have a feeling it will blow your mind, both intel and amd I think have plenty of headroom). Now maybe AMD has taken the lead, but a lead far ahead will just drive prices up and promote non-innovation. At least there is the ARM ISA (ecosystem whatever) with interesting options.
Great interview there. I wasn't expecting particularly illuminating answers, but the questions were top-notch and probably got as much out of him as we were ever going to get at this stage. Cheers, Ian!
Is is just me? I love the work AMD has put out and greatly respect the engineering team behind it. but i'm kind of disappointed after reading this interview. i walk away with nothing technically concreate. this reads like what a PR staff put or, not the CTO of the company. IC asked some great technical questions but the answers given by MP to me felt more or less just like standard PR marketing points.
What id like to know, if anyone care about it and wants to find out, when are we suppose to get ZEN 3 Threadripper PRO CPUS ? I care about properly working ECC ram and octochanell memory, which is probably a must for that many corez... But nothing on the horizon about it. Barely one can get its hand on ZEN 2 Threadripper PRO, its more like an unicorn, and the motherboards like a unicorn child. Nobody seen or heard about them....
> Barely one can get its hand on ZEN 2 Threadripper PRO, its more like an unicorn, and the motherboards like a unicorn child. Nobody seen or heard about them....
Like their other Pro-branded CPUs, the Threadripper Pros are only available to OEMs. I'm not happy about it, either.
Has the Zen 3 Threadripper become the new 'Fight Club'? Or the biggest elephant in the room?? Or is AMD not having such good yields in their higher core count CPU's... Obviously our good Doctor is not able / allowed / too afraid of losing adv. bucks to ask the question. AnandTech used to be about asking the hard questions.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
202 Comments
Back to Article
Hulk - Friday, October 16, 2020 - link
Great interview. AMD has really "found" itself. It seems to finally know who it is and where it is going.Wreckage - Friday, October 16, 2020 - link
It's a shame they overpriced this generation. AMD fans are more frugal and prefer a less expensive option. I wouldn't recommend these chips at their launch price.eva02langley - Friday, October 16, 2020 - link
Better Price/Performance ratio than the past generations. Stop begging for 50 bucks, don't do like the clowns at WCCF.Unashamed_unoriginal_username_x86 - Friday, October 16, 2020 - link
I agree the complaints are annoying, but you don't need to pretend perf/$ has gone up, because it makes sense that it didn't. For the 5600x 1.25Unashamed_unoriginal_username_x86 - Friday, October 16, 2020 - link
Didn't mean to do that. 1.25x relative performance ÷ 1.5x price = 5/6th the perf/$ of 3600 at launch MSRPQasar - Friday, October 16, 2020 - link
Wreckage " AMD fans are more frugal and prefer a less expensive option " and that option, is now intel :-)evernessince - Sunday, October 18, 2020 - link
@Qasar No, AMD ryzen 3000 series is going to provide more value than anything Intel has right now and you can expect the value to get better after the 5000 series launches.Qasar - Sunday, October 18, 2020 - link
as i said a few posts down, what happens when the 3000 stops being produced ? then what ?The_Countess - Monday, October 19, 2020 - link
They are still selling Ryzen 1000! There is basically zero risk of Ryzen 3000 no longer being available in the coming few years, let alone before the release of a 5600.pugster - Monday, October 19, 2020 - link
You tell me where they sell the Ryzen 1000, like the 1600 AE not in some jacked up price for less than $100. They disappeared off the shelves and AMD won't make more of them. I was lucky enough to get the Ryzen 2600x for cheap at microcenter earlier this year and the 2000 series disappeared off the shelves. The Ryzen 3100 and 3300x? Nowhere to be found.Qasar - Monday, October 19, 2020 - link
they are ? BS one local store here doesnt have ANY ryzen 1000 or 2000 cpus for sale. nice trynandnandnand - Saturday, October 17, 2020 - link
Wrong. 3600X MSRP was $250, 5600X is $300. 1.25x performance for 1.2x price.If you want to compare 5600X to $200 discounted 3600X, then congrats, you discovered that AMD Zen 2 is the budget option, not Intel.
Qasar - Saturday, October 17, 2020 - link
and what happends when ryzen 3000 is no longer available ? guess what, intel will be the budget option. get used to it. intel looks to be the budget option for the next while. not amdRobATiOyP - Sunday, October 18, 2020 - link
New performance product, you announce the high end parts first, the mid and value range items feed in later.There's going to be a supply of used 3000 series and probably budget chips with integrated gfx.
But you have to factor in mobo costs too and for desktop gaming enthusiasts, the Intel competition are not budget parts.
A good budget part will be on 7-10nm, designed for mass production with a lower TDP
Spunjji - Monday, October 19, 2020 - link
@Qasar:1) AMD release lower-priced 5000 series parts, and
2) The prices on the whole range gradually move down as supply increases and demand is met.
You're right in one way, though - Intel will have to present themselves as the budget option to compete. That will require some price realignment on their part, though. Right now they're barely competing with the 3000 series on value. 😬
Qasar - Monday, October 19, 2020 - link
spunjji yep, but if those out perform anything by intel, intel will have to lower its prices, and may have to low lowerer then what the amd equiv. are selling for. i still doubt amd is the budget option any more, i think for the time being, intel will be.Spunjji - Tuesday, October 20, 2020 - link
@Qasar - They already outperform plenty of things by Intel. Intel don't really have to lower their prices, though, because demand for their products remains high regardless of their competitiveness. The joys of a duopoly.We'll see, of course - you could be right - I just don't think you are. Intel like their margins too much.
Qasar - Tuesday, October 20, 2020 - link
i think intel would, at least a little, after all why spend $500 on intel cpu X when for approx the same $500 AMD cpu X you get better performance across the board ? and be fore any one says " but the cost of the intel board is cheaper. " rog strix E gaming, in X570 and Z490 for example, at one store i go to, has a 30 buck difference, with the Z490 being less, 30 bucks for the board isnt much.eek2121 - Saturday, October 17, 2020 - link
I read this article earlier, but I didn’t read the comments because of crap like this. The 5xxx Ryzen series actually offers much higher $ per watt than both previous gen as well as competitors. I am not an AMD employee or investor, rather I am just a guy that is passionate about tech without a stick to burn. If anything, AMD has made it’s chips worth more, not less.OMGWhyMe - Saturday, October 17, 2020 - link
Performance per watt is really more of a consideration in laptops and servers, not in home destkop gaming machines.The Zen 3s offer a better value at the top end, that is to say the Ryzen 9s.
RobATiOyP - Sunday, October 18, 2020 - link
That's real nonsense!! Any one wanting the fastest gaming part always paid too much, the market leader's premium.Ryzen 9 3900XT is slower than the silly expensive power hungry less cores Intel part.
The R 9 59[05]0X is faster than the previous champ which has been Intel since Core Duo
AshlayW - Saturday, October 17, 2020 - link
Passionate about tech - yet defends price increases across the board?Okay.
RobATiOyP - Sunday, October 18, 2020 - link
MSRP doesn't mean a lot, I checked prices this week and lower end skews can cost more with the bargains being chips like Ryz 5 5500X 75% cost of a 5600 at €145. The Ryz 3 3000X actually sells at €223, more than an R5 5600XRobATiOyP - Sunday, October 18, 2020 - link
But perf/CPU cost isn't the real factor, because +10% system performance is achieved with < +10% system cost, even if +19% CPU perf was gained with a +25% part.Spunjji - Monday, October 19, 2020 - link
Shhhh, you're bringing logic to a shill-fight 😏 It's an excellent point, of course.Bytales - Wednesday, October 21, 2020 - link
Yeah, thats what it is effectively, begging for a couple of bucks. i dont mind AMD getten those 50 extra bucks. They did a good job, and they earned it. Anything else means beeing a cheap bastard. Take a look at how much the 64 core Zen 2 Threadripper Pro costs, mister "50 bucks". Then youll see how cheap these CPUs really are.just4U - Saturday, October 17, 2020 - link
I am not sure why you think that it's over priced Wreckage. Anyone currently sitting on a higher end 3000 series either won't buy these cpu's or will pay the same premiums they do on intel processors as early adopters.gruffi - Saturday, October 17, 2020 - link
It's just $50 more for every model. I wouldn't call this overpriced. Given the fact that Zen had very aggressive prices so far and that Zen 3 will have no competition at all at launch this seems fair pricing. It's funny, people had no problem with $305 for 4-core 7700K in 2017. And now they think $299 for 6 cores is overpriced. Before Zen Intel's top Broadwell-E 6950X was over $1700. So, compared to what Intel offered before Zen current prices are still fine.Spunjji - Monday, October 19, 2020 - link
The bitching makes more sense if you assume that the most vocal posters are working from the logic that whatever Intel do is okay and whatever AMD do is bad.I've seen a few people making genuine complaints, but for the most part it's just the standard bad-faith critiques from the usual suspects.
bigi - Saturday, October 17, 2020 - link
Overpriced? Well, I am sorry. A pair of good skates is $1,000. Digital cameras sell for 1-6K. Phones are ~1,200. "Normal" cars are 40-120K.In any case, I get your moaning if you are 12 or so, but in any case, please stop.
flyingpants265 - Saturday, October 17, 2020 - link
We can decide if they're overpriced once real benchmarks come out. 3600 did not match 9600k in gaming.They also seem to have removed stock coolers from most models except 5600x.
I'll wait for the 5600(non-X) at $250 or less, and it better have a stock cooler. I paid $80 used for my 2600.
RobATiOyP - Sunday, October 18, 2020 - link
The R. 5 5600X with 65W TDP cooler will be available at $250 soon enough, the launch MSRPs being low would just cause a debacle like Nvidia's with true market prices being higher.Alexvrb - Saturday, October 17, 2020 - link
They are starting at MSRP... they have a lot of Zen 2 inventory to move. After that, maybe they'll release cheaper models and/or sell below MSRP. But as of right now the only reason Zen 2 is a better value is the discounts! Early adopter fee.RobATiOyP - Sunday, October 18, 2020 - link
That's bllx!! I bought AMD often because they were technically the best, from dual SMP 32bit, Thunderbird and Opteron to consumer 64bit parts .. and in gfx cards, they had every few years compelling smaller die lower wattage options.Intel milked consumers, they did shady deals and tried to force through crap like the Itanic and P4 .. so yes, often AMD did offer better value but AMD actually shaped the x86 64bit market
Spunjji - Monday, October 19, 2020 - link
Amazing how the language of shills has changed from "AMD are second-class products" to "AMD consumers want second-class products". Have you considered that maybe they're aiming to *expand their market share* and that having premium offerings at an appropriate price is part of that process? If not, why not?Gondalf - Saturday, October 17, 2020 - link
I don't know. I have suspect the IPC boost is smaller than claimed expecially in integer, likely they have some major increase of performance in a particular segment and some of less stellar in the other ones. This "19%" number looks like marketing, they claimed something in the 10/15% range from ages, this can not change suddenly in a month or two. I suspect they include the frequency boost in their calculations.Bet we can wait a real review to know the truth. Moreover funny enough they do not mention Milan, from the revenue point of view this is the real launch.....but they are late. Focusing on games only is not a good strategy, someone have a very bad boy on arrival.
Qasar - Saturday, October 17, 2020 - link
" I have suspect the IPC boost is smaller than claimed expecially in integer, " of course you would Gondalf, so far, amd as been pretty accurate with what it says about prev Zen cpus, i doubt they would start to fudge the numbers now."This "19%" number looks like marketing, " when was the last time intel claimed ANY performance increases gen over gen that were above 10% ? i sure cant recall any.
" Focusing on games only is not a good strategy, someone have a very bad boy on arrival. " um isnt this EXACTLY what intel and its fans have been touting for the last while ?? " sure amd are better in multi core and power usage, but intel is still the fastest for games " isnt that what was being said ??
Spunjji - Monday, October 19, 2020 - link
"...isn't that what was being said?"It is, in fact, exactly what Gondalf has been saying 😏
RobATiOyP - Sunday, October 18, 2020 - link
There are figures around and indeed media ops improve a little more than integer ops, but the story is 6 core parts performing at Zen 2 8 core level.Gaming performance was the focus because AMD led already in every other segment but the single thread perf dominated gaming benchmarks, so AMD claiming that was NEWS.
Everything else improving around is not news and MSRP is meaningless as the prices on skews fluctuate according to supply/demand.
The_Countess - Monday, October 19, 2020 - link
"they claimed something in the 10/15% range from ages"And with 'they' you mean the rumor mill, not actually AMD.
So why are you holding AMD to things the rumor mills said?
Zingam - Saturday, October 17, 2020 - link
No mobile - no win! There are still Intel and NVIDIA/Arm (did that happen already) and now even Apple!nandnandnand - Saturday, October 17, 2020 - link
AMD has plenty of mobile APUs now and will have more next year.Nvidia/ARM will take years for regulators to approve and Apple switched from Intel to its own ARM chips, which was widely expected years before it was announced.
Spunjji - Monday, October 19, 2020 - link
Oh look, there go the goalposts.Again.
prophet001 - Friday, October 16, 2020 - link
Get this thing to market already so I can do a vaporware build with a 3080.Okendor - Friday, October 16, 2020 - link
I really love this kind of content. Many sites can put up good reviews but Ian's sit down interviews are always top notch.dietzi96 - Friday, October 16, 2020 - link
N1 job, AMD tech and press department!Machinus - Friday, October 16, 2020 - link
I was hoping for more detail on how this 7nm version is actually different from the previous process. Some of that 19% must be due to manufacturing improvements, right?ArcadeEngineer - Friday, October 16, 2020 - link
That's not a blanket performance improvement figure, that's specifically IPC; how does a node improvement on it's own help with that?sing_electric - Friday, October 16, 2020 - link
They've talked about this - and Anandtech has covered - that it's from a variety of sources as DrPapermaster says in the interview. It's the same process with the same libraries, but better tolerances as it's now more mature.
The uplift comes from a variety of things but better cache design and overall focus on latency seems to be the d one of the bigger drivers.
haukionkannel - Saturday, October 17, 2020 - link
Yep, the manufacturing part is very minor in here! Most come from architectural improvements. Most likely the clockspeed increase about 100hz is mainly manufacturing getting better.sing_electric - Friday, October 16, 2020 - link
Interesting how Papermaster basically dodged the (artfully put) question on future roadmaps. We knew about Zen what, 2 years before it launched? And Zen 3 and 4 were on the slides from st least the Zen+ launch, and but now we've just got one more generation and then "there be dragons here."Ditto with AMDs graphics lineup - we'd been hearing about Navi and Navi 2 from when Vega launched (I think?) and now we're almost there and don't know what their next plans are after that.
Guspaz - Friday, October 16, 2020 - link
So we're not quite there yet with graphics, RDNA2 isn't quite out yet, and they've already shown RDNA3 in roadmaps.5080 - Friday, October 16, 2020 - link
He does touch on PCIe 5, available in the next chipset, just no details or timelines.eva02langley - Friday, October 16, 2020 - link
It never was a secret.nandnandnand - Saturday, October 17, 2020 - link
It's not clear what products will get PCIe 5.0 and when. It could come to Epyc a year or two before mainstream desktop CPUs.mode_13h - Saturday, October 17, 2020 - link
I don't believe mainstream desktops will get PCIe 5.0, in the foreseeable future. This is going to be for servers and maybe workstations.The most cost-effective way to improve I/O bandwidth (if desktops even needed it) would be to backport features from PCIe 6.0, like PAM4 and FEC, to create something like PCIe 4.1. And even that isn't likely to happen any time soon.
AMDSuperFan - Friday, October 16, 2020 - link
After Navi 2, I would expect Navi #3 and Navi #4. After the 5000 series. I feel confident that there will be a 6000 series. It is easy to talk roadmap. What makes the rubber hit the road is when you promise the performance and meet the performance. Look for up to 200 cores and not cuda cores. I am excited for the new announcements.Carmen00 - Monday, October 19, 2020 - link
...what the heck are you talking about? Before vomiting all over the contents section, at least have the decency to read the actual article!Spunjji - Monday, October 19, 2020 - link
It's a troll.eek2121 - Saturday, October 17, 2020 - link
It is important to note that Vermeer is the last consumer chip that is officially on the roadmap. Zen 4 and 5 nm was mentioned for server applications only. That is not to say that AMD does not have plans, but rather, those plans have not been made public.haukionkannel - Saturday, October 17, 2020 - link
I still expect to see seen Zen3+ next year for desktops. And zen4 only for the servers... And zen4 to desktops at 2022... Other posibility is that amd does not release new cpu at all next year for normal consumer markets... But intel is not staying still, so it seems to be unlikely...sing_electric - Sunday, October 18, 2020 - link
That's sort of been my guess. They want AM5 to be another long-lived socket but that probably means DDR5 modules will have to be widely available at desktop-friendly prices at launch, which may be happening later we'd have thought a couple of years ago.So Zen 4 and AM5 will come together, but AMD might release a Zen 3+ in the interim - possibly with an improved I/O die (7nm? As Ian notes, idle power consumption on Zen 2's IO was pretty high), and depending on timing, maybe even with 5nm for the compute chiplets.
As a company, AMD has been pretty willing to spend the money to do a process shrink on existing designs - though we've mostly seen it on the graphics side (eg Radeon VII, using Vega, or even going from 14 to 12nm for the RX 590, which really was kind of a niche product).
GeoffreyA - Friday, October 16, 2020 - link
Nice interview, thank you. Though a bit too much on the marketing side, very interesting and makes one excited to read the microarchitecture article. Some lines of Mark's I enjoyed were: "Frequency is a tide that raises all boats," and "A great idea overused can become a bad idea."GeoffreyA - Friday, October 16, 2020 - link
After looking at the diagrams, thinking about it, and what Mark said, I'm guessing this: besides the usual widening of structures (retire queue, register files, and schedulers), AMD has likely added another decoder, up from 4-way, and increased dispatch width from the decoders (while keeping the later 6-wide integer and 4-wide FP), increased the size of the micro-op cache (+25/50%), and increased load-store bandwidth to 64 bytes/cycle for the different links, along with another AGU. And the retire width has perhaps gone up to 10 (from 8).Just sheer speculation and innocent fun ;)
scineram - Friday, October 16, 2020 - link
Micro-op cache was already massive with Zen 2, so doubt that.4everalone - Friday, October 16, 2020 - link
Mine was:best defense is in fact a strong offence - we’re not letting up!
TheReason8286 - Friday, October 16, 2020 - link
I had hoped you would've asked "why not just push the 5.0Ghz" just to get it out the way loljack836547 - Friday, October 16, 2020 - link
Great interview thanks ! real journalism heremode_13h - Friday, October 16, 2020 - link
> There are some differences between one ISA and another, but that’s not fundamentalThat's good PR spin, but not intellectually honest. x86 is more work to decode, much more heavily-dependent on register-renaming (due to smaller register file), has more implicit memory-ordering constraints than ARM, and other things.
Like it or not, ISA matters. AMD just doesn't want to acknowledge this, until they announce another ARM CPU or maybe RISC V (increasingly likely, given ARM's new master).
lmcd - Friday, October 16, 2020 - link
ISA might matter but I really don't think AMD has to be too worried right now. ARM might be designing N1 cores, but Microsoft has only really bothered partnering with Qualcomm. Qualcomm seems not to care enough to build a large enough die to compete at the desktop level, and, until every developer has an ARM desktop at their cubical, we won't see ARM threaten AMD's server and consumer desktop/laptop space products.patel21 - Friday, October 16, 2020 - link
Wouldn't Apple's ARM macs help with it ?Wilco1 - Friday, October 16, 2020 - link
Arm is already threatening AMD and Intel in the server space. For example, Graviton is now 10% of AWS and might be 15% by the end of the year.The success of Graviton also shows it's perfectly feasible to use a cloud server without having a dedicated developer box. A fast server outperforms your desktop many times, particularly when building large software projects.
deltaFx2 - Sunday, October 18, 2020 - link
@Wilco1 : Please provide a reference to this claim re. AWS ARM share.It's perfectly feasible to use a cloud server without a dev box, granted. But what it means is that most development will happen on the dev box (x86) and ported to ARM. Of course, virtual desktops change that argument but I'm not yet aware of a virtual desktop environment running arm (not that it can't be done, just not aware of AWS or someone offering one).
Wilco1 - Sunday, October 18, 2020 - link
See http://pdf.zacks.com/pdf/JY/H5669536.PDFArm development has been done on x86 boxes for decades - cross-compilation, ISA simulation and platform emulation is a thing. Having native developer boxes is nice but not as essential as some people are claiming. Arm developer boxes have existed for years, you can buy 32-core versions online: https://store.avantek.co.uk/arm-desktops.html
deltaFx2 - Sunday, October 18, 2020 - link
Thank you for the link. So take-away is that AMD and ARM are taking away market share from intel. At present AMD is taking away more share. It's unclear if the pie is growing (probably) so ARM and/or AMD are taking a large chunk of incremental gains or whether it's replacing intel (I suspect former). Anyway, I'll read more in detail later..Re. cross development, sure it's a thing. On client and embedded. I've never heard anyone doing it for server. Sounds crazy to me.
Wilco1 - Monday, October 19, 2020 - link
People don't use expensive loud servers as their desktop - for server development you would login to a remote server already. Big software projects build so much faster on a proper server.mode_13h - Monday, October 19, 2020 - link
> It's perfectly feasible to use a cloud server without a dev box, granted. But what it means is that most development will happen on the dev box (x86) and ported to ARM.Yeah, not what I meant.
mode_13h - Friday, October 16, 2020 - link
That's a weird argument. A lot of development is already cloud-based, meaning it no longer matters what you have on your desk or in your lap.Everyone already has ARM in their pocket, and a lot of kids and techies have ARM on their desktops in the form of a Raspberry Pi. ARM is making inroads into the laptop market and cloud, as well - key markets for AMD! While a company AMD's size could survive comfortably off just gaming PCs and consoles, investors would leave in droves, if AMD ceased to be a credible player in the cloud market.
Your point about Qualcomm is also strange, since they killed off their server core design group, and their mobile core design efforts were effectively extinguished even before that. The fact that MS used Qualcomm in laptops and Hololens says nothing about what they're doing towards cloud.
Kjella - Friday, October 16, 2020 - link
Like it or not, we've heard the same tired arguments from RISC vs CISC proponents for decades without a shred of proof that one side is superior. If you're a ballet dancer "do a pirouette" is more efficient than "get up on one toe, spin around your own axis and contract your extremities to increase the speed of your rotation" and there's no reason to think computers are any different. Like when you create instructions to do "one round of AES encryption" like in AES-NI, obviously it's less flexible and requires more dedicated logic but it's a lot more efficient than doing it with basic math. Just because you can do everything with simple instructions doesn't make it a good idea.mode_13h - Friday, October 16, 2020 - link
It's weird that you map x86 vs ARM to a CISC vs RISC argument. I look at it as a modern, planned city vs an old medieval town that slowly evolved into the modern era. There's no question which is more efficient, even if the latter has more lore, character, and is more picturesque.ARM is winning on efficiency, dominating mobile and now cloud. That's the point Ian made, and it's the right one. Once you can get x86-levels of performance with ARM-levels of efficiency, the choice becomes obvious and x86 is quickly relegated to only markets with lots of legacy software.
Your example of crypto acceleration is odd, as mobile chips have long had crypto accelerators. A better example would be AI, which ARM has actively been adding ISA extensions to address.
mode_13h - Friday, October 16, 2020 - link
BTW, since open source support for ARM is already top notch, I mean specifically PROPRIETARY legacy software - much of which could comfortably run in its own VM instance, in the cloud. Once you do that, you untether your primary platform from what's needed by an ever-shrinking number of legacy apps.deltaFx2 - Sunday, October 18, 2020 - link
"Once you can get x86-levels of performance with ARM-levels of efficiency" Except you don't, do you? See Ampere. They claim to match Rome spec rate perf @210W, which is pretty much same TDP as rome. They need to operate the arm cores at the limit (3GHZ+) which is terrible for efficiency as those cores are designed for a slightly lower frequency. They have more efficient SKUs at lower frequencies and I have seen the claim that AWS runs at 2.5GHz, 110W. Efficient? Absolutely, Performant? Not so much (Yes, we've all seen the skylake SMT-thread vCPU vs graviton full-core CPU. Apples to oranges comparisons because AWS can offer SMT-off instances and they chose not to.)So ARM winning on efficiency is a questionable claim if performance (say spec rate score per CPU). But no question that ARMs designs are efficient when operated at the optimal voltage and freqency (2.5-2.8GHz)
BOttom line: You can be efficient or performant, not both.
Wilco1 - Sunday, October 18, 2020 - link
The 3+GHz Altra's give up efficiency to get more throughput. And that's a choice since a halo part that beats Rome will get more interest. However cloud companies are unlikely going to use it, stepping 10-20% down in frequency increases efficiency considerably without losing much performance, and reduces cooling costs.Graviton 2 proves you can be both efficient and performant at the same time. You only lose out if you push too much on one aspect.
mode_13h - Monday, October 19, 2020 - link
> You can be efficient or performant, not both.Of course those goals are at odds, but you seem to presume that the latest x86 designs are at least on the same perf/efficiency curve as leading ARM implementations, in which case I'm guessing you missed this:
https://www.anandtech.com/show/15967/nuvia-phoenix...
Also, ARM is not standing still. Previously, they've been focused on energy- and area- efficiency. It looks like they're finally beginning to stretch their legs, lending credence to some of Nuvia's claims:
https://www.anandtech.com/show/16073/arm-announces...
The writing is on the wall, regardless of whether you choose to see it or AMD to acknowledge it. But I trust AMD knows what's up, and is surely cooking something in a back room - perhaps a successor to Jim Keller's fabled K12.
Qasar - Monday, October 19, 2020 - link
i guess that means the writing is on the wall for intel as well then, too??mode_13h - Monday, October 19, 2020 - link
Oh, for sure. Check out that first link (the one about Nuvia).But it doesn't necessarily spell doom for Intel or AMD. They are as well-positioned as anyone to pivot and start making ARM or RISC-V CPUs. For them, it's just a question of getting the timing right, as they don't want to undermine their cash cows while demand is still strong and uncertainty still surrounds any potential successors.
That said, Intel clearly has the most to lose from any shift away from x86.
eva02langley - Friday, October 16, 2020 - link
Problem is ARM not there and it needs to kill their efficiency and can barely compete with x86 compute performances.Wilco1 - Saturday, October 17, 2020 - link
Graviton 2 actually beats x86 cloud instances on performance and cost while using a fraction of the power. Basically x86 is totally uncompetitive and as a result Arm will take over the cloud in the next few years.deltaFx2 - Sunday, October 18, 2020 - link
I've only seen AWS comparisons with Intel, hyperthreads vs. graviton full core. (1) The x86 top dog is Rome, not Skylake. (2) SMT naturally reduces per-thread performance so it's apples to oranges. AWS chooses to run with SMT, it can be turned off but they chose not to. (3) Pricing is AWS's choice to drive adoption (would you pay the same for having to do extra work porting to a new ISA? Devs have to be paid and it's not zero work. In fact its not zero work going from intel -> AMD due to uarch differences needing different perf tuning. AWS/Azure also prices AMD instances lower for the same reason. Porting to a new uarch and a new ISA is even more work and the price reflects this reality. )Wilco1 - Sunday, October 18, 2020 - link
Single-threaded performance of Graviton 2 is similar to latest Intel and AMD CPUs, the low-clocked Graviton 2 is within 6% of a 3.2GHz Xeon Platinum on SPECINT. You could turn off SMT but you'd lose 30% throughput. How does that make x86 look more competitive? Adding more cores isn't an option either since x86 cores are huge. 128-core single-die Arm server CPUs are no problem however.With much faster Arm servers on the horizon, things are looking rather bleak even if Milan manages 20% extra performance over Rome. This slide sums it up: https://images.anandtech.com/doci/16073/Neoverse-c...
deltaFx2 - Monday, October 19, 2020 - link
"Graviton 2 is within 6% of a 3.2GHz Xeon Platinum on SPECINT. " I'm assuming you mean Spec int rate. yeah sure, but Rome is ~2x Xeon platinum. Turning of SMT is a much easier tradeoff.Chiplets allow at least AMD to add cores without worrying about yields. Intel might also go that route with Sapphire rapids. The link you shared is a marketing fluff slide with no X and Y axis. I was surprised at the claim that ARM could have higher IPC cores and more of them and see an increase in throughput (assuming it's spec int rate, it's well known that 8-channel DDR4 bandwidth gets maxed out in many cases with 64 cores). Sure enough, DDR5, HBM2e show up for Zeus: https://images.anandtech.com/doci/16073/Neoverse-c... In other words, they're comparing a system from last year to a system from next year. DDR5 is not really available in production volume until next year. Yeah, better memory subsystems improve both 1T and socket performance! Who knew? The real competition is then Sapphire Rapids from intel and Genoa from AMD both probably next year? And yes, in both cases expect a bump in core-count because DDR5 allows it. If history is any judge, server DDR5 will remain expensive until intel launches their server.
Meanwhile, if Vermeer is launching in a few weeks, you may be sure that hyperscalars have had early revisions for months already, and it's likely that they will be receiving shipments of Milan later this year if not already. "Bleak", "Uncompetitive" etc are not even gross exaggerations. They just come across as fanboyish.
mode_13h - Monday, October 19, 2020 - link
> Chiplets allow at least AMD to add cores without worrying about yields.Even ARM's balanced cores (the N-series) are tiny, by comparison. You can stuff a heck of a lot of those on a single die and still end up with something smaller than mass market GPUs.
Of course, as the market grows and chiplet technologies continue to mature, it's only natural to expect some ARM CPUs to utilize this technology, as well.
No doubt x86 will dominate the market for at least a couple more years. It can still win some battles, but it won't win the war.
deltaFx2 - Monday, October 19, 2020 - link
GPUs and CPUs yield differently. Gpus have lots of redundancy and are therefore tolerant to defects. Cpus can’t handle faults in logic. At best, sram defects can be mitigated using redundancy and core defects by turning them off and selling them as a low core count part. Btw 7nm is yet to see volume production on large gpus. Navi might be the first. Xilinx has done large fpgas but those are even more redundant.I don’t care much for wars and battles and cults. I seem to have a low tolerance for fanboyism and taking marketing slides as the final word. Arm in Cloud provides options to hyperscalers and for sure it will crash intels fat DC margins. I doubt arm “winning the war” is good for anyone except arm/nvidia if it means a situation similar to the last decade with intel. Time will tell. Cpus have not been this competitive in a while
mode_13h - Monday, October 19, 2020 - link
At 331 mm^2, Vega 20 launched on 7 nm nearly 2 years ago!And let's not get into dismissive talk of fanboyism and cults, because that cuts both ways. I've tried to make my case on facts and reason. I have no vested interest in ARM or anything else, but I would like to see the chapter of x86's dominance in computing come to a close, because I think it's really starting to hold us back.
You cannot make a credible claim that the current dominance of x86-64 is due to any technical merits of the ISA, itself. It's only winning due to inertia and the massive resources Intel and AMD have been funneling into tweaking the hell out of it. The same resources would be better spent on an ISA with more potential and less legacy.
deltaFx2 - Tuesday, October 20, 2020 - link
Ok, let me rephrase. A volume large GPU. Those Vegas were sold in datacenters or HPC iirc and amd is no nvidia.I made no claim re the superiority of x86 Isa. I happen to agree with papermaster that a high performance CPU is largely Isa agnostic. The area difference you see in arm vs x86 today is mostly x86s need for speed, I.e the desktop market. X86 succeeded because it could push the same design in all markets and until very recently the same silicon too. This is true for amd today: Vermeer ccd is same as Milan's. Ultimately economy of scale made the difference. Intel's process dominance also played a huge role. The PC market is still massive and as long as it exists as x86, the Isa will sustain if not thrive. It's just plain economics but money makes the world go around and all that
Spunjji - Tuesday, October 20, 2020 - link
I'm mostly with deltaFx2 on this - the truth is that vendors basing products on ARM architectures will have to show a dramatic advantage in two or more area (cost / power / performance) over x86 to stand a chance of toppling x86 anywhere outside of the datacentre. There's too much legacy and gaming software in the PC market to make the change palatable for a good while yet.mode_13h - Tuesday, October 20, 2020 - link
I never said anything about desktop computing, but I think the ARM-based laptop segment will be interesting to watch. Now that Nvidia owns ARM, we could start to see them push some real GPU horsepower into that segment. If those start to gain traction with some gamers, then it's not inconceivable to see inroads created into ARM-based desktop gaming.Heck, aren't there already ARM-based Chrome boxes - low-cost, NUC-like mini-PCs?
mode_13h - Tuesday, October 20, 2020 - link
Within 5 months of their release as the MI50/MI60, AMD started selling them for $700 as Radeon VII. Those had 60 of 64 CUs enabled and 16 GB of HBM2 - exactly the same specs as MI50. So, it sure sounds like yields were decent. Availability of those cards remained strong, until they were discontinued (after the launch of Navi).I also didn't say you made such a claim - just that it's not there to be made. However, based on what you did just say, it seems you need to spend more time examining this article.
https://www.anandtech.com/show/15967/nuvia-phoenix...
ISA has consequences. Some of the same factors that helped ARM trounce x86 in mobile are the same ones that give it more potential in performance-optimized computing. You can't solve all problems with scale and by throwing money at them. A bad design ultimately has less potential than a good one, and the moribund ISA is the one constant of all x86-64 CPUs.
deltaFx2 - Tuesday, October 20, 2020 - link
I did see Nuvia's claim. They haven't normalized for the memory subsystem so those results are being... erm... economical with the truth. (I can see *why* they did it, they want series B funding.) Mobile parts support very specific memories and sizes, soldered right on motherboard or maybe even PoP. The PC market usually supports much larger memories, slotted, etc (Apple solders but I don't believe this is standard. Corporate notebooks want the interchangeability). These sort of things have a fixed cost even when not being used. Phones don't have to support 128GB memory; laptops do. Another difference is big.Little in mobile which helps idle power. bigLittle is a bad idea in servers because if your sever is idling, you're doing it wrong. There's other fun stuff that x86 systems have to support (PCIe, USB in various flavors, display port, etc) that are not in mobile. This is an intentional choice.Counterexample to this is Ampere Altra. In their own testing (not third party or spec submission), they claim to equal Rome 64c/128t using 80c at roughly the same tdp. i.e. when you throw in large memory controllers, 128 PCIe lanes, and all the paraphernalia that a server needs, the difference is in the noise at same perf levels.
x86 CPUs and ARM cpus look nearly identical outside the decode unit and maybe parts of the LSU. Heck, modern ARM CPUs even have a uop cache. The main tax for x86 is variable length decode which is a serial process and hard to scale. The two tricks used for that is the uop cache (sandy-bridge onwards) or two parallel decode pipes (treadmont). Beyond that, it's exactly the same in the core. Any difference in area is down to using larger library cells for higher current drive, duplicated logic to meet frequency goals, etc. The x86 tax is (1) high frequency design (2) supporting a wide variety of standard interfaces. #2 is also why x86 is successful. You can design your device and plug it into *any* x86 system from server to laptop using standard interfaces like Pcie. The x86 ISA may be closed, but the x86 ecosystem is wide open (before that you had vertically integrated servers like sun/HP/DEC/IBM with their own software and hardware). Even ARM can't claim such interoperability in mobile. They're adopting x86 standards in servers.
mode_13h - Tuesday, October 20, 2020 - link
Wow. I won't impugn your motives, but you sure seem invested in your skepticism.> I can see *why* they did it, they want series B funding.
They're not some kids, fresh out of uni, looking to draw a regular paycheck. If they weren't convinced they had something, I really believe they wouldn't be trying to build it. What they clearly want is to get acquired, and that's going to require a high burden of evidence.
> These sort of things have a fixed cost even when not being used.
> ...
> There's other fun stuff that x86 systems have to support (PCIe, USB in various flavors, display port, etc) that are not in mobile.
The power figures quoted are labelled "Idle-Normalized", by which they must mean that idle power is subtracted off.
> Another difference is big.Little
Except they specify they're talking about single-core performance and explicitly label exactly which Apple cores they're measuring.
The most glaring point you're missing is the Y-axis. Take a good look at that. For better legibility, right-click those images and save or view them in another tab - they're higher-res than shown in that article.
> Counterexample to this is Ampere Altra.
That's Neoverse N1-based, which clearly hasn't achieved the same level of refinement as Apple's cores, and is also balancing against area-efficiency. According to ARM's announcements, the V1 and N2 should finally turn the tables. In the meantime, we can admire what Apple has already achieved. And you needn't be content with one Geekbench benchmark - you can also find various specbench scores for their recent SoCs, elsewhere on this site.
> x86 CPUs and ARM cpus look nearly identical outside the decode unit and maybe parts of the LSU.
Since I don't like to repeat myself, I'll just point out that I already made my case against x86 in my first post in this thread. It's tempting to think of CPUs as just different decoders slapped on the same guts, but you need more than such a superficial understanding to build a top-performing CPU.
> supporting a wide variety of standard interfaces.
A few ARM-based PCs (beyond Raspberry Pi) do exist and also use PCIe, USB, SATA, etc.
> They're adopting x86 standards in servers.
USB and PCIe aren't x86 standards. I'm less sure about SATA, but whatever. This point is apropos of nothing.
deltaFx2 - Tuesday, October 20, 2020 - link
I don't like repeating myself either, same pinch! Different memory subsystems, different uncore, different design targets lead to different results. I have seen the Nuvia presentation in detail, and they sure know what they're doing, and I don't think it's an accident. What they present is likely true data but misleading.The qualcomm design discussed in Nuvia's data is the same as the one used in N1 but smaller caches, so unless you're saying ARM's uncore is a raging fireball, I don't understand your argument. Let's say the Qualcomm core is 1.8W for 900 points. for the same performance, the graph claims AMD/intel burn 6W. So the qualcomm core consumes 1/3rd the power of the AMD core for the same perf and somehow when put on a server SoC, all that supposed efficiency goes to naught? Or is the more likely explanation that Nuvia is comparing an apple to a orange?
Are you referring to this "x86 is more work to decode, much more heavily-dependent on register-renaming (due to smaller register file), has more implicit memory-ordering constraints than ARM, and other things." ?
* x86 is more work to decode in the absence of uop caches. That's the whole point of uop caches, to not have to do that work. Turns out ARM is complex enough that it too needs uop caches.
* "much more heavily-dependent on register-renaming (due to smaller register file)" That's just a mish-mash of terminology you don't seem to understand. ALL OoO machines are dependent on register renaming. ARM has to rename 31 architected registers, x86 has to rename 16. That's not the register file, that's architected registers. Skylake/Zen2 have 180E integer register file to hold renamed physical registers (rob is 224/256 can't remember). I can't find what A77's PRF size is, but the ROB is 128 entries, so its going to be ~100ish entries. PRF size minus architectural register size is usually what's available for OoO execution. So which design has a bigger instruction window? And what's ARM's advantage here? If it's 6-wide issue, it has to rename 6 destinations and 6*2 sources. You have to build that hardware whether it's ARM or x86 or MIPS or SPARC or Alpha or Power or IBM Z.
* "has more implicit memory-ordering constraints than ARM, and other things" yes it does. ARM has some interesting memory ordering constraints too, like loads dependent on other loads cannot issue out of order. Turns out that these constraints are solved using speculation because most of the time it doesn't matter. x86 needs a larger tracking window but again, it's in the noise for area and power.
Are you sure you know what you're talking about, because it doesn't look like it.
mode_13h - Wednesday, October 21, 2020 - link
> That's the whole point of uop caches, to not have to do that work.uOP caches are a hack. They work well for loops, but not highly-branchy code. Here's an idea: why not use main memory for a uOP cache? In fact, maybe involve the OS, so it can persist the decoded program and save the CPU from having to repeat the decoding process. Kinda GPU-like. You could even compress it, to save on memory bandwidth.
> Turns out ARM is complex enough that it too needs uop caches.
I wonder how much of that has to do with supporting both ARMv7-A and ARMv8-A, as all of their in-house 64-bit cores do. However, at least a couple custom cores don't bother with ARMv7-A compatibility.
> That's just a mish-mash of terminology you don't seem to understand.
*sigh* I meant ISA registers, obviously. And it's not 16 - ESP is used for stack, frame pointers usually occupy EBP, and some instructions are hard-wired to use a couple of the others, from what I dimly recall. Anyway, the number of software-visible registers restricts the types and extent of optimizations a compiler can do -- at best, forcing the CPU to repeatedly replicate them through speculative execution and out-of-order execution -- at worst, causing register spilling, which generates extra loads and stores.
> x86 needs a larger tracking window but again, it's in the noise for area and power.
You don't think it generates more stalls, as well?
Something else that comes to mind as a bit silly is AVX. Perhaps you'll insist it's trivial, but all the upper-register zeroing in scalar and 128-bit operations seems like wasted effort.
> Are you sure you know what you're talking about, because it doesn't look like it.
I'm not a CPU designer, but you'd do well to remember that the performance of a CPU is the combination of hardware *and* software. The compiler (or JIT engine) is part of that system, and the ISA is a bottleneck between them.
GeoffreyA - Wednesday, October 21, 2020 - link
I'm only a layman in all this, but if ARM is aiming for high performance, I think its cores are going to get increasingly complex. On a side note, Apple didn't speak about the power efficiency of the A14 on the recent keynote, as far as I'm aware.On the x86 length-decoding bottleneck, I was reading that there used to be another interesting trick (used by the K8/10/Bulldozer and Pentium MMX) marking instruction boundaries in the cache. Not sure if that was particularly effective or not, but I suppose the micro-op cache turned out to be a better tradeoff on the whole (removing the fetch and pre/decoding steps altogether when successful, saving power and raising speed).
Who knows, perhaps AMD and Intel could re-introduce the marking of instruction boundaries in cache, if it were feasible and made good sense, further toning down x86's length-decoding troubles, while keeping the micro-op cache as well.
mode_13h - Wednesday, October 21, 2020 - link
> if ARM is aiming for high performance, I think its cores are going to get increasingly complex.Yes. A couple years ago, ARM announced the N-series of cores that would target server and infrastructure applications. Recently, they announced the V-series of cores that would be even larger and further performance-optimized.
https://www.anandtech.com/show/16073/arm-announces...
> perhaps AMD and Intel could re-introduce the marking of instruction boundaries in cache, if it were feasible and made good sense
If it still makes sense, I would assume they're still doing it. Do you know any differently?
That said, I don't really get why CPUs wouldn't basically expose their uOPS and decoder, so the OS can just decode a program once, instead of continually decoding the same instructions many times during a single run. Sure, it'd require a significant amount of work for the CPU designers and OS, but seems like it'd be a significant win on both the performance and efficiency fronts. And perhaps you could even get rid of the in-core uOP cache, with corresponding simplifications in core design and area-efficiency. Again, I think GPUs have already largely proven the concept, though I'm not necessarily proposing to ditch a hardware decoder, entirely.
GeoffreyA - Wednesday, October 21, 2020 - link
An intriguing idea, definitely, but probably too difficult to implement and a maintenance nightmare.The OS would have to maintain multiple decoders/compilers for different CPU brands, each of which likely has its own internal micro-op format, along with whatever quirks involved. There might be a delay when the program is first decoded. And likely, more bugs would be stirred into the soup.
Exposing the internal format also means going against abstraction and tying things to an implementation, and the CPU designer wouldn't be able to change much of that "interface" in the future. I think an abstract ISA, like x86 or ARM, would be a better choice, but that's just my view.
> If it still makes sense, I would assume they're still doing it. Do you know any differently?
Perhaps it was no good, or a micro-op cache was a better tradeoff. According to Agner Fog, Zen abandoned it and the P6 (and later) never used it. Pentium MMX, yes. Not sure about Pentium classic, 486, etc. Marking instruction boundaries in the cache alleviated a critical bottleneck of x86 decoding: working out the instruction length, which varies.
https://www.agner.org/optimize/microarchitecture.p...
mode_13h - Wednesday, October 21, 2020 - link
> The OS would have to maintain multiple decoders/compilers for different CPU brands,Again, GPUs already do this. They each ship their own shader compilers as part of their driver. However, CPUs needn't follow that model, exactly. CPUs would still need some hardware decoding capability, in order to boot and support legacy OSes, if nothing else. But the CPU can continue to do the primary ISA -> uOP translations with hw-accelerated engines - just in a way that's potentially decoupled from the rest of the execution pipeline.
> Exposing the internal format also means going against abstraction and tying things to an implementation,
As with GPUs, the actual uOP format could continue to be opaque and unknown to the OS. I'm talking about this as distinct from actually changing the ISA.
GeoffreyA - Thursday, October 22, 2020 - link
I see what you're saying now. Something like this: At the OS level, things are still x86, for example. Then, going one step down, we've got the driver which takes all that and translates it into, say, AMD format micro-ops, using the hw-accelerated units on the CPU, caching the decoded results permanently (so we only compile once), and sending them to the pipeline to be executed.Quite an interesting idea. Roughly equivalent to what CPUs already do, but with the critical difference of stashing the results permanently, which should bring about massive gains in efficiency. Come to think of it, it's a bit like a persistent micro-op cache.
It would come down to this then. Will the performance of this method be a drastic improvement over the current model (which, owing to op cache, might be able to come close)? Will the better performance or lower power consumption justify the extra development work involved? Or is it just better to keep things as-is by paying that extra power?
mode_13h - Friday, October 23, 2020 - link
Yes, it seems @deltaFx2 was on point with both the Transmeta and Denver citations. However, I hadn't considered self-modifying code, though I thought that had somewhat recently fallen out of favor, due to security concerns and counter-measures against things like buffer-overrun exploits.I guess if we're not talking about functions that literally modify themselves, but rather JIT-type cases where you have a like a Javascript engine acting like an in-memory compiler. That strikes me as a slightly more manageable problem, especially if the CPU is capable of supporting a giant uOPS cache in main memory.
GeoffreyA - Saturday, October 24, 2020 - link
Read about Denver and Crusoe this morning and was surprised to see this sort of thing had been done before. Faintly, I do recall reading about the Crusoe years ago but forgot all about it.GeoffreyA - Saturday, October 24, 2020 - link
And thank you for an engaging discussion as well. I've enjoyed it :)deltaFx2 - Wednesday, October 21, 2020 - link
"uOP caches are a hack. They work well for loops, but not highly-branchy code."That's not true at all. The uop cache is a decoded instruction cache. It basically translates variable length x86 instructions to fixed length micro-instructions or micro-ops, so that next time around, you need to do less work to decode it so you can supply more instructions/cycle with a much shorter pipeline. The hard part of x86 is not the number of instructions (even arm has thousands) but the fact that it has variable length, prefixes, etc. So you can think of it as a copy of the instruction cache if you like. Caches naturally cache repetitive accesses better than random access. If you mean uopcache is bad for large instruction cache footprint workloads, yes, but that's true of the instruction cache as well. Uopcaches cut down the branch mispredict penalty which is one reason ARM uses it, I'd guess. ARM, despite what you might have heard, is not a 'simple' ISA, so it also benefits from the slightly shorter pipeline out of the opcache.
>> Here's an idea: why not use main memory for a uOP cache?
That was transmeta/nvidia denver. Both had binary translators that rewrote x86 to VLIW (transmeta) or ARM -> VLIW ARM (denver). The software+VLIW approach has its drawbacks (predication to achieve the large basic blocks necessary for VLIW to extract parallelism). However, it's certainly possible to rewrite it in uop format and have the CPU suck it up. I've been told doing this in a static binary is hard but might be possible to do the denver thing on an x86 CPU and cut out all the fancy compiler tricks. Not infeasible, but the question always arises if it's just fewer points of failure if you did it in hardware (self modifying code for example. It's a thing for VMs, JITs, etc. That's why ARM now has a coherent I-cache for server CPUs)
>> *sigh* I meant ISA registers, obviously.<snip>
Ok, so x86 32-bit had a problem with 8 arch regs with ax/bx/cx/dx reserved for specific uses. x86-64 largely does away with that although the older forms still exist. A problem with few registers is that you have to spill and fill a lot but it's not as big a deal in x86-64 and newer x86 cpus do memory renaming, which cuts out that latency if it occurs. The argument for many architected registers is strong for in-order cores. For OoO, it's debatable.
>> You don't think it generates more stalls, as well?
No. It generates more pipeline flushes, but they're so rare that it's not worth worrying about. The extra work x86 cores have to do is to wait to make sure that out-of-order loads don't see shared data being written to by another core in the wrong order (TSO). So they're kept around for long enough to make sure this is not the case. It's just bookkeeping. Most of the time, you're not accessing shared data and when you are, another core isn't accessing it most of the time so you toss it away. When it happens, the CPU goes oops, lets flush and try again. ARM claimed to do barrier elision to fix the problem created by weak memory models (DSBs everywhere) and it may be that they are doing the same thing on a smaller scale. I could be wrong though, I haven't seen details.
>> Something else that comes to mind as a bit silly is AVX. Perhaps you'll insist it's trivial, but all the upper-register zeroing in scalar and 128-bit operations seems like wasted effort.
Ah but there's a reason for that. ARM used to not do that in Neon 32-bit and quickly realized it's a bad idea. x86 learned from 16bit that it was a bad idea. Preserving the upper bits when you are only using the lower bits of a register means that you have to merge the results from the previous computation that wrote the full 256bit register (say) and your 64 bit result. It creates false dependencies that hurt performance. Neon-32 bit had similar thing where a register could be accessed as quad, or double hi, double lo, or single 0/1/2/3. It sucks, and in one implementation of ARM (A57?), they stalled dispatch when this happens. Aarch64 zeros out upper bits just like x86-64.
>> I'm not a CPU designer, but you'd do well to remember that the performance of a CPU is the combination of hardware *and* software.
Agreed. I'm saying there's nothing inherently lacking in x86 as an ISA. It's not 'pretty' but neither is C/C++. Python is pretty until you start debugging someone else's code. Arguably neither is linux if you ask the microkernel crowd. But it works well.
mode_13h - Friday, October 23, 2020 - link
First, I'd like to thank you for your time and sharing your insights. Also, I have much respect for how long Intel and AMD have kept x86 CPUs dominant. Further, I acknowledge that you might indeed be right about everything you claim.> The uop cache is a decoded instruction cache.
I understand what they are and why they exist. It's probably an overstatement to call them a hack, but my point is that (like pretty much all caches) rather than truly solving a problem in every case, they are an optimization of most cases, at the expense of (hopefully) a few. Even with branch-prediction and prefetching, you still have additional area and power overhead, so there's no free lunch.
> ARM, despite what you might have heard, is not a 'simple' ISA
As I'm sure you know, ARM isn't only one thing. ARMv8-A does away with much of the legacy features from ARMv7-A. Not much is yet known about ARMv9. Of course, ARMv8-A is probably going to be the baseline that all ARM-based servers and notebooks will always have to support.
> That was transmeta/nvidia denver. Both had binary translators that rewrote x86 to VLIW
So, the problem with those examples is that you have all the downsides of VLIW clouding the picture. Nvidia's use of VLIW is the most puzzling, since it really only excels on DSP-type workloads that are much better suited to the GPUs they integrated in the same SoC!
Interestingly, I guess what I was talking about is somewhere in between that and Itanium, which had a hardware x86 decoder. Of course, we know all too well about Itanic's tragic fate, but I was bemused to realize that I'd partially and unintentionally retread that path. And I still wonder if EPIC didn't have some untapped potential, like if they'd added OoO (which is actually possible, with it not being true VLIW). Late in its life, the product line really suffered from the lack of any vector instructions.
> ARM used to not do that in Neon 32-bit and quickly realized it's a bad idea.
SSE also seems to leave the upper elements unchanged, for scalar operations. My concern is that zeroing 224 bits (in case of AVX) or 480 bits (for AVX-512) will impart a slight but measurable cost on the energy-efficiency of scalar operations.
Finally, I should say that this has been one of the best and most enlightening exchanges I've had on this or probably any internet forum. So, I thank you for your patience with my thoughts & opinions, as well as with the 1990's-era commenting system.
deltaFx2 - Saturday, October 24, 2020 - link
Eh, this isn't about whether x86 designers are good or not. My point was that there wasn't anything inherent in the arm isa that made it power or area efficient, but implementation choices by arm to achieve that. Not sure why it bothers me, but this canard about ISAs making a significant difference to power has to die and it just doesn't.>> ARMv8-A does away with much of the legacy features from ARMv7-A.
Aarch64 is not simple. The RISC-V folks called ARM Ciscy in some paper (can't find it; you'll have to take my word for it. They said something to the effect of look how CISCy arm can be). RISC has morphed to mean whatever people want it to mean but originally it was about eschewing microcode. ARM has microcode. No, not a microcode ROM; that's just one way to implement microcode. Microcode is a translation layer between the ISA and the hardware. RISC philosophy was to expose the microarchitecture in the ISA to software so as to have no microcode (and simple instructions that execute in a single cycle to achieve highly efficient pipelining and 1 instruction per cycle. Cutting out microcode was a way of achieving that).
ARM has common instructions that write 3 GPRs. I don't think x86 has 3 GPR writing instructions (even 2 are rare and are special. 128-bit mul and division, iirc). See LDP with autoincrement where the CPU loads 2 GPRs into memory and increments the base pointer. Common loads (LDR) support autoincrement, so they write two destinations.
ARM has instructions that can be considered load-execute operations (i.e. operations with memory sources/destinations). This was a huge no-no in RISC. Consider LD4 (single structure) which reads a vector from memory and updates a particular vector lane of 4 different registers (say you wanted to update the 3rd element in 4 packed byte vector registers). The most likely implementation is going to be microcoded with operations that load, read-modify-write a register.
See: https://developer.arm.com/docs/ddi0596/f/simd-and-...
There's other weirdness if you look closely... adds with shifts, loads with scaled index, flag registers (oh my!) etc. just like x86 and in some instances more capable than x86. Perfectly fine but the RISC folks get an aneurysm thinking about it. OoO machines I think benefit from having more information sent per instruction rather than less.
What Aarch64 did was to get rid of variable length (thumb/arm32), super-long-microcoded sequences (check out ldmia/stmia/push/pop), predicated instructions, shadow registers and some other stuff I can't remember. They did not make it RISC-y, it very much retained the information density. Now unlike x86, ARM can't add prefixes (yet) so you have only 4 bytes that blow up into thousands of instructions. So while it doesn't have the variable length problem of x86, it does have the 'more work to decode' problem and hence its pipeline is longer than say MIPS. Hence the uopcache for power savings (those 3-dest/2-dest instructions are going to get cracked, so why do it over and over?).
>> Nvidia's use of VLIW is the most puzzling,
Nvidia hired the same transmeta people. If you have a hammer, every problem looks like a nail. Also, some people are like the monty python black knight 'tis only a flesh wound'. In fairness, they wanted to do x86 and that needs a license from Intel (not happening) but binary translation would work. I have no idea why VLIW was chosen again. I had heard around the same time, intel was looking into x86->x86 binary translation to help Atom. Probably went nowhere.
>> we know all too well about Itanic's tragic fate,
Tragic is not the adjective I'd choose. Itanium not only sank but dragged a bunch of other viable ISAs with it (PA-RISC, Alpha, possibly even SPARC as Sun considered switching to itanum, IIRC). Itanium should've been torpedoed before it left the harbour. It belongs to the same school of thinking (hopefully dead but who knows) that all problems in hardware can be solved by exporting it to software. RISC was that (branch delay slots, anyone?), VLIW was that, Itanium was that (see ALAT). If only compilers would do our bidding. And maybe they do in 80% of the cases, but it's the 20% that gets you. Itanium has 128 architectural registers plus some predicated registers. Going out-of-order would be impossible. Too many to rename, needs an enormous RAT, PRF, etc while x86 would match it with fewer resources. They went too far down the compiler route to be able to back off.
You're right about SSE. I forgot. Nice discussing this with you too; one often gets zealots on forums like this so its a welcome change. Hope it helped.
mode_13h - Sunday, October 25, 2020 - link
> this canard about ISAs making a significant difference to power has to die and it just doesn't.You make compelling points, but somehow it's not enough for me.
An interesting experiment would be to rig a compiler to use a reduced set of GP registers and look at the impact it has on the benchmarks of a couple leading ARM core designs. That should be trivial, for someone who knows the right parts of LLVM or GCC.
I don't know of an easy way to isolate the rest. Maybe a benchmark designed to stress-test the memory-ordering guarantees of x86 could at least put an upper bound on its performance impact. But, the rest of the points would seem to require detailed metrics on area, power-dissipation, critical path, etc. that only the CPU designers probably have access to.
> Aarch64 is not simple. The RISC-V folks called ARM Ciscy in some paper
Thanks for the very enlightening details. I don't have much to say about this subject, and it seems to me that many discussions about ISAs and uArchs veer off into unproductive debates about orthodoxies and semantics.
Again, I appreciate your specific examples, and many of us probably learned a few things, there. I definitely see the relevance to my earlier speculation about decoding cost.
> OoO machines I think benefit from having more information sent per instruction rather than less.
If information-density is the issue, is it not solvable by a simple compression format that can be decoded during i-cache fills? Perhaps it would be smaller and more energy-efficient than adding complexity to the decoder, and not add much latency in comparison with an i-cache miss.
> Itanium not only sank but dragged a bunch of other viable ISAs
We can agree that the outcome was tragic. At the time, I very much drank the EPIC cool-aide, but I was also programming VLIW DSPs and very impressed with the performance. One explanation I heard of its failure is that Intel's legal team had patented so much around it that an unlicensed competing implementation was impossible, and big customers & ISVs were therefore wary of vendor lock-in.
> It belongs to the same school of thinking (hopefully dead but who knows) that all problems in hardware can be solved by exporting it to software.
As for the school of thought being dead, this is worth a look (with a number of the more interesting details hiding in the comments thread):
https://www.anandtech.com/show/15823/russias-elbru...
Also this comes to mind:
https://www.anandtech.com/show/10025/examining-sof...
I'd imagine Google would somehow be involved in the next iteration of software-heavy ISAs.
> Itanium has 128 architectural registers plus some predicated registers. Going out-of-order would be impossible.
At the ISA level, my understanding is that EPIC allows for OoO and speculative execution - all the compiler does is make the data-dependencies explicit, leaving the hardware to do the scheduling (which is required for binary backwards-compatibility). Also, I'm not clear why they'd require renaming for smaller levels of OoO - it seems to me more an issue in cases of extensive reordering, or for speculative execution. Perhaps the compiler would need to encode an additional set of dependencies on availability of the destination registers?
> You're right about SSE.
Something about the way AVX shoehorns scalar arithmetic into those enormous vector registers just feels inefficient.
deltaFx2 - Thursday, October 29, 2020 - link
Here's a datapoint: Rome 64c@225W or 3.4W per core (SMT on, mind you). Ampere Altra, 80c @250W (3.3GHz SKU), 3.1W. Ampere claims this SKU performs the same as Rome. https://www.servethehome.com/ampere-altra-80-arm-c... , https://www.servethehome.com/ampere-altra-max-targ...Rome will struggle, I expect, at ~100W TDP due to the MCM design (inefficient). However, from a TCO standpoint, high performance at (reasonably) higher power generally wins because of consolidation effects (fewer racks and what not for same throughput). Unless you are power constrained. Anyway, I'll leave it at that.
>If information-density is the issue, is it not solvable by a simple compression format that can be decoded during i-cache fills?
Let's say you have a jump to target 0xFEED for the first time. How would you find the target instruction if it were compressed? You'd need some large table to tell you where to find it and someone would have to be responsible for maintaining it (like the OS, because otherwise it's a security issue). And for large I-cache footprint workloads, this could happen often enough that it would make things worse.
The ideal ISA would be one that studies the frequencies of various instructions and huffman-encodes them down for Icache density. ISAs are never designed that way, though.
The fundamental problem with compiler based solutions to OoO are that they cannot deal with unpredictable latencies. Cache latency is the most common case. OoO machines deal with them fine. Nvidia's denver was particularly strange in that regard as they should have known why transmeta didn't work out and went with the same solution without addressing that problem (static scheduling can't solve that problem. Oracle prefetching can, but it doesn't exist yet)
VISC: Pay attention to the operating frequency in addition to IPC. If you run your machine at 200MHz, for example, you can get spectacular IPC because memory latency is (almost) constant and your main memory is only (say) 20 cycles away instead of 200 cycles away. The artcle says their prototype was 500MHz. Intel acquired them for next to nothing (200Mn?) so it wasn't like they had something extraordinary. Likely an acquihire. Can't say much about Elbrus as I can't tell what they're doing or how well it performs. If I had to bet, I'd bet against it amounting to much. Too much history pointing in the opposite direction.
>> At the ISA level, my understanding is that EPIC allows for OoO and speculative execution -
Oh yeah, you can probably do OoO on VLIW ISAs too. I'm saying, it has too many architected registers. You can solve it by having a backing store for the architectural registers and copying things into the main PRF when needed for OoO execution and all but it's not efficient and will be trounced by an x86 or arm design. EPIC only made sense if it reduced the number of transistors spent on speculative execution and gave that to caches and other things outside the core. Otherwise one might as well stick to an ISA with a large install base (x86). As a concept, EPIC was worth a shot (you never know until you try) but HP/Intel should've known well in advance that this won't pan out and killed it. Intel wanted to get in on big-iron and thought itanium was the ticket, plus it didn't have to compete with AMD and Cyrix and whoever else was around then in x86.
mode_13h - Sunday, November 1, 2020 - link
> How would you find the target instruction if it were compressed?I'm not familiar with the current state of the art, but it does seem to me that you'd need some sort of double-indirection. I'd probably compress each I-Cache line into a packet, and you have some index you can use to locate that, for a given offset.
You could do some optimizations, though. Like, what about having the index store the first line, uncompressed, and then actually encode the location of the next line? That would avoid the latency hit from double-indirection, only adding the overhead of one memory offset, which would be amortized in fetches of subsequent lines. Interleaving offsets in with code (or at least all of the branch targets) would bloat slightly complicate indexing, but I think not much.
> The ideal ISA would be one that studies the frequencies of various instructions and huffman-encodes them down for Icache density.
I know, but if you're only compressing the opcodes, that still won't give you optimal compression.
> The fundamental problem with compiler based solutions to OoO are that they cannot deal with unpredictable latencies.
Yes, we're agreed that some runtime OoO is needed (unless you have a huge amount of SMT, like GPUs). I never meant to suggest otherwise - just that compilers (or their optimizers and instruction schedulers) could play a bigger role.
> Can't say much about Elbrus as I can't tell what they're doing or how well it performs.
If you're interested, check out the comments thread in that article. Some interesting tidbits, in there. Plus, about as much (or maybe even a little less) politics as one would expect.
Thanks, again, for the discussion. Very enlightening for myself and doubtlessly a few others.
mode_13h - Sunday, November 1, 2020 - link
I should add that, as manufacturing process technology runs out of steam, I see it as an inevitability that the industry will turn towards more software-heavy approaches to wring more speed and efficiency out of CPUs. It's mainly a question of exactly what shape it takes, who is involved, and when a level of success is achieved that forces everyone else to follow.deltaFx2 - Tuesday, October 20, 2020 - link
"USB and PCIe aren't x86 standards. " They are Intel IP, standardized. Because unfortunately that's what's needed for wide adoption and intel has a strong not-invented-here culture. There's committees and all that for this but intel is the big dog there. Just like qualcomm is for wireless IP standards. PCIe came about because AMD was pushing non-coherent Hypertransport. So intel decided to nip that in the bud with PCI/PCIe. A long time ago, AMD pushed coherent HT as well, which was adopted by some lke Cray (CXL/CCIX, 20 yrs ago) but AMD's shoddy execution after Barcelona killed it as well. CXL came about because there's no way intel would do CCIX (modified AMBA)Wilco1 - Monday, October 19, 2020 - link
Gaviton 2 is close to the fastest EPYC 7742 on both single-threaded performance and throughput despite running at a much lower frequency. Turning off SMT means losing your main advantage. Without SMT, Rome would barely match Graviton 2. Now how will it look compared to 80 cores in Ampere Altra? Again, how does that make x86 look more competitive?Milan maxes out at 64 cores (you need to redesign the IO die to increase cores, so chiplets are no magic solution) and will still use the basic 7nm process, so doesn't improve much over Rome - rumours say 10-15%.
The facts say 2021 will be the year of the 80-128 core Arm servers. If you believe AMD/Intel can somehow keep up with 64 cores (or less) then that would be fanboyism... DDR5 will help the generation after that and enable even higher core counts (Arm talks about 192 cores using DDR5), but that's going to be 2022.
deltaFx2 - Monday, October 19, 2020 - link
What good is 128 cores if DDR4 bandwidth is insufficient? Niche use cases where data fits in caches?Ampere’s 80 cores “beat” rome in their marketing slides. Where’s the official spec submission? Third party tests? Nope. Can it be produced in volume? If it were, we’d see general availability. A Silicon Valley startup looking for an exit is prone to gross exaggeration.
Rumors are just that. Rumors were that zen 3 had 50% fp ipc uplift. Wait and see.
Wilco1 - Monday, October 19, 2020 - link
Would you say the same about Rome? It has 128 threads too - everybody is hitting the same bandwidth limits, but it simply means scaling is non-linear (rather than non-existent).Right now it looks like all the Altra chips are being gobbled up by their initial customers - Oracle announced Altra availability early next year. So benchmarks should come out soon.
deltaFx2 - Tuesday, October 20, 2020 - link
You said it yourself. If your rumors is true that the 64c milan gains only 10-15% over rome, then that only makes sense if ddr b/w is constrained. Otherwise how is it that IPC is up 19% (measure at 4ghz so maybe higher at typical server frequency) but the socket perf is less? And you said it yourself too, smt buys only20- 30% uplift whereas 2x cores should scale nearly linearly in an unconstrained system. If 128 smt threads are starved would 128 full cores be fed?mode_13h - Tuesday, October 20, 2020 - link
If we're talking in general terms, SMT buys a heck of a lot more than 20-30%, in many workloads.Spunjji - Monday, October 19, 2020 - link
That's a marketing slide with no stated parameters. All it sums up are ARM's ambitions.NB: I'm not arguing against the idea that they're an increasingly-relevant player in the market, I'm arguing against the doomsaying.
Wilco1 - Monday, October 19, 2020 - link
It's not an ambition, it shows where Arm designs will be next year. One such design is already known: https://www.anandtech.com/show/16072/sipearl-lets-...As for doom, IIRC nobody claimed AMD or Intel will go bust. However you can't deny there is a shift happening across the industry towards Arm. And it's happening fast as the Graviton share shows.
Spunjji - Tuesday, October 20, 2020 - link
@Wilco1 - It indicates - vaguely - where ARM predict their designs to be next year. Pointing to one of the haziest marketing slides I've ever seen significantly weakens your argument.I didn't say anybody claimed about going bust, either. I simply don't believe the shift is happening "across the industry", and I don't believe x86 is at some massive disadvantage outside of mobile and datacentres, where power-per-core is king once a certain level of performance is reached.
mode_13h - Tuesday, October 20, 2020 - link
Don't forget about laptops. Apple and Qualcomm see a real opportunity, there.Wilco1 - Wednesday, October 21, 2020 - link
And HPC - the #1 supercomputer is now Arm-based, and uses 512-bit SVE rather than GPUs like most other supercomputers.mode_13h - Wednesday, October 21, 2020 - link
Showing Xeon Phi how it's done!However, GPUs still make for more efficient vector-processing engines than ARM + SVE. They'll reclaim the top spot, before long.
AntonErtl - Saturday, October 17, 2020 - link
The weaker memory ordering of ARM is a disadvantage for ARM. This means that porting multi-threaded applications from Intel/AMD is hard in general. Of course, ARM (and ARM implementors like Nuvia) can fix this by strengthening the memory ordering in their implementations. Will they do it?mode_13h - Saturday, October 17, 2020 - link
Not to put too fine a point on it, but that's basically misinformed nonsense. There's absolutely no problem running code written according to standard and well-specified threading APIs.The problems only occur when programmers try to outsmart the OS by doing things like userspace spinlocks -- a practice best summed up by none other than Linus Torvalds:
"Do not use spinlocks in user space, unless you actually know what you're doing.
And be aware that the likelihood that you know what you are doing is basically nil."
https://www.realworldtech.com/forum/?threadid=1897...
Userspace locking has all kinds of pitfalls and is usually only a win in very narrow and unreliable circumstances. There are very many reasons not to do it, with portability problems being only one.
mode_13h - Saturday, October 17, 2020 - link
Also, as most modern ISAs are considerably more weakly-ordered than x86 (including POWER and RISC-V), code depending on x86's strong memory ordering is fundamentally non-portable. Various ordering guarantees of several ISAs are conveniently summarized, here:https://en.wikipedia.org/wiki/Memory_ordering#Runt...
AntonErtl - Sunday, October 18, 2020 - link
All code that does not communicate or synchronize memory accesses through API calls may be affected, in particular code that employs lockless or waitless techniques (and much of that code probably is). Yes, such code would not be portable to ARM and Power as they are, which means that ARM and Power will not be able to serve as replacements for Intel and AMD in general.If you want to replace a competitor, you have to give at least all the guarantees that the competition gives, not fewer; otherwise your product will be perceived as incompatible; your "blame the programmer for not follwing standards" strategy won't help if the program runs fine on the incumbent system. Yes, giving more guarantees has its costs, but the benefit is worth it: Your chips are compatible with existing code. OpenPower has understood this for byte ordering and switched to little-endian; I have not followed it enough to know whether they have done it for memory ordering.
Vast amounts of cell phone code running on ARM does not mean anything for servers.
Wilco1 - Sunday, October 18, 2020 - link
In order to work correctly, you have to use some API (even if you define a custom one, as eg. Linux does). Some primitives are NOPs on some ISA and are simply required to stop the compiler from reordering memory accesses. If you don't use the API or don't use it correctly, your code is simply broken, and will fail even on x86.I've seen various cases where memory ordering bugs (by not using the APIs) caused easily reproducible crashes on Arm while the code appeared to work on x86. You just needed many more threads and wait longer for the crash on x86. So a stronger memory ordering does not fix memory ordering bugs, it just makes them a bit harder to trigger.
Personally I much prefer an immediate crash so the bug can be fixed quickly.
AntonErtl - Monday, October 19, 2020 - link
If a primitive that is a nop on AMD64 is not used in the program, but would be required on architectures with weaker ordering, the program works as intended on AMD64, but not on the weakly ordered architectures. And I expect that there are a significant number of such programs, enough to make people avoid architectures with weaker ordering. Of course, if the offers for the weaker ordered architectures are substantially cheaper, that might be enough to make some customers organize their computing to use these offers for the programs that appear to work on them (and where the consequences of running a program that does not work as intended are recognizable or harmless enough), and run only the rest on AMD64. But that would mean that weak ordering lowers the value of the architecture substantially.We all would like it if bugs would show up right away, but with memory ordering bugs, that often does not happen. Having more possibilities for making bugs tends to lead to some bug showing up earlier; that does not mean that you eliminate all bugs earlier (probably to the contrary).
mode_13h - Monday, October 19, 2020 - link
Check the Wikipedia link - there are formal guarantees in x86 that do not exist in most modern ISAs. That means you can write robust, lock-free code that will ALWAYS work on x86 and break everywhere else.However, before you actually attempt to do such a thing, you should really spend some time reading Linus' posts in that RealWorldTech forum I linked, above. He outlines a number of performance and efficiency pitfalls that make it inadvisable even on x86. Instead of simply avoiding locks, a better approach to minimizing lock contention and overhead is to minimize thread interaction.
For more on memory ordering, this seems pretty-well written. I've only skimmed it and don't necessarily endorse the author's opinions, but it lays out the issues pretty well:
https://preshing.com/20120930/weak-vs-strong-memor...
https://preshing.com/20121019/this-is-why-they-cal...
Apparently, Microsoft had a significant need to educate game programmers on the subject, in the XBox 360 era (as it's PowerPC-based):
https://docs.microsoft.com/en-us/windows/win32/dxt...
mode_13h - Monday, October 19, 2020 - link
> code that does not communicate or synchronize memory accesses through API calls may be affected, in particular code that employs lockless or waitless techniquesIt basically boils down to spinlocks, and enough has been said about that, if you'd care to read it.
> If you want to replace a competitor, you have to give at least all the guarantees that the competition gives, not fewer
It sounds like you're in denial. This turns out not to be the deal-breaker issue you're casting it as.
> Vast amounts of cell phone code running on ARM does not mean anything for servers.
That distinction largely disappeared about a decade ago. Moreover, since you've apparently been living under a rock, it will surprise you to know that Amazon is on their second generation of ARM-based cloud servers and enterprise linux distros have supported ARM for more than two years, already.
We're no longer talking about the future - this is happening!
AntonErtl - Wednesday, October 21, 2020 - link
Wait-free code avoids all locks, including spin-locks. Lock-free code may wait, but unlike Torvalds' user-mode spinlocks, at least one thread always makes progress. No, these don't boil down to spinlocks at all.In 1990 many (including me) believed that IA-32 would be replaced by RISCs eventually. IA-32 survived and morphed into AMD64 (and survivid also IA-64), and most RISCs of 1990 are now dead. These RISCs had more standing in server space than ARM does now. So we will see how it goes.
However, one thing to note wrt the present discussion is that SPARC offered stronger (TSO) and weaker memory models (e.g., RMO). If weak memory orders were as unproblematic for software and as beneficial for hardware and performance as you suggest, they should have dropped TSO and only offered RMO. The reverse has happened.
mode_13h - Wednesday, October 21, 2020 - link
> Wait-free code avoids all locks, including spin-locks. Lock-free code may wait, but ... at least one thread always makes progress.Most who think they need to write wait-free code are probably wrong (see: premature optimization; optimizing the wrong problem). But, if you really can't avoid significant lock-contention by any other means, then (like I already said) you can do it portably, if you just use standard and well-specified APIs.
https://en.cppreference.com/w/cpp/atomic/atomic_th...
> In 1990 many (including me) believed that IA-32 would be replaced by RISCs eventually. IA-32 survived and morphed into AMD64 (and survivid also IA-64), and most RISCs of 1990 are now dead.
I remember that, and I didn't know what they were smoking. Maybe it was my youthful inexperience, but I could hardly imagine a world in which PCs weren't x86-based. And PCs were taking over everything!
> These RISCs had more standing in server space than ARM does now.
So did mainframes!
> one thing to note wrt the present discussion is that SPARC
Eh, SPARC was basically the weakest of the lot. MIPS and POWER have way more going on than SPARC. About the only one deader than SPARC is Alpha, and that was killed off precisely because of how *good* it was! I guess PA-RISC is almost as dead, also for largely anti-competitive reasons. Anyway, I guess we can take it as a cautionary tale that changing to a stronger memory order is *not* going to save your bacon!
mode_13h - Saturday, October 17, 2020 - link
BTW, why would ARM or Nuvia do any differently, when vast amounts of code is already running perfectly on ARM? Do you think strong memory ordering is without costs?mode_13h - Friday, October 16, 2020 - link
Ian, it would've been great if you'd gotten in a question about the Ryzen 3 3300X! Why'd they even offer it, only to withdraw it so quickly? Was it just a PR move?scineram - Friday, October 16, 2020 - link
Obviously a supply issue they did not expect in Q1.haukionkannel - Saturday, October 17, 2020 - link
Just look how good the 7nm is! AMD does not get too many chips that are so bad they have to be used in 3300 chips! Most are goor for 3600 and most likely even more for 3700!Very few chip has only 4 to 5 working cpu units! So they can not sell something they don´t have!
mode_13h - Saturday, October 17, 2020 - link
There is nothing preventing them from taking chips from higher bins, unless supply of those bins is tight. There's a long history of both Intel and AMD selling more functional chips into lower bins, which some would take advantage of by overclocking or even unlocking disabled cores (not sure if that's possible on any Ryzens).Spunjji - Monday, October 19, 2020 - link
I've not heard of core unlocking on Ryzen, sadly, as if it were a thing I'd have the right kind of CPU to play with it.Given that supplies of the 3100 are still available, I'd hazard that it's partly having too many functional 6+ core chips that they can sell without difficulty, not having enough defective chips that have 4 working cores entirely on one CCX, and competition with their own discounted 14/12nm products making the required binning less profitable.
TeXWiller - Friday, October 16, 2020 - link
I can't wait to read about the ROP protections and eventually their independent tests for both Intel and AMD.Drazick - Friday, October 16, 2020 - link
Great interview.Actually I think the most important note is:
> MP: We do - we have math kernel libraries that optimize around Zen 3. That will be all part of the roll-out as the year continues.
Intel major advantage is its libraries: Intel MKL, Intel IPP, Intel Compiler. They are widely used in the industry and create major advantage for Intel. Once AMD has something similar (Better be open source) that will be adopted we'll see its advantage getting even bigger in real world.
Kishoreshack - Friday, October 16, 2020 - link
New found love for Ian cutress was missing such excellent interviewssmilingcrow - Friday, October 16, 2020 - link
Good questions but many of the answers were pure PR fluff.Unsurprisingly as they plan on releasing more info in the future, after release.
So the timing of this shows it to be a PR stunt.
I'd have respected AMD if they'd declined to take the interview until they could remove the gag from Puppetmaster.
eva02langley - Friday, October 16, 2020 - link
He's the CTO you genius... go do an interview with Jensen Huang... and just swallow it...smilingcrow - Saturday, October 17, 2020 - link
Makes zero difference, usual PR gagged BS. Tedious.Qasar - Saturday, October 17, 2020 - link
yea ok, sure. and you have proof of this ?Spunjji - Monday, October 19, 2020 - link
What were you expecting? There were plenty of hints, we'll get more later, at least Ian asked the good questions in the first place. No need to be personal about it.Oxford Guy - Friday, October 16, 2020 - link
Very very. Great great. Very excited. Etc.One thing I miss about the 1950s is that a technical interview wasn’t filled with the vapid ‘eternal sugar rush’ hype that everything involving corporate money does these days.
I, for instance, can’t stomach listening to top tennis players like Federer speak. It’s all “amazing” and “unbelievable” all that time. This is one area in which culture is continuing to deteriorate.
GeoffreyA - Saturday, October 17, 2020 - link
Agreed that language has gone downhill. As for "amazing," it's become one of those words over-used as a shorthand for the good x 10 idea behind it (I am as guilty as anybody). "Stunning" for home improvement and "decadent" for cooking also come to mind.icoreaudience - Friday, October 16, 2020 - link
The amount of bullshit no-answer in this interview is mind-blowing.Might as well read a press release, it's as devoid of information.
What have we learned beyond "Yeah, of course We are the best" + _nothing_ ??
carewolf - Friday, October 16, 2020 - link
That is what you get for talking to a manager instead of an engineer. Wondering what is up with the low information strategy.Arbie - Friday, October 16, 2020 - link
He has an MSEE. Hence CTO. Tired of the "engineers >> managers" meme.GreenReaper - Saturday, October 17, 2020 - link
Sure, but it sounds like a Microsoft qualification, even if it isn't. ;-psmilingcrow - Saturday, October 17, 2020 - link
If they had held this in November after the launch he'd have had more freedom to answer questions.Although based on his fluency in PR speak I'm not sure exactly how much more interesting it would be.
Spunjji - Monday, October 19, 2020 - link
Your mind was blown by non-answers from a CTO in a pre-release interview? Seems like it's not just Papermaster that's into hyperbole here 🤭AMDSuperFan - Friday, October 16, 2020 - link
This article proves technical superiority of AMD. This Papermaster really will only tell us truths and without contempt. I appreciate hearing about my favorite company and their plans. They never fail in their plans. The people there seem so smart, funny, and most importantly good looking. I would very much like to hear more about Big Navi coming to compete with Intel Tiger and Rocket. I expect a very big speed improvement. Why not release the benchmarks now since everyone is so confident. Lets see what is in the tank for AMD to really win.eva02langley - Friday, October 16, 2020 - link
Nice trolling...Spunjji - Monday, October 19, 2020 - link
You can almost feel the person running that account getting less and less sure that they have anything remotely funny to say.WaltC - Friday, October 16, 2020 - link
Great interview, Ian...;) Glad Papermaster was so forthcoming--AMD intends to be a moving target with respect to its competitors. Zen 4 is already in design at this time--Zen 5 is "a thing" already. Keeping the pedal to the metal and continuing its great record of execution will be necessary for AMD to continue its industry leadership on out into the future. It's kind of wild to consider the leadership role AMD is in at this time--it's really unique. Intel has never had this kind of *self-imposed* competitive pressure--what AMD has pulled off here is quite remarkable, imo. The folks at AMD know exactly what they've been doing for the past several years and it shows.WaltC - Friday, October 16, 2020 - link
Wanted to add that Papermaster's detail on Zen 3 was very interesting--it's a ground-up redesign of the entire architecture--as opposed to an incremental change built on top of Zen 2. It will be interesting to see who if anyone will be able to keep up with AMD's pace of development in the next few years.haukionkannel - Saturday, October 17, 2020 - link
True. But based on the IPC gains... it was quite obvious... Small tinkering does not give you 19% higher IPC, you need architecture changes!Qasar - Saturday, October 17, 2020 - link
which they have done.smilingcrow - Saturday, October 17, 2020 - link
Very impressive how much they have managed to improve in one generation.Now that the 8C chiplets are monolithic in a sense, I wonder where the gains are next time around?
The I/O die needs some love.
Plus the eventual move to 5nm should help along with DDR5.
Spunjji - Monday, October 19, 2020 - link
I'm guessing 5nm on the chiplets and 7nm on the IO die will be order of business... at least for the server parts. It'll be interesting to see what they do elsewhere, as unless I'm mistaken they're still on the hook for some 14/12nm wafer starts with GloFo through 2021.GeoffreyA - Sunday, October 18, 2020 - link
My guess is that Papermaster's words about "redesign" can be interpreted in different ways. In Zen 3, it appears they've reworked every inch of the implementation, while the bird's-eye abstraction/schematic is largely the Zen architecture, with widening of the core across the board, and some new spices thrown in here and there. Just speculation on my part. Either way, brilliant work from AMD.shabby - Friday, October 16, 2020 - link
Pity his last name isn't zenmaster 😂Kenkyee - Friday, October 16, 2020 - link
With they asked when the AM5 socket is coming...eva02langley - Friday, October 16, 2020 - link
With Zen 4... it is like asking when the 2019 Chevy is going to be released...haukionkannel - Saturday, October 17, 2020 - link
AM5 at 2022realbabilu - Friday, October 16, 2020 - link
AMD need to optimize library kernel windows too, optimizing L1 L2 L3 cache like Openblas for. Windows, to fight Intel MKL.zodiacfml - Saturday, October 17, 2020 - link
What happened to the availabity of 3300x? Is it that they're having good yields, they're not getting enough defective chips?mode_13h - Saturday, October 17, 2020 - link
Yes, good questions. If only there were some inquisitive and resourceful journalist to find us some answers...Since I've seen a couple OEM systems with these CPUs, one possibility is that they've completely soaked up AMD's supply. AMD seems to be bending over backwards to stay in OEMs' good graces, lately.
Spunjji - Monday, October 19, 2020 - link
That would make sense. Failing to gain traction with OEMs is what choked them back in their pre-Dozer heyday.gruffi - Saturday, October 17, 2020 - link
"We didn’t change technology nodes - we stayed in 7nm. So I think your readers would have naturally assumed therefore we went up significantly in power but the team did a phenomenal job of managing not just the new core complex but across every aspect of implementation and kept Zen 3 in the power envelope that we had been in Zen 2."If tests can prove it I think that's the real achievement of Zen 3. Intel's Sunny Cove also had a significant IPC uplift, but at the cost of higher power consumption.
Zingam - Saturday, October 17, 2020 - link
OK! Now show me the chips for tablets, notebooks and laptops!Spunjji - Monday, October 19, 2020 - link
You'll need to wait at least 6 months for that. Right now it's Renoir. Maybe focus on asking OEMs why they aren't using it.Zingam - Saturday, October 17, 2020 - link
You do not record an interview with a clicky keyboard ever!AntonErtl - Sunday, October 18, 2020 - link
If Zen 3 is a complete redesign, why is it called Zen 3, and not, say, Painter 1.I actually expect that Zen 3 is a substantial rework of the basic Zen microarchitecture, but not a from-scratch redesign of everything, like Bulldozer and Zen1 were (and even there I expect that some good parts were reused to some extent). And sticking with a good microarchitecture and refining it has been a good strategy: Intel's Tiger Lake has lineage going back to the Pentium Pro released in 1995, and both Intel's from-scratch Netburst line and AMD's from-scratch Bulldozer line turned out to e dead ends.
sing_electric - Sunday, October 18, 2020 - link
I think it's that from a block diagram level, Zen 3 looks like Zen. What they did is look deep at each part of each block and optimize for performance.You're right, though, in that if the architecture works, use it, instead of making crazy assumptions of how computing "should work," like Intel's crazy long pipelines in NetBurst (... Or all of Itanium, for that matter...).
Qasar - Sunday, October 18, 2020 - link
" If Zen 3 is a complete redesign, why is it called Zen 3, and not, say, Painter 1. " the same can be said about intel and anything gen 10 and higher. its been said that they are a new architecture, but intel sill calls it gen 10, gen 11 etc.AntonErtl - Monday, October 19, 2020 - link
Intel's generations just tell you the year. They even have different microarchitectures in the same generation. Anyway, in their tech marketing (and in their optimization guides) Intel has revealed the microarchitectures of their products to some extent, and we know that Skylake-Comet Lake have the same microarchitecture, and we know that Sunny Cove is a widened variant with various changes (e.g., AVX512, and ROB size increased from 224 to 352 instructions). AMD has done that, too, and I expect that they will do it for Zen 3 at some later time.Qasar - Monday, October 19, 2020 - link
yep, not a new architecture, just an update to an existing oneGeoffreyA - Sunday, October 18, 2020 - link
Feel the same way too. Papermaster's "redesign" didn't really answer Ian's question right and is blurring things. I think the implementation has been reworked quite a bit, but the architecture is largely the old Zen design, the core widened and perhaps some new surprises here and there. Agree also that radical change is a bad idea in CPU design.Kalibr - Sunday, October 18, 2020 - link
ALWAYS keep in mind that competition is good and drives perf/price to good values.Ofc price you see now also drives future innovations to a certain degree, still it would be interesting to see the profit margin (I have a feeling it will blow your mind, both intel and amd I think have plenty of headroom).
Now maybe AMD has taken the lead, but a lead far ahead will just drive prices up and promote non-innovation. At least there is the ARM ISA (ecosystem whatever) with interesting options.
SIDtech - Sunday, October 18, 2020 - link
'At near x86 IPC the new ARM Neoverse V1' I believe somewhere Andrei F would throw a fit hearing about this 😂Carmen00 - Monday, October 19, 2020 - link
Another fantastic AnandTech interview, in-depth and beautifully edited. Thank you, Dr Cutress!Spunjji - Monday, October 19, 2020 - link
Great interview there. I wasn't expecting particularly illuminating answers, but the questions were top-notch and probably got as much out of him as we were ever going to get at this stage. Cheers, Ian!abufrejoval - Monday, October 19, 2020 - link
For me the most interesting bits in this interview were1. Control Flow Integrity (and shadow stacks) seem to be included in Zen3, hopefully 100% Intel compatible)
2. Per VM (or multi-key memory encryption) seems to be supported also on the client side, even notebooks
Both are seriously interesting improvements, IMHO
DigitalFreak - Monday, October 19, 2020 - link
I believe his name is actually spelled Mark Paperlaunchmaster.Qasar - Monday, October 19, 2020 - link
and why would you say that ?Spunjji - Tuesday, October 20, 2020 - link
Presumably this one doesn't understand the difference between a product announcement and a launchquiet-cheese - Tuesday, October 20, 2020 - link
Is is just me?I love the work AMD has put out and greatly respect the engineering team behind it.
but i'm kind of disappointed after reading this interview. i walk away with nothing technically concreate. this reads like what a PR staff put or, not the CTO of the company. IC asked some great technical questions but the answers given by MP to me felt more or less just like standard PR marketing points.
Bytales - Wednesday, October 21, 2020 - link
What id like to know, if anyone care about it and wants to find out, when are we suppose to get ZEN 3 Threadripper PRO CPUS ? I care about properly working ECC ram and octochanell memory, which is probably a must for that many corez... But nothing on the horizon about it. Barely one can get its hand on ZEN 2 Threadripper PRO, its more like an unicorn, and the motherboards like a unicorn child. Nobody seen or heard about them....mode_13h - Wednesday, October 21, 2020 - link
> Barely one can get its hand on ZEN 2 Threadripper PRO, its more like an unicorn, and the motherboards like a unicorn child. Nobody seen or heard about them....Like their other Pro-branded CPUs, the Threadripper Pros are only available to OEMs. I'm not happy about it, either.
mode_13h - Wednesday, October 21, 2020 - link
Luckily, you can still build an Epyc-based machine. I expect that's the next part of their product line AMD will snatch away from DIYers.croc - Monday, March 29, 2021 - link
Has the Zen 3 Threadripper become the new 'Fight Club'? Or the biggest elephant in the room?? Or is AMD not having such good yields in their higher core count CPU's... Obviously our good Doctor is not able / allowed / too afraid of losing adv. bucks to ask the question. AnandTech used to be about asking the hard questions.