> We’ve known for many years that having two threads per core is not the same as having two cores
True, and I still read this as an argument against SMT in forums. IMO it should be pointed out clearly that the cost of implementing either also differs drastically: +100% core size for another core and ~5% for SMT.
Intel began its HT journey in order to pull more efficiency from each core--basically, as performance was being left on the table. Interestingly enough, after Athlon and A64, AMD roundly criticized Intel because the SMT thread was not done by a "real core"...and then proceeded to drop cores with two integer units--which AMD then labeled as "cores"...;) Intel's HT approach proved superior, obviously. IIRC. It's been awhile so the memories are vague...;) The only problem with this article is that it tries to make calls about SMT hardware design without really looking hard at the software, and the case for SMT is a case for SMT software. Games will not use more than 4-8 threads simultaneously so of course there is little difference between SMT on and off when running most games on a 5950. You would likely see near the same results on a 5600 in terms of gaming. SMT on or off when running these games leaves most of the CPU's resources untouched. Programs designed and written to utilize a lot of threads, however, show a robust, healthy scaling with SMT on versus no SMT. So--without a doubt--SMT CPU design is superior to no SMT from the standpoint of the hardware's performance. The outlier is the software--not the hardware. And of course the hardware should never, ever be judged strictly by the software one arbitrarily decides to run on it. We learn a lot more about the limits of the software tested here than we learn about SMT--which is a solid performance design in CPU hardware.
Very valid point about how games won't see a difference between 16 and 32 threads when they only use 6. Do you know if this type of analysis has been done at the lower end of the market?
It's been common knowledge established a few years ago when AMD started pushing 8 core (and greater) CPUs that games don't require that many cores and that 6 cores is optimal for gaming right now. And if you do more than game, occasionally, and need more than 6 threads then SMT is there for you. As the new consoles are 8-core CPU designs, over time the number of cores required for optimal game performance will increase.
Thoset Jaguar cores was more like a 4c/8t processor to be fair. And they weren't that much better than Intel's Atom cores, a far cry from Intel's Core-i SkyLake architecture. And current gen consoles were very light on the OS, so maybe using 1-full core (or 2-threads-shared) leaving only 3-cores for games, but much better than the 2-core optimised games from the PS3/360 era.
The new gen consoles will be somewhat similar, using only 1-full core (2-threads) reserved for the OS. But this time we have an architecture that's on-par with Intel's Core-i SkyLake, with a modern full 8-core processor (SMT/HT optional). This time leaving a healthy 7-cores that's dedicated to games. Optimisations should come sooner than later, and we'll see the effects on PC ports by 2022. So we should see a widening gap between 4vs6-core, and to a lesser extent 6vs8-core in the future. I wouldn't future-proof my rig by going for a 5700x instead of a 5600x, I would do that for the next round (ie 2022 Zen4).
The 8 Jaguar cores are in no way like 4c/8t CPUs; if you use only half of them, you get half the performance (unless your application is memory/L2-bandwidth-limited). Their predecessor Bobcat is about twice as fast as an Bonnell core (Atom proper), and a little slower than Silvermont (the core that replaced Bonnell), about half as fast as Goldmont+ (all at the clock rates at which they were available in fanless mini-ITX boards), one third as fast as a 3.5GHz Excavator core, and one sixth as fast as a 4.2GHz Skylake.
Worse IPC than Bulldozer as far as I know. Certainly worse than Piledriver.
Really sad. The "consoles" should have used something better than Jaguar. It's bad enough that the "consoles" are a parasitic drain on PC gaming in the first place. It's worse when they not only drain life with their superfluous walled gardens but also by foisting such a low-grade CPU onto the art.
The Jaguar cores share alot of DNA with Bulldozer, but they aren't the same. It's like Intel's Atom chips compared to Intel Core-i chips. With that said, 2015 Puma+ was a slight improvement over 2013 Jaguar, which was a modest improvement over the initial 2011 Bobcat lineup. All this started in 2006 with AMD choosing to evolve their earlier Phenom2 cores which are derivatives of the AMD Athlon-64.
So just by their history, we can see they're inline with Intel's Atom architecture evolution, and basically a direct competitor. Where Intel had slightly less performance, but had much lower power-draw... making them the obvious winner. Leaving AMD to fill in the budget segments of the market.
As for the core arrangement, they don't have full proper cores as people expect them. Like the Bulldozer architecture, each core had to share resources like the decoder and floating-point unit. So in many instances, one core would have to wait for the other core. This boosts multithreaded performance with simple calculations in orderly patterns. However, with more complex calculations and erratic/dynamic patterns (ie Regular PC use), it causes a hit to the single-thread performance and notable hiccups. So my statement was true. This is more like a 4c/8t chipset, and it is less like a Core-i and much more like an Atom. But don't take my word for it, take Dr Ian Cutress. He said the same thing during the deep dive into the Jaguar microarchitecture, and recently in the Chuwi Aerobox (Xbox One S) article. https://www.anandtech.com/show/16336/installing-wi...
Now, there have been huge benefits to the Gaming PC industry, and game ports, due to the PS4/XB1. The first being the x86-64bit direct compatibility. Second was the cross-compatability thanks to Vulkan and DirectX (moreso with PS4 Pro and XB1X). The third being that it forced game developers to innovate their game engines, so that they're less narrow and more multi-threaded. With PS5/XseX we now see a second huge push with this philosophy, and the improvements of fast single-thread performance and fast-flash storage access. So I think while we have legitimate reasons to groan about the architecture (especially in the PS4) upon release, we do have to recognize the conveniences that they also brought (especially in the XB1X). This is just to show that my stance wasn't about console bashing.
@Kangal, Jaguar APUs in consoles are definitely not "like a 4c/8t processor" because they don't use CMT. They are full 8 cores. Their IPC may be comparable with some newer Atoms although it's hard to benchmark how the later "Evolved Jaguar" cores in the mid generation console refresh compares against the regular Jaguar or Atom.
Xbox360 came out in 2005. 3C/6T. Even the PS3 had a 1C/2T PowerPC PPE and 6 SPEs, so a total of 8T. PS4/XO is 8C/8T. Though I guess we could blame lack of CPU utilization still on this last generation using pretty weak cores from the get go. IIRC 8 core Jaguar would be on par with an Intel i3 at the time of these console releases.
Though, the only other option AMD had was Piledriver. Piledriver still poor performer, a power hog, and it would likely only been worth it over 8 Jaguar cores if they went with a 3 or 4 module chip.
It is nice that this generation MS and Sony both went all out on the CPU. Just too bad they aren't Zen 3 based. :(
It should be kept in mind that, at the time when AMD criticized Intel for that, that was when AMD had actual dual-cores (A64x2) and Intel still had single-cores with HT, which makes the criticism rather fair.
Intel's approach wasn't that much superior. In fact, in the early days of Intel's HTT processors, many Applications, even ones which supposed to be optimised for MC code path was getting lower scores with HTT enabled than when HTT was disabled.
The main culprit was that Applications were designed to handle each thread in a real core, not two threads in a single core, the threads were fighting for resources that weren't there.
Intel knew this and worked hard with developers to make them know the difference and apply this change to the code path. This actually took sometime till Multi-Core applications were SMT aware and had a code path for this.
For AMD's case, AMD's couldn't work hard enough like Intel with developers to make them have a new code path just for AMD CPU's. Not to mention that intel was playing it dirty starting with their famous compiler which was -and still- used by most developers to compile applications, the compilers will optimise the code for intel's CPU's and have an optimised code path for every CPU and CPU feature intel have, but when the application detect a non-Intel CPU, including AMD's it will select the slowest code path, and will not try to test the feature and choose a code path.
This applied also to AMD's CPU's, while sure the CPU's lacked FPU performance, and was not competitive enough (even when the software was optimised), but the whole optimisation thing made AMD's CPU inefficient, the idea should work better than Intel, because there's an actual real hardware there (at least for Integer), but developers didn't work harder, and the intel compiler played a major role for smaller developers also.
TL'DR, the main issue was the intel compiler and lack of developers interest, then the actual cores were also not that much stronger than intel's (IPC side), AMD's idea should have worked, but things weren't in their side.
And by the time AMD came with their design, they were already late, applications were already optimised for Intel HTT which became very good as almost all applications became SMT aware. AMD acknowledged this and knew that they must take what developers already have and work on it, they also worked hard on their SMT implementation that it is touted now that their SMT is better intel's own SMT implementation (HTT).
Urm no, intel’s compiler isn’t used often these days unless you’re doing really heavy maths. Microsoft’s compiler is used much more often, though clang is taking off
During P4, HT gives no difference in performance compared to AMD64 but on Core2Duo there it shows better performance. Probably because we have only 2-4 cores and not enough for our multi tasking needs, Now we have 4-32 cores plus much powerful and efficient cores, hence, SMT maybe not that significant already that is why on most test it shows no big performance lift.
5%? I think more than 5% is needed for a whole second set of registers plus the logic needed to properly handle context switching. Everything in between the cache and pipeline needs to be doubled.
Register rename means they already have more registers that don't need to be copied. The register renaming means they have more physical registers than logical registers exposed to programmer. Say you have: 16 logical registers exposed to coder per thread; 128 rename registers in HW; SMT 2tgreads/core = same 16 logical but each thread has 64 rename registers instead of 128. Compare mixing the workloads eg. 8 int/branch heavy with 8 FP heavy on 8 core; or OS background tasks like indexing/search/AntiVirus.
The 5% is from Intel for the original Pentium 4. At some point in the last 10 years I think I read a comparable number, probably here on AT, regarding a more modern chip.
There is little accurate info about it, but the fact is that x86 cores are many times larger than Arm cores with similar performance, so it must be a lot more than 5%. Graviton 2 gives 75-80% of the performance of the fastest Rome at less than a third of the area (and half the power).
That doesn't mean SMT is mainly responsible for that. The x86 decoders are a lot more complex. And at the top end you get diminishing performance returns for additional die area.
I didn't say all the difference comes from SMT, but it can't be the x86 decoders either. A Zen 2 without L2 is ~2.9 times the size of a Neoverse N1 core in 7nm. That's a huge factor. So 2 N1 cores are smaller and significantly faster than 1 SMT2 Zen 2 core. Not exactly an advertisement for SMT, is it?
>Graviton 2 gives 75-80% of the performance of the fastest Rome at less than a third of the area To be honest, it wouldn't surprise me one bit if 90% of the area gives 10% of the performance. Wringing out that extra 1% single-threaded performance here or there is the name of the game nowadays.
Also there are many other differences that probably cost a fair bit of silicon, like wider vector units (NEON is still 128-bit, and exceedingly few ARM cores implement SVE yet).
It's nowhere near as bad with Arm designs showing large gains every year. Next generation Neoverse has 40+% higher IPC.
Yes Arm designs opt for smaller SIMD units. It's all about getting the most efficient use out of transistors. Having huge 512-bit SIMD units add a lot of area, power and complexity with little performance gain in typical code. That's why 512-bit SVE is used in HPC and nowhere else.
So with Arm you get many different designs that target specific markets. That's more efficient than one big complex design that needs to address every market and isn't optimal in any.
The extra 20% of performance is difficult to achieve. You can already see it on zen CPUs, where 16 core designs are dramatically more efficient per core in multithread running at around 3.4ghz, vs 8 core designs running at 4.8ghz. I've always hated these comparisons with ARM for this reason... you need a part with 1:1 watt parity to make a fair comparison, otherwise 80% performance at half the power can also be accomplished even on x86 by just reducing frequency and upping core count.
Graviton clocks low to conserve power, and still gets close to Rome. You can easily clock it higher - Ampere Altra does clock the same N1 core 32% higher. So that 20-25% gap is already gone. We also know about the next generation (Neoverse N2 and V1) which have 40+% higher IPC.
Yes adding more cores and clocking a bit lower is more efficient. But that's only feasible when your core is small! Altra Max has 128 cores on a single die, and I don't think we'll see AMD getting anywhere near that in the next few years even with chiplets.
It is obviously a lot LESS than 5%. Nothing that matters in terms of transistors (caches and vector units) increases. Even doubling of registers would add a few hundreds/thousands of transistors on a chip with tens of billions of transistors, less than 0.000001%.
They can double all scalar units and it still would be below 1% increase.
I agree. Adding SMT/HT requires something like a +10% increase in the Silicon Budget, and a +5% increase in power draw but increases performance by +30%, speaking in general. So it's worth the trade-off for daily tasks, and those on a budget.
What I was curious to see, is if you disabled SMT on the 5950X, which has lots of cores. Leaving each thread with slightly more resources. And use the extra thermals to overclock the processor. How would that affect games?
My hunch? Thread-happy games like Ashes of Singularity would perform worse, since it is optimised and can take advantage of the SMT. Unoptimized games like Fallout 76 should see an increase in performance. Whereas actually optimised games like Metro Exodus they should be roughly equal between OC versus SMT.
I guess you didn't understand my point. Think of a modern game which is well optimised, is both GPU intensive and CPU intensive. Such as Far Cry V or Metro Exodus. These games scale well anywhere from 4-physical-core to 8-physical-cores.
So using the 5950X with its 16-physical-cores, you really don't need extra threads. In fact, it's possible to see a performance uplift without SMT, dropping it from 32-shared-threads down to 16-full-threads, as each core gets better utilisation. Now add to that some overclocking (+0.2GHz ?) due to the extra thermal headroom, and you may legitimately get more performance from these titles. Though I suspect they wouldn't see any substantial increases or decreases in frame rates.
In horribly optimised games, like Fallout 76, Mafia 3, or even AC Odyssey, anything could happen (though probably they would see some increases). Whereas we already know that in games that aren't GPU intensive, but CPU intensive (eg practically all RTS games), these were designed to scale up much much better. So even with the full-cores and overclock, we know these games will actually show a decrease in performance from losing those extra threads/SMT.
With a R 5 1600 it makes about 5-6% difference in usable clock speed. (200-250 Mhz) and also with temperature. With a R 7 3800X it is not as noticeable. If you reduce the background operations while gaming with either CPU. I don't know about recent game releases but older ones only use 2-4 cores (threads) so clocking the R 5 1600 @ 3750 (SMT on) Mhz vs 3975 Mhz (SMT off) does make a difference on frame rates
I think that 5% area cost for SMT is marketing. If you only count the logic that is essential for SMT, then it might be 5%. However many resources need to be increased or doubled. Even if that helps single-threaded performance, it still adds a lot of area that you wouldn't need without SMT.
Graviton 2 proves that 2 small non-SMT cores will beat one big SMT core on multithreaded workloads using a fraction of the silicon and power.
Several compute applications do not need hyper-threading. A couple of official references: 1. Wolfram Mathematica: "Mathematica’s Parallel Computing suite does not necessarily benefit from hyper-threading, although certain kernel functionality will take advantage of it when it provides a speedup." [source: https://support.wolfram.com/39353?src=mathematica]. Indeed Mathematica automatically set-up a number of threads equal to the number of physical cores of the CPU. 2. Intel MKV library. "Hyper-Threading Technology (HT Technology) is especially effective when each thread is performing different types of operations and when there are under-utilized resources on the processor. Intel MKL fits neither of these criteria as the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may obtain higher performance when using Intel MKL without HT Technology enabled." [source: https://software.intel.com/content/www/us/en/devel...].
Apparently intel mkl and Matlab that uses intel mkl only allowing AMD uses non AVX2 library only. Only Linux with fake cpu preloaded library could go around this. https://www.google.com/amp/s/simon-martin.net/2020...
The "if not Intel genuine cpu" disabled all optimisations (this rubbish has been going on for years only 2020 or 2019 where they are actually fixing there code to detect if AVX is available, even BTRFS had this problem it wouldn't use hardware acceleration if it wasn't on an intel CPU, again lazy coding )
As usage for modern users changes I wonder how this could be better tested/visualized.
I am not looking at a 5900x to run any advanced tools. I am looking to game, run mutiple browsers with a few dozen tabs open, stream, download, run Plex (transcoding), security tools, VPN, and the million other applications a normal user would have running at any given point in time. While no two users will have the same workload at any given time, how could we quantify SMT versus no SMT for the average user?
In the not to distance future we could be seeing the average PC running 32 cores. I am talking your run of the mill office machine from Dell that costs $800. Or will we? Is there a point where it does not matter anymore?
Simple. At average user 4 core 8gen u series have more core than the generation before. It has more strength, but it's rarely got 100 percent cpu utilized for those normal you doing. To get 8 threads or 4 cores work 100 percent need killer applications that programmed by man know how to extract every juice of it processor, know how to program multithread, or using optimized math kernel.library / optimized compiler switch like FEM, Render, math applied science. Other than those app, maybe you could expense it to gpu for gaming.
Or you just have multiple tabs open. I regularly hit 100% usage on my work i5-6400 with 4c/4t having 10-12 tabs open. It gets quite annoying as on a normal day I might need up to double that open at any given time. That means that 20 tabs would peg a 4c/8t CPU pretty easily.
You need an ad blocker unless those tabs are all very busy doing something. I mean, it sounds like they're mining Monero for somebody else, I mean what they're *supposed* to be doing for you.
I use an ad blocker and nothing is being mined. However, ads are an example of things that will destroy your performance in web browsing quite quickly and suck up a lot of CPU cycles. While right now 4c/8t is enough for an office machine, it will not be long before 6c/12t is the standard.
Wouldn't high SMT performance be an indication of bad software design rather than bad core design? While SMT performance is changing in these tests the core is not. Only the software is changing. It seems as though an Intel CPU in this comparison would have provided additional insights to these questions.
The situations that create high SMT performance are generally outside the software in question's control. For example, a program might have 1 thread that's doing all divides and another that's doing all multiplies. The thread that only has multiplies or divisions aren't poorly designed, they just aren't using units on the chip that don't help their respective workloads.
There are also cache effects. If you have 2 threads working on data bigger than the CPU's caches while one is waiting for that data to come back from memory the other can make unrelated progress and vice versa, but the data being big isn't necessarily an indicator of poor software design. Some problem domains just have big data sets there's no way around.
Exactly. Some software is written to utilize a lot of threads simultaneously, some is not. Running software that does not make use of a lot of simultaneous threads tells us really nothing much about SMT CPU hardware, imo, other than "this software doesn't support it very well."
That all depends. There could be a unit that switches between threads to dispatch instructions into the pipeline, but instructions from all the threads are simultaneously working on calculations in the pipeline. I'd call that a way to implement SMT.
Guys, I've got bad news for you. The difference between a barrel processor ("temporal multithreading") and SMT is all about the backend, not the frontend. I.e. whether the processor is superscalar or not. Otherwise there is no difference. They duplicate hardware resources and switch between them. And the frontend (a.k.a. the decoder) switches temporally between hardware threads. There are NOT multiple frontends/decoders simultaneously feeding one backend pipeline.
For example the "SMT4" Intel Xeon Phi has a design weakness where three running threads per core get decoded as if four threads were running. (And yes, just one or two running threads per core get decoded efficiently.)
Will be interesting to see if this looks different with Quad-Channel Threadripper or Octo-Channel EPYC/TR Pro CPUs, since 16 Cores/32 Threads with 2 channels of memory doesn't seem very compute-friendly. Though it's good to see that "SMT On" is still the reasonable default it's pretty much always has been, except in very specific circumstances.
To be fair, "x86" and "security conscious" are already incompatible on anything newer than a Pentium 1/MMX. Spectre affects everything starting with the Pentium Pro, and newer processors have blackboxes in the form of Intel ME or AMD PSP. You can reduce the security risk by turning off some performance features (and get CPUs without Intel ME if you're the US government), but this is still just making an inherently insecure product slightly less insecure.
Sure, there are numerous examples of non-core silicon in x86 environments that are insecure, however this article is about SMT yet it avoids mentioning the ever growing list of security problems with Intels implementation of SMT. The Intel implementation of SMT is so badly flawed from a security perspective that the only way to secure Intel CPUs is to completely disable SMT, and that's the bottom line recommendation of many kernel and distribution developers that have been trying to "fix" Intel CPUs for the past few years.
I use an Intel i7 in my pfSense firewall appliance, which based on BSD.
BSD tends to remind you, that you should run it with SMT disabled because of these side channel exposure security issues.
Yet, with the only workload being pfSense and no workload under an attacker's control able to sniff data, I just don't see why SMT should be a risk there, while extra threads to deeply inspect with Suricata should help avoiding that deeper analysis creates a bottleneck in the uplink.
You need to be aware of the architectural risks that exist on the CPUs you use, but to argue that SMT should always be off is a bit too strong.
Admittedly, when you have 16 real cores to play with, disabling SMT hurts somewhat less than on an i3-7350K (2C/4T), a chip I purchased once out of curiosity, only to have it replaced by an i5-7600K (4C/4T) just before Kaby Lake left the shelves and became temptingly cheap.
It held up pretty well, actually, especially because it did go to 4,8 GHz without more effort than giving it permission to turbo. But I'm also pretty sure the 4 real cores of the i5-7600k will let the system live longer as my daughter's gaming rig.
> to argue that SMT should always be off is a bit too strong.
Not really - if you're a kernel or distro developer then Intel SMT "off" is the only sane recommendation you can give to end users given the state of Intel CPUs released over the last 10 years (note: the recommendation isn't relating to SMT in general, not even AMD SMT, it is only Intel SMT).
However if end users with Intel CPUs choose to ignore the recommendation then that's their choice, as presumably they are doing so while fully informed of the risks.
This article wasn't talking about Intel's implementation, only SMT performance on Zen 3. If this were about SMT on Intel then it would make sense, otherwise no.
The start of the article is discussing the pros and cons of SMT *in general* and then discusses where SMT is used, and where it is not used, giving examples of Intel x86 CPUs. Why not then mention the SMT security concerns with Intel CPUs too? That's a rhetorical question btw, we all know the reason.
Since this is an AMD focused article, there isn't the side channel attack vector for SMT. Therefore why would you mention side channel attacks for Intel CPUs? That doesn't make any sense since Intel CPUs are only stated for who uses SMT and Intel's SMT marketing name. Hence bringing up Intel and side channel attack vectors would be including extraneous data/information to the article and take away from the stated goal. "In this review, we take a look at AMD’s latest Zen 3 architecture to observe the benefits of SMT."
> The Intel implementation of SMT is so badly flawed from a security perspective that > the only way to secure Intel CPUs is to completely disable SMT
That's not true. The security problems with SMT are all related to threads from different processes sharing the same physical core. To this end, the Linux kernel now has an option not to do that, since it's the kernel which decides what threads run where. So, you can still get most of the benefits of SMT, in multithreaded apps, but without the security risks!
A lot of people forget to install the amd chipset drivers witch can result in some small loss of performance (but also need bios to be kept upto date as well Co compleat the ccx group support and best cores support to advertise to windows
For me the main question is not whether SMT is bad or good in multithread, but how it is good or bad for 2-4-6 thread loads on for example 12 core Ryzen. When windows may or may not schedule threads to real cores (by 1 thread of 1 core) or to SMT cores in series
IIRC, Windows knows what cores are real vs virtual and what virtual core maps to a real core. It shouldn't matter if a thread is scheduled on a real or virtual cores, though. If a thread is scheduled on a virtual core that maps to a real core that's not utilized, it still has access to the resources of the full core.
SMT doesn't come into play until you need more threads than cores.
That's not *quite* true. Some elements are staticly partitioned, notably some instruction/data queues. See 20.19 Simultaneous multithreading in https://www.agner.org/optimize/microarchitecture.p... "The queueing of µops is equally distributed between the two threads so that each thread gets half or the maximum throughput."
This partitioning is set on boot. So, where each thread might get 128 queued micro-ops with SMT off, you only get 64 with it on. This might have little or no impact, but it depends on the code.
The article itself says: "In the case of Zen3, only three structures are still statically partitioned: the store queue, the retire queue, and the micro-op queue. This is the same as Zen2."
Honestly it looks like you provided a 3rd viewpoint. As these are general purpose processors it really depends on the workload/code optimization and how they are optimized for a given targeted workload.
Hmmm, if you have a *very* specific workload, yes, 'it depends', but we're really talking HPC here. Pretty much nothing you do at home makes it worth rebooting-and-disabling-SMT for on an AMD Zen 3.
The confusion comes in because these are consumer processors. These are not technically HPC. Lines are being blurred as these things make $10k CPU's from 5-10 years ago look like trash in a lot of work loads.
I imagine compiler optimization these days is tuned for SMT. Perhaps this could have been discussed in the article? I wonder how much of a difference this makes to SMT on/off.
This article ignores the important viewpoint from the server side. If I have several independent, orthogonal workloads scheduled on a computer I can greatly benefit from SMT. For example if one workload is a pointer-chasing database search type of thing, and one workload is compressing videos, they are not going to contend for backend CPU resources at all, and the benefit from SMT will approach +100%, i.e. it will work perfectly and transparently. That's the way you exploit SMT in the datacenter, by scheduling orthogonal non-interfering workloads on sibling threads.
On the one hand, if one program does a lot of integer calculations, and the other does a lot of floating-point calculations, putting them on the same core would seem to make sense because they're using different execution resources. On the other hand, if you use two threads from the same program on the same core, then you may have less contention for that core's cache memory.
One of the key ways in which SMT helps keep the execution resources of the core active, is cache misses. When one thread is waiting 100+ clocks on reads from DRAM, the other thread can happily keep running out of cache. Of course, this is a two-edged sword. Since the second thread is consuming cache, the first thread was more likely to have a cache miss. So for the very best per-thread latency you're best off to disable SMT. For the very best throughput, you're best off to enable SMT. It all depends on your workload. SMT gives you that flexibility. A core without SMT lacks that flexibility.
Now I only glanced through the article, but would it make more sense to use a lower core count cpu to see the benefits of SMT as using a higher core count might mean it will use the real core over the smt cores?
Starting to think the entire point of the article was this subject is so damn complicated and hard to quantify for the average user that there is no point in trying unless you are running work loads in a lab environment to find the best possible outcome for the work you plan on doing. Who is going to bother doing that unless you work for a company where it makes sense to do so.
I wonder what the charts would look like with a dual core SMT processor. I think the game tests would have a good chance of changing. I'm a little surprised the author didn't think to test CPU's with more limited resources.
Too bad you could not include more real-life test. I think that a processor with far less threads (like a 5600X) is easier to saturate with real-life software that actually uses threads. The Cache Sizes and Memory channels bandwith ratio to core is quite different for those 'smaller' processors. That will probably result in different benchmark results... So it would be interesting to see what those processors will do, SMT ON versus SMT OFF. I don't think the end result will be different, but it could even be a bigger victory for SMT ON.
Another interesting area is virtualization. And as already mentioned in more comments it is very important that the Operating Systems will assign threads to the right core or SMT-Core combinations. That is even more important in virtualization situations...
Determining the usefulness of SMT with 16 cores on tap is not quite as relevant as when this experiment might be done with, say, a 5600X or 5800X....; naturally 16 cores without SMT might still be plenty (as even 8 non SMT cores on the 9700K proved)
This would be a better test on a CPU that doesn't have 16 base cores. If you could try it on a 4C/8T part I think the difference would be more pronounced.
The basic benefit of SMT is it allows the processer to hide the impact of long latency instructions on average IPC, since it can switch to new thread and execute those instructions. In this way it is similar to OOO(which leverages speculative execution to do the same) and also more flexible than fine-grained multi-threading. There is an overhead and cost(area/power) due to the duplicated structures in the core that will impact the perf/watt of pure single-threaded workloads, I don't think disabling SMT removes all this impact ...
Perhaps not. But at the same time, it is likely that any non-SMT chip that has a SMT variant actually *is* a SMT chip, it is just disabled in firmware - either because it is broken on that chip, or because the non-SMT variant sold better.
It's hard to imagine a transistor defect that would break *only* SMT. As you say all non-SMT chips are really SMT chips internally and the decision to disable SMT doesn't really result in huge chunks of transistors going dark (the potential target area for physical defects).
I'd say most of the SMT vs. no-SMT decisions on individual CPUs are binning related: SMT can create significantly more heat because there is less idle which allows the chip to cool. So if you have a chip with higher resistance in critical vias and require higher voltage to function, you need to sacrifice clocks, TDP or utilization (and permutations).
Why are people still testing SMT in 2020? Cache coherency and hierarchy design is mature enough to offset the possible instruction bottleneck issues. I don't even know the purpose of this article at all... Anyways, perhaps fallng back to 2008? Come on...
Well, instead of testing the concept of SMT, which has been around for a while, perhaps one could think of it as testing the implementation of SMT found on the chips we can get in 2020.
Thanks Ian! I always thought of SMT as a way of using whatever compute capacity a core has, but isn't being used in the moment. Hence it's efficient if many tasks need doing that each don't take a full core most of the time. However, that hits a snag if the cores get really busy. Hence (for desktop or laptop), 6 or 8 real cores are usually better than 4 cores that pretend to be 8.
I found the "Is SMT an good thing" discussion (and later discussion of the same topics) strange, because it seemed to take the POV of someone who wants to optimize some efficiency or utilization metric of someone who can choose the number of resources in the core. If you are in that situation, then the take of the EV8 designers was: we build a wide machine so that single-threaded applications can run fast, even though we know the wideness leads to low utilization; we also add SMT so that multi-threaded applications can increase utilization. Now, 20 years later, such wide cores become reality, although interestingly Apple and ARM do not add SMT.
Anyway, buyers and users of off-the-shelf CPUs are not in that situation, and for them the questions are: For buyers: How much benefit does the SMT capabilty provide, and is it worth the extra money? For users: Does disabling SMT on this SMT-capable CPU increase the performance or the efficiency?
The article shows that the answers to these questions depend on the application (although for the Zen3 CPUs available now the buyer's question does not pose itself).
It would be interesting to see whether the wider Zen3 design gives significantly better SMT performance than Zen or Zen2 (and maybe also a comparison with Intel), but that would require also testing these CPUs.
I did not find it surprising that the 5950X runs into the power limit with and without SMT. The resulting clock rates are mentioned in the text, but might be more interesting graphically than the temperature. What might also be interesting is the power consumed at the same clock frequency (maybe with fewer active cores and/or the clock locked at some lower clock rate).
If SMT is so efficient (+91%) for 3DPMavx, why does the graphics only show a small difference?
Anand, while I value your in depth articles you guys really need to drop the 95th percentile frame times and get on board with 1% and .1% lows. What disrupts gaming the most is the hiccups, not looking at a statistically smooth chart. SMT/HT effects these THE most, especially in heavily single threaded games. If you aren't testing what it influences, why test it at all? Youtube reviews are also having problems with tests that don't reflect real world scenarios as well. Sometimes it's a lot more disagreeable then others.
Completely invalid testing methodology at this point.
My advice based on my own testing. You turn off SMT/HT except in scenarios in which you become CPU bound, across all cores, not one. This improved .1 and 1% frame time... IE stutters. You turn it on when you reach a point of 90%+ utilization as it helps and a lot when your CPU is maxed out. Generally speaking <6 and soon to be 8 cores should always have it on.
You didn't even test where this helps the most and that's low end CPUs vs high end CPUs where you find the Windows scheduler messes things up.
Also if you're testing this on your own, always turn it off in the bios. If you use something like process lasso or manually change affinity, windows will still put protected services and process onto those extra virtual cores causing contention issues that lead to the stuttering.
Most obvious games that get a benefit from SMT/HT off are heavily single threaded games, such as MOBAS.
"Most modern processors, when in SMT-enabled mode, if they are running a single instruction stream, will operate as if in SMT-off mode and have full access to resources."
Which would have access to the whole microinstruction cache (L0I) in SMT mode?
Sun's UltraSparc T1 has in-order cores that run several threads alternatingly on the functional units. This is probably the closest thing to SMT that makes sense on an in-order core. Combining this with SMT proper makes no sense; if you can execute instructions from different threads in the same cycle, there is no need for an additional mechanism for processing them in alternate cycles. Instruction fetch on some SMT cores processes instructions in alternate cycles, though.
The AMD Bulldozer and family have pairs of cores that share more than cores in other designs share (but less than with SMT): They share the I-cache, front end and FPU. As a result, running code on both cores of a pair is often not as fast as when running it on two cores of different pairs. You can combine this scheme with SMT, but given that it was not such a shining success, I doubt anybody is going to do it.
Looking at roughly contemporary CPUs (Athlon X4 845 3.5GHz Excavator and Core i7 6700K 4.2Ghz Skylake), when running the same application twice one after the other on the same core/thread vs. running it on two cores of the same pair or two threads of the same core, using two cores was faster by a factor 1.65 on the Excavator (so IMO calling them cores is justified), and using two threads was faster by a factor 1.11 on the Skylake. But Skylake was faster by a factor 1.28 with two threads than Excavator with two cores, and by a factor 1.9 when running only a single core/thread, so even on multi-threaded workloads a 4c/8t Skylake can beat an 8c Excavator (but AFAIK Excavators were not built in 8c configurations). The benchmark was running LaTeX.
AMD's design was very inefficient in large part because the company didn't invest much into improving it. The decision was made, for instance, to stall high-performance with Piledriver in favor of a very very long wait for Zen. Excavator was made on a low-quality process and was designed to be cheap to make.
Comparing a 2011/2012 design that was bad when it came out with Skylake is a bit of a stretch, in terms of what the basic architectural philosophy is capable of.
I couldn't remember that fourth type (the first being standard multi-die CPU multiprocessing) so thanks for mentioning it (Sun's).
Many congratulations to Dr. Ian Cutress for the excellent analysis carried out.
If possible, it would be extremely interesting to repeat a similar rigorous analysis (at least on multi-threaded subsection of choosen benchmarks) on the following platforms: - 5900X (Zen 3, but fewer cores for each chiplet, maybe with more thermal headroom) - 5800X (Zen 3, only a single computational chiplet, so no inter CCX latency throubles) - 3950X (same cores and configuration, but with Zen 2, to check if the new, beefier core improved SMT support) - 2950X (Threadripper 2, same number of cores but Zen+, with 4 mamory channels; useful expecially for tests such as AIBench, which have gotten worse with SMT) - 3960X (Threadripper3, more cores, but Zen2 and with 4 memory ch.)
Obviously, it would be interesting to check Intel HyperThreading impact on recent Comet Lake, Tiger Lake and Cascade Lake-X.
For the time being, Apple has decided not to use any form of SMT on its own CPUs, so it is useful to fully understand the usefulness of SMT technologies for notebooks, high-end PCs and prosumer platforms.
Thanks Ian! With some of your comments about memory access limiting performance in some cases, how does (or would) a quad channel memory setup give in additional performance compared to the dual channel consumer setups (like these or mine) have? Now, I know that servers and actual workstations usually have 4 or more memory channels, and for good reason. So, in the time of 12 and 16 core CPUs, is it time for quad channel memory access for the rest of us, or would that break the bank?
That's a good question. As time moves on and we keep getting more cores, with people doing more things that make use of them (such as gaming and streaming at the same time, with browser/tabs open, livechat, perhaps an ecode too), perhaps indeed the plethora of cores does need better mem bw and parallelism, but maybe the end user would not yet tolerate the cost.
Something I noticed about certain dual-socket S2011 mbds on Aliexpress is that they don't have as many memory channels as they claim, which with two CPUs does hurt performance of even consumer grade tasks such as video encoding:
Thanks for these interesting tests! Perhaps, SMT thing is a something that could drastically improve more budget CPUs performance? Your CPU has more than enough shiny cores for these games, but what if you take Ryzen 3100? I believe %age would be different, as it was in my real world case :) Back then i had 6600k@4500 and in some FPS games with a huge maps and a lot of players (Heroes and Generals; Planetside 2) i started to receive stutters in a tight fights, but when i switched to 6700@4500 it wasn't my case anymore. So i do believe that Hyperthreading worked in my case, cuz my CPUs were identical aside of virtual threads in the last one.
Would super interesting to have this post updated with a cheaper sample results 😇
The slide states that Zen 3 decodes 4 instructions/cycle. Are there two independent decoders which each decode those 4 instruction for a thread? Or is there a single decoder that switches between the program counters of both threads but only decodes instructions of one thread per cycle?
Right, I found that article as well and from that slide it looks like the decoder would be shared. But then that slide was from 2017, so that might have changed.
It looks though as if the decoder could decode those 4 instructions from a single program counter only, right? It's not like the decoder could decode e.g. 2 instructions from program counter 1 and another 2 instructions from program counter 2?
I'm not too sure how the implementation works, but I expect they're shuffling both threads through the decoder at roughly the same time. The decoder has four units (I think 1 complex and 3 simple). As far as I'm aware, that has stayed the same in both Zen 2 and 3.
Ian, a question about Handbrake, though it may not apply to the type of test you used. I've read that Handbrake doing an h264 encode can only use 16 threads max. Does this mean that in theory one could run two separate h264 encodes on a 5950X and thus obtain a good overall throughput speedup? Have you tried such a thing? Or might this only work if it were possible to force one encode to only use the 16 threads of one 8c block (CCX?), and the other encode to use the rest? ie. so that the separate encodes are not fighting over the same cores or indeed the same CCX-shared L3? Is it possible to force this somehow? Also, if the claimed 16 thread limit for h264 is true, is there a performance difference for a single h264 encode between SMT on vs. off just in general? ie. with it on, is the OS smart enough to ensure that the 16 threads are spread across all the cores evenly rather than being scrunched onto fewer cores because reasons? If not, then turning SMT off might speed it up. Note that I'm using Windows for all this.
I don't know if any of this applies to h265, but atm the encoding I do is still 1080p. I did an analysis of all available Ryzen CPUs based on performance, power consumption and cost (I ruled out Intel partly due to the latter two factors but also because of a poor platform upgrade path) and found that although the 5900X scored well, overall it was beaten by the 2700X, mainly because the latter is so much cheaper. However, the 5950X would look a lot better if one could run two encodes on it at the same time without clashing, but review articles naturally never try this. I wish I could test it, but the only 16c system I have is a dual-socket S2011 setup with two 2560 v2s, so the separate CPUs introduce all sorts of other issues (NUMA and suchlike).
I found something similar a long time ago when I noticed one could run six separate Maya frame renders on a 24-CPU SGI rack Onyx (essentially one render per CPU board), compared to running a single render on a quad-CPU (single board) deskside Onyx, giving a good overall throughput increase (the renderer being limited to 4 CPUs per job). See:
Funny actually, re what you say about an overly good speedup perhaps implying a less than optimal core design. Something odd about SGIs is how many times on a multi CPU system one can btain better results by using more threads than there are CPUs, baring in mind MIPS CPUs from that era did not have SMT, ie. the CPUs kinda behave as if they do have SMT even though they don't. I found this behaviour occured most for Blender and C-Ray.
So anyway, it would be great if it were possible to run two h264 encodes on a 5950X at the same time, but there's probably no point if the OS doesn't spread out the loads in a sensible manner, or if in that circumstance there isn't a way to force each encode to use a separate CCX.
All very specific to my use case of course, but I have hundreds of hours of material to convert, so the ability to get twice the throughput from a 5950X would make that CPU a lot more interesting; so far reviews I've read show it to be about 2x faster than the 2700X for h264 Handbrake (just one encode of course), but it costs 4.4x more, rather ruining the price/performance angle. And if it does work then I guess one could ask the same question of TR - could one run eight separate h264 encodes on a future Zen3 TR without the thread management being a total mess? :D I'm assuming it probably wouldn't be so good with the older Zen2 design given the split L3.
Interesting question. Would be nice if someone could give this a test on 16-core Ryzen or TR, and see what happens. Yesterday, I was able to take both FFmpeg and Handbrake up to 128 threads, and it does work; but, having only a 4-core, 4-thread CPU, can't comment.*
As for x264's performance limit, I'm not sure at what number of threads it begins to flag; but, quality wise, using too many (say, over 16 at 1080p) is not advisable. According to the x264 developers, vertical resolution / threads shouldn't fall below 40-50 and certainly not below 30.
* As far as I know, Windows schedules threads all right. From 1903, on Zen 2, one CCX is supposed to be filled up, then another. I imagine 16 threads will be spread across two CCXs in the 5950X. FFmpeg's --threads switch could prove useful too.
I wonder when smt4 will hit the market a model with 3 copies of most things on the die, in a ring configuration fp/int/fp/int, cache inside a ring st would have a chance to use 2 FP modules for single int processor part (when others don't use it ofc). This kind of setup would have very interesting performance numbers at least. I am not saying it's a good idea, but interesting one for sure.
This article omits one of the basic considerations in any manually-configured and custom-cooled desktop system: achieving uniform, preditcable thermal behavior. Unless you are building servers to perform only one or two specific types of mathematical operations, and can build, configure, and stress test on those instruction types alone, you need high confidence that the chip will never exceed the thermal flux densities of the cooling system you built. Fixed-clock systems with a static number of available cores have much more consistent thermal performance than chips whose clocks, and number of threads, are free-floating. This reduces your peak flops, but it significantly extends system lifetime. HEDT and HPC systems have double or triple-digit coure counts per sockrt in 2020; SMT is not worth paying the price of reduced hardware lifetime unless you are building extremely specialized calculation servers.
> When SMT is enabled, depending on the processor, it will allow two, four, > or eight threads to run on that core
Intel's HD graphics GPUs win the oddball award for supporting 7 threads per EU, at least up through Gen 11, I think.
IIRC, AMD supports 12 threads per CU, on GCN. I don't happen to know how many "warps" Nvidia simultaneously executes per SM, in any of their generations.
Thanks for looking at this, although I was disappointed in the testing methodology. You should be separately measuring how the benchmarks respond to simply having more threads, without introducing the additional variable of SMT on/off. One way to do this would be to disable half of the cores (often an option you see in BIOS) and disable SMT. Then separately re-test with SMT on, and then with SMT off but all cores on. This way, we could compare SMT on/off with the same number of threads. Ideally, you'd also do this on a single-die/single-CCX CPU, to ensure no asymmetry in which cores were disabled.
Even better would be it disable any turbo, so we could just see the pipeline behavior. Although, controlling for more variables poses a tradeoff between shedding more insight into the ALU behavior and making the test less relevant to real-world usage.
The reason to separate to hold the number of threads constant is that software performance doesn't scale linearly with the number of threads. Due to load-balancing issues or communication overhead (e.g. lock contention), performance of properly-designed software always scales sub-linearly with the number of threads. So, by keeping the number of threads constant, you'd eliminate that variable.
Of course, in real-world usage, users would be deciding between the two options you tested (SMT on/off; always using all cores). So, that was most relevant to the decision they face. It's just that you're limited in your insights into the results, if you don't separately analyze the thread-scaling of the benchmarks.
Oops, I also intended to mention OS scheduling overhead as another source of overhead, when running more threads. We tend not to think of the additional work that more threads creates for the OS, but each thread the kernel has to manage and schedule has a nonzero cost.
As for the article portion, I also thought too little consideration was given towards the relative amounts of ILP in different code. Something like zip file compressor should have relatively little ILP, since each symbol in the output tends to have a variable length in the input, meaning decoding of the next symbol can't really start until the current one is mostly done. Text parsing and software compilation also tend to fall in this category.
So, I was disappointed not to see some specific cases of low-ILP (but high-TLP) highlighted, such as software compilation benchmarks. This is also a very relevant use case for many of us. I spend hours per week compiling software, yet I don't play video games or do 3D photo reconstruction.
A final suggestion for any further articles on the subject: rather than speculate about why certain benchmarks are greatly helped or hurt by SMT, use tools that can tell you!! To this end, Intel has long provided VTune and AMD has a tool called μProf.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
126 Comments
Back to Article
MrSpadge - Thursday, December 3, 2020 - link
> We’ve known for many years that having two threads per core is not the same as having two coresTrue, and I still read this as an argument against SMT in forums. IMO it should be pointed out clearly that the cost of implementing either also differs drastically: +100% core size for another core and ~5% for SMT.
WaltC - Thursday, December 3, 2020 - link
Intel began its HT journey in order to pull more efficiency from each core--basically, as performance was being left on the table. Interestingly enough, after Athlon and A64, AMD roundly criticized Intel because the SMT thread was not done by a "real core"...and then proceeded to drop cores with two integer units--which AMD then labeled as "cores"...;) Intel's HT approach proved superior, obviously. IIRC. It's been awhile so the memories are vague...;) The only problem with this article is that it tries to make calls about SMT hardware design without really looking hard at the software, and the case for SMT is a case for SMT software. Games will not use more than 4-8 threads simultaneously so of course there is little difference between SMT on and off when running most games on a 5950. You would likely see near the same results on a 5600 in terms of gaming. SMT on or off when running these games leaves most of the CPU's resources untouched. Programs designed and written to utilize a lot of threads, however, show a robust, healthy scaling with SMT on versus no SMT. So--without a doubt--SMT CPU design is superior to no SMT from the standpoint of the hardware's performance. The outlier is the software--not the hardware. And of course the hardware should never, ever be judged strictly by the software one arbitrarily decides to run on it. We learn a lot more about the limits of the software tested here than we learn about SMT--which is a solid performance design in CPU hardware.WarlockOfOz - Friday, December 4, 2020 - link
Very valid point about how games won't see a difference between 16 and 32 threads when they only use 6. Do you know if this type of analysis has been done at the lower end of the market?WaltC - Friday, December 4, 2020 - link
It's been common knowledge established a few years ago when AMD started pushing 8 core (and greater) CPUs that games don't require that many cores and that 6 cores is optimal for gaming right now. And if you do more than game, occasionally, and need more than 6 threads then SMT is there for you. As the new consoles are 8-core CPU designs, over time the number of cores required for optimal game performance will increase.Flying Aardvark - Friday, December 4, 2020 - link
Consoles are 8-core now, with 2 reserved for the OS. Count on 6-cores being optimal for gaming for quite some time.Kangal - Friday, December 4, 2020 - link
Thoset Jaguar cores was more like a 4c/8t processor to be fair. And they weren't that much better than Intel's Atom cores, a far cry from Intel's Core-i SkyLake architecture. And current gen consoles were very light on the OS, so maybe using 1-full core (or 2-threads-shared) leaving only 3-cores for games, but much better than the 2-core optimised games from the PS3/360 era.The new gen consoles will be somewhat similar, using only 1-full core (2-threads) reserved for the OS. But this time we have an architecture that's on-par with Intel's Core-i SkyLake, with a modern full 8-core processor (SMT/HT optional). This time leaving a healthy 7-cores that's dedicated to games. Optimisations should come sooner than later, and we'll see the effects on PC ports by 2022. So we should see a widening gap between 4vs6-core, and to a lesser extent 6vs8-core in the future. I wouldn't future-proof my rig by going for a 5700x instead of a 5600x, I would do that for the next round (ie 2022 Zen4).
AntonErtl - Sunday, December 6, 2020 - link
The 8 Jaguar cores are in no way like 4c/8t CPUs; if you use only half of them, you get half the performance (unless your application is memory/L2-bandwidth-limited). Their predecessor Bobcat is about twice as fast as an Bonnell core (Atom proper), and a little slower than Silvermont (the core that replaced Bonnell), about half as fast as Goldmont+ (all at the clock rates at which they were available in fanless mini-ITX boards), one third as fast as a 3.5GHz Excavator core, and one sixth as fast as a 4.2GHz Skylake.Oxford Guy - Sunday, December 6, 2020 - link
Worse IPC than Bulldozer as far as I know. Certainly worse than Piledriver.Really sad. The "consoles" should have used something better than Jaguar. It's bad enough that the "consoles" are a parasitic drain on PC gaming in the first place. It's worse when they not only drain life with their superfluous walled gardens but also by foisting such a low-grade CPU onto the art.
Kangal - Thursday, December 24, 2020 - link
The Jaguar cores share alot of DNA with Bulldozer, but they aren't the same. It's like Intel's Atom chips compared to Intel Core-i chips. With that said, 2015 Puma+ was a slight improvement over 2013 Jaguar, which was a modest improvement over the initial 2011 Bobcat lineup. All this started in 2006 with AMD choosing to evolve their earlier Phenom2 cores which are derivatives of the AMD Athlon-64.So just by their history, we can see they're inline with Intel's Atom architecture evolution, and basically a direct competitor. Where Intel had slightly less performance, but had much lower power-draw... making them the obvious winner. Leaving AMD to fill in the budget segments of the market.
As for the core arrangement, they don't have full proper cores as people expect them. Like the Bulldozer architecture, each core had to share resources like the decoder and floating-point unit. So in many instances, one core would have to wait for the other core. This boosts multithreaded performance with simple calculations in orderly patterns. However, with more complex calculations and erratic/dynamic patterns (ie Regular PC use), it causes a hit to the single-thread performance and notable hiccups. So my statement was true. This is more like a 4c/8t chipset, and it is less like a Core-i and much more like an Atom. But don't take my word for it, take Dr Ian
Cutress. He said the same thing during the deep dive into the Jaguar microarchitecture, and recently in the Chuwi Aerobox (Xbox One S) article.
https://www.anandtech.com/show/16336/installing-wi...
Now, there have been huge benefits to the Gaming PC industry, and game ports, due to the PS4/XB1. The first being the x86-64bit direct compatibility. Second was the cross-compatability thanks to Vulkan and DirectX (moreso with PS4 Pro and XB1X). The third being that it forced game developers to innovate their game engines, so that they're less narrow and more multi-threaded. With PS5/XseX we now see a second huge push with this philosophy, and the improvements of fast single-thread performance and fast-flash storage access. So I think while we have legitimate reasons to groan about the architecture (especially in the PS4) upon release, we do have to recognize the conveniences that they also brought (especially in the XB1X). This is just to show that my stance wasn't about console bashing.
at_clucks - Monday, December 7, 2020 - link
@Kangal, Jaguar APUs in consoles are definitely not "like a 4c/8t processor" because they don't use CMT. They are full 8 cores. Their IPC may be comparable with some newer Atoms although it's hard to benchmark how the later "Evolved Jaguar" cores in the mid generation console refresh compares against the regular Jaguar or Atom.Bomiman - Saturday, December 5, 2020 - link
That common knowledge is a few years old now. It was once common knowledge that games only used one thread.Consoles now have 3 times as many threads as before, and that's in a situation where 4t Cpus are barely usable and 4c 8t Cpus are obsolete.
MrPotatoeHead - Tuesday, December 15, 2020 - link
Xbox360 came out in 2005. 3C/6T. Even the PS3 had a 1C/2T PowerPC PPE and 6 SPEs, so a total of 8T. PS4/XO is 8C/8T. Though I guess we could blame lack of CPU utilization still on this last generation using pretty weak cores from the get go. IIRC 8 core Jaguar would be on par with an Intel i3 at the time of these console releases.Though, the only other option AMD had was Piledriver. Piledriver still poor performer, a power hog, and it would likely only been worth it over 8 Jaguar cores if they went with a 3 or 4 module chip.
It is nice that this generation MS and Sony both went all out on the CPU. Just too bad they aren't Zen 3 based. :(
Dolda2000 - Friday, December 4, 2020 - link
It should be kept in mind that, at the time when AMD criticized Intel for that, that was when AMD had actual dual-cores (A64x2) and Intel still had single-cores with HT, which makes the criticism rather fair.Xajel - Sunday, December 6, 2020 - link
"Intel's HT approach proved superior".Intel's approach wasn't that much superior. In fact, in the early days of Intel's HTT processors, many Applications, even ones which supposed to be optimised for MC code path was getting lower scores with HTT enabled than when HTT was disabled.
The main culprit was that Applications were designed to handle each thread in a real core, not two threads in a single core, the threads were fighting for resources that weren't there.
Intel knew this and worked hard with developers to make them know the difference and apply this change to the code path. This actually took sometime till Multi-Core applications were SMT aware and had a code path for this.
For AMD's case, AMD's couldn't work hard enough like Intel with developers to make them have a new code path just for AMD CPU's. Not to mention that intel was playing it dirty starting with their famous compiler which was -and still- used by most developers to compile applications, the compilers will optimise the code for intel's CPU's and have an optimised code path for every CPU and CPU feature intel have, but when the application detect a non-Intel CPU, including AMD's it will select the slowest code path, and will not try to test the feature and choose a code path.
This applied also to AMD's CPU's, while sure the CPU's lacked FPU performance, and was not competitive enough (even when the software was optimised), but the whole optimisation thing made AMD's CPU inefficient, the idea should work better than Intel, because there's an actual real hardware there (at least for Integer), but developers didn't work harder, and the intel compiler played a major role for smaller developers also.
TL'DR, the main issue was the intel compiler and lack of developers interest, then the actual cores were also not that much stronger than intel's (IPC side), AMD's idea should have worked, but things weren't in their side.
And by the time AMD came with their design, they were already late, applications were already optimised for Intel HTT which became very good as almost all applications became SMT aware. AMD acknowledged this and knew that they must take what developers already have and work on it, they also worked hard on their SMT implementation that it is touted now that their SMT is better intel's own SMT implementation (HTT).
Keljian - Sunday, January 10, 2021 - link
Urm no, intel’s compiler isn’t used often these days unless you’re doing really heavy maths. Microsoft’s compiler is used much more often, though clang is taking offpogsnet - Tuesday, December 29, 2020 - link
During P4, HT gives no difference in performance compared to AMD64 but on Core2Duo there it shows better performance. Probably because we have only 2-4 cores and not enough for our multi tasking needs, Now we have 4-32 cores plus much powerful and efficient cores, hence, SMT maybe not that significant already that is why on most test it shows no big performance lift.willis936 - Thursday, December 3, 2020 - link
5%? I think more than 5% is needed for a whole second set of registers plus the logic needed to properly handle context switching. Everything in between the cache and pipeline needs to be doubled.tygrus - Thursday, December 3, 2020 - link
Register rename means they already have more registers that don't need to be copied. The register renaming means they have more physical registers than logical registers exposed to programmer. Say you have: 16 logical registers exposed to coder per thread; 128 rename registers in HW; SMT 2tgreads/core = same 16 logical but each thread has 64 rename registers instead of 128.Compare mixing the workloads eg. 8 int/branch heavy with 8 FP heavy on 8 core; or OS background tasks like indexing/search/AntiVirus.
MrSpadge - Thursday, December 3, 2020 - link
The 5% is from Intel for the original Pentium 4. At some point in the last 10 years I think I read a comparable number, probably here on AT, regarding a more modern chip.Wilco1 - Friday, December 4, 2020 - link
There is little accurate info about it, but the fact is that x86 cores are many times larger than Arm cores with similar performance, so it must be a lot more than 5%. Graviton 2 gives 75-80% of the performance of the fastest Rome at less than a third of the area (and half the power).MrSpadge - Friday, December 4, 2020 - link
That doesn't mean SMT is mainly responsible for that. The x86 decoders are a lot more complex. And at the top end you get diminishing performance returns for additional die area.Wilco1 - Friday, December 4, 2020 - link
I didn't say all the difference comes from SMT, but it can't be the x86 decoders either. A Zen 2 without L2 is ~2.9 times the size of a Neoverse N1 core in 7nm. That's a huge factor. So 2 N1 cores are smaller and significantly faster than 1 SMT2 Zen 2 core. Not exactly an advertisement for SMT, is it?Dolda2000 - Friday, December 4, 2020 - link
>Graviton 2 gives 75-80% of the performance of the fastest Rome at less than a third of the areaTo be honest, it wouldn't surprise me one bit if 90% of the area gives 10% of the performance. Wringing out that extra 1% single-threaded performance here or there is the name of the game nowadays.
Also there are many other differences that probably cost a fair bit of silicon, like wider vector units (NEON is still 128-bit, and exceedingly few ARM cores implement SVE yet).
Wilco1 - Saturday, December 5, 2020 - link
It's nowhere near as bad with Arm designs showing large gains every year. Next generation Neoverse has 40+% higher IPC.Yes Arm designs opt for smaller SIMD units. It's all about getting the most efficient use out of transistors. Having huge 512-bit SIMD units add a lot of area, power and complexity with little performance gain in typical code. That's why 512-bit SVE is used in HPC and nowhere else.
So with Arm you get many different designs that target specific markets. That's more efficient than one big complex design that needs to address every market and isn't optimal in any.
whatthe123 - Saturday, December 5, 2020 - link
The extra 20% of performance is difficult to achieve. You can already see it on zen CPUs, where 16 core designs are dramatically more efficient per core in multithread running at around 3.4ghz, vs 8 core designs running at 4.8ghz. I've always hated these comparisons with ARM for this reason... you need a part with 1:1 watt parity to make a fair comparison, otherwise 80% performance at half the power can also be accomplished even on x86 by just reducing frequency and upping core count.Wilco1 - Saturday, December 5, 2020 - link
Graviton clocks low to conserve power, and still gets close to Rome. You can easily clock it higher - Ampere Altra does clock the same N1 core 32% higher. So that 20-25% gap is already gone. We also know about the next generation (Neoverse N2 and V1) which have 40+% higher IPC.Yes adding more cores and clocking a bit lower is more efficient. But that's only feasible when your core is small! Altra Max has 128 cores on a single die, and I don't think we'll see AMD getting anywhere near that in the next few years even with chiplets.
peevee - Monday, December 7, 2020 - link
It is obviously a lot LESS than 5%. Nothing that matters in terms of transistors (caches and vector units) increases. Even doubling of registers would add a few hundreds/thousands of transistors on a chip with tens of billions of transistors, less than 0.000001%.They can double all scalar units and it still would be below 1% increase.
Kangal - Friday, December 4, 2020 - link
I agree.Adding SMT/HT requires something like a +10% increase in the Silicon Budget, and a +5% increase in power draw but increases performance by +30%, speaking in general. So it's worth the trade-off for daily tasks, and those on a budget.
What I was curious to see, is if you disabled SMT on the 5950X, which has lots of cores. Leaving each thread with slightly more resources. And use the extra thermals to overclock the processor. How would that affect games?
My hunch?
Thread-happy games like Ashes of Singularity would perform worse, since it is optimised and can take advantage of the SMT. Unoptimized games like Fallout 76 should see an increase in performance. Whereas actually optimised games like Metro Exodus they should be roughly equal between OC versus SMT.
Dolda2000 - Friday, December 4, 2020 - link
>What I was curious to see, is if you disabled SMT on the 5950X, which has lots of cores.That is exactly what he did in this article, though.
Kangal - Saturday, December 5, 2020 - link
I guess you didn't understand my point.Think of a modern game which is well optimised, is both GPU intensive and CPU intensive. Such as Far Cry V or Metro Exodus. These games scale well anywhere from 4-physical-core to 8-physical-cores.
So using the 5950X with its 16-physical-cores, you really don't need extra threads. In fact, it's possible to see a performance uplift without SMT, dropping it from 32-shared-threads down to 16-full-threads, as each core gets better utilisation. Now add to that some overclocking (+0.2GHz ?) due to the extra thermal headroom, and you may legitimately get more performance from these titles. Though I suspect they wouldn't see any substantial increases or decreases in frame rates.
In horribly optimised games, like Fallout 76, Mafia 3, or even AC Odyssey, anything could happen (though probably they would see some increases). Whereas we already know that in games that aren't GPU intensive, but CPU intensive (eg practically all RTS games), these were designed to scale up much much better. So even with the full-cores and overclock, we know these games will actually show a decrease in performance from losing those extra threads/SMT.
warpuck - Friday, December 25, 2020 - link
With a R 5 1600 it makes about 5-6% difference in usable clock speed. (200-250 Mhz) and also with temperature. With a R 7 3800X it is not as noticeable.If you reduce the background operations while gaming with either CPU.
I don't know about recent game releases but older ones only use 2-4 cores (threads) so clocking the R 5 1600 @ 3750 (SMT on) Mhz vs 3975 Mhz (SMT off) does make a difference on frame rates
whatthe123 - Saturday, December 5, 2020 - link
it doesn't make much of a difference unless you go way past the TDP and have exotic cooling.these CPUs are already boosting close to their limits at stock settings to maintain high gaming performance.
29a - Saturday, December 5, 2020 - link
There is a lot of different scenarios that would be interesting to see. I would like to see some testing with a dual core chip 2c/4t.Netmsm - Thursday, December 3, 2020 - link
good pointWilco1 - Friday, December 4, 2020 - link
I think that 5% area cost for SMT is marketing. If you only count the logic that is essential for SMT, then it might be 5%. However many resources need to be increased or doubled. Even if that helps single-threaded performance, it still adds a lot of area that you wouldn't need without SMT.Graviton 2 proves that 2 small non-SMT cores will beat one big SMT core on multithreaded workloads using a fraction of the silicon and power.
peevee - Monday, December 7, 2020 - link
Except they are not faster, but whatever.RickITA - Thursday, December 3, 2020 - link
Several compute applications do not need hyper-threading. A couple of official references:1. Wolfram Mathematica: "Mathematica’s Parallel Computing suite does not necessarily benefit from hyper-threading, although certain kernel functionality will take advantage of it when it provides a speedup." [source: https://support.wolfram.com/39353?src=mathematica]. Indeed Mathematica automatically set-up a number of threads equal to the number of physical cores of the CPU.
2. Intel MKV library. "Hyper-Threading Technology (HT Technology) is especially effective when each thread is performing different types of operations and when there are under-utilized resources on the processor. Intel MKL fits neither of these criteria as the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may obtain higher performance when using Intel MKL without HT Technology enabled." [source: https://software.intel.com/content/www/us/en/devel...].
BTW Ian: Wolfram Mathematica has a benchmark mode [source: https://reference.wolfram.com/language/Benchmarkin...], please consider to add it to your test suite. Or something with Matlab.
realbabilu - Thursday, December 3, 2020 - link
Apparently intel mkl and Matlab that uses intel mkl only allowing AMD uses non AVX2 library only. Only Linux with fake cpu preloaded library could go around this.https://www.google.com/amp/s/simon-martin.net/2020...
RickITA - Thursday, December 3, 2020 - link
Not a matlab user, but this is no longer true since version 2020a. Source: https://www.extremetech.com/computing/308501-cripp...leexgx - Saturday, December 5, 2020 - link
The "if not Intel genuine cpu" disabled all optimisations (this rubbish has been going on for years only 2020 or 2019 where they are actually fixing there code to detect if AVX is available, even BTRFS had this problem it wouldn't use hardware acceleration if it wasn't on an intel CPU, again lazy coding )Holliday75 - Thursday, December 3, 2020 - link
As usage for modern users changes I wonder how this could be better tested/visualized.I am not looking at a 5900x to run any advanced tools. I am looking to game, run mutiple browsers with a few dozen tabs open, stream, download, run Plex (transcoding), security tools, VPN, and the million other applications a normal user would have running at any given point in time. While no two users will have the same workload at any given time, how could we quantify SMT versus no SMT for the average user?
In the not to distance future we could be seeing the average PC running 32 cores. I am talking your run of the mill office machine from Dell that costs $800. Or will we? Is there a point where it does not matter anymore?
realbabilu - Thursday, December 3, 2020 - link
Simple. At average user 4 core 8gen u series have more core than the generation before. It has more strength, but it's rarely got 100 percent cpu utilized for those normal you doing.To get 8 threads or 4 cores work 100 percent need killer applications that programmed by man know how to extract every juice of it processor, know how to program multithread, or using optimized math kernel.library / optimized compiler switch like FEM, Render, math applied science.
Other than those app, maybe you could expense it to gpu for gaming.
schujj07 - Thursday, December 3, 2020 - link
Or you just have multiple tabs open. I regularly hit 100% usage on my work i5-6400 with 4c/4t having 10-12 tabs open. It gets quite annoying as on a normal day I might need up to double that open at any given time. That means that 20 tabs would peg a 4c/8t CPU pretty easily.evilpaul666 - Friday, December 4, 2020 - link
You need an ad blocker unless those tabs are all very busy doing something. I mean, it sounds like they're mining Monero for somebody else, I mean what they're *supposed* to be doing for you.schujj07 - Friday, December 4, 2020 - link
I use an ad blocker and nothing is being mined. However, ads are an example of things that will destroy your performance in web browsing quite quickly and suck up a lot of CPU cycles. While right now 4c/8t is enough for an office machine, it will not be long before 6c/12t is the standard.marrakech - Tuesday, December 15, 2020 - link
15 cores are the futureeeeeeHulk - Thursday, December 3, 2020 - link
Wouldn't high SMT performance be an indication of bad software design rather than bad core design?While SMT performance is changing in these tests the core is not. Only the software is changing. It seems as though an Intel CPU in this comparison would have provided additional insights to these questions.
BillyONeal - Thursday, December 3, 2020 - link
The situations that create high SMT performance are generally outside the software in question's control. For example, a program might have 1 thread that's doing all divides and another that's doing all multiplies. The thread that only has multiplies or divisions aren't poorly designed, they just aren't using units on the chip that don't help their respective workloads.There are also cache effects. If you have 2 threads working on data bigger than the CPU's caches while one is waiting for that data to come back from memory the other can make unrelated progress and vice versa, but the data being big isn't necessarily an indicator of poor software design. Some problem domains just have big data sets there's no way around.
WaltC - Thursday, December 3, 2020 - link
Exactly. Some software is written to utilize a lot of threads simultaneously, some is not. Running software that does not make use of a lot of simultaneous threads tells us really nothing much about SMT CPU hardware, imo, other than "this software doesn't support it very well."Elstar - Thursday, December 3, 2020 - link
SMT24? Ha. Try SMT128: https://en.wikipedia.org/wiki/Cray_XMT#Threadstorm...dotjaz - Thursday, December 3, 2020 - link
Do you understand what "S(imultaneous)" in SMT means? Barrel processors are by definition NOT simultaneous. They switch between threads.quadibloc - Friday, December 4, 2020 - link
That all depends. There could be a unit that switches between threads to dispatch instructions into the pipeline, but instructions from all the threads are simultaneously working on calculations in the pipeline. I'd call that a way to implement SMT.Elstar - Friday, December 4, 2020 - link
Guys, I've got bad news for you. The difference between a barrel processor ("temporal multithreading") and SMT is all about the backend, not the frontend. I.e. whether the processor is superscalar or not. Otherwise there is no difference. They duplicate hardware resources and switch between them. And the frontend (a.k.a. the decoder) switches temporally between hardware threads. There are NOT multiple frontends/decoders simultaneously feeding one backend pipeline.Elstar - Friday, December 4, 2020 - link
For example the "SMT4" Intel Xeon Phi has a design weakness where three running threads per core get decoded as if four threads were running. (And yes, just one or two running threads per core get decoded efficiently.)dotjaz - Thursday, December 3, 2020 - link
You nailed 2 letters out of 3, gj.Luminar - Thursday, December 3, 2020 - link
Talk about being uninformed.MenhirMike - Thursday, December 3, 2020 - link
Will be interesting to see if this looks different with Quad-Channel Threadripper or Octo-Channel EPYC/TR Pro CPUs, since 16 Cores/32 Threads with 2 channels of memory doesn't seem very compute-friendly. Though it's good to see that "SMT On" is still the reasonable default it's pretty much always has been, except in very specific circumstances.schujj07 - Thursday, December 3, 2020 - link
Also would be interesting to see this on a 6c/12t or 8c/16t CPU.CityBlue - Thursday, December 3, 2020 - link
In your list of "Systems that do not use SMT" you forgot:* All x86 from Intel with CPU design vulnerabilities used in security conscious environments
MenhirMike - Thursday, December 3, 2020 - link
To be fair, "x86" and "security conscious" are already incompatible on anything newer than a Pentium 1/MMX. Spectre affects everything starting with the Pentium Pro, and newer processors have blackboxes in the form of Intel ME or AMD PSP. You can reduce the security risk by turning off some performance features (and get CPUs without Intel ME if you're the US government), but this is still just making an inherently insecure product slightly less insecure.CityBlue - Thursday, December 3, 2020 - link
Sure, there are numerous examples of non-core silicon in x86 environments that are insecure, however this article is about SMT yet it avoids mentioning the ever growing list of security problems with Intels implementation of SMT. The Intel implementation of SMT is so badly flawed from a security perspective that the only way to secure Intel CPUs is to completely disable SMT, and that's the bottom line recommendation of many kernel and distribution developers that have been trying to "fix" Intel CPUs for the past few years.abufrejoval - Thursday, December 3, 2020 - link
I use an Intel i7 in my pfSense firewall appliance, which based on BSD.BSD tends to remind you, that you should run it with SMT disabled because of these side channel exposure security issues.
Yet, with the only workload being pfSense and no workload under an attacker's control able to sniff data, I just don't see why SMT should be a risk there, while extra threads to deeply inspect with Suricata should help avoiding that deeper analysis creates a bottleneck in the uplink.
You need to be aware of the architectural risks that exist on the CPUs you use, but to argue that SMT should always be off is a bit too strong.
Admittedly, when you have 16 real cores to play with, disabling SMT hurts somewhat less than on an i3-7350K (2C/4T), a chip I purchased once out of curiosity, only to have it replaced by an i5-7600K (4C/4T) just before Kaby Lake left the shelves and became temptingly cheap.
It held up pretty well, actually, especially because it did go to 4,8 GHz without more effort than giving it permission to turbo. But I'm also pretty sure the 4 real cores of the i5-7600k will let the system live longer as my daughter's gaming rig.
CityBlue - Thursday, December 3, 2020 - link
> to argue that SMT should always be off is a bit too strong.Not really - if you're a kernel or distro developer then Intel SMT "off" is the only sane recommendation you can give to end users given the state of Intel CPUs released over the last 10 years (note: the recommendation isn't relating to SMT in general, not even AMD SMT, it is only Intel SMT).
However if end users with Intel CPUs choose to ignore the recommendation then that's their choice, as presumably they are doing so while fully informed of the risks.
leexgx - Saturday, December 5, 2020 - link
The SMT risk is more a server issue then a consumer issueschujj07 - Thursday, December 3, 2020 - link
This article wasn't talking about Intel's implementation, only SMT performance on Zen 3. If this were about SMT on Intel then it would make sense, otherwise no.CityBlue - Thursday, December 3, 2020 - link
The start of the article is discussing the pros and cons of SMT *in general* and then discusses where SMT is used, and where it is not used, giving examples of Intel x86 CPUs. Why not then mention the SMT security concerns with Intel CPUs too? That's a rhetorical question btw, we all know the reason.schujj07 - Friday, December 4, 2020 - link
Since this is an AMD focused article, there isn't the side channel attack vector for SMT. Therefore why would you mention side channel attacks for Intel CPUs? That doesn't make any sense since Intel CPUs are only stated for who uses SMT and Intel's SMT marketing name. Hence bringing up Intel and side channel attack vectors would be including extraneous data/information to the article and take away from the stated goal. "In this review, we take a look at AMD’s latest Zen 3 architecture to observe the benefits of SMT."mode_13h - Sunday, June 6, 2021 - link
> The Intel implementation of SMT is so badly flawed from a security perspective that> the only way to secure Intel CPUs is to completely disable SMT
That's not true. The security problems with SMT are all related to threads from different processes sharing the same physical core. To this end, the Linux kernel now has an option not to do that, since it's the kernel which decides what threads run where. So, you can still get most of the benefits of SMT, in multithreaded apps, but without the security risks!
dotjaz - Thursday, December 3, 2020 - link
Windows knows how to allocate threads to the same CCX (after patches of course). It not only knows the physical core, it also knows the topology.leexgx - Saturday, December 5, 2020 - link
A lot of people forget to install the amd chipset drivers witch can result in some small loss of performance (but also need bios to be kept upto date as well Co compleat the ccx group support and best cores support to advertise to windowsMachinus - Thursday, December 3, 2020 - link
Can you sell me yours so I can try one?Marwin - Thursday, December 3, 2020 - link
For me the main question is not whether SMT is bad or good in multithread, but how it is good or bad for 2-4-6 thread loads on for example 12 core Ryzen. When windows may or may not schedule threads to real cores (by 1 thread of 1 core) or to SMT cores in seriesDuraz0rz - Thursday, December 3, 2020 - link
IIRC, Windows knows what cores are real vs virtual and what virtual core maps to a real core. It shouldn't matter if a thread is scheduled on a real or virtual cores, though. If a thread is scheduled on a virtual core that maps to a real core that's not utilized, it still has access to the resources of the full core.SMT doesn't come into play until you need more threads than cores.
GreenReaper - Thursday, December 3, 2020 - link
That's not *quite* true. Some elements are staticly partitioned, notably some instruction/data queues. See 20.19 Simultaneous multithreading in https://www.agner.org/optimize/microarchitecture.p..."The queueing of µops is equally distributed between the two threads so that each thread gets half or the maximum throughput."
This partitioning is set on boot. So, where each thread might get 128 queued micro-ops with SMT off, you only get 64 with it on. This might have little or no impact, but it depends on the code.
The article itself says: "In the case of Zen3, only three structures are still statically partitioned: the store queue, the retire queue, and the micro-op queue. This is the same as Zen2."
jeisom - Thursday, December 3, 2020 - link
Honestly it looks like you provided a 3rd viewpoint. As these are general purpose processors it really depends on the workload/code optimization and how they are optimized for a given targeted workload.jospoortvliet - Thursday, December 3, 2020 - link
Hmmm, if you have a *very* specific workload, yes, 'it depends', but we're really talking HPC here. Pretty much nothing you do at home makes it worth rebooting-and-disabling-SMT for on an AMD Zen 3.Holliday75 - Thursday, December 3, 2020 - link
The confusion comes in because these are consumer processors. These are not technically HPC. Lines are being blurred as these things make $10k CPU's from 5-10 years ago look like trash in a lot of work loads.GeoffreyA - Thursday, December 3, 2020 - link
Interesting article. Thank you. Would be nice to see the Intel side of the picture.idealego - Thursday, December 3, 2020 - link
I imagine compiler optimization these days is tuned for SMT. Perhaps this could have been discussed in the article? I wonder how much of a difference this makes to SMT on/off.bwj - Thursday, December 3, 2020 - link
This article ignores the important viewpoint from the server side. If I have several independent, orthogonal workloads scheduled on a computer I can greatly benefit from SMT. For example if one workload is a pointer-chasing database search type of thing, and one workload is compressing videos, they are not going to contend for backend CPU resources at all, and the benefit from SMT will approach +100%, i.e. it will work perfectly and transparently. That's the way you exploit SMT in the datacenter, by scheduling orthogonal non-interfering workloads on sibling threads.quadibloc - Friday, December 4, 2020 - link
On the one hand, if one program does a lot of integer calculations, and the other does a lot of floating-point calculations, putting them on the same core would seem to make sense because they're using different execution resources. On the other hand, if you use two threads from the same program on the same core, then you may have less contention for that core's cache memory.linuxgeex - Thursday, December 3, 2020 - link
One of the key ways in which SMT helps keep the execution resources of the core active, is cache misses. When one thread is waiting 100+ clocks on reads from DRAM, the other thread can happily keep running out of cache. Of course, this is a two-edged sword. Since the second thread is consuming cache, the first thread was more likely to have a cache miss. So for the very best per-thread latency you're best off to disable SMT. For the very best throughput, you're best off to enable SMT. It all depends on your workload. SMT gives you that flexibility. A core without SMT lacks that flexibility.Dahak - Thursday, December 3, 2020 - link
Now I only glanced through the article, but would it make more sense to use a lower core count cpu to see the benefits of SMT as using a higher core count might mean it will use the real core over the smt cores?Holliday75 - Thursday, December 3, 2020 - link
Starting to think the entire point of the article was this subject is so damn complicated and hard to quantify for the average user that there is no point in trying unless you are running work loads in a lab environment to find the best possible outcome for the work you plan on doing. Who is going to bother doing that unless you work for a company where it makes sense to do so.29a - Thursday, December 3, 2020 - link
I wonder what the charts would look like with a dual core SMT processor. I think the game tests would have a good chance of changing. I'm a little surprised the author didn't think to test CPU's with more limited resources.Klaas - Thursday, December 3, 2020 - link
Nice article. And nice benchmarks.Too bad you could not include more real-life test.
I think that a processor with far less threads (like a 5600X)
is easier to saturate with real-life software that actually uses threads.
The Cache Sizes and Memory channels bandwith ratio to core is quite different
for those 'smaller' processors. That will probably result in different benchmark results...
So it would be interesting to see what those processors will do, SMT ON versus SMT OFF.
I don't think the end result will be different, but it could even be a bigger victory for SMT ON.
Another interesting area is virtualization.
And as already mentioned in more comments it is very important that the Operating Systems
will assign threads to the right core or SMT-Core combinations.
That is even more important in virtualization situations...
MDD1963 - Thursday, December 3, 2020 - link
Determining the usefulness of SMT with 16 cores on tap is not quite as relevant as when this experiment might be done with, say, a 5600X or 5800X....; naturally 16 cores without SMT might still be plenty (as even 8 non SMT cores on the 9700K proved)thejaredhuang - Thursday, December 3, 2020 - link
This would be a better test on a CPU that doesn't have 16 base cores. If you could try it on a 4C/8T part I think the difference would be more pronounced.dfstar - Thursday, December 3, 2020 - link
The basic benefit of SMT is it allows the processer to hide the impact of long latency instructions on average IPC, since it can switch to new thread and execute those instructions. In this way it is similar to OOO(which leverages speculative execution to do the same) and also more flexible than fine-grained multi-threading. There is an overhead and cost(area/power) due to the duplicated structures in the core that will impact the perf/watt of pure single-threaded workloads, I don't think disabling SMT removes all this impact ...GreenReaper - Thursday, December 3, 2020 - link
Perhaps not. But at the same time, it is likely that any non-SMT chip that has a SMT variant actually *is* a SMT chip, it is just disabled in firmware - either because it is broken on that chip, or because the non-SMT variant sold better.abufrejoval - Thursday, December 3, 2020 - link
It's hard to imagine a transistor defect that would break *only* SMT. As you say all non-SMT chips are really SMT chips internally and the decision to disable SMT doesn't really result in huge chunks of transistors going dark (the potential target area for physical defects).I'd say most of the SMT vs. no-SMT decisions on individual CPUs are binning related: SMT can create significantly more heat because there is less idle which allows the chip to cool. So if you have a chip with higher resistance in critical vias and require higher voltage to function, you need to sacrifice clocks, TDP or utilization (and permutations).
leexgx - Saturday, December 5, 2020 - link
With HT off I have definitely noticed less smoothness windows, as with HT it can keep the cpu active when a thread is slightly stuckiranterres - Thursday, December 3, 2020 - link
Why are people still testing SMT in 2020? Cache coherency and hierarchy design is mature enough to offset the possible instruction bottleneck issues. I don't even know the purpose of this article at all... Anyways, perhaps fallng back to 2008? Come on...quadibloc - Friday, December 4, 2020 - link
Well, instead of testing the concept of SMT, which has been around for a while, perhaps one could think of it as testing the implementation of SMT found on the chips we can get in 2020.eastcoast_pete - Friday, December 4, 2020 - link
Thanks Ian! I always thought of SMT as a way of using whatever compute capacity a core has, but isn't being used in the moment. Hence it's efficient if many tasks need doing that each don't take a full core most of the time. However, that hits a snag if the cores get really busy. Hence (for desktop or laptop), 6 or 8 real cores are usually better than 4 cores that pretend to be 8.AntonErtl - Friday, December 4, 2020 - link
I found the "Is SMT an good thing" discussion (and later discussion of the same topics) strange, because it seemed to take the POV of someone who wants to optimize some efficiency or utilization metric of someone who can choose the number of resources in the core. If you are in that situation, then the take of the EV8 designers was: we build a wide machine so that single-threaded applications can run fast, even though we know the wideness leads to low utilization; we also add SMT so that multi-threaded applications can increase utilization. Now, 20 years later, such wide cores become reality, although interestingly Apple and ARM do not add SMT.Anyway, buyers and users of off-the-shelf CPUs are not in that situation, and for them the questions are: For buyers: How much benefit does the SMT capabilty provide, and is it worth the extra money? For users: Does disabling SMT on this SMT-capable CPU increase the performance or the efficiency?
The article shows that the answers to these questions depend on the application (although for the Zen3 CPUs available now the buyer's question does not pose itself).
It would be interesting to see whether the wider Zen3 design gives significantly better SMT performance than Zen or Zen2 (and maybe also a comparison with Intel), but that would require also testing these CPUs.
I did not find it surprising that the 5950X runs into the power limit with and without SMT. The resulting clock rates are mentioned in the text, but might be more interesting graphically than the temperature. What might also be interesting is the power consumed at the same clock frequency (maybe with fewer active cores and/or the clock locked at some lower clock rate).
If SMT is so efficient (+91%) for 3DPMavx, why does the graphics only show a small difference?
Bensam123 - Friday, December 4, 2020 - link
Anand, while I value your in depth articles you guys really need to drop the 95th percentile frame times and get on board with 1% and .1% lows. What disrupts gaming the most is the hiccups, not looking at a statistically smooth chart. SMT/HT effects these THE most, especially in heavily single threaded games. If you aren't testing what it influences, why test it at all? Youtube reviews are also having problems with tests that don't reflect real world scenarios as well. Sometimes it's a lot more disagreeable then others.Completely invalid testing methodology at this point.
My advice based on my own testing. You turn off SMT/HT except in scenarios in which you become CPU bound, across all cores, not one. This improved .1 and 1% frame time... IE stutters. You turn it on when you reach a point of 90%+ utilization as it helps and a lot when your CPU is maxed out. Generally speaking <6 and soon to be 8 cores should always have it on.
You didn't even test where this helps the most and that's low end CPUs vs high end CPUs where you find the Windows scheduler messes things up.
Also if you're testing this on your own, always turn it off in the bios. If you use something like process lasso or manually change affinity, windows will still put protected services and process onto those extra virtual cores causing contention issues that lead to the stuttering.
Most obvious games that get a benefit from SMT/HT off are heavily single threaded games, such as MOBAS.
Gloryholle - Friday, December 4, 2020 - link
Testing Zen3 with 3200CL16?peevee - Friday, December 4, 2020 - link
"Most modern processors, when in SMT-enabled mode, if they are running a single instruction stream, will operate as if in SMT-off mode and have full access to resources."Which would have access to the whole microinstruction cache (L0I) in SMT mode?
Arbie - Friday, December 4, 2020 - link
Another excellent AT article, which happens to hit my level of knowledge and interest; thanks!Oxford Guy - Friday, December 4, 2020 - link
Suggestions:Compare with Zen 2 and Zen 1, particularly in games.
Explain SMT vs. CMT. Also, is SMT + CMT possible?
AntonErtl - Sunday, December 6, 2020 - link
CMT has at least two meanings.Sun's UltraSparc T1 has in-order cores that run several threads alternatingly on the functional units. This is probably the closest thing to SMT that makes sense on an in-order core. Combining this with SMT proper makes no sense; if you can execute instructions from different threads in the same cycle, there is no need for an additional mechanism for processing them in alternate cycles. Instruction fetch on some SMT cores processes instructions in alternate cycles, though.
The AMD Bulldozer and family have pairs of cores that share more than cores in other designs share (but less than with SMT): They share the I-cache, front end and FPU. As a result, running code on both cores of a pair is often not as fast as when running it on two cores of different pairs. You can combine this scheme with SMT, but given that it was not such a shining success, I doubt anybody is going to do it.
Looking at roughly contemporary CPUs (Athlon X4 845 3.5GHz Excavator and Core i7 6700K 4.2Ghz Skylake), when running the same application twice one after the other on the same core/thread vs. running it on two cores of the same pair or two threads of the same core, using two cores was faster by a factor 1.65 on the Excavator (so IMO calling them cores is justified), and using two threads was faster by a factor 1.11 on the Skylake. But Skylake was faster by a factor 1.28 with two threads than Excavator with two cores, and by a factor 1.9 when running only a single core/thread, so even on multi-threaded workloads a 4c/8t Skylake can beat an 8c Excavator (but AFAIK Excavators were not built in 8c configurations). The benchmark was running LaTeX.
Oxford Guy - Sunday, December 6, 2020 - link
AMD's design was very inefficient in large part because the company didn't invest much into improving it. The decision was made, for instance, to stall high-performance with Piledriver in favor of a very very long wait for Zen. Excavator was made on a low-quality process and was designed to be cheap to make.Comparing a 2011/2012 design that was bad when it came out with Skylake is a bit of a stretch, in terms of what the basic architectural philosophy is capable of.
I couldn't remember that fourth type (the first being standard multi-die CPU multiprocessing) so thanks for mentioning it (Sun's).
USGroup1 - Saturday, December 5, 2020 - link
So yCruncher is far away from real world use cases and 3DPMavx isn't.pc8086 - Sunday, December 6, 2020 - link
Many congratulations to Dr. Ian Cutress for the excellent analysis carried out.If possible, it would be extremely interesting to repeat a similar rigorous analysis (at least on multi-threaded subsection of choosen benchmarks) on the following platforms:
- 5900X (Zen 3, but fewer cores for each chiplet, maybe with more thermal headroom)
- 5800X (Zen 3, only a single computational chiplet, so no inter CCX latency throubles)
- 3950X (same cores and configuration, but with Zen 2, to check if the new, beefier core improved SMT support)
- 2950X (Threadripper 2, same number of cores but Zen+, with 4 mamory channels; useful expecially for tests such as AIBench, which have gotten worse with SMT)
- 3960X (Threadripper3, more cores, but Zen2 and with 4 memory ch.)
Obviously, it would be interesting to check Intel HyperThreading impact on recent Comet Lake, Tiger Lake and Cascade Lake-X.
For the time being, Apple has decided not to use any form of SMT on its own CPUs, so it is useful to fully understand the usefulness of SMT technologies for notebooks, high-end PCs and prosumer platforms.
Than you very much.
eastcoast_pete - Sunday, December 6, 2020 - link
Thanks Ian! With some of your comments about memory access limiting performance in some cases, how does (or would) a quad channel memory setup give in additional performance compared to the dual channel consumer setups (like these or mine) have? Now, I know that servers and actual workstations usually have 4 or more memory channels, and for good reason. So, in the time of 12 and 16 core CPUs, is it time for quad channel memory access for the rest of us, or would that break the bank?mapesdhs - Thursday, December 10, 2020 - link
That's a good question. As time moves on and we keep getting more cores, with people doing more things that make use of them (such as gaming and streaming at the same time, with browser/tabs open, livechat, perhaps an ecode too), perhaps indeed the plethora of cores does need better mem bw and parallelism, but maybe the end user would not yet tolerate the cost.Something I noticed about certain dual-socket S2011 mbds on Aliexpress is that they don't have as many memory channels as they claim, which with two CPUs does hurt performance of even consumer grade tasks such as video encoding:
http://www.sgidepot.co.uk/misc/kllisre_analysis.tx...
bez5dva - Monday, December 7, 2020 - link
Hi Dr. Cutress!Thanks for these interesting tests!
Perhaps, SMT thing is a something that could drastically improve more budget CPUs performance? Your CPU has more than enough shiny cores for these games, but what if you take Ryzen 3100? I believe %age would be different, as it was in my real world case :)
Back then i had 6600k@4500 and in some FPS games with a huge maps and a lot of players (Heroes and Generals; Planetside 2) i started to receive stutters in a tight fights, but when i switched to 6700@4500 it wasn't my case anymore. So i do believe that Hyperthreading worked in my case, cuz my CPUs were identical aside of virtual threads in the last one.
Would super interesting to have this post updated with a cheaper sample results 😇
peevee - Monday, December 7, 2020 - link
It is clear that 16-core Ryzen is power, memory and thermally limited. I bet SMT results on 8-core Ryzen 7 5800x would be much better for more loads.naive dev - Tuesday, December 8, 2020 - link
The slide states that Zen 3 decodes 4 instructions/cycle. Are there two independent decoders which each decode those 4 instruction for a thread? Or is there a single decoder that switches between the program counters of both threads but only decodes instructions of one thread per cycle?GeoffreyA - Tuesday, December 8, 2020 - link
There's a single set of 4 decoders. In SMT mode, I believe some sharing is going in. This is from the original Zen design:https://images.anandtech.com/doci/10591/HC28.AMD.M...
GeoffreyA - Tuesday, December 8, 2020 - link
* going onnaive dev - Wednesday, December 9, 2020 - link
Right, I found that article as well and from that slide it looks like the decoder would be shared. But then that slide was from 2017, so that might have changed.It looks though as if the decoder could decode those 4 instructions from a single program counter only, right? It's not like the decoder could decode e.g. 2 instructions from program counter 1 and another 2 instructions from program counter 2?
GeoffreyA - Thursday, December 10, 2020 - link
I'm not too sure how the implementation works, but I expect they're shuffling both threads through the decoder at roughly the same time. The decoder has four units (I think 1 complex and 3 simple). As far as I'm aware, that has stayed the same in both Zen 2 and 3.mapesdhs - Thursday, December 10, 2020 - link
Ian, a question about Handbrake, though it may not apply to the type of test you used. I've read that Handbrake doing an h264 encode can only use 16 threads max. Does this mean that in theory one could run two separate h264 encodes on a 5950X and thus obtain a good overall throughput speedup? Have you tried such a thing? Or might this only work if it were possible to force one encode to only use the 16 threads of one 8c block (CCX?), and the other encode to use the rest? ie. so that the separate encodes are not fighting over the same cores or indeed the same CCX-shared L3? Is it possible to force this somehow? Also, if the claimed 16 thread limit for h264 is true, is there a performance difference for a single h264 encode between SMT on vs. off just in general? ie. with it on, is the OS smart enough to ensure that the 16 threads are spread across all the cores evenly rather than being scrunched onto fewer cores because reasons? If not, then turning SMT off might speed it up. Note that I'm using Windows for all this.I don't know if any of this applies to h265, but atm the encoding I do is still 1080p. I did an analysis of all available Ryzen CPUs based on performance, power consumption and cost (I ruled out Intel partly due to the latter two factors but also because of a poor platform upgrade path) and found that although the 5900X scored well, overall it was beaten by the 2700X, mainly because the latter is so much cheaper. However, the 5950X would look a lot better if one could run two encodes on it at the same time without clashing, but review articles naturally never try this. I wish I could test it, but the only 16c system I have is a dual-socket S2011 setup with two 2560 v2s, so the separate CPUs introduce all sorts of other issues (NUMA and suchlike).
I found something similar a long time ago when I noticed one could run six separate Maya frame renders on a 24-CPU SGI rack Onyx (essentially one render per CPU board), compared to running a single render on a quad-CPU (single board) deskside Onyx, giving a good overall throughput increase (the renderer being limited to 4 CPUs per job). See:
http://www.sgidepot.co.uk/perfcomp_RENDER4_maya1.h...
Funny actually, re what you say about an overly good speedup perhaps implying a less than optimal core design. Something odd about SGIs is how many times on a multi CPU system one can btain better results by using more threads than there are CPUs, baring in mind MIPS CPUs from that era did not have SMT, ie. the CPUs kinda behave as if they do have SMT even though they don't. I found this behaviour occured most for Blender and C-Ray.
So anyway, it would be great if it were possible to run two h264 encodes on a 5950X at the same time, but there's probably no point if the OS doesn't spread out the loads in a sensible manner, or if in that circumstance there isn't a way to force each encode to use a separate CCX.
All very specific to my use case of course, but I have hundreds of hours of material to convert, so the ability to get twice the throughput from a 5950X would make that CPU a lot more interesting; so far reviews I've read show it to be about 2x faster than the 2700X for h264 Handbrake (just one encode of course), but it costs 4.4x more, rather ruining the price/performance angle. And if it does work then I guess one could ask the same question of TR - could one run eight separate h264 encodes on a future Zen3 TR without the thread management being a total mess? :D I'm assuming it probably wouldn't be so good with the older Zen2 design given the split L3.
GeoffreyA - Sunday, December 13, 2020 - link
Interesting question. Would be nice if someone could give this a test on 16-core Ryzen or TR, and see what happens. Yesterday, I was able to take both FFmpeg and Handbrake up to 128 threads, and it does work; but, having only a 4-core, 4-thread CPU, can't comment.*As for x264's performance limit, I'm not sure at what number of threads it begins to flag; but, quality wise, using too many (say, over 16 at 1080p) is not advisable. According to the x264 developers, vertical resolution / threads shouldn't fall below 40-50 and certainly not below 30.
https://forum.doom9.org/showthread.php?p=1213185#p...
forum.doom9.org/showthread.php?p=1646307#post1646307
More posts on high core counts:
forum.doom9.org/showthread.php?t=173277
forum.doom9.org/showthread.php?t=175766
* As far as I know, Windows schedules threads all right. From 1903, on Zen 2, one CCX is supposed to be filled up, then another. I imagine 16 threads will be spread across two CCXs in the 5950X. FFmpeg's --threads switch could prove useful too.
GeoffreyA - Sunday, December 13, 2020 - link
-threads, not --threadsHere are links set out better (thought they'd link in the comment):
https://forum.doom9.org/showthread.php?p=1213185#p...
https://forum.doom9.org/showthread.php?p=1646307#p...
https://forum.doom9.org/showthread.php?t=173277
https://forum.doom9.org/showthread.php?t=175766
karthikpal - Friday, December 11, 2020 - link
Nice content bro<a href="https://www.tronicsmaster.com">Ryzen 7 5800X</a>
deil - Sunday, December 13, 2020 - link
I wonder when smt4 will hit the market a model with 3 copies of most things on the die, in a ring configuration fp/int/fp/int, cache inside a ring st would have a chance to use 2 FP modules for single int processor part (when others don't use it ofc).This kind of setup would have very interesting performance numbers at least. I am not saying it's a good idea, but interesting one for sure.
Machinus - Sunday, December 13, 2020 - link
This article omits one of the basic considerations in any manually-configured and custom-cooled desktop system: achieving uniform, preditcable thermal behavior. Unless you are building servers to perform only one or two specific types of mathematical operations, and can build, configure, and stress test on those instruction types alone, you need high confidence that the chip will never exceed the thermal flux densities of the cooling system you built. Fixed-clock systems with a static number of available cores have much more consistent thermal performance than chips whose clocks, and number of threads, are free-floating. This reduces your peak flops, but it significantly extends system lifetime. HEDT and HPC systems have double or triple-digit coure counts per sockrt in 2020; SMT is not worth paying the price of reduced hardware lifetime unless you are building extremely specialized calculation servers.quadibloc - Monday, December 14, 2020 - link
The SPARC chips used SMT a lot, even going beyond 2-way, so I'm surprised they weren't mentioned as examples.mode_13h - Sunday, June 6, 2021 - link
> When SMT is enabled, depending on the processor, it will allow two, four,> or eight threads to run on that core
Intel's HD graphics GPUs win the oddball award for supporting 7 threads per EU, at least up through Gen 11, I think.
IIRC, AMD supports 12 threads per CU, on GCN. I don't happen to know how many "warps" Nvidia simultaneously executes per SM, in any of their generations.
mode_13h - Sunday, June 6, 2021 - link
Thanks for looking at this, although I was disappointed in the testing methodology. You should be separately measuring how the benchmarks respond to simply having more threads, without introducing the additional variable of SMT on/off. One way to do this would be to disable half of the cores (often an option you see in BIOS) and disable SMT. Then separately re-test with SMT on, and then with SMT off but all cores on. This way, we could compare SMT on/off with the same number of threads. Ideally, you'd also do this on a single-die/single-CCX CPU, to ensure no asymmetry in which cores were disabled.Even better would be it disable any turbo, so we could just see the pipeline behavior. Although, controlling for more variables poses a tradeoff between shedding more insight into the ALU behavior and making the test less relevant to real-world usage.
The reason to separate to hold the number of threads constant is that software performance doesn't scale linearly with the number of threads. Due to load-balancing issues or communication overhead (e.g. lock contention), performance of properly-designed software always scales sub-linearly with the number of threads. So, by keeping the number of threads constant, you'd eliminate that variable.
Of course, in real-world usage, users would be deciding between the two options you tested (SMT on/off; always using all cores). So, that was most relevant to the decision they face. It's just that you're limited in your insights into the results, if you don't separately analyze the thread-scaling of the benchmarks.
mode_13h - Sunday, June 6, 2021 - link
Oops, I also intended to mention OS scheduling overhead as another source of overhead, when running more threads. We tend not to think of the additional work that more threads creates for the OS, but each thread the kernel has to manage and schedule has a nonzero cost.mode_13h - Sunday, June 6, 2021 - link
As for the article portion, I also thought too little consideration was given towards the relative amounts of ILP in different code. Something like zip file compressor should have relatively little ILP, since each symbol in the output tends to have a variable length in the input, meaning decoding of the next symbol can't really start until the current one is mostly done. Text parsing and software compilation also tend to fall in this category.So, I was disappointed not to see some specific cases of low-ILP (but high-TLP) highlighted, such as software compilation benchmarks. This is also a very relevant use case for many of us. I spend hours per week compiling software, yet I don't play video games or do 3D photo reconstruction.
mode_13h - Sunday, June 6, 2021 - link
A final suggestion for any further articles on the subject: rather than speculate about why certain benchmarks are greatly helped or hurt by SMT, use tools that can tell you!! To this end, Intel has long provided VTune and AMD has a tool called μProf.* https://software.intel.com/content/www/us/en/devel...
* https://developer.amd.com/amd-uprof/