> Among these items is also “threading architecture”. While I’m not keen on guessing Arm’s intents here, I do wonder if this means we’ll be seeing SMT implemented in Arm’s Neoverse IPs?
Honestly, I would love to see SMT become a standard feature of the Cortex IP line. I know that it would probably take a few generations for it to trickle down to the lower end. However, I still would hope that would push ARM and other chip designers to focus on creating larger, more powerful cores more similar to Apples, even if they couldn't reach quite the girth of apple's silicon.
Why shouldn't they? After all Apple uses the same ARM instruction set & all do it have wider OoO core's for a mainstream user product's it's also two steps behind reference ARM design (predictor & task scheduler). An usual price for a costume design. If you trow in available server grade ARM designs Apple ain't leading in anything. If ARM starts introducing IP server designs & if those are presumably good one's it will cut those steps behind for costume core's. I don't believe ARM is about to go the SMT path which cortex A76 unique design approach approves. At one point the vertical SMT makes sense as it's much less costly strategy then SMT, large TBL tables or cache multiplication. Same way the Imagination Technologies did & came to same conclusion. One thing is clear the wider the core is it's harder to optimal feed it & this is especially even more the case for ultra wide SIMD extension blocks.
With a heterogenous core layout now the de-facto standard for ARM, is SMT even desirable? SMT helps you utilise cores that are partially idle and shedding power for no performance benefit, but on ARM the solution to this is already to switch to a smaller, more efficient core to match the current workload. SMT is great when you have a lot of workload and are limited by how big you can make your die, but on mobile (ARMs bread and butter, despite multiple abortive attempts to break into the datacentre) you don't really have those big chunks of high workload to chew through.
SMT operates at a much lower scale than you imply: When a core has a lot of functional units, a single instruction stream may not have enough instruction level parallelism to leverage all of them every cycle, leading to lots of empty instruction slots. SMT effectively gives a massive boost to the ILP created by a large out of order instruction window, by providing multiple completely independent instruction streams to utilize all those lost ALU slots. SMT is particularly helpful with covering the latency of cache misses when the OO window is insufficient. SMT is the more efficient choice when (1) your application actually has parallel threads, and (2) your core has the potential to issue many instructions per cycle.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
5 Comments
Back to Article
r3loaded - Tuesday, October 16, 2018 - link
> Among these items is also “threading architecture”. While I’m not keen on guessing Arm’s intents here, I do wonder if this means we’ll be seeing SMT implemented in Arm’s Neoverse IPs?Yes.
ztrouy - Wednesday, October 17, 2018 - link
Honestly, I would love to see SMT become a standard feature of the Cortex IP line. I know that it would probably take a few generations for it to trickle down to the lower end. However, I still would hope that would push ARM and other chip designers to focus on creating larger, more powerful cores more similar to Apples, even if they couldn't reach quite the girth of apple's silicon.ZolaIII - Wednesday, October 17, 2018 - link
Why shouldn't they? After all Apple uses the same ARM instruction set & all do it have wider OoO core's for a mainstream user product's it's also two steps behind reference ARM design (predictor & task scheduler). An usual price for a costume design. If you trow in available server grade ARM designs Apple ain't leading in anything. If ARM starts introducing IP server designs & if those are presumably good one's it will cut those steps behind for costume core's. I don't believe ARM is about to go the SMT path which cortex A76 unique design approach approves. At one point the vertical SMT makes sense as it's much less costly strategy then SMT, large TBL tables or cache multiplication. Same way the Imagination Technologies did & came to same conclusion. One thing is clear the wider the core is it's harder to optimal feed it & this is especially even more the case for ultra wide SIMD extension blocks.edzieba - Wednesday, October 17, 2018 - link
With a heterogenous core layout now the de-facto standard for ARM, is SMT even desirable? SMT helps you utilise cores that are partially idle and shedding power for no performance benefit, but on ARM the solution to this is already to switch to a smaller, more efficient core to match the current workload. SMT is great when you have a lot of workload and are limited by how big you can make your die, but on mobile (ARMs bread and butter, despite multiple abortive attempts to break into the datacentre) you don't really have those big chunks of high workload to chew through.jdiamond - Thursday, February 21, 2019 - link
SMT operates at a much lower scale than you imply: When a core has a lot of functional units, a single instruction stream may not have enough instruction level parallelism to leverage all of them every cycle, leading to lots of empty instruction slots. SMT effectively gives a massive boost to the ILP created by a large out of order instruction window, by providing multiple completely independent instruction streams to utilize all those lost ALU slots. SMT is particularly helpful with covering the latency of cache misses when the OO window is insufficient. SMT is the more efficient choice when (1) your application actually has parallel threads, and (2) your core has the potential to issue many instructions per cycle.