Comments for AMD Comments on Threadripper 2 Performance and Windows Scheduler

AMD Comments on Threadripper 2 Performance and Windows Scheduler

by Ian Cutress on 1/14/2019 9:00 AM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

39 Comments

Back to Article

colonelclaw - Monday, January 14, 2019 - link
FWIW I've tested CorePrio with 3DS Max 2019 and V-Ray 4.1, and I can't see a meaningful difference, so it looks like V-Ray is unaffected. Furthermore the 32-core TR scales almost exactly as expected from the 16-core TR (taking into account frequency differences).
Conclusion: If you render with V-Ray and TR everything's good.
IGTrading - Monday, January 14, 2019 - link
Problem is that Microsoft did this on purpose!

Please, don't go crazy about anti-AMD conspiracies, because this is not the case at all (while there are clearly some other cases where reasonable doubt is more than resonable :) but not here , not today , please )

What I mean is this : every project has a PM.

When Intel told Microsoft they want a patch to fix/improve Xeon 18-core, Microsoft was probably happy to oblige, but when that development was ready, testing is ABSOLUTELY MANDATORY @Microsoft.

Of course Microsoft WHQL will tests the patch in most x86 platforms, including AMD EPYC and AMD ThreadRipper.

My question is: how the hell did the PM of that project decide that the side effects that the new Microsoft patch for Intel Xeon is causing are not only acceptable, but several higher-ups on the the approval chain decided to move this into production and release the patch ?!

How the heck did they decide to force the update on AMD machines as well, clearly knowing the side effects it has on them ?!

I don't believe this is an anti-AMD Microsoft+Intel coalition at all.

What I DO suspect might be possible is that some guy/guys from Microsoft which were involved with this patch, received some nice team meetings in beautiful expensive settings or other sort of benefits from their Intel counterparts ... to quicken the release of the new patch and overlook the AMD side effects & any decent minimal mitigation measures.

Otherwise, moving such a patch into production is absolutely impossible to explain in a company like Microsoft for their most important core product, Windows.
HStewart - Monday, January 14, 2019 - link
This is ridiculous to think Microsoft did this on purpose, after the Xbox uses the AMD APU's in Xbox One - unless you think Microsoft switching process in future generations.

Also is possible that there is a problem with design of AMD Zen processors that is causing this issue - what is so different with there processors that will make it be a problem?
IGTrading - Monday, January 14, 2019 - link
@HStewart please mate, don't go there. :) This is not "Microsoft" the company, but somebody @Microsoft made the decision to go forward with this crap patch for Intel Xeon.

It has nothing to do with AMD ThreadRipper architecture.

It has to do with ignoring the internal quality testing Microsoft has surely done and pushing out a patch which trashes Microsoft clients with AMD ThreadRipper-based and EPYC-based systems.
HStewart - Monday, January 14, 2019 - link
People love to blame Microsoft for chip problems - I would expect if Microsoft has dependencies on AMD that they work with AMD to write specific drivers to support - so why did Threadripper or previous EPYC systems not have this problem - I guess it simpler just to blame Microsoft for bad design in TR 2
rahvin - Monday, January 14, 2019 - link
The Scheduler is not a driver.
tamalero - Tuesday, January 15, 2019 - link
1) this is a windows scheduler issue, not a driver issue
2) this does NOT happen on linux.
lobz - Tuesday, January 15, 2019 - link
@HStewart man you just love taking sucker punches, don't you? Chip problems? Bad design? Man... you're killing it today :)
boeush - Monday, January 14, 2019 - link
If it was a Zen design issue, it wouldn't affect exclusively Windows. Since Linux is unaffected, this is clearly Microsoft's problem, not AMD's.
HStewart - Monday, January 14, 2019 - link
Well AMDS must have work with Linux - on drivers to over come issues - since Linux is open source. But why would 16 Core Intel not have the problem - shouldn't scheduling be treated the same.
FreckledTrout - Monday, January 14, 2019 - link
Or MS just messed up and Linux driver developers did not.
Alistair - Monday, January 14, 2019 - link
Read the article:

It turns out that Microsoft has a hotfix in place in Windows for dual-NUMA environments that disables this ‘best NUMA node’ situation. Ultimately at some point there were enough dual-socket workstation platforms on the market that this made sense, pushing the ‘best NUMA node’ implementation down the road to 3+ NUMA environments. This is why we see it in quad-die Threadripper and EPYC, and not dual-die Threadripper.

So clearly since Intel's 16 core CPU's are not treated as 3 or 4 sockets, it won't affect them. It only affects 4 processor systems, or processors treated as such like the 32 core Threadripper.
KaiserWilly - Tuesday, January 15, 2019 - link
You're not understanding how AMD Threadripper is made compared to Intel.

Intel uses a monolithic die approach, where it's architecturally and visually a single CPU. Everything works because there's no NUMA fit at all. There's no NUMA. It just doesn't apply.

AMD uses a multi-die solution that appears to Windows as multiple CPUs (note the graphic in the article showcasing Node0, Node1, etc.), which requires this extra scheduling to operate properly.

The idea that because AMD requires this means they are inferior is also false, it's merely a different implementation that trades occasional bugs like this for an architecture that's highly flexible, not requiring the LCC, HCC, and XCC designs that Intel uses to cope with the extra cores. Microsoft erred in their scheduling, creating a large proportion of the performance decrease when using >2 Nodes. The fact that it can be readily fixed and is not present on Linux lends credence to the suggestion that this is a bug, and not "AMD being terrible."

And god I hope you're relying more than just drivers to run a CPU, that'd be a nightmare...
PeachNCream - Monday, January 14, 2019 - link
Please step away from your keyboard, sir.
FullmetalTitan - Wednesday, January 16, 2019 - link
Gonna wait here while you figure out how a Windows scheduler problem is not an option because the company also sells game consoles with AMD chips. As if Playstation consoles don't have nearly identical core designs.
Bonus: detail how exactly XBone or PS4 are running into 3+ NUMA environment issues to justify that first point
cheshirster - Wednesday, January 16, 2019 - link
You are so 2007.
Windows is not important for MS any more.
W10 is plagued with violent bugs and noone cares.
FreckledTrout - Monday, January 14, 2019 - link
Do we think it will be solved before Zen2 variants get the shelves? Anandtech can we hold Microsoft accountable on this now? It's clearly a MS issue now.
oleyska - Monday, January 14, 2019 - link
I do not how Zen2 will handle NUMA.
But it will be very different as every core will have common DDR controller rather than distributed so Numa 1 will never go to numa 2 for DDR access.

L3 cache will be interesting to see if they will ever access cache across die's.
rns.sr71 - Monday, January 14, 2019 - link
'L3 cache will be interesting to see if they will ever access cache across die's.'-
Very good point.
I haven't read any direct confirmation from AMD (so please anyone that has, please provide links) but apparently, there are two possibilities on this. (1) The 8 core chiplets will will have to hop from the IO die to the other chiplet(s) as AMD has said that there is no direct link between chiplets OR (2) the IO die will have adequate cache of its own to copy all L3 data from all 8 core chiplets on package. The 2nd option would provide the lowest latency but would take up a tremendous amount of die area on the IO hub.
Either way, it sounds like the only NUMA scenario ROME will have is when going from one socket to another.
FreckledTrout - Monday, January 14, 2019 - link
You have a good point. I guess we wait for more details. Having share L3 would be crazy.
Xajel - Monday, January 14, 2019 - link
Zen2 Epyc doesn't have NUMA in a single socket platform, so the OS will not bother.
NUMA stands for Non Uniform Memory Access which is when Core 6 for example have different latency to access specific part of the memory than Core 2. It started with multi socket platform after the memory controller was moved to the CPU where a core in any socket will have higher latency when it tries to access the memory connected to other socket, so the OS schedualer and even the software must be aware of this (NUMA aware scheduling). But 1st gen EPYC have this topology in the socket it self where each 8 cores have their own IMC and when a die needs access to a memory connected to another die it will have higher latency.

Zen2 Epyc doesn't have NUMA in the same socket as the IMC is in the IO die and all cores in the same socket have the same latency accessing the memory. Thought multi socket Zen2 Epyc will have NUMA.
lightningz71 - Tuesday, January 15, 2019 - link
While you've got the layout right and the general idea correct, you are missing one key point, NUMA isn't just about access to MAIN memory, it's about access to memory in general. Each chiplet is expected to have 32MB of L3 cache (16 per CCX is they keep a pair of quads). Access to the local L3 will be cheap. Access to a remote L3 will be expensive, likely quite expensive. You need to tell the scheduler to make its best effort to keep related threads together on each chiplet. How would you do that? Using the NUMA architecture. You define each group of cores on a chiplet as a single NUMA node. ROME will likely have 8 NUMA nodes at max, one for each chiplet. If they kept the CCX units as quads, you could realize some performance gains in some corner cases by defining each CCX as a NUMA node. Threadripper 3 should have just as many nodes as it will likely share the same package topology.

Looking at the I/O chiplet of the Matisse sample, if we compare the area of that chip with the area of a Ryzen 1xxx chip, we see that it roughly matches the area of the uncore from Ryzen 1. This doesn't leave any area at 14nm for an L3 cache. I don't expect Matisse to have an L4 cache on the IO chiplet. For ROME and TR3, we can see that the IO die is roughly 4X the Matisse one. Again, that has to account for 4X the uncore of each Ryzen 1 chip, leaving very little room for an L4 cache. There might be some substantial buffering for the memory controllers, but I don't see where it would have a big L4 cache in this generation.

Going forward, once 7nm production scales out, I could see a day where the IO die is moved to an older 7nm process while the chiplets are moved to a newer, higher performance node like 7nm+ or even 5nm. In that case, if they keep the chips the same relative size, there will be plenty of room on the chip for a massive L4 cache. Just, keep in mind that, for it to be effective, it not only needs to be at least as large as the combined total of the L3 caches on all eight core chiplets, so, that's 256MB in the current generation (8 X 32MB), but it needs to be roughly 4X the size to provide a useful amount of memory bandwidth reduction to perform its main job of being a cache. That would mean that each EPYC and TR3 processor should have an L4 of 1GB in size on package. Now, while that's not impossible using 7nm, that is absolutely massive for an on chip cache in general. Its also not likely to make a huge increase in performance in general use cases. Where it would make the biggest improvement is if the sum of all the active processes current work sets is able to stay comfortably on that I/O die's L4 cache AND there's a lot of cross chiplet data snooping.
Dodozoid - Thursday, January 17, 2019 - link
Are you sure that different distances to L3 neccesitate NUMA mode? Even Zeppelin has two CCXs with cache in each and accessing other CCX's L3 meant trip through IF. And I don't think that it is transparent to the OS rather than being managed by the chip itself (or does it have special setting in windows scheduler?). Heck even intel CPUs with older multi-ring buses had 3 different latencies to different core's L3 (own, same ring, other ring) and thats before they started with the mesh... I am by no means programer or IC engineer or whatever so I might be completely wrong here...
peevee - Thursday, May 30, 2019 - link
Keeping threads on the same core is good anyway, regardless memory and L3 architecture. Simply because of very expensive L1 and L2 thrashing. All properly written software performance-critical sets thread affinities itself (or at least Ideal Processor on Windows), not relying on scheduler. And it must be done before memory allocation, done by each thread for itself - and that relies on a good memory manager allocating memory on the same NUMA node as the thread. Scheduler has no idea which memory each thread uses, and without completely unrealistic interpretation of the code, there is no way for it to learn.
IGTrading - Monday, January 14, 2019 - link
@FreckledTrout well it was Microsoft generated issue from the beginning.

Microsoft is accountable and I guess that discussing this and making it a popular topic will prompt and quick and effective move from Microsoft.

I'm confident the move will be both to fix this and to investigate those responsible (from within the company) no matter if it was intentional or not, because Microsoft cannot afford such blunders, especially on their core products such as Windows OS.
FreckledTrout - Monday, January 14, 2019 - link
That was what I meant. Just keep this in the news cycle every now and then.
jimjamjamie - Monday, January 14, 2019 - link
Obligatory "just use Linux" comment
IGTrading - Monday, January 14, 2019 - link
@jimjamjamie : You kinda got a point there, mate :) Linux FTW.
FunBunny2 - Monday, January 14, 2019 - link
"Wormer, he's a dead man! Marmalard, dead! NEIDERMEYER......!!!!!!!!!"
-- D-Day
Lord of the Bored - Monday, January 14, 2019 - link
But I like computers that work in the general sense over computers that properly support high-end hardware but stumble over wifi and bluetooth.
edwaleni - Monday, January 14, 2019 - link
Phoronix already tested the Coreprio app and they found small improvements in some apps, some worse. The only app showing a gain was Indigo, which oddly is the "only" app that Wendell can get appreciable improvement in.

Microsoft needs to fix their NUMA support and stop kicking the can down the road. If Linux can support it "out of the box", then Microsoft needs to get up to speed.
Byyo - Monday, January 14, 2019 - link
Those results were at Phoronix were suspect. He didn't see the performance regression from Indigo on Win10, for unknown reasons (maybe lingering impact from Coreprio), so there was nothing for Coreprio to fix. He *did* see the perf regression in Win2019, where it only had 50% the performance there, though he didn't go into this discrepancy. Everyone else gets the same results from Indigo.

7-Zip is also impacted the same, though the fix has been more challenging to consistently apply in NUMA mode (though does work): https://bitsum.com/forum/index.php/topic,8526.msg2...
edwaleni - Monday, January 14, 2019 - link
I am not sure what at Phoronix is "suspect". He ran the tool, he ran the tests. Other than Indigo, nothing caught or overtook Linux's performance in any appreciable way.

The NumaPref detail you link to was just posted yesterday.

For those wondering what AMD is doing about it, they opened the ticket, they elevated it at MSFT. Not aware of the timelines involved, but it really is in MSFT's hands.
PeachNCream - Monday, January 14, 2019 - link
*buys 32-core processor for compute-intensive workloads*
*disables multiple cores to achieve acceptable performance*

Makes sense to me. Now fix your junk Microsoft.
coder543 - Monday, January 14, 2019 - link
After all these years, I really wish that AnandTech would do some of their benchmarks on Linux as well. Windows just has really bad performance at certain things, like listing directories with lots of small files, or apparently NUMA scheduling. Intrinsic issues like these make me question the value of the Chrome compilation benchmark and others that I want to care about, since those numbers could be wildly different on Linux. For consumer hardware, perhaps Windows benchmarks are fine, but for reviews of professional-grade desktop hardware, Linux should absolutely have a place in the benchmark results.

At a minimum, on issues like the article addresses, it would make it easier for the reviewer to differentiate hardware problems from software problems when they can look at the benchmark results from two operating systems, instead of only one.
PeachNCream - Monday, January 14, 2019 - link
Your wish is about to be granted. Per Ian in the recent $60 CPU review article located here:

https://www.anandtech.com/show/13660/amd-athlon-20...

"Linux (when feasible)

When in full swing, we wish to return to running LinuxBench 1.0. This was in our 2016 test, but was ditched in 2017 as it added an extra complication layer to our automation. By popular request, we are going to run it again."
Dragonstongue - Monday, January 14, 2019 - link
They should just leave Core 0 out of the equation in the first place when it comes to anything but the primary task user demand in question, example, I launch a game to play, Core 0 takes this game as priority, I launch a media player, it gets assigned to Core 0 as default etc, the things windows, background process, auto-launch programs etc get sent to all other Cores except for Core 0, there, problem solved, should be easy enough to do via windows KB update to Vista all the way through to Win 10.
Lakados - Wednesday, January 16, 2019 - link
I have a OLD quad socket Intel server that is being decommissioned it has run REHL 5.x its whole life I would be half tempted to stick Windows on it just to see if this can be replicated there.
Ozymankos - Saturday, February 9, 2019 - link
oh you ran explorer in windows with all 16 cores?
well that is a true achievement
you shall post it together with
-7 flies in a single strike:))

AMD Comments on Threadripper 2 Performance and Windows Scheduler

Post Your Comment

39 Comments

Back to Article

colonelclaw - Monday, January 14, 2019 - link

IGTrading - Monday, January 14, 2019 - link

HStewart - Monday, January 14, 2019 - link

IGTrading - Monday, January 14, 2019 - link

HStewart - Monday, January 14, 2019 - link

rahvin - Monday, January 14, 2019 - link

tamalero - Tuesday, January 15, 2019 - link

lobz - Tuesday, January 15, 2019 - link

boeush - Monday, January 14, 2019 - link

HStewart - Monday, January 14, 2019 - link

FreckledTrout - Monday, January 14, 2019 - link

Alistair - Monday, January 14, 2019 - link

KaiserWilly - Tuesday, January 15, 2019 - link

PeachNCream - Monday, January 14, 2019 - link

FullmetalTitan - Wednesday, January 16, 2019 - link

cheshirster - Wednesday, January 16, 2019 - link

FreckledTrout - Monday, January 14, 2019 - link

oleyska - Monday, January 14, 2019 - link

rns.sr71 - Monday, January 14, 2019 - link

FreckledTrout - Monday, January 14, 2019 - link

Xajel - Monday, January 14, 2019 - link

lightningz71 - Tuesday, January 15, 2019 - link

Dodozoid - Thursday, January 17, 2019 - link

peevee - Thursday, May 30, 2019 - link

IGTrading - Monday, January 14, 2019 - link

FreckledTrout - Monday, January 14, 2019 - link

jimjamjamie - Monday, January 14, 2019 - link

IGTrading - Monday, January 14, 2019 - link

FunBunny2 - Monday, January 14, 2019 - link

Lord of the Bored - Monday, January 14, 2019 - link

edwaleni - Monday, January 14, 2019 - link

Byyo - Monday, January 14, 2019 - link

edwaleni - Monday, January 14, 2019 - link

PeachNCream - Monday, January 14, 2019 - link

coder543 - Monday, January 14, 2019 - link

PeachNCream - Monday, January 14, 2019 - link

Dragonstongue - Monday, January 14, 2019 - link

Lakados - Wednesday, January 16, 2019 - link

Ozymankos - Saturday, February 9, 2019 - link

Log in

Don't have an account? Sign up now