Comments Locked

223 Comments

Back to Article

  • blanarahul - Thursday, August 19, 2021 - link

    That’s Intel Thread Director. Not Intel Threat Detector, which is what I keep calling it all day, or Intel Threadripper, which I have also heard.

    never thought my nerdy interest would make me laugh
  • Hifihedgehog - Thursday, August 19, 2021 - link

    One takeaway as far as the rumors we have been operating on are concerned is that Raichu (who has a 90% accuracy rating for his rumors/leaks) would now appear to have been wrong with his recent leak. Sad panda face, I know. I did some analysis elsewhere and here is what I had shared:

    ===

    The information released from Intel seems to invalidate this previous rumor above that I shared some weeks ago.

    The Core i9 11900K operates at a 5.3 GHz single-core boost and gets a score of 623 in Cinebench R20.

    Intel claims a 19% IPC with Golden Cove over Cypress Lake (i.e. Rocket Lake's core microarchitecture). If we see the same single-core boost clock speeds of 5.3 GHz, that would equate to 741. Let's take a huge moment to stare at this astounding achievement. This is nothing to be sneezed at! This puts AMD in a very distant position as far as single-threaded performance is concerned and puts the onus on them to deliver a similar gain with Zen 4. However, switching hats from performance analyst to fact checker, this is in no wise close to the ">810" claim as stated above. To achieve a score of >810, they would need a clock speed of roughly 5.8 GHz (623 points * 1.19 IPC improvement / 5.3 GHz * 5.8 GHz). That, quite frankly, I highly doubt.

    ===

    Link:

    http://forum.tabletpcreview.com/threads/intel-news...

    That said, though, getting roughly 2/3s of the way there to the rumored performance is still a colossal jump for Intel and is minimally going to have AMD in a rather painful position until Zen 4 comes around.
  • arayoflight - Thursday, August 19, 2021 - link

    The IPC improvement won't be the same as you can see clearly in the intel provided chart. Some workloads will run slower than Cypress cove and some can go as high as 60%.

    The 19% is an average, indicative figure and is not going to be the same in every benchmark. The >810 might as well be true.
  • Wrs - Thursday, August 19, 2021 - link

    The 19% is a median figure. Based on the graphic from Intel, there are tasks that go over 50% faster and tasks that do about the same if not slower than Cypress Cove. How do we know for certain Cinebench is anywhere close to the median?
  • mattbe - Thursday, August 19, 2021 - link

    Where did you see median?/ It says right on the graph that the 19% is a geometric mean.
  • mode_13h - Friday, August 20, 2021 - link

    Aside from that, the same point applies. We don't know where Cinebench sits, in the range.
  • Hifihedgehog - Friday, August 20, 2021 - link

    As a very standard rule, though, Cinebench (especially the latest R20 and R23 iterations) have tracked very closely to the mean IPC scores that Intel and AMD have advertised for the last few years.
  • mode_13h - Saturday, August 21, 2021 - link

    Does R20 use AVX-512, though?
  • Spunjji - Monday, August 23, 2021 - link

    @mode_13h: Yes, Cinebench R20 makes use of AVX-512. It's why some of the more AMD-flavoured commentators around the interwebs insisted that R15 was the /correct/ Cinebench to use for comparisons.

    Personally I think it was a good example of the AVX-512 benefit in the "real world", i.e. it's there but not as substantial as Intel's pet benchmarks imply.

    That aside, the lack of AVX-512 on ADL is another data point that suggests Raichu's R20 leak may have been highly optimistic. I always though it sounded a bit much.
  • mode_13h - Tuesday, August 24, 2021 - link

    > Yes, Cinebench R20 makes use of AVX-512.

    Good to know. Thanks, as always!
  • mode_13h - Friday, August 20, 2021 - link

    All of this focus on IPC seems to miss the fact that we don't know how much power the P-cores burn. So far, Intel's 10 nm nodes haven't enabled it to surpass Ryzen 5000 in terms of perf/W, so it remains an unknown whether "Intel 7" will change that.

    Most people don't use water cooling. To really asses the typical experience of Alder Lake, we'll have to see how well it holds up on a more standard air-cooled setup.
  • Spunjji - Monday, August 23, 2021 - link

    This is my main area of interest, too. I mostly use laptops these days, so I care a lot more about performance at/around TDP than I do about the absolute peaks; especially as what is allowed as an absolute peak varies so much from vendor to vendor. I guess we'll soon see for ourselves how much better "7" is in that regard.
  • Lezmaka - Thursday, August 19, 2021 - link

    Still better than Steam Deck/Stream Deck
  • mode_13h - Thursday, August 19, 2021 - link

    In its i7/i9 incarnation, this CPU will cost more than an entire entry-level Steam Deck!
  • JayNor - Friday, October 15, 2021 - link

    according to new tomshardware article,"Intel Shows Game Developers How to Optimize CPU Performance for Alder Lake", you can enable avx512 use on the Alder Lake Golden Cove cores by disabling the efficiency cores in the bios.

    So, looks like they didn't fuse off the avx512 in hardware...
  • Hulk - Thursday, August 19, 2021 - link

    Yeah! Finally some ADL info from Intel. Settles the debate about ADL needing or not needing a different scheduler than what is currently in Windows.
  • 5j3rul3 - Thursday, August 19, 2021 - link

    It's a big step for Intel, against Amd Ryzen 5000
  • nico_mach - Monday, August 23, 2021 - link

    I have to say it raises some question marks. No Windows 10 for 'full' value. No upgrading the memory to the next standard. And no AVX, which probably someone will care about.
  • SarahKerrigan - Thursday, August 19, 2021 - link

    Golden Cove looks like one heck of a jump.
  • MDD1963 - Thursday, August 19, 2021 - link

    After the hype of the last generation's pre-release environment turned out to end with almost 'laughing stock' results at the release , at least for gains in gaming, I will withhold judgement until I see some comparisons at/near launch day. (I fear we indeed get an 18-19% gain in IPC and then lose it all in clock speed reduction for a net 'wash' in gaming performance...or worse yet, a regression!)
  • WaltC - Thursday, August 19, 2021 - link

    Ditto. Wake me when the CPUs ship. Until then, ZZZZ-Z-Z-zzzz. You would think that most people would have grown tired by now of seeing advance info from Intel that somehow never accurate describes the products that do ship.

    Nice to see Anandtech using Intel PR marketing description instead of describing the process node in nm--just because Intel decides that accuracy in advertising really isn't important. Every time I see "Intel's process 7" I cringe...;) indicates the extent to which Intel is rattled these days, I guess.
  • SarahKerrigan - Thursday, August 19, 2021 - link

    "The process node in nm" - which structure size should determine this? What structure's geometry in TSMC is 7nm?
  • kwohlt - Thursday, August 19, 2021 - link

    There's nothing inaccurate at all, considering "TSMC 7nm" and "Intel 10nm" and it's extensions are product names and not measurements. If the next, yet to be released node known as Intel 7 offers a 20% performance/watt improvement over Intel 10nm SuperFin, then that's an improvement metric large enough to justify lowering the number in the name, just as all the other fabs do.
  • mode_13h - Thursday, August 19, 2021 - link

    > Nice to see Anandtech using Intel PR marketing description instead of
    > describing the process node in nm

    More goes into a fab process than just density. Also, because density can be computed different ways and Intel doesn't exactly release the raw data you'd need to properly compute, they have no real choice but to report the manufacturing process as Intel has named it.

    Of course, they should always do so with a link to their article describing what's known about Intel 7.
  • DannyH246 - Thursday, August 19, 2021 - link

    Completely agree. We've had so many articles like this over the last 5 years its not even funny. Never fear though www.IntelTech.com will be here to dutifully report on it as the next best thing.
  • MetaCube - Thursday, August 19, 2021 - link

    Cringe take
  • Wereweeb - Thursday, August 19, 2021 - link

    This will finally use 10nm, so I doubt it. I'm worried about the memory bandwidth and latency tho.

    I'm hopeful that IBM manages to improve upon MRAM until it's a suitable SRAM replacement. DRAM isn't keeping up, so we need more, cheaper L3 cache as a buffer.
  • TheinsanegamerN - Friday, August 20, 2021 - link

    Well it’s a good thing that ddr5 with clock speeds in excess of double what ddr4 can offer, with promising of results triple that of ddr4, are arriving as we speak.
  • mode_13h - Saturday, August 21, 2021 - link

    DDR5 will only help with bandwidth. Every time a new DDR standard comes along, latency (measured in ns) ends up being about the same or worse.

    Bigger L3 helps with both bandwidth and latency, but at a cost (in both $ and W).
  • Spunjji - Monday, August 23, 2021 - link

    If their claims about 15% better power characteristics for 7 are true - and they're not based on some cherry-picked measurements at some unspecified mid-power-level - then they might have the headroom to maintain clocks even with the expanded structures.

    With Ice Lake having been such a flop in this regard, though - and Tiger taking as much as it gave away, depending on power level - I'm with you on waiting to see what they deliver before I get excited. That's in shipping products, too - not some tweaked trial notebook with an unlocked TDP and 100% fan speeds...
  • Unashamed_unoriginal_username_x86 - Thursday, August 19, 2021 - link

    When you say "similar in magnitude to what skylake did" on the Golden Cove page, are you sure you don't mean something like Sandy Bridge? I vaguely remember Skylake being a pretty nominal improvement on the order of 10-15%
  • mode_13h - Thursday, August 19, 2021 - link

    > I vaguely remember Skylake being a pretty nominal improvement on the order of 10-15%

    That would NOT be a nominal improvement! Fortunately, the real number isn't hard to find:

    https://www.anandtech.com/show/9483/intel-skylake-...

    "In our IPC testing, ... we saw a 5.7% increase in performance over Haswell. That value masks the fact that between Haswell and Skylake, we have Broadwell, marking a 5.7% increase for a two generation gap."

    "In our discrete gaming benchmarks, at 3GHz Skylake actually performs worse than Haswell at an equivalent clockspeed, giving up an average of 1.3% performance."
  • Wereweeb - Thursday, August 19, 2021 - link

    Funny how Ryzen made people used to thinking in terms of generational improvements as "10-15%" again. Thankfully, EUV and GAAFETs will make sure the next few generations keep advancing at that pace.
  • mode_13h - Thursday, August 19, 2021 - link

    > are you sure you don't mean something like Sandy Bridge?

    Exactly. The timeframe of "a decade" and the magnitude of the changes they're describing lines up with Sandybridge.
  • Spunjji - Monday, August 23, 2021 - link

    Definitely referring to Sandy Bridge, as that was a 2011 architecture.
  • zzzxtreme - Thursday, August 19, 2021 - link

    been waiting 10 years for this, assuming this is a breakthrough x86 cpu
  • cheshirster - Thursday, August 19, 2021 - link

    Not this time.
  • shabby - Thursday, August 19, 2021 - link

    Finally we'll see how good intels 10nm is...
  • AdrianBc - Thursday, August 19, 2021 - link

    "E-core will be at ‘Haswell-level’ AVX2 support" seems to be contradicted by the slides from the Intel presentation, which imply that Gracemont does not have FMA, but only separate FADD and FMUL.

    If the Intel slides are correct Gracemont cannot support the complete Haswell instruction set.

    Maybe Gracemont supports only the 256-bit integer instructions added by Haswell over Sandy Bridge and it might also not support the BMI Haswell instructions.

    Also weird is that the original Intel presentation does not contain the terms AVX or AVX2, but only some vague "support for Advanced Vector Instructions".

    So, unless Intel has purposedly confused the presentations for now, it looks like the Gracemont and Golden Cove do not have compatible instructions sets, even with AVX-512 disabled.

    If that is true, then disabling AVX-512 must have only one reason, decreasing the manufacturing cost for Alder Lake, by using all the defect chips and reserving the good ones for Sapphire Rapids.
  • AdrianBc - Thursday, August 19, 2021 - link

    After writing the comment above, I have looked again at the Gracemont presentation and only now I have noticed that in the same slide that does not show any FMA unit it is written that it indeed supports the FMA instructions.

    I do not know why 2 FADD + 2 FMUL are shown instead of 2 FMA, like in the slide for Golden Cove.

    Because each FADD is on the same port with an FMUL, this means that Gracemont cannot function like AMD Zen, which can do an extra separate FADD besides an FMUL, when the full FMA is not needed. If the FMA and the FMUL cannot be executed simultaneously, then they should have drawn it as just an FMA unit, like in the Golden Cove slide.

    In any case even if Gracemont would be compatible with Haswell + the SHA extension, that still cannot make it instruction compatible with Golden Cove, because there are important additional instructions introduced in Broadwell, Ice Lake and Tiger Lake.
  • TristanSDX - Thursday, August 19, 2021 - link

    "decreasing the manufacturing cost for Alder Lake, by using all the defect chips and reserving the good ones for Sapphire Rapids."
    Alder Lake and Shapire Rapids are two totally diffrerent chips
  • mode_13h - Thursday, August 19, 2021 - link

    > Designed as its third generation of vector instructions

    Depends on how you're counting. First is definitely MMX. That was extended in a few subsequent CPUs, but they didn't call those extensions MMX2 or anything. MMX was strictly integer, however, and total vector width was 64 bits. MMX had the annoying feature of reusing the FPU registers, which complicated mixing it with x87 code and basically requiring a state reset, when going from MMX -> x87 code.

    Then, SSE came along and added single-precision floating-point. It also added a distinct set of vector registers, which were 128 bits. Finally, it included scalar single-precision arithmetic operations, beginning the era of x87's obsolescence.

    SSE2 followed with double-precision and integer operations, making MMX obsolete and further replacing x87 functionality.

    SSE3, the wondefully-named SSSE3, and a couple rounds of SSE4 came along, but all were basically just rounds of various additions to flesh out what SSE/SSE2 introduced.

    Then, AVX was introduced as something of a replacement for SSE. AVX registers are 256 bits. Like SSE, AVX was initially just including single-precision floating-point support. And like SSE2, AVX2 added double-precision and integer operations.

    Then, Xeon Phi (2nd gen) and Skylake-SP introduced the first variations on AVX-512 support. You can see what a mess AVX-512 is, here:

    https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AV...

    Anyway, AVX-512 should be considered Intel's FOURTH family of vector computing instructions, in x86. I think the first time they dabbled with vector instructions was in the venerable i860 - a very cool, but also fairly problematic step in the history of computing.

    > (AVX is 128-bit, AVX2 is 256-bit, AVX512 is 512-bit),

    No, not at all. The register width for AVX and AVX2 is 256 bits, as I explained above.

    However, even that is a slight simplification. AVX introduced some refinements in vector programming, such as a more compiler-friendly 3-operand format. Therefore, it was meant to subsume SSE usage, and included support for 128-bit operations. Similarly, AVX-512 introduced further refinements and the capability to use it on 128-bit and 256-bit operands.

    For more, see: https://en.wikipedia.org/wiki/AVX-512#Encoding_and...
  • mode_13h - Thursday, August 19, 2021 - link

    One more correction:

    > Some workloads can be vectorised – multiple bits of consecutive data all require
    > the same operation, so you can pack them into a single register and perform it
    > all at once with a single instruction.

    Intel's vector instruction extensions aren't strictly SIMD. They include horizontal operations that you don't see in classical SIMD processors or most GPUs.
  • mode_13h - Thursday, August 19, 2021 - link

    > One could argue that if the AVX-512 unit was removed from the desktop
    > cores that they would be a lot smaller

    That's what I thought, but the area overhead it added to a Skylake-SP core was estimated at a mere 11%.

    https://www.realworldtech.com/forum/?threadid=1932...

    Of course, we can't yet know how much of Golden Cove it occupies, but still probably somewhere in that ballpark.
  • mode_13h - Thursday, August 19, 2021 - link

    > Intel isn’t even supporting AVX-512 with a dual-issue

    Perhaps because AVX-512 doubled the number and size of vector registers. So, just the vector register file alone would grow 4x in size.
  • Schmide - Thursday, August 19, 2021 - link

    64bit packed doubles are in avx as are some 64bit ints. AVX2 filled in a lot of gaps such as full vector operands and reorders. So as much as AVX2 finished off the 32 and 64bit ints (epi) functions. There was already a fair amount in avx.
  • Schmide - Thursday, August 19, 2021 - link

    not to be misleading. There were really no usable int functions in avx other than load and store.
  • maroon1 - Thursday, August 19, 2021 - link

    Gracemont beats skylake ???? Really ? I'm reading the article correctly

    So these small cores are actually very powerful !!
  • vegemeister - Thursday, August 19, 2021 - link

    The hypothetical 8% increase in peak performance seems like wishful thinking to me. The chart looks like "graphic design" marketing wank, not plotted data. I would only go by the printed numbers. That is, at an operating point that matches Skylake peak performance, Gracemont cores use less than 60% of Skylake's power, and if you ran Skylake at that same power, it would have less than 60% of Gracemont's performance.
  • mode_13h - Thursday, August 19, 2021 - link

    > I would only go by the printed numbers.

    Okay, so are those numbers you used hypothetical, or where did you see 60%?

    Also, there's no fundamental reason why the ISO-power and ISO-performance deltas should match.
  • mode_13h - Thursday, August 19, 2021 - link

    Indeed. But, remember that it's a Skylake from 2015, fabbed on Intel's original 14 nm node, and it's an integer workload they measured. If they measured vector or FPU workloads, the results would probably be rather different.
  • Spunjji - Monday, August 23, 2021 - link

    Indeed. Based on how Intel usually do their marketing, I'm not expecting anything revolutionary from those cores. Maybe I'll be surprised, but I'm expecting mild disappointment.
  • mode_13h - Tuesday, August 24, 2021 - link

    Having already bought into the "Atom" series at Apollo Lake, for a little always-on media streaming server, I'm already thrilled! Tremont was already a bigger step up than I expected.
  • Spunjji - Tuesday, August 24, 2021 - link

    Fair - I've just been a bit burned! Last time I used an Atom device was Bay Trail, and at the time there was a big noise about its performance being much better than previous Atom processors. The actual experience was not persuasive!
  • Silver5urfer - Thursday, August 19, 2021 - link

    Too many changes in the CPU x86 topology. They are making this CPU a heavily dependent one of the OS side with such insane changes to the Scheduler system, like P, E and then the Hyperthreading of P cores. On top of all this the DRAM system must be good else all that 19% big IPC boost will be wasted just like on Rocket Lake. Finally Windows 11 only ? dafaq.

    I have my doubts on this Intel IDT and the whole ST performance along with gaming SMT/HT performance. Until the CPU is out it's hard to predict the things. Also funny they are simply adding the older Skylake cores to the processor in a small format without HT, while claiming this ultra hybrid nonsense, seems like mostly tuned for a mobile processor than a Desktop system which is why there's no trash cores on the HEDT Sapphire Rapids Xeon. And which Enterprise wants to shift to this new nonsense of x86 landscape. On top we have Zen 4 peaking at 96Core 192T hyperbeast Genoa which also packs AVX512.

    I'm waiting Intel, also AMD for their 3D V Cache Zen 3 refresh. Plus whoever owns any latest processors from Intel or AMD should avoid this Hardware like plague, it's too much of a beta product and OS requirements, DRAM pricing problems will be there for Mobos and RAM kits and PCIe5.0 is just damn new and has no usage at all right now It all feels just like Zen when AMD came with NUMA system and refined it well by the Zen 3. I doubt AMD will have any issue with this design. But one small good news is some competition ?
  • Silver5urfer - Thursday, August 19, 2021 - link

    Also scalable lol. This design is peaked out at 8C16T and 8 small cores while Sapphire Rapids is at 56Cores 112T. AMD's Zen 4 ? 96C/192T lmao that battle is going to be good. Intel is really done with x86 is what I get from this, copying everything from AMD and ARM. Memory Interconnects, Big Little nonsense. Just release the CPU and let it rip Intel, we want to see how it works against 10900Ks and 5900Xs.
  • mode_13h - Friday, August 20, 2021 - link

    > Also funny they are simply adding the older Skylake cores
    > to the processor in a small format without HT

    They're not Skylake cores, of course. They're smaller & more power-efficient, but also a different uArch. 3+3-wide decode, instead of 4-wide, and no proper uop cache. Plus, the whole thing about 17 dispatch ports.

    If you back-ported these to 14 nm, they would lose their advantages over Skylake. If they forward-ported Skylake to "Intel 7", it would probably still be bigger and more power-hungry. So, these are different, for good reasons.
  • vyor - Friday, August 20, 2021 - link

    I believe they have a uOP cache though?
  • mode_13h - Saturday, August 21, 2021 - link

    No, Tremont and Gracemont don't have a uop cache. And if Goldmont didn't, then it's probably safe to say that none of the "Atom" cores did.

    The article does mention that some aspects of the instruction cache make it sound a bit like a uop cache.
  • Silver5urfer - Saturday, August 21, 2021 - link

    I see the only reason - Intel was forced to shrink the SKL and shove them in this designs because their node Fabs are busted. Their Rocket Lake is a giant power hog. Insane power draw. Intel really shined until 10900K, super fast cores, ultra strong IMC that can handle even 5000MHz and any DRAM. Solid SMT. High power but it's a worth trade off.

    With RKL, backport Intel lost - IMC leadership, SMT performance, ST performance (due to Memory latency) AND efficiency PLUS Overclockability. That was the time I saw Intel's armor cracking. HEDT was dead so was Xeon but the only reason Mainstream LGA1200 stood was super strong ring bus even on RKL.

    Now FF to 10SF or Intel 7 whatever they call it. No more high speed IMC now even double whammy due to the dual ring system and the ring is shared by all the cores connected, I doubt these SKL cores can manage the highspeed over 3800MHz DDR4 RAM, which is why they are mentioning Dynamic Clocking for Memory, this will have Gearing memory system for sure. High amount of efficiency focus due to the Laptop market from Apple and AMD pressure. No more big core SMT/HT performance. Copying ARMs technology onto x86 is pathetic. ARM processors never did SMT x86 had this advantage. But Intel is losing it because their 10nm is a dud. Look at the leaked PL1,2,4 numbers. It doesn't change at all, they crammed 8 phone cores and still it's higher and higher.

    Look at HEDT, Sapphire Rapids, tile approach, literally copied everything they could from AMD and tacked on HBM for HPC money. And I bet the power consumption would be insanely high due to no more phone cores cheating only big x86 real cores. Still they are coming back. At this point Intel would have released "Highest Gaming Performance" marketing for ADL, so far none and release is just 2 months. RKL had that campaign before 2 months and CFL, CML all of them had. This one doesn't and they are betting on I/O this time.

    Intel has to show the performance. And it's not like AMD doesn't know this, which is why Lisa Su showed off a strong 15% gaming boost. And remember when AMD showcases the CPUs ? Direct benchmarks against Intel's top - 9900K, 10900Ks all over the place. No sign of 5900X or 5950X comparisons from Intel.
  • mode_13h - Sunday, August 22, 2021 - link

    > Intel was forced to shrink the SKL and shove them in this designs because
    > their node Fabs are busted.

    Well, no. Fabs aside, there's no SKL core in there.

    > ARM processors never did SMT

    Actually, a couple of them did. Not the mainstream phone cores, but their Cortex-A65 is.

    https://developer.arm.com/ip-products/processors/c...
  • Yojimbo - Thursday, August 19, 2021 - link

    "For users looking for more information about Thread Director on a technical, I suggest reading this document and going to page 185, reading about EHFI – Enhanced Hardware Frequency Interface. It outlines the different classes of performance as part of the hardware part of Thread Director."

    Which document? I think you meant to include a link.
  • eastcoast_pete - Thursday, August 19, 2021 - link

    Hey, Intel's Marketing department:. Instead of Thread Director, how about "Thread Mender"? AMD rips threads, so...
    If you end up using it, I want a freebie!
  • GeoffreyA - Thursday, August 19, 2021 - link

    "The Weaver" or "Threadweaver"
  • Duncan Macdonald - Thursday, August 19, 2021 - link

    Still a monolithic design which limits the total number of cores. Intel seems to have given up on the HEDT market as it can not match the 32 or 64 core models from AMD (let alone the 96 core models expected later this year).
  • kwohlt - Thursday, August 19, 2021 - link

    Sapphire Rapids will be up to 56 Core, so I imagine they'd just repurpose Sapphire rapids (with AVX-512) for the HEDT space.
  • drothgery - Thursday, August 19, 2021 - link

    Intel's HEDT CPUs have always been derived from server chips (even if in some generations they weren't quite repurposed versions of the same silicon with different power and frequency bins and different features enabled), not desktop chips. And as noted by the previous reply to your comment, Intel is using a multi-chip module set up there.
  • GNUminex_l_cowsay - Thursday, August 19, 2021 - link

    Windows has been discriminating between different kinds tasks for scheduling purposes since Vista and has had big little support since Windows. So, I was wondering why there were rumors of Microsoft putting lots of work into scheduling for Alder Lake. This Thread director thing explains it.
  • Oxford Guy - Thursday, August 26, 2021 - link

    Doing it and doing it well...

    Some claimed, for instance, that Piledriver is much more efficient under optimized Linux — in terms of the role of the scheduler.
  • Wereweeb - Thursday, August 19, 2021 - link

    "Windows only"

    Yeah sure, they say this because they're 69'ing Microsoft, but they can't exactly prevent open source software from developing so in a couple of months Linux will probably have better support than Windows itself, especially given Microsoft's incompetence.
  • mode_13h - Friday, August 20, 2021 - link

    > they can't exactly prevent open source software from developing

    Well, they don't have to release all of the information that would be needed for Linux to do the same thing.

    > Linux will probably have better support than Windows itself

    Intel is one of the largest contributors to the Linux kernel. No doubt, any support for their Thread Director will be developed by them.
  • Obi-Wan_ - Thursday, August 19, 2021 - link

    Is it likely that Alder Lake will consume noticeably less power when near idle or during video playback/streaming, or are existing CPUs already quite efficient in these cases?

    I'm thinking an HTPC that should be as silent as possible when idling and streaming, but also have a high power budget (effectively noise budget) when gaming for example.
  • mode_13h - Friday, August 20, 2021 - link

    If you really care about minimizing idle power, then I think you probably need to use LPDDR memory. Integrated graphics should also be a priority.

    Another thing people miss is that the PSU should not only be a high-efficiency model, but also not heavily over-spec'd. Power supplies lose a lot of efficiency, when you run them well below peak load.
  • mode_13h - Thursday, August 19, 2021 - link

    > The desktop processor will have sixteen lanes of PCIe 5.0

    I'll believe it when I see it. Let's not forget that it took Intel 2 generations to get PCIe 4.0 working! They had to reverse course on enabling it it Comet Lake, and that was years after POWER, Ryzen, and some ARM CPUs had it.

    I also don't see the value of having it now, given that we know DG2 is going to be only PCIe 4, nor are we aware of any other upcoming GPUs that will support 5.0.
  • Bp_968 - Sunday, August 22, 2021 - link

    Pcie 5 is twice as fast and backwards compatible. Why would you *not* want it in your system if possible? Its not like you can add it later.

    Pcie4 is mostly useless for gpus already (regardless if they "support" it or not) so pcie5 isnt going to improve anything gpu wise just like it didnt improve with pcie4.

    But where it *will* be an improvement is in peripherals (the x4 channel to the chipset just got twice as fast) and support for pcie5 storage. Oh and easier support for high speed interconnects like USB 3-4 and 10gb ethernet.

    Personally id prefer the layout be different. X16 or X8/x8 (plus 2 x4 slots for nvme i think) for the pcie5 is ok, but on the pcie4 side I'd like to see x16, x8/x8 as an option as well. That way you could use the pcie5 slots on other stuff and put the gpu in a x16 or x8 pcie4 slot (and it perform just as well). Support for 4+ nvme drives will be nice in the future. One or two pcie5 high speed units and then spots for slower SSDs for mass storage (x2 or x4 pcie4 being the "slow" slots).
  • mode_13h - Monday, August 23, 2021 - link

    > Pcie 5 is twice as fast and backwards compatible.
    > Why would you *not* want it in your system if possible?

    Mainly due to board and peripheral costs, I think. Beyond that, power dissipation should be well above PCIe 4.0 and it could also be a source of stability issues.

    > But where it *will* be an improvement is in peripherals
    > (the x4 channel to the chipset just got twice as fast)

    This is actually the one place where it makes sense to me. The chipset can be located next to the CPU, so that hopefully no retimers will be needed. And the additional power needed to run a short x4 link @ 5.0 speeds hopefully shouldn't be too bad. When leaks first emerged about Alder Lake having PCIe 5.0, I suspected it was just for the chipset link.

    > support for pcie5 storage.

    By the time there are any consumer SSDs that exceed PCIe 4.0 x4 speeds, we'll already be on a new platform. It took over a year for PCIe 4.0 SSDs to finally surpass PCIe 3.0 x4 speeds, and many still don't.

    > easier support for high speed interconnects like USB 3-4

    The highest-rated speed for USB4 is PCIe 3.0 x4. However, even a chipset link of PCIe 4.0 x4 will mean you can support it with bandwidth to spare. That said, I think the highest-speed USB links are typically CPU-direct, in recent generations.

    > 10gb ethernet.

    You can already do that with a PCIe 4.0 x1 link.
  • Spunjji - Monday, August 23, 2021 - link

    > But where it *will* be an improvement is in peripherals
    > (the x4 channel to the chipset just got twice as fast)

    It doesn't use PCIe 5.0 for the chipset link, so it doesn't even have that advantage. I genuinely think it's premature. I guess we'll have to see what motherboard costs look like to know whether it was worth it for future-proofing, or whether it's just spec wankery.
  • mode_13h - Tuesday, August 24, 2021 - link

    > It doesn't use PCIe 5.0 for the chipset link

    I'm pretty sure they didn't specify that, one way or another. I'm pessimistic, though. Then again, didn't Rocket Lake have a PCIe 4.0 x8 link to the chipset? If so, moving up to PCIe 5.0 x4 is plausible.

    > or whether it's just spec wankery.

    It's definitely wankery. I'm just waiting for them either to walk it back, a la Comet Lake's PCIe 4.0 support, or for users to encounter a raft of issues, once some PCIe 5.0 GPU is finally released and people try to actually *use* the capability.
  • Spunjji - Friday, August 27, 2021 - link

    All the resources I'm finding online say it's a DMI 4.0 x8 link to the chipset, so the same as Rocket Lake. Personally I think that's going to be plenty for the vast majority of their users, assuming they follow up at some point in the not-too distant future with an up-to-date HEDT platform for the users who need more.
  • mode_13h - Saturday, August 28, 2021 - link

    That's a shame, because the DMI link is the one place where Intel could've gotten practical benefits from using PCIe 5.0, right away.
  • mode_13h - Thursday, August 19, 2021 - link

    > On the E-core side, Gracemont will be Intel’s first Atom processor to support AVX2.

    Finally. It's about f'ing time, Intel.

    > desktop processors and mobile processors will now have AVX-512 disabled in all scenarios.
    > ...
    > If AMD’s Zen 4 processors plan to support some form of AVX-512 ... we might be in
    > some dystopian processor environment where AMD is the only consumer processor
    > on the market to support AVX-512.

    LOL! Exactly! I wouldn't call it "dystopian", exactly. Just paradoxical.

    And now that Intel has been pushing AVX-512 adoption for the past 5 years, there should actually be a fair amount of software & libraries that can take advantage of it, making this the worst possible time for Intel to step back from AVX-512! Oh, the irony would be only too delicious!

    > Intel is also integrating support for VNNI instructions for neural network calculations.
    > In the past VNNI (and VNNI2) were built for AVX-512, however this time around Intel
    > has done a version of AVX2-VNNI for both the P-core and E-core designs in Alder Lake.

    Wow. That really says a lot about what a discombobulated mess the development of Alder Lake must've been! They thought the E-cores would be a good area-efficient way to add performance, but then AVX-512 probably would've spoiled that. So, then they had to disable AVX-512 in the P-cores. But, since that would hurt deep learning performance too much, they had to back-port VNNI to AVX2!

    And then, we're left to wonder how much software is going to bother supporting it, just for this evolutionary cul-de-sac of a CPU (presumably, Raptor Lake or Meteor Lake will finally enable AVX-512 in the E-cores).
  • Gondalf - Thursday, August 19, 2021 - link

    Have you realized this SKU was thinked for 7nm ??....and than backported to 10nm ???.
    Rocket Lake number two.
  • TomWomack - Thursday, August 19, 2021 - link

    VNNI is four very straightforward instructions (8-bit and 16-bit packed dot-product with/without saturation), so the back port is unlikely to have been difficult
  • mode_13h - Thursday, August 19, 2021 - link

    Yeah, but it implies some chaos in the design process.

    Also, my question about how well-supported it will be stands. I think a lot of people aren't going to go back and optimize their AVX2 path to use it. Any focus on new instructions is likely to focus on AVX-512.
  • Spunjji - Monday, August 23, 2021 - link

    If they kill AVX-512 in consumer with ADL only to bring it back in the next generation, I shall be laughing a hearty laugh. Another round of "developer relations" funding will be needed...

    Personally I think they never should have brought it to consumer.
  • mode_13h - Tuesday, August 24, 2021 - link

    > I think they never should have brought it to consumer.

    I have my gripes against AVX-512 (mostly, with regard to the 14 nm implementation), but it's not all bad. I've read estimates that it only adds 11% to the core size of Skylake-SP (excluding the L3 cache slice & such). It was estimated at about 5% of a Skylake-SP compute tile. So, that means less than 5% of the total die size. So, it's probably not coming at too high a price.
  • Spunjji - Friday, August 27, 2021 - link

    That's fair - my reasons for thinking they shouldn't have done it are more related to marketing and engineering effort than die space, though.

    They put in a lot of time and money to bring a feature to a market that didn't really need it, including doing a load of "developer relations" stuff to develop some cringe-worthy edge-case benchmark results, alongside a bunch of slightly embarrassing hype (including the usual sponsored posters on comment sections), all to lead up to this quiet little climb-down.

    Seems like to me like it would have made more sense to designate it as an Enterprise Grade feature - an excuse to up-sell from the consumer-grade "Xeon" processors - and then trickle it down to consumer products later.
  • mode_13h - Saturday, August 28, 2021 - link

    > Seems like to me like it would have made more sense to designate it as an Enterprise
    > Grade feature ... and then trickle it down to consumer products later.

    Yeah, that's basically what they did. They introduced it in Skylake-SP (if we're not counting Xeon Phi - KNL), and kept it out of consumers' hands until Ice Lake (laptop) and Rocket Lake (desktop). It seems pretty clear they didn't anticipate having to pull it back, in Alder Lake, when the latter two were planned.
  • mode_13h - Saturday, August 28, 2021 - link

    BTW, you know the Skylake & Cascade Lake HEDT CPUs had it, right? So, the whole up-sell scheme is what they *actually* did!
  • TristanSDX - Thursday, August 19, 2021 - link

    If ADL have disbaled features like part of L2 cache or without AVX-512, so it is interesting if presented 19% IPC growth apply to ADL or SPR.
    AMD Zen 3 will definitelly have AVX-512, BIG shame on you, for disabling it, even for SKU without small cores
  • Gondalf - Thursday, August 19, 2021 - link

    You mean 5nm Zen 4 will have AVX512.
    Anyway wait and see if only on server cores or even in consumer, sure 5nm will give room for AVX512 in desktop cpus, but it is not for certain.
    The funny thing we are in front of a tight situation for both Intel and AMD.
    AMD can not go seriously on 5nm because there are not enough wafers around, Intel have to wait 7nm for new designs.
    Intersting times, if roadmaps are true, both AMD and Intel will are on a 5nm class process around at the same time, sometime at the end of 2022.
    We'll see the best one between two contenders.
  • JayNor - Thursday, August 19, 2021 - link

    Looks like a Sapphire Rapids HEDT would be Intel's solution for pro consumers who want avx512. It would include bfloat16 support and AMX tiled matrix operations, which have not been available previously.

    An eight core Golden Cove HEDT chip with its dual avx512 and tiled matrix bfloat16 units enabled sounds like a decent upgrade from Ice Lake HEDT.
  • mode_13h - Friday, August 20, 2021 - link

    Does SPR have BFloat16 in AVX-512, or just via AMX? I thought its AVX-512 is still not fully caught up with Cooper Lake.
  • Kamen Rider Blade - Thursday, August 19, 2021 - link

    The desktop processor will have sixteen lanes of PCIe 5.0, which we expect to be split as x16 for graphics or as x8 for graphics and x4/x4 for storage. This will enable a full 64 GB/s bandwidth. Above and beyond this are another four PCIe 4.0 lanes for more storage. As PCIe 5.0 NVMe drives come to market, users may have to decide if they want the full PCIe 5.0 to the discrete graphics card or not

    Why won't they allow BiFurication of PCIe 5.0 = x12 + x4 as an option?

    x12 PCIe lanes is part of the PCIe spec, it should be better supported.

    Same with PCIe Gen 5.0 x12 + x2 + x2

    That can offer alot of flexibility in end user setups.
  • mode_13h - Friday, August 20, 2021 - link

    The reality is that consumers don't need PCIe 5.0 x16. The benefits of even 4.0 x16 are small (but certainly real, in several cases).

    IMO, the best case for PCIe 5.0 would be x8 + x8 for multi-GPU setups. This lets you run dual-GPU, with each getting the same bandwidth as if it had a 4.0 x16 link.

    Unfortunately, they seem to have overlooked that obvious win, and all for the sake of supporting a use case we certainly won't see within the life of this platform: a SSD that can actually exceed 4.0 x4 speeds.
  • Dug - Friday, August 20, 2021 - link

    As long as I can have 3 ssd's that can run full speed at PCIe 4.0, I'll be happy.
  • mode_13h - Thursday, August 19, 2021 - link

    The Thread Director is intriguing. I wonder how much of the same information can be gleaned from the performance counter registers, although having an embedded microcontroller analyze it saves the OS from the chore of doing so.

    Can it raise interrupts, though? If not, then I don't see much point to enabling performance characterization in 30 microseconds, as that's way shorter than an OS timeslice.

    It should be an interesting target for new sidechannel attacks, as well.
  • Jorgp2 - Thursday, August 19, 2021 - link

    Isn't it hardware feedback interface, not hardware frequency interface?
  • eastcoast_pete - Thursday, August 19, 2021 - link

    To me, the star of the CPU cores is the "little" one, Gracemont. Question about AL in Ultrabooks: Why not an SoC with 4, 6, or 8 Gracemont cores plus some Xe Graphics, at least for the lower end? For most regular business use cases, that'll do just fine. The addition of AVX/AVX2 also means that certain effects for video conferencing, such as virtual backgrounds (Teams, other) is now possible with these beefed-up Atoms.
    And, on the other end of the spectrum, I agree with Ian that a 32 or more Gracemont-core CPU would work well if you want to run a lot of threads within a reasonable power envelope. @Ian: any chance you can get your hands on one of the CPUs specified for 5G base stations? Even the current, Tremont-based ones are exactly that: many Atom cores in one, specialized server CPU. Would be nice to see how those go.
  • eastcoast_pete - Thursday, August 19, 2021 - link

    To be very precise: I meant an SoC without any "Cove" cores, just 4 or more Gracemonts. It'll do for many, especially business uses.
  • mode_13h - Friday, August 20, 2021 - link

    Yeah, we get it. That's their standard chromebook-level SoC, actually. Check out Jasper Lake:

    https://ark.intel.com/content/www/us/en/ark/produc...
  • mode_13h - Friday, August 20, 2021 - link

    > Why not an SoC with 4, 6, or 8 Gracemont cores plus some Xe Graphics

    If history is any guide, Intel will *certainly* release low-end CPUs with only Gracemont cores. I'm not sure if they'll do the 6-core or 8-core variants, but definitely 4-core + probably a 32-EU Xe iGPU.
  • ifThenError - Friday, August 20, 2021 - link

    +1 for a high-core-count-small-core chip!

    It's a shame the current Snow Ridge chips are not available to consumers. That leaves only Ryzen U-series for multithreaded energy aware computing. Not saying these are bad, but an alternative wouldn't hurt.
  • mode_13h - Saturday, August 21, 2021 - link

    > +1 for a high-core-count-small-core chip!

    They already do that. The Atom-branded processors currently ship in up to 24-core configurations, targeted at applications like 5G basestations. The Snow Ridge you mention are based on Tremont cores:

    https://ark.intel.com/content/www/us/en/ark/produc...

    Interestingly, it's limited to just dual-channel DDR4, but it has 32-lanes of PCIe 3.0 support. Presumably, a refresh with Gracemont would use DDR5 and at least PCIe 4.0.

    > It's a shame the current Snow Ridge chips are not available to consumers.

    Well, they're BGA. So, you probably don't want a bare CPU. What you'd need to find is someone selling them on boards in a standard PC form factor. I already checked Supermicro, but their Atom boards feature older models. Didn't find anything from ASRock Rack or Gigabyte, either.
  • ifThenError - Saturday, August 21, 2021 - link

    And that is exactly the point. I know there are CPUs with more Atom cores out there, and I'd be perfectly fine with a small mainboard with soldered chip or a NUC type barebone. But they are simply not sold to consumers and this is one major disappointment.

    The consumer line just ends with the 4 cores and everything else is big cores, bigger coolers junk.
  • mode_13h - Sunday, August 22, 2021 - link

    Well, you can find boards with higher core-count Atoms, but I just didn't see any with Snow Ridge. Maybe it's just a matter of time, or maybe we'd need to look a little harder.
  • ifThenError - Sunday, August 22, 2021 - link

    Do you have an example? The only ones I'm aware of are with totally outdated cores like Apollo Lake...
  • mode_13h - Monday, August 23, 2021 - link

    Well, here are some boards with Atom C-series, but I think they're still featuring Goldmont/Goldmont+ cores.

    https://www.supermicro.com/products/motherboard/At...

    The biggest they have is based on the 16-core C3958. I'd keep an eye out, to see if they release a new generation with the newer Tremont-based P-series.
  • ifThenError - Monday, August 23, 2021 - link

    Yes these are the ones I meant. Mixed up the names though, they are not Apollo Lake, but Denverton, but still it's all Goldmont cores (not even Goldmont Plus). So they are already 2 generations behind and on a much inferior process node.

    Unfortunately I don't have hopes for a Tremont alternative. The current P-series are all without iGPU, so it's only for headless servers. I guess there's little chance we'll see any systems available with these chips any time soon...
  • mode_13h - Tuesday, August 24, 2021 - link

    > The current P-series are all without iGPU

    So are the C-series!

    If you want more than 4 cores, you're talking about a server SoC. So, that means getting it on a server board that will have a BMC anyhow.

    If you're cool with a consumer version, then we can hope they up the core count of their Chromebook SoCs to 6 or 8, for Gracemont.
  • name99 - Thursday, August 19, 2021 - link

    "Intel’s Thread Director controller puts an embedded microcontroller inside the processor such that it can monitor what each thread is doing and what it needs out of its performance metrics. It will look at the ratio of loads, stores, branches, average memory access times, patterns, and types of instructions."

    People might be interested to know that Apple has done this for years (I don't know about ARM).

    The Apple scheme has many moving parts but these include
    - tracking how much work is done by Fetch, Decode and Execute. The first two can estimate based on number of instructions, the third takes account of the type of instruction.

    - the scheme is even sophisticated enough (at least the patent talks about this) that the weights given to each of these pieces are variable to match the characteristics of the manufactured chip. Each SoC is tested and the precise weights are fused into the chip after testing.

    - this mean that the SoC can calculate things like instantaneous power usage. This is used at the overall SoC level (to limit battery current draw) and at the per execution unit level (eg to halt the SIMD pipeline for a cycle every few cycles if some thermal pr power constraint is being exceeded). You will notice this is the equivalent of Intel's frequency throttling for AVX512, but much nicer because it is done on demand, purely to the level needed, and without slowing down the rest of the core or without a slow transition between faster and slower frequencies.

    - there is also tracking of where L1 cache fills comes from. If a lot come from the E cores, the E-core frequency is boosted. If a lot come from DRAM, then the fabric frequency and DRAM frequency are boosted.

    - behind everything, for *most purposes* the primary metric is nJ/instruction. The scheduler does many things in the obvious way you would expect (background threads on E cores, track progress vs deadline and ramp core performance up or down depending on how that is doing); but some non-obvious things are that code that is inefficient (ie nJ/instruction is too low) and that is not otherwise protected by the OS will be pushed to lower frequency or to an E-core. This might sound bad, but mainly what it's saying is
    + if you're constantly waiting on DRAM, then running the core at high frequency does you no good anyway
    + if you're not running very wide (hard to predict branches, or long dependency chains) you can't take advantage of the big core anyway, so why waste power keeping you there?

    Presumably Intel's scheme at least covers all these sorts of bases.

    One complication Apple has, that I assume Intel/Windows will not have (but it's not clear) is the use of clustering. Clustering sounds great, as does that huge low latency shared cache. But it comes at the cost of, as far as I can tell, a common frequency for the entire cluster. (If CPUs were at different frequencies, there'd have to be a cross-frequency-domain stage when communicating with the shared L2, at that would add noticeable latency).
    So the OS scheduler doesn't just have the job of scheduling each thread to the optimal core at optimal DVFS, it also has to pack 4 optimal [as a unit] threads to a cluster...
    I can't tell if Intel's scheme runs their small cores that way, as a cluster of 4 sharing an L2 (and thus sharing frequency). If so, how the OS scheduler handles this is something to keep an eye on for both Windows and Linux.

    BTW there are very recently published patents that suggest Apple may be moving away from this, to a scheme of private L2s and a shared per-cluster L3!
    https://patents.google.com/patent/US10942850B2
    That's something to keep an eye on for the A15 and M2...
  • mode_13h - Friday, August 20, 2021 - link

    Thanks for the info.

    What do you mean by "nJ/instruction" ? Is that the ratio of branches vs. non-branch instructions? If not, then what does it have to do with DRAM latency? Or was that a reference to the prior paragraph?

    Where do you read this stuff?
  • name99 - Friday, August 20, 2021 - link

    nanoJoules/instruction. ie energy per instruction

    This info is acquired from reading massive numbers of Apple patents, validated as much as possible by experiments run on M1.
  • mode_13h - Saturday, August 21, 2021 - link

    Wow. My eyes glaze over, trying to read patents. I'm sure there are better and worse ones, but they're often written in ways that drain the joy out of the most interesting ideas.

    Thanks for sharing!
  • jospoortvliet - Sunday, August 22, 2021 - link

    Indeed super interesting!
  • mode_13h - Thursday, August 19, 2021 - link

    I wonder if they did anything to the decoder around SMT or multiple instruction streams. In Tremont, it seemed like they way they used a 6-wide decoder was as two 3-wide decoders, where each would work on a separate branch target.

    > the L2 BTB (branch target buffer) has grown to well over double with the
    > structure increased from 5K entries to 12K entries

    Can someone refresh us on the function of a BTB? Is it like a cache that stores the target address of each recent branch instruction, so that speculative execution doesn't have to wait for the target to be computed (if not a fixed target)?

    > actually eliminating instructions that otherwise would have to actually
    > emitted to the back-end execution resources.

    Huh? Seems like an editing error. Can anyone elaborate?

    > Intel still continues to use merged execution port / reservation station design

    Someone please remind us what a reservation station is?

    > On the integer side of things, there’s now a fifth execution port and pipeline with
    > simple ALU and LEA capabilities

    In this case, I presume LEA means "load effective address" and is used to compute memory addresses (potentially involving a multiply, an add, a constant offset?). Is that correct? And does the above statement mean that each of those ports can do simple ALU *or* LEA operations?

    > Intel has improved the prefetchers

    Yes, and the article text didn't even mention the bullet point in the slide about feedback-based prefetch-throttling! I'm reminded of how ARM's N2 can throttle back prefetching, during periods of memory contention. Perhaps Intel came to the same conclusion that overzealous prefetchers can starve cores running memory-intensive routines, in highly-threaded workloads.

    > full-line-write predictive bandwidth optimisation ... where the core can greatly improve
    > bandwidth by avoiding RFO reads of cache lines that are going to be fully rewritten

    Yes, I've been wanting this for about 2 decades.

    > We can see in the graph ... low outliers where the new design doesn’t improve thing
    > much or even sees regressions, which is odd.

    Maybe those were affected by the disabling of AVX-512? Or were those benchmarks performed on a fully-enabled core?

    > +25% wider µOP output

    If this is referring to expanding uOP cache outputs from 6 -> 8, that's a 33% improvement!
  • name99 - Thursday, August 19, 2021 - link

    "Can someone refresh us on the function of a BTB? "
    It's hard to be sure because I can never tell the extent to which Intel is doing things the old comfortable way, or the most sensible new way. I'll tell you what Apple do.
    Intel are presumably a subset of these features, but I don't know how good a subset. You need ALL the pieces to run sustained "random" code 8-wide as Apple does.

    (a) You want a predictor for indirect branches (think things like virtual function or procPtr calls). How to construct such a predictor is interesting but we will just assume it exists. This may have been what the original Branch Target Buffer was say in the early 1990s, but it is NOT what we have in mind today.

    (b) To run a modern wide OoO machine optimally, you want to be able to process a TAKEN branch per cycle. (Code has a branch every ~6 instructions, a taken branch every ~10 instructions. If you want to run 8 wide...)
    - This means you need to pull a new run of instructions (ie loaded from a new address) every cycle.
    - This, in turn, means that you really need to run Fetch as an asynchronous process. A Fetch Engine every cycle predicts the next Fetch Address and the number of instructions to Fetch. (On Apple this can be at least as wide as 16 instructions in one cycle if everything lines up correctly.) These instructions are placed in the Fetch Queue and at the other end Decode reads 8/cycle from this queue. Making Fetch async from the rest of the machine means that you can sometimes pull in 16 instructions into the queue, sometimes you just pull in three or four instructions, sometimes none while you wait for a cache miss. But hopefully the queue between Fetch and Decode buffers much of this variation.

    - BUT asynchronous Fetch means Fetch is on its own regarding how it proceeds. Basically what you want is
    + a very fast (single cycle!) Next Fetch Predictor that produces the next fetch address and (ideally) also a Fetch Width
    But a fast such predictor is of limited accuracy.
    So the second essential you need is very high quality predictors that correct the Next Fetch Predictor. As long as you correct a misFetch before the instruction stream hits Rename life is fairly easy. Correcting after Rename is tough (you have to undo resource allocations), correcting after Issue is hopeless and you have to flush.
    The Apple numbers are that their high quality predictors (Branch Prediction and Indirect Branch Prediction) are TAGE based, phenomenally accurate, and take up to 5 cycles to generate a result. That just works out (of course!)

    So the idea is that the Next Fetch Predictor generates a stream of Fetch's which results in a stream of, let's call them cars of instructions, proceeding from I-cache, through the Fetch Queue, through Decode. At any point one of the better quality predictors can step in and derail all the cars after a certain point, so that Fetch restarts. Obviously this isn't great, you've lost up to five cycles of work, but it's a lot better than a full machine flush!

    OK, within this framework, I believe that what Intel today calls the BTB is essentially doing the same job as what I am calling the Next Fetch Predictor.

    BTW there are an insane number of tweaks and improvement Apple have made to the above scheme over the years. These include
    - a separate Return stack used by the Next Fetch predictor to deal with extremely rapid call/return pairs (eg call, work, return is three cycles; all done before the code has even hit decode, so totally out of sync with the "full accuracy" Return stack)
    - Decode (ie the earliest stage possible) does what it can to keep the machinery on track. Decode detects any sort of mismatch between decoded branches and the next car of instructions and, if so, gets Fetch to resteer. This is easily done for unconditional branches, and can also be done for a few other weird cases (like some mismatched call/return pairs). Decode also updates the Return stack.
    - pre-decode (ie when an instruction line is moved from l2 to L1) does a bunch of stuff to mark significant points (eg where branches are) in a cache line. This in turn is referenced the first time the Next Fetch Predictor encounters these new lines.
    - for certain cases (most obviously when the Next Fetch Predictor has an indirect branch marked as low confidence) Fetch pauses until some of the upstream machinery can suggest an address. The idea is that for low confidence indirect branches, you're so unlikely to guess correctly why even waste energy trying?

    Apart from all these, theres a whole other set of machinery that handles loops and the transition from "random" code to loops. These include both an L0 cache and a trace cache. (That's right kids, a trace cache!)
    There's also a whole set of ideas for saving power at every stage of this process. For example the Next Fetch Predictor, along with the obvious things it is recording (next fetch address, and fetch width) also records two items obvious in retrospect -- the physical address (so no TLB lookup necessary) and even the cache way (so no way prediction necessary, and the correct way -- and only that way) can be fired up on cache access. The loop buffer, L0, and the trace cache are additional ways to run Fetch on energy fumes for code that meets the specific requirements, so that various of TLB, way prediction, multi-way lookup, branch predictor, etc etc can all be powered down.
  • mode_13h - Friday, August 20, 2021 - link

    Thanks for the info. I wonder where you find such detailed descriptions!

    > L0 cache

    Just another name for a uop cache?

    > That's right kids, a trace cache!

    So, a trace cache stores an entire string of instructions, even across one or more branches? Does it hold instructions before or after decode?

    > and even the cache way

    You mean the cache set?

    > The loop buffer

    What's a loop buffer? Sort of like a trace cache, for loops?
  • name99 - Friday, August 20, 2021 - link

    Think of the steps required to run normal code, as I described above. The consider various simple loops.

    Suppose you have a straight line loop, say 40 instructions in the loop body, no branches. Then you can omit branch prediction, TLB, cache -- just repeatedly run the same code from a straight buffer. That's essentially a loop buffer.

    Now suppose that your loop body has a few branches in it, but they are predictable, maybe something like
    if(loop counter even){path1} else {path2}
    Now what you want is something like a trace cache that's holding the two paths path1 and path2, and a very simple predictor that's telling which of these to choose each iteration. You can still avoid the costs of a real branch predictor and a real cache.

    Now suppose you have a loop with moderately complicated branches, not terrible but not that easy to predict either. You can't avoid the cost of branch prediction now (as I said, to validate the guess of the Next Fetch Predictor) but you can avoid much of the cost of the cache by moving the loop body into an L0 cache which will be essentially a small direct-mapped cache. Being smaller, and direct-mapped, it will use less energy/access than the full I-cache. (And you probably will also access it virtually rather than physically, so also avoid TLB costs.)

    cache way:
    Recall that a DIRECT-MAPPED cache has only a single place where a line can go -- grab some bits from the middle of an address, they define an index, the line goes at that index. This is fast and cheap, but means you have a problem if you frequently want to access two addresses with the same index (ie same middle bits in their addresses).
    n-way set-associative cache means you now have, n (may be 2, 4, 8 or some other number) of slots associated with a given index. So if you have 8 slots, you can hold 8 lines with that same index, ie 8 addresses with those same middle bits.
    BUT how do you know WHICH of those 8 lines you want? Ahh.

    That gets into the business of matching tags, way prediction and a whole mess of other stuff that you need to read in a textbook. But the relevance to what I was saying is that which of these 8 possible lines is of interest is called a WAY. So by storing the cache way, you can access a cache with the speed (avoid cache tag lookup) and energy (no need to precharge the tags) of a direct-mapped cache.
  • GeoffreyA - Saturday, August 21, 2021 - link

    Great information. I believe on the Intel side, Nehalem added something like that, the LSD.
  • name99 - Saturday, August 21, 2021 - link

    As always the devil is in the details :-)

    Basic loop buffers, as in the LSD (introduced one gen before Nehalem, with Core2) have been with us forever, including early ARM chips and the early PA Semi chips, going on to Apple Swift.

    But the basic loop buffer can not deal with branches (because part of the system is to switch off branch prediction!). Part of what makes the Apple scheme interesting and exceptional is that it's this graduated scheme that manages to extract much of the energy win from the repetition of loops while being able to cover a much wider variety of loops including those with (not too awful) patterns of function calls and branches.

    Comparing details is usually unhelpful because different architectures have different concerns; obviously x86 has decode+variable length concerns which is probably THE prime concern for how their structure their attempts to extract performance and energy savings from loops,

    On the Apple side, I would guess that Mapping (specifically detecting dependencies within a Decode group of 8 instructions, ie what register written by instruction A is immediately read by successor instruction B) is a high-energy task, and a future direction for all these loop techniques on the Apple side might be to somehow save these inter-instruction-dependencies in the loop storage structure? This is, obviously, somewhat different from Intel or AMD's prime concern with their loops, given that even now they max out at only 5 (perhaps soon 6) wide in the mapping stage, and don't need to know as much for mapping because they don't do as much zero-cycle work in the stage right after Mapping.
  • GeoffreyA - Sunday, August 22, 2021 - link

    Thanks. I suppose storing the dependancy information would be of use even in non-loop cases, because of the amount of work it takes. Then again, it might add greater complexity, which is always a drawback.
  • mode_13h - Sunday, August 22, 2021 - link

    > I would guess that Mapping (specifically detecting ... what register written by
    > instruction A is immediately read by successor instruction B) is a high-energy task

    So, do you foresee some future ISA trying to map these out at compile-time, like Intel's ill-fated EPIC tried to do? On the one hand, it bloats the instruction size with redundant information. On the other, it would save some expensive lookups, at runtime. I guess you could boil it down to a question of whether it takes more energy for those bits to come in from DRAM and traverse the cache hierarchy vs. doing more work inside the core.

    The other idea I have is that if the CPU stores some supplemental information in its i-cache, then why not actually flush that out to L3 & DRAM, rather than recompute it each time? The OS would obviously have to provide the CPU with a private data area, sort of like a shadow code segment, but at least the ISA wouldn't have to change.
  • mode_13h - Saturday, August 21, 2021 - link

    Thanks. Very nice incremental explanation of a loop buffer, trace cache, and L0.

    > n-way set-associative cache means you now have, n slots associated with a given index.
    > So if you have 8 slots, you can hold 8 lines with that same index,
    > ie 8 addresses with those same middle bits.
    > BUT how do you know WHICH of those 8 lines you want?

    Yeah, I know how a set-associative cache works. The simplistic explanation is that there's a n-entry CAM (Content-Addressable Memory) which holds the upper bits of the addresses (I think what you're calling Tags) for each cache line in a set. So, a cache lookup involves up to n (I suppose 8, in this case) comparisons against those upper bits, to find out if any of them match the requested address. And, ideally, we want the hardware to do all those comparisons in parallel, so it can give a quicker answer where our data is (or fetch it, if the cache doesn't have it).

    Even at that level, cache is something not enough software developers know about. It's really easy to thrash a normal set-associative cache. Just create a 2D array with a width that's a factor or an integral multiple of a cache set size and do a column-traversal. If you're lucky, your entire array fits in L3. If not... :(

    > which of these 8 possible lines is of interest is called a WAY.

    Where I first learned about CPU caches, they called it a set. So, an 8-way set-associative cache had 8 sets.

    > by storing the cache way, you can access a cache with the speed ...
    > and energy ... of a direct-mapped cache.

    Yup. That's what I thought you were saying. So, the way/set + offset is basically an absolute pointer into the cache. And a bonus is that it only needs as many bits as the cache size, rather than a full address. So, a 64k cache would need only 16 bits to uniquely address any content it holds.
  • GeoffreyA - Thursday, August 19, 2021 - link

    I believe the reservation station is that portion which contains the scheduler and physical register files. In Intel, it's been unified since P6, compared to AMD's split/distributed scheduler design and, I think, Netburst.
  • name99 - Thursday, August 19, 2021 - link

    "Intel is noting that they’re doing an increased amount of dependency resolution at the allocation stage, actually eliminating instructions that otherwise would have to actually emitted to the back-end execution resources."

    Again presumably this means "executing" instructions at Rename (or earlier) rather than as actual execution.
    Apple examples are
    - handling some aspects of branches (for unconditional branches) at Decode
    - zero cycle move. This means you treat a move from one register to another by creating a second reference to the underlying physical register. Sounds obvious, the trick is tracking how many references now exist to a given physical register and when it can be freed. It's tricky enough that Apple have gone through three very different schemes so far.
    - zero cycle immediates. The way Apple handle this is a separate pool of ~40 integer registers dedicated to handling MOV xn, # (ie load xn with immediate value #), and the instruction is again handled to Rename.
    Intel could do the same.They already do this for zero idioms, of course.

    - then there are weirder cases like value prediction where again you insert the value into the target register at Rename. The instruction still has to be validated (hence executed in some form) but the early insertion improves. Apple does this for certain patterns of loads, but the details are too complicated for here.
  • mode_13h - Friday, August 20, 2021 - link

    Thanks again!
  • name99 - Thursday, August 19, 2021 - link

    "Someone please remind us what a reservation station is?"

    After an instruction is decoded it passes through Rename where resources it will need later (like a destination register, or a load/store queue entry) are allocated.
    Then it is placed in a Scheduling Queue. It sits in the queue until its dependencies are resolved (ie ADD x0, x1, x2 cannot execute until the values of x1 and x2 are known.
    This Scheduling Queue is also called a Reservation Station.

    There is a huge dichotomy in design here. Intel insists on using a single such queue, everyone else uses multiple queues. Apple have a queue per execution unit, ala https://twitter.com/dougallj/status/13739734787312... (this is not 100% correct, but good enough).
    The problem with a large queue is meeting cycle time. The problem with multiple queues is that they can get unbalanced. It's sad if you are executing integer only code and can't use the FP queue slots. Even worse is if you have one of your integer queues totally full, and the others empty, so only that one queue dispatches work per cycle.
    Apple solve these in a bunch of ways.
    First note the Dispatch Buffer in front of the Queues. This accepts instructions up to 8-wide per cycle from Rename, and sends as many as possible to the Scheduling Queues. It engages in load balancing to make sure that queues are always as close to evenly filled as possible.
    Secondly the most recent Apple designs pair the Scheduling Queues so that, ideally, each queue issues one instruction, but if a Queue cannot find a runnable instruction, it will accept the second choice runnable candidate from its paired queue.

    Queues and scheduling are actually immensely complicated. You have hundreds of instructions, all of which could depend in principle on any earlier instruction, so how do track (at minimal area and energy) all these dependencies? Apple appears to use a Matrix Scheduler with a TRULY ASTONISHINGLY CLEVER dependency scheme. A lot about the M1 impresses me, but if I had to choose one thing it might be this.
    It's way too complicated to describe here, but among the things you need to bear in mind are
    - you want to track which instructions depend on which
    - you want to track the age of instructions (so that when multiple instructions are runnable, earliest go first)
    - you need to handle Replay. I won't describe this here, but for people who know what it is, Apple provide
    + cycle-accurate Replay (no randomly retrying every four cycles till you finally succeed!) AND, most amazingly
    + perfect DEMAND Replay (only the instructions, down a chain of dependencies) that depended on a Replay are in turn Replayed (and, again, at the cycle accurate time)
    + if you think that's not amazing enough, think about what this implies for value prediction, and how much more aggressive you can be if the cost of a mispredict is merely a cycle-accurate on-demand Replay rather than a Flush!!!
  • mode_13h - Friday, August 20, 2021 - link

    Wow, you're on a roll!
  • name99 - Thursday, August 19, 2021 - link

    "> full-line-write predictive bandwidth optimisation ... where the core can greatly improve
    > bandwidth by avoiding RFO reads of cache lines that are going to be fully rewritten"

    Of course this is one of those "about time" optimizations :-)
    Apple (it's SO MUCH EASIER with a decent memory model, so I am sure also ARM) have had this for years of course. But improvements to it include
    - treat all-zero lines as special cases that are tagged in L2/SLC but don't require transferring data on the NoC. Intel had something like this in IceLake that, after some time, they switched off with microcode update.

    - store aggregation is obvious and easy if your memory model allows it. But Apple also engages in load aggregation (up to a cache line width) for uncachable data. I'm not sure what the use cases of this are (what's still uncachable in a modern design? reads rather than DMA from PCIe?) but apparently uncachable loads and stores remain a live issue; Apple is still generating patents about them even now.

    - Apple caches all have a bit per line that indicates whether this line should be treated as streaming vs LRU. Obviously any design that provides non-temporal loads/stores needs something like that, but the Apple scheme also allows you to mark pages (or range registers, which are basically BATs -- yes PPC BATs are back, baby!) as LRU or streaming, then the system will just do the right thing whether that data is accessed by load/stores, prefetch or whatever else.

    BTW, just as an aside, Apple's prefetchers start at the load-store unit, not the L1. Meaning they see the VIRTUAL address stream.) This in turn means they can cross page boundaries (and prefetch TLB entries for those boundary crossings). They're also co-ordinated so that each L1 is puppeting what it want the L2 prefetcher to do for it, rather than having L1 and L2 prefetchers working independently and hoping that it kinda sorta results in what you want. And yes, of course, tracking prefetching efficiency and throttling when appropriate have always been there.
  • mode_13h - Friday, August 20, 2021 - link

    > - treat all-zero lines as special cases that are tagged in L2/SLC but don't require
    > transferring data on the NoC. Intel had something like this in IceLake that, after
    > some time, they switched off with microcode update.

    I heard about that. Sad to see it go, but certainly one of those micro-optimizations that's barely measurable.
  • name99 - Thursday, August 19, 2021 - link

    " This is over double that of AMD’s Zen3 µarch, and really only second to Apple’s core microarchitecture which we’ve measured in at around 630 instructions. "

    Apple's ROB is in fact around 2300 entries in size. But because it is put together differently than the traditional ROB, you will get very different numbers depending on exactly what you test.

    The essential points are
    (a)
    - the ROB proper consists of about 330 "rows" where each row holds 7 instructions.
    - one of these instructions can be a "failable", ie something that can force a flush. In other words branches or load/stores
    - so if you simply count NOPs, you'll get a count of ~2300 entries. Anything else will hit earlier limits.

    (b) The most important of these limits, for most purposes, is the History File which tracks changes in the logical to physical register mapping. THIS entity has ~630 entries and is what you will bump into first if you test somewhat varied code.
    Earlier limits are ~380 int physical registers, ~420 or so FP registers, ~128 flag registers. But if you balance code across fp and int you will hit the 630 History File limit first.

    (c) If you carefully balance that against code that does not touch the History File (mainly stores and branches) than you can get to almost but not quite 1000 ROB entries.

    The primary reason Apple looks so different from x86 is that (this is a common pattern throughout Apple's design)
    - what has traditionally been one object (eg a ROB that tracks instruction retirement AND tracks register mappings) is split into two objects each handling a single task.
    The ROB handles in-order retiring, including flush. The History File handles register mapping (in case of flush and revert to an earlier state) and marking registers as free after retire.

    This design style is everywhere. Another, very different, example, is the traditional Load part of the Load/Store queue is split into two parts, one tracking overlap with pending/unresolved stores, the second part tracking whether Replay might be required (eg because of missing in TLB or in the L1).

    - even a single object is split into multiple what Apple calls "slices", but essentially a way to share rare cases with common cases, so the ROB needs to track some extra state for "failable" instructions that may cause a flush, but not every instruction needs that state. So you get this structure where you have up to six "lightweight" instructions with small ROB slots, and a "heavyweight" instruction with a larger ROB slot. Again with see this sort of thing everywhere, eg in the structures that hold branches waiting to retire which are carefully laid out to cover lots of branches, but with less storage for various special cases (taken branches need to preserve the history/path vectors, none-taken branches don't; indirect branches need to store a target, etc etc)
  • GeoffreyA - Friday, August 20, 2021 - link

    Thanks for all the brilliant comments on CPU design!
  • mode_13h - Friday, August 20, 2021 - link

    Go go go!
  • GeoffreyA - Thursday, August 19, 2021 - link

    I think Intel did a great job at last. Golden Cove, impressive. But the real star's going to be Gracemont. Atom's come of age at last. Better than Skylake, while using less power, means it's somewhere in the region of Zen 2. Got a feeling it'll become Intel's chief design in the future, the competitor to Zen.

    As for Intel Thread Director, interesting and impressive; but the closer tying of hardware and scheduler, not too sure about that. Name reminded me of the G-Man, strangely enough. AVX512, good riddance. And Intel Marketing, good job on the slides. They look quite nice. All in all, glad to see Intel's on the right track. Keep it up. And thanks for the coverage, Ian and Andrei.
  • Silver5urfer - Friday, August 20, 2021 - link

    Lol. That is no star. The small puny SKL cores are not going to render your high FPS nor the Zip compression. They are admitting themselves these are efficiency. Why ? Because 10SF is busted in power consumption and Intel cannot really make any more big cores on their Desktop platform without getting power throttled. On top their Ring bus cannot scale like SKL anymore.
  • GeoffreyA - Friday, August 20, 2021 - link

    Not as it stands, but mark my words, the Atom design is going to end up the main branch, on the heels of Zen in p/w. Interesting ideas are getting poured into this thing, whereas the bigger cores, they're just making it wider for the most part.
  • ifThenError - Friday, August 20, 2021 - link

    Totally understand your point and I'd personally welcome such a development!

    Anyway, the past years have shown a rather opposite way. Just take ARM as an example. There once was an efficiency line of cores that got the last update years ago with the A35. Now it's labelled as "super efficient" and hardly has any implementations aside from devices sitting idle most of the time. You can practically consider it abandoned.
    The former mid tier with the A55 is now marketed as efficient cores, while the former top tier A7x more and more turns into the new midrange. Meanwhile people go all crazy about the new X1 top tier processors even though the growth of power consumption and heat is disproportionate to the performance. Does this sound reasonable in a power and heat constraint environment? Yeah, I don't think so either! ;-)

    For that reason I perfectly understand Ian's demand for a 64 core Gracemont CPU. Heck, even a 16 core would still be #1 on my wishlist.
  • GeoffreyA - Saturday, August 21, 2021 - link

    Yes, performance/watt is the way to go, and I reckon a couple more rounds of iteration will get Atom running at the competition's level. The designs are similar enough. It's ironic, because Atom had a reputation for being so slow.
  • mode_13h - Saturday, August 21, 2021 - link

    > Atom had a reputation for being so slow.

    With Tremont, Intel really upped their Atom game. It added a lot of complexity and grew significantly wider.

    However, it's not until Gracemont's addition of AVX/AVX2 that Intel is clearly indicating it wants these cores to be taken seriously.

    I wonder if Intel will promote their Atom line of SoCs as the new replacement for Xeon D. Currently, I think they're just being marketed for embedded servers and 5G Basestations, but they seem to have the nous to taken the markets Xeon D was targeting.
  • GeoffreyA - Sunday, August 22, 2021 - link

    I'm impressed, and more so because of its humble roots. Gone are the days when Atom was something Jaguar made a laughing stock of, though that stigma still appears to be clinging to it. In a way, it reminds me of the Pentium M, though the M was solid from the word go. Golden Cove, is this a repeat of the past, and is your energy-efficient brother going to take over the show?
  • ifThenError - Sunday, August 22, 2021 - link

    Would it be bad if Golden Cove turns out the new Netburst in the long run? ;-)
  • GeoffreyA - Sunday, August 22, 2021 - link

    That's what I'm thinking. It's going to dim the lights in the house when it fires up on Prime95.
  • mode_13h - Friday, August 20, 2021 - link

    > Better than Skylake, while using less power, means it's somewhere in the region of Zen 2.

    But remember, it's not ISO-process. Still, the comparison with Zen 2 is apt.

    The main caveat, and I can't believe Ian completely miss this (regarding his Cinebench & Gracemont HEDT comments), is that Intel only quoted integer performance. I think it won't compare as favorably, on floating point workloads.
  • GeoffreyA - Saturday, August 21, 2021 - link

    Yes, it's going to fall behind even Zen 1 on floating-point performance. While there are two FADDs and FMULs, same as Zen, they're being shared across two ports (20 and 21).
  • hechacker1 - Thursday, August 19, 2021 - link

    Ok INTC. For the record I entered 300 shares at an average cost of 52.39 on 8/19/21. I sold 3 covered calls for downside protection (you bastard).

    RIP me.
  • mode_13h - Friday, August 20, 2021 - link

    I'm not sure I follow. I'd be interested in hearing your rationale, if you want to share it.
  • hechacker1 - Friday, August 20, 2021 - link

    I think INTC is fairly priced right now with these performance improvements. They only have to be competitive.

    They still have fabs, and importantly, in the US. It's a long term play that ANY cpu/gpu/ai fab will print money because the demand is high.

    A 1 or 5 year chart on intel suggest it probably won't go that much lower.

    I sold covered calls (and guess what, intel was down today 1.8%) which netted me $24 in "profit." I then closed the calls. Then the stock ended back up slightly where I put more covered calls. Rinsed and repeat and hope it starts trending up (or I lose).

    Basically, I think it's undervalued compared to the run up in AMD and NVDA. It's taken a beating. It's still cash flow positive. But I admit, intel needs to show proof. No more bullshit. Actual yields.
  • mode_13h - Saturday, August 21, 2021 - link

    Thanks. I stopped buying individual stocks a while ago. I'm just not as interested in finance as I am in tech.
  • bwj - Thursday, August 19, 2021 - link

    Wait a sec, does this mean that the P-cores will gain the Tremont features UMWAIT, TPAUSE, and UMONITOR? Because if so, that's going to be absolutely dank for us system programmers.
  • mode_13h - Friday, August 20, 2021 - link

    With a launch date in just a few months, their reference documentation should already be updated with info about the supported instructions on Golden Cove.

    I'd be interested in hearing why these features are of interest to you.
  • mode_13h - Thursday, August 19, 2021 - link

    > look for Cinebench R20 scores for one Gracemont thread around 478 then
    > (Skylake 6700K scored 443).

    No, because the 8% greater perf is on spec int, and Cinebench is not an integer workload!

    > If that’s the case, where is our 64-core Atom version for HEDT?

    That market will probably be better-served by a Sapphire Rapids-derived chip with with much more floating-point horsepower per core.

    However, a 128-core Gracemont-based server CPU could perhaps be interesting for mostly integer workloads.

    > along with a 5000-entry branch target cache

    Is that not the same as a BTB?

    > So it’s a bit insane to have 17 execution ports here

    But they're more special-purpose, right? If you make each less versatile, the natural consequence is that you need more of them.

    > in the E-core there are two separate schedulers

    A little like Bulldozer, no? Except they're both being fed by just one thread.

    > if each core is fully loaded, there is only 512 KB of L2 cache per core before
    > making the jump to main memory

    Unless I'm missing something, 4 MB / 4 cores = 1 MB / core
  • GeoffreyA - Friday, August 20, 2021 - link

    "in the E-core there are two separate schedulers"

    I think that's related to the ports being divided into integer and vector sides. Possibly, one scheduler is handling the integer work, and one the FP. It's interesting that this design is following in the tracks of Zen (and even the Athlon). Also, it's ROB is 256 entries, which is same as Zen 3's. Intel is slowly working Atom up to be their main design.
  • mode_13h - Saturday, August 21, 2021 - link

    > Intel is slowly working Atom up to be their main design.

    I don't know about being their main design, but perhaps they're worried their P-cores are too uncompetitive in perf/W with ARM server cores. So, I could imagine them making a bigger push for server CPUs built around E-cores.
  • dullard - Thursday, August 19, 2021 - link

    "For users looking for more information about Thread Director on a technical, I suggest reading this document and going to page 185, reading about EHFI – Enhanced Hardware Frequency Interface."

    Ian, can you please link that document? Thanks.
  • TeXWiller - Thursday, August 19, 2021 - link

    I had a strong impression that Windows 11 will be "published" well (months) before Alder Lake based products come to market.
  • drothgery - Friday, August 20, 2021 - link

    Intel has an event scheduled at the end of October that seems to be strongly rumored as the official launch event for Adler Lake. That's only two months from now. How it actually performs is TBD, but I'd bet a lot on working systems in reviewers' hands before November.
  • Dr_b_ - Thursday, August 19, 2021 - link

    looking really good, sad to lose AVX512 though
    concerns:
    -no real substantive availability on release date
    -high prices for motherboards/RAM/CPU and scalping, we are still in a pandemic with global supply chain disruptions
    -buggy motherboard/OS
    -no Linux support
  • mode_13h - Friday, August 20, 2021 - link

    They didn't say "no Linux support". Rather, they said the thread director wouldn't initially have explicit Linux support. I'm sure there will be workarounds, probably along the lines of how Windows 10 sees the Big.Little cores. Remember that Linux has scheduled workloads on Big.Little configurations in Android, for at least a decade.
  • flensr - Thursday, August 19, 2021 - link

    That on-chip embedded microprocessor that tracks threading... It gets to talk to the OS. How cool is that as a target for hackers?
  • GreenReaper - Friday, August 20, 2021 - link

    Intel Threat Detected!
  • ifThenError - Friday, August 20, 2021 - link

    LOL!
    Underrated comment
  • mode_13h - Saturday, August 21, 2021 - link

    :D
  • diediealldie - Thursday, August 19, 2021 - link

    I'm quite curious how their E-cores are designed. They somehow use 6-way decoder which is same width compared to P-cores. And use twice bigger I-cache, yet using 1/4 of area.

    Maybe it's related to design philosophy? or Atom team's a true trump card of the Intel design team?
  • name99 - Thursday, August 19, 2021 - link

    That "6-way decoder" is typical Intel double-talk. What is done is that you have two decoders that can each decode three instructions. This works IF there is a branch between the two sets of instructions, because the branch landing point provides a resync point for the second decoder, so the two can run in parallel.

    You could obviously extend this, in a decent Next Fetch Predictor system, to have the NFP store the lengths of each instruction in the run of instructions to be decoded, and get trivial parallel decode. And Andy Glew (I think it was him, either him or Jim K) submitted a patent for this around 2000. But in true Intel fashion, nothing seems to have been done with the idea...
  • GeoffreyA - Saturday, August 21, 2021 - link

    If I'm not mistaken, Tremont or Goldmont, can't remember which, began marking the instruction boundaries in cache.
  • name99 - Saturday, August 21, 2021 - link

    Doing it in the cache is more difficult. Of course it makes the most sense! But it hits the problem that, *in theory*, you can have stupid code that jumps into the middle of an instruction and starts decoding the alternative version of the byte stream that results.
    This is, of course, absolutely insane, but it's part of the joy that is supporting x86.

    Now one way to handle this is to tag the boundaries not in the I-cache (where you can jump to any byte) but in structures that are already set up to deal with instruction streams (as opposed to byte streams). Hence the Next Fetch Predictor, as I described, or in a trace cache if you are using that.

    Another solution would be yet another predictor! Assume most code will be sane, but have a separate pool of decoders that are validating the byte stream as opposed to the high-speed instruction stream going through main path of the CPU. In the event of a mismatch
    - treat it like a branch misprediction, flush and restart AND
    - train the predictor that for this particular cache line, we have to decode slowly and surely

    Now why doesn't Intel do any of these things? You and I can think of them, just as people like Andy Glew were thinking of variants of them 20 years ago.
    My primary hypothesis is that Intel has become locked into the idea that GHz is everything. Sure they occasionally say they don't believe this, or even claim to have reformed after a disaster (*cough* Pentium4 *cough*) but then they head right back to the crack house.
    I suspect it's something like the same mentality as the US Air Force -- when pilots form the highest levels of command, they see pilots as the essence of what the Air Force IS; drones and UAV's are a cute distraction but will never be the real thing.
    Similarly, if you see GHz as the essence of what Intel is, that smarts are cute but real men work on GHz, then you will always be sacrificing smarts if they might cut into GHz. But GHz costs the problems we see in the big cores: the crazy power draws, and the ridiculously low density...

    Well, this is getting into opinion, not technology, so interpret it as you wish!
  • GeoffreyA - Sunday, August 22, 2021 - link

    Looking at the article again, I see their on-demand instruction length decoder is where this is happening. Seems to be caching lengths after they're worked out. I also wonder if this is why Atom hasn't had a uop cache as yet. It's either that or the length caching, because the uop cache will indirectly serve that purpose as well (decoded instructions don't need their lengths worked out). So it's perhaps a matter of die area here that Intel chose that instead of a uop cache.
  • GeoffreyA - Sunday, August 22, 2021 - link

    It's been said that K7 to Bulldozer also did a similar thing, marking instruction boundaries in the cache. And the Pentium MMX, but need to double check this one.
  • mode_13h - Sunday, August 22, 2021 - link

    > In the event of a mismatch - treat it like a branch misprediction, flush and restart

    Yes, because even assembly language doesn't make it easy to jump into the middle of another instruction. IMO, any code which does that *deserves* to run slowly, so that it will get replaced with newer software that's written in an actual programming language.

    > My primary hypothesis is that Intel has become locked into the idea that GHz is everything.

    I think they just got lulled into thinking it was enough to deliver modest generational gains. Anything more ambitious probably jeopardized the schedule or risked their profit margins due to the cores getting to big and expensive. And when time comes for more performance, they reach into their old playbook and go with a "sure" win, like wider vectors. I wonder if the example of TSX reveals anything about their execution, on the more innovative stuff. Because that doesn't build a lot of confidence for taking on bold, new ideas.

    > when pilots form the highest levels of command,
    > they see pilots as the essence of what the Air Force IS

    Not just pilots, but specifically fighter pilots. So, they also don't care much about bombers or Space Command (now Space Force). The only way to change that would be to make them care, by making them more accountable for the other programs, until they realize they need them to be run by someone who know about that stuff. Either that or just reorg the whole military. That would probably also help reign in defense spending.
  • zamroni - Friday, August 20, 2021 - link

    That low power cores for desktop is waste of transistors.
    They are better to be used for more caches or more performance cores
  • mode_13h - Friday, August 20, 2021 - link

    This is what I thought, until I realized that they have better perf/area than the big cores. Not to mention perf/W.

    So, in highly-threaded workloads, their 8+8 core configuration should out-perform 10 cores of Golden Cove. And, when thermally-limited, the little cores will also more than pull their weight.

    It's an interesting experiment they're trying. I'm interested in seeing how it plays out, in the real world.
  • nevcairiel - Friday, August 20, 2021 - link

    > Designed as its third generation of vector instructions (AVX is 128-bit, AVX2 is 256-bit, AVX512 is 512-bit)

    SSE is 128-bit. AVX is 256-bit FP, AVX2 is 256-bit INT.
    And MMX was 64-bit before that. So doesn't this make it the 4th generation, assuming you don't count all the SSE versions separately? (The big ones were SSE1 with 128-bit FP, and SSE2 with 128-bit INT, SSE3/SSSE3/SSE4.1 are only minor extensions)
  • mode_13h - Saturday, August 21, 2021 - link

    Yeah, I came to the same conclusion. It's the 4th major family of vector instructions. Or, another way to clearly demarcate it would be the 4th vector width.
  • abufrejoval - Friday, August 20, 2021 - link

    I wonder how many side channel attacks the power director will enable.

    Also wonder if the lack of details is due to Intel stepping awfully close to some of Apple's patents.

    The battles between the Big little and AVX-512 teams inside Intel must have been epic: I imagine frothing red faces all around...
  • mode_13h - Saturday, August 21, 2021 - link

    > The battles between the Big little and AVX-512 teams inside Intel must have been epic

    : )

    Although, the AVX-512 folks have some egg on their faces from a problematic implementation in Skylake-SP and its derivatives.
  • abufrejoval - Friday, August 20, 2021 - link

    Does Big-little make any sense on a "desktop"?

    And then: Are there actually still any desktops around?

    All around my corporate workplaces, notebooks have become the de-facto desktop for many depreciation cycles, mostly because personal offices got replaced by open space and home-office days became a regular thing far before the pandemic. Since then even 'workstations' just became bigger notebooks.

    Anywhere else I look it's becoming hard to detect desktops, even for big-screen & multi-monitor setups, it's mostly NUCs or in-screen devices these days.

    Those latter machines rarely seem to get turned off any more and I guess many corporate laptops will remain 'turned on' (= stay in some sort of slumber) most of the time, too, so there Big-little overall power consumption might drop vs. Big-only, when both no longer sleep deeply.

    Supposedly that makes all these voice commands possible, but try as I might, I can see no IT admin turning that on in an office, nor would I want that in my living room.

    The only place I still see 'desktops' are really gamer machines and for those it's hard to see how those small cores might have any significant energy budget impact, even while they are used for ordinary 2D stuff.

    For micro-servers Big-little seems much more useful, but Intel typically has gone a long way to ensure that 'desktop' CPUs were not used for that.

    Intel's desire for market differentiation seems the major factor behind this and many other features since MMX, but given an equal price choice, I cannot imagine preferring the use of AVX-512 for dark silicon and two P-core tiles for eight E-cores over a fully enabled ten P-core chip.

    And I'd belive that most 'desktop' users would prefer the same.
  • mode_13h - Saturday, August 21, 2021 - link

    > The only place I still see 'desktops' are really gamer machines

    We still use traditional desktops for software development and VMs for testing. Our software takes long enough to build and the test environment needs to boot a full image. So, a proper desktop isn't hard to justify.
  • abufrejoval - Saturday, August 21, 2021 - link

    Our developers are encouraged to use build servers and the automatic testing pipelines behind them. Those run on machines with hundreds of GB of RAM and dozens of CPU cores, where loads get distribued via the framework. The QA tests will use containers or VMs as required, which are built and torn down to match by the pipeline. With thousands of developers in the company, that tends to give both better performance to any developer and much better economy to the company, while (home-)offices stay cool and quiet. We still give them laptops with what used to be "desktop" specs (32GB RAM, i7 quads), because, well they're cheap enough, and it allows them to play with VMs locally, even offline, should they want to e.g. for education/self-study.

    These days when you're running a build farm on your "desktop", that may really more of a workstation. It may be the "economy" model, which means from a price point it's what used to be a desktop, in my home-lab case a Ryzen 7 5800X 8-core with an RTX 2080ti and 128GB ECC RAM that runs whisper quiet even at full load. It would have been a 16-Core 5950X today, but when I built it, those were impossible to get. It's still an easy upgrade and would get you 16 "P-cores" on the cheap. It's also pretty much a gamer's rig, which is why I also use it after hours.

    My other home-lab workstation is what used to be a "real workstation" some years ago, an 18-core Haswell E5-2696 v3, which has exactly the same performance as the Ryzen 7 5800X on threaded jobs, even uses the same 110 Watts of power, but much lower clocks (2.7 vs. 4.4 GHz all-cores). Also 128GB of ECC RAM and thankfully just as quiet. It's not so great at gaming, because it only clocks to 4 GHz for single/dual core loads with Haswell IPC and I've yet to find a game that's capable of using 18-cores for profit to balance that out.

    Today you would use a Threadripper in that ballpark, with an easy 64 "P-Cores" and matching RAM, pretty much the same computing capacity as a mid-range server, but much quieter and typically tolerable in a desktop/office setup.

    If threaded software builds were all you do, you'd want to use 64 E-Cores on the "economy" variant and 256 E-Cores on the "premium", much like Ian hinted, because as long as you can fully load those 256 cores for your builds, they would be faster overall. But the chances for that happening are vastly bigger on a shared server than on a dedicated desktop, which is why we see all these ARM servers going for extra cores at the price of max single threaded performance.

    As a thought experiment imagine a machine where tiles can be switched dynamically between being a single P-core or four E-cores. For embarrassingly parallel workloads, the E-Cores would give you both better Watt economy (if you can maintain full load or your idle power consumption is perfect) and faster finish times. But as soon as your workload doesn't go beyond the number of P-cores you can configure, finishing times will be better on P-cores, while power effiency very much gets lost in idle power demands.

    The only way to get that re-configurability is to use shared servers, cloud or DC, while a fixed allocation of P vs E cores on a desktop has a much harder time to match your workload.

    I can tell you that I much prefer working on the 5800X workstation these days, even if it's no faster for the builds. Beause it's twice as fast on all those scalar workloads. And no matter how much most stuff tries to go wide and thready, Amdahl's law still holds true and that where P-Cores help.
  • mode_13h - Sunday, August 22, 2021 - link

    > Our developers are encouraged to use build servers

    We use VM servers, but they're old and the VMs are spec'd worse than desktops. So, there's no real incentive to use them for building. And if you're building on a desktop in your home, then testing on a server-based VM means copying the image over the VPN. So, almost nobody does that, either.

    VM servers are a nice idea, but companies often balk at the price tag. New desktops every 4-5 years is an easier pill to swallow, especially because upgrades are staggered.

    > I much prefer working on the 5800X workstation these days,
    > even if it's no faster for the builds. Beause it's twice as fast on all those scalar workloads.

    Exactly. Most incremental compilation involves relatively few files. I do plenty of other sequential tasks, as well.
  • mode_13h - Saturday, August 21, 2021 - link

    > micro-servers Big-little seems much more useful, but Intel typically has gone
    > a long way to ensure that 'desktop' CPUs were not used for that.

    Huh? Their E-series Xeons are simply desktop CPUs with a few less features fused-off.
  • abufrejoval - Saturday, August 21, 2021 - link

    We all know that that's what they are technically. But that didn't keep Intel from selling them, and the required chipsets, which had the same magical snake oil, at a heavy markup, before AMD came along and offered ECC and some RAS for free.

    And that is going to come back, as soon as Intel sees a chance to make an extra buck.
  • mode_13h - Sunday, August 22, 2021 - link

    > that didn't keep Intel from selling them, and the required chipsets, ... at a heavy markup

    Except for maybe the top-end models, I tended to observe E-series (previously E3-series) selling for similar prices as the desktop equivalents. However, workstation motherboards generally have commanded a higher price.
  • mode_13h - Saturday, August 21, 2021 - link

    > given an equal price choice, I cannot imagine preferring the use of AVX-512 for
    > dark silicon and two P-core tiles for eight E-cores over a fully enabled ten P-core chip.

    Aside from the AVX-512 part, the math is quite easy. If you just take what they showed in the Gracemont vs. Skylake comparison, it's clear that 8 E-cores is going to provide more performance than 2 more P-cores. And anything well-threaded enough to fully-load 10 P-cores should probably scale well to at least 16 (or 24) threads.

    As for the AVX-512 part, its absence irrelevant if your workload doesn't utilize it, as most don't. Ryzen 5000 has been very competitive without it. I'm sure folks at Intel were keen to cite that.

    > And I'd belive that most 'desktop' users would prefer the same.

    I don't love the E-cores, in a desktop, but that's more out of apprehension about how well-scheduled they'll be. If the scheduling is good, then I'm fine with having them instead of 2 more P-cores.
  • Spunjji - Tuesday, August 24, 2021 - link

    "If the scheduling is good, then I'm fine with having them instead of 2 more P-cores"
    It's all going to come down to this. Lakefield wasn't great in that regard; presumably anybody running Windows 10 on ADL will get a slightly more refined version of that experience. Hopefully the Windows 11 + Thread Director combo will be what's needed!
  • Timur Born - Friday, August 20, 2021 - link

    My current experience is that anything based on older Lua versions (like 5.1) does not seem to benefit from IPC gains at all, only clock-rate matters.
  • abufrejoval - Saturday, August 21, 2021 - link

    That's interesting.

    If IPC gains were "uniform", that should not happen, which then means they aren't uniform enough for your workloads.

    But a bit more data would help... especially if a newer version of Lua doesn't show this behavior?
  • mode_13h - Sunday, August 22, 2021 - link

    I've never used it, but it seems to be dynamically-typed and table-based. So, I'd assume it's doing lots of hashtable lookups, which seem harder for a CPU to optimize. Maybe newer versions have some optimizations to reduce the frequency of table lookups, which would also be more OoO-friendly.
  • TristanSDX - Friday, August 20, 2021 - link

    for disabled AVX-512, I suspect they found last-minute bug in P cores. ADL is in mass production now, and release can't be posponed, and not many apps use it currently, so they disabled it completely. For Saphire Rapids AVX-512 is mandatory, that's why they delayed it half year, from Q421 to Q222, HPC product without AVX-512 used by many HPC software is just brick.
  • mode_13h - Saturday, August 21, 2021 - link

    That doesn't explain the E-core situation, though. As the article explains, enabling it on only the P-cores would create a real headache for the OS' thread scheduler.

    Plus, a lot of multi-threaded software naively spawns one worker thread per hardware thread, so you could end up with a situation where 24 software threads are fighting for execution time on 16 hardware threads, leading to more context switches and higher software latencies.

    I'm just saying that the stated explanation of disabling it because it's lacking in the E-cores is a suitable reason.

    As for Sapphire Rapids' delays, it's not hard to imagine they're having yield problems with such big chips on their new "Intel 7" process. Also, they're behind schedule for the software support for it, with AMX still being in really rough shape.
  • abufrejoval - Saturday, August 21, 2021 - link

    Since AVX-512 isn't new, I'm somewhat doubtful on the bug theory.

    And since Intel doesn't do chiplets yet, they can't be reusing that silicon for server CPUs either.

    It really has me think that the AVX-512 guys tried to push their baby through into production until the bloody final battle, when the E/P-Core symmetry team shut them down (for now, it's all fuses, right?).

    It's really very much a matter of how you want to use these resources and educating both operating systems and users about their potential and limitations. If all you see in E-cores is a way to run a P-core task on less energy budget, that symmetry is critical. If you see E-cores as an add-on resource that somewhat functionally limited (but might have better side-channel resilience or run special purpose VMs etc.), yet available for low silicon real-estate, it's another story.

    On notebooks on batteries, the symmetric view wins out. For anything on a powerline, the E-cores may make some sense as functionally constrained extra resources, I can't see the power savings vs. good idle there (well, perhaps a single E-core, like the Tegra 3 had against it's quad P-cores).

    It's very hard to maintain real flexibility when things get baked into silicon.

    I'd say product managers got the better over the engineers and what you get is a compromise, which hardly ever ideal nor easy to understand without the context of its creation.
  • mode_13h - Sunday, August 22, 2021 - link

    > It really has me think that the AVX-512 guys tried to push their baby through into
    > production until the bloody final battle,

    That doesn't explain the backport of VNNI to AVX2, unless that was already being done for other reasons.

    Intel went through this once, already, with Lakefield. That was like 2 years ago, and forced the same situation of the P-core being kneecapped. So, this thing can't have been a surprise.

    Now, wouldn't it be cool if BIOS gave you a choice between enabling the E-cores and having AVX-512 on the P-cores? I know it'd create more headaches for the customer support teams at Intel and many OEMs, but that would at least be a more customer-centric way to make the tradeoff.
  • Spunjji - Tuesday, August 24, 2021 - link

    Giving customers more choice for no additional cost is not the Intel way!
  • Oxford Guy - Thursday, August 26, 2021 - link

    Some here fervently believe enthusiasts who build their own PCs aren’t going to enter BIOS to turn on XMP...
  • Spunjji - Friday, August 27, 2021 - link

    @Oxford Guy - only ever seen people argue the majority of users won't do that, not enthusiasts specifically.
  • SystemsBuilder - Friday, August 20, 2021 - link

    Breaking out VNNI from AVX512 and keeping it in Alder Lake is to accelerate Neural Net inference. Many other parts of AVX512 (i.e. AVX512F etc) are necessary to sufficiently accelerate NN learning.
    Intel probably thought that Alder Lake CPUs would only be used in inference scenarios and therefor reserved AVX512 and AMX to Sapphire rapids server, workstation and hopefully the HEDT platform road maps.

    Intel forgot (or more likely did not care) that companies have, after 5 years of AVX512 with implementations as far down into the consumer stack as Ice Lake and Tiger Lake lap tops, tuned libraries to take advantage of AVX512 in OTHER scenarios than deep learning. Those libraries are now going to be regressing to AVX2 when run on Alder lake CPUs, effectively knee capped, executed on P and crap cores, ops sorry, meant E cores.
  • mode_13h - Saturday, August 21, 2021 - link

    To be fair, I think Intel had further motives for porting VNNI to AVX2. They sell Atom processors into applications where inferencing is a useful capability. Skylake CPUs are already pretty good at inferencing, with just baseline AVX2, so VNNI can only help.

    Still, the situation is something of an own-goal. I'll bet Intel will be nursing that wound for the next few years. I don't expect they'll make the same decision/mistake in Raptor Lake.
  • StoykovK - Friday, August 20, 2021 - link

    Intel stated that ADL has 6 decoders from 4, but didn't Skylake has 5 (4 simple + 1 complex)?

    I'm a little bit confused. It looks like, from architecture point, Golden Cove compared to WillowCove is bigger update, than WillowCove to SkyLake, but both result ~20% IPC.

    E-cores: Really good idea to get high score in multi-core benchmarks. GoldenCove looks like ~33% faster than E-cores, but taking a lot more power. Does anybody have an idea how wide is E-cores AVX- 128bit or 256bit.
  • TristanSDX - Friday, August 20, 2021 - link

    SSE - 128 bit, AVX - 256 bit, AVX-512 - 512 bit
  • StoykovK - Friday, August 20, 2021 - link

    Zen/Zen+, Sandy Bridge, Ivy Bridge fuses 2x128bit units in order to execute single 256bit AVX.
  • NikosD - Wednesday, August 25, 2021 - link

    Don't get confused...Sandy Bridge and Ivy Bridge have the exact same implementation of AVX1 execution like Haswell...They both support full 256-bit throughput (like Haswell) for all 256-bit AVX1 execution units on port 0,1 and 5 (for Sandy Bridge).
  • SystemsBuilder - Friday, August 20, 2021 - link

    E cores are also good for marketing purposes.
    I can envision the sticker on the "latest" PC/laptops at BestBuy in front of me: bla, bla, bla ... "Intel 12th gen CPU with 16 cores"... more bla, bla.
    In fact thinking a bit about this, Alder Lake might be a pure marketing department driven product as a reaction AMD's core count superiority -> # of core sells.
  • Oxford Guy - Friday, August 20, 2021 - link

    I don’t doubt that that’s part of it.
  • mode_13h - Saturday, August 21, 2021 - link

    I had the same thought, initially. That E-cores were mainly a ploy to inflate their core counts.

    However, as mentioned above, the E-cores are a very area-efficient way to increase performance on multithreaded workloads. So, another cynical take on it would be that they're just a way to gin up some of the benchmark numbers.
  • Oxford Guy - Friday, August 20, 2021 - link

    ‘Intel did confirm that the highest client power, presumably on the desktop processor, will be 125 W.’

    Will that be 125 actual watts or the traditional fantasy watts of claimed TDP for Intel desktop CPUs?
  • StoykovK - Saturday, August 21, 2021 - link

    I think that this is TDP as PL1. In my opinion PL2 should be quite higher.
  • Spunjji - Tuesday, August 24, 2021 - link

    Around 228W, apparently
  • taisingera - Friday, August 20, 2021 - link

    My take away from this is Microsoft is dictating to you that to use Windows 11 you need at least 8th gen Intel, and Intel is dictating to you that to use Alder Lake, you need Windows 11 for the Intel Thread Detector.
  • Gradius2 - Friday, August 20, 2021 - link

    WHEN this will be out? September?
  • Jp7188 - Sunday, August 22, 2021 - link

    Re: thread director, I'm cringing at the security implications of a microcontroller that can spy on every thread and is OS accessible. I'm thinking this is likely to be the reason it's closed-source/Windows only for now. When/if this gets open sourced it's going to be a hacker's field day.
  • mode_13h - Monday, August 23, 2021 - link

    Yeah, it'll be an exploit gold mine, if someone works out how to access it from userspace, particularly the profiling information on other running threads.
  • mode_13h - Monday, August 23, 2021 - link

    > I'm thinking this is likely to be the reason it's closed-source/Windows only for now.

    Security by obscurity didn't protect the ME. No reason to think it'll save the TD, either.
  • iranterres - Monday, August 23, 2021 - link

    LOL another socket. Another pinout....
  • mode_13h - Tuesday, August 24, 2021 - link

    And what's surprising about that? LGA1200 was introduced with Comet Lake and supported by Rocket Lake. Intel's standard socket lifespan is 2 generations. This just continues the trend.

    That said, I've heard rumors that LGA1700 could be supported for 3 generations. I wouldn't count on it, but it'd be nice to see them opt for a bit more longevity.
  • MDD1963 - Tuesday, August 24, 2021 - link

    "... if Intel wants to push AVX-512 again, it will have a *Sisyphean* task to convince everyone it’s what the industry needs."

    <sigh...!>
  • Timoo - Tuesday, August 24, 2021 - link

    So here comes the clusterf*ck.
    Intel still has serious leverage over Microsoft:

    "Intel Threadripper Thread Director Technology

    This new technology is a combined hardware/software solution that Intel has engineered with Microsoft focused on Windows 11."

    Which means that you have to upgrade to 11 in order to use the potential of Alder lake. Which in turn means that, if AMD comes with a similar solution down the lane, it is obliged to go to 11 as well.

    Having enough troubles as-it-is with 10 upgrades, I have a hard time thinking about upgrading to 11. And I do not want to buy new hardware just for the sake of 11. I want to buy new hardware to make 10 run more smooth, with those 50+ browser threads open all the time.
  • MobiusPizza - Wednesday, August 25, 2021 - link

    "Intel has improved the prefetchers, nothing things such as “better stride prefetching in L1”, though beyond that the company hasn’t divulged much other details. "

    nothing -> noting
  • vikas.sm - Friday, August 27, 2021 - link

    8.8-24 is more intuitive than 8C8c24T.
    M4rK3tin8 FaIL🤪
  • vikas.sm - Friday, August 27, 2021 - link

    Suggestion for sff motherboard vendors:
    1. Include on-board RAM.
    2. Also include tilted/flush so-dimm slots.
  • mode_13h - Saturday, August 28, 2021 - link

    > Include on-board RAM.

    No. It'll probably something cheap and maybe not as much or as fast as you'd like. And if it goes bad, then you have to basically toss the whole board.

    I hate it when laptops have soldered RAM.
  • MichuR - Thursday, November 3, 2022 - link

    "AVX is 128-bit, AVX2 is 256-bit"
    I'm sorry, but at all. Both AvX and AVX2 are 256bit. And no, AVX2 is not 'better' AVX. AVX2 just supports different set of instructions - in huge simplification AVX floats, AVX2 integers. 128bit operations were introduced by SSE in late 90s

Log in

Don't have an account? Sign up now