Original Link: https://www.anandtech.com/show/11464/intel-announces-skylakex-bringing-18core-hcc-silicon-to-consumers-for-1999



There are days in this profession in which I am surprised. The longer I stay in the technology industry, they become further and further apart. There are several reasons to be surprised: someone comes out of the blue with a revolutionary product and the ecosystem/infrastructure to back it up, or a company goes above and beyond a recent mediocre pace to take on the incumbents (with or without significant financial backing). One reason is confusion, as to why such a product would ever be thought of, and another is seeing how one company reacts to another.

We’ve been expecting the next high-end desktop version of Skylake for almost 18 months now, and fully expected it to be an iterative update over Broadwell-E: a couple more cores, a few more dollars, a new socket, and done. Intel has surprised us with at least two of the reasons above: Skylake-X will increase the core count of Intel’s HEDT platform from 10 to 18.

The Skylake-X announcement is a lot to unpack, and there are several elements to the equation. Let’s start with familiar territory: the first half of the processor launch.

Announcement One: Low Core Count Skylake-X Processors

The last generation, Broadwell-E, offered four processors: two six-core parts, an eight-core part, and a top-tier 10-core processor. The main difference between the two six-core parts was the PCIe lane count, and aside from the hike in pricing for the top-end SKU, these were iterative updates over Haswell-E: two more cores for the top processor.

This strategy from Intel is derived from what they call internally as their ‘LCC’ core, standing for ‘low core count’. The enterprise line from Intel has three designs for their silicon – a low core count, a high core count, and an extreme core count: LCC, HCC, and XCC respectively. All the processors in the enterprise line are typically made from these three silicon maps: a 10-core LCC silicon die, for example, can have two cores disabled to be an 8-core. Or a 22-core XCC die can have all but four cores disabled, but still retain access to all the L3 cache, to have an XCC processor that has a massive cache structure. For the consumer HEDT platform, such as Haswell-E and Broadwell-E, the processors made public were all derived from the LCC silicon.

The first half of the Skylake-X processor llineup follows this trend. Intel will launch four Skylake-X processors based on the LCC die, which for this platform will have a maximum of 12 cores. All processors will have hyperthreading.

Skylake-X Processors (Low Core Count Chips)
  Core i7-7800X Core i7-7820X Core i9-7900X Core i9-7920X
Cores/
Threads
6/12 8/16 10/20 12/24
Base Clock 3.5 GHz 3.6 GHz 3.3 GHz TBD
Turbo Clock 4.0 GHz 4.3 GHz 4.3 GHz TBD
TurboMax Clock N/A 4.5 GHz 4.5 GHz TBD
L3 8.25 MB 11 MB 13.75 MB TBD
(Likely 13.75 MB)
PCIe Lanes 28 44 TBD
(Likely 44)
Memory Channels 4
Memory Freq DDR4-2400 DDR4-2666 TBD
TDP 140W TBD
Price $389 $599 $999 $1199

The bottom processor is the Core i7-7800X, running at 3.5 GHz with a 4.0 GHz turbo. This design will not feature Intel’s new ‘favored core’ Turbo 3.0 technology (more on that below), but will have six cores, support quad-channel memory at DDR4-2400, come in at a TDP of 140W, have 28 PCIe lanes, and retail for around $400. This processor will be the entry level model, for any user who needs the benefit of quad-channel memory but perhaps doesn’t need a two-digit number of cores or has a more limited budget.

Next up is the Core i7-7820X, which hits a potential sweet spot in the LCC design. This is an eight-core processor, with the highest LCC base clock of 3.6 GHz and the joint-highest turbo settings: 4.3 GHz for regular turbo and 4.5 GHz for favored core. Unlike the previous processor, this CPU gets support for DDR4-2666 memory.

However in another break from Intel’s regular strategy, this CPU will only support 28 PCIe lanes. Normally only the lowest CPU of the HEDT stack would be adjusted in this way, but Intel is using the PCIe lane allocation as another differentiator as a user considers which processor in the stack to go for. This CPU also runs in at 140W, and comes in at $600. At this price, we would expect it to be competing directly against AMD’s Ryzen 7 1800X, which will be the equivalent of a generation behind in IPC but $100 cheaper.

Comparison: Core i7-7820X vs. Ryzen 7 1800X
Intel
Core i7-7820X
Features AMD
Ryzen 7 1800X
8 / 16 Cores/Threads 8 / 16
3.6 / 4.3GHz
(4.5 GHz TMax)
Base/Turbo 3.6 / 4.0 GHz
28 PCIe 3.0 Lanes 16
11 MB L3 Cache 16 MB
140 W TDP 95 W
$599 Price (MSRP) $499

The third processor is also a change for Intel. Here is the first processor bearing the new Core i9 family. Previously we had Core i3, i5 and i7 for several generations. This time out, Intel deems it necessary to add another layer of differentiation in the naming, so the Core i9 naming scheme was the obvious choice. If we look at what the Core i9 name brings to the table, the obvious improvement is PCIe lanes: Core i7 processors will have 28 PCIe lanes, while Core i9 processors will have 44 PCIe lanes. This makes configuring an X299 motherboard a little difficult: see our piece on X299 to read up on why.

Right now the Core i9-7900X is the only Core i9 with any details: this is a ten core processor, running with a 3.3 GHz base, a 4.3 GHz turbo and a 4.5 GHz favored core. Like the last processor, it will support DDR4-2666 and has a TDP of 140W. At this level, Intel is now going to charge $100/core, so this 10-core part runs in at a $999 tray price ($1049 retail likely).

One brain cell to twitch when reading this specification is the price. For Ivy Bridge-E, the top SKU was $999 for six-cores. For Haswell-E, the top SKU was $999 for eight-cores. For Broadwell-E, we expected the top SKU for 10-cores to be $999, but Intel pushed the price up to $1721, due to the way the enterprise processors were priced. For Skylake-X, the new pricing scheme is somewhat scrapped again. This 10-core part is now $999, which is what we expected the Broadwell-E based Core i7-6950X to be. This isn’t the top SKU, but the pricing comes back down to reasonable levels.

Meanwhile for the initial launch of Skylake-X, it is worth noting that this 10-core CPU, the Core i9-7900X, will be the first one available to purch. More on that later.

Still covering the LCC core designs, the final processor in this stack is the Core i9-7920X. This processor will be coming out later in the year, likely during the summer, but it will be a 12-core processor on the same LGA2066 socket for $1199 (retail ~$1279), being part of the $100/core mantra. We are told that Intel is still validating the frequencies of this CPU to find a good balance of performance and power, although we understand that it might be 165W rather than 140W, as Intel’s pre-briefing explained that the whole X299 motherboard set should be ready to support 165W processors.

In the enterprise space, or at least in previous generations, Intel has always had that processor that consumed more power than the rest. This was usually called the ‘workstation’ processor, designed to be in a single or dual socket design but with a pumped up frequency and price to match. In order for Intel to provide this 12-core processor to customers, as the top end of the LCC silicon, it has to be performant, power efficient, and come in at reasonable yields. There’s a chance that not all the factors are in place yet, especially if they come out with a 12-core part that is clocked high and could potentially absorb some of their enterprise sales.

Given the expected timing and launch for this processor, as mentioned we were expecting mid-summer, that would have normally put the crosshairs into Intel’s annual IDF conference in mid-August, although that conference has now been canned. There are a few gaming events around that time to which Intel may decide to align the launch to.



Announcement Two: High Core Count Skylake-X Processors

The twist in the story of this launch comes with the next batch of processors. In our pre-briefing came something unexpected: Intel is bringing the high core count silicon from the enterprise side down to consumers. I’ll cover the parts and then discuss why this is happening.

The HCC die for Skylake is set to be either 18 or 20 cores. I say or, because there’s a small issue with what we had originally thought. If you had asked me six months ago, I would have said that the upcoming HCC core, based on some information I had and a few sources, would be an 18-core design. As with other HCC designs in previous years, while the LCC design is a single ring bus around all the cores, the HCC design would offer a dual ring bus, potentially lopsided, but designed to have an average L3 cache latency with so many cores without being a big racetrack (insert joke about Honda race engines). Despite this, Intel shared a die image of the upcoming HCC implementation, as in this slide:

It is clear that there are repeated segments: four rows of five, indicating the presence of a dual ring bus arrangement. A quick glance might suggest a 20 core design, but if we look at the top and bottom segments of the second column from the left: these cores are designed slightly differently. Are these actual cores? Are they different because they support AVX-512 (a topic discussed later), or are they non-cores, providing die area for something else? So is this an 18-core silicon die or a 20-core silicon die? We’ve asked Intel for clarification, but we were told to await more information when the processor is launched. Answers on a tweet @IanCutress, please.

So with the image of the silicon out of the way, here are the three parts that Intel is planning to launch. As before, all processors support hyperthreading.

Skylake-X Processors (High Core Count Chips)
  Core i9-7940X Core i9-7960X Core i9-7980XE
Cores/
Threads
14/28 16/32 18/36
Clocks TBD
L3 TBD
PCIe Lanes TBD
(Likely 44)
Memory Freq TBD
TDP TBD
Price $1399 $1699 $1999

As before, let us start from the bottom of the HCC processors. The Core i9-7940X will be a harvested HCC die, featuring fourteen cores, running in the same LGA2066 socket, and will have a tray price of $1399, mimicking the $100/core strategy as before, but likely being around $1449-$1479 at retail. No numbers have been provided for frequencies, turbo, power, DRAM or PCIe lanes, although we would expect DDR4-2666 support and 44 PCIe lanes, given that it is a member of the Core i9 family.

Next up is the Core i9-7960X, which is perhaps the name we would have expected from the high-end LCC processor. As with the 14-core part, we have almost no information except the cores (sixteen for the 7960X), the socket (LGA2066) and the price: $1699 tray ($1779 retail?). Reiterating, we would expect this to support at least DDR4-2666 memory and 44 PCIe lanes, but unsure on the frequencies.

The Core i9-7980XE sits atop of the stack as the halo part, looking down on all those beneath it. Like an unruly dictator, it gives nothing away: all we have is the core count at eighteen, the fact that it will sit in the LGA2066 socket, and the tray price at a rather cool $1999 (~$2099 retail). When this processor will hit the market, no-one really knows at this point. I suspect even Intel doesn’t know.

Analysis: Why Offer HCC Processors Now?

The next statement shouldn’t be controversial, but some will see it this way: AMD and ThreadRipper.

ThreadRipper is AMD’s ‘super high-end desktop’ processor, going above the eight cores of the Ryzen 7 parts with a full sixteen cores of their high-end microarchitecture. Where Ryzen 7 competed against Broadwell-E, ThreadRipper has no direct competition, unless we look at the enterprise segment.

Just to be clear, Skylake-X as a whole is not a response to ThreadRipper. Skylake-X, as far as we understand, was expected to be LCC only: up to 12 cores and sitting happy. Compared to AMD’s Ryzen 7 processors, Intel’s Broadwell-E had an advantage in the number of cores, the size of the cache, the instructions per clock, and enjoyed high margins as a result. Intel had the best, and could charge more. (Whether you thought paying $1721 for a 10-core BDW-E made sense compared to a $499 8-core Ryzen with fewer PCIe lanes, is something you voted on with your wallet). Pretty much everyone in the industry, at least the ones I talk to, expected more of the same. Intel could launch the LCC version of Skylake-X, move up to 12-cores, keep similar pricing and reap the rewards.

When AMD announced ThreadRipper at the AMD Financial Analyst Day in early May, I fully suspect that the Intel machine went into overdrive (if not before). If AMD had a 16-core part in the ecosystem, even at a lower 5-15% IPC to Intel, it would be likely that Intel with 12-cores might not be the halo product anymore. Other factors come into play of course, as we don’t know all the details of ThreadRipper such frequencies, or the fact that Intel has a much wider ecosystem of partners than AMD. But Intel sells A LOT of its top-end HEDT processor. I wouldn’t be surprised if the 10-core $1721 part was the bestselling Broadwell-E processor. So if AMD took that crown, Intel would lose a position it has held for a decade.

So imagine the Intel machine going into overdrive. What would be going through their heads? Competing in performance-per-dollar? Pushing frequencies? Back in the days of the frequency race, you could just slap a new TDP on a processor and just bin harder. In the core count race, you actually need physical cores to provide that performance, if you don’t have 33%+ IPC difference. I suspect the only way in order to provide a product in the same vein was to bring the HCC silicon to consumers.

Of course, I would suspect that inside Intel there was push back. The HCC (and XCC) silicon is the bread and butter of the company’s server line. By offering it to consumers, there is a chance that the business Intel normally gets from small and medium businesses, or those that buy single or double-digit numbers of systems, might decide to save a lot of money by going the consumer route. There would be no feasible way for Intel to sell HCC-based processors to end-users at enterprise pricing and expect everyone to be happy.

Knowing what we know about working with Intel for many years, I suspect that the HCC was the most viable option. They could still sell a premium part, and sell lots of them, but the revenue would shift from enterprise to consumer. It would also knock back any threat from AMD if the ecosystem comes into play as well.

As it stands, Intel has two processors lined up to take on ThreadRipper: the sixteen-core Core i9-7960X at $1699, and the eighteen-core Core i9-7980XE at $1999. A ThreadRipper design is two eight-core Zeppelin silicon designs in the same package – a single Zeppelin has a TDP of 95W at 3.6 GHz to 4.0 GHz, so two Zeppelin dies together could have a TDP of 190W at 3.6 GHz to 4.0 GHz, though we know that AMD’s top silicon is binned heavy, so it could easily come down to 140W at 3.2-3.6 GHz. This means that Intel is going to have to compete with those sorts of numbers in mind: if AMD brings ThreadRipper out to play at around 140W at 3.2 GHz, then the two Core i9s I listed have to be there as well. Typically Intel doesn’t clock all the HCC processors that high, unless they are the super-high end workstation designs.

So despite an IPC advantage and an efficiency advantage in the Skylake design, Intel has to ply on the buttons here. Another unknown is AMD’s pricing. What would happen if ThreadRipper comes out at $999-$1099?  

But I ask our readers this:

Do you think Intel would be launching consumer grade HCC designs for HEDT if ThreadRipper didn’t exist?

For what it is worth, kudos all around. AMD for shaking things up, and Intel for upping the game. This is what we’ve missed in consumer processor technology for a number of years.

(To be fair, I predicted AMD’s 8-core to be $699 or so. To see one launched at $329 was a nice surprise).

I’ll add another word that is worth thinking about. AMD’s ThreadRipper uses a dual Zeppelin silicon, with each Zeppelin having two CCXes of four cores apiece. As observed in Ryzen, the cache-to-cache latency when a core needs data in other parts of the cache is not consistent. With Intel’s HCC silicon designs, if they are implementing a dual-ring bus design, also have similar issues due to the way that cores are grouped. For users that have heard of NUMA (non-unified memory access), it is a tricky thing to code for and even trickier to code well for, but all the software that supports NUMA is typically enterprise grade. With both of these designs coming into consumer, and next-to-zero NUMA code for consumer applications (including games), there might be a learning period in performance. Either that or we will see software pinning itself to particular groups of cores in order to evade the issue entirely.



Announcement Three: Skylake-X's New L3 Cache Architecture

(AKA I Like Big Cache and I Cannot Lie)

SKU madness aside, there's more to this launch than just the number of cores at what price. Deviating somewhat from their usual pattern, Intel has made some interesting changes to several elements of Skylake-X that are worth discussing. Next is how Intel is implementing the per-core cache.

In previous generations of HEDT processors (as well as the Xeon processors), Intel implemented an three stage cache before hitting main memory. The L1 and L2 caches were private to each core and inclusive, while the L3 cache was a last-level cache covering all cores and that also being inclusive. This, at a high level, means that any data in L2 is duplicated in L3, such that if a cache line is evicted into L2 it will still be present in the L3 if it is needed, rather than requiring a trip all the way out to DRAM. The sizes of the memory are important as well: with an inclusive L2 to L3 the L3 cache is usually several multiplies of the L2 in order to store all the L2 data plus some more for an L3. Intel typically had 256 kilobytes of L2 cache per core, and anywhere between 1.5MB to 3.75MB of L3 per core, which gave both caches plenty of room and performance. It is worth noting at this point that L2 cache is closer to the logic of the core, and space is at a premium.

With Skylake-X, this cache arrangement changes. When Skylake-S was originally launched, we noted that the L2 cache had a lower associativity as it allowed for more modularity, and this is that principle in action. Skylake-X processors will have their private L2 cache increased from 256 KB to 1 MB, a four-fold increase. This comes at the expense of the L3 cache, which is reduced from ~2.5MB/core to 1.375MB/core.

With such a large L2 cache, the L2 to L3 connection is no longer inclusive and now ‘non-inclusive’. Intel is using this terminology rather than ‘exclusive’ or ‘fully-exclusive’, as the L3 will still have some of the L3 features that aren’t present in a victim cache, such as prefetching. What this will mean however is more work for snooping, and keeping track of where cache lines are. Cores will snoop other cores’ L2 to find updated data with the DRAM as a backup (which may be out of date). In previous generations the L3 cache was always a backup, but now this changes.

The good element of this design is that a larger L2 will increase the hit-rate and decrease the miss-rate. Depending on the level of associativity (which has not been disclosed yet, at least not in the basic slide decks), a general rule I have heard is that a double of cache size decreases the miss rate by the sqrt(2), and is liable for a 3-5% IPC uplift in a regular workflow. Thus here’s a conundrum for you: if the L2 has a factor 2 better hit rate, leading to an 8-13% IPC increase, it’s not the same performance as Skylake-S. It may be the same microarchitecture outside the caches, but we get a situation where performance will differ.

Fundamental Realisation: Skylake-S IPC and Skylake-X IPC will be different.

This is something that fundamentally requires in-depth testing. Combine this with the change in the L3 cache, and it is hard to predict the outcome without being a silicon design expert. I am not one of those, but it's something I want to look into as we approach the actual Skylake-X launch.

More things to note on the cache structure. There are many ‘ways’ to do it, one of which I imagined initially is a partitioned cache strategy. The cache layout could be the same as previous generations, but partitions of the L3 were designated L2. This makes life difficult, because then you have a partition of the L2 at the same latency of the L3, and that brings a lot of headaches if the L2 latency has a wide variation. This method would be easy for silicon layout, but hard to implement. Looking at the HCC silicon representation in our slide-deck, it’s clear that there is no fundamental L3 covering all the cores – each core has its partition. That being the case, we now have an L2 at approximately the same size as the L3, at least per core. Given these two points, I fully suspect that Intel is running a physical L2 at 1MB, which will give the design the high hit-rate and consistent low-latency it needs. This will be one feather in the cap for Intel.



Announcement Four: AVX-512 & Favored Core

To complete the set, there are a couple of other points worth discussing. First up is that AVX-512 support coming to Skylake-X. Intel has implemented AVX-512 (or at least a variant of it) in the last generation of Xeon Phi processors, Knights Landing, but this will be the first implementation in a consumer/enterprise core.

Intel hasn’t given many details on AVX-512 yet, regarding whether there is one or two units per CPU, or if it is more granular and is per core. We expect it to be enabled on day one, although I have a suspicion there may be a BIOS flag that needs enabling in order to use it.

As with AVX and AVX2, the goal here is so provide a powerful set of hardware to solve vector calculations. The silicon that does this is dense, so sustained calculations run hot: we’ve seen processors that support AVX and AVX2 offer decreased operating frequencies when these instructions come along, and AVX-512 will be no different. Intel has not clarified at what frequency the AVX-512 instructions will run at, although if each core can support AVX-512 we suspect that the reduced frequency will only effect that core.

With the support of AVX-512, Intel is calling the Core i9-7980X ‘the first TeraFLOP CPU’. I’ve asked details as to how this figure is calculated (software, or theoretical), but it does make a milestone in processor design. We are muddying the waters a bit here though: an AVX unit does vector calculations, as does a GPU. We’re talking about parallel compute processes completed by dedicated hardware – the line between general purpose CPU and anything else is getting blurred.

Favored Core

For Broadwell-E, the last generation of Intel’s HEDT platform, we were introduced to the term ‘Favored Core’, which was given the title of Turbo Boost Max 3.0. The idea here is that each piece of silicon that comes off of the production line is different (which is then binned to match to a SKU), but within a piece of silicon the cores themselves will have different frequency and voltage characteristics. The one core that is determined to be the best is called the ‘Favored Core’, and when Intel’s Windows 10 driver and software were in place, single threaded workloads were moved to this favored core to run faster.

In theory, it was good – a step above the generic Turbo Boost 2.0 and offered an extra 100-200 MHz for single threaded applications. In practice, it was flawed: motherboard manufacturers didn’t support it, or they had it disabled in the BIOS by default. Users had to install the drivers and software as well – without the combination of all of these at work, the favored core feature didn’t work at all.

Intel is changing the feature for Skylake-X, with an upgrade and for ease-of-use. The driver and software are now part of Windows updates, so users will get them automatically (if you don’t want it, you have to disable it manually). With Skylake-X, instead of one core being the favored core, there are two cores in this family. As a result, two apps can be run at the higher frequency, or one app that needs two cores can participate.

Availability

Last but not least, let's talk about availability. Intel will likely announce availability during the keynote at Computex, which is going on at the same time as this news post goes live. The launch date should be sooner rather than later for the LCC parts, although the HCC parts are unknown. But no matter what, I think it's safe to say that by the end of this summer, we should expect a showdown over the best HEDT processor around.

Log in

Don't have an account? Sign up now