Well, you wouldn't be able to access the foreign data in your cache simply because it won't match any of the physical addresses you have access to after address translation. Really, it's no more strange than the fact that you're sharing DRAM with other cloud customers.
> Are you saying the L2 is associative to every core in the system?
This is already the case for asserting exclusive ownership of a cacheline, which involves a broadcast to all other cache blocks to force them to evict any copy they might have. This is done upon the first attempt to write into a cacheline. ISA-specific memory ordering rules potentially prevent the write from commencing before the evictions are performed.
Similarly, when a cacheline is fetched, even for reading, the requestor must ensure that no other core has exclusive ownership of it.
Moreso because some clients will be accessing more than a single-core, or perhaps a single CPU block. So for those more demanding cloud tasks of the new-age, this is going to save some latency. And that's the biggest hurdle in cloud computing.
However, this thing is going to be a security nightmare, to code it "just right" and look for thousands of edge-cases for possible attacks. Difficult, but not impossible.
The thing that will be impossible is efficiency. Since the L2 caches are all unified, they must always be on and firing together, in a sense. And possibly running at a high frequency that is locked. So for even the lightest form of processing, there won't be any speed increases, but the energy use would be much much higher. In other words, this is a "true" server/desktop. This concept won't work in a DC-battery device like a thick-gaming laptop, let alone, a thin and light x86 tablet.
PS: This article doesn't mention it, but what ISA/architecture are these cores based on? x86-64, or ARMv9, or RISC-V, or is it the (abandoned?) PowerPC architecture that we saw in the GameCube...
The architecture dates back to the S/360 from the 1960's: true mainframe heritage. The power consumption angle isn't much of a concern for this market. Clock speed is going to be 5.5 Ghz at the top, matching some previous mainframe chips.
The security aspect may not be much of a concern due to the system's memory encryption capabilities. The real question is *when* contents get decrypted in the pipeline. Evicting cachlines out of the local L2 to a neighboring CPU's cache may require the contents to be encrypted. So even if the local L2 cache mapping boundaries were broken to gain access to what is being used as a virtual L3, the contents would be obscured. To use the virtual L3 cache data you'd more than likely have to get the encryption key being used by the other CPU core.
> In other words, this is a "true" server/desktop.
A *single* machine can occupy the entirety of a pair of specialized rack cabinets, offer liquid cooling as an option, can hold 40TB of memory (in the previous z15 version), and use multiple 3-phase power cords, each about the size of a fat garden hose. Oh, and cost several $M. So yes, they are "true" servers, whatever that means.
Let's just say that nobody is losing sleep about the power bill when the single box is enough to process the transactions for an entire national stock exchange or a megabank.
> PS: This article doesn't mention it, but what ISA/architecture are these cores based on? x86-64, or ARMv9, or RISC-V, or is it the (abandoned?) PowerPC architecture that we saw in the GameCube...
Welcome to the Big Leagues. At the core it uses the same instruction set as the S/360, which was released in 1964, and it's the one that has pretty much run world commerce since then.
"At the core it uses the same instruction set as the S/360, which was released in 1964"
sort of. like X86, the 360 ISA has gained many extensions since 1965, but yes, it's a gozilla CISC machine. not sure when the 'machine' morphed to 'microprocessor' implementation. Big Blues mini-ish machines were micro based no later than 1990 (RS/6000, and may be AS/400). it's been said, haven't confirmed, that 360/DOS applications will still execute on z machines. which is much more than a hardware issue, but largely OS support.
Modern X86 designs are RISC, and have been for decades. For intel with the Pentium Pro (pentium 2) and AMD with the K5. They use a CISC to RISC decoder. Along with many Extensions and more modern compilers. CISC op code for the most part in minimal and doesn't negatively effect performance. Though the OP Code Decode for the CISC to RISC does take up die space that otherwise wouldn't be there.
I wouldn't be surprised if z/Architecture has also gone down the route of being a RISC unit with OP Code Decoders. But i'm not going to dig into it.
Yeah I don't think we'll see this implemented in x86 in the near future. If we do, they probably will build L2-cache-heavy server-focused designs. Doesn't seem ideal for other applications, and wasting power in the cache means less power and thermal overhead elsewhere on the chip. But who knows, maybe a future version of this which is less aggressive might see wider adoption.
Also I'm pretty sure all the Z chips are z/Architecture. PPC was for consumer/prosumer systems mainly IIRC.
One thing to add is that as a Z-series mainframe, the platform supports memory encryption. The question I have is where the contents are decrypted before usage. My personal *guess* is that the L1 caches are decrypted for use while the L2 cache remains fully encrypted, making many of the recent side channel attacks (which IBM's Z series was also vulnerable to some) ineffective.
Depends on the point of the memory encryption. If you just want to avoid someone snooping memory by attaching a logic analyzer to the board, then it's good enough to do it in the memory controllers.
If the point is for guests to have private storage that even the kernel cannot see, then I guess it could only be decrypted into L1, as you say. Because, if L2 is shared, then any thread with privileges to access that memory address would see it as it appears in the shared cache.
Side channels through caches reveal the accessed address, not the cache contents. So if you want to protect against that with encryption, you have to encrypt the address.
When I read you talking about "virtual L3/L4"' caches", I guessed that this is what the "virtual L3 cache" might mean, but I had no idea what the "virtual L4 cache" would be.
Is there any situation in which a CPU might have to access the DRAM of another socket? Even if accessing another module's SRAM is close in latency to accessing local DRAM, the L4 cache might prevent a CPU from having to pull data from another socket's DRAM.
If that's not relevant, then DRAM's inherent latency must be more of a bother than the latency of pulling the data all the way from the depths of Hades.
Oops, wrong place, fucking hell. I must've clicked the reply button without realizing it, and it just brought the comment I was writing. That's... annoying
Ignore this, was supposed to be it's own thing and in the back of the room with the other latecomer comments.
Virtualization security layers will tag each instruction to its origin. This is effectively how Intel fixed Meltdown and Spectre in recent hardware. I assume every modern CPU design going forward will use metadata and instruction tagging to fix branch prediction hijacking and cache intrusion.
I'm not sure you're understanding the difference between cloud computing and renting a distinct machine. Even if cloud vendors were to bind a core to a process at the O/S level (which I strongly doubt they could do as routine and remain economically viable), you'd still have cache lines overlapping the memory of other users / processes whether there's a separate shared L3 or not.
The security risk you highlight is already there regardless of architecture e.g. sideband cache snooping atacks like Spectre.
This sounds more brittle than the Z15. Yes, things look nice from a single core perspective, but system robustness depends on how it behaves worst case, not best case. Now we have a core with only 32MB of cache at all and a clogged up ring bus trying to steal data from others L2 caches plus chip to chip links also clogged with similar traffic--with no benefit as all L2 is busy with the local processor.
And these L4 numbers start to look like main memory levels of latency. The path from core to L1 then L2, then L3, and finally to L4 only to find that "the data is in another castle" seems like a horrible failure mode.
Sounds a lot like bad programming, don't you think? In my field we have a problem between application-tinking and service-thinking. Monoliths versus agiles. Monoliths should be at the end-point, not in the middle-layer.
I think it is comparable; if you program thise service-minded, with only broadcasting the end-product, there is no "clogging the bus".
> With that in mind, you might ask why we don’t see 1 GB L1 or L2 caches on a processor
With Genoa (that might have up to 96 cores according to some leaks) and stacked 3D V-Cache (which effectively triples amount of L3 cache) we might see 1.125 GiB of L3 cache in the not so distant future.
I've always wondered (link anyone?) how these designs come to be. 1 - it's a staggering bunch of maths understood only by the Einsteins of today 2 - it's a staggering bunch of spitballing on white boards by Joe Sixpack engineers
I think it starts with ideas, followed by models (i.e. spreadsheets or similar), followed by simulations.
Until you simulate it, you can't truly know how it will perform and where the bottlenecks will be. With a simulation, you can play with different parameters and make a more balanced design.
I have always been fascinated by the hardware in IBM's big iron, it's on a whole other level. More content along this line would be appreciated.
Question: I know mainframes traditionally execute instructions twice to ensure accuracy/integrity. When you say 'core' in this context, does that truly represent a single core, or is that really a pair that execute the same instructions in parallel?
We used to have lockstep execution, but that's many generations ago. Nowadays we use error checking across the entire chip, were we apply parity checking on memory elements (SRAMs, registers etc), and data buses. We also perform either parity or legal state checking on control logic. When we detect an error, the core can go through a transparent recovery action, which is a complete core reset almost like a mini-reboot of the core, but the program state (program location, all the register content etc) gets recovered. So after recovery, we continue running wherever we were in the program. It's completely transparent to the software layers. When a core goes through recovery repeatedly we can even take "the brain dump" of the software state and dynamically and transparently transfer it to a spare core. That way we achieve outstanding resilience and availability against random bit flips or wear out effects.
fallacy: 1 a deceptive, misleading, or false notion, belief, etc.: That the world is flat was at one time a popular fallacy. 2 a misleading or unsound argument. 3 deceptive, misleading, or false nature; erroneousness.
for the most part, that sounds more like every one of oxford guys posts..... 1, sounds like his console scam BS 2 and 3, the rest of his posts.
> Never give up that cocky attitude, that you have something to teach the designer of the system
FFS! Why do you assume my reply was aimed only at cjacobi? I was trying to relate the mechanism to something far more familiar. This had two goals:
1. Grounding the explanation in something other readers are familiar with. Part of understanding something exotic is demystifying the parts that are actually commonplace, so that we can focus on those which aren't.
2. If there's an important distinction (such as maybe the thread migration being completely hardware-driven, rather than OS-managed), cjacobi is welcome to correct me and enlighten us all.
Further, I might add that it's all part of participating in the comments. This is not a one-way communication mechanism, like articles or lectures. Since anyone can reply, you shouldn't post comments, if you can't deal with someone offering a clarification or asking a dumb follow-up question, etc.
Dear Sir, given your nickname (and how you talk about z/) I guess you're the chief architect of my favorite platform. Let me congratulate for the amazing job you (and those before you) did on s/3x0 and z/xx. I think you should mention, for those who are interested, the literature IBM published for those who want to get started: https://www.ibm.com/docs/en/zos-basic-skills?topic... Not exactly new, but it's a great read. Best regards from a guy who's been a cobol/assembler/easytrieve programmer for over 30 years (last year I had to switch to more "mundane" platforms!)
Thank you. My favorite platform as well. My HotChips presentation was representing the work of a large team of brilliant engineers. And we are standing on the shoulders of giants, almost 60 years of innovation from S/360 to the IBM Z Telum chip. -- Christian Jacobi
But that process of "trying to find space when evicted, across the whole chip, and failing that, across the whole system" cannot be cheap. Stuff is getting evicted constantly, right?
In other words, this sounds fine in a 5.3GHz fire breathing monster, but that *has* to burn too much power for a phone, or maybe even a laptop.
Yeah, I don't see why it has to physically move, just because it got demoted from L2 to L3. If all cache blocks contain a mix of L2 and L3, then why not leave it in place and simply re-tag it?
Maybe you have to evict something else from L3, in order to make room. Even then, I don't quite see why you'd move it to another L2/L3 block. If another core is actively or recently using it, then it wouldn't be getting evicted.
Generally data is only evicted if you need to make space for new/other data. So if the core is trying to pull in new L2 data then it needs to evict something else that is no longer needed. When evicting from *private* cache you generally push it up one layer -> so it’s going to L3. Putting it into another cores L3 is really the only spot to put it.
> Putting it into another cores L3 is really the only spot to put it.
That still doesn't wash. As far as the detail that I've seen, the distinction between L2 and L3 is mainly one of quota. If a line in a cache block is currently at L2 status, and then gets demoted to L3 status, then all that needs to happen is for something else to get evicted from that block's L3 quota to make room for the newly demoted line. There's no need to actually *move* anything!
For shared portions of memory running across different cores, simply retagging the local L2 as the virtual L3 for the remote code would work rather well. One of the topics that has been slowly creeping into the spot light has been the amount of energy to move data around. Granted this would still require the tagging information to be refreshed but it should conceptually be less than moving the fully amount of data across the chip and updating the tagging there.
For shared memory this does bring up an interesting scenario, can piece of data be tagged as both local L2 as well as for remote L3 on a remote core or L4 on a remote socket? From a user space program, this would be an elegant solution to where the single source of truth lies in a coherent system. I wonder what coherency black magic IBM pulled off to accomplish this at speed in hardware.
> can piece of data be tagged as both local L2 as well as for remote L3 on a remote core
This gets to the very definition of L2 and L3. Under a quota-based scheme, where L2 is defined as the amount of a cache block controlled by the associated core, then the line would enter L2 "status" if being written by that core. It wouldn't then *also* be L3, as long as it's within the core's L2 quota. Once it get demoted from L2, then it should go to L3 status, before getting completely evicted.
> or L4 on a remote socket?
I wonder if they're just being cute, by calling it L4. It seems to me that L4 is a functional definition of what happens when you're doing a cache snoop and find the data of some remote DRAM in a remote core. In that case, it'd be a little faster to get it out of the remote cache. However, I wonder if and why a remote core would ever *put* something in a chip's cache. I don't really see it, but it could make sense for the chip which hosts the attached DRAM.
If your L2 and L3 are sharing the same local physical memory I think the only distinction between what is L2 and L3 is your eviction priority. Latency to both should be the same. But we already have finer grained methods of determining what ought to be evicted than a single-bit L2/3 indicator.
So - I suspect L3 for this architecture is necessarily non-local.
Or more generally, a specific local cache line is considered L2 for that local processor, whereas that same line is considered virtual L3 for non-local processors.
I thought it was pretty clear that the line is being evicted to make space for new data. In such a scenario it has to be moved. Either to L3 or back to DRAM (it was private, so there might still be stale data, can’t just drop this line completely). If you want one core to have useful access to the full chips L3 capacity you must move from your private L2 into the chip wide virtual L3. This way code operating on problems with a working set >32MB will actually get access to it.
It could also be this does not occur 100% of the time. There could be a heuristic that determines if it gets demoted to L3 locally or moved to another cores L3 or flushed straight to DRAM depending on various performance counters. I assume this type of heuristic is exactly what they would have simulated in advance and determined to be optimal for their expected workloads.
Very interesting. It seems like whether it will be beneficial in operation has a lot to do with how much of the core cache capacity is usually vacant (or occupied by low value data) and how often a piece of cached data is "reused" by the same core or another core in the system.
The second seems less likely in a cloud system with multiple customers using the same resources but seems likely to be substantial in a single purpose big iron environment.
Will data errors be more prone to propagate if all cores "reuse" data copies rather than independently going "out" to the source? (all sharing a bad copy) or less likely because there are fewer total "copy" operations taking place and resulting in fewer points for error?
I should have gone into Engineering - these are much more interesting questions to me than the human problems I deal with and I think there are actual right answers to be found.
There's very strong error correction codes on the L2 cache, which can even recover data if an entire cache memory segment fails on the chip. We always check the correctness of the data on every access using those codes, and if we find a bit (or multiple) have flipped we fix it before sending the data out.
IBM's small team, representing a tiny part of their revenue, has long led the industry for performance innovation. It reminds me of Broadcom, who consistently has the most consistent, lowest-noise communication chips for mobile phones, even though it is a tiny part of their revenue.
Both keep a clear lead by virtue of having ridiculously amazing engineers like Dr. Christian Jacobi, even against companies -- or countries -- that have much larger R&D budgets.
This architecture seems almost familiar to me: like a big, global whiteboard publish/subcribe system. Tagged items, content-addressable memory.
Those sort of things seem to work out ok when you need massive throughput. I think I've only seen this done on systems where the data processing operations are fixed, deterministic-time instructions.
Like in network routers.
Wild: looks like IBM took "Smart NIC" (network interface controller) idea all the way up the stack.
It's amazing what you can do with a processor architecture when it doesn't have to be all things to all people. The non-core power usage of these systems should be quite significant, to the level of massive. That's OK if you don't have to also live in laptops and tablet sized convertibles, nor meet things like the California desktop PC power consumption limits.
As for how this works in practice? I feel like this would work very well for general purpose computing, but would have issues when dealing with objects that have very large data sets. Things that overflow the L2 cache will start to chew through ring bus capacity and power, but still have reasonable performance. Things that exceed more than 50% of the CPU total L3 vCache capacity are going to start to eat up package power and bandwidth and start taking a performance hit as compared to an on+chip L3, which might be smaller than that amount. But, it looks like it gets ugly when it gets above 256MB and starts hitting the module neighbors for information. That starts impacting the performance of every core in the module as their power budget and available cache amounts start to diminish, and they start to fight for resources.
There's a niche where I am sure this will work well, but, it's not something that can scale down very well. That's fine for IBM, but this strategy isn't going to work well everywhere.
No can do. There's no actual "interactive" mode - even when using CICS (transaction handler) using full conversational mode is not a best practice. Online processing is done pseudo-conversationally - transaction stops when output is sent to terminal/caller and then resumed when terminal/caller sends input again. BUT you may find a way to run some Linux benchmark in USS (Unix System Services). I would gladly help you, but I was "recycled" and no longer work on mainframes. Sorry.
Every since AMD and Intel moved to the large L3 designs, i've felt that it wasn't ideal. There has to be a better way. I mean how many levels do we need for the same socket? It's good that IBM is atleast trying something different, although I don't know if this is the best solutions either ...
"although I don't know if this is the best solutions either "
mostly because I worked on them decades ago, even after they were antiques, the TI-990 machines had a really radical design - no instruction or data registers in the cpu, everything done in memory. the justification at the time was that processor and memory cycle times were closer enough that load/store (no recollection if that term even existed then) was sub-optimal.i
the next logical step, of course, is to eliminate memory as temporary store altogether. Nirvana.
while Optane, either in absolute performance or marketing, hasn't gotten traction as a direct single-level datastore (i.e. all reads and writes are to durable storage), a baby step of consolidating caches may hint that some folks in the hardware side are looking in that direction.
OSes will need to be modified, perhaps heavily, to work with such hardware. another olden days machine that provides some guidance is the OS/400 as originally designed. it operated a SQL-ish database as datastore without a filesystem protocol (a filesystem was later grafted on). an 'object' datastore without the filesystem protocol eliminates more impedance matching.
Memory-to-memory architectures made more sense when CPUs ran so slowly that DRAM accesses only took a handful of clock cycles. These days, memory-to-memory would be completely uncompetitive, unless your CPU is doing some kind of memory-renaming and internally remapping them to registers.
You can't just forego DRAM and use Optane instead. It would need to be several orders of magnitude more durable than it currently is.
However, Intel has been working on using Optane from userspace (i.e. without kernel filesystem overhead). But, that's in *addition* to DRAM - not as a complete substitute for it.
> since AMD and Intel moved to the large L3 designs, i've felt that it wasn't ideal.
Do you want lots of cores, with lots of IPC, running at high clocks? If so, then you need bandwidth. And you need it to scale faster than DRAM has been. Scaling the speed and size of caches is the way to do that.
If not, AMD wouldn't have been able to tout such impressive gains, in real-world apps, by simply scaling up L3 from 32 MB to 96 MB.
Of course, another way to do it is with in-package HBM-type memory, which comes at some latency savings and increased bandwidth vs. DDR memory sitting on DIMMs.
A yet more radical approach is to reduce the burden on caches, by using a software-managed on-chip memory. This is something you'll find in some GPUs and more specialty processors, but places a lot more burdens and assumptions on the software. Going to a private, direct-mapped memory avoids the latency and energy tax of cache lookups and maintaining cache coherency.
Question for Ian (or anyone at least as close to Big Blue):
In the olden days, IBM monitored its installs, seeking the most used instructions, data flows, and the like; mostly to optimize the machine for COBOL (the 360 meme, one machine for science and business, died almost immediately) applications.
is this 'radical' cache structure the result of customer monitoring, or (like Apple) IBM is telling the customer base 'we don't care what you think, this is what you need'?
"In the olden days, IBM monitored its installs, seeking the most used instructions, data flows, and the like; mostly to optimize the machine for COBOL (the 360 meme, one machine for science and business, died almost immediately) applications.
is this 'radical' cache structure the result of customer monitoring, or (like Apple) IBM is telling the customer base 'we don't care what you think, this is what you need'?"
I suspect Apple has far greater insight into the codepaths* its customers run than IBM ever did. The iOS phones are absolute miracles of OS and hardware working together in lockstep to achieve more with less resources (or consume less battery power) than any other phone maker on the planet.
Apple looks to be repeating this achievement with their M1 Macs.
*staying well away from the whole CSAM issue. We're just talking about IBM / Apple tweaking their OS/ hardware to maximise the efficiency of their customers' most highly used codepaths / dataflows.
Miraculous voicemail that won’t delete and auto-defect that can’t be disabled. Want to turn off voicemail transcription? You may be able to use a kludge but there is no actual normal setting to give users direct control. This is normal for Apple in recent times. Strip away user control and be hailed for efficient program CPU utilization. Stripping features from a program is one way to speed it up but what about the efficiency of the user’s workflow?
I’d gladly trade a bit of battery life for a system that has more respect for the user. But Apple has other plans. It even wants to play cop with people’s phone. Warrantless surveillance that Scientific American warns will lead to the abuse of a vulnerable minority. It also warns that this is a brainwashing of young people to believe they have no right to privacy. Big brother Apple is going to monitor you, not just government agencies. The corporate-government complex’s dissolving of the line between corporation and government continues at speed. A $2 trillion valuation doesn’t happen without gifts to those in power.
Oh, yes. Lost faith in Microsoft a long time ago, though they have pulled up their socks in the last few years, somewhat. As for 11, I'm just going to stick with 10 as long as I can. If I've got to use it for another decade, I've got no problem with that.
You know, we always eat our words. I had another look at 11 just now, and have got to admit, it doesn't look half bad. Indeed, seems like sense is at work once again: lots of rubbish has been removed, such as Cortana, One Drive, and the Ribbon. And from an appearance point of view, there's a restrained steering away from Metro's plainness. Then, the part that wins me: the command bar, replacing the Ribbon, is looking suspiciously like XP's. Microsoft, what has got into you? Common sense? October, I'm going to give it a go.
Fiddling with UI appears to be enough to placate most, keeping the entropy flowing. Slap a superficial coat of paint on it and we can ignore the badness of the deal.
Apart from the telemetry, which can be toned down using tools like ShutUp10, the problem with Windows today is largely one of appearance. Repeatedly, Microsoft has tried to tack on mobile rubbish, and it is just that, tacked on; the OS has resisted attempts for it to sink deeper. It's still the Windows of yore all the way through.
Windows is very like a house built on a solid foundation, with a sound plan. Its finish---the plaster, paint, and furniture---used to be excellent in the days of XP and 7. After 7, it's as if new owners bought the place and, modern folk that they are, have been repainting the walls, putting up gaudy ornaments, and adding extensions that don't square with the original plan. I'm sorry to say they even got rid of some beautiful antique furniture, adding, "Our synthetic wood is what everyone's going for nowadays. It's the in thing."
When I visited the house, I saw that it had been defaced considerably; but looking closer, realised the Windows I knew and loved was still there, beneath all the make-up. I smiled.
Aye, DOS all the way to the bone. Bill recently admitted on Reddit AMA, "We were runnin' out o' time, so I told the NT team, *Dave, just copy and paste the DOS code, throw some up-to-date modules in, smack on the GUI, and no one will know the difference.*"
I like how the added latency to access another chip's cache is a ?
First thing that jumped out at me was "it might be slower than DRAM at that point". Either they have an enormous broadcast/response fabric to do all of this cache state coordination traffic (and they might) or virtual L4 ends up having some major glass jaws compared to just going to local DRAM.
An on-drawer cache hit is still significantly faster than a memory access, even for chip-local memory. And given the memory size of those systems (many Terabytes, z15 can have up to 40TB in a shared memory config), the data you need is often hanging off a different chip so you need to traverse that chip-to-chip distance anyway. We do however access memory speculatively in some cases to further reduce latency, while broadcasting in parallel for cache coherency across a number of chips.
> virtual L4 ends up having some major glass jaws compared to just going to local DRAM.
Keep in mind that all cache-coherent systems have the scaling problem of needing to ensure that no other core has an exclusive copy of the cacheline you want.
However, whether it's worth fetching a non-exclusive copy on another chip vs. reading from local DRAM is still a decision that can be made on a case-by-case basis. Usually, a cache line would get fetched into L1 or L2. And if L3/L4 is inclusive, then there's no problem with a given cacheline simultaneously existing in another chip's L3 and the local chip's L1 or L2.
The traditional distinction between L2 and L3 is that L2 is private and L3 is shared. So, what you're saying is that you effectively want to do away with L2. The disadvantage of doing so is that one core that's running a memory-intensive workload could crowd out the L3 sets of others.
IBM's mainframes were called System/360 because the 360 could handle any workload (not niche.) They found out pretty fast it couldn't compete economically with minicomputers for some users so they regretfully introduced a large number of architectures that weren't "compatible".
They were handing out "less than a full core" to users in the 1970s, long before the advent of "cloud computing". I remember being in the Computer Explorers and spinning up instances of VM/CMS which was basically a single-user operating system a lot like MS-DOS inside a VM to do software development tasks.
IBM had demands to make a "Baby 370" for software devs but it never caught on because it was more cost effective to give them a little slice of a big one.
"IBM had demands to make a "Baby 370" for software devs but it never caught on because it was more cost effective to give them a little slice of a big one."
yeah, but... Multics did the most to move time-sharing forward TSO was IBM's version, as a temporary patch, which became the product
"The Sole of a New Machine" chronicles not just one machine, but a broad overview of minis in the late 70s, when a mini was multiple boards of, mostly, discrete parts. and each company had its own version of what an OS was.
You're assuming the DRAM holding the cacheline you want is directly connected to your core. However, what if it's actually hosted by a 3rd chip that's even farther?
Kinda curious how QoS is managed between cores in a single chip and between different drawers. Even in a single chip, there can be tons of different threads with different memory usages. If there are memory-intensive workloads in different cores, maybe cores will be trying to 'steal' cache from others. This looks like overprovisioning in virtual machines(Pretend that there's enough space to evict something). So I expect that there will be similar problems and am quite curious how IBM handled this. In each core's perspective, I don't see any reason why there should be extra un-used(?) L2 which can be used for other core's L3.
I wonder how the new cache architecture was also designed to fully integrate the AI cores that Telum has? IBM stressed that that being able to do "AI on the fly" (my words, not theirs) is a key feature of the new mainframe CPU, so maybe some of these changes are to make that easier and (especially) faster? Any words from IBM on that?
> when it comes time for a cache line to be evicted from L2, ... rather than > simply disappearing it tries to find space somewhere else on the chip.
Why would it necessarily have to get moved? If the same physical cache is shared between L2 and L3, and the partitioning is merely logical instead of physical, why couldn't it (sometimes) just get re-tagged as L3 and stay put?
I think that'd save a lot of energy, by eliminating pointless data movement. Of course, if the cache block has no available space in its L3 quota, then I suppose the line might indeed have to be relocated. Or maybe it could evict a L3 line from that cache block that could potentially find a home elsewhere, if it were still in sufficiently high demand.
It seems to me that what should scale most efficiently is to have blocks of L3 coupled to each memory channel. This way, you know precisely which L3 to check and your L3 bandwidth will scale linearly with your memory bandwidth. And while load-balancing could be an issue, software already has an incentive to load-balance its usage of different memory channels.
What am I missing? Is energy-efficiency just not a priority, at that level?
No. Energy-efficiency is not a priority at all. In the mid 90s the mainframe I used to work on was HEAVILY water cooled. On 3 sides of the building (about 40m x 40m - so 120m) there was an uninterrupted line of 50-60 cm fans that provided cooling for the beast. If my memory serves me well it was a 9000 series, don'r remember the specific model. As you can guess those 200ish fans dissipated a lot of heat therefore shedloads of power were consumed to keep that baby on 24/7.
I was speaking in very broad terms about computer architecture, not simply restricted to mainframes. Sorry not to be clear about that.
The reason I went there it is that the article seems written with an eye towards broader trends in cache hierarchies. So, it was those broader trends that I was attempting to question.
To couple a cache to each DRAM controller and have it store local data, you’d want it to be physically indexed. I’m not aware of any physically indexed caches. You’d have to do all your conversion from virtual to physical addressing before lookup, and then you’ll need to store the whole virtual address as the tag. To make matter worse you need to store virtual tags for all processes that are accessing that data, which is an arbitrary number. This is trivial in a virtually indexed cache as the same physical index can be held in multiple cachelines as long as they are in a non-exclusive mode, with the TLBs ultimately protecting usage. If you had very long cachelines that were the size of an entire DRAM transfer or bigger then maybe the overhead would be worth it. Otherwise it’s a lot of gates for tracking a small amount of information.
> You’d have to do all your conversion from virtual to physical addressing before lookup
Ah, good point. That would add latency and energy expenditure to the lookup.
> then you’ll need to store the whole virtual address as the tag.
Why? If the cache is dealing in physical addresses and the OS is ensuring that no two virtual addresses overlap (or, any thing do have explicit its approval), then why would the cache also need to know the virtual address?
> Why? If the cache is dealing in physical addresses and the OS is ensuring that no two virtual addresses overlap (or, any thing do have explicit its approval), then why would the cache also need to know the virtual address?
You're right. As long as you stored the entire 64-bits of the address it shouldn't be an issue. That would boost its density too
I forgot in my first reply that if you don't have virtual tagging then you can't tell a core making a request where the data may be above it - that information could be currently in another cache. You'd have to have resolved probes and whatever already, so it's not just another level in the cache hierarchy, but a dumb latency reducer for DRAM accesses. It'd have to be quite large to be much use I'd have thought to give much benefit if you'd already traversed all other caches.
They have my applause on L2 size but other than that it's a victim of/to victim, if they went with a much larger memory cube (HMC) and interesting buss (which would still be in the same role) I would have said fine. Would really like to see consumer grade (mobile/tablet/laptop...) SoC's where it can serve (either cube or HBM) as final RAM (as neither the DDR latency nor new buffer levels have much sense anymore).
Isn’t Apple already doing something like that with their shared L2 cache? From benchmarks done by Anandtech it seems like every M1 core has „priority“ access to a portion of L2 and the rest is used as some sort of a virtual L3.
Unfortunately, the article just reaffirmed what I understood from the little bit that the slides shown at HotChips revealed already. The interesting pieces are missing: When is a line in a different L2 considered "unused"? How does that information get communicated? Is the system using directories, or is it just snooping?
Some ideas that come to my mind is a directory at each memory controller that tells which lines for this memory controller are in which cache(s); this would keep traffic down, but may increase latency.
Another bunch of marketing nonsense. I'm sure IBM chose the model that fits THEIR use patterns the best. Which is usually far from what others in x86_64 land might see.
What is interesting is how easily have they managed to sell it to "doctor" Ian Cutress... Doctor of what ?
19 cycles is not that great for L2. Above that, 12ns is not that great for on-the-chip communication. Above that, this L3/L4 from many L2 tiles doesn't seem to bring anything significant on inter-chip levels. Latencies are so big anyway that couple of cycles more or less don't mean much.
Also, L1,L2 and L3 are different beasts. One can't compare and translate their logic 1:1.
Love your use of quotes. The answer to your question is simply 'heresy'. I studied and wrote a thesis on the mystic arts. I can identify magic when I see it.
Not sure exactly what's your issue. However, on the semantic front, those who distain use of the title "doctor" for Ph.D. recipients would do well to note that the distinction pre-dated Medical Doctorates by at least 5 centuries.
"distinction pre-dated Medical Doctorates by at least 5 centuries."
it wasn't until well into the 20th century that an MD actually had to attend a Medical School; it was not much more, if that, than an apprenticeship in blacksmithing.
So you start off with "the future of caches"... then you admit "IBM Z... is incredibly niche". And at 530mm2 die size versus Zen 3's 84mm2, of course they can fit in a stupid amount of L2 cache and virtualise it.
So no, this is not the future of anything except chips with stupidly large dies.
"So no, this is not the future of anything except chips with stupidly large dies."
well, may be large relative to current Intel/AMD/ARM chips, but what used to occupy hundreds (even thousands) of square foot, raised floor, liquid cooling machines is now a couple of chips.
as to niche: web apps are really the niche, in that they all do the same thing. the mainframe does the heavy lifting in the real world, but it's largely invisible to the likes of AT readers.
TA's comparison is apt, in that they are both 8-core dies made on a similar process node!
By using so much L2/L3 cache, essentially what IBM has done is taken an overbuilt 4 cylinder engine and strapped an enormous supercharger to it, as a way to buy a little more performance for a lot of $$$. The only reason they can get away with such a large disparity between cores and cache is that most of their customers will buy 8+ node systems, whereas most x86 & ARM servers are single-CPU or dual-CPU.
The reason I call it a 4-cylinder engine is that if they would reveal more details about the micro-architecture, I think it would appear fairly simple by today's standards. Most of the complexity in their design is probably its RAS features, and much of their verification resources probably go towards testing they all work properly and robustly. So, its IPC is probably comparatively low. And when you're trying to eke out a little more performance, in order for the product can stay relevant, enlarging cache is a pretty reliable (if expensive) way to do it.
> the mainframe does the heavy lifting in the real world
This is BS, of course. HPC doesn't use mainframes and nor do the hyperscalers. Corporate back offices run in the cloud or on conventional servers, not mainframes. Mainframes are just too expensive and not necessary, even for most "real work". They haven't kept up with the perf/$ improvements of commodity computing hardware, for a long time.
There are really only a few niche areas where mainframes still dominate. Some of those niches are surely starting to drift away from mainframes, which explains why they added in the machine learning acceleration.
And you can be pretty sure mainframes aren't picking up new niches. The computing world is much more invested in building resiliency atop commodity server hardware. Case in point: I know some big crypto currency exchanges use AWS, rather than mainframes.
"I'm not attacking the very idea of ultra-reliable hardware, just that it's worth the TCO in any case where downtime isn't extremely costly."
what folks who've never worked in a MegaCorp mainframe shop desperate to have an 'innterTubes' presence (I have) don't know is that 100% of that brand new web app from MegaCorp (your bank, insurance, grocery store...) is really just some lipstick on a COBOL pig from 1980. no one, but no one, re-makes the 3270 app soup to nuts. just puts new paint on the front door.
AMD designed Zen to be scaled down to low-power devices and to be higher-margin.
If x86 were to have had more competition than a duopoly in which one of the two players is still hobbled by 14nm, AMD may not have been able to get away with such a small die.
People have been conditioned to see what the duopoly produces as being the pinnacle of what's possible at a given time. Apple's M1 shed a bit of light on that falsity, and it's not even a high-performance design. It's mainly designed for power efficiency (and die size/profit margin, too).
Thanks, Ian, for the well-written article. I don't know much about caches, but if I may venture an opinion from the armchair of ignorance, I'd say this is going to take a lot more work/complexity for minimal gain, if any. Seems to me the classic L2/L3 cache will win the day. But 10/10 for innovative thinking.
It's difficult to understand why people insist on talking about x86 CPUs these days because there aren't any and haven't been any "x86" CPUs in many years. The x86 instruction set is but a tiny segment of today's Intel and AMD CPUs that are very advanced risc-cisc hybrid OOOP designs that don't resemble 80286/386/486 & even 586 CPUs at all. "x86" software compatibility is maintained in the CPUs merely for the sake of backwards compatibility with older software designed to run on real x86 CPUs, but these CPUs haven't been "x86" in a long time. Back in the 90's when real x86 CPUs were shipping, the scuttlebutt was that RISC was going to leave "x86" behind and become the new paradigm--yes, that long ago. That never happened because "x86" moved on far beyond what it was while maintaining the backwards software compatibility that the markets wanted. That's why you still hear that "x86" is not long for the world--because it's terrifically oversimplified. Apple customers, especially, think "x86" is the same thing today as it was 30 years ago...;) All of that is just marketing spiel. AMD's (and Intel's) "x86" CPUs will continue to change and improve and push ahead--they aren't going to be sitting still. But when people say, "x86" today, that's what some think--that x86 hasn't changed in all these years and so it "must" at some point be superseded by something better. They keep forgetting that the CPUs that have superseded the old x86 CPUs of the 80's/90's are themselves "x86"...that's the part you don't read much about. It should be well understood. x86 has little trouble today with 64-bits, for instance, and many other things that were never a part of the old x86 hardware ISA's.
It's fair to say these are x86 CPUs, because that's what they are, despite the decoding to an internal format that happens in the front end. But I agree with the gist of your comment. It's a pop-culture commonplace that x86 is dead, x86 is going down, or almost done for. Why? Well, according to popular wisdom, old is bad and new is good. But watch how fickle a thing our allegiances are. As a fanciful example, if Apple were to switch to RISC-V, watch how opinion would quickly swerve round to denouncing ARM and vindicating its successor.
x86 + Windows has been the dominant non-handheld consumer platform for a long time. Not only that, x86 is the hardware of all the so-called consoles. Not only that, even today Apple is still selling x86 equipment. x86 certainly has plenty of focus for Linux developers and users, too.
The technical implementation of those instructions isn't very important so long as the restriction on who can build chips with them remains so relevant.
Or, is your argument that any company can begin to sell x86-compatible chips, chips that run the instructions 'natively' rather than in some sort of Rosetta-like emulation? My understanding is that only Intel, AMD, and VIA have had the ability to produce x86 CPUs for many years.
Except for AMD’s licensing of Zen 1 to China which somehow was approved. I’m not sure how AMD managed to enable another company to manufacture x86. Are all the patents Intel held that Zen 1 used expired — so anyone can make unlicensed (by Intel) x86 chips so long as they don’t run afoul of newer patents?
All that text and yet no suggestions for an alternative name. Interesting. Seems like your goal wasn't to move the conversation needle forward, but just to complain.
Dr. Cutress, IBM's cache approach isn't a preview of the future of caches. It's something that I described in my Cache Memory Book back in 1993. It's called Direct Data Intervention, and has been around for some time. It's on Page 154.
It's still cool!
Evictions aren't as complicated as all that, either. If a line's clean then it's simply over-written. On the other hand, if it's been written to (Dirty), then a process is followed to write it back to main memory before it's over-written, although, in some cases, it's simply written back into the next-slower cache level.
It's nice to see IBM using this approach to squeeze more out of its caches.
PS, in the sign-up screen I had to check a box saying that I read the ToS & Privacy Policy, but the links to those don't work.
It's interesting how the world is moving away from general-purpose computing and the issues associated with it. A 40% improvement? Holy crap!
If anything, it shows that you can gain performance by making things more complicated, which is the opposite of the conventional wisdom these days. Simpler != smarter.
Sounds like absolute genius. It's almost like AI caching, let dinosaur caching die. The hardest part past the engineering is the firmware to align everything to work as intended because ok you engineered the possibility, but you still have to make it work in the real world, but IBM at least has a rather targeted audience in mind which will help immensely. In fact, I would love if the AI package on CPUs helped with caching; it would validate their existence.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
149 Comments
Back to Article
coburn_c - Thursday, September 2, 2021 - link
So you could have another cloud customers data in your cache... that doesn't sound like a security risk at all.Ian Cutress - Thursday, September 2, 2021 - link
...you've heard of a shared cache, right?FreckledTrout - Thursday, September 2, 2021 - link
Like every other CPU.coburn_c - Thursday, September 2, 2021 - link
This isn't a shared cache, this is an exclusive cache, with someone else's data in it. Explain that access control model to me oh wise onesxenol - Thursday, September 2, 2021 - link
Same way any other ACL or similar mechanism works on storage systems?Dolda2000 - Thursday, September 2, 2021 - link
Well, you wouldn't be able to access the foreign data in your cache simply because it won't match any of the physical addresses you have access to after address translation. Really, it's no more strange than the fact that you're sharing DRAM with other cloud customers.coburn_c - Thursday, September 2, 2021 - link
If the core can't access it how does it get in or out?coburn_c - Thursday, September 2, 2021 - link
Are you saying the L2 is associative to every core in the system?mode_13h - Thursday, September 2, 2021 - link
> Are you saying the L2 is associative to every core in the system?This is already the case for asserting exclusive ownership of a cacheline, which involves a broadcast to all other cache blocks to force them to evict any copy they might have. This is done upon the first attempt to write into a cacheline. ISA-specific memory ordering rules potentially prevent the write from commencing before the evictions are performed.
Similarly, when a cacheline is fetched, even for reading, the requestor must ensure that no other core has exclusive ownership of it.
coburn_c - Friday, September 3, 2021 - link
>To all other cache blocksWhat? I don't... are you saying it is snooping all the caches across the entire system? So it's slow too huh.
Kangal - Thursday, September 2, 2021 - link
Moreso because some clients will be accessing more than a single-core, or perhaps a single CPU block. So for those more demanding cloud tasks of the new-age, this is going to save some latency. And that's the biggest hurdle in cloud computing.However, this thing is going to be a security nightmare, to code it "just right" and look for thousands of edge-cases for possible attacks. Difficult, but not impossible.
The thing that will be impossible is efficiency. Since the L2 caches are all unified, they must always be on and firing together, in a sense. And possibly running at a high frequency that is locked. So for even the lightest form of processing, there won't be any speed increases, but the energy use would be much much higher. In other words, this is a "true" server/desktop. This concept won't work in a DC-battery device like a thick-gaming laptop, let alone, a thin and light x86 tablet.
PS: This article doesn't mention it, but what ISA/architecture are these cores based on?
x86-64, or ARMv9, or RISC-V, or is it the (abandoned?) PowerPC architecture that we saw in the GameCube...
Kevin G - Thursday, September 2, 2021 - link
The architecture dates back to the S/360 from the 1960's: true mainframe heritage. The power consumption angle isn't much of a concern for this market. Clock speed is going to be 5.5 Ghz at the top, matching some previous mainframe chips.The security aspect may not be much of a concern due to the system's memory encryption capabilities. The real question is *when* contents get decrypted in the pipeline. Evicting cachlines out of the local L2 to a neighboring CPU's cache may require the contents to be encrypted. So even if the local L2 cache mapping boundaries were broken to gain access to what is being used as a virtual L3, the contents would be obscured. To use the virtual L3 cache data you'd more than likely have to get the encryption key being used by the other CPU core.
mode_13h - Thursday, September 2, 2021 - link
> This concept won't work in a DC-battery device like a thick-gaming laptop,> let alone, a thin and light x86 tablet.
I know you're only talking about applying the concept, but my mind went right to imagining a laptop with a huge mainframe CPU in it!
:D
mode_13h - Thursday, September 2, 2021 - link
> what ISA/architecture are these cores based on?A little more is said about the processor (but not the core design or ISA), midway through this liveblog:
https://www.anandtech.com/show/16901/hot-chips-202...
Here's a starting point for learning more about IBM's Z-series:
https://en.wikipedia.org/wiki/IBM_Z
sirwired - Friday, September 3, 2021 - link
> In other words, this is a "true" server/desktop.A *single* machine can occupy the entirety of a pair of specialized rack cabinets, offer liquid cooling as an option, can hold 40TB of memory (in the previous z15 version), and use multiple 3-phase power cords, each about the size of a fat garden hose. Oh, and cost several $M. So yes, they are "true" servers, whatever that means.
Let's just say that nobody is losing sleep about the power bill when the single box is enough to process the transactions for an entire national stock exchange or a megabank.
> PS: This article doesn't mention it, but what ISA/architecture are these cores based on?
x86-64, or ARMv9, or RISC-V, or is it the (abandoned?) PowerPC architecture that we saw in the GameCube...
Welcome to the Big Leagues. At the core it uses the same instruction set as the S/360, which was released in 1964, and it's the one that has pretty much run world commerce since then.
FunBunny2 - Saturday, September 4, 2021 - link
"At the core it uses the same instruction set as the S/360, which was released in 1964"sort of. like X86, the 360 ISA has gained many extensions since 1965, but yes, it's a gozilla CISC machine. not sure when the 'machine' morphed to 'microprocessor' implementation. Big Blues mini-ish machines were micro based no later than 1990 (RS/6000, and may be AS/400). it's been said, haven't confirmed, that 360/DOS applications will still execute on z machines. which is much more than a hardware issue, but largely OS support.
halo37253 - Thursday, September 16, 2021 - link
Modern X86 designs are RISC, and have been for decades. For intel with the Pentium Pro (pentium 2) and AMD with the K5. They use a CISC to RISC decoder. Along with many Extensions and more modern compilers. CISC op code for the most part in minimal and doesn't negatively effect performance. Though the OP Code Decode for the CISC to RISC does take up die space that otherwise wouldn't be there.I wouldn't be surprised if z/Architecture has also gone down the route of being a RISC unit with OP Code Decoders. But i'm not going to dig into it.
Alexvrb - Sunday, September 5, 2021 - link
Yeah I don't think we'll see this implemented in x86 in the near future. If we do, they probably will build L2-cache-heavy server-focused designs. Doesn't seem ideal for other applications, and wasting power in the cache means less power and thermal overhead elsewhere on the chip. But who knows, maybe a future version of this which is less aggressive might see wider adoption.Also I'm pretty sure all the Z chips are z/Architecture. PPC was for consumer/prosumer systems mainly IIRC.
Dolda2000 - Friday, September 3, 2021 - link
Whether the *core* can access the cache is very different from what parts of the cache the *program* running on that core can access.coburn_c - Saturday, September 4, 2021 - link
The *program* is invariably weaponized. The *hypervisor* attempts isolation, but we've all learned if memory is accessed it can be *leaked*.Kevin G - Thursday, September 2, 2021 - link
One thing to add is that as a Z-series mainframe, the platform supports memory encryption. The question I have is where the contents are decrypted before usage. My personal *guess* is that the L1 caches are decrypted for use while the L2 cache remains fully encrypted, making many of the recent side channel attacks (which IBM's Z series was also vulnerable to some) ineffective.mode_13h - Thursday, September 2, 2021 - link
> where the contents are decrypted before usage.Depends on the point of the memory encryption. If you just want to avoid someone snooping memory by attaching a logic analyzer to the board, then it's good enough to do it in the memory controllers.
If the point is for guests to have private storage that even the kernel cannot see, then I guess it could only be decrypted into L1, as you say. Because, if L2 is shared, then any thread with privileges to access that memory address would see it as it appears in the shared cache.
AntonErtl - Friday, September 3, 2021 - link
Side channels through caches reveal the accessed address, not the cache contents. So if you want to protect against that with encryption, you have to encrypt the address.mode_13h - Friday, September 3, 2021 - link
> if you want to protect against that with encryption, you have to encrypt the address.That won't work, unless everyone is using the same key. Otherwise, you'd have address collisions and what appears to be data corruption.
And having everyone use the same key would only protect from physical eavesdropping.
Wereweeb - Thursday, September 2, 2021 - link
When I read you talking about "virtual L3/L4"' caches", I guessed that this is what the "virtual L3 cache" might mean, but I had no idea what the "virtual L4 cache" would be.Is there any situation in which a CPU might have to access the DRAM of another socket? Even if accessing another module's SRAM is close in latency to accessing local DRAM, the L4 cache might prevent a CPU from having to pull data from another socket's DRAM.
If that's not relevant, then DRAM's inherent latency must be more of a bother than the latency of pulling the data all the way from the depths of Hades.
Wereweeb - Thursday, September 2, 2021 - link
Oops, wrong place, fucking hell. I must've clicked the reply button without realizing it, and it just brought the comment I was writing. That's... annoyingIgnore this, was supposed to be it's own thing and in the back of the room with the other latecomer comments.
Samus - Friday, September 3, 2021 - link
Virtualization security layers will tag each instruction to its origin. This is effectively how Intel fixed Meltdown and Spectre in recent hardware. I assume every modern CPU design going forward will use metadata and instruction tagging to fix branch prediction hijacking and cache intrusion.Threska - Friday, September 3, 2021 - link
The future resembles the past. Now all we need is to start programming in LISP.FunBunny2 - Saturday, September 4, 2021 - link
"The future resembles the past. Now all we need is to start programming in LISP."damn Modernist! Autocoder all the way.
hob196 - Friday, September 3, 2021 - link
I'm not sure you're understanding the difference between cloud computing and renting a distinct machine.Even if cloud vendors were to bind a core to a process at the O/S level (which I strongly doubt they could do as routine and remain economically viable), you'd still have cache lines overlapping the memory of other users / processes whether there's a separate shared L3 or not.
The security risk you highlight is already there regardless of architecture e.g. sideband cache snooping atacks like Spectre.
dwillmore - Thursday, September 2, 2021 - link
This sounds more brittle than the Z15. Yes, things look nice from a single core perspective, but system robustness depends on how it behaves worst case, not best case. Now we have a core with only 32MB of cache at all and a clogged up ring bus trying to steal data from others L2 caches plus chip to chip links also clogged with similar traffic--with no benefit as all L2 is busy with the local processor.And these L4 numbers start to look like main memory levels of latency. The path from core to L1 then L2, then L3, and finally to L4 only to find that "the data is in another castle" seems like a horrible failure mode.
FreckledTrout - Thursday, September 2, 2021 - link
It very much depends how often the data is in the other "castle" as you put it.Timoo - Monday, September 6, 2021 - link
Sounds a lot like bad programming, don't you think?In my field we have a problem between application-tinking and service-thinking.
Monoliths versus agiles. Monoliths should be at the end-point, not in the middle-layer.
I think it is comparable; if you program thise service-minded, with only broadcasting the end-product, there is no "clogging the bus".
Timoo - Monday, September 6, 2021 - link
*application-thinking and *these, instead of application-tinking and thise.eSyr - Thursday, September 2, 2021 - link
> With that in mind, you might ask why we don’t see 1 GB L1 or L2 caches on a processorWith Genoa (that might have up to 96 cores according to some leaks) and stacked 3D V-Cache (which effectively triples amount of L3 cache) we might see 1.125 GiB of L3 cache in the not so distant future.
will- - Thursday, September 2, 2021 - link
The flexible L2 / L3 cache is geniusDrkrieger01 - Thursday, September 2, 2021 - link
Indeed. It's like making the best of the silicon you have, not letting any of it go to wasteFunBunny2 - Thursday, September 2, 2021 - link
I've always wondered (link anyone?) how these designs come to be.1 - it's a staggering bunch of maths understood only by the Einsteins of today
2 - it's a staggering bunch of spitballing on white boards by Joe Sixpack engineers
mode_13h - Thursday, September 2, 2021 - link
I think it starts with ideas, followed by models (i.e. spreadsheets or similar), followed by simulations.Until you simulate it, you can't truly know how it will perform and where the bottlenecks will be. With a simulation, you can play with different parameters and make a more balanced design.
drw392772 - Thursday, September 2, 2021 - link
I have always been fascinated by the hardware in IBM's big iron, it's on a whole other level. More content along this line would be appreciated.Question: I know mainframes traditionally execute instructions twice to ensure accuracy/integrity. When you say 'core' in this context, does that truly represent a single core, or is that really a pair that execute the same instructions in parallel?
SarahKerrigan - Thursday, September 2, 2021 - link
Mainframes, generally, aren't lockstepped but include other RAS measures.HPE's Nonstop family and Stratus's ftServer family, otoh, are fully lockstepped systems.
cjacobi - Thursday, September 2, 2021 - link
We used to have lockstep execution, but that's many generations ago. Nowadays we use error checking across the entire chip, were we apply parity checking on memory elements (SRAMs, registers etc), and data buses. We also perform either parity or legal state checking on control logic. When we detect an error, the core can go through a transparent recovery action, which is a complete core reset almost like a mini-reboot of the core, but the program state (program location, all the register content etc) gets recovered. So after recovery, we continue running wherever we were in the program. It's completely transparent to the software layers. When a core goes through recovery repeatedly we can even take "the brain dump" of the software state and dynamically and transparently transfer it to a spare core. That way we achieve outstanding resilience and availability against random bit flips or wear out effects.mode_13h - Thursday, September 2, 2021 - link
> we can even take "the brain dump" of the software state and> dynamically and transparently transfer it to a spare core.
Usually referred to as a context switch. The main difference would be if this happens entirely in hardware vs. with OS intervention.
name99 - Friday, September 3, 2021 - link
Oh mode-13h!Never give up that cocky attitude, that you have something to teach the designer of the system :-)
Oxford Guy - Friday, September 3, 2021 - link
‘It’s not a fallacy if it’s true.’I’m still giggling over that whopper.
mode_13h - Saturday, September 4, 2021 - link
Nice troll, OG. At least *pretend* to be a decent human?Oxford Guy - Tuesday, September 7, 2021 - link
Another fallacy.You need to not only learn what a fallacy is but also that it's bad faith argumentation.
Qasar - Tuesday, September 7, 2021 - link
fallacy:1 a deceptive, misleading, or false notion, belief, etc.: That the world is flat was at one time a popular fallacy.
2 a misleading or unsound argument.
3 deceptive, misleading, or false nature; erroneousness.
for the most part, that sounds more like every one of oxford guys posts.....
1, sounds like his console scam BS
2 and 3, the rest of his posts.
mode_13h - Saturday, September 4, 2021 - link
> Never give up that cocky attitude, that you have something to teach the designer of the systemFFS! Why do you assume my reply was aimed only at cjacobi? I was trying to relate the mechanism to something far more familiar. This had two goals:
1. Grounding the explanation in something other readers are familiar with. Part of understanding something exotic is demystifying the parts that are actually commonplace, so that we can focus on those which aren't.
2. If there's an important distinction (such as maybe the thread migration being completely hardware-driven, rather than OS-managed), cjacobi is welcome to correct me and enlighten us all.
Further, I might add that it's all part of participating in the comments. This is not a one-way communication mechanism, like articles or lectures. Since anyone can reply, you shouldn't post comments, if you can't deal with someone offering a clarification or asking a dumb follow-up question, etc.
GeoffreyA - Sunday, September 5, 2021 - link
Reminiscent of a context switch.Zio69 - Friday, September 3, 2021 - link
Dear Sir, given your nickname (and how you talk about z/) I guess you're the chief architect of my favorite platform. Let me congratulate for the amazing job you (and those before you) did on s/3x0 and z/xx.I think you should mention, for those who are interested, the literature IBM published for those who want to get started:
https://www.ibm.com/docs/en/zos-basic-skills?topic...
Not exactly new, but it's a great read.
Best regards from a guy who's been a cobol/assembler/easytrieve programmer for over 30 years (last year I had to switch to more "mundane" platforms!)
cjacobi - Friday, September 3, 2021 - link
Thank you. My favorite platform as well. My HotChips presentation was representing the work of a large team of brilliant engineers. And we are standing on the shoulders of giants, almost 60 years of innovation from S/360 to the IBM Z Telum chip.-- Christian Jacobi
brucethemoose - Thursday, September 2, 2021 - link
Fascinating.But that process of "trying to find space when evicted, across the whole chip, and failing that, across the whole system" cannot be cheap. Stuff is getting evicted constantly, right?
In other words, this sounds fine in a 5.3GHz fire breathing monster, but that *has* to burn too much power for a phone, or maybe even a laptop.
Ian Cutress - Thursday, September 2, 2021 - link
IBM said that each package (so each dual die) is ~400W TDP or so.FunBunny2 - Thursday, September 2, 2021 - link
"IBM said that each package (so each dual die) is ~400W TDP or so."Tim Cook: "ah we can handle that."
brucethemoose - Friday, September 3, 2021 - link
Sounds like a great chip for an IBM air fryer.Oxford Guy - Friday, September 3, 2021 - link
Less than overclocked Rocket Lake.mode_13h - Thursday, September 2, 2021 - link
Yeah, I don't see why it has to physically move, just because it got demoted from L2 to L3. If all cache blocks contain a mix of L2 and L3, then why not leave it in place and simply re-tag it?Maybe you have to evict something else from L3, in order to make room. Even then, I don't quite see why you'd move it to another L2/L3 block. If another core is actively or recently using it, then it wouldn't be getting evicted.
schuckles - Friday, September 3, 2021 - link
Generally data is only evicted if you need to make space for new/other data. So if the core is trying to pull in new L2 data then it needs to evict something else that is no longer needed. When evicting from *private* cache you generally push it up one layer -> so it’s going to L3. Putting it into another cores L3 is really the only spot to put it.mode_13h - Friday, September 3, 2021 - link
> Putting it into another cores L3 is really the only spot to put it.That still doesn't wash. As far as the detail that I've seen, the distinction between L2 and L3 is mainly one of quota. If a line in a cache block is currently at L2 status, and then gets demoted to L3 status, then all that needs to happen is for something else to get evicted from that block's L3 quota to make room for the newly demoted line. There's no need to actually *move* anything!
Kevin G - Friday, September 3, 2021 - link
For shared portions of memory running across different cores, simply retagging the local L2 as the virtual L3 for the remote code would work rather well. One of the topics that has been slowly creeping into the spot light has been the amount of energy to move data around. Granted this would still require the tagging information to be refreshed but it should conceptually be less than moving the fully amount of data across the chip and updating the tagging there.For shared memory this does bring up an interesting scenario, can piece of data be tagged as both local L2 as well as for remote L3 on a remote core or L4 on a remote socket? From a user space program, this would be an elegant solution to where the single source of truth lies in a coherent system. I wonder what coherency black magic IBM pulled off to accomplish this at speed in hardware.
mode_13h - Friday, September 3, 2021 - link
> can piece of data be tagged as both local L2 as well as for remote L3 on a remote coreThis gets to the very definition of L2 and L3. Under a quota-based scheme, where L2 is defined as the amount of a cache block controlled by the associated core, then the line would enter L2 "status" if being written by that core. It wouldn't then *also* be L3, as long as it's within the core's L2 quota. Once it get demoted from L2, then it should go to L3 status, before getting completely evicted.
> or L4 on a remote socket?
I wonder if they're just being cute, by calling it L4. It seems to me that L4 is a functional definition of what happens when you're doing a cache snoop and find the data of some remote DRAM in a remote core. In that case, it'd be a little faster to get it out of the remote cache. However, I wonder if and why a remote core would ever *put* something in a chip's cache. I don't really see it, but it could make sense for the chip which hosts the attached DRAM.
jim bone - Friday, September 3, 2021 - link
If your L2 and L3 are sharing the same local physical memory I think the only distinction between what is L2 and L3 is your eviction priority. Latency to both should be the same. But we already have finer grained methods of determining what ought to be evicted than a single-bit L2/3 indicator.So - I suspect L3 for this architecture is necessarily non-local.
jim bone - Friday, September 3, 2021 - link
Or more generally, a specific local cache line is considered L2 for that local processor, whereas that same line is considered virtual L3 for non-local processors.jim bone - Friday, September 3, 2021 - link
in which case the coherence algorithm doesn't really need to change - just that evictions from local L2 can go either to DRAM or someone else's L2.jim bone - Friday, September 3, 2021 - link
unless the cache is fully associative (?) something may *have* to move. the eviction is to free up something in a physically specific section of sram.schuckles - Friday, September 3, 2021 - link
I thought it was pretty clear that the line is being evicted to make space for new data. In such a scenario it has to be moved. Either to L3 or back to DRAM (it was private, so there might still be stale data, can’t just drop this line completely). If you want one core to have useful access to the full chips L3 capacity you must move from your private L2 into the chip wide virtual L3. This way code operating on problems with a working set >32MB will actually get access to it.It could also be this does not occur 100% of the time. There could be a heuristic that determines if it gets demoted to L3 locally or moved to another cores L3 or flushed straight to DRAM depending on various performance counters. I assume this type of heuristic is exactly what they would have simulated in advance and determined to be optimal for their expected workloads.
brucethemoose - Friday, September 3, 2021 - link
Moving it is the point, right? Data gets evicted to some *other* less congested L2 to make room for something new in the local L2.If it just gets retagged as L3, its not making room for anything new.
COtech - Thursday, September 2, 2021 - link
Very interesting. It seems like whether it will be beneficial in operation has a lot to do with how much of the core cache capacity is usually vacant (or occupied by low value data) and how often a piece of cached data is "reused" by the same core or another core in the system.The second seems less likely in a cloud system with multiple customers using the same resources but seems likely to be substantial in a single purpose big iron environment.
Will data errors be more prone to propagate if all cores "reuse" data copies rather than independently going "out" to the source? (all sharing a bad copy) or less likely because there are fewer total "copy" operations taking place and resulting in fewer points for error?
I should have gone into Engineering - these are much more interesting questions to me than the human problems I deal with and I think there are actual right answers to be found.
cjacobi - Thursday, September 2, 2021 - link
There's very strong error correction codes on the L2 cache, which can even recover data if an entire cache memory segment fails on the chip. We always check the correctness of the data on every access using those codes, and if we find a bit (or multiple) have flipped we fix it before sending the data out.Sivar - Thursday, September 2, 2021 - link
IBM's small team, representing a tiny part of their revenue, has long led the industry for performance innovation.It reminds me of Broadcom, who consistently has the most consistent, lowest-noise communication chips for mobile phones, even though it is a tiny part of their revenue.
Both keep a clear lead by virtue of having ridiculously amazing engineers like Dr. Christian Jacobi, even against companies -- or countries -- that have much larger R&D budgets.
COtech - Friday, September 3, 2021 - link
In my experience Organizations that consistently hit above their weight do so not on the strength of recruiting but by developing those they recruit.Freakie - Thursday, September 2, 2021 - link
"On the face of it, each L2 cache is indeed a private cache for each core, and 32 MB is stonkingly huge."Heh, nice meme usage there!
watersb - Thursday, September 2, 2021 - link
This architecture seems almost familiar to me: like a big, global whiteboard publish/subcribe system. Tagged items, content-addressable memory.Those sort of things seem to work out ok when you need massive throughput. I think I've only seen this done on systems where the data processing operations are fixed, deterministic-time instructions.
Like in network routers.
Wild: looks like IBM took "Smart NIC" (network interface controller) idea all the way up the stack.
lightningz71 - Thursday, September 2, 2021 - link
It's amazing what you can do with a processor architecture when it doesn't have to be all things to all people. The non-core power usage of these systems should be quite significant, to the level of massive. That's OK if you don't have to also live in laptops and tablet sized convertibles, nor meet things like the California desktop PC power consumption limits.As for how this works in practice? I feel like this would work very well for general purpose computing, but would have issues when dealing with objects that have very large data sets. Things that overflow the L2 cache will start to chew through ring bus capacity and power, but still have reasonable performance. Things that exceed more than 50% of the CPU total L3 vCache capacity are going to start to eat up package power and bandwidth and start taking a performance hit as compared to an on+chip L3, which might be smaller than that amount. But, it looks like it gets ugly when it gets above 256MB and starts hitting the module neighbors for information. That starts impacting the performance of every core in the module as their power budget and available cache amounts start to diminish, and they start to fight for resources.
There's a niche where I am sure this will work well, but, it's not something that can scale down very well. That's fine for IBM, but this strategy isn't going to work well everywhere.
Kamen Rider Blade - Thursday, September 2, 2021 - link
I wonder when Ian can get a chance to test one of these systems.Ian Cutress - Thursday, September 2, 2021 - link
I'll run some minesweeper benchmarksKamen Rider Blade - Thursday, September 2, 2021 - link
=DZio69 - Friday, September 3, 2021 - link
No can do. There's no actual "interactive" mode - even when using CICS (transaction handler) using full conversational mode is not a best practice. Online processing is done pseudo-conversationally - transaction stops when output is sent to terminal/caller and then resumed when terminal/caller sends input again.BUT you may find a way to run some Linux benchmark in USS (Unix System Services).
I would gladly help you, but I was "recycled" and no longer work on mainframes. Sorry.
Soulkeeper - Thursday, September 2, 2021 - link
Every since AMD and Intel moved to the large L3 designs, i've felt that it wasn't ideal. There has to be a better way. I mean how many levels do we need for the same socket? It's good that IBM is atleast trying something different, although I don't know if this is the best solutions either ...FunBunny2 - Thursday, September 2, 2021 - link
"although I don't know if this is the best solutions either "mostly because I worked on them decades ago, even after they were antiques, the TI-990 machines had a really radical design - no instruction or data registers in the cpu, everything done in memory. the justification at the time was that processor and memory cycle times were closer enough that load/store (no recollection if that term even existed then) was sub-optimal.i
the next logical step, of course, is to eliminate memory as temporary store altogether. Nirvana.
while Optane, either in absolute performance or marketing, hasn't gotten traction as a direct single-level datastore (i.e. all reads and writes are to durable storage), a baby step of consolidating caches may hint that some folks in the hardware side are looking in that direction.
OSes will need to be modified, perhaps heavily, to work with such hardware. another olden days machine that provides some guidance is the OS/400 as originally designed. it operated a SQL-ish database as datastore without a filesystem protocol (a filesystem was later grafted on). an 'object' datastore without the filesystem protocol eliminates more impedance matching.
the future may look a lot different.
Threska - Thursday, September 2, 2021 - link
I imagine Pmem is going to have to deal with security.https://www.snia.org/education/what-is-persistent-...
mode_13h - Thursday, September 2, 2021 - link
Memory-to-memory architectures made more sense when CPUs ran so slowly that DRAM accesses only took a handful of clock cycles. These days, memory-to-memory would be completely uncompetitive, unless your CPU is doing some kind of memory-renaming and internally remapping them to registers.You can't just forego DRAM and use Optane instead. It would need to be several orders of magnitude more durable than it currently is.
However, Intel has been working on using Optane from userspace (i.e. without kernel filesystem overhead). But, that's in *addition* to DRAM - not as a complete substitute for it.
mode_13h - Thursday, September 2, 2021 - link
> since AMD and Intel moved to the large L3 designs, i've felt that it wasn't ideal.Do you want lots of cores, with lots of IPC, running at high clocks? If so, then you need bandwidth. And you need it to scale faster than DRAM has been. Scaling the speed and size of caches is the way to do that.
If not, AMD wouldn't have been able to tout such impressive gains, in real-world apps, by simply scaling up L3 from 32 MB to 96 MB.
Of course, another way to do it is with in-package HBM-type memory, which comes at some latency savings and increased bandwidth vs. DDR memory sitting on DIMMs.
A yet more radical approach is to reduce the burden on caches, by using a software-managed on-chip memory. This is something you'll find in some GPUs and more specialty processors, but places a lot more burdens and assumptions on the software. Going to a private, direct-mapped memory avoids the latency and energy tax of cache lookups and maintaining cache coherency.
FunBunny2 - Thursday, September 2, 2021 - link
Question for Ian (or anyone at least as close to Big Blue):In the olden days, IBM monitored its installs, seeking the most used instructions, data flows, and the like; mostly to optimize the machine for COBOL (the 360 meme, one machine for science and business, died almost immediately) applications.
is this 'radical' cache structure the result of customer monitoring, or (like Apple) IBM is telling the customer base 'we don't care what you think, this is what you need'?
Tomatotech - Thursday, September 2, 2021 - link
"In the olden days, IBM monitored its installs, seeking the most used instructions, data flows, and the like; mostly to optimize the machine for COBOL (the 360 meme, one machine for science and business, died almost immediately) applications.is this 'radical' cache structure the result of customer monitoring, or (like Apple) IBM is telling the customer base 'we don't care what you think, this is what you need'?"
I suspect Apple has far greater insight into the codepaths* its customers run than IBM ever did. The iOS phones are absolute miracles of OS and hardware working together in lockstep to achieve more with less resources (or consume less battery power) than any other phone maker on the planet.
Apple looks to be repeating this achievement with their M1 Macs.
*staying well away from the whole CSAM issue. We're just talking about IBM / Apple tweaking their OS/ hardware to maximise the efficiency of their customers' most highly used codepaths / dataflows.
Oxford Guy - Friday, September 3, 2021 - link
Miraculous voicemail that won’t delete and auto-defect that can’t be disabled. Want to turn off voicemail transcription? You may be able to use a kludge but there is no actual normal setting to give users direct control. This is normal for Apple in recent times. Strip away user control and be hailed for efficient program CPU utilization. Stripping features from a program is one way to speed it up but what about the efficiency of the user’s workflow?I’d gladly trade a bit of battery life for a system that has more respect for the user. But Apple has other plans. It even wants to play cop with people’s phone. Warrantless surveillance that Scientific American warns will lead to the abuse of a vulnerable minority. It also warns that this is a brainwashing of young people to believe they have no right to privacy. Big brother Apple is going to monitor you, not just government agencies. The corporate-government complex’s dissolving of the line between corporation and government continues at speed. A $2 trillion valuation doesn’t happen without gifts to those in power.
GeoffreyA - Sunday, September 5, 2021 - link
No matter what Apple does, people will still worship before the Fruit Shrine.Oxford Guy - Tuesday, September 7, 2021 - link
The same goes for MS.Windows 11 offers consumers entropy rather than value. It will be a success for MS nonetheless.
GeoffreyA - Tuesday, September 7, 2021 - link
Oh, yes. Lost faith in Microsoft a long time ago, though they have pulled up their socks in the last few years, somewhat. As for 11, I'm just going to stick with 10 as long as I can. If I've got to use it for another decade, I've got no problem with that.GeoffreyA - Tuesday, September 7, 2021 - link
You know, we always eat our words. I had another look at 11 just now, and have got to admit, it doesn't look half bad. Indeed, seems like sense is at work once again: lots of rubbish has been removed, such as Cortana, One Drive, and the Ribbon. And from an appearance point of view, there's a restrained steering away from Metro's plainness. Then, the part that wins me: the command bar, replacing the Ribbon, is looking suspiciously like XP's. Microsoft, what has got into you? Common sense? October, I'm going to give it a go.Oxford Guy - Tuesday, September 7, 2021 - link
Fiddling with UI appears to be enough to placate most, keeping the entropy flowing. Slap a superficial coat of paint on it and we can ignore the badness of the deal.GeoffreyA - Wednesday, September 8, 2021 - link
Apart from the telemetry, which can be toned down using tools like ShutUp10, the problem with Windows today is largely one of appearance. Repeatedly, Microsoft has tried to tack on mobile rubbish, and it is just that, tacked on; the OS has resisted attempts for it to sink deeper. It's still the Windows of yore all the way through.Windows is very like a house built on a solid foundation, with a sound plan. Its finish---the plaster, paint, and furniture---used to be excellent in the days of XP and 7. After 7, it's as if new owners bought the place and, modern folk that they are, have been repainting the walls, putting up gaudy ornaments, and adding extensions that don't square with the original plan. I'm sorry to say they even got rid of some beautiful antique furniture, adding, "Our synthetic wood is what everyone's going for nowadays. It's the in thing."
When I visited the house, I saw that it had been defaced considerably; but looking closer, realised the Windows I knew and loved was still there, beneath all the make-up. I smiled.
FunBunny2 - Thursday, September 9, 2021 - link
"Windows is very like a house built on a solid foundation, with a sound plan."wha....??? does you mean it's still DOS deep down??? :)
GeoffreyA - Thursday, September 9, 2021 - link
Aye, DOS all the way to the bone. Bill recently admitted on Reddit AMA, "We were runnin' out o' time, so I told the NT team, *Dave, just copy and paste the DOS code, throw some up-to-date modules in, smack on the GUI, and no one will know the difference.*"mode_13h - Thursday, September 2, 2021 - link
> is this 'radical' cache structure the result of customer monitoringMost likely, they asked what apps customers use most, then profiled and analyzed them on their own.
The_Assimilator - Saturday, September 4, 2021 - link
It isn't the 1960s anymore. Consumers have data privacy rights now.FunBunny2 - Saturday, September 4, 2021 - link
"It isn't the 1960s anymore. Consumers have data privacy rights now."at least officially, customers signed up for the instrumentation.
GeoffreyA - Sunday, September 5, 2021 - link
Paradoxically, I find it hard to believe that.Oxford Guy - Tuesday, September 7, 2021 - link
'Consumers have data privacy rights now.'Yes. They're told when their SSNs and such are leaked by everyone under the sun.
metafor - Thursday, September 2, 2021 - link
I like how the added latency to access another chip's cache is a ?First thing that jumped out at me was "it might be slower than DRAM at that point". Either they have an enormous broadcast/response fabric to do all of this cache state coordination traffic (and they might) or virtual L4 ends up having some major glass jaws compared to just going to local DRAM.
cjacobi - Thursday, September 2, 2021 - link
An on-drawer cache hit is still significantly faster than a memory access, even for chip-local memory. And given the memory size of those systems (many Terabytes, z15 can have up to 40TB in a shared memory config), the data you need is often hanging off a different chip so you need to traverse that chip-to-chip distance anyway. We do however access memory speculatively in some cases to further reduce latency, while broadcasting in parallel for cache coherency across a number of chips.mode_13h - Thursday, September 2, 2021 - link
> virtual L4 ends up having some major glass jaws compared to just going to local DRAM.Keep in mind that all cache-coherent systems have the scaling problem of needing to ensure that no other core has an exclusive copy of the cacheline you want.
However, whether it's worth fetching a non-exclusive copy on another chip vs. reading from local DRAM is still a decision that can be made on a case-by-case basis. Usually, a cache line would get fetched into L1 or L2. And if L3/L4 is inclusive, then there's no problem with a given cacheline simultaneously existing in another chip's L3 and the local chip's L1 or L2.
goldman1337 - Thursday, September 2, 2021 - link
the idea sounds nice but i'm wondering why is this better than just a big shared L2 cache?mode_13h - Thursday, September 2, 2021 - link
The traditional distinction between L2 and L3 is that L2 is private and L3 is shared. So, what you're saying is that you effectively want to do away with L2. The disadvantage of doing so is that one core that's running a memory-intensive workload could crowd out the L3 sets of others.PaulHoule - Thursday, September 2, 2021 - link
IBM's mainframes were called System/360 because the 360 could handle any workload (not niche.) They found out pretty fast it couldn't compete economically with minicomputers for some users so they regretfully introduced a large number of architectures that weren't "compatible".They were handing out "less than a full core" to users in the 1970s, long before the advent of "cloud computing". I remember being in the Computer Explorers and spinning up instances of VM/CMS which was basically a single-user operating system a lot like MS-DOS inside a VM to do software development tasks.
IBM had demands to make a "Baby 370" for software devs but it never caught on because it was more cost effective to give them a little slice of a big one.
FunBunny2 - Thursday, September 2, 2021 - link
"IBM had demands to make a "Baby 370" for software devs but it never caught on because it was more cost effective to give them a little slice of a big one."yeah, but...
Multics did the most to move time-sharing forward
TSO was IBM's version, as a temporary patch, which became the product
"The Sole of a New Machine" chronicles not just one machine, but a broad overview of minis in the late 70s, when a mini was multiple boards of, mostly, discrete parts. and each company had its own version of what an OS was.
Teckk - Thursday, September 2, 2021 - link
Interest concept and a fantastic article with solid explanation. Thanks Dr.Ian.Kamen Rider Blade - Thursday, September 2, 2021 - link
I wonder what kind of security vulnerabilities this design of Virtual Cache would present?Jorgp2 - Thursday, September 2, 2021 - link
Wouldn't latency to memory be lower than latency to another chip?mode_13h - Thursday, September 2, 2021 - link
You're assuming the DRAM holding the cacheline you want is directly connected to your core. However, what if it's actually hosted by a 3rd chip that's even farther?mode_13h - Thursday, September 2, 2021 - link
Uh, I mean "the DRAM holding the data you want."diediealldie - Thursday, September 2, 2021 - link
Kinda curious how QoS is managed between cores in a single chip and between different drawers. Even in a single chip, there can be tons of different threads with different memory usages. If there are memory-intensive workloads in different cores, maybe cores will be trying to 'steal' cache from others.This looks like overprovisioning in virtual machines(Pretend that there's enough space to evict something). So I expect that there will be similar problems and am quite curious how IBM handled this. In each core's perspective, I don't see any reason why there should be extra un-used(?) L2 which can be used for other core's L3.
eastcoast_pete - Thursday, September 2, 2021 - link
I wonder how the new cache architecture was also designed to fully integrate the AI cores that Telum has? IBM stressed that that being able to do "AI on the fly" (my words, not theirs) is a key feature of the new mainframe CPU, so maybe some of these changes are to make that easier and (especially) faster? Any words from IBM on that?mode_13h - Thursday, September 2, 2021 - link
> when it comes time for a cache line to be evicted from L2, ... rather than> simply disappearing it tries to find space somewhere else on the chip.
Why would it necessarily have to get moved? If the same physical cache is shared between L2 and L3, and the partitioning is merely logical instead of physical, why couldn't it (sometimes) just get re-tagged as L3 and stay put?
I think that'd save a lot of energy, by eliminating pointless data movement. Of course, if the cache block has no available space in its L3 quota, then I suppose the line might indeed have to be relocated. Or maybe it could evict a L3 line from that cache block that could potentially find a home elsewhere, if it were still in sufficiently high demand.
mode_13h - Thursday, September 2, 2021 - link
It seems to me that what should scale most efficiently is to have blocks of L3 coupled to each memory channel. This way, you know precisely which L3 to check and your L3 bandwidth will scale linearly with your memory bandwidth. And while load-balancing could be an issue, software already has an incentive to load-balance its usage of different memory channels.What am I missing? Is energy-efficiency just not a priority, at that level?
Zio69 - Friday, September 3, 2021 - link
No. Energy-efficiency is not a priority at all. In the mid 90s the mainframe I used to work on was HEAVILY water cooled. On 3 sides of the building (about 40m x 40m - so 120m) there was an uninterrupted line of 50-60 cm fans that provided cooling for the beast. If my memory serves me well it was a 9000 series, don'r remember the specific model. As you can guess those 200ish fans dissipated a lot of heat therefore shedloads of power were consumed to keep that baby on 24/7.mode_13h - Friday, September 3, 2021 - link
I was speaking in very broad terms about computer architecture, not simply restricted to mainframes. Sorry not to be clear about that.The reason I went there it is that the article seems written with an eye towards broader trends in cache hierarchies. So, it was those broader trends that I was attempting to question.
LightningNZ - Saturday, September 4, 2021 - link
To couple a cache to each DRAM controller and have it store local data, you’d want it to be physically indexed. I’m not aware of any physically indexed caches. You’d have to do all your conversion from virtual to physical addressing before lookup, and then you’ll need to store the whole virtual address as the tag. To make matter worse you need to store virtual tags for all processes that are accessing that data, which is an arbitrary number. This is trivial in a virtually indexed cache as the same physical index can be held in multiple cachelines as long as they are in a non-exclusive mode, with the TLBs ultimately protecting usage. If you had very long cachelines that were the size of an entire DRAM transfer or bigger then maybe the overhead would be worth it. Otherwise it’s a lot of gates for tracking a small amount of information.mode_13h - Sunday, September 5, 2021 - link
> You’d have to do all your conversion from virtual to physical addressing before lookupAh, good point. That would add latency and energy expenditure to the lookup.
> then you’ll need to store the whole virtual address as the tag.
Why? If the cache is dealing in physical addresses and the OS is ensuring that no two virtual addresses overlap (or, any thing do have explicit its approval), then why would the cache also need to know the virtual address?
LightningNZ - Sunday, September 5, 2021 - link
> Why? If the cache is dealing in physical addresses and the OS is ensuring that no two virtual addresses overlap (or, any thing do have explicit its approval), then why would the cache also need to know the virtual address?You're right. As long as you stored the entire 64-bits of the address it shouldn't be an issue. That would boost its density too
LightningNZ - Monday, September 6, 2021 - link
I forgot in my first reply that if you don't have virtual tagging then you can't tell a core making a request where the data may be above it - that information could be currently in another cache. You'd have to have resolved probes and whatever already, so it's not just another level in the cache hierarchy, but a dumb latency reducer for DRAM accesses. It'd have to be quite large to be much use I'd have thought to give much benefit if you'd already traversed all other caches.ZolaIII - Friday, September 3, 2021 - link
They have my applause on L2 size but other than that it's a victim of/to victim, if they went with a much larger memory cube (HMC) and interesting buss (which would still be in the same role) I would have said fine. Would really like to see consumer grade (mobile/tablet/laptop...) SoC's where it can serve (either cube or HBM) as final RAM (as neither the DDR latency nor new buffer levels have much sense anymore).misan - Friday, September 3, 2021 - link
Isn’t Apple already doing something like that with their shared L2 cache? From benchmarks done by Anandtech it seems like every M1 core has „priority“ access to a portion of L2 and the rest is used as some sort of a virtual L3.AntonErtl - Friday, September 3, 2021 - link
Unfortunately, the article just reaffirmed what I understood from the little bit that the slides shown at HotChips revealed already. The interesting pieces are missing: When is a line in a different L2 considered "unused"? How does that information get communicated? Is the system using directories, or is it just snooping?Some ideas that come to my mind is a directory at each memory controller that tells which lines for this memory controller are in which cache(s); this would keep traffic down, but may increase latency.
Brane2 - Friday, September 3, 2021 - link
Another bunch of marketing nonsense.I'm sure IBM chose the model that fits THEIR use patterns the best.
Which is usually far from what others in x86_64 land might see.
What is interesting is how easily have they managed to sell it to "doctor" Ian Cutress...
Doctor of what ?
19 cycles is not that great for L2. Above that, 12ns is not that great for on-the-chip communication.
Above that, this L3/L4 from many L2 tiles doesn't seem to bring anything significant on inter-chip levels. Latencies are so big anyway that couple of cycles more or less don't mean much.
Also, L1,L2 and L3 are different beasts. One can't compare and translate their logic 1:1.
Oxford Guy - Friday, September 3, 2021 - link
You should read the article.Ian Cutress - Saturday, September 4, 2021 - link
Love your use of quotes.The answer to your question is simply 'heresy'. I studied and wrote a thesis on the mystic arts. I can identify magic when I see it.
mode_13h - Saturday, September 4, 2021 - link
Not sure exactly what's your issue. However, on the semantic front, those who distain use of the title "doctor" for Ph.D. recipients would do well to note that the distinction pre-dated Medical Doctorates by at least 5 centuries.FunBunny2 - Saturday, September 4, 2021 - link
"distinction pre-dated Medical Doctorates by at least 5 centuries."it wasn't until well into the 20th century that an MD actually had to attend a Medical School; it was not much more, if that, than an apprenticeship in blacksmithing.
GeoffreyA - Sunday, September 5, 2021 - link
Yet, fast forward to today, and I doubt whether even a theoretical physicist would get the amount of fawning your everyday, run-of-the-mill MD gets.The_Assimilator - Saturday, September 4, 2021 - link
So you start off with "the future of caches"... then you admit "IBM Z... is incredibly niche". And at 530mm2 die size versus Zen 3's 84mm2, of course they can fit in a stupid amount of L2 cache and virtualise it.So no, this is not the future of anything except chips with stupidly large dies.
FunBunny2 - Saturday, September 4, 2021 - link
"So no, this is not the future of anything except chips with stupidly large dies."well, may be large relative to current Intel/AMD/ARM chips, but what used to occupy hundreds (even thousands) of square foot, raised floor, liquid cooling machines is now a couple of chips.
as to niche: web apps are really the niche, in that they all do the same thing. the mainframe does the heavy lifting in the real world, but it's largely invisible to the likes of AT readers.
mode_13h - Saturday, September 4, 2021 - link
TA's comparison is apt, in that they are both 8-core dies made on a similar process node!By using so much L2/L3 cache, essentially what IBM has done is taken an overbuilt 4 cylinder engine and strapped an enormous supercharger to it, as a way to buy a little more performance for a lot of $$$. The only reason they can get away with such a large disparity between cores and cache is that most of their customers will buy 8+ node systems, whereas most x86 & ARM servers are single-CPU or dual-CPU.
The reason I call it a 4-cylinder engine is that if they would reveal more details about the micro-architecture, I think it would appear fairly simple by today's standards. Most of the complexity in their design is probably its RAS features, and much of their verification resources probably go towards testing they all work properly and robustly. So, its IPC is probably comparatively low. And when you're trying to eke out a little more performance, in order for the product can stay relevant, enlarging cache is a pretty reliable (if expensive) way to do it.
mode_13h - Saturday, September 4, 2021 - link
> the mainframe does the heavy lifting in the real worldThis is BS, of course. HPC doesn't use mainframes and nor do the hyperscalers. Corporate back offices run in the cloud or on conventional servers, not mainframes. Mainframes are just too expensive and not necessary, even for most "real work". They haven't kept up with the perf/$ improvements of commodity computing hardware, for a long time.
There are really only a few niche areas where mainframes still dominate. Some of those niches are surely starting to drift away from mainframes, which explains why they added in the machine learning acceleration.
And you can be pretty sure mainframes aren't picking up new niches. The computing world is much more invested in building resiliency atop commodity server hardware. Case in point: I know some big crypto currency exchanges use AWS, rather than mainframes.
FunBunny2 - Saturday, September 4, 2021 - link
" Case in point: I know some big crypto currency exchanges use AWS, rather than mainframes."don't sign on.
mode_13h - Sunday, September 5, 2021 - link
I wonder if anyone can find a niche, where mainframes dominate, that doesn't fall under the category of "critical infrastructure"?I'm not attacking the very idea of ultra-reliable hardware, just that it's worth the TCO in any case where downtime isn't extremely costly.
FunBunny2 - Thursday, September 9, 2021 - link
"I'm not attacking the very idea of ultra-reliable hardware, just that it's worth the TCO in any case where downtime isn't extremely costly."what folks who've never worked in a MegaCorp mainframe shop desperate to have an 'innterTubes' presence (I have) don't know is that 100% of that brand new web app from MegaCorp (your bank, insurance, grocery store...) is really just some lipstick on a COBOL pig from 1980. no one, but no one, re-makes the 3270 app soup to nuts. just puts new paint on the front door.
Oxford Guy - Tuesday, September 7, 2021 - link
'versus Zen 3's 84mm2'AMD designed Zen to be scaled down to low-power devices and to be higher-margin.
If x86 were to have had more competition than a duopoly in which one of the two players is still hobbled by 14nm, AMD may not have been able to get away with such a small die.
People have been conditioned to see what the duopoly produces as being the pinnacle of what's possible at a given time. Apple's M1 shed a bit of light on that falsity, and it's not even a high-performance design. It's mainly designed for power efficiency (and die size/profit margin, too).
GeoffreyA - Sunday, September 5, 2021 - link
Thanks, Ian, for the well-written article. I don't know much about caches, but if I may venture an opinion from the armchair of ignorance, I'd say this is going to take a lot more work/complexity for minimal gain, if any. Seems to me the classic L2/L3 cache will win the day. But 10/10 for innovative thinking.WaltC - Sunday, September 5, 2021 - link
It's difficult to understand why people insist on talking about x86 CPUs these days because there aren't any and haven't been any "x86" CPUs in many years. The x86 instruction set is but a tiny segment of today's Intel and AMD CPUs that are very advanced risc-cisc hybrid OOOP designs that don't resemble 80286/386/486 & even 586 CPUs at all. "x86" software compatibility is maintained in the CPUs merely for the sake of backwards compatibility with older software designed to run on real x86 CPUs, but these CPUs haven't been "x86" in a long time. Back in the 90's when real x86 CPUs were shipping, the scuttlebutt was that RISC was going to leave "x86" behind and become the new paradigm--yes, that long ago. That never happened because "x86" moved on far beyond what it was while maintaining the backwards software compatibility that the markets wanted. That's why you still hear that "x86" is not long for the world--because it's terrifically oversimplified. Apple customers, especially, think "x86" is the same thing today as it was 30 years ago...;) All of that is just marketing spiel. AMD's (and Intel's) "x86" CPUs will continue to change and improve and push ahead--they aren't going to be sitting still. But when people say, "x86" today, that's what some think--that x86 hasn't changed in all these years and so it "must" at some point be superseded by something better. They keep forgetting that the CPUs that have superseded the old x86 CPUs of the 80's/90's are themselves "x86"...that's the part you don't read much about. It should be well understood. x86 has little trouble today with 64-bits, for instance, and many other things that were never a part of the old x86 hardware ISA's.GeoffreyA - Monday, September 6, 2021 - link
It's fair to say these are x86 CPUs, because that's what they are, despite the decoding to an internal format that happens in the front end. But I agree with the gist of your comment. It's a pop-culture commonplace that x86 is dead, x86 is going down, or almost done for. Why? Well, according to popular wisdom, old is bad and new is good. But watch how fickle a thing our allegiances are. As a fanciful example, if Apple were to switch to RISC-V, watch how opinion would quickly swerve round to denouncing ARM and vindicating its successor.Oxford Guy - Tuesday, September 7, 2021 - link
Software compatibility, licensing.x86 + Windows has been the dominant non-handheld consumer platform for a long time. Not only that, x86 is the hardware of all the so-called consoles. Not only that, even today Apple is still selling x86 equipment. x86 certainly has plenty of focus for Linux developers and users, too.
The technical implementation of those instructions isn't very important so long as the restriction on who can build chips with them remains so relevant.
Or, is your argument that any company can begin to sell x86-compatible chips, chips that run the instructions 'natively' rather than in some sort of Rosetta-like emulation? My understanding is that only Intel, AMD, and VIA have had the ability to produce x86 CPUs for many years.
Oxford Guy - Tuesday, September 7, 2021 - link
Except for AMD’s licensing of Zen 1 to China which somehow was approved. I’m not sure how AMD managed to enable another company to manufacture x86. Are all the patents Intel held that Zen 1 used expired — so anyone can make unlicensed (by Intel) x86 chips so long as they don’t run afoul of newer patents?GeoffreyA - Thursday, September 9, 2021 - link
I vaguely remember from the article that it was through some sort of convoluted legal trickery.Ian Cutress - Tuesday, September 7, 2021 - link
All that text and yet no suggestions for an alternative name. Interesting. Seems like your goal wasn't to move the conversation needle forward, but just to complain.Jim Handy - Friday, September 10, 2021 - link
Dr. Cutress, IBM's cache approach isn't a preview of the future of caches. It's something that I described in my Cache Memory Book back in 1993. It's called Direct Data Intervention, and has been around for some time. It's on Page 154.It's still cool!
Evictions aren't as complicated as all that, either. If a line's clean then it's simply over-written. On the other hand, if it's been written to (Dirty), then a process is followed to write it back to main memory before it's over-written, although, in some cases, it's simply written back into the next-slower cache level.
It's nice to see IBM using this approach to squeeze more out of its caches.
PS, in the sign-up screen I had to check a box saying that I read the ToS & Privacy Policy, but the links to those don't work.
mannyvel - Friday, September 17, 2021 - link
It's interesting how the world is moving away from general-purpose computing and the issues associated with it. A 40% improvement? Holy crap!If anything, it shows that you can gain performance by making things more complicated, which is the opposite of the conventional wisdom these days. Simpler != smarter.
ericore - Thursday, October 14, 2021 - link
Sounds like absolute genius. It's almost like AI caching, let dinosaur caching die. The hardest part past the engineering is the firmware to align everything to work as intended because ok you engineered the possibility, but you still have to make it work in the real world, but IBM at least has a rather targeted audience in mind which will help immensely. In fact, I would love if the AI package on CPUs helped with caching; it would validate their existence.