Comments Locked

17 Comments

Back to Article

  • Sahrin - Monday, August 5, 2019 - link

    Didn’t Intel already try to do this and it failed miserably?
  • ats - Monday, August 5, 2019 - link

    Yes FB-DIMM with lower latency and less power....
  • Kevin G - Monday, August 5, 2019 - link

    FB-DIMM was not low power and the latency was pretty bad (especially in two or more FB-DIMMs per channel). Those we'll have to see what the real world latencies of this IBM technology is.

    However, it was a JEDEC standard that leveraged standard parallel DRAM chip configurations with a parallel-to-serial buffer which is what this IBM technology is also doing.
  • ats - Monday, August 5, 2019 - link

    The quoted device power and the quoted latency numbers are both higher than AMB chips achieved. I'm not arguing that FB-DIMM was low power or latency, just that it was lower power and latency than this. And yes, if you chained FB-DIMMs the latency went up, but this device doesn't even support that functionality so its appropriate to only compare to single AMB channels.
  • SarahKerrigan - Monday, August 5, 2019 - link

    I don't think so. This is basically the successor to IBM Centaur - a combination of the buffered memory that many scale-up servers already use and a fast generic expansion interconnect; I'm not aware of anything else quite like it.
  • ats - Monday, August 5, 2019 - link

    Its basically a perfect analog for FB-DIMM.
  • close - Monday, August 5, 2019 - link

    Are you talking about Optane? Because they're quite different in what they're trying to achieve.

    This is a way of cramming more RAM into a server and allowing for easy generational upgrades by decoupling the RAM type from the IMC in the CPU.
  • Kevin G - Monday, August 5, 2019 - link

    Intel has done this... twice. As has IBM.

    The first as already pointed out is the FB-DIMM standard. Intel was the popular advocate for this but the standard itself was part of JEDEC and handful of smaller players leveraged it as well. The DIMMs ran hot and had significantly higher latency than their traditional DDR2 counter parts of that era. Technically they could have also used DDR3 based DRAM with an appropriate buffer chip but no such configuration ever existed to my knowledge.

    The FB2-DIMM spec was proposed but never adopted by JEDEC. Both Intel and IBM leveraged the concepts from this design for their highend systems (Xeon E7 and POWER7 respectively). Instead of putting the memory buffer on the DIMM itself, the serial-to-parallel conversion chip was placed on the motherboard or a daughter card which then backed traditional DDR3 DIMMs in most cases (IBM still did the proprietary DIMM format for their really, really high end systems which had features like chip kill etc.).

    IBM followed up their initial memory buffer design in POWER8 by incorporating a massive amount of eDRAM (32 MB) with the buffer chip to cache as a L4 cache. This bulk caching effectively hid much of the memory buffer latency as the cache's contents could only exist on that memory channel. The buffer chip here did a few other clever things since it had a massive cache like re-ordering read/write operations for more continued burst operations on the DRAM.
  • name99 - Monday, August 5, 2019 - link

    "The buffer chip here did a few other clever things since it had a massive cache like re-ordering read/write operations for more continued burst operations on the DRAM."

    This feature is called "Virtual Write Queue". It's described in this paper:
    https://lca.ece.utexas.edu/pubs/ISCA_2010.pdf

    It seems like IBM tried to patent it, but the patent is abandoned?
    https://patents.google.com/patent/US20150143059

    The technique should be feasible on any SoC where the memory controller and LLC are integrated and designed together, so basically anything modern (including eg phones). Whether it's that valuable in different environments (eg on phones with the very different traffic patterns of GPUs), well who knows? But certainly everyone should be trying to reduce time wasted on DRAM turnaround whenever you have to switch from read to write then back.
    My guess is that a company like Apple, that's already engaged in every performance trick known, probably does something similar, though perhaps more sophisticated to distinguish between different types of traffic.
  • azfacea - Monday, August 5, 2019 - link

    yes and no.
    it was many years ago and target at different use cases. I think the latencies were bad and the IMC was hot at the time.

    now this could be useful for very large densities for data science and analytics. i dont recall intel making server CPUs back then with 6 or 8 channel IMC back then? am i wrong ? was there a demand for quad socket servers just for extra IMC and not actually that high core counts ? LTT had a video of super computer in canada with such servers
  • Kevin G - Monday, August 5, 2019 - link

    Intel only had four SMI buses on the Xeon E7 that would go to the memory buffer chips but from there, the memory buffer would fan out to two traditional DDR3 or DDR4 channels. So the result was effectively an 8 channel DDR3/DDR4 setup. A fully decked out quad socket server of that era with 128 GB DIMMs could support 12 TB of RAM. These are still desirable today as they don't incur the memory capacity tax that Intel has artificially placed on Xeon Scalable chips.
  • Kevin G - Monday, August 5, 2019 - link

    Being able to fan out to traditional DIMMs maybe the higher capacity option if each of those chips can support two DDR4 LR-DIMMs. If a board maker wanted to go for pure capacity, I'd expect the host POWER9+ to have something like 64 OMI memory channels but if these SMC 1000 chips are able to operate across one single OMI link. That'd be 128 traditional DIMMs per socket and at 256 GB per LR-DIMM, a 32 TB per socket capacity. Sixteen sockets like that would permit a 0.5 PB capacity in a single logical system. Lots of what-ifs to get there and the physical layout would be a thing of logistical nightmares, but IBM could just be aiming to be the first to such capacities regardless if it would be realistically obtainable.*

    *Though for those who only care about memory capacity, have money growing tree, and no regards to performance, a 1.5 GB system might be possible on the x86 side through a custom system via HPE. They bought SGI and their NUMAlink architecture for inclusion in future SuperDome systems. That scaled up to 256 sockets under the SGI banner but the newer models under HP are only listed up to 32. However, at 256 sockets wtih Xeon SP with 12 DIMM slots each and only using 512 GB Optane DIMMs (bye-bye performance), that'd get you a 1.5 PB capacity. Again, lots of what-ifs and speculation to make such a box happen.
  • rbanffy - Wednesday, August 21, 2019 - link

    IBM plays long term. They've been playing the 360+ mainframe since the 60's and profiting wonderfully from it. AFAIK, current z boxes (more like "fridges") already uses something like this and this would allow to build machines where the memory doesn't need to be so tightly coupled to a CPU socket. Think processor and memory in separate drawers, allowing for logical partitioning into multiple smaller "machines" or a single humongous consolidated monster
  • PeachNCream - Monday, August 5, 2019 - link

    A multiple lane serial link is a parallel link.
  • anonomouse - Monday, August 5, 2019 - link

    Not exactly - different requirements on inter-lane skew vs a true parallel link like DDR, which has 64 parallel links which all must clock together. The 8 serial lanes probably each have their own independent clocks, and as serial links there's not much worry about skew within each link.
  • azfacea - Monday, August 5, 2019 - link

    72 with ecc
  • MojArch - Monday, August 5, 2019 - link

    hi
    I am newbie to this stuff can some one point me what exactly they try to do?
    Is it like conventional ram? or some thing else?

Log in

Don't have an account? Sign up now