I worked with a weather analysis company, and kept trying to talk them into setting a computer up with 1.6TB of ram for their primary analysis instead of using a NAS. They had about 10 physical servers running 24/7 spending 97% of their time on io wait.
If it helps at all, since Firefox switched to their new rendering engine, Gecko, Firefox now happily uses +1GB of RAM for a single webpage. Yes, I have a screen shot, but there's no way to post a picture here.
Likewise, GCC happily eats +4GB *per source file* when compiling modern C++.
I usually don't observe Firefox consuming 1GB per tab. Spikes up to 250MB are typical, but it seems more like 100MB is around the running average per thing open on Linux. On Win11 I'm seeing roughly the same consumption as well (just checked). Running an adblocker and noscript on both OSes so fairly lean on addons. With that said, the browser upon launch allocates close to 1GB to itself to include a single page (common for Edge also - cannot test Chrome as its not installed and a Google data collection platform so Chrome may not be a near-identical thing to use to compare), but I'm not sure its fair to say its 1GB per page or per tab. I'm not seeing that play out in day to day usage.
I wonder how this appears to developers. Is it managed by the OS, (in which case you'd want it as a NUMA node). Or, is is managed by a driver and some sort of API where you have to specifically talk to the device?
CXL memory expanders are designed to appear as NUMA node(s) - just ones with no CPU cores. So it should require no changes to NUMA-aware software assuming the OS supports CXL, which is something introduced in the latest versions of popular OSes.
[ PCIe5.0 grade mainboard processors are topping on ~30- ~64GB/s(?) memory bandwidth for SDRAM sockets each DIMM (DDR4-DDR5, no OC) within a ~3-7" distance from main cpu.
A PCIe adapter can access (a system bus) for cycles for DMA from the memory controller (arbiter for shared memory) within the main cpu, depending on settings there are preferred devices(cpu #, peripherals) and (possibly restricted) access to memory regions. ]
596 ns is the first concrete latency figures I've seen for CXL devices, so that is very interesting, and also higher than I was expecting. It's not quite, but not far from, an order of magnitude slower than directly attached DRAM, and roughly around 2000 clock cycles for server CPUs.
Is that really usable? Do CPUs really have enough reordering capacity to work around such massive latencies? Surely these aren't supposed to be used over some sort of asynchronous DMA-based transfer scheme (what would be the point of CXL then)?
[ If a memory access request is for blocks larger than ~32kB, than data transfer (maybe independent from cpu cycles, additional latency because of checking cache coherency?) time is already higher compared to initial data request latency? Other number seen mentioned about typically addition of ~200ns for CXL memory controller ]
There is an insightful article on SemiAnalysis regarding this concern. The main part of the article is behind a paywall but the free section gets their general arguments across well.
I think when your dataset is that big, you’ll take any improvement in latency/throughput you can get, so CXL memory is a huge improvement over, say, a highly parallel NVMe flash array, or just not tackling a problem at all because current hardware is too slow.
In a way, this is a kind of reversion to the dawn of electric computing. Each device or module had a separate rack, or even room, and all were cabled together with spaghetti. I remember my mentor teaching me how to wire-wrap the back of a UniBus backplane...Jimminy Christmas!
[ read, that CXL (version?, 3.x, theoretically saturates a PCIe7.0 x16 connection) is limited to ~4"(?), through merger with Gen-Z there's ethernet support (CXL version?, multi ~100Gbps, reduced latency cmprd to RDMA?, optimized ~400ns) up to few tens of meters(?) ]
It almost feels demeaning how they put "memory module box" on the front, as if anyone who has any business touching it might not understand a more conventional name!
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
17 Comments
Back to Article
mikegrok - Wednesday, April 3, 2024 - link
I worked with a weather analysis company, and kept trying to talk them into setting a computer up with 1.6TB of ram for their primary analysis instead of using a NAS. They had about 10 physical servers running 24/7 spending 97% of their time on io wait.BvOvO - Wednesday, April 3, 2024 - link
That weather analysis company's name? Albert Einsteinballsystemlord - Wednesday, April 3, 2024 - link
If it helps at all, since Firefox switched to their new rendering engine, Gecko, Firefox now happily uses +1GB of RAM for a single webpage. Yes, I have a screen shot, but there's no way to post a picture here.Likewise, GCC happily eats +4GB *per source file* when compiling modern C++.
PeachNCream - Friday, April 5, 2024 - link
I usually don't observe Firefox consuming 1GB per tab. Spikes up to 250MB are typical, but it seems more like 100MB is around the running average per thing open on Linux. On Win11 I'm seeing roughly the same consumption as well (just checked). Running an adblocker and noscript on both OSes so fairly lean on addons. With that said, the browser upon launch allocates close to 1GB to itself to include a single page (common for Edge also - cannot test Chrome as its not installed and a Google data collection platform so Chrome may not be a near-identical thing to use to compare), but I'm not sure its fair to say its 1GB per page or per tab. I'm not seeing that play out in day to day usage.ballsystemlord - Saturday, April 6, 2024 - link
Sorry, I should have been clearer. This occurs on only some webpages. Not every webpage.BigT383 - Wednesday, April 3, 2024 - link
I wonder how this appears to developers. Is it managed by the OS, (in which case you'd want it as a NUMA node). Or, is is managed by a driver and some sort of API where you have to specifically talk to the device?ballsystemlord - Thursday, April 4, 2024 - link
Upvote!The Von Matrices - Friday, April 5, 2024 - link
CXL memory expanders are designed to appear as NUMA node(s) - just ones with no CPU cores. So it should require no changes to NUMA-aware software assuming the OS supports CXL, which is something introduced in the latest versions of popular OSes.back2future - Thursday, April 4, 2024 - link
[ PCIe5.0 grade mainboard processors are topping on ~30- ~64GB/s(?) memory bandwidth for SDRAM sockets each DIMM (DDR4-DDR5, no OC) within a ~3-7" distance from main cpu.A PCIe adapter can access (a system bus) for cycles for DMA from the memory controller (arbiter for shared memory) within the main cpu, depending on settings there are preferred devices(cpu #, peripherals) and (possibly restricted) access to memory regions. ]
Dolda2000 - Thursday, April 4, 2024 - link
596 ns is the first concrete latency figures I've seen for CXL devices, so that is very interesting, and also higher than I was expecting. It's not quite, but not far from, an order of magnitude slower than directly attached DRAM, and roughly around 2000 clock cycles for server CPUs.Is that really usable? Do CPUs really have enough reordering capacity to work around such massive latencies? Surely these aren't supposed to be used over some sort of asynchronous DMA-based transfer scheme (what would be the point of CXL then)?
back2future - Friday, April 5, 2024 - link
[ If a memory access request is for blocks larger than ~32kB, than data transfer (maybe independent from cpu cycles, additional latency because of checking cache coherency?) time is already higher compared to initial data request latency?Other number seen mentioned about typically addition of ~200ns for CXL memory controller ]
The Von Matrices - Friday, April 5, 2024 - link
There is an insightful article on SemiAnalysis regarding this concern. The main part of the article is behind a paywall but the free section gets their general arguments across well.https://www.semianalysis.com/p/cxl-is-dead-in-the-...
Elstar - Sunday, April 7, 2024 - link
I think when your dataset is that big, you’ll take any improvement in latency/throughput you can get, so CXL memory is a huge improvement over, say, a highly parallel NVMe flash array, or just not tackling a problem at all because current hardware is too slow.Dolda2000 - Saturday, April 13, 2024 - link
Perhaps I'm mistaken, but my impression is that the main (not only) purpose for CXL is configurable infrastructure rather than capacity.thomasjkenney - Thursday, April 4, 2024 - link
In a way, this is a kind of reversion to the dawn of electric computing. Each device or module had a separate rack, or even room, and all were cabled together with spaghetti. I remember my mentor teaching me how to wire-wrap the back of a UniBus backplane...Jimminy Christmas!back2future - Friday, April 5, 2024 - link
[ read, that CXL (version?, 3.x, theoretically saturates a PCIe7.0 x16 connection) is limited to ~4"(?), through merger with Gen-Z there's ethernet support (CXL version?, multi ~100Gbps, reduced latency cmprd to RDMA?, optimized ~400ns) up to few tens of meters(?) ]mode_13h - Monday, April 15, 2024 - link
It almost feels demeaning how they put "memory module box" on the front, as if anyone who has any business touching it might not understand a more conventional name!