SMR has about a 20% advantage over CMR, and it's just because it leaves less space between tracks. So 2.6TB/platter SMR vs. 2.2TB/Platter CMR is exactly what you'd expect. :)
SMR can go higher; it's just a question of how the timing works out. If you think about it, what determines how many bits you can pack is the granularity of the underlying magnetic medium, the minimum grain sizes that can magnetize in one direction or another. Now these grains - are random in size/shape/orientation - are spread over the platter in a 2D fashion, not along 1D tracks.
This means, as you start approaching serious density - extremely aggressive signal processing (so that wherever the grains don't line up correctly and bits are not stored, the missing bits can be made up from correction bits stored elsewhere.
But we have hit the limit of signal processing. So where do we go? (a) we can go to patterned media, where you replace the random grains of current platters with "magnetic dots" laid down in a perfect pattern on a platter. Obviously that sounds great (and has great potential). Just as obviously, that does not sound cheap... So everyone's delaying it for as long as possible. (b) the most obvious signal processing described above is one-dimensional -- you take the analog signal read by the head as it moves along a track and extract info from it. But in principle you could engage in 2D signal processing, using not just the signal along a linear track, but the full 2D signal over some fairly wide range (width of at least a few grains). SMR is a very limited step in this direction, kinda 1.5D signal processing, using the signal along a track with some information about what's happening on the sides, but not yet fully treating the problem as one of 2D signal processing. Going full 2D will buy a nice additional boost in capacity beyond SMR -- but probably at the cost of even slower writes...
So it's essentially more economics than tech at this point: - do we go down the expensive (but high-performing) patterned media path OR - do we go down the 2D signal processing path for a few years, and hope that integrated flash can hide most of the performance loss?
My (mostly uninformed!) guess is that energy assisted writes will help with a more 2D style solution, and that that, with flash, will be the more compelling path over the next ten years, until it absolutely can't be pushed any further.
If this stuff interests you, about the most approachable paper on 2D recording is: The Feasibility of Magnetic Recording at 10 Terabits Per Square Inch on Conventional Media https://sci-hub.hkvisa.net/10.1109/tmag.2008.20106... (Conventional media means exactly the random granular media today, as opposed to patterned media.)
I wanted to mention the same thing. WD said something awhile back they could get substantial gains in capacity with SMR, but they have to draw a line somewhere balancing performance, reliability, and capacity. This is why WD and Seagate have varying capacities of SMR drives using the same platter count (and for that matter, the same Showa platters) simply due to how Seagate appears more aggressive than WD at packing data, possibly due to superior error correction or something. But my experience is Seagate SMR drives are slower than the equivalent WD SMR drive.
More platters = same max transfer rate but even more time to write the entire disk. It's ridiculous rebuilding a RAID with modern high capacity magnetic disks.
This is why more serious array vendors implement non-traditional RAID protection schemes, i.e. distributed RAID where all drives participate in rebuild process for both reads and writes (spare capacity is distributed).
How would that make a rebuild of an empty drive faster, which involves rewriting the entire disk? Unless you're saying that it excludes sectors that are marked as empty in the filesystem.
In such schemes there is no empty spare drive. Instead there is distributed spare capacity reserved across all drives. This concept has been used for a decade or so. https://www.youtube.com/watch?v=nPAwq9uFTGg
"Normal" RAID setup has X data disks and Y spare disks. The spare disks just sit there, empty, waiting to be used. When a data disk dies, the RAID array starts copying data to the spare disk. This is limited by the write IOps/throughput of the single spare disk. When you physically replace the dead drive and mark the new one as a spare, then it sits there idle until it's needed.
"Distributed spare" RAID setups (like draid on ZFS) has X data disks and Y spare disks. But, the "spare disk" isn't sitting there empty, it's actually used in the RAID array, and the "spare space" from that drive is spread across all the other drives in the array. When a data disk dies, the array starts writing data into the "spare disk" space. As this space is distributed across X data disks ... it's not limited in write IOps/throughput (it has the same IOps/throughput as the whole array). Once you physically replace the dead drive and mark the new one as the spare, then a background process runs to distribute data from the "spare space" onto the new drive (or something like that) to bring the array back up to optimal performance/redundancy/spares/
Think of it as the difference between RAID3 (separate parity disk) and RAID5 (distributed parity data). It's the different between a separate spare disk and distributed spare disk space across the array.
Read up on the draid vdev type for ZFS for more information on how one implementation for it works.
I do not use raidz 1,2,3. I use mirrored videos. When a disk fails you can copy the data from the mirror drive...which is much faster and less stressful than pounding the entire array of disks
Thanks for the explanation. So it doesn't solve the problem of slow rebuilds; it just enhances degraded array performance by making active use of disks that would otherwise be blank hot spares. It will still take ~1.5 days to write an entire 22TB replacement disk and reintroduce it to the array.
Depends on what you consider to be a problem about slow rebuilds. In terms of raw time to get the array back to the state it was before a disk died, there's no speedup and in fact probably it's a bit slower. But in terms of the time until you get back to full parity protection, where you can afford to have another disk die without data loss, the distributed rebuild scheme is dramatically faster.
I built a CEPH cluster for an employer once... it's quite something to use. CEPH is one of these types of non-traditional RAID schemes, basically block storage distributed across multiple hosts. However, it suffers from a need of either many, many spindles, or a few high performance enterprise SSDs. It performs sync transfers, which pretty much bypass all the cache on the drives, so slowest possible speeds really. More spindles matter in this case!
Thanks for that post. I've long been intrigued by it. How long ago was this?
I'm curious why you chose CEPH, rather than other options, like GlusterFS. It seems to a big disadvantage of CEPH is the amount of memory it requires (1 GB per TB of storage, IIRC).
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
24 Comments
Back to Article
meacupla - Monday, May 9, 2022 - link
Has WD CMR technology been substantially improved or something? How does the CMR drive manage 2.2TB/platter, when SMR only does 2.6TB/platter?If CMR can be packed this dense, there's almost no need for SMR.
DrReD - Monday, May 9, 2022 - link
SMR has about a 20% advantage over CMR, and it's just because it leaves less space between tracks. So 2.6TB/platter SMR vs. 2.2TB/Platter CMR is exactly what you'd expect. :)WaltC - Monday, May 9, 2022 - link
Yes, the whole story of hard drive technology has been one of packing more data onto a platter.Arbie - Monday, May 9, 2022 - link
Well that's just amazing to learn. Thanks!name99 - Wednesday, May 11, 2022 - link
SMR can go higher; it's just a question of how the timing works out.If you think about it, what determines how many bits you can pack is the granularity of the underlying magnetic medium, the minimum grain sizes that can magnetize in one direction or another. Now these grains
- are random in size/shape/orientation
- are spread over the platter in a 2D fashion, not along 1D tracks.
This means, as you start approaching serious density
- extremely aggressive signal processing (so that wherever the grains don't line up correctly and bits are not stored, the missing bits can be made up from correction bits stored elsewhere.
But we have hit the limit of signal processing. So where do we go?
(a) we can go to patterned media, where you replace the random grains of current platters with "magnetic dots" laid down in a perfect pattern on a platter. Obviously that sounds great (and has great potential). Just as obviously, that does not sound cheap... So everyone's delaying it for as long as possible.
(b) the most obvious signal processing described above is one-dimensional -- you take the analog signal read by the head as it moves along a track and extract info from it. But in principle you could engage in 2D signal processing, using not just the signal along a linear track, but the full 2D signal over some fairly wide range (width of at least a few grains).
SMR is a very limited step in this direction, kinda 1.5D signal processing, using the signal along a track with some information about what's happening on the sides, but not yet fully treating the problem as one of 2D signal processing.
Going full 2D will buy a nice additional boost in capacity beyond SMR -- but probably at the cost of even slower writes...
So it's essentially more economics than tech at this point:
- do we go down the expensive (but high-performing) patterned media path OR
- do we go down the 2D signal processing path for a few years, and hope that integrated flash can hide most of the performance loss?
My (mostly uninformed!) guess is that energy assisted writes will help with a more 2D style solution, and that that, with flash, will be the more compelling path over the next ten years, until it absolutely can't be pushed any further.
If this stuff interests you, about the most approachable paper on 2D recording is:
The Feasibility of Magnetic Recording at 10 Terabits Per Square Inch on Conventional Media
https://sci-hub.hkvisa.net/10.1109/tmag.2008.20106...
(Conventional media means exactly the random granular media today, as opposed to patterned media.)
Kamen Rider Blade - Wednesday, May 11, 2022 - link
The sooner we get to Patterned Media with Hexagonally arranged magnetic grains, the better we'll be.Samus - Tuesday, August 30, 2022 - link
I wanted to mention the same thing. WD said something awhile back they could get substantial gains in capacity with SMR, but they have to draw a line somewhere balancing performance, reliability, and capacity. This is why WD and Seagate have varying capacities of SMR drives using the same platter count (and for that matter, the same Showa platters) simply due to how Seagate appears more aggressive than WD at packing data, possibly due to superior error correction or something. But my experience is Seagate SMR drives are slower than the equivalent WD SMR drive.The Von Matrices - Monday, May 9, 2022 - link
More platters = same max transfer rate but even more time to write the entire disk. It's ridiculous rebuilding a RAID with modern high capacity magnetic disks.f00f - Monday, May 9, 2022 - link
This is why more serious array vendors implement non-traditional RAID protection schemes, i.e. distributed RAID where all drives participate in rebuild process for both reads and writes (spare capacity is distributed).The Von Matrices - Monday, May 9, 2022 - link
How would that make a rebuild of an empty drive faster, which involves rewriting the entire disk? Unless you're saying that it excludes sectors that are marked as empty in the filesystem.f00f - Monday, May 9, 2022 - link
In such schemes there is no empty spare drive. Instead there is distributed spare capacity reserved across all drives. This concept has been used for a decade or so.https://www.youtube.com/watch?v=nPAwq9uFTGg
Doug_S - Monday, May 9, 2022 - link
MUCH longer than a decade. HP's AutoRAID product did this nearly 30 years ago.phoenix_rizzen - Monday, May 9, 2022 - link
"Normal" RAID setup has X data disks and Y spare disks. The spare disks just sit there, empty, waiting to be used. When a data disk dies, the RAID array starts copying data to the spare disk. This is limited by the write IOps/throughput of the single spare disk. When you physically replace the dead drive and mark the new one as a spare, then it sits there idle until it's needed."Distributed spare" RAID setups (like draid on ZFS) has X data disks and Y spare disks. But, the "spare disk" isn't sitting there empty, it's actually used in the RAID array, and the "spare space" from that drive is spread across all the other drives in the array. When a data disk dies, the array starts writing data into the "spare disk" space. As this space is distributed across X data disks ... it's not limited in write IOps/throughput (it has the same IOps/throughput as the whole array). Once you physically replace the dead drive and mark the new one as the spare, then a background process runs to distribute data from the "spare space" onto the new drive (or something like that) to bring the array back up to optimal performance/redundancy/spares/
Think of it as the difference between RAID3 (separate parity disk) and RAID5 (distributed parity data). It's the different between a separate spare disk and distributed spare disk space across the array.
Read up on the draid vdev type for ZFS for more information on how one implementation for it works.
hescominsoon - Monday, May 9, 2022 - link
I do not use raidz 1,2,3. I use mirrored videos. When a disk fails you can copy the data from the mirror drive...which is much faster and less stressful than pounding the entire array of diskshescominsoon - Monday, May 9, 2022 - link
Vdevs not videos..sorry autocorrect got me.The Von Matrices - Tuesday, May 10, 2022 - link
Thanks for the explanation. So it doesn't solve the problem of slow rebuilds; it just enhances degraded array performance by making active use of disks that would otherwise be blank hot spares. It will still take ~1.5 days to write an entire 22TB replacement disk and reintroduce it to the array.tomatotree - Tuesday, May 10, 2022 - link
Depends on what you consider to be a problem about slow rebuilds. In terms of raw time to get the array back to the state it was before a disk died, there's no speedup and in fact probably it's a bit slower. But in terms of the time until you get back to full parity protection, where you can afford to have another disk die without data loss, the distributed rebuild scheme is dramatically faster.Drkrieger01 - Monday, May 9, 2022 - link
I built a CEPH cluster for an employer once... it's quite something to use. CEPH is one of these types of non-traditional RAID schemes, basically block storage distributed across multiple hosts. However, it suffers from a need of either many, many spindles, or a few high performance enterprise SSDs. It performs sync transfers, which pretty much bypass all the cache on the drives, so slowest possible speeds really. More spindles matter in this case!mode_13h - Tuesday, May 10, 2022 - link
Thanks for that post. I've long been intrigued by it. How long ago was this?I'm curious why you chose CEPH, rather than other options, like GlusterFS. It seems to a big disadvantage of CEPH is the amount of memory it requires (1 GB per TB of storage, IIRC).
mode_13h - Tuesday, May 10, 2022 - link
They ought to make an "Advanced SMR" or "ASMR" hard drive, that makes soothing, whisper-like noises during operation.kn00tcn - Thursday, May 12, 2022 - link
it's whisper quiet! *machine loudly grinds orangesSamus - Tuesday, May 10, 2022 - link
This feels like a new Gillette razor announcement.basketballstarsunblocked - Wednesday, May 18, 2022 - link
<a href='https://www.basketballstarsunblocked.net'>b... stars unblocked</a>It's definitely inspiring.
basketballstarsunblocked - Wednesday, May 18, 2022 - link
https://www.basketballstarsunblocked.net/It's definitely inspiring.