The Fusion-IO ioDrive Octal was designed for the NSA. These babies are probably why they could spy on the entire Internet without ever running low on storage IO. Unsurprisingly that bit about the Octal being designed for the US government is no longer on their site :)
Yeah, you could probably get pretty far throwing a bunch of drives into a well configured ZFS box (striped raidz2/3? Mirrored stripes? Balance performance versus redundancy and take your pick) and throwing some enterprise SSDs in front of the array as SLOG and/or L2ARC drives.
In fact, if you don't want to completely DIY, as many enterprises don't, there are companies selling enterprise solutions doing exactly this. Nexenta, for example (who also happen to be one of the lead developers behind modern opensource ZFS), sell enterprise software solutions for this. There are other companies that sell hardware solutions based on this and other software.
Another option for this would be to go directly to Oracle with their ZFS Storage Appliances. This gives companies the very valuable benefit of having hardware and software support from the same entity. They also tend to undercut the entrenched storage vendors on price as well.
And perhaps lock you into Larry's platform so he can extract his tribute for Oracle software? I think I've paid for a week of vacation on Ellison's Hawaiian island.
Everybody gets their money to appease shareholders somehow. Either maintenance, software, hardware or whatever.
Discs have grown bigger, but not faster. Also, they are not safer nor more resilient to data corruption. Large amounts of data will have data corruption. The more data, the more corruption. NetApp has some studies on this. You need new solutions that are designed from the ground up to combat data corruption. Research papers shows that ntfs, ext, etc and hardware raid are vulnerable to data corruption. Research papers also show that ZFS do protect against data corruption. You find all papers on wikipedia article on zfs, including papers from NetApp.
It's worth pointing out, though, that enterprise use of ZFS should always use ECC RAM and disk controllers that properly report when data has actually been written to the disk. For home use, neither are really required.
And Advanced/SDDC/Chipkill ECC, not the old-fashioned single-bit correct/multiple bit detect. The RAM on the disk controller might be small enough for this not to matter, but not on the system RAM.
The Amplistore design seems a bit better than ZFS. ZFS has a hash to detect bit rot within the blocks, while this stores FEC coding that can potentially recover the data within that block without calculating it based on parity from the other drives on the stripe and the I/O that involves. It also seems to be a bit smarter on how it distributes data by allowing you to cross between storage devices to provide recovery at the node level while ZFS is really just limited to the current pool. It has various out of band data rebalancing which isn't really present in ZFS. For example, add a second vdev to a zpool when it's 90% full and there really isn't a process to automatically rebalance the data across the two pools as you add more data. The original data stays on that first vdev, and new data basically sits in the second vdev. It seems very interesting, but I certainly can't afford it, I'll stick with raidz2 for my puny little server until something open source comes out with a similar feature set.
Are you aware that with ZFS you can specify the number of replicas each data block should have on a per-filesystem basis? ZFS is indeed not very flexible on pool layout and does not rebalance things (as of now), but there's nothing in the on-disk data structure that prevent this. This means it can be implemented and would be applicable on old pools in a non disruptive way. ZFS, also, is open source, its license is simply not compatible with GPLv2, hence ZFS-On-Linux separate distribution.
If you want to rebalance ZFS, you just copy the data back and forth and rebalancing is done. Assume you have data on some ZFS disks in a ZFS raid, and then you add new empty discs, so all data will sit on the old disks. To spread the data evenly to all disks, you need to rebalance the data. Two ways: 1) Move all data to another server. And then move back the data to your ZFS raid. Now all data are rebalanced. This requires another server, which is a pain. Instead, do like this: 2) Create a new ZFS filesystem on your raid. This filesystem is spread out on all disks. Move the data to the new ZFS filesystem. Done.
I'm definitely looking forward to these improvements, if they eventually arrive. I'm aware of the multiple copy solution, but if you read the Intel and Amplistore whitepapers, you will see they have very good arguments that their model works better than creating additional copies by spreading out FEC blocks across nodes. I have used ZFS for years, and while you can work around the issues, it's very clear that it's no longer evolving at the same rate since Oracle took over Sun. Products like this keep things interesting.
Theory is one thing, real life another. There are many bold claims and wonderful theoretical constructs from companies, but do they hold up to scrutiny? Researchers injected artificially constructed errors in different filesystems (NTFS, Ext3, etc), and only ZFS detected all errors. Researchers have verified that ZFS seems to combat data corruption. Are there any research on Amplistore's ability to combat datacorruption? Or do they only have bold claims? Until I see research from a third part, independent part, I will continue with the free open source ZFS. CERN is now switching to ZFS for tier-1 and tier-2 long time term storage, because vast amounts of data _will_ have data corruption, CERN says. Here are research papers on data corruption on NTFS, hardware raid, ZFS, NetApp, CERN, etc: http://en.wikipedia.org/wiki/ZFS#Data_integrity For instance, Tegile, Coraid, GreenByte, etc - are all storage vendors that offers PetaByte Enterprise servers using ZFS.
bitpushr, "That's because ZFS has had a minimal impact on the professional storage market."
That is ignorant. If you had followed the professional storage market, you would have known that ZFS is the most widely deployed storage system in the Enterprise. ZFS systems manages 3-5x more data than NetApp, and manages more data than NetApp and EMC Isilon combined. ZFS is the future and eating other's cake: http://blog.nexenta.com/blog/bid/257212/Evan-s-pre...
The Amplidata Bitspread data protection scheme sounds alot like the OneFS filesystem on Isilon.
A note on the NetApp section, the NVRAM does not store the hottest blocks, rather it is only used for correlating writes to allow destaging entire raid group wide stripes onto disk at once. This utilization of NVRAM in NetApp, along with the write characteristics of the WAFL filesystem, allows RAID-DP (NetApp's slightly customized version of RAID-6) to have similar write performance as RAID-10 with a much smaller usable space penalty up to approximately 85-90% space utilization. Read cache is always held in RAM on the controller and the FlashCache (formerly PAM) cards supplement that RAM-based cache. A thing to remember about the size of the FlashCache cards is that the space still benefits from the data efficiency features of Data OnTap, such as deduplication and compression, and as such applications such as VDI get a massive boost in performance.
I think you also need to discuss the effect of OSS or very low cost solutions that can be built on white box hardware. Those cause far greater disruptions than anything I can think of! SCST and COMSTAR to name a few.
One thing i didn't see mention is that in the good old days you spread the I/O out across many spindles which was a huge advantage SCSI which was geared towards such a configuration. As drive sizes have increased the spindles have reduced adding more latency. The fact is that expensive SSD type storage systems are not needed in most medium sized businesses. Their data needs can in most cases be served by spectacularly by using a well architected tiered storage model.
There's some thing missing - take a look at Pernix Data - That's disruptive and also vSphere 5.5 gonna be a game changer. Software Defined Storage is the way forward - We just need space for more disks in blade servers
SDS is an EMC-marchitecture discussion (a la ViPR). I'd suggest that you avoid conflating what a marketing talking head discusses with technology can actually do. :)
My understanding withenterprise storage isn't necessarily the hardware but rather the software interface and support that comes with it. NetApp for example will dial home and order replacements for failed hard drives for you. Various interfaces I've used allow for the logical creation multiple arrays across multiple controllers each using a different RAID type. I have no sane reason why some one would want to do that but the option is there and supported for the crazies.
As far as performance goes, NVMe and SATA Express are clearly the future. I'm surprised that we haven't see any servers with hot swap mini-PCIe slots. With two lanes going to each slot, a single socket Sandy Bridge-e chip could support twenty of those small form factor cards in the front of a 1U server. At 500 GB a piece, that is 10 TB of preformatted storage, not far off of the 16 TB preformatted possible today using hard drives. Cost of course will be more expensive than disk but speeds are ludicrous.
Going with standard PCIe form factors for storage only makes sense if there are tons of channels connected to the controlller and are PCIe native. So far the majority of offers stick a hardware RAID chip with several SATA SSD controllers onto a PCIe card and call it a day.
Also for the enterprice market, it would be nice to a PCIe SSD have an out of band management port that communicates via Ethernet and can fully function if the switch on the other end supports power over ethernet. The entire host could be fried but data could still potentially be recovered. Also works great for hardware configuration like on some Areca cards.
I read through the second link on that page (the Intel paper). I wouldn't consider that paper as unbiased considering Intel is clearly trying to use it to sell more Xeon chips. Regardless, I don't think your statement "mathematically proven that the Reed-Solomon based erasure codes of RAID 6 are a dead end road for large storage systems" is justified. Sure RAID6 will eventually give way to RAID7 (or RAIDZ2 in ZFS terms), but this still uses Reed-Solomon codes. The Intel paper just shows that RAID6+1 has much worse efficiency with slightly worse durability compared to Bitspread. The same could be said for RAID7 (instead of Bitspread), which really should have been part of the comparison.
Another strange statement in the Intel paper is "Traditional erasure coding schemes implemented by competitive storage solutions have limited device-level BER protection (e.g., 4 four bit errors per device)". Umm, with non-degraded RAID6 you could have as many UREs as you like provided less than three occur on the same stripe (or less than two for a degraded array). Again RAID7 allows even more UREs in the same stripe.
This is not to say that the Bitspread technique isn't interesting, but you seem to be a little to quick to drink the kool-aid.
I imagine the reason people are quick to drink the koolaid is that convolutional FEC codes have proved how well they work through much wireless experience. Loss of some Amplidata data is no different from puncturing, and puncturing just works --- we experience it every time we use WiFi or cell data.
I also wouldn't read too much into Intel's support here. Obviously running a Viterbi algorithm to cope with a punctured convolutional code is more work than traditional parity-type recovery --- a LOT more work. And obviously, the first round of software you write to prove to yourself that this all works, you're going to write for a standard CPU. Intel is the obvious choice, and they're going to make a big deal about how they were the obvious choice.
BUT the obvious next step is to go to Qualcomm or Broadcom and ask them to sell you a Viterbi cell, which you put on a SOC along with an ARM front-end, and hey presto --- you have a $20 chip you can stick in your box that's doing all the hard work of that $1500 Xeon.
The point is, convolutional FEC is operating on a totally different dimension from block parity --- it is just so much more sophisticated, flexible, and powerful. The obvious thing that is being trumpeted here is destruction of one of more blocks in the storage device, but that's not the end of the story. FEC can also handle point bit errors. Recall that a traditional drive (HD or SSD) has its own FEC protecting each block, but if enough point errors occur in the block, that FEC is overwhelmed and the device reports a read error. NOW there is an alternative --- the device can report the raw bad data up to a higher level which can combine it with data from other devices to run the second layer of FEC --- something like a form of Chase combining.
Convolutional codes are a good start for this, of course, but the state of the art in WiFi and telco is LDPCs, and so the actual logical next step is to create the next device based not on a dedicated convolutional SOC but on a dedicated LDPC SOC. Depending on how big a company grows, and how much clout they eventually have with SSD or HD vendors, there's scope for a whole lot more here --- things like using weaker FEC at the device level precisely because you have a higher level of FEC distributed over multiple devices --- and this may allow you a 10% or more boost in capacity.
you forgot another implication of scale-out software design. namely, the ability to bypass flash completely and store your most performance intensive workloads that use your most expensive software licensing directly in-memory. 16 gigs to run the host, the other 368 gigs as a nice RAM drive.
Hi Johan, one thing to be clear about is that the dollars you are quoting in this article are off by a huge margin. Enterprise storage is one of the most highly discounted areas of technology. Happy to chat more on this subject. Patrick @STH
Ahh, the never-ending whine of the Enterprise sale man, who desperately wants to have it both ways --- to be able to charge a fortune and simultaneously to claim that he's not charging a fortune. Good luck with that --- there is, after all, a sucker born every minute.
But let's get one thing straight here. If your organization refuses to publish the actual prices at which it sells, then you STFU when people report the prices that ARE published, not the magic secret prices that you claim exist but neither you nor anyone else are ever allowed to actually mention them. You don't get to have it both ways. AnandTech and similar blogs are not in the business of sustaining your obsolete business model and its never-ending lies about price...
You can not blame a company to do whatever they can to protect their business model, but your comment is on target. The list versus street price models reminds of techniques of salesman on the street in touristic areas: they charge 3 times too much, and you end up with a 50% discount. The end result is that you are still ripped off unless you have intimate knowledge.
My experience with stuff like this is that the low prices are geared towards locking you into their products and getting themselves in the door. As soon as these companies feel certain that changing to a different storage tech would be prohibitively expensive for you, the contract renewal price will go through the roof.
In other words, that initial price may actually be very good and very competitive. Just don't expect to get the same deal when things come up for contract renewal.
First - I have been advocating open storage projects for years. I do think we are moving to 90%+ of the market being 4TB drives and SSDs and SDS is a clear step in this direction. I don't sell storage but have been using open platforms for years precisely because of the value I can extract through the effort of sizing the underlying hardware.
Second - Most of the big vendors are public companies. It isn't hard to look at gross margin and figure out ballpark what average discounts are. Most organizations purchasing this type of storage have other needs. The market could push for lower margins so my sense is that the companies buying this class of storage are not just paying for raw storage.
Third - vendors are moving the direction of lower discounts at the low end. Published list prices there are much closer to actual as the discounting trend in the industry is towards lower list prices.
Not to say that pricing is just or logical, but then again, it is a large industry that is poised for a disruptive change. One key thing here is that I believe you can get pricing if you just get a quote. This is the same as other enterprise segments such as the ERP market.
I'll not ask you to STFU as you eloquently abbreviated it. Though in general I believe people ultimately charge what they believe the market will pay.
Yes, list prices are generously overpriced in the IT industry. But to ask EMC or IBM to tell you how much they really charge for things is stupid. That's a negotiated rate between them and their customer. BofA or WalMart isn't going to disclose how much they pay for services. Their low negotiated price helps drive efficiencies to better compete with rivals. Heck, ask Kelloggs how much Target pays per box for cereal vs WalMart. No way in hell they're going to tell you. You think a SMB is going to get the same price as Savvis or Bank of America? They can ask for it but good luck. I sense some naiveté in your response.
In essence you're complaining about how inflated the list is vs what the average customer pays. That's a game played out based on supply and demand, market expectations and the blended costs of delivering products.
What is always missing from such essays (and this one reads more like a 'Seeking Alpha' pump piece) is a discussion of datastore structure. If you want speed and fewer bytes and your data isn't just videos, then an industrial strength RDBMS with a Organic Normal Form™ schema gets rid of the duplicate bytes. Too bad.
Look at EMC's acquisition of XtremeI/O...that's a viable competitor that EMC has already been able to integrate as a mainstream product. Oh, and they're also using Virident PCIe cards for server-side flash. ;)
But is that really disruptive, or business as usual? These guys usually buy up smaller technologies as needed and integrate them if needed. Most of their core business (spinning disks) has remained the same.
If you ditch Windows on the desktop, you can do a lot more for a lot less.
$22,000 for a Nutanix node to support a handful of virtual desktops? And you still need the VDI client systems on top of that? Pffft, for $3000 CDN we can support 200-odd diskless Linux workstations (diskless meaning they boot off the network, mount all their filesystems via NFS, and run all programs on the local system using local CPU, local RAM, local GPU, local sound, etc). The individual desktops are all under $200 (AMD Athlon-II X3 and X4, 2 GB of RAM, onboard everything; CPU fan is the only moving part) and treated like appliances (when one has issues, just swap it out for a spare).
No licensing fees for the OS, no licensing fees for 90+% of the software in use, no exorbitant markup on the hardware. And all staff and students are happy with the system. We've been running this setup in the local school district for just shy of 10 years now. Beats any thin-client/VDI setup, that's for sure.
Another vendor doing hybrid storage is Nimble Storage (http://www.nimblestorage.com/). I've looked at their solution and it is quite impressive. It's not cheap though.
They also claim to be the fastest growing storage vendor!
Very interesting article. It basically match my personal option on SAN market: it is an overprice one, with much less performance per $$$ then DAS.
Anyway, with the advent of thin pools / thin volumes in RHEL 6.4 and dmcache in RHEL 7.0, commodity, cheap Linux distribution (CentOS costs 0, by the way) basically matche the feature-set exposed by most low/mid end SAN. This means that a cheap server with 12-24 2.5'' bays can be converted to SAN-like works, with very good results also.
In this point of view, the recent S3500 / Crucial M500 disks are very interesting: the first provide enterprise-certified, high performance, yet (relatively) low cost storage, and the second, while not explicitly targeted at the enterprise market, is available at outstanding capacity/cost ratio (the 1TB version is about 650 euros). Moreover it also has a capacitor array to prevent data loss in the case of power failure.
Bottom line: for high performance, low cost storage, use a Linux server with loads of SATA SSDs. The only drawback is that you _had_ to know the VGS/LVS cli interface, because good GUIs tend to be commercial products and, anyway, for data recovery the cli remains your best friend.
A note on the RAID level: while most sysadmins continue to use RAID5/6, I think it is really wrong in most cases. The R/M/W penalty is simply too much on mechanincal disks. I've done some tests here: http://www.ilsistemista.net/index.php/linux-a-unix...
Maybe on SSDs the results are better for RAID5, but the low-performance degraded state (and very slow/dangerous reconstruction process) ramain.
The enterprise storage market is about the value-add you get from buying from the big name companies (EMC, Netapp, HP, etc...). All of those will come with support contracts for replacement gear and to help you fix any problems you may run into with the storage system. I'd say the key reasons to buy from some of these big players:
* Let someone else worry about maintaining the systems (this is helpful for large datacenter operations where the customer has petabytes of data). * The data reporting tools you get from these companies will out-shine any home grown solution. * When something goes wrong, these systems will have extensive logs about what happened, and those companies will fly out engineers to rescue your data. * Hardware/Firmware testing and verification. The testing that is behind these solutions is pretty staggering.
For smaller operations, rolling out an enterprise SAN is probably overkill. But if your data and uptime is important to you, enterprise storage will be less of a headache when compared to JBOD setups.
We looked at Fusion-IO ioDrive and decided not to go that route as the work loads presented by virtualize desktops we offer would have killed those units in a heartbeat. We opted instead for a product by greenbytes for our VDI offering.
If there is one thing I absolutely adore about real capitalism it is these moments where the establishment goes down in flames. Just the thought of their jaws dropping and stammering "but that's not fair!" when they themselves were making mockery of fair prices with absurd profit margins... priceless. Working with computers gives you so very many of these wonderful moments of truth...
On the software end it is almost as much fun as watching plutocrats and dictators alike try to "contain" or "limit" TCP/IP's ability to spread information.
There also seems to be a disconnect in what Reed-Solomon can do and what they are concerned about (while RAID 6 uses Reed Solomon, it is a specific application and not a general limitation).
It is almost impossible to scale rotating discs (presumably magnetic, but don't ignore optical forever) to the point where Reed-Solomon becomes an issue. The basic algorithm scales (easily) to 256 disks (or whatever you are striping across) of which typically you want about 16 (or less) parity disks. Any panic over "some byte of data was mangled while a drive died" just means you need to use more parity disks. Somehow using up all 256 is silly (for rotating media) as few applications access data in groups of 256 sectors a time (current 1MB, possibly more by the time somebody might consider it).
All this goes out the window if you are using flash (and can otherwise deal with the large page clear requirement issue), but I doubt that many are up to such large sizes yet. If extreme multilevel optical disks ever take over, things might get more interesting on this front (I will still expect Reed Solomon to do well, but eventually things might reach the tipping point).
The author misunderstands how NetApp uses NVRAM. NVRAM is not a cache for the hottest data. Writes are always to DRAM memory. The writes are committed to NVRAM (which is mirrored to another controller) before being acknowledged to the host but the write IO and its commitment to disk or SSD via WAFL sequential CP writes is all from DRAM. While any data remains in DRAM, it can be considered cached but the contents of NVRAM do not constitute nor is it used for caching for host reads.
NVRAM is only to make sure that no writes are ever lost due to a controller loss. This is important to recognize since most mid-range systems (and all the low-end ones I've investigated) do NOT protect from write losses in event of failure. Data loss like this can lead to corruption in block-based scenarios and database corruption in nearly any scenario.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
60 Comments
Back to Article
Jammrock - Monday, August 5, 2013 - link
Great write up, Johan.The Fusion-IO ioDrive Octal was designed for the NSA. These babies are probably why they could spy on the entire Internet without ever running low on storage IO. Unsurprisingly that bit about the Octal being designed for the US government is no longer on their site :)
Seemone - Monday, August 5, 2013 - link
I find the lack of ZFS disturbing.Guspaz - Monday, August 5, 2013 - link
Yeah, you could probably get pretty far throwing a bunch of drives into a well configured ZFS box (striped raidz2/3? Mirrored stripes? Balance performance versus redundancy and take your pick) and throwing some enterprise SSDs in front of the array as SLOG and/or L2ARC drives.In fact, if you don't want to completely DIY, as many enterprises don't, there are companies selling enterprise solutions doing exactly this. Nexenta, for example (who also happen to be one of the lead developers behind modern opensource ZFS), sell enterprise software solutions for this. There are other companies that sell hardware solutions based on this and other software.
blak0137 - Monday, August 5, 2013 - link
Another option for this would be to go directly to Oracle with their ZFS Storage Appliances. This gives companies the very valuable benefit of having hardware and software support from the same entity. They also tend to undercut the entrenched storage vendors on price as well.davegraham - Tuesday, August 6, 2013 - link
*cough* it may be undercut on the front end but maintenance is a typical Oracle "grab you by the chestnuts" type thing.Frallan - Wednesday, August 7, 2013 - link
More like "grab you by the chestnuts - pull until they rips loose and shove em up where they don't belong" - type of thing...davegraham - Wednesday, August 7, 2013 - link
I was being nice. ;)equals42 - Saturday, August 17, 2013 - link
And perhaps lock you into Larry's platform so he can extract his tribute for Oracle software? I think I've paid for a week of vacation on Ellison's Hawaiian island.Everybody gets their money to appease shareholders somehow. Either maintenance, software, hardware or whatever.
Brutalizer - Monday, August 5, 2013 - link
Discs have grown bigger, but not faster. Also, they are not safer nor more resilient to data corruption. Large amounts of data will have data corruption. The more data, the more corruption. NetApp has some studies on this. You need new solutions that are designed from the ground up to combat data corruption. Research papers shows that ntfs, ext, etc and hardware raid are vulnerable to data corruption. Research papers also show that ZFS do protect against data corruption. You find all papers on wikipedia article on zfs, including papers from NetApp.Guspaz - Monday, August 5, 2013 - link
It's worth pointing out, though, that enterprise use of ZFS should always use ECC RAM and disk controllers that properly report when data has actually been written to the disk. For home use, neither are really required.jhh - Wednesday, August 7, 2013 - link
And Advanced/SDDC/Chipkill ECC, not the old-fashioned single-bit correct/multiple bit detect. The RAM on the disk controller might be small enough for this not to matter, but not on the system RAM.tuxRoller - Monday, August 5, 2013 - link
Amplidata's dss seems like a better, more forward looking alternative.Sertis - Monday, August 5, 2013 - link
The Amplistore design seems a bit better than ZFS. ZFS has a hash to detect bit rot within the blocks, while this stores FEC coding that can potentially recover the data within that block without calculating it based on parity from the other drives on the stripe and the I/O that involves. It also seems to be a bit smarter on how it distributes data by allowing you to cross between storage devices to provide recovery at the node level while ZFS is really just limited to the current pool. It has various out of band data rebalancing which isn't really present in ZFS. For example, add a second vdev to a zpool when it's 90% full and there really isn't a process to automatically rebalance the data across the two pools as you add more data. The original data stays on that first vdev, and new data basically sits in the second vdev. It seems very interesting, but I certainly can't afford it, I'll stick with raidz2 for my puny little server until something open source comes out with a similar feature set.Seemone - Tuesday, August 6, 2013 - link
Are you aware that with ZFS you can specify the number of replicas each data block should have on a per-filesystem basis? ZFS is indeed not very flexible on pool layout and does not rebalance things (as of now), but there's nothing in the on-disk data structure that prevent this. This means it can be implemented and would be applicable on old pools in a non disruptive way. ZFS, also, is open source, its license is simply not compatible with GPLv2, hence ZFS-On-Linux separate distribution.Brutalizer - Tuesday, August 6, 2013 - link
If you want to rebalance ZFS, you just copy the data back and forth and rebalancing is done. Assume you have data on some ZFS disks in a ZFS raid, and then you add new empty discs, so all data will sit on the old disks. To spread the data evenly to all disks, you need to rebalance the data. Two ways:1) Move all data to another server. And then move back the data to your ZFS raid. Now all data are rebalanced. This requires another server, which is a pain. Instead, do like this:
2) Create a new ZFS filesystem on your raid. This filesystem is spread out on all disks. Move the data to the new ZFS filesystem. Done.
Sertis - Thursday, August 8, 2013 - link
I'm definitely looking forward to these improvements, if they eventually arrive. I'm aware of the multiple copy solution, but if you read the Intel and Amplistore whitepapers, you will see they have very good arguments that their model works better than creating additional copies by spreading out FEC blocks across nodes. I have used ZFS for years, and while you can work around the issues, it's very clear that it's no longer evolving at the same rate since Oracle took over Sun. Products like this keep things interesting.Brutalizer - Tuesday, August 6, 2013 - link
Theory is one thing, real life another. There are many bold claims and wonderful theoretical constructs from companies, but do they hold up to scrutiny? Researchers injected artificially constructed errors in different filesystems (NTFS, Ext3, etc), and only ZFS detected all errors. Researchers have verified that ZFS seems to combat data corruption. Are there any research on Amplistore's ability to combat datacorruption? Or do they only have bold claims? Until I see research from a third part, independent part, I will continue with the free open source ZFS. CERN is now switching to ZFS for tier-1 and tier-2 long time term storage, because vast amounts of data _will_ have data corruption, CERN says. Here are research papers on data corruption on NTFS, hardware raid, ZFS, NetApp, CERN, etc:http://en.wikipedia.org/wiki/ZFS#Data_integrity
For instance, Tegile, Coraid, GreenByte, etc - are all storage vendors that offers PetaByte Enterprise servers using ZFS.
JohanAnandtech - Tuesday, August 6, 2013 - link
Thanks, very helpful feedback. I will check the paper outmikato - Thursday, August 8, 2013 - link
And Isilon OneFS? Care to review one? :)bitpushr - Friday, August 9, 2013 - link
That's because ZFS has had a minimal impact on the professional storage market.Brutalizer - Sunday, August 11, 2013 - link
bitpushr,"That's because ZFS has had a minimal impact on the professional storage market."
That is ignorant. If you had followed the professional storage market, you would have known that ZFS is the most widely deployed storage system in the Enterprise. ZFS systems manages 3-5x more data than NetApp, and manages more data than NetApp and EMC Isilon combined. ZFS is the future and eating other's cake:
http://blog.nexenta.com/blog/bid/257212/Evan-s-pre...
blak0137 - Monday, August 5, 2013 - link
The Amplidata Bitspread data protection scheme sounds alot like the OneFS filesystem on Isilon.A note on the NetApp section, the NVRAM does not store the hottest blocks, rather it is only used for correlating writes to allow destaging entire raid group wide stripes onto disk at once. This utilization of NVRAM in NetApp, along with the write characteristics of the WAFL filesystem, allows RAID-DP (NetApp's slightly customized version of RAID-6) to have similar write performance as RAID-10 with a much smaller usable space penalty up to approximately 85-90% space utilization. Read cache is always held in RAM on the controller and the FlashCache (formerly PAM) cards supplement that RAM-based cache. A thing to remember about the size of the FlashCache cards is that the space still benefits from the data efficiency features of Data OnTap, such as deduplication and compression, and as such applications such as VDI get a massive boost in performance.
enealDC - Monday, August 5, 2013 - link
I think you also need to discuss the effect of OSS or very low cost solutions that can be built on white box hardware. Those cause far greater disruptions than anything I can think of!SCST and COMSTAR to name a few.
Ammohunt - Monday, August 5, 2013 - link
One thing i didn't see mention is that in the good old days you spread the I/O out across many spindles which was a huge advantage SCSI which was geared towards such a configuration. As drive sizes have increased the spindles have reduced adding more latency. The fact is that expensive SSD type storage systems are not needed in most medium sized businesses. Their data needs can in most cases be served by spectacularly by using a well architected tiered storage model.mryom - Monday, August 5, 2013 - link
There's some thing missing - take a look at Pernix Data - That's disruptive and also vSphere 5.5 gonna be a game changer. Software Defined Storage is the way forward - We just need space for more disks in blade serversdavegraham - Tuesday, August 6, 2013 - link
SDS is an EMC-marchitecture discussion (a la ViPR). I'd suggest that you avoid conflating what a marketing talking head discusses with technology can actually do. :)Kevin G - Monday, August 5, 2013 - link
My understanding withenterprise storage isn't necessarily the hardware but rather the software interface and support that comes with it. NetApp for example will dial home and order replacements for failed hard drives for you. Various interfaces I've used allow for the logical creation multiple arrays across multiple controllers each using a different RAID type. I have no sane reason why some one would want to do that but the option is there and supported for the crazies.As far as performance goes, NVMe and SATA Express are clearly the future. I'm surprised that we haven't see any servers with hot swap mini-PCIe slots. With two lanes going to each slot, a single socket Sandy Bridge-e chip could support twenty of those small form factor cards in the front of a 1U server. At 500 GB a piece, that is 10 TB of preformatted storage, not far off of the 16 TB preformatted possible today using hard drives. Cost of course will be more expensive than disk but speeds are ludicrous.
Going with standard PCIe form factors for storage only makes sense if there are tons of channels connected to the controlller and are PCIe native. So far the majority of offers stick a hardware RAID chip with several SATA SSD controllers onto a PCIe card and call it a day.
Also for the enterprice market, it would be nice to a PCIe SSD have an out of band management port that communicates via Ethernet and can fully function if the switch on the other end supports power over ethernet. The entire host could be fried but data could still potentially be recovered. Also works great for hardware configuration like on some Areca cards.
youshotwhointhewhatnow - Monday, August 5, 2013 - link
The first link on "Cloudfounders: No More RAID" appears to be broken (http://www.amplidata.com/pdf/The-RAID Catastrophe.pdf).I read through the second link on that page (the Intel paper). I wouldn't consider that paper as unbiased considering Intel is clearly trying to use it to sell more Xeon chips. Regardless, I don't think your statement "mathematically proven that the Reed-Solomon based erasure codes of RAID 6 are a dead end road for large storage systems" is justified. Sure RAID6 will eventually give way to RAID7 (or RAIDZ2 in ZFS terms), but this still uses Reed-Solomon codes. The Intel paper just shows that RAID6+1 has much worse efficiency with slightly worse durability compared to Bitspread. The same could be said for RAID7 (instead of Bitspread), which really should have been part of the comparison.
Another strange statement in the Intel paper is "Traditional erasure coding schemes implemented by competitive storage solutions have limited device-level BER protection (e.g., 4 four bit errors per device)". Umm, with non-degraded RAID6 you could have as many UREs as you like provided less than three occur on the same stripe (or less than two for a degraded array). Again RAID7 allows even more UREs in the same stripe.
This is not to say that the Bitspread technique isn't interesting, but you seem to be a little to quick to drink the kool-aid.
name99 - Tuesday, August 6, 2013 - link
I imagine the reason people are quick to drink the koolaid is that convolutional FEC codes have proved how well they work through much wireless experience. Loss of some Amplidata data is no different from puncturing, and puncturing just works --- we experience it every time we use WiFi or cell data.I also wouldn't read too much into Intel's support here. Obviously running a Viterbi algorithm to cope with a punctured convolutional code is more work than traditional parity-type recovery --- a LOT more work. And obviously, the first round of software you write to prove to yourself that this all works, you're going to write for a standard CPU. Intel is the obvious choice, and they're going to make a big deal about how they were the obvious choice.
BUT the obvious next step is to go to Qualcomm or Broadcom and ask them to sell you a Viterbi cell, which you put on a SOC along with an ARM front-end, and hey presto --- you have a $20 chip you can stick in your box that's doing all the hard work of that $1500 Xeon.
The point is, convolutional FEC is operating on a totally different dimension from block parity --- it is just so much more sophisticated, flexible, and powerful. The obvious thing that is being trumpeted here is destruction of one of more blocks in the storage device, but that's not the end of the story. FEC can also handle point bit errors. Recall that a traditional drive (HD or SSD) has its own FEC protecting each block, but if enough point errors occur in the block, that FEC is overwhelmed and the device reports a read error. NOW there is an alternative --- the device can report the raw bad data up to a higher level which can combine it with data from other devices to run the second layer of FEC --- something like a form of Chase combining.
Convolutional codes are a good start for this, of course, but the state of the art in WiFi and telco is LDPCs, and so the actual logical next step is to create the next device based not on a dedicated convolutional SOC but on a dedicated LDPC SOC. Depending on how big a company grows, and how much clout they eventually have with SSD or HD vendors, there's scope for a whole lot more here --- things like using weaker FEC at the device level precisely because you have a higher level of FEC distributed over multiple devices --- and this may allow you a 10% or more boost in capacity.
meorah - Monday, August 5, 2013 - link
you forgot another implication of scale-out software design. namely, the ability to bypass flash completely and store your most performance intensive workloads that use your most expensive software licensing directly in-memory. 16 gigs to run the host, the other 368 gigs as a nice RAM drive.pjkenned - Tuesday, August 6, 2013 - link
Hi Johan, one thing to be clear about is that the dollars you are quoting in this article are off by a huge margin. Enterprise storage is one of the most highly discounted areas of technology. Happy to chat more on this subject. Patrick @STHname99 - Tuesday, August 6, 2013 - link
Ahh, the never-ending whine of the Enterprise sale man, who desperately wants to have it both ways --- to be able to charge a fortune and simultaneously to claim that he's not charging a fortune. Good luck with that --- there is, after all, a sucker born every minute.But let's get one thing straight here. If your organization refuses to publish the actual prices at which it sells, then you STFU when people report the prices that ARE published, not the magic secret prices that you claim exist but neither you nor anyone else are ever allowed to actually mention them. You don't get to have it both ways.
AnandTech and similar blogs are not in the business of sustaining your obsolete business model and its never-ending lies about price...
enealDC - Tuesday, August 6, 2013 - link
Thank you!!! lolJohanAnandtech - Tuesday, August 6, 2013 - link
You can not blame a company to do whatever they can to protect their business model, but your comment is on target. The list versus street price models reminds of techniques of salesman on the street in touristic areas: they charge 3 times too much, and you end up with a 50% discount. The end result is that you are still ripped off unless you have intimate knowledge.nafhan - Tuesday, August 6, 2013 - link
My experience with stuff like this is that the low prices are geared towards locking you into their products and getting themselves in the door. As soon as these companies feel certain that changing to a different storage tech would be prohibitively expensive for you, the contract renewal price will go through the roof.In other words, that initial price may actually be very good and very competitive. Just don't expect to get the same deal when things come up for contract renewal.
equals42 - Saturday, August 17, 2013 - link
You shouldn't be making enterprise purchasing decisions unless you have intimate knowledge and have done the necessary research.pjkenned - Tuesday, August 6, 2013 - link
So three perspectives:First - I have been advocating open storage projects for years. I do think we are moving to 90%+ of the market being 4TB drives and SSDs and SDS is a clear step in this direction. I don't sell storage but have been using open platforms for years precisely because of the value I can extract through the effort of sizing the underlying hardware.
Second - Most of the big vendors are public companies. It isn't hard to look at gross margin and figure out ballpark what average discounts are. Most organizations purchasing this type of storage have other needs. The market could push for lower margins so my sense is that the companies buying this class of storage are not just paying for raw storage.
Third - vendors are moving the direction of lower discounts at the low end. Published list prices there are much closer to actual as the discounting trend in the industry is towards lower list prices.
Not to say that pricing is just or logical, but then again, it is a large industry that is poised for a disruptive change. One key thing here is that I believe you can get pricing if you just get a quote. This is the same as other enterprise segments such as the ERP market.
equals42 - Saturday, August 17, 2013 - link
I'll not ask you to STFU as you eloquently abbreviated it. Though in general I believe people ultimately charge what they believe the market will pay.Yes, list prices are generously overpriced in the IT industry. But to ask EMC or IBM to tell you how much they really charge for things is stupid. That's a negotiated rate between them and their customer. BofA or WalMart isn't going to disclose how much they pay for services. Their low negotiated price helps drive efficiencies to better compete with rivals. Heck, ask Kelloggs how much Target pays per box for cereal vs WalMart. No way in hell they're going to tell you. You think a SMB is going to get the same price as Savvis or Bank of America? They can ask for it but good luck. I sense some naiveté in your response.
In essence you're complaining about how inflated the list is vs what the average customer pays. That's a game played out based on supply and demand, market expectations and the blended costs of delivering products.
prime2515103 - Tuesday, August 6, 2013 - link
"Note that the study does not mention the percentage customer stuck in denial :-)."I don't mean to be a jerk or anything, but I can't believe I just read that on Anandtech. It's not the grammar either. A smiley? Good grief...
JohanAnandtech - Tuesday, August 6, 2013 - link
I fixed the sentence, but left the smiley in there. My prerogative ;-)prime2515103 - Wednesday, August 7, 2013 - link
I know, it just seems unprofessional. It's a tech article, not a chat room.FunBunny2 - Tuesday, August 6, 2013 - link
What is always missing from such essays (and this one reads more like a 'Seeking Alpha' pump piece) is a discussion of datastore structure. If you want speed and fewer bytes and your data isn't just videos, then an industrial strength RDBMS with a Organic Normal Form™ schema gets rid of the duplicate bytes. Too bad.DukeN - Tuesday, August 6, 2013 - link
But has there actually been any disruptions to the top dogs?EMC, NetApp, storage from HP/Dell/IBM, Hitachi all have had significant earnings increases yet again.
So maybe a couple of new startups as well as FusionIO are making money now, but some of the big guys can probably just buy them out and shelf them.
davegraham - Tuesday, August 6, 2013 - link
Look at EMC's acquisition of XtremeI/O...that's a viable competitor that EMC has already been able to integrate as a mainstream product. Oh, and they're also using Virident PCIe cards for server-side flash. ;)DukeN - Wednesday, August 7, 2013 - link
But is that really disruptive, or business as usual? These guys usually buy up smaller technologies as needed and integrate them if needed. Most of their core business (spinning disks) has remained the same.bitpushr - Friday, August 9, 2013 - link
XtremIO is still not a shipping product. It is not generally-available. So, I do not think this qualifies as "integrate as a mainstream product".Likewise their server-side Flash sales (Project Lightning) have been extremely slow.
phoenix_rizzen - Tuesday, August 6, 2013 - link
If you ditch Windows on the desktop, you can do a lot more for a lot less.$22,000 for a Nutanix node to support a handful of virtual desktops? And you still need the VDI client systems on top of that? Pffft, for $3000 CDN we can support 200-odd diskless Linux workstations (diskless meaning they boot off the network, mount all their filesystems via NFS, and run all programs on the local system using local CPU, local RAM, local GPU, local sound, etc). The individual desktops are all under $200 (AMD Athlon-II X3 and X4, 2 GB of RAM, onboard everything; CPU fan is the only moving part) and treated like appliances (when one has issues, just swap it out for a spare).
No licensing fees for the OS, no licensing fees for 90+% of the software in use, no exorbitant markup on the hardware. And all staff and students are happy with the system. We've been running this setup in the local school district for just shy of 10 years now. Beats any thin-client/VDI setup, that's for sure.
turb0chrg - Tuesday, August 6, 2013 - link
Another vendor doing hybrid storage is Nimble Storage (http://www.nimblestorage.com/). I've looked at their solution and it is quite impressive. It's not cheap though.They also claim to be the fastest growing storage vendor!
dilidolo - Tuesday, August 6, 2013 - link
I have 2 of them for VDI, they work fine, but I wouldn't call it enterprise storage.equals42 - Saturday, August 17, 2013 - link
It's only iSCSI so you better like that protocol.WeaselITB - Tuesday, August 6, 2013 - link
Fascinating perspective piece. I look forward to the CouldFounders review -- that stuff seems pretty interesting.Thanks,
-Weasel
shodanshok - Tuesday, August 6, 2013 - link
Very interesting article. It basically match my personal option on SAN market: it is an overprice one, with much less performance per $$$ then DAS.Anyway, with the advent of thin pools / thin volumes in RHEL 6.4 and dmcache in RHEL 7.0, commodity, cheap Linux distribution (CentOS costs 0, by the way) basically matche the feature-set exposed by most low/mid end SAN. This means that a cheap server with 12-24 2.5'' bays can be converted to SAN-like works, with very good results also.
In this point of view, the recent S3500 / Crucial M500 disks are very interesting: the first provide enterprise-certified, high performance, yet (relatively) low cost storage, and the second, while not explicitly targeted at the enterprise market, is available at outstanding capacity/cost ratio (the 1TB version is about 650 euros). Moreover it also has a capacitor array to prevent data loss in the case of power failure.
Bottom line: for high performance, low cost storage, use a Linux server with loads of SATA SSDs. The only drawback is that you _had_ to know the VGS/LVS cli interface, because good GUIs tend to be commercial products and, anyway, for data recovery the cli remains your best friend.
A note on the RAID level: while most sysadmins continue to use RAID5/6, I think it is really wrong in most cases. The R/M/W penalty is simply too much on mechanincal disks. I've done some tests here: http://www.ilsistemista.net/index.php/linux-a-unix...
Maybe on SSDs the results are better for RAID5, but the low-performance degraded state (and very slow/dangerous reconstruction process) ramain.
Kyrra1234 - Wednesday, August 7, 2013 - link
The enterprise storage market is about the value-add you get from buying from the big name companies (EMC, Netapp, HP, etc...). All of those will come with support contracts for replacement gear and to help you fix any problems you may run into with the storage system. I'd say the key reasons to buy from some of these big players:* Let someone else worry about maintaining the systems (this is helpful for large datacenter operations where the customer has petabytes of data).
* The data reporting tools you get from these companies will out-shine any home grown solution.
* When something goes wrong, these systems will have extensive logs about what happened, and those companies will fly out engineers to rescue your data.
* Hardware/Firmware testing and verification. The testing that is behind these solutions is pretty staggering.
For smaller operations, rolling out an enterprise SAN is probably overkill. But if your data and uptime is important to you, enterprise storage will be less of a headache when compared to JBOD setups.
Adul - Wednesday, August 7, 2013 - link
We looked at Fusion-IO ioDrive and decided not to go that route as the work loads presented by virtualize desktops we offer would have killed those units in a heartbeat. We opted instead for a product by greenbytes for our VDI offering.Adul - Wednesday, August 7, 2013 - link
See if you can get one of these devices for review :)http://getgreenbytes.com/solutions/vio/
we have hundreds of VDI instances running on this.
Brutalizer - Sunday, August 11, 2013 - link
These Greenbyte servers are running ZFS and Solaris (illumos)http://www.virtualizationpractice.com/greenbytes-a...
Brutalizer - Sunday, August 11, 2013 - link
GreenByte:http://www.theregister.co.uk/2012/10/12/greenbytes...
Also, Tegile is using ZFS and Solaris:
http://www.theregister.co.uk/2012/06/01/tegile_zeb...
Who said ZFS is not the future?
woogitboogity - Sunday, August 11, 2013 - link
If there is one thing I absolutely adore about real capitalism it is these moments where the establishment goes down in flames. Just the thought of their jaws dropping and stammering "but that's not fair!" when they themselves were making mockery of fair prices with absurd profit margins... priceless. Working with computers gives you so very many of these wonderful moments of truth...On the software end it is almost as much fun as watching plutocrats and dictators alike try to "contain" or "limit" TCP/IP's ability to spread information.
wumpus - Wednesday, August 14, 2013 - link
There also seems to be a disconnect in what Reed-Solomon can do and what they are concerned about (while RAID 6 uses Reed Solomon, it is a specific application and not a general limitation).It is almost impossible to scale rotating discs (presumably magnetic, but don't ignore optical forever) to the point where Reed-Solomon becomes an issue. The basic algorithm scales (easily) to 256 disks (or whatever you are striping across) of which typically you want about 16 (or less) parity disks. Any panic over "some byte of data was mangled while a drive died" just means you need to use more parity disks. Somehow using up all 256 is silly (for rotating media) as few applications access data in groups of 256 sectors a time (current 1MB, possibly more by the time somebody might consider it).
All this goes out the window if you are using flash (and can otherwise deal with the large page clear requirement issue), but I doubt that many are up to such large sizes yet. If extreme multilevel optical disks ever take over, things might get more interesting on this front (I will still expect Reed Solomon to do well, but eventually things might reach the tipping point).
equals42 - Saturday, August 17, 2013 - link
The author misunderstands how NetApp uses NVRAM. NVRAM is not a cache for the hottest data. Writes are always to DRAM memory. The writes are committed to NVRAM (which is mirrored to another controller) before being acknowledged to the host but the write IO and its commitment to disk or SSD via WAFL sequential CP writes is all from DRAM. While any data remains in DRAM, it can be considered cached but the contents of NVRAM do not constitute nor is it used for caching for host reads.NVRAM is only to make sure that no writes are ever lost due to a controller loss. This is important to recognize since most mid-range systems (and all the low-end ones I've investigated) do NOT protect from write losses in event of failure. Data loss like this can lead to corruption in block-based scenarios and database corruption in nearly any scenario.