You're not doing your readers any favors by conflating the terms NAS and SAN. NAS devices (such as what you've described here) are Network Attached Storage, accessed over Ethernet, and usually via fileshares (NFS, CIFS, even AFP) with file-level access. SAN is Storage Area Network, nearly always implemented with Fibre Channel, and offers block-level access. About the only gray area is that iSCSI allows block-level access to a NAS, but that doesn't magically turn it into a SAN with a storage fabric.
Honestly, given the problems I've seen with NAS devices and the burden a well-designed one will put on a switch backplane, I just don't see the point for anything outside the smallest installations where the storage is tied to a handful of servers. By the time you have a NAS set up *well* you're inevitably going to start taxing your switches, which leads to setting up dedicated storage switches, which means... you might as well have set up a real SAN with 8Gbps fibre channel and been done with it.
NAS is great for home use - no special hardware and cabling, and options as cheap as you want to go - but it's a pretty poor way to handle centralized storage in the datacenter.
The terms NAS and SAN have become rightfully mixed, because modern storage appliances can do the jobs of both. Add some FC HBAs to the above ZFS storage system and create some FC Targets using Comstar in OpenSolaris or Nexenta and guess what? You've got a "SAN" box. Nexenta can even do active/active failover and everything else that makes it worthy of being called a true "Enterprise SAN" solution.
I like our FC SAN here, but holy cow is it expensive, and its not getting any cheaper as time goes on. I foresee iSCSI via plain 10G Ethernet and also FCoE (which is 10G Ethernet + FC sharing the same physical HBA and data link) completely taking over the Fibre Channel market within the next decade, which will only serve to completely erase the line between "NAS" and "SAN".
The systems as configured in this article are block level storage devices accessed over a gigabit network using iSCSI. I would strongly consider that a SAN device over a NAS device. Also, the storage network is segregated onto a separate network already, isolated from the primary network.
We also backed this device with 20Gbps InfiniBand, but had issues getting the IB network stable, so we did not include it in the article.
iSCSI is block based storage, NAS is file based. The transport used is irrelevent. We could use iSCSI over 10GbE, or over InfiniBand, which would increase the performance significantly, and probably exceed what is available on the most expensive 8Gb FC available.
You are confusing the NAS vs. SAN terminology with the interconnects terminology and vice versa.
SAN, NAS, DAS ... are abstract methods how a data client accesses the stored data. --Network Attached Storage (NAS), per definition, is an file/entity-based data storage solution. - - - It is _usually_but_not_necessarily_ connected to a general-purpose data network --Storage Area Network(SAN), per definition, is a block-access-based data storage solution. - - - It is _usually_but_not_necessarily_THE_ dedicated data network.
Ethernet, FC, Infiniband, ... are physical data conduits, they are the ones who define in which PERFORMANCE class a solution belongs
iSCSI, SAS, FC, NFS, CIFS ... are logical conduits, they are the ones who define in which FEATURE CLASS a solution belongs
Today, most storage appliances allow for multiple ways to access the data, many of the simultaneously.
Therefore, presently:
Calling a storage appliance, of whatever type, a "SAN" is pure jargon. - It has nothing to do with the device "being" a SAN per se Calling an appliance, of whatever type, a "NAS" means it is/will be used in the NAS role. - It has nothing to do with the device "being" a NAS per se.
My iSCSI Datacore SAN, pushes 20k iops for the same reason that their ZFS does it (Ram cacheing).
Fibre Channel SANs will always outperform iSCSI run over crappy switching. Currently Fibre Channel maxes out at 8Gbps in most arrays. Even with MPIO, your better off with an iSCSI system and 10/40Gbps Ethernet if you do it right. Much cheaper, and you don't have to learn an entire new networking model (Fibre Channel or Infiniband).
Indeed you can, which is one of the most exciting parts about using software based storage appliances. Nexenta really excels in this area, offering iSCSI, NFS, SMB, and WebDAV with simple mouse clicks.
Would be really nice to see how ZoL compares. It's in no way optimized yet (current work is on getting the core functionality stable - which IMHO it is) so it would have no chanse against OpenSolaris or Nexenta, but hopfully it's comparative to the Promise rack.
NAS is extremely cost effective in a data center if a large majority of NFS/CIFS users are more interested in capacity, not performance. NDMP can be very efficent for backups, and the snapshots/multi-protocol aspects of NAS systems are fairly easy to manage. Some of the larger Vendor NAS systems can support 100+TB's per NAS fairly effectively.
Actually, OpenSolaris and Nexenta can act as a SAN device using COMSTAR. You can attach to them with iSCSI, FC, Infiniband, etc. and use any zvols as raw scsi targets.
This is similar to the NAS<>SAN argument. They are used in a similar manner, but have very different purposes.
Testing. You are checking to see if the item performance meets your need & looking for bugs or other problems including documentation and support.
Benchmarking. You are running a series of test sets to measure the performance. Bugs & poor documentation/support may abort some of the measuring tools, but that simply goes into the report of what the benchmarks measured.
Or in short: Test==does it work? Benchmark==What does it score on standard performance measures?
I am no networking expert so please bear with me. What are the benfits of a SAN over local drivers and or a NAS? I would expect a NAS to have better performance since it would send less data over the wire than a SAN if they both had the same physical connection. A local drive/array I would expect to be faster than a SAN since it will not need to go through a network. Does it all come down to management? I can see the benefit of having your servers boot over the network and having all your drives in one system. If you set up the servers to boot over the network it would be really easy to replace a server. Am I missing something or are the gains all a matter of management?
A NAS has most of the time worse performance than a similar SAN since there is a file system layer on the storage side. A SAN only manages block and has thus less layers, and is more efficient.
A local drive array is faster, but is less scalable and depending on the setup, it is harder to give a large read/write cache: you are limited by the amount of RAM your cache controller supports. In a software SAN you can use block based caches in the RAM of your storage server.
Management advantages over Local drives are huge: for example you can plug a small ESXi/Linux flash drive which only contains the hypervisor/OS, and then boot everything else from a SAN. That means that chances are good that you never have to touch your server during its lifetime and handle all storage and VM needs centrally. Add to that high availability, flexibility to move VMs from one server to another and so on.
I but that layer must be executed somewhere I thought that decrease in data sent over the physical wire would make up for the extra software cost on the server side. Besides you would still want a NAS even with a SAN for shared data. I am guessing that you could have a NAS served data from the SAN if you needed shared directories. I also assume that since most SAN are on a separate storage network that the SAN is mainly used to provide storage to servers and than the servers provide data to clients on the lan. The rest of it seems very logical to me in a large setup. I am guessing that if you have a really high performance data base server that one might use a DAS instead of SAN or dedicate a SAN server just to the database server. Thanks I am just trying to educate myself on SANs vs NAS vs DAS. Since I work at a small software development firm our sever setup is much simpler than the average Data center so I don't get to deal this level of hardware often. However I am thinking that maybe we should build a SAN and storage network just for our rack.
I've been working on getting the additional parts necessary to build a similar system out of a slightly used HP DL380 G5 with a bunch of 15K SAS drives and an MSA20 shelf full of 750GB SATA drives. Here's what I'm going to be doing a little differently from what you've done:
1) More CPU (already there, it has dual Xeon X5355 if I recall correctly)
2) Two mirrored OCZ Vertex2 EX 50GB drives for the SLOG device (the ZIL write cache). Even though the Vertex2 claims a highly impressive 50,000 random-write IOPS, the ZIL is written sequentially, and the Vertex2 EX claims to sustain 250MB/sec writes, so it should make a very good SLOG device.
3) Two OCZ Vertex2 100G (the cheaper MLC models) for L2ARC.
4) The SSDs will be put on a separate SAS HBA card from the HDDs to prevent I/O starvation due to the HBA I/O queue filling up because of the relatively slow I/O service-times of the HDDs.
5) Quad Gigabit Ethernet or 10G Ethernet link. The latter will require an upgrade to our datacenter switches, which is probably going to happen soon anyway.
I would love to see performance results for your setup. The IOMeter ICF file that we have linked to in the article would help you run the exact same tests as we ran if you would be interested in running them.
I forgot to mention it might also be running FreeBSD (which I'm very familiar with) rather than Nexenta or OpenSolaris, but I'm just kind of playing it by ear. I may try all three. The goal is for it to eventually become a production storage server, but I'm going to do a bit of experimentation first. I still haven't gotten around to ordering the SSDs and the extra SAS HBAs, so it'll be a while before I have any benchmarks for you.
ZFS on linux is terrible. also ZFS on FreeBSD is decent. recent ZFS features such as deduplication and iSCSI are not available on FreeBSD. just grab a copy of the latest build of opensolaris (134), compile it from build 157. use solaris 10 (got to pay now), or use one of the mentioned Nexenta distros.
From personal experience, use fast SSD drives. I made the mistake of using a pair of the Intel 40GB Value drives for a home box with 8 x 1.5 TB drives. terrible performance. Yes it is cool for latency but I cant get more than 40MBs from it. I have tried using them just for ZIL or just for L2ARC and performance is abyssal. Get the fastest possible drives you can afford.
Matt, have you tested with using for example realtek nics (dont, pain in the ass), intel desktop nics (stable) or the more fancy server grade nics that have reported iSCSI offload? also have you tried using dedup/compresion for increased performance/space savings? this will use up lots of memory for indexies but if your cpus are fast enough along with network, less IO hits the discs. I hear it has worked assuming you have the memory, CPU, network. One last bit, try using the Sun 40GBs infiniband cards? I know they will work with solaris 10 and opensolaris and thus I would assume nexenta. might want to check the hardware compatibility list for your IB card.
We have not tested with any other NIC's other than the Intel GB nics onboard the blade. We considered using an iSCSI offload NIC for the ZFS system, but given the cost of such cards we could not justify using them.
As for Deduplication - we have recently tested using deduplication on Nexenta and the results were abysmal. Most tests were reading above 90% CPU utilization while delivering far lower IOPS. I believe that deduplication could help performance, but only if you have an insane amount of CPU available. With the checksumming and deduplication running our 5504 was simply not able to keep up. By increasing the core count, adding a second processor, and increasing the clock speed, it may be able to keep up, but after you spend that much additional capital on CPU's and better motherboards, you could increase your spindle count, switch to SAS drives, or simply add another storage unit for marginally more money.
from my personal experience i could not agree more for the deduplication. 33% on each core on my phenom 2 for a home setup is insane. Some things like exchange server, it is best to let the application decide what is should be cached but duplication realy make sense for a tier three storage or nightly backup or maybe for a small dev box. Also the drives them selves mater, you want to use the ones that are geared for raid setups. it allows the system to better communicate with it. I wont name a particular vendor but the current 'green' 5400 rpm 2TB drives are terrible for zfs http://pastebin.com/aS9Zbfeg (not my setup) that is a nightly backup array used at a webhosting facility. sure they have great throughput but all those errors after a few hours.
I use WD green drives in my home OpenSolaris NAS. I have 2 raidz vdevs of 4 drives each (initially I used mirrors, but wanted more space). I can serve 720p content to two laptops and my Xstreamer simultaneously without a hiccup...I guess it depends on your needs, but for a home media server, I have absolutely no complaints with the 'green' drives. Weekly scrubs for 1 yr plus with no issues. I did have to replace a scorpio on my mirrored rpool after 6 months. I am quite happy with my setup.
As a Nexenta partner, we see these issues all the time. Deduplication is not an apples-apples feature. The system build-out and deduplication set (affecting DDT size) are both unique factors.
With ZFS' deduplication, RAM/ARC and L2ARC become critical components for performance. Deduplication tables that spill to disk (will not fit into memory) will cause serious performance issues. Likewise, the deduplication hash function and verify options will impact perfomance.
For each application, doing the math on spindle count (power, cost, space, etc.) versus effective deduplication is always best. Note that deduplication does not need to be enabled pool-wide, and that - like in compression where it is wasteful to compress pre-compressed data - data with low deduplication rates should not be allowed to dominate a deduplication-enabled pool/folder.
Deduplication of 15K, primary storage seems contradictory, but that type of storage has the highest $/TB factor and spindle count for any given capacity target. By allocating deduplication to targets folders/zvol, performance and capacity can be optimized for most use cases. Obviously, data sets that are write-heavy and sensitive to storage latency are not good candidates for deduplication or inline compression.
If you do the math, the cost of SSD augmentation of 7200 RPM SAS pools is very competitive against similar capacity 15K pools. The benefits to SSD augmentation (i.e. L2ARC and ZIL->SLOG where synchronous writes dominate performance profiles) is in higher IOP potential for random IO workloads (where the 7200 disks suffer most). In fact, contrasting 600GB SAS 15K to 2TB SAS 7200, you approach an economic factor where 7200 RPM disks favor mirror groups over 15K raidz groups - again, given the same capacity goals.
The real beauty of ZFS storage - whether it be Opensolaris/Illumos or Nexenta/Stor/Core - is that mixing 15K and 7200 RPM pools within the same system is very easy/effective to do. With the proper SAS controllers and JBOD/RBOD combinations, you can limit 15K applications to a small working set and commit bulk resources to augmented 7200 RPM spindles in robust raidz2 groups (i.e. watch your MTTDL versus raidz).
It is important to note that ZFS was not designed with the "home user" in mind. It can be very memory and CPU/thread hungry and easily out-strip a typical hobbyist's setup. A proper enterprise setup will include 2P quad core and RAM stores suited to the target workload. Since ZFS was designed for robust threading, the more "hardware" threads it has at its disposal, the more efficient it is. While snapshots are "free" in ZFS (i.e. copy-on-write nature of ZFS means writes are the same with or without snapshots) but data integrity (checksums) and compression/deduplication are not.
As you noted, we found deduplication to be beyond the reaches of our system. With proper tuning and component selection, I think it could be used very well (and have talked to several people who have had very good experiences with it). For the average home user it's probably beyond the scope of what they would want to use for their storage.
2) I wanted to say this earlier, but I'm quite confident that SLC is NOT required for a SLOG device, as with current wear leveling, unless you actually write more than <MLC disk capacity> / day there is no way you'll ever need the SLC's extended durability.
3) Again, MLC SSD's, good stuff
4) Yes again
5) not too shabby
6) Why use 15k or 7k2 rpm drives in the first place
All in all nice project, just too bad you have to start from used equipment.
In my view, you can easily trash both your similar system and Anandtech's test system and simply go for what the future is going to be anyway : Raid-10 MLC drives, 48+RAM, 4 CPU's (yes those MLC's are going to perform so much faster you will need this - quite a fair chance you'll need AMD stuff on that as 4-socket is their place) and mainly and this is the hardest part, sata 6 Gb/s * many with a controller that can actually handle the bandwidth.
Overall you'd get a much simpler, faster and cleaner solution (might need to upgrade your networking though to match with the rest).
Actually, I was just hoping to see a ZFS vs HFS+ comparison for the higher-end Macs. But with the given players (Oracle, Apple), I don't know if the drivers will ever be officially released.
I have to say, kudos to you Anand for featuring an article about ZFS! It is truly the killer app for filesystems right now, and nothing else is going to come close to it for quite some time. What use is performance if you can't automatically verify that your data (and the system files that tells your system how to manipulate that data) was what it was the last time you checked?
You picked up on the benefits of the SSD (low latency) before anyone else, it is no wonder you've figured out the benefits of ZFS too earlier than most of your compatriots as well. Well done.
Hi Matt, Thank you for the extensive report. In your testing results there are a few unexpected results. I find the difference between Nexenta and Open Solaris hard to understand, unless it is due to misalignment of the IO in the case of Nexenta. A zvol (the basis for an iSCSI volume) is created on top of the ZFS pool with a certain block size. I believe the default is 8kB. Next you initialize the volume and format it with NTFS. By default the NTFS structure starts at sector 63 (sixty three, not a typo!), which means that every other 4kB cluster (the NTFS allocation size) falls over a zvol block boundary. That has a serious impact on performance. I saw a report of 70% improvement after properly alignment. Is it possible that the Open Solaris and Nexenta pools were different in this respect, either because of different zvol block size (e.g. 8kB for Nexenta, 128kB for Open Solaris – larger blocks means less “boundary cases”) or differences in how the volumes were initialized and formatted?
It's possible that the sector alignment could be a problem, but I believe the build that we tested, the default sector size was set to 128kB, which was identical to OpenSolaris. If that has changed, then we should re-test with the newest build to see if that makes any differences.
I haven't tried this myself yet but how about using 8kb blocks and using jumbo frames on your network? possibly lower through padding to fill the 9mb packet in exchange for lower latency? I have no idea as this is just a theory. dudes in the #opensolaris irc chan have always recommended 128K or 64K depending on the data.
One easy way to check this would be to export the pool from OpenSolaris and directly import it to NexentaStor and re-test. I think you'll find that the differences - as your benchmarks describe - are more linked to write caching at the disk level than partition alignment.
NexentaStor is focused on data integrity, and tunes for that very conservatively. Since SATA disks are used in your system, NexentaStor will typically disable disk write cache (write hit) and OpenSolaris may typically disable device cache flush operations (write benefit). These two feature differences can provide the benchmark differences you're seeing.
Also, some "workstation" tuning includes the disabling of ZIL (performance benefit). This is possible - but not recommended - in NexentaStor but has the side effect of risking application data integrity. Disabling the ZIL (in the absence of SLOG) will result in synchronous writes being committed only with transaction group commits - similar performance to having a very fast SLOG (lots of ARC space helpful too).
We have benchmarked FreeNAS's implimentation of ZFS on the same hardware, and the performance was abysmal. We've considered looking into the latest releases of FreeBSD but have not completed any of that testing yet.
There was a lot of work on this article just prior to the official announcement. The development of the Illumos foundation and subsequent OpenIndiana has been so rapidly paced that we wanted to get this article out the door before diving in to OpenIndiana and any other OpenSolaris derivatives. We will probably add more content talking about the demise of OpenSolaris and the Open Source alternatives that have started popping up at a later date.
Not to mention that projects like illumos are currently not recommended for production, Currently only meant as a base for other distros (OpenIndiana.) Then there is Solaris 11 due soon. I'll try out the express version when its released.
FreeNAS 0.7.x is still using FreeBSD 7.x, and the ZFS code is a bit dated. FreeBSD 8.x has newer ZFS code (v15). Hopefully very soon FreeBSD 9.x will have the latest ZFS code (v24).
You say that all writes go to a log in ZFS. That's just not true. Only synchronous writes below a certain size go into the log (either built into the pool, or a dedicated log device). All writes are held in memory in a transaction group, and that transaction group is written to the main pool at least every 10 seconds by default (in OpenSolaris - it used to be 30 seconds, and still is in Solaris 10 U9). That's tunable, and commits will happen more frequently if required, based on available ARC and data churn rate. Note that _all_ writes go into the transaction group - the log is only ever used if the box crashes after a synchronous write and before the txg commits.
Now for the caution - you have chosen SSDs for your SLOG that don't have a backup power source for their on board caches. If you suffer power loss, you may lose data. Several SLC SSDs have recently been released that have a supercapacitor or other power source sufficient to write cache data to flash on power loss, but the current Intel like up doesn't have it. I believe the next generation Intel SSDs will.
As far as using the X25-E's as ZIL devices - when we built the box initially, the X25-E's were the best choice at the time. Future builds will probably include a capacitor-backed SSD.
I would be curious to know how the performance compares to traditional fs caching on Linux w/ ext3 or ext4 with same amount of memory and a few SSD drives.
There are a few options within Linux that would be pretty interesting to see. FS caching and the different schedulers that are available within Linux. Also I would throw out ext3 and replace that with ext4 and xfs. Redhat is now supporting xfs and there are just tons of tunables for xfs compared to the other file systems.
Thanks Matt, I've been following the build over at your blog and this is an excellent article to tie it all together. I hope you follow up with your "things we'd do differently" in future articles. I would also love to see some more benchmarking against more alternatives, e.g. Open-E, or even an off-the-shelf EqualLogic.
Well, I know at least for Solaris 10.... I would suspect that OpenSolaris has it as well by now, since it has been out for at least 4 years that I know of...
You can install the ZFS Web GUI from the Solaris toolkit, but it isn't bundled into OpenSolaris. It is binary compatible, but it doesn't give any good options for iSCSI setup, as it only supported the old iSCSI target rather than the new COMSTAR target.
How can you spend a page talking about how you aren't really worried about the future of Opensolaris, and then have half a paragraph mentioning "oh, btw, it's cancelled"? The project is clearly dead. They stopped releasing source almost a month ago. Oracle has made absolutely no guarantees about when or how source would be released in the future. For all we know, they could release only portions of Solaris Express, and do it months to years after the binaries drop.
OpenSolaris is indeed dead as far as development goes, but it's still viable if you want to use the last build released which is what all of our performance figures are based on. I will be writing some companion articles to this one talking about not only the death of OpenSolaris, but it's alternative, OpenIndiana, and the Promise M610i used as a comparison in this article.
The OpenSolaris project may be dead but ZFS and all the CDDL licensed code is still out there. Illumos, OpenIndiana and a few other distros are still out there and available. Oracle has stated they will continue to release source code after Solaris releases and will also provide binary preview releases in the form of Solaris Express. To say Solaris and ZFS are dead is pretty premature.
Whatever happens, the existing code is out there. To call it dead is a bit premature. Sure the project that had the name 'OpenSolaris' has been canceled, but everything that made it up (minus a small few closed bits that have already been replaced) lives on.
Along the lines of the "Opensolaris is kind of dead" threads, I'd really like to see an article like this for BTRFS. It's about to become the standard filesystems for Fedora and Ubuntu in the near future, and I'd love to get some AnandTech depth articles about it.. what it can do, what it can't. How it compares to existing Linux filesystems, how it compares to ZFS, etc.
When btrfs is ready for production use, let me know. From what I have seen it is still very much experimental. When it's as stable and proven as ZFS, I would love to give it a try. I have severe doubts that Oracle will continue to invest in its development now that it owns ZFS.
I do not believe that Windows Storage Server is an end-user product. I believe that it is only released to OEM's to ship on their systems. At this time we have no route to obtain Windows Storage Server.
True its OEM only and not public but "evaluation" version is available with Technet and MSDN Without a license key you can run it for 180 days (like all new MS OS BTW)
"We decided to spend some time really getting to know OpenSolaris and ZFS."
OpenSolaris is a dead operating system, killed off by Oracle. Points for testing Nexenta, since they're the ones driving the fork that seems to be the successor to OpenSolaris, but basing your article around a dead-end OS isn't very helpful to your readers...
When this project was started, OpenSolaris was far from dead. We decided to keep using OpenSolaris to finish the article because a viable alternative wasn't available until three weeks ago. If we were to start this article today, it would be based on OpenIndiana. Some of our preliminary testing of OpenIndiana indicate that it performs even better than OpenSolaris in most tests.
And a viable alternative still isn't available how is Nexenta and the community suppose to get driver support and support for new hardware there, when Oracle has closed the development kernel (SXDE is closed source), meaning that they maybe just maybe can use the retail Solaris 11 kernel if it's released in a functioning form that can be piped in with existing software and distro. They aren't going to develop it themselves and the vendors have no reason giving the code/drivers to anybody but Oracle. Continuing the OpenSolaris kernel means creating a new operating system. It means you won't get the latest ZFS updates and tools any more, at least not till they are in the normal S11 release. Means you can't expect the latest driver updates and so on either. You can continue to use it on todays hardware, but tomorrow it might be useless, you might not find working configurations.
It's not clear that Nexenta actually can develop their own operating system, rather then just a distro, it means they have to create their own OS with their own kernel eventually. With their own drivers and so on. And it's not clear how much code Oracle will let slip out. It's just clear that they will keep it under wraps till official releases. It's however clear that there won't be any distro for them to base it on and any and all forks would be totally dependent on what Nexenta (Illuminos) manage to do. It will quickly get outdated without updates flowing all the time, and they came from Sun.
OpenIndiana/Illumos runs the same latest and greatest pool/zfs versions as the most recent Solaris 10 update.
Work continues on porting newer pool/ZFS versions to FreeBSD which has plenty of driver support (better than OpenSolaris ever did).
A stated goal of the Illumos project is to maintain 100% binary compatibility with Solaris. If Oracle decides the break that compatibility, intentionally or not, it will truly become a fork. Development will still continue.
Even if no further development is made on ZFS, it's still an absolutely phenomenal filesystem. How many years now has Apple been using HFS+? FAT is still around in everything. If all development on ZFS stopped today, it would still remain an absolutely viable filesystem for many years to come. There is nothing else currently out there that even comes close to its feature set.
I don't see how ZFS being under Oracle's control makes it any worse than any other open source filesystem. The source is still out there, and people are free to do what they want with it within the CDDL terms.
This idea that just because the OpenSolaris DISTRO has been discontinued, that everything that went into it is no longer viable is silly. It is like calling Linux dead because Mandriva is dead.
Thanks for mentioning OpenIndiana. I've been eagerly awaiting IllumOS to be built into an actual distribution to give me an upgrade path for my home OpenSolaris file server, and I look forward to upgrading to the first stable build of OpenIndiana.
I'm currently running a dev build of OpenSolaris since the realtek network driver was broken in the latest stable build of OpenSolaris (for my chipset, at least).
I believe all of the current Hypervisors support this. Hyper-V does, as does XenServer. I have not done extensive testing with ESXi, but I would imagine that it supports it also.
It would be great to see how FreeBSD performs (8.1 and 9-CURRENT) on that hardware, I can help You configure FreeBSD for these tests if You would like to, for example, by default FreeBSD does not enables AHCI mode for SATA drives which increases random performance a lot.
Anyway, great article about ZFS performance on nice piece of hardware.
In Hyper-V it is called a Differencing disk - you have a parent disk that you build, and do not modify. You then create a "differencing disk". That disk uses the parent disk as it's source, and writes any changes out to the differencing disk. This way you can maintain all core OS files in one image, and write any changes out to child disks. This allows the storage system to cache any core OS components once, and any access to those core components comes directly from the cache.
I believe that Xen calls it a differencing disk also, but I do not currently have a Xen Hypervisor running anywhere that I can check quickly.
new: Version 0.323 napp-it ZFS appliance with Web-UI and online-installer for NexentaCore and Openindiana
Napp-it, a project to build a free "ready to run" ZFS- Web und NAS-Appliance with Web-UI and Online-Installer now supports NexentaCore and OpenIndiana (free successor of OpenSolaris) up from Version 0.323. With its online Installer, you will have your ZFS-Server running with all services and tools within minutes.
Features NAS Fileserver with AFP (incl. Time Maschine and Zero Config), SMB with ACLs, AD-Support and User/ Groups SAN Server with iSCSI (Comstar) and NFS forr XEN or Vmware esxi Web-Server, FTP Database-Server Backup-Server newest ZFS-Features (highest security with parity and Copy On Write, Deduplication, Raid-Z3, unlimited Snapshots via Windows previous Version, working ACLs, Online Pooltest with Datarefresh, Hybridpools, expandable Datapools=simply add Controller or Disks,............)
Howto with NexentaCore: 1. insert NexentaCore CD and install 2. login as root and enter:
wget -O - www.napp-it.org/nappit | perl
During First-Installation you have to enter a mySQL Passwort angeben and select Apache with space-key
Howto with OpenIndiana (free successor of OpenSolaris): 1. Insert OpenIndiana CD and install 2. login as admin, open a terminal and enter su to get root permissions and enter:
wget -O - www.napp-it.org/nappit | perl
AFP-Server is currently installed only on Nexenta.
thats all, no step 3! You can now remotely manage this Mac/PC NAS appliance via Browser
I think you identified the strong issue between SATA and SAS drives, but there's no real reason you can't do both: in fact, this is common practice. I don't know what the distribution for AT is so I may be wrong, but often a relatively small amount of your data is accountable for a large portion of your random writes. Why not store that data permanently on the SSDs?
For everything else, the cost per gb difference between SATA and SAS is too much to ignore. Once you start talking about adding SAS drives to this, you're moving out of the same class as the Promise device. I've used the Promise vTrak M series (and actually, the M610i specifically) and it's about the cheapest iSCSI SAN device you can get while still being a "real" iSCSI device. It's also about at least a 5 year old product and is growing long in the tooth; I don't know that it's appropriate to compare it with a brand new, performance tuned monster.
But once you introduce SAS into the equation, the chassis itself becomes a much smaller percentage of cost. You go from $140 a drive to close to $400. You also start competing with EqualLogic, HP, etc. and given the need you expressed to add more RAM and CPU, there's definitely some stiff competition from higher-end, more modern products than the M610i.
I guess at the end of the day, while the performance numbers are impressive compared to the M610i, I don't know that the M610i is the device I would use if I was interested in performance. The Promise M610i's strength is price and capacity. Given that the M610i is INFINITELY easier to set up and maintain, that has to factor in to the cost as well. The M610i is often used as a staging target for disk-disk-tape backups; it actually has some throughput issues in a number of scenarios so it's not appropriate for all situations. It just depends on where your needs and bottlenecks are.
I'd rather have seen a comparison with a device such as an EqualLogic or StorageWorks array; because once you upgrade the ZFS box, add labor and support costs into the equation, they do become more appealing in the $10k range (and the fact that you can rather easily add more spindles to an existing array.)
1 - our storage system is not used at Anandtech in any way - I am involved in an entirely separate entity who's only affiliation with Anandtech is that we've written an article reviewing our hardware in our environment. As such, I have no idea what Anandtech's storage needs look like. In our environment we use fixed size VHD's for our VM storage currently. As such there is no real way to put small writes on SSD's and static content on slower spindles. We need to maintain performance across the entire data set.
2 - The Vtrak M610i is about 3 years old from what I can gather from their press releases. We purchased our first Vtrak M610i at about that time. http://www.promise.com/news_room/news.aspx?m=615&a... While it may be getting a bit older, it is still available for purchase, and is still a relatively inexpensive way to build a high-capacity SAN device. The reason that it was compared in this article is because that is what we are currently using and replacing. While the controller and chassis is different from our ZFS monster, the drives in the chassis are identical, and the price points are very similar.
3 - We would have loved to compare it to a current generation Equalogic unit, but we did not have one on hand to test. If we ever happen to get one we will definately run the numbers against it.
4 - The Promise system has a lot going for it in the ease of setup and use department, and I am currently working on an article that goes in depth on that. Promise also has several new products available that lower the price point (VessRAID) and expand the options that you have available. I hope to get one of those units to test and possibly deploy in the near future also. They also have an enterprise-grade head end (Vtrak S3000) that looks promising also.
Overall, this article was mainly about the ZFS system, what is possible, and how it performed against our current infrastructure. I am hopeful that we can expand what we have on hand to test with and provide broader comparisons in the future, but there is only so far a budget will stretch for getting hardware to simply test.
I know it's at least 4 years old -- I purchased one at least that long ago. But point taken; I haven't kept up with Promise beyond the vTrak M after getting a budget to higher-end units (I still used the vTrak Ms for cheap storage.)
And if your data set is large enough to require this many spindles, you might benefit from optimizing it a bit on the front-end... for example, build your VMs to split the VHDs so the high-write data is stored elsewhere. No idea if this would be of benefit for your environment (that's what test labs are for) but it's a strategy most shops with high-volume, high-transaction datasets have to periodically look at as the performance gulf between big, cheap drives and small, fast ones keeps increasing.
Given the size of your environment, EqualLogic or StorageWorks would probably be willing to let you use a demo unit for a little while. Don't know that they wouldn't make you sign an NDA regarding the benchmarks, but you'd at least be able to do it internally... Plus, IMO, there's a massive benefit to having an enterprise support contract when you have a controller failure (which I'm actually surprised hasn't been an issue with the single controller design of the Promise M610)
All told; still a good article -- you generally don't see stuff this thorough posted on the Internet. There are just so many possibilities in this space that it's hard not to nitpick. :)
Management of the drive LEDs for faulty drives, etc. is available with the right hardware; it's unfortunate that's it's not well supported on a wide variety of systems, but it does exist.
As for SMTP notification (and other kinds) of faulty hardware, etc. that should be available depending on the build of OpenSolaris you're using and whether fault management aware drivers are available for your hardware. See 'man fmadm', 'man fmd' and 'man smtp-notify' for more information.
Ultimately, users looking for a polished storage system with graphical management tools, etc. are encouraged to look at Oracle's Sun Open Storage servers which address many of the complaints listed in the article. Yes, I'm aware you're trying to build your own systems here, but it should be obvious why all of the nice tools aren't given away for free.
I haven't installed OpenSolaris yet, but when I am using Solaris 10 with ZFS, it does come with a website manager to manage may of the Sun/Oracle Applications. Did you try https://localhost:6789?
First of all, there is only ONE single reason to use ZFS: it protects your data whereas other storage solutions might corrupt your data (including enterprise storage solutions)!
All the rest of the ZFS features such as snapshots, easy administration, etc are just icing on the cake. If ZFS had only protection of data and no other features, I would still use ZFS..
See here how Raid-5 does not protect your data. In fact, Raid-6 is not better and also may corrupt your precious data. Google "data corruption raid-6" http://www.baarf.com/
ZFS has end-to-end checksumming! That means, ZFS will compare the data in RAM with the data on disk - are they equal? All other storage solutions does not do that - they only check data within a realm. But when data passes a realm it may corrupt (RAM down to disk controller down to disk). There may be bugs in hardware or software within a realm. And data are never compared "the data XYZ in RAM, is it still XYZ on disk?" - this check are never ever made (unless you use ZFS).
Regarding dedup. If you get slow performance of dedup, it is only because dedup requires huge amounts of RAM. You need something like 2GB RAM for each TB disk. If you have less RAM, dedup will be sloooow. If you have much RAM, dedup will be fast.
Another advantage of ZFS (there are many) is that ZFS is OS agnostic! You can insert your zfs raid into another OS or computer without any problems! Try that with a hardware raid - impossible.
Another advantage of ZFS is there are no "fsck"! Instead you do "zfs scrub" every week, while your raid is alive and running. fsck requires you to shutdown the raid to validate it.
Hardware raid is just a cpu with some software running on it. It is better to move that software to the CPU where you have many cores and GB of RAM and you can easily patch it. In the future, hardware raid will die. Software raid like ZFS will rule.
Regarding BTRFS, if you read the mail lists, you see that people loose data all the time with BTRFS. In the future it might be good, but it will take at least another 5 years until we reach that stage. Then ZFS have developed even further.
Regarding ZFS and ubiquity: ZFS is only version compatible. As ZFS' capabilities are updated, the blanket statement that "any ZFS-speaking OS can mount a ZFS volume" just isn't going to ring true. In fact, many distributions porting ZFS are still behind in ZFS version.
As in most "backward compatible" entities, newer versions of ZFS will almost always be compatible with older versions, but the older version will not be able to mount a more recent version. Therefore, you could have a Mac port that can't read a BSD port for instance.
Also, since ZFS is modular, one OS vendor could included a "highly proprietary" inline encryption or compression algorithm that is not (or not strictly) open in nature. This leads to subsequent OS-based divergence if they fail to include the necessary libraries that are not a part of ZFS itself.
However, and for the most part, ZFS should be regarded as version compatible regardless of the OS. Another great reason to use JBOD or discrete disk setups: complete portability of storage pools.
Why is ZFS not the only file system in use today? I completely forgot about this until this article. I remember first reading about it and thinking "this'll probably be in everything in a couple years" so I put it out of my mind. I am upset this is not the file system everything uses.
I'm just wondering about SAS bandwidth. If you connect the backplane via 4 SAS lanes you have a theoretical peak throughput of around 1,200MB/s. The RE3 has an average read/write spead of around 90MB/s so you could already saturate the backplane connection with about thirteen RE3s at average speed. Given the fact you also connect the SSDs this seems to a bottleneck you may wish to consider on your "areas where we could have improved the original build" list.
By the way: really great article! Thanks for it...
While in pure sequential reads (from all drives at the same time) would yield a bottleneck, I don't know of any instances where you would actually encounter that in our environment. Throw in one random read, or one random write, and suddenly the heads in the drives are seeking and delivering substantially lower performance than in a purely sequential read situation.
If this was purely a staging system for disk to tape backups, and the reads were 100% sequential, I would consider more options for additional backplane bandwidth. Since this isn't a concern at this time and this system will be used primarily for VM storage, and our workloads show a pretty substantial random write access pattern (67 write/ 33 read is pretty much the norm, fully random) the probability of saturating the SAS bus is greatly reduced.
Concerning random IO you are surely right and the impressing numbers of your box prove this. But even if you don't have sequential workload there is still "zpool scrub" or the possible need to resilver one or more drives which will fill your bandwidth.
I've checked the options at Supermicro and beside the SC846E1 they are offering E16, E2 and E26 versions with improved backplane bandwidth. The difference in price tag isn't that huge and should not have much impact if your are thinking of 15k SAS or SSD drives.
The E2 and E26 are both dual-controller designs, which are meant for dual SAS controllers so that you can have failover capabilities.
The E16 is the same system, but with SAS 2.0 support, which doubles the available bandwidth. I can definately see the E16 or the E26 as being a very viable option for anyone needing more bandwidth.
Actually (perhaps you meant to say this), the E1 and E16 are single SAS expander models, with the E16 supporting SAS2/6G. The E2 and E26 as dual SAS expander models, with the E26 supporting SAS2/6G.
The dual expander design allows for MPxIO to SAS disks via the second SAS port on those disks. The single expander version is typical of SATA-only deployments. Each expander has auto-sensing SAS ports (typically SFF8088) that can connect to HBA or additional SAS expanders (cascade.)
With SAS disks, MPxIO is a real option: allowing for reads and writes to take different SAS paths. Not so for SATA - I know of no consumer SATA disk with a second SATA port.
As for the 90MB/s average bandwidth of a desktop drive: you're not going to see that in a ZFS application. When ZIL writes happen without an SLOG device, they are written to the pool immediately looking much like small block, random writes. Later, when the transaction group commits, the same ZIL data is written again with the transaction group (but never re-read from the original ZIL pool write since it's still in ARC). For most SATA mechanisms I've tested, there is a disproportionate hit on read performance in the presence of these random writes (i.e. 10% random writes may result in 50%+ drop in sequential read performance).
Likewise, (and this may be something to stress in a follow-up), the behavior of the ZFS transaction group promises to create a periodic burst of sequential write behavior when committing transactions groups. This has the effect of creating periods of very little activity - where only ZIL writes to the pool take place - followed by a large burst of writes (about every 20-30 seconds). This is where workload determines the amount of RAM/ARC space your ZFS device needs.
In essence, you need 20-30 seconds of RAM. Writing target 90MB/s (sequential)? You need 2GB additional RAM to do that. Want to write 1200MB/s (assume SAS2 mirror limit)? You'll need 24GB of additional RAM to do that (not including OS footprint and other ARC space for DDT, MRU and MFU data). Also, the ARC is being used for read caching as well, so you'll want enough memory for the read demand as well.
There are a lot of other reasons why your "mythical" desktop sequential limits will rarely be seen: variable block size, raid level (raidz/z2/z3/mirror) and metadata transactions. SLOG, L2ARC and lots of RAM can reduce the "pressure" on the disks, but there always seem to be enough pesky, random reads and writes to confound most SATA firmware from delivering its "average" rated performance. On average, I expect to see about 30-40% of "vendor specified average bandwidth" in real world applications without considerable tuning; and then, perhaps 75-80%.
It's still early sunday morning over here, but I'm missing something. You have 26 disks in your setup, yet your mainboard has only 14 sata connectors. How are your other disks connected to the mainboard?
The 24 drives in the front of the enclosure are connected via a SAS expander. That allows you to add additional ports without having to have a separate cable for each individual drive.
I know this is old, but it wasn't mentioned that you can choose between gzip and lz type compression. The lz was particularly interesting to us because we hardly noticed the cpu increase, while performance improved slightly and we got almost as good compression as the fastest gzip option.
Thanks guy's for an excellent post on your ZFS SAN/NAS testing. I am in the process of building my own as well. I was wondering if there has been any further testing or if you have invested in new hardware and ran the benchmarks again?
Also Do you think this would be a good solution for Disk Backup? Would backup software make use of the ZIL you think when writing to the NAS/SAN?
I have read many great articles at Anandtech. But this is the best so far! I loved the way you have presented it. It's very natural and you have mentioned most of the pit falls. It's a splendid article and keep more like these coming!
PS: I wanted to congratulate the author for this great work. Just for thanking you, I joined Anandtech ;) Though I wanted to share a thought or two previously, I was just compelled enough to go through the boring process of signing up :D
Great post and really easy to understand language even a newbie like me could understand.
Could you shed some more light on as to why a "reverse breakout cable" was needed for this configuration.? is it a limitation of the motherboard or the back-plane? if i use a diffident motherboard with a HBA can i directly connect an SFF-8087 to SFF8087 cable to the back-plane and use all the 24 drives.?
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
102 Comments
Back to Article
diamondsw2 - Tuesday, October 5, 2010 - link
You're not doing your readers any favors by conflating the terms NAS and SAN. NAS devices (such as what you've described here) are Network Attached Storage, accessed over Ethernet, and usually via fileshares (NFS, CIFS, even AFP) with file-level access. SAN is Storage Area Network, nearly always implemented with Fibre Channel, and offers block-level access. About the only gray area is that iSCSI allows block-level access to a NAS, but that doesn't magically turn it into a SAN with a storage fabric.Honestly, given the problems I've seen with NAS devices and the burden a well-designed one will put on a switch backplane, I just don't see the point for anything outside the smallest installations where the storage is tied to a handful of servers. By the time you have a NAS set up *well* you're inevitably going to start taxing your switches, which leads to setting up dedicated storage switches, which means... you might as well have set up a real SAN with 8Gbps fibre channel and been done with it.
NAS is great for home use - no special hardware and cabling, and options as cheap as you want to go - but it's a pretty poor way to handle centralized storage in the datacenter.
cdillon - Tuesday, October 5, 2010 - link
The terms NAS and SAN have become rightfully mixed, because modern storage appliances can do the jobs of both. Add some FC HBAs to the above ZFS storage system and create some FC Targets using Comstar in OpenSolaris or Nexenta and guess what? You've got a "SAN" box. Nexenta can even do active/active failover and everything else that makes it worthy of being called a true "Enterprise SAN" solution.I like our FC SAN here, but holy cow is it expensive, and its not getting any cheaper as time goes on. I foresee iSCSI via plain 10G Ethernet and also FCoE (which is 10G Ethernet + FC sharing the same physical HBA and data link) completely taking over the Fibre Channel market within the next decade, which will only serve to completely erase the line between "NAS" and "SAN".
mbreitba - Tuesday, October 5, 2010 - link
The systems as configured in this article are block level storage devices accessed over a gigabit network using iSCSI. I would strongly consider that a SAN device over a NAS device. Also, the storage network is segregated onto a separate network already, isolated from the primary network.We also backed this device with 20Gbps InfiniBand, but had issues getting the IB network stable, so we did not include it in the article.
Maveric007 - Tuesday, October 5, 2010 - link
I find iscsi is closer to a NAS then a SAN to be honest. The performance difference between iscsi and san are much further away then iscsi and nas.Mattbreitbach - Tuesday, October 5, 2010 - link
iSCSI is block based storage, NAS is file based. The transport used is irrelevent. We could use iSCSI over 10GbE, or over InfiniBand, which would increase the performance significantly, and probably exceed what is available on the most expensive 8Gb FC available.mino - Tuesday, October 5, 2010 - link
You are confusing the NAS vs. SAN terminology with the interconnects terminology and vice versa.SAN, NAS, DAS ... are abstract methods how a data client accesses the stored data.
--Network Attached Storage (NAS), per definition, is an file/entity-based data storage solution.
- - - It is _usually_but_not_necessarily_ connected to a general-purpose data network
--Storage Area Network(SAN), per definition, is a block-access-based data storage solution.
- - - It is _usually_but_not_necessarily_THE_ dedicated data network.
Ethernet, FC, Infiniband, ... are physical data conduits, they are the ones who define in which PERFORMANCE class a solution belongs
iSCSI, SAS, FC, NFS, CIFS ... are logical conduits, they are the ones who define in which FEATURE CLASS a solution belongs
Today, most storage appliances allow for multiple ways to access the data, many of the simultaneously.
Therefore, presently:
Calling a storage appliance, of whatever type, a "SAN" is pure jargon.
- It has nothing to do with the device "being" a SAN per se
Calling an appliance, of whatever type, a "NAS" means it is/will be used in the NAS role.
- It has nothing to do with the device "being" a NAS per se.
mkruer - Tuesday, October 5, 2010 - link
I think there needs to be a new term called SANNAS or snaz short for snazzy.mmrezaie - Wednesday, October 6, 2010 - link
Thanks, I learned a lot.signal-lost - Friday, October 8, 2010 - link
Depends on the hardware sir.My iSCSI Datacore SAN, pushes 20k iops for the same reason that their ZFS does it (Ram cacheing).
Fibre Channel SANs will always outperform iSCSI run over crappy switching.
Currently Fibre Channel maxes out at 8Gbps in most arrays. Even with MPIO, your better off with an iSCSI system and 10/40Gbps Ethernet if you do it right. Much cheaper, and you don't have to learn an entire new networking model (Fibre Channel or Infiniband).
MGSsancho - Tuesday, October 5, 2010 - link
while technically a SAN you can easily make it a NAS with a simple zfs set sharesmb=on as I am sure you are aware.Mattbreitbach - Tuesday, October 5, 2010 - link
Indeed you can, which is one of the most exciting parts about using software based storage appliances. Nexenta really excels in this area, offering iSCSI, NFS, SMB, and WebDAV with simple mouse clicks.MGSsancho - Tuesday, October 5, 2010 - link
or a single command!FransUrbo - Wednesday, January 11, 2012 - link
Would be really nice to see how ZoL compares. It's in no way optimized yet (current work is on getting the core functionality stable - which IMHO it is) so it would have no chanse against OpenSolaris or Nexenta, but hopfully it's comparative to the Promise rack.http://zfsonlinux.org/
gfg - Tuesday, October 5, 2010 - link
NAS is extremely cost effective in a data center if a large majority of NFS/CIFS users are more interested in capacity, not performance. NDMP can be very efficent for backups, and the snapshots/multi-protocol aspects of NAS systems are fairly easy to manage. Some of the larger Vendor NAS systems can support 100+TB's per NAS fairly effectively.bhigh - Wednesday, October 6, 2010 - link
Actually, OpenSolaris and Nexenta can act as a SAN device using COMSTAR. You can attach to them with iSCSI, FC, Infiniband, etc. and use any zvols as raw scsi targets.JGabriel - Wednesday, October 6, 2010 - link
Also, "Testing and Benchmarking"?Doesn't that mean the same thing and isn't it redundant? See what I did there?
.
Fritzr - Thursday, October 7, 2010 - link
This is similar to the NAS<>SAN argument. They are used in a similar manner, but have very different purposes.Testing. You are checking to see if the item performance meets your need & looking for bugs or other problems including documentation and support.
Benchmarking. You are running a series of test sets to measure the performance. Bugs & poor documentation/support may abort some of the measuring tools, but that simply goes into the report of what the benchmarks measured.
Or in short:
Test==does it work?
Benchmark==What does it score on standard performance measures?
lwatcdr - Friday, October 8, 2010 - link
I am no networking expert so please bear with me.What are the benfits of a SAN over local drivers and or a NAS?
I would expect a NAS to have better performance since it would send less data over the wire than a SAN if they both had the same physical connection.
A local drive/array I would expect to be faster than a SAN since it will not need to go through a network.
Does it all come down to management? I can see the benefit of having your servers boot over the network and having all your drives in one system. If you set up the servers to boot over the network it would be really easy to replace a server.
Am I missing something or are the gains all a matter of management?
JohanAnandtech - Sunday, October 10, 2010 - link
A NAS has most of the time worse performance than a similar SAN since there is a file system layer on the storage side. A SAN only manages block and has thus less layers, and is more efficient.A local drive array is faster, but is less scalable and depending on the setup, it is harder to give a large read/write cache: you are limited by the amount of RAM your cache controller supports. In a software SAN you can use block based caches in the RAM of your storage server.
Management advantages over Local drives are huge: for example you can plug a small ESXi/Linux flash drive which only contains the hypervisor/OS, and then boot everything else from a SAN. That means that chances are good that you never have to touch your server during its lifetime and handle all storage and VM needs centrally. Add to that high availability, flexibility to move VMs from one server to another and so on.
lwatcdr - Monday, October 11, 2010 - link
I but that layer must be executed somewhere I thought that decrease in data sent over the physical wire would make up for the extra software cost on the server side.Besides you would still want a NAS even with a SAN for shared data. I am guessing that you could have a NAS served data from the SAN if you needed shared directories.
I also assume that since most SAN are on a separate storage network that the SAN is mainly used to provide storage to servers and than the servers provide data to clients on the lan.
The rest of it seems very logical to me in a large setup. I am guessing that if you have a really high performance data base server that one might use a DAS instead of SAN or dedicate a SAN server just to the database server.
Thanks I am just trying to educate myself on SANs vs NAS vs DAS.
Since I work at a small software development firm our sever setup is much simpler than the average Data center so I don't get to deal this level of hardware often.
However I am thinking that maybe we should build a SAN and storage network just for our rack.
cdillon - Tuesday, October 5, 2010 - link
I've been working on getting the additional parts necessary to build a similar system out of a slightly used HP DL380 G5 with a bunch of 15K SAS drives and an MSA20 shelf full of 750GB SATA drives. Here's what I'm going to be doing a little differently from what you've done:1) More CPU (already there, it has dual Xeon X5355 if I recall correctly)
2) Two mirrored OCZ Vertex2 EX 50GB drives for the SLOG device (the ZIL write cache). Even though the Vertex2 claims a highly impressive 50,000 random-write IOPS, the ZIL is written sequentially, and the Vertex2 EX claims to sustain 250MB/sec writes, so it should make a very good SLOG device.
3) Two OCZ Vertex2 100G (the cheaper MLC models) for L2ARC.
4) The SSDs will be put on a separate SAS HBA card from the HDDs to prevent I/O starvation due to the HBA I/O queue filling up because of the relatively slow I/O service-times of the HDDs.
5) Quad Gigabit Ethernet or 10G Ethernet link. The latter will require an upgrade to our datacenter switches, which is probably going to happen soon anyway.
mbreitba - Tuesday, October 5, 2010 - link
I would love to see performance results for your setup. The IOMeter ICF file that we have linked to in the article would help you run the exact same tests as we ran if you would be interested in running them.cdillon - Tuesday, October 5, 2010 - link
I forgot to mention it might also be running FreeBSD (which I'm very familiar with) rather than Nexenta or OpenSolaris, but I'm just kind of playing it by ear. I may try all three. The goal is for it to eventually become a production storage server, but I'm going to do a bit of experimentation first. I still haven't gotten around to ordering the SSDs and the extra SAS HBAs, so it'll be a while before I have any benchmarks for you.Maveric007 - Tuesday, October 5, 2010 - link
You should throw Linux into the mix. You find your performance will increase over the other selections ;)MGSsancho - Tuesday, October 5, 2010 - link
ZFS on linux is terrible. also ZFS on FreeBSD is decent. recent ZFS features such as deduplication and iSCSI are not available on FreeBSD. just grab a copy of the latest build of opensolaris (134), compile it from build 157. use solaris 10 (got to pay now), or use one of the mentioned Nexenta distros.From personal experience, use fast SSD drives. I made the mistake of using a pair of the Intel 40GB Value drives for a home box with 8 x 1.5 TB drives. terrible performance. Yes it is cool for latency but I cant get more than 40MBs from it. I have tried using them just for ZIL or just for L2ARC and performance is abyssal. Get the fastest possible drives you can afford.
Matt, have you tested with using for example realtek nics (dont, pain in the ass), intel desktop nics (stable) or the more fancy server grade nics that have reported iSCSI offload? also have you tried using dedup/compresion for increased performance/space savings? this will use up lots of memory for indexies but if your cpus are fast enough along with network, less IO hits the discs. I hear it has worked assuming you have the memory, CPU, network. One last bit, try using the Sun 40GBs infiniband cards? I know they will work with solaris 10 and opensolaris and thus I would assume nexenta. might want to check the hardware compatibility list for your IB card.
Cheers
Mattbreitbach - Tuesday, October 5, 2010 - link
We have not tested with any other NIC's other than the Intel GB nics onboard the blade. We considered using an iSCSI offload NIC for the ZFS system, but given the cost of such cards we could not justify using them.As for Deduplication - we have recently tested using deduplication on Nexenta and the results were abysmal. Most tests were reading above 90% CPU utilization while delivering far lower IOPS. I believe that deduplication could help performance, but only if you have an insane amount of CPU available. With the checksumming and deduplication running our 5504 was simply not able to keep up. By increasing the core count, adding a second processor, and increasing the clock speed, it may be able to keep up, but after you spend that much additional capital on CPU's and better motherboards, you could increase your spindle count, switch to SAS drives, or simply add another storage unit for marginally more money.
MGSsancho - Tuesday, October 5, 2010 - link
from my personal experience i could not agree more for the deduplication. 33% on each core on my phenom 2 for a home setup is insane. Some things like exchange server, it is best to let the application decide what is should be cached but duplication realy make sense for a tier three storage or nightly backup or maybe for a small dev box. Also the drives them selves mater, you want to use the ones that are geared for raid setups. it allows the system to better communicate with it. I wont name a particular vendor but the current 'green' 5400 rpm 2TB drives are terrible for zfs http://pastebin.com/aS9Zbfeg (not my setup) that is a nightly backup array used at a webhosting facility. sure they have great throughput but all those errors after a few hours.andersenep - Tuesday, October 5, 2010 - link
I use WD green drives in my home OpenSolaris NAS. I have 2 raidz vdevs of 4 drives each (initially I used mirrors, but wanted more space). I can serve 720p content to two laptops and my Xstreamer simultaneously without a hiccup...I guess it depends on your needs, but for a home media server, I have absolutely no complaints with the 'green' drives. Weekly scrubs for 1 yr plus with no issues. I did have to replace a scorpio on my mirrored rpool after 6 months. I am quite happy with my setup.solori - Wednesday, October 20, 2010 - link
As a Nexenta partner, we see these issues all the time. Deduplication is not an apples-apples feature. The system build-out and deduplication set (affecting DDT size) are both unique factors.With ZFS' deduplication, RAM/ARC and L2ARC become critical components for performance. Deduplication tables that spill to disk (will not fit into memory) will cause serious performance issues. Likewise, the deduplication hash function and verify options will impact perfomance.
For each application, doing the math on spindle count (power, cost, space, etc.) versus effective deduplication is always best. Note that deduplication does not need to be enabled pool-wide, and that - like in compression where it is wasteful to compress pre-compressed data - data with low deduplication rates should not be allowed to dominate a deduplication-enabled pool/folder.
Deduplication of 15K, primary storage seems contradictory, but that type of storage has the highest $/TB factor and spindle count for any given capacity target. By allocating deduplication to targets folders/zvol, performance and capacity can be optimized for most use cases. Obviously, data sets that are write-heavy and sensitive to storage latency are not good candidates for deduplication or inline compression.
If you do the math, the cost of SSD augmentation of 7200 RPM SAS pools is very competitive against similar capacity 15K pools. The benefits to SSD augmentation (i.e. L2ARC and ZIL->SLOG where synchronous writes dominate performance profiles) is in higher IOP potential for random IO workloads (where the 7200 disks suffer most). In fact, contrasting 600GB SAS 15K to 2TB SAS 7200, you approach an economic factor where 7200 RPM disks favor mirror groups over 15K raidz groups - again, given the same capacity goals.
The real beauty of ZFS storage - whether it be Opensolaris/Illumos or Nexenta/Stor/Core - is that mixing 15K and 7200 RPM pools within the same system is very easy/effective to do. With the proper SAS controllers and JBOD/RBOD combinations, you can limit 15K applications to a small working set and commit bulk resources to augmented 7200 RPM spindles in robust raidz2 groups (i.e. watch your MTTDL versus raidz).
It is important to note that ZFS was not designed with the "home user" in mind. It can be very memory and CPU/thread hungry and easily out-strip a typical hobbyist's setup. A proper enterprise setup will include 2P quad core and RAM stores suited to the target workload. Since ZFS was designed for robust threading, the more "hardware" threads it has at its disposal, the more efficient it is. While snapshots are "free" in ZFS (i.e. copy-on-write nature of ZFS means writes are the same with or without snapshots) but data integrity (checksums) and compression/deduplication are not.
Mattbreitbach - Wednesday, October 20, 2010 - link
Excellent comments! Thank you for your input.As you noted, we found deduplication to be beyond the reaches of our system. With proper tuning and component selection, I think it could be used very well (and have talked to several people who have had very good experiences with it). For the average home user it's probably beyond the scope of what they would want to use for their storage.
L. - Wednesday, March 16, 2011 - link
Too bad you already have the 15k drives.2) I wanted to say this earlier, but I'm quite confident that SLC is NOT required for a SLOG device, as with current wear leveling, unless you actually write more than <MLC disk capacity> / day there is no way you'll ever need the SLC's extended durability.
3) Again, MLC SSD's, good stuff
4) Yes again
5) not too shabby
6) Why use 15k or 7k2 rpm drives in the first place
All in all nice project, just too bad you have to start from used equipment.
In my view, you can easily trash both your similar system and Anandtech's test system and simply go for what the future is going to be anyway :
Raid-10 MLC drives, 48+RAM, 4 CPU's (yes those MLC's are going to perform so much faster you will need this - quite a fair chance you'll need AMD stuff on that as 4-socket is their place) and mainly and this is the hardest part, sata 6 Gb/s * many with a controller that can actually handle the bandwidth.
Overall you'd get a much simpler, faster and cleaner solution (might need to upgrade your networking though to match with the rest).
L. - Wednesday, March 16, 2011 - link
Of course, 6 months later .. .its not the same equation ;) Sorry for the necroB3an - Tuesday, October 5, 2010 - link
I like seeing stuff like this on Anand. It's a shame it dont draw as much interest as even the poor Apple articles.Tros - Tuesday, October 5, 2010 - link
Actually, I was just hoping to see a ZFS vs HFS+ comparison for the higher-end Macs. But with the given players (Oracle, Apple), I don't know if the drivers will ever be officially released.Taft12 - Wednesday, October 6, 2010 - link
Doesn't it? This interests me greatly and judging by the number of comments is as popular as any article about the latest video or desktop CPU techgreenguy - Wednesday, October 6, 2010 - link
I have to say, kudos to you Anand for featuring an article about ZFS! It is truly the killer app for filesystems right now, and nothing else is going to come close to it for quite some time. What use is performance if you can't automatically verify that your data (and the system files that tells your system how to manipulate that data) was what it was the last time you checked?You picked up on the benefits of the SSD (low latency) before anyone else, it is no wonder you've figured out the benefits of ZFS too earlier than most of your compatriots as well. Well done.
elopescardozo - Tuesday, October 5, 2010 - link
Hi Matt,Thank you for the extensive report. In your testing results there are a few unexpected results. I find the difference between Nexenta and Open Solaris hard to understand, unless it is due to misalignment of the IO in the case of Nexenta.
A zvol (the basis for an iSCSI volume) is created on top of the ZFS pool with a certain block size. I believe the default is 8kB. Next you initialize the volume and format it with NTFS. By default the NTFS structure starts at sector 63 (sixty three, not a typo!), which means that every other 4kB cluster (the NTFS allocation size) falls over a zvol block boundary. That has a serious impact on performance. I saw a report of 70% improvement after properly alignment.
Is it possible that the Open Solaris and Nexenta pools were different in this respect, either because of different zvol block size (e.g. 8kB for Nexenta, 128kB for Open Solaris – larger blocks means less “boundary cases”) or differences in how the volumes were initialized and formatted?
mbreitba - Tuesday, October 5, 2010 - link
It's possible that the sector alignment could be a problem, but I believe the build that we tested, the default sector size was set to 128kB, which was identical to OpenSolaris. If that has changed, then we should re-test with the newest build to see if that makes any differences.cdillon - Tuesday, October 5, 2010 - link
Windows Server 2008 aligns all newly created partitions at 1MB, so his NTFS block access should have been properly aligned by default.Mattbreitbach - Tuesday, October 5, 2010 - link
I was unaware that Windows 2008 correctly aligned NTFS partitions now. Thanks for the info!MGSsancho - Tuesday, October 5, 2010 - link
I haven't tried this myself yet but how about using 8kb blocks and using jumbo frames on your network? possibly lower through padding to fill the 9mb packet in exchange for lower latency? I have no idea as this is just a theory. dudes in the #opensolaris irc chan have always recommended 128K or 64K depending on the data.solori - Wednesday, October 20, 2010 - link
One easy way to check this would be to export the pool from OpenSolaris and directly import it to NexentaStor and re-test. I think you'll find that the differences - as your benchmarks describe - are more linked to write caching at the disk level than partition alignment.NexentaStor is focused on data integrity, and tunes for that very conservatively. Since SATA disks are used in your system, NexentaStor will typically disable disk write cache (write hit) and OpenSolaris may typically disable device cache flush operations (write benefit). These two feature differences can provide the benchmark differences you're seeing.
Also, some "workstation" tuning includes the disabling of ZIL (performance benefit). This is possible - but not recommended - in NexentaStor but has the side effect of risking application data integrity. Disabling the ZIL (in the absence of SLOG) will result in synchronous writes being committed only with transaction group commits - similar performance to having a very fast SLOG (lots of ARC space helpful too).
fmatthew5876 - Tuesday, October 5, 2010 - link
I'd be very interested to see how FreeBSD ZFS benchmark results would compare to Nexenta and Open Solaris.mbreitba - Tuesday, October 5, 2010 - link
We have benchmarked FreeNAS's implimentation of ZFS on the same hardware, and the performance was abysmal. We've considered looking into the latest releases of FreeBSD but have not completed any of that testing yet.jms703 - Tuesday, October 5, 2010 - link
Have you benchmarked FreeBSD 8.1? There were a huge number of performance fixes in 8.1.Also, when was this article written? OpenSolaris was killed by Sun on August 13th, 2010.
mbreitba - Tuesday, October 5, 2010 - link
There was a lot of work on this article just prior to the official announcement. The development of the Illumos foundation and subsequent OpenIndiana has been so rapidly paced that we wanted to get this article out the door before diving in to OpenIndiana and any other OpenSolaris derivatives. We will probably add more content talking about the demise of OpenSolaris and the Open Source alternatives that have started popping up at a later date.MGSsancho - Tuesday, October 5, 2010 - link
Not to mention that projects like illumos are currently not recommended for production, Currently only meant as a base for other distros (OpenIndiana.) Then there is Solaris 11 due soon. I'll try out the express version when its released.cdillon - Tuesday, October 5, 2010 - link
FreeNAS 0.7.x is still using FreeBSD 7.x, and the ZFS code is a bit dated. FreeBSD 8.x has newer ZFS code (v15). Hopefully very soon FreeBSD 9.x will have the latest ZFS code (v24).piroroadkill - Tuesday, October 5, 2010 - link
This is relevant to my interests, and I've been toying with the idea of setting up a ZFS based server for a while.It's nice to see the features it can use when you have the hardware for it.
cgaspar - Tuesday, October 5, 2010 - link
You say that all writes go to a log in ZFS. That's just not true. Only synchronous writes below a certain size go into the log (either built into the pool, or a dedicated log device). All writes are held in memory in a transaction group, and that transaction group is written to the main pool at least every 10 seconds by default (in OpenSolaris - it used to be 30 seconds, and still is in Solaris 10 U9). That's tunable, and commits will happen more frequently if required, based on available ARC and data churn rate. Note that _all_ writes go into the transaction group - the log is only ever used if the box crashes after a synchronous write and before the txg commits.Now for the caution - you have chosen SSDs for your SLOG that don't have a backup power source for their on board caches. If you suffer power loss, you may lose data. Several SLC SSDs have recently been released that have a supercapacitor or other power source sufficient to write cache data to flash on power loss, but the current Intel like up doesn't have it. I believe the next generation Intel SSDs will.
mbreitba - Tuesday, October 5, 2010 - link
Thanks for the comment on the ZIL.As far as using the X25-E's as ZIL devices - when we built the box initially, the X25-E's were the best choice at the time. Future builds will probably include a capacitor-backed SSD.
James5mith - Tuesday, October 5, 2010 - link
For what it's worth, we are currently using roughly 16 of the Supermicro 846-E1 chassis in our storage solutions.Drive numbering is from bottom to top, left to right. Don't know if this helps or not.
5 11 17 23
4 10 16 22
3 9 15 21
2 8 14 20
1 7 13 19
0 6 12 18
badhack - Tuesday, October 5, 2010 - link
I would be curious to know how the performance compares to traditional fs caching on Linux w/ ext3 or ext4 with same amount of memory and a few SSD drives.Maveric007 - Tuesday, October 5, 2010 - link
There are a few options within Linux that would be pretty interesting to see. FS caching and the different schedulers that are available within Linux. Also I would throw out ext3 and replace that with ext4 and xfs. Redhat is now supporting xfs and there are just tons of tunables for xfs compared to the other file systems.badnews - Tuesday, October 5, 2010 - link
Thanks Matt, I've been following the build over at your blog and this is an excellent article to tie it all together. I hope you follow up with your "things we'd do differently" in future articles. I would also love to see some more benchmarking against more alternatives, e.g. Open-E, or even an off-the-shelf EqualLogic.Keep up the good work :)
Fallen Kell - Tuesday, October 5, 2010 - link
Well, I know at least for Solaris 10.... I would suspect that OpenSolaris has it as well by now, since it has been out for at least 4 years that I know of...https://<host>:6789
mbreitba - Tuesday, October 5, 2010 - link
You can install the ZFS Web GUI from the Solaris toolkit, but it isn't bundled into OpenSolaris. It is binary compatible, but it doesn't give any good options for iSCSI setup, as it only supported the old iSCSI target rather than the new COMSTAR target.sfc - Tuesday, October 5, 2010 - link
How can you spend a page talking about how you aren't really worried about the future of Opensolaris, and then have half a paragraph mentioning "oh, btw, it's cancelled"? The project is clearly dead. They stopped releasing source almost a month ago. Oracle has made absolutely no guarantees about when or how source would be released in the future. For all we know, they could release only portions of Solaris Express, and do it months to years after the binaries drop.http://opensolaris.org/jive/thread.jspa?messageID=...
I love ZFS/Opensolaris, I use it at home, but Opensolaris is dead.
Mattbreitbach - Tuesday, October 5, 2010 - link
OpenSolaris is indeed dead as far as development goes, but it's still viable if you want to use the last build released which is what all of our performance figures are based on. I will be writing some companion articles to this one talking about not only the death of OpenSolaris, but it's alternative, OpenIndiana, and the Promise M610i used as a comparison in this article.andersenep - Tuesday, October 5, 2010 - link
The OpenSolaris project may be dead but ZFS and all the CDDL licensed code is still out there. Illumos, OpenIndiana and a few other distros are still out there and available. Oracle has stated they will continue to release source code after Solaris releases and will also provide binary preview releases in the form of Solaris Express. To say Solaris and ZFS are dead is pretty premature.Whatever happens, the existing code is out there. To call it dead is a bit premature. Sure the project that had the name 'OpenSolaris' has been canceled, but everything that made it up (minus a small few closed bits that have already been replaced) lives on.
vla - Tuesday, October 5, 2010 - link
Along the lines of the "Opensolaris is kind of dead" threads, I'd really like to see an article like this for BTRFS. It's about to become the standard filesystems for Fedora and Ubuntu in the near future, and I'd love to get some AnandTech depth articles about it.. what it can do, what it can't. How it compares to existing Linux filesystems, how it compares to ZFS, etc.andersenep - Tuesday, October 5, 2010 - link
When btrfs is ready for production use, let me know. From what I have seen it is still very much experimental. When it's as stable and proven as ZFS, I would love to give it a try. I have severe doubts that Oracle will continue to invest in its development now that it owns ZFS.Khyron320 - Wednesday, October 6, 2010 - link
I have never heard of any caching feature mentioned for BTRFS and it is not mentioned on the wiki anywhere. Is this a planned feature?http://en.wikipedia.org/wiki/Btrfs#Features
Sabbathian - Wednesday, October 6, 2010 - link
Only site where you can find articles like these.... thank you guys ... ;)lecaf - Wednesday, October 6, 2010 - link
Hiwhy not do some extra testing with Windows Storage Server R2 (just released a few days ago)
I'm sure it would lag behind but it could be interesting to see how much.
Mattbreitbach - Wednesday, October 6, 2010 - link
I do not believe that Windows Storage Server is an end-user product. I believe that it is only released to OEM's to ship on their systems. At this time we have no route to obtain Windows Storage Server.lecaf - Wednesday, October 6, 2010 - link
True its OEM only and not public but "evaluation" version is available with Technet and MSDNWithout a license key you can run it for 180 days (like all new MS OS BTW)
but you can also try this
http://www.microsoft.com/specializedservers/en/us/...
Just a registration and you get the software. (Read license because benchmarking is sometimes prohibited)
Sivar - Wednesday, October 6, 2010 - link
BSD supports ZFS as well, and it is far from dead.Of course, it's also far from popular.
Guspaz - Wednesday, October 6, 2010 - link
"We decided to spend some time really getting to know OpenSolaris and ZFS."OpenSolaris is a dead operating system, killed off by Oracle. Points for testing Nexenta, since they're the ones driving the fork that seems to be the successor to OpenSolaris, but basing your article around a dead-end OS isn't very helpful to your readers...
Mattbreitbach - Wednesday, October 6, 2010 - link
When this project was started, OpenSolaris was far from dead. We decided to keep using OpenSolaris to finish the article because a viable alternative wasn't available until three weeks ago. If we were to start this article today, it would be based on OpenIndiana. Some of our preliminary testing of OpenIndiana indicate that it performs even better than OpenSolaris in most tests.Penti - Wednesday, October 6, 2010 - link
And a viable alternative still isn't available how is Nexenta and the community suppose to get driver support and support for new hardware there, when Oracle has closed the development kernel (SXDE is closed source), meaning that they maybe just maybe can use the retail Solaris 11 kernel if it's released in a functioning form that can be piped in with existing software and distro. They aren't going to develop it themselves and the vendors have no reason giving the code/drivers to anybody but Oracle. Continuing the OpenSolaris kernel means creating a new operating system. It means you won't get the latest ZFS updates and tools any more, at least not till they are in the normal S11 release. Means you can't expect the latest driver updates and so on either. You can continue to use it on todays hardware, but tomorrow it might be useless, you might not find working configurations.It's not clear that Nexenta actually can develop their own operating system, rather then just a distro, it means they have to create their own OS with their own kernel eventually. With their own drivers and so on. And it's not clear how much code Oracle will let slip out. It's just clear that they will keep it under wraps till official releases. It's however clear that there won't be any distro for them to base it on and any and all forks would be totally dependent on what Nexenta (Illuminos) manage to do. It will quickly get outdated without updates flowing all the time, and they came from Sun.
andersenep - Wednesday, October 6, 2010 - link
OpenIndiana/Illumos runs the same latest and greatest pool/zfs versions as the most recent Solaris 10 update.Work continues on porting newer pool/ZFS versions to FreeBSD which has plenty of driver support (better than OpenSolaris ever did).
A stated goal of the Illumos project is to maintain 100% binary compatibility with Solaris. If Oracle decides the break that compatibility, intentionally or not, it will truly become a fork. Development will still continue.
Even if no further development is made on ZFS, it's still an absolutely phenomenal filesystem. How many years now has Apple been using HFS+? FAT is still around in everything. If all development on ZFS stopped today, it would still remain an absolutely viable filesystem for many years to come. There is nothing else currently out there that even comes close to its feature set.
I don't see how ZFS being under Oracle's control makes it any worse than any other open source filesystem. The source is still out there, and people are free to do what they want with it within the CDDL terms.
This idea that just because the OpenSolaris DISTRO has been discontinued, that everything that went into it is no longer viable is silly. It is like calling Linux dead because Mandriva is dead.
Guspaz - Wednesday, October 6, 2010 - link
Thanks for mentioning OpenIndiana. I've been eagerly awaiting IllumOS to be built into an actual distribution to give me an upgrade path for my home OpenSolaris file server, and I look forward to upgrading to the first stable build of OpenIndiana.I'm currently running a dev build of OpenSolaris since the realtek network driver was broken in the latest stable build of OpenSolaris (for my chipset, at least).
Mattbreitbach - Wednesday, October 6, 2010 - link
I believe all of the current Hypervisors support this. Hyper-V does, as does XenServer. I have not done extensive testing with ESXi, but I would imagine that it supports it also.joeribl - Wednesday, October 6, 2010 - link
"Nexenta is to OpenSolaris what OpenFiler or FreeNAS is to Linux."FreeNAS has always been FreeBSD based, not Linux. It does however provide ZFS support.
Mattbreitbach - Wednesday, October 6, 2010 - link
I should have caught that - thanks for the info. I've edited the article to reflect as such.vermaden - Wednesday, October 6, 2010 - link
... with deduplication and other features, here You can grab an ISO build or a VirtualBox apliance here: http://blog.vx.sk/archives/9-Pomozte-testovat-ZFS-...It would be great to see how FreeBSD performs (8.1 and 9-CURRENT) on that hardware, I can help You configure FreeBSD for these tests if You would like to, for example, by default FreeBSD does not enables AHCI mode for SATA drives which increases random performance a lot.
Anyway, great article about ZFS performance on nice piece of hardware.
Mattbreitbach - Wednesday, October 6, 2010 - link
In Hyper-V it is called a Differencing disk - you have a parent disk that you build, and do not modify. You then create a "differencing disk". That disk uses the parent disk as it's source, and writes any changes out to the differencing disk. This way you can maintain all core OS files in one image, and write any changes out to child disks. This allows the storage system to cache any core OS components once, and any access to those core components comes directly from the cache.I believe that Xen calls it a differencing disk also, but I do not currently have a Xen Hypervisor running anywhere that I can check quickly.
gea - Wednesday, October 6, 2010 - link
new: Version 0.323napp-it ZFS appliance with Web-UI and online-installer for NexentaCore and Openindiana
Napp-it, a project to build a free "ready to run" ZFS- Web und NAS-Appliance with Web-UI and Online-Installer now supports NexentaCore and OpenIndiana (free successor of OpenSolaris) up from Version 0.323. With its online Installer, you will have your ZFS-Server running with all services and tools within minutes.
Features
NAS Fileserver with AFP (incl. Time Maschine and Zero Config), SMB with ACLs, AD-Support and User/ Groups
SAN Server with iSCSI (Comstar) and NFS forr XEN or Vmware esxi
Web-Server, FTP
Database-Server
Backup-Server
newest ZFS-Features (highest security with parity and Copy On Write, Deduplication, Raid-Z3, unlimited Snapshots via Windows previous Version, working ACLs, Online Pooltest with Datarefresh, Hybridpools, expandable Datapools=simply add Controller or Disks,............)
included Tools:
bonnie Pool-Performancetest
iperf Net-Performancetest
midnight commander
ndmpcopy Backup
rsync
smartmontools
socat
unzip
Management:
remote via Web-UI and Browser
Howto with NexentaCore:
1. insert NexentaCore CD and install
2. login as root and enter:
wget -O - www.napp-it.org/nappit | perl
During First-Installation you have to enter a mySQL Passwort angeben and select Apache with space-key
Howto with OpenIndiana (free successor of OpenSolaris):
1. Insert OpenIndiana CD and install
2. login as admin, open a terminal and enter su to get root permissions and enter:
wget -O - www.napp-it.org/nappit | perl
AFP-Server is currently installed only on Nexenta.
thats all, no step 3!
You can now remotely manage this Mac/PC NAS appliance via Browser
Details
www.napp-it.org
running Installation
www.napp-it.org/pop_en.html
Mattbreitbach - Wednesday, October 6, 2010 - link
Very neat - I am installing OpenIndiana on our hardware right now and will test out the Napp-it application.Exelius - Wednesday, October 6, 2010 - link
I think you identified the strong issue between SATA and SAS drives, but there's no real reason you can't do both: in fact, this is common practice. I don't know what the distribution for AT is so I may be wrong, but often a relatively small amount of your data is accountable for a large portion of your random writes. Why not store that data permanently on the SSDs?For everything else, the cost per gb difference between SATA and SAS is too much to ignore. Once you start talking about adding SAS drives to this, you're moving out of the same class as the Promise device. I've used the Promise vTrak M series (and actually, the M610i specifically) and it's about the cheapest iSCSI SAN device you can get while still being a "real" iSCSI device. It's also about at least a 5 year old product and is growing long in the tooth; I don't know that it's appropriate to compare it with a brand new, performance tuned monster.
But once you introduce SAS into the equation, the chassis itself becomes a much smaller percentage of cost. You go from $140 a drive to close to $400. You also start competing with EqualLogic, HP, etc. and given the need you expressed to add more RAM and CPU, there's definitely some stiff competition from higher-end, more modern products than the M610i.
I guess at the end of the day, while the performance numbers are impressive compared to the M610i, I don't know that the M610i is the device I would use if I was interested in performance. The Promise M610i's strength is price and capacity. Given that the M610i is INFINITELY easier to set up and maintain, that has to factor in to the cost as well. The M610i is often used as a staging target for disk-disk-tape backups; it actually has some throughput issues in a number of scenarios so it's not appropriate for all situations. It just depends on where your needs and bottlenecks are.
I'd rather have seen a comparison with a device such as an EqualLogic or StorageWorks array; because once you upgrade the ZFS box, add labor and support costs into the equation, they do become more appealing in the $10k range (and the fact that you can rather easily add more spindles to an existing array.)
Mattbreitbach - Wednesday, October 6, 2010 - link
You make some strong points.1 - our storage system is not used at Anandtech in any way - I am involved in an entirely separate entity who's only affiliation with Anandtech is that we've written an article reviewing our hardware in our environment. As such, I have no idea what Anandtech's storage needs look like. In our environment we use fixed size VHD's for our VM storage currently. As such there is no real way to put small writes on SSD's and static content on slower spindles. We need to maintain performance across the entire data set.
2 - The Vtrak M610i is about 3 years old from what I can gather from their press releases. We purchased our first Vtrak M610i at about that time. http://www.promise.com/news_room/news.aspx?m=615&a...
While it may be getting a bit older, it is still available for purchase, and is still a relatively inexpensive way to build a high-capacity SAN device. The reason that it was compared in this article is because that is what we are currently using and replacing. While the controller and chassis is different from our ZFS monster, the drives in the chassis are identical, and the price points are very similar.
3 - We would have loved to compare it to a current generation Equalogic unit, but we did not have one on hand to test. If we ever happen to get one we will definately run the numbers against it.
4 - The Promise system has a lot going for it in the ease of setup and use department, and I am currently working on an article that goes in depth on that. Promise also has several new products available that lower the price point (VessRAID) and expand the options that you have available. I hope to get one of those units to test and possibly deploy in the near future also. They also have an enterprise-grade head end (Vtrak S3000) that looks promising also.
Overall, this article was mainly about the ZFS system, what is possible, and how it performed against our current infrastructure. I am hopeful that we can expand what we have on hand to test with and provide broader comparisons in the future, but there is only so far a budget will stretch for getting hardware to simply test.
Exelius - Thursday, October 7, 2010 - link
I know it's at least 4 years old -- I purchased one at least that long ago. But point taken; I haven't kept up with Promise beyond the vTrak M after getting a budget to higher-end units (I still used the vTrak Ms for cheap storage.)And if your data set is large enough to require this many spindles, you might benefit from optimizing it a bit on the front-end... for example, build your VMs to split the VHDs so the high-write data is stored elsewhere. No idea if this would be of benefit for your environment (that's what test labs are for) but it's a strategy most shops with high-volume, high-transaction datasets have to periodically look at as the performance gulf between big, cheap drives and small, fast ones keeps increasing.
Given the size of your environment, EqualLogic or StorageWorks would probably be willing to let you use a demo unit for a little while. Don't know that they wouldn't make you sign an NDA regarding the benchmarks, but you'd at least be able to do it internally... Plus, IMO, there's a massive benefit to having an enterprise support contract when you have a controller failure (which I'm actually surprised hasn't been an issue with the single controller design of the Promise M610)
All told; still a good article -- you generally don't see stuff this thorough posted on the Internet. There are just so many possibilities in this space that it's hard not to nitpick. :)
JonBendtsen - Thursday, October 7, 2010 - link
I think it could be interesting to see performance benchmarks without the L2ARC to see how much value it really has.binarycrusader - Thursday, October 7, 2010 - link
Management of the drive LEDs for faulty drives, etc. is available with the right hardware; it's unfortunate that's it's not well supported on a wide variety of systems, but it does exist.As for SMTP notification (and other kinds) of faulty hardware, etc. that should be available depending on the build of OpenSolaris you're using and whether fault management aware drivers are available for your hardware. See 'man fmadm', 'man fmd' and 'man smtp-notify' for more information.
Ultimately, users looking for a polished storage system with graphical management tools, etc. are encouraged to look at Oracle's Sun Open Storage servers which address many of the complaints listed in the article. Yes, I'm aware you're trying to build your own systems here, but it should be obvious why all of the nice tools aren't given away for free.
pburdine - Friday, October 8, 2010 - link
I haven't installed OpenSolaris yet, but when I am using Solaris 10 with ZFS, it does come with a website manager to manage may of the Sun/Oracle Applications. Did you try https://localhost:6789?murdmath - Friday, October 8, 2010 - link
Great Article. Very informative. I am excited for you review of the Promise M610i SAN.Mat
Brutalizer - Monday, October 11, 2010 - link
First of all, there is only ONE single reason to use ZFS: it protects your data whereas other storage solutions might corrupt your data (including enterprise storage solutions)!See here how common file systems such as Ext3, JFS, XFS, ReiserFS, NTFS, etc might corrupt your data:
http://www.zdnet.com/blog/storage/how-microsoft-pu...
All the rest of the ZFS features such as snapshots, easy administration, etc are just icing on the cake. If ZFS had only protection of data and no other features, I would still use ZFS..
See here how Raid-5 does not protect your data. In fact, Raid-6 is not better and also may corrupt your precious data. Google "data corruption raid-6"
http://www.baarf.com/
See here how ZFS does protect your data:
http://www.zdnet.com/blog/storage/zfs-data-integri...
http://queue.acm.org/detail.cfm?id=1317400
There is a reason ZFS eats CPU (does checksumming and protects your data), whereas all the other filesystems does not protect your data (rudimentary checksumming).
ZFS has end-to-end checksumming! That means, ZFS will compare the data in RAM with the data on disk - are they equal? All other storage solutions does not do that - they only check data within a realm. But when data passes a realm it may corrupt (RAM down to disk controller down to disk). There may be bugs in hardware or software within a realm. And data are never compared "the data XYZ in RAM, is it still XYZ on disk?" - this check are never ever made (unless you use ZFS).
Regarding dedup. If you get slow performance of dedup, it is only because dedup requires huge amounts of RAM. You need something like 2GB RAM for each TB disk. If you have less RAM, dedup will be sloooow. If you have much RAM, dedup will be fast.
Another advantage of ZFS (there are many) is that ZFS is OS agnostic! You can insert your zfs raid into another OS or computer without any problems! Try that with a hardware raid - impossible.
Another advantage of ZFS is there are no "fsck"! Instead you do "zfs scrub" every week, while your raid is alive and running. fsck requires you to shutdown the raid to validate it.
Hardware raid is just a cpu with some software running on it. It is better to move that software to the CPU where you have many cores and GB of RAM and you can easily patch it. In the future, hardware raid will die. Software raid like ZFS will rule.
Regarding BTRFS, if you read the mail lists, you see that people loose data all the time with BTRFS. In the future it might be good, but it will take at least another 5 years until we reach that stage. Then ZFS have developed even further.
solori - Friday, October 22, 2010 - link
Regarding ZFS and ubiquity: ZFS is only version compatible. As ZFS' capabilities are updated, the blanket statement that "any ZFS-speaking OS can mount a ZFS volume" just isn't going to ring true. In fact, many distributions porting ZFS are still behind in ZFS version.As in most "backward compatible" entities, newer versions of ZFS will almost always be compatible with older versions, but the older version will not be able to mount a more recent version. Therefore, you could have a Mac port that can't read a BSD port for instance.
Also, since ZFS is modular, one OS vendor could included a "highly proprietary" inline encryption or compression algorithm that is not (or not strictly) open in nature. This leads to subsequent OS-based divergence if they fail to include the necessary libraries that are not a part of ZFS itself.
However, and for the most part, ZFS should be regarded as version compatible regardless of the OS. Another great reason to use JBOD or discrete disk setups: complete portability of storage pools.
Hrel - Monday, October 11, 2010 - link
Why is ZFS not the only file system in use today? I completely forgot about this until this article. I remember first reading about it and thinking "this'll probably be in everything in a couple years" so I put it out of my mind. I am upset this is not the file system everything uses.sfw - Wednesday, October 13, 2010 - link
I'm just wondering about SAS bandwidth. If you connect the backplane via 4 SAS lanes you have a theoretical peak throughput of around 1,200MB/s. The RE3 has an average read/write spead of around 90MB/s so you could already saturate the backplane connection with about thirteen RE3s at average speed. Given the fact you also connect the SSDs this seems to a bottleneck you may wish to consider on your "areas where we could have improved the original build" list.By the way: really great article! Thanks for it...
Mattbreitbach - Wednesday, October 13, 2010 - link
While in pure sequential reads (from all drives at the same time) would yield a bottleneck, I don't know of any instances where you would actually encounter that in our environment. Throw in one random read, or one random write, and suddenly the heads in the drives are seeking and delivering substantially lower performance than in a purely sequential read situation.If this was purely a staging system for disk to tape backups, and the reads were 100% sequential, I would consider more options for additional backplane bandwidth. Since this isn't a concern at this time and this system will be used primarily for VM storage, and our workloads show a pretty substantial random write access pattern (67 write/ 33 read is pretty much the norm, fully random) the probability of saturating the SAS bus is greatly reduced.
sfw - Thursday, October 14, 2010 - link
Concerning random IO you are surely right and the impressing numbers of your box prove this. But even if you don't have sequential workload there is still "zpool scrub" or the possible need to resilver one or more drives which will fill your bandwidth.I've checked the options at Supermicro and beside the SC846E1 they are offering E16, E2 and E26 versions with improved backplane bandwidth. The difference in price tag isn't that huge and should not have much impact if your are thinking of 15k SAS or SSD drives.
Mattbreitbach - Thursday, October 14, 2010 - link
The E2 and E26 are both dual-controller designs, which are meant for dual SAS controllers so that you can have failover capabilities.The E16 is the same system, but with SAS 2.0 support, which doubles the available bandwidth. I can definately see the E16 or the E26 as being a very viable option for anyone needing more bandwidth.
solori - Thursday, October 21, 2010 - link
Actually (perhaps you meant to say this), the E1 and E16 are single SAS expander models, with the E16 supporting SAS2/6G. The E2 and E26 as dual SAS expander models, with the E26 supporting SAS2/6G.The dual expander design allows for MPxIO to SAS disks via the second SAS port on those disks. The single expander version is typical of SATA-only deployments. Each expander has auto-sensing SAS ports (typically SFF8088) that can connect to HBA or additional SAS expanders (cascade.)
With SAS disks, MPxIO is a real option: allowing for reads and writes to take different SAS paths. Not so for SATA - I know of no consumer SATA disk with a second SATA port.
As for the 90MB/s average bandwidth of a desktop drive: you're not going to see that in a ZFS application. When ZIL writes happen without an SLOG device, they are written to the pool immediately looking much like small block, random writes. Later, when the transaction group commits, the same ZIL data is written again with the transaction group (but never re-read from the original ZIL pool write since it's still in ARC). For most SATA mechanisms I've tested, there is a disproportionate hit on read performance in the presence of these random writes (i.e. 10% random writes may result in 50%+ drop in sequential read performance).
Likewise, (and this may be something to stress in a follow-up), the behavior of the ZFS transaction group promises to create a periodic burst of sequential write behavior when committing transactions groups. This has the effect of creating periods of very little activity - where only ZIL writes to the pool take place - followed by a large burst of writes (about every 20-30 seconds). This is where workload determines the amount of RAM/ARC space your ZFS device needs.
In essence, you need 20-30 seconds of RAM. Writing target 90MB/s (sequential)? You need 2GB additional RAM to do that. Want to write 1200MB/s (assume SAS2 mirror limit)? You'll need 24GB of additional RAM to do that (not including OS footprint and other ARC space for DDT, MRU and MFU data). Also, the ARC is being used for read caching as well, so you'll want enough memory for the read demand as well.
There are a lot of other reasons why your "mythical" desktop sequential limits will rarely be seen: variable block size, raid level (raidz/z2/z3/mirror) and metadata transactions. SLOG, L2ARC and lots of RAM can reduce the "pressure" on the disks, but there always seem to be enough pesky, random reads and writes to confound most SATA firmware from delivering its "average" rated performance. On average, I expect to see about 30-40% of "vendor specified average bandwidth" in real world applications without considerable tuning; and then, perhaps 75-80%.
dignus - Sunday, October 17, 2010 - link
It's still early sunday morning over here, but I'm missing something. You have 26 disks in your setup, yet your mainboard has only 14 sata connectors. How are your other disks connected to the mainboard?Mattbreitbach - Sunday, October 17, 2010 - link
The 24 drives in the front of the enclosure are connected via a SAS expander. That allows you to add additional ports without having to have a separate cable for each individual drive.sor - Sunday, February 20, 2011 - link
I know this is old, but it wasn't mentioned that you can choose between gzip and lz type compression. The lz was particularly interesting to us because we hardly noticed the cpu increase, while performance improved slightly and we got almost as good compression as the fastest gzip option.jwinsor566 - Wednesday, February 23, 2011 - link
Thanks guy's for an excellent post on your ZFS SAN/NAS testing. I am in the process of building my own as well. I was wondering if there has been any further testing or if you have invested in new hardware and ran the benchmarks again?Also Do you think this would be a good solution for Disk Backup? Would backup software make use of the ZIL you think when writing to the NAS/SAN?
Thanks
shriganesh - Thursday, February 24, 2011 - link
I have read many great articles at Anandtech. But this is the best so far! I loved the way you have presented it. It's very natural and you have mentioned most of the pit falls. It's a splendid article and keep more like these coming!PS: I wanted to congratulate the author for this great work. Just for thanking you, I joined Anandtech ;) Though I wanted to share a thought or two previously, I was just compelled enough to go through the boring process of signing up :D
prattyy - Tuesday, September 11, 2012 - link
Great post and really easy to understand language even a newbie like me could understand.Could you shed some more light on as to why a "reverse breakout cable" was needed for this configuration.?
is it a limitation of the motherboard or the back-plane?
if i use a diffident motherboard with a HBA can i directly connect an SFF-8087 to SFF8087 cable to the back-plane and use all the 24 drives.?
rc.srimurugan - Friday, March 1, 2013 - link
Hi all,I am new to Nexenta ,can any one please explain architecture of Nexenta ,and what is the back end ,
Thanks in advance