Another reason I use ZFS is the utilities that it provides, such as snapshots and send and receive. They make it easy to backup and restore. Even with hardware raid solutions, I would still put ZFS on top of the logical volume.
One issue I'm running into in the wild as MSP is clients wanting gig or multi-gig networking, but their drive arrays just aren't fast enough to give them that performance for internal resources. Higher internet bandwidth than local is certainly not the example case I was expecting, but I've been trying to push clients towards better drive performance choices for a while now. Thanks for helping me to be more prepared! The only luck you get is the luck you build in preparation.
MD raid you seem to have journal enabled (possibly bit mapping enabled as well as that avoids you having to do a full rebuild when a drive is removed and re plugged back in or a unclean shutdown) Unsure if it's a feature of mdadm but on Synology it can skip Unallocated space on rebuild
Most newer raid cards use nand and super caps (when power is lost it dumps the raid ram cache into the nand on the raid card) also some support swappabble cache so if the raid card fails while it was online you can move the cache to the new raid card This is assuming you have enabling the write cache mode wirch uses the cache on the raid card
Yea I was debating on going into more detail about how supercaps have gotten much more common recently, but decided not include that to try to keep the video shorter. Do you know the model of one of those cards with remove cache? I've seen some that use a little RAM like board for swapping out the cache, but it seemed to mostly be for upgrading to more cache, not keeping cache form a failed card.
@ElectronicsWizardry need to specifically look for the ones that have NAND on them , the configuration of the hardware raid array (should be stored on it) and uncommited data is stored on it as well so when the card is replaced and the nand module is plugged into new card it should just work or it ask to restore previous/unknown configuration only thing you must do is make sure both cards are using same firmware (update both cards once and don't update them anymore, this is so you know both cards have same firmware) The issue with the hardware raid or HBA card not working on consumer hardware will be due to the smbus pins (buy cards that don't use smbus pins or use caption tape over the pins, flashing the firmware for IT mode doesn't disable the smbus as it's an optional hardware feature that can be implemented by the raid/hbsla card, typically dell and hp raid/HBA cards will have the smbus wired on the card for BMC/light out/idrac use) the issue stems from that smbus is supported by UEFI on consumer boards but the UEFI module is usually missing so the system just hangs on boot or you have random issues with the computer (or it works because the manufacturer of the motherboard actually didn't wire the smbus pins on the pci-e slot it self so the smbus issue doesn't happen).
@@ElectronicsWizardry Hmm a raid card with a built in nvme socket for just using for cache/recovery would be nice to see. I wonder when/if that's available now? I've got an old rocketraid 2740 (pci-e v2) x16 card that I really want to upgrade but every time I do a search for "16 port raid pci-e 3" or 4 I don't seem to see many and most are $500 or something crazy.
The main reason why you dont use hardware raid is because of reliability especially with modern filesystems which are designed to work with native raw drives. You cannot depend on hardware raid to communicate properly especially on consumer hardware. Unless the raid card is certified to use with the filesystem like zfs and btrfs stay far away or use it in IT MODE. With hardware raid cards you also have to worry about drivers and software support for your OS.
No raid card is compatible with zfs, as far as I am aware. It needs raw access to the disks, and a raid card will not allow that. Not sure about btrfs. A raid card without the raid ability (just connects disks) is just called an hba.
@olefjord85 that is specifically called IT mode or "hba" mode. Any means of using a raid card in the mode it was designed for (raid mode) means trouble with zfs
That requires a separate drive right? I might look into that in the future, but when planning this video I already was doing a lot of testing and decided to skip the mdadm journal drive testing. I might look into this for a future video though.
@@ElectronicsWizardry Definitely a separate drive. Its in the official manpages but often not really known to alot of people. Thanks for doing the raid testing as well!
My question is, what happens, if you introduce a one bit error on a drive attacjed to the hatdware raid controller? You have to put an error on every drive on different locations, to be sure, that the controller does not always overwrite the wrong bit on the parity drive for example. ZFS can handle these errors with its checksum. When I tested Qnap QTS it could not do that and the NAS returned faulty data.
I think all the hardware RAID cards I know of will assume the data from the drives is correct, so you would see a error in this instance. Since drives have checksumming of the data along with the SATA/SAS interface the assumption is the risk of reading incorrect data is low. I should do a video looking into this in the future.
@@ElectronicsWizardry there are RAID controllers + SAS drives that have their own checksumming, but it's not something you'll be realistically able to setup in a home lab
I already had data corruption problems with several files being unreadable. I did not recognize it for a longer time period. So my backups were also corrupted. I still do not know, what the reason was. Bad Sata cable, faulty driver, faulty CPU or RAM? As I used hardware raid at that time, it might also have been the raid controller. I value zip files since that problem, because they do a crc32 checksum, so you know immediately, which data is correct.
@ so back to compare ZFS to MD and a lot more options of configuration for ZFS than MD. So not easy to compare because depending on the files (size, content) and the usage (sequential read or not, concurrent access, read or write, only access to recent files or totally random) the best is not obvious and obviously not XFS in every case.
Very good overview of the various implementations. I would like to stress that the comments at 1:58 are what really makes hardware RAID unattractive nowadays: hardware RAID controllers have gotten expensive. In addition to the make and model being correct to move data in case of hardware failure, firmware versions are also important here. Back in my data center support days, we always had a cold spare of it laying around and never got updated firmware until it was ready to replace the failed unit. There have been times were down grading firmware was also the path of least resistance to bring back a storage system. Another interesting feature of some select RAID cards is that they off their own out-of-band management network ports with independent PoE power. This permits setting up and accessing the RAID controller even if the host system is offline. The write cache can be recovered directly from the device, archived safely and restored when host functionality returns in a recovery mode. Automations can be setup to copy that write cache on detection of host issues to quickly protect that cached data externally. Lastly a huge feature for enterprises is hardware RAID card support for virtualization. This permits a thinner hypervisor by not needing to handle the underlaying software storage system for guest machines. All great enterprise class features that are some of the reasoning why RAID controllers are so expensive. (That and Broadcom's unfettered greed.) ZFS has something similar to journaling which can be separated from the main disk system called the ZFS Intent Log (ZIL). Similarly there are options for separate drives to act as read and write caches. Leveraging these features can further accelerate ZFS pools if used in conjunction with high IOPs drives (Intel Optane is excellent for these tasks). Redundancy support for the ZIL, read cache and write cache is fully supported. CPU utilization here is presumably higher but I haven't explicitly tested it. I have seen the performance results and they do speed things up. Common setup is using spinning disks for the bulk storage with these extras as fast NVMe SSD. Costs of using these ZFS features used to be quite high and comparable to the extra cost of a RAID card but storage prices of NVMe drives has dropped significantly over the past few years changing the price/performance dynamic. 5:27 Both hardware RAID and ZFS can be setup with external monitoring solutions like Zabbix for monitoring and alerts as well. In the enterprise world these are preferred as they're just the storage aspect for a more centralized monitoring system. Think watching CPU temperatures, fan speeds and the like. They don't alter the disks like the vendor supplied utilities or software tools, but they do the critical work of letting admins know if anything has gone wrong. One last thing with ZFS is that it can leverage some newer CPU features to accelerate the parity computations that your older Opteron may not support. With ZFS's portability, you could take that disk array and move it to a newer faster platform and speed things up that way. I would also look into various EFI setting for legacy BIOS support on that i7 12900K board. Ditto for SR-IOV which that card should also support.
BTRFS raid6 user here 8x8tb. using raid1c4 for metadata and raid6 for data to mitigate the write hole bugs until the raid stripe tree is fully released. performance is ok, and requires very little ram, (2gb-4gb of ram is fine).
I use a second hand 8 port hardware raid card with the raid bios removed so just passes the disks . I've been using ZFS for the last few year so almost maxed out my 5 drives , so can't wait for single disk expansion which should be here soon.
There was no mention about BTRFS (as suggested in the title and description) and you missed the most important feature of ZFS: Checksums! A performance comparison is nice but if you care about your data, there is no way around checksums
OOps, edited the title with MDADM instead. As far as checksums, that is one big advantage of ZFS + BTRFS + others with that support. I will mention that HDDs have internal checksums to prevent data from being read incorrectly, and typically results in a low enough error rate.
That hardware RAID card is freaking old. It's a really bad comparison for 2025 - The difference between PERC 11 and 12 is huge in itself. Should do this again with a H965i to show what 2025 hardware can do.
Its an OK comparison since ZFS and MDADM are running on an ancient platform. Also because it can be purchased for 20$. Makes it easier to compare since ZFS is free.
Performance is important, of course, but for some of us the power usage might be another important factor to consider. Some time ago I was using HP P410 RAID controller and it was increasing my server's idle power consumption by 20W. That's why I decided to switch to software RAID based on ZFS.
I went ZFS route back when I've setup my proxmox multi vms combo server back in 2021. So far (knock on wood) after 2 separate drives failures no data was lost. Before I was using either proprietary NAS solutions (qnap) or builtin motherboard raid configurations and this always ended up with partial or complete data loss. Thankfully after my first disk failure I always kept some sort of external backup so even though qnap and raid solutions failed me I had some way to restore my data. I'm not saying ZFS is bulletproof (look at LTT situation from couple years prior), but if you do regular pool scrubs and extended smartctl tests - so basically you won't let your ZFS pool to "rot" then I'm pretty sure ZFS is the best their is (so far).
The opteron server he was running these tests on is very slow. Modern cpu's are so fast now that software raid would probably show much lower utilization. The comparison with old hardware raid cards is tricky because they likely won't run on modern uefi motherboards, especially consumer ones. He kind of said this at the beginning of the video.
Have an Adaptec ASR-8805 with BBU in a consumer ASUS motherboard in UEFI mode, running for several years with Proxmox, no problem. Bought 2 of these controllers they were so cheap in case of issue.
Be careful with the results of this video. The system he's using has a very weak and underpowered processor in terms of single core IPC and without any additional information his performance problems are most likely localized to his system configuration. I've been using mdadm for years on a variety of different configurations and the only caveats were slightly lower. Random read speeds and in some cases higher disc latency than hardware RAID arrays but other than that, as long as you have a decent CPU (a 10-year-old optron is hardly a good test candidate) It's a viable solution for Budget conscious individuals. Fun! Fact: Intel maintains the entire software stack for mdadm and it's implemented as part of their Intel VROC solution on high-end server grade motherboards.
I don't know if it was posted already or not, but if you are able to set the "Storage Oprom" to legacy on consumer boards, that will allow you to use the built-in managers for the different raid cards. But given the move to uefi on everythnig, its a dying thing to see legacy features on newer mainboards.
Typically a replacement RAID card will detect a array and import it. Generally if its from the same manufacture with the same model or newer you can import the array. But the hardware requirements are much stricter for importing the array than software RAID solutions. I generally like to rely on backups in case something goes wrong with the whole array, but having ease of recovery is still a feature if things go wrong.
From what I understand, a hardware RAID generally doesn't allow you to use for example an NVME SSD as a cache. So wouldn't a software raid with such a cache generally surpass performance of hardware raids? Especially for random reads/writes?
I think a few Hardware raid cards supported a SSD cache, but this feature has since been removed in newer product lines to my knowledge. You can still add a cache in software with something like bcache in linux. The annoying part of a cache is they can help a lot in some workloads, but almost none in others. If your doing random IO across the whole drive, expect almost no improvement from adding a cache as it would be nearly impossible to predict what blocks are needed next. If your accessing some files more than others, a cache may help a lot. I'm generally a fan of a sperate SSD only pool if you know some files are going to be access more often than others. Like a SSD pool for current projects, and a HDD pool for archive projects. But this can add complexity and depends on your exact workload.
Your results agree with my experience. HW Raid, for small and medium systems, is not needed. I still see it for large enterprise SAN nodes, but not in this classic form. And, at least for desktop type machines, the move to solid-state storage has changed the performance equations yet again. But assuming we're talking bulk storage, AKA spinning rust, I would use ZFS over just about any other choice, especially in a NAS application.
Yea SSDs change the performance calculation a good amount. With HDDs its much more common to be IO limited than with SSDs. I choose HDDs for testing here as there much more common in home server and NAS use, and since HDDs do much worse in some workloads like random IO I thought it would be best to test with random IO. With how well this video is doing, I might look at SSD arrays, and try to get one of the NVMe raid cards to see how they work.
@@ElectronicsWizardry Yeah, that might be fun. One of my servers has an LSI card that does NVMe. No RAID in hardware for that, but the Kioxia U.2 drives I have it hooked to give insane I/O throughput. Perhaps you could do a ZFS comparison with using the SSD's in conjunction with the HDD's, either as special devices, or just cache. (And why cache drives aren't what most people think when it comes to ZFS)
Yeah, ZFS (even stripe) with NVMes and 4 fio threads with fsync=0 just max out the CPUs, or for practical case couple of VM guests maxing out their io on same zfs pool - same results on host CPU. For HDDs would choose zfs any day
Yes, solid-sate storage has completely changed the game for where you would need to use RAID. As for a NAS..... ZFS is fast becoming the filesystem to use. But nobody talks about SNAPRAID. I've been using that to store my media files for years. Never had any issues with it. And best of all, it only spins up the drive where the data is stored, rather than every drive in the array.
@Andy-fd5fg yea snapraid is flexible and does well in home media server like environments. It struggles with lots of changing files and only operates at the speed of a single disk. Unfortunately there is no perfect storage solution so it’s a pick your compromises when setting up raid or raid like solution.
You should _always_ use a form of ssd as a log device for a raidz1 or raidz2 if you want a decent performance. An alternatve is to force the pool to be asyncrons, but then you can lose up to 5 seconds of data. One of the best log devices you can use are smallish optane drives, just avoid the 16GB ones since their sustained write perfomance is to low.
Very pertinent. I switched from a 4th-gen Celeron-powered build with MDADM (and an extra PCI SATA card) with 6 disks, to a PowerEdge R510 with an H700 Perc card. MDADM was especially slow with writes, much slower than the Perc. However, Raid-6 performance isn't impressive either.
I really liked your video but ... What happens if the volume or array runs out of space. Can i just add another drive and keep going? Unraid can do that.
With parity RAID, most hardware raid cards support adding a array, MDADM has supported adding disks to arrays for a long time, and ZFS has added this feature recently. In all of these examples the drive has to be the same size as the existing drives(larger drives houdl work but the extra space won't be used)
I didn't test the journal device for this video(I already spend a long time setting up these arrays and rebuilding them). I might look into mdadm journal devices later if people are interested.
@@ElectronicsWizardry ZFS can have a special vdev on nvme, which helps a lot in certain cases. metadata is always stored on nvme (crazy fast directory index loading!) and you can choose (per dateset if you want) up to which block size shall be stored on nvme instead of HDD. attention: if blocksize==recsize, everything will be stored on nvme.
MDADM doesn't have a journal... the filesystem you use on it has the journal. I'd suggest you do some tests using different filesystems..... BTRFS, EXT4, XFS.... there maybe others you want to look at... is JFS still kicking around? And as others have pointed out, you could move the journal to SSD's perhaps a mirrored pair. Also look at MDADM "chunk" size. Many years ago when i was playing with MDADM and XFS i had to do some calculations for what i would set for XFS "sunit" and "swidth" values. I expect BTRFS has something similar. (Sorry, can't remember the exact details of those calculations.)
I think I understood how MDADM does its journal, I thought(incorrectly it seems) that it always uses it like ZFS log, but it seems to only use the journal when a journal device is connected. I think I tried a few other filesystems and didn't see a performance difference. Since I was mostly trying to test RAID performance I stuck with XFS as I didn't see a big difference between filesystems and fio performance, and wanted to keep the tests rounds down(It took ~3 days for each RAID types to be tested as I had to wait for the initialization, then write a 15TB test file, then do a rebuild) I should check MDADM chunk sizes, that easily could have been the issue here.
@@ElectronicsWizardry Sounds like you need to acquire some smaller drives, 1tb perhaps. I know they aren't good for price per TB, but it would save a considerable about of time for tests like these. Even if you don't, a follow up video testing just XFS with a separate journal drive, and tweaks to the chunk size to get better performance out of MDADM would be a good topic.
I do have a pile of 1TB drives. I should have remembered to use them instead as the whole drive can be overwritten faster. It seems like looking into MDADM would make a good video and I'll work on that in the future, It will be a bit of time though as I have some other videos in the pipeline.
MDADM has a "bitmap" which allows it to only resync the recently changed data when a device fails and then comes back, but it's not a journal. But Device Mapper has dm-era module that does something like this, I think.
My hardware raid card works funky with my consumer grade motherboard. Add it's a 10 year old motherboard and you have all sorts of funny. But it works.
Vroc to my knowledge is for NVMe only drives on specific platforms. I'm not sure how it does performance wise, but I think it uses a bit of silicon on the CPU to help with putting the array together for booting from, but I think much of the calculations is still done on CPU cores with no dedicated cache. I might make a video about it if I can get my hands on the hardware needed for VROC.
I should take another look at storage spaces. Its been a bit since I've done a video on storage spaces, and I think server 2025 changes some things. I decided to skip it here as I am more familiar with Linux and adding a second OS to testing adds a lot of variables.
The problem with hardware raid is when something goes bad, no tools ... RAID5 and 6 with mdadm are terrible, especially with fast storage. Some work is being done to fix that but when will that be included in the kernel is anyone's guess. For now disable bitmaps and try to use power of 2 number of data disk (4+2 for RAID6 for example). That should fix some of the issues you are seeing.
I don't think its been announced what Debian 18's code name will be. The latest codename I think they have public is for 14 with Forky. I'm guessing they still have a lot of toy story characters to go through.
I am running unraid with two zfs pools. I can get almost full Read and write speed out of my exos hdds. So Software raid isnt a bottlenecks now days if there is enough cpu power available
I stuck with HDDs here as there most common in home server and NAS use. SSDs change up a lot of the performance calculations as they are so much faster things like the CPU and bus speeds are much more likely to be the storage limit than the disks themselves.
as you are using a Perc card which is a rebranded OEM LSI MegaRAID 9361-8i, if the RAID controller fails, and you are using Linux you can use an HBA + MDADM to import it and run it like normal to recover any data you need till you get another RAID controller with the same firmware
haha I've actually been booting my Windows Server off a RAID 1, 2-drive array for like 9years. This is on a Dell R510 with a H700. I've never heard you can't boot from RAID, but then again I've also never tried using software RAID
There is so much to ZFS that I don't understand, that it seems too dangerous to me. I also don't like how they don't focus one bit on performance. You can now expand RAIDz, but doing so is nothing like expanding traditional RAID. Performance drops with each drive added. It is not a complete resilvering. Dumb.
All the tests were done over a ~3m period, so the writes couldn't just be dumped to cache without testing the whole array performance. The cache helps with sustained write performance, so I'm far from jsut testing the cache..
I skipped BTRFS as its RAID 5 and 6 solution isn't listed as fully stable yet. I like how BTRFS RAID is very flexible with adding and removing drives and mixed drive size configs.
So hw hba cards are only used nowadys in servers in something like a multi cluster jbod atorage set up where you might want redundant data paths... Reason md raid was slow in your dual socket is probably due to the fact its old. Everything uses md raid in the server world now.
Hardware raid is really firmware raid, its software “burned” onto a chip. Does that sound like it’s easy to update? It’s software and it will need to be updated. Remember when firmware was a joke? Its software written to the “eeprom” so between software and hardware is firmware. LOL.
lvl1techs made a great video on this why you should not use this, This is not a viable solution due to it not detecting write errors and bitrot. The current hardware raid cards are not the raid cards from the past th-cam.com/video/l55GfAwa8RI/w-d-xo.html
I watched that video when it came out, and probably should have talked about this issue more in the video. While this is a potential issue, It generally seems to be pretty rare in practice due to the error correction on HDDs/SSDs. I have used many hardware raid cards with data checksumming in software and almost never get data corruption/bitrot. Zfs and other checksummed filesystems are nice and help to keep data from changing, and notifying the user upon issue. My general experience is checksum errors on HDDs is extremely rare on a drive that doesn't have other issues. Keeping bitrot away is one reason why I generally stick with ZFS as my default storage solution, and try to use ECC if possible.
@@ElectronicsWizardry ECC will not do anything against Bitrot, Bitrot refers to the gradual corruption of data stored on disks over time, often due to magnetic or physical degradation. ECC memory primarily protects against bit flips in system memory, not storage, so it does not prevent bitrot. ECC memory for ZFS is recommended for not writing any bit flips to disk, this is even a more extreme measure since bit flips in memory occur even less then writing errors so to on 1 hand being very lax about data integrity using hardware raid cards but then suggesting ECC memory for ZFS on an even less likely issue is very strange. HDD's dont have any error correction, they just write the data it gets, it does not have any way to even know what data would be the correct data so how can you even start to do error correction on it, the filesystem is responsible for that by doing checksums. (yes a HDD has ECC but this is for reading data from the instability of reading such tiny magnetic fluxes) You should rewatch the video, specially at 6min timestamp, there are no cards that do any checksums anymore " I have used many hardware raid cards with data checksumming in software and almost never get data corruption/bitrot" this statement is 1 to 1 equivalent to your grandma saying "I got the age of 90 with smoking a pack of cigrates a day" Good on you, but those empirical statements are kinda useless and specially as "informing" people it only hurts, since i bet you, atleast one person is going to buy a raid card in 2025 because of this video
@@ElectronicsWizardry After Advanced Format drives (512e included) came out, all the different DIF formatted drives became much less attractive because the how powerful the new ECC algorithm AF drives uses (LDPC on 100 bytes per 4k physical sector). This is why most modern hardware raid controllers don't use DIF formats; because it's been made redundant by the checksumming that is running on the drives themselves.
Yea ecc doesn’t help with bitrot but you want all memory and interfaces in the data storage system to be error correcting to prevent corruption. Ecc and disk checksuming help with different parts of the data storage pipeline. Hdds do have error correcting that is hidden from the user available space. That’s how drives know if there reading the data correctly or not. Also interfaces like sata and sas have error checking as well. The assumption raid cards make is the interfaces used by drives and the data on the disks is checksummed so the drives only return correct data or none at all. There are edge cases where this can occur but it’s rare in my experience and I’ve used a lot of drive so if this was common I’d guess I would have seen it. Also a huge amount of business servers are running of these raid cards. If there was a significant data bitrot risk I’d guess this would have been changed a while ago.
@@ElectronicsWizardry I definitely agree that it would be ideal to have ECC throughout the entire pipeline of subsystems (all the SerDes structures and all memory levels) the data goes through, but adding complexity to a design can also introduce new sources of errors; there is probably some ideal balance between simplicity and error checking, and I'd be willing to bet we're pretty close to it in most systems. I suppose bitrot is kind of an all-encompassing term, but the ECC that runs on hdds is being used to checksum every sector that is read, so it would protect against some kind of surface defect or magnetic bit problem below LDPC threshold on the hdd causing bitrot, but it'd be narrowed to correcting only those sources of bitrot and not some kind of wider memory or cabling issue. This is all mostly transparent to the user like you said, you'd have to do into SMART to look for reallocated sectors to discover this was happening.
Great video! My few cents on this topic 1) Hardware RAID with writeback cache and single card is pure evil (TL;DR just don't do that) - whatever redundancy you get from RAID geometry is swept away by reliance on a single piece of hardware. When that controller fails, your data is in an unknown state, because some writes might be in the cache and thus lost. Also, when that battery fails, performance drops off the cliff because you can no longer use writeback cache (if you value your data), at which point if you plan for _not_ having the cache, then why rely on that piece of hardware at all? There are ways to do this properly with multiple cards that provide redundancy and synchronize their cache, but that's usually vendor specific and I haven't had much experience dealing with them, those make sense mostly if the setup is large, like multiple SAS expanders with tens or hundreds of drives (and usually hosting a huge Oracle instance or something like that). 2) MDADM is great for OS partitions because of its simplicity, but also for performance-sensitive workloads like databases and is extremely flexible nowadays when it gets to expansion or even changing between RAID levels (if you are ready for the performance penalty when it's rebuilding). But it lacks the features that make stuff like ZFS great, like snapshots (yes, I know you can use DM for those, but has anybody actually tried using that productively?) 3) ZFS is what you use if you actually value your data because of all the features, like streaming replication, snapshots, checksums... but you lose a lot of performance because of CopyOnWrite nature of the filesystem. This is what I use as long as I can live with that performance penalty. This should just be default for everyone :) 4) BTRFS - just don't (unless you are an adrenaline junkie). Every single BTRFS filesystem with more than light desktop use I ever encountered broke sooner or later. Very common scenario is colleagues claming BTRFS works just fine for them and me discovering I/O errors in kernel.log that were simply overlooked or ignored and resulting damaged data that were never checked. 5) there are many seldom-used but very powerful options for gaining performance, for example putting your ext4 filesystem journal on a faster SSD or adding a NVDIMM SLOG device to your ZFS pool, it might be interesting to see a video on those :) Some more notes on your testing 1) FIO is complicated, IO engines are a mess. I spent good few months trying to get numbers that made sense and correlated to actual workload out of it. Sometimes you're not even testing the bottom line ("what if everything was a synchronous transaction") but just stress-testing weird fsync()+NCQ behaviour in your HBA's firmware. Not sure I have a solution for this, testing is hard, testing for "all workloads" is impossible and you can get very misleading numbers just because you used LSI HBA with unsupported drive, which breaks NCQ+FUA, which makes all testing worthless (ask how I know). Btw LSI HBA in IT mode is not as transparent as an ordinary SATA controller. 2) CPU usage ("load") might not seem that important when what you're storing are media files, but when you're running a database it translates into latency... except when it doesn't, which is what I suspect happened to you, because FIO wasn't waiting for writes to finish (so not blocked on I/O) which is what "load" actually measures. On the other hand, if you throw NVMe drives in there then that CPU overhead will start making a huge impact inside your database/application which can just throw away that thread immediately etc.. 3) HW RAID controllers are capable of rebuilding just the actual data in an array, but that requires a supported hardware stack (drives and their FW revisions) and functional TRIM/discard in your OS. 4) ZFS performance testing is usually non-reproducible with faster drives or when measuring IOPS for a couple of reasons like fragmentation, allocating unused space vs. rewriting data, meta(data/slab) caching, recordsize or volblocksize vs. actual workload and weird double caching behaviour on Linux. Just try creating a ZVOL (let's say 2 times larger than your testing writeset) and then run the same benchmark multiple times on it, you'll start seeing different number on every run, trending downwards :)
Thanks for the feedback. I noticed the card in IT mode seems to be realize it cache to help single disks. This affected my testing and trying to get good results, but in the real world this should only help(unless there is a bug with the cache, but I guess this could affect any disk controller, the cache on the card is error corrected) The more storage performance tests I do the mode complex it becomes and more variables I realize I have to control. I know these results are far from perfect and can't be extrapolate to every use case, but I hope these numbers are better than the often vague descriptions of performance I see online, like slow writes, or no practical CPU usage difference. I hope others can use these tests as a jumping off point to running more tests specific to their workloads. Load averages likely weren't the best for seeing the CPU usage. Looking back running a CPU benchmark at the same time might have been a better metric. It was pretty reproducible though in my testing.
@@ElectronicsWizardry IT mode shouldn't use cache at all. In the past I used IR mode with RAID0 of single drives to get controller-based writeback cache (backed up by the battery), but in IT mode this is not possible. Just beware that IT mode is not transparent at all. I am biased against LSI adapters based on my past experience, but they have sort of become the standard for home NAS/homelab setup. For small setups, I think it's better to get a cheap SATA controller. (And lots of AMD boards can switch U.2 ports to SATA mode drectly to the CPU which is awesome).
@@zviratko I might test it again, but it defiantly seems to used the RAM as a read cache on that card when set to IT mode. I should look into this more and might do some testing in the future Do you have a model of sata controller you suggest?
Be careful with the results of this video. The system he's using has a very weak and underpowered processor in terms of single core IPC and without any additional information his performance problems are most likely localized to his system configuration. I've been using mdadm for years on a variety of different configurations and the only caveats were slightly lower. Random read speeds and in some cases higher disc latency than hardware RAID arrays but other than that, as long as you have a decent CPU (a 10-year-old optron is hardly a good test candidate) It's a viable solution for Budget conscious individuals. Fun! Fact: Intel maintains the entire software stack for mdadm and it's implemented as part of their Intel VROC solution on high-end server grade motherboards. One other caveat is nearly all modern hardware RAID solutions do not scale with high-end solid state storage devices and other nvme drives. Using these type of cards will bottleneck pretty quickly and experience diminishing returns. I've worked with distributed storage systems for years and stopped using hardware raid arrays almost a decade ago.
That opteron server might not have been the choice here as it was probably getting to be too slow for a realistic test bench. I was trying to stick with a slow CPU and slow HDDs as that seems more common in NAS units and home servers. SSD scale differently in performance but I stuck with HDDs for this test. Unfortunately no single platform can be used to extrapolate results for every use case, but I try to get the best results I can with the hardware I have. Hopefully I can get something like a LSI 9560 and a few NVMe drives one day to see how a modern NVMe RAID card compares to software RAID on a high speed CPU. During my testing I didn't notice the CPU threads being maxed out and the disk usage was being maxed out.
Err... Windows have no problem booting from a raid like volume, and have been able to do so since Windows XP days... So that's just plain misinformation you're spreading there. It's a bit fiddly to set up, but it's very much possible.
How can Windows be setup to boot from a software RAID volume? I've check a lot of sources over the years and yet to see a good way to do it. Storage spaces isn't bootabe to my knowledge(I think it was setup in a bootable matter on a surface system, but I don't think that counts).
@@ElectronicsWizardry You're right that storage spaces isn't usable for booting yet. Windows does however have the older raid like tech that was introduced in XP named Dynamic Disks, which can be. It supports simple, spanned, striped, mirror and raid5 volumes, and most importantly here, Booting from it is supported. But, for all usages except booting from a mirror set of dynamic disks, it's also so old that it's deprecated, but not yet removed either. It's even still there in the current previews of Win12, and probably isn't going to actually removed until Storage spaces drives actually are bootable. As for how, just install on one drive. Once in windows, convert to dynamic, initialize another drive as dynamic (but no partitions), and extend the system drive over to that drive. It does however NOT work with dual booting, because it needs both the MBR and boot sector of the partitions to work. Normally you set grub to overwrite the MBR and chainload the bootsector version, but that does not work with dynamic disks.
Yea your right there are ways to do a mirrored boot, but compared to Linux the options are limited, its mirrored only, and uses a deprecated RAID method in Windows. I think its reasonble to say that a RAID card can a good option if you want to boot windows from a RAID array. Thanks for the reply
@@ElectronicsWizardry No no. You misunderstand. Only mirrored boot is still what MS considers current with this method. You CAN boot any dynamic disk type, including raid5 style, and the method as a whole, is not deprecated. It's only for OTHER USES that it's deprecated. As you, I'd still recommend a raid card, that's not the point. But recommendations are different from possible. My point is simply that you shouldn't claim it can't do it when it's very much possible and has been for almost a quarter of a century.
hardware raid is dead! ceph/zfs leads everything. i went away from both software and or hardware raid now for about 4 years. im happy with my zfs in my homelab!
Yea I'm a big fan of ZFS/Ceph for much of my use I go with ZFS. For many of my uses squeezing a bit more performance out of a array is less important than features like ZFS send/receive and I am very used to how to manage ZFS. Also next week's video is gonna be using ceph in a small cluster so stay tuned.
@ElectronicsWizardry The redundancy and ease of management that ZFS offers are truly unmatched compared to traditional RAID. If you're replicating something like RAID0 or RAID5 and want to avoid sacrificing performance by skipping a ZRAID setup, there’s absolutely no reason to stick with RAID. With ZFS, you get almost the same speed but with far better security! For me, proper error correction is non-negotiable! I refuse to rely on systems that don’t (proper) offer it. While I do like BTRFS, it’s still a no-go because I prioritize both security and performance. Unfortunately, BTRFS RAID setups, as I would need them, are still too unstable. Honestly, there’s anyway no reason to move away from ZFS, even if BTRFS eventually becomes stable for certain RAID configurations. Btw, i like your recent style change with this fancy beard! :-D
Yea that can be annoying on a lot of cards. Dealing with licensing is a pain in IT. I often try to stick with free open source products if I can even if I have the money available just to save me the hassle of licensing.
Thank you. I had planned to do something simlar in the future, but this video saved me time. Love your videos, keep it up.
Glad you like my videos and I was able to help answer some of your questions. Thanks for the support!
My man coming through with exactly the info I seek right now. Thank you sir!
seriously.
Another reason I use ZFS is the utilities that it provides, such as snapshots and send and receive. They make it easy to backup and restore. Even with hardware raid solutions, I would still put ZFS on top of the logical volume.
Great video bro! Thanks as always for doing the tests and sharing the results!
Great to see you on my feed!
Great discussion topic. Thanks for a great video.
These tests are fascinating and such a good tool when considering the different RAID options in a system. The analysis is very much appreciated!
One issue I'm running into in the wild as MSP is clients wanting gig or multi-gig networking, but their drive arrays just aren't fast enough to give them that performance for internal resources. Higher internet bandwidth than local is certainly not the example case I was expecting, but I've been trying to push clients towards better drive performance choices for a while now.
Thanks for helping me to be more prepared! The only luck you get is the luck you build in preparation.
MD raid you seem to have journal enabled (possibly bit mapping enabled as well as that avoids you having to do a full rebuild when a drive is removed and re plugged back in or a unclean shutdown)
Unsure if it's a feature of mdadm but on Synology it can skip Unallocated space on rebuild
I've worked with storage in a professional and hobby capacity for years. This guy knows his stuff
Most newer raid cards use nand and super caps (when power is lost it dumps the raid ram cache into the nand on the raid card)
also some support swappabble cache so if the raid card fails while it was online you can move the cache to the new raid card
This is assuming you have enabling the write cache mode wirch uses the cache on the raid card
Yea I was debating on going into more detail about how supercaps have gotten much more common recently, but decided not include that to try to keep the video shorter.
Do you know the model of one of those cards with remove cache? I've seen some that use a little RAM like board for swapping out the cache, but it seemed to mostly be for upgrading to more cache, not keeping cache form a failed card.
@ElectronicsWizardry need to specifically look for the ones that have NAND on them , the configuration of the hardware raid array (should be stored on it) and uncommited data is stored on it as well so when the card is replaced and the nand module is plugged into new card it should just work or it ask to restore previous/unknown configuration
only thing you must do is make sure both cards are using same firmware (update both cards once and don't update them anymore, this is so you know both cards have same firmware)
The issue with the hardware raid or HBA card not working on consumer hardware will be due to the smbus pins (buy cards that don't use smbus pins or use caption tape over the pins, flashing the firmware for IT mode doesn't disable the smbus as it's an optional hardware feature that can be implemented by the raid/hbsla card, typically dell and hp raid/HBA cards will have the smbus wired on the card for BMC/light out/idrac use)
the issue stems from that smbus is supported by UEFI on consumer boards but the UEFI module is usually missing so the system just hangs on boot or you have random issues with the computer (or it works because the manufacturer of the motherboard actually didn't wire the smbus pins on the pci-e slot it self so the smbus issue doesn't happen).
@ElectronicsWizardry (might have to respond to you on discord or email as TH-cam might automod/delete my post) which it looks like it has delete it
@leexgx I should put more of my socials on TH-cam but my email is probably the best way to message me.
@@ElectronicsWizardry Hmm a raid card with a built in nvme socket for just using for cache/recovery would be nice to see. I wonder when/if that's available now? I've got an old rocketraid 2740 (pci-e v2) x16 card that I really want to upgrade but every time I do a search for "16 port raid pci-e 3" or 4 I don't seem to see many and most are $500 or something crazy.
this was very detailed and informative. thanks for your work
Glad you liked the video and found it useful.
The main reason why you dont use hardware raid is because of reliability especially with modern filesystems which are designed to work with native raw drives. You cannot depend on hardware raid to communicate properly especially on consumer hardware. Unless the raid card is certified to use with the filesystem like zfs and btrfs stay far away or use it in IT MODE. With hardware raid cards you also have to worry about drivers and software support for your OS.
No raid card is compatible with zfs, as far as I am aware. It needs raw access to the disks, and a raid card will not allow that. Not sure about btrfs. A raid card without the raid ability (just connects disks) is just called an hba.
There are indeed some raid controllers which can be put in HBA mode and therefore pass raw disks to the system.
@olefjord85 that is specifically called IT mode or "hba" mode. Any means of using a raid card in the mode it was designed for (raid mode) means trouble with zfs
@@industrial-wave also JBOD mode
@@SakuraChan00 that's the same thing as hba/it mode, just another different name for it
You can put the journal for mdadm on a faster drive (nvme) --write-journal option as well
That requires a separate drive right? I might look into that in the future, but when planning this video I already was doing a lot of testing and decided to skip the mdadm journal drive testing. I might look into this for a future video though.
@@ElectronicsWizardry I'd love to see some results in the case of a ssd/nvme journal drive!
@@ElectronicsWizardry Definitely a separate drive.
Its in the official manpages but often not really known to alot of people. Thanks for doing the raid testing as well!
It's called a bitmap, not a journal.
So helpful! Thanks a lot!
My question is, what happens, if you introduce a one bit error on a drive attacjed to the hatdware raid controller? You have to put an error on every drive on different locations, to be sure, that the controller does not always overwrite the wrong bit on the parity drive for example. ZFS can handle these errors with its checksum.
When I tested Qnap QTS it could not do that and the NAS returned faulty data.
I think all the hardware RAID cards I know of will assume the data from the drives is correct, so you would see a error in this instance. Since drives have checksumming of the data along with the SATA/SAS interface the assumption is the risk of reading incorrect data is low.
I should do a video looking into this in the future.
@@ElectronicsWizardry there are RAID controllers + SAS drives that have their own checksumming, but it's not something you'll be realistically able to setup in a home lab
I already had data corruption problems with several files being unreadable. I did not recognize it for a longer time period. So my backups were also corrupted. I still do not know, what the reason was. Bad Sata cable, faulty driver, faulty CPU or RAM? As I used hardware raid at that time, it might also have been the raid controller. I value zip files since that problem, because they do a crc32 checksum, so you know immediately, which data is correct.
It's really not easy to compare because there's so many ways to configure a ZFS volume...
The best way is to use XFS.
@@JodyBruchon XFS has redundancy options?
@@frederichardy8844 That's the point of md, duh.
@ so back to compare ZFS to MD and a lot more options of configuration for ZFS than MD. So not easy to compare because depending on the files (size, content) and the usage (sequential read or not, concurrent access, read or write, only access to recent files or totally random) the best is not obvious and obviously not XFS in every case.
I would like to see object storage like MinIO tested
Very good overview of the various implementations.
I would like to stress that the comments at 1:58 are what really makes hardware RAID unattractive nowadays: hardware RAID controllers have gotten expensive. In addition to the make and model being correct to move data in case of hardware failure, firmware versions are also important here. Back in my data center support days, we always had a cold spare of it laying around and never got updated firmware until it was ready to replace the failed unit. There have been times were down grading firmware was also the path of least resistance to bring back a storage system. Another interesting feature of some select RAID cards is that they off their own out-of-band management network ports with independent PoE power. This permits setting up and accessing the RAID controller even if the host system is offline. The write cache can be recovered directly from the device, archived safely and restored when host functionality returns in a recovery mode. Automations can be setup to copy that write cache on detection of host issues to quickly protect that cached data externally. Lastly a huge feature for enterprises is hardware RAID card support for virtualization. This permits a thinner hypervisor by not needing to handle the underlaying software storage system for guest machines. All great enterprise class features that are some of the reasoning why RAID controllers are so expensive. (That and Broadcom's unfettered greed.)
ZFS has something similar to journaling which can be separated from the main disk system called the ZFS Intent Log (ZIL). Similarly there are options for separate drives to act as read and write caches. Leveraging these features can further accelerate ZFS pools if used in conjunction with high IOPs drives (Intel Optane is excellent for these tasks). Redundancy support for the ZIL, read cache and write cache is fully supported. CPU utilization here is presumably higher but I haven't explicitly tested it. I have seen the performance results and they do speed things up. Common setup is using spinning disks for the bulk storage with these extras as fast NVMe SSD. Costs of using these ZFS features used to be quite high and comparable to the extra cost of a RAID card but storage prices of NVMe drives has dropped significantly over the past few years changing the price/performance dynamic.
5:27 Both hardware RAID and ZFS can be setup with external monitoring solutions like Zabbix for monitoring and alerts as well. In the enterprise world these are preferred as they're just the storage aspect for a more centralized monitoring system. Think watching CPU temperatures, fan speeds and the like. They don't alter the disks like the vendor supplied utilities or software tools, but they do the critical work of letting admins know if anything has gone wrong.
One last thing with ZFS is that it can leverage some newer CPU features to accelerate the parity computations that your older Opteron may not support. With ZFS's portability, you could take that disk array and move it to a newer faster platform and speed things up that way. I would also look into various EFI setting for legacy BIOS support on that i7 12900K board. Ditto for SR-IOV which that card should also support.
BTRFS raid6 user here 8x8tb. using raid1c4 for metadata and raid6 for data to mitigate the write hole bugs until the raid stripe tree is fully released. performance is ok, and requires very little ram, (2gb-4gb of ram is fine).
I use a second hand 8 port hardware raid card with the raid bios removed so just passes the disks .
I've been using ZFS for the last few year so almost maxed out my 5 drives , so can't wait for single disk expansion which should be here soon.
There was no mention about BTRFS (as suggested in the title and description) and you missed the most important feature of ZFS: Checksums! A performance comparison is nice but if you care about your data, there is no way around checksums
OOps, edited the title with MDADM instead.
As far as checksums, that is one big advantage of ZFS + BTRFS + others with that support. I will mention that HDDs have internal checksums to prevent data from being read incorrectly, and typically results in a low enough error rate.
That hardware RAID card is freaking old. It's a really bad comparison for 2025 - The difference between PERC 11 and 12 is huge in itself. Should do this again with a H965i to show what 2025 hardware can do.
Yeah! I would never use older than H730 cards.
Its an OK comparison since ZFS and MDADM are running on an ancient platform. Also because it can be purchased for 20$. Makes it easier to compare since ZFS is free.
Performance is important, of course, but for some of us the power usage might be another important factor to consider.
Some time ago I was using HP P410 RAID controller and it was increasing my server's idle power consumption by 20W. That's why I decided to switch to software RAID based on ZFS.
that thing is an abomination.
I went ZFS route back when I've setup my proxmox multi vms combo server back in 2021. So far (knock on wood) after 2 separate drives failures no data was lost. Before I was using either proprietary NAS solutions (qnap) or builtin motherboard raid configurations and this always ended up with partial or complete data loss. Thankfully after my first disk failure I always kept some sort of external backup so even though qnap and raid solutions failed me I had some way to restore my data. I'm not saying ZFS is bulletproof (look at LTT situation from couple years prior), but if you do regular pool scrubs and extended smartctl tests - so basically you won't let your ZFS pool to "rot" then I'm pretty sure ZFS is the best their is (so far).
Surprising results! I was of the belief that software raid is just as fast as raid. I was wrong!
The opteron server he was running these tests on is very slow. Modern cpu's are so fast now that software raid would probably show much lower utilization. The comparison with old hardware raid cards is tricky because they likely won't run on modern uefi motherboards, especially consumer ones. He kind of said this at the beginning of the video.
Have an Adaptec ASR-8805 with BBU in a consumer ASUS motherboard in UEFI mode, running for several years with Proxmox, no problem. Bought 2 of these controllers they were so cheap in case of issue.
Be careful with the results of this video. The system he's using has a very weak and underpowered processor in terms of single core IPC and without any additional information his performance problems are most likely localized to his system configuration. I've been using mdadm for years on a variety of different configurations and the only caveats were slightly lower. Random read speeds and in some cases higher disc latency than hardware RAID arrays but other than that, as long as you have a decent CPU (a 10-year-old optron is hardly a good test candidate) It's a viable solution for Budget conscious individuals. Fun! Fact: Intel maintains the entire software stack for mdadm and it's implemented as part of their Intel VROC solution on high-end server grade motherboards.
Really interesting video, thank you!
Glad you enjoyed the video!
Are we even have some decent hardware raids in this times?
I don't know if it was posted already or not, but if you are able to set the "Storage Oprom" to legacy on consumer boards, that will allow you to use the built-in managers for the different raid cards. But given the move to uefi on everythnig, its a dying thing to see legacy features on newer mainboards.
HW RAID is only good if I have another card on hand when one fails. And even then, can they be easily swapped in?
Typically a replacement RAID card will detect a array and import it. Generally if its from the same manufacture with the same model or newer you can import the array. But the hardware requirements are much stricter for importing the array than software RAID solutions.
I generally like to rely on backups in case something goes wrong with the whole array, but having ease of recovery is still a feature if things go wrong.
From what I understand, a hardware RAID generally doesn't allow you to use for example an NVME SSD as a cache. So wouldn't a software raid with such a cache generally surpass performance of hardware raids? Especially for random reads/writes?
I think a few Hardware raid cards supported a SSD cache, but this feature has since been removed in newer product lines to my knowledge. You can still add a cache in software with something like bcache in linux.
The annoying part of a cache is they can help a lot in some workloads, but almost none in others. If your doing random IO across the whole drive, expect almost no improvement from adding a cache as it would be nearly impossible to predict what blocks are needed next. If your accessing some files more than others, a cache may help a lot.
I'm generally a fan of a sperate SSD only pool if you know some files are going to be access more often than others. Like a SSD pool for current projects, and a HDD pool for archive projects. But this can add complexity and depends on your exact workload.
Your results agree with my experience. HW Raid, for small and medium systems, is not needed. I still see it for large enterprise SAN nodes, but not in this classic form. And, at least for desktop type machines, the move to solid-state storage has changed the performance equations yet again. But assuming we're talking bulk storage, AKA spinning rust, I would use ZFS over just about any other choice, especially in a NAS application.
Yea SSDs change the performance calculation a good amount. With HDDs its much more common to be IO limited than with SSDs. I choose HDDs for testing here as there much more common in home server and NAS use, and since HDDs do much worse in some workloads like random IO I thought it would be best to test with random IO. With how well this video is doing, I might look at SSD arrays, and try to get one of the NVMe raid cards to see how they work.
@@ElectronicsWizardry Yeah, that might be fun. One of my servers has an LSI card that does NVMe. No RAID in hardware for that, but the Kioxia U.2 drives I have it hooked to give insane I/O throughput. Perhaps you could do a ZFS comparison with using the SSD's in conjunction with the HDD's, either as special devices, or just cache. (And why cache drives aren't what most people think when it comes to ZFS)
Yeah, ZFS (even stripe) with NVMes and 4 fio threads with fsync=0 just max out the CPUs, or for practical case couple of VM guests maxing out their io on same zfs pool - same results on host CPU. For HDDs would choose zfs any day
Yes, solid-sate storage has completely changed the game for where you would need to use RAID.
As for a NAS..... ZFS is fast becoming the filesystem to use.
But nobody talks about SNAPRAID. I've been using that to store my media files for years. Never had any issues with it. And best of all, it only spins up the drive where the data is stored, rather than every drive in the array.
@Andy-fd5fg yea snapraid is flexible and does well in home media server like environments. It struggles with lots of changing files and only operates at the speed of a single disk. Unfortunately there is no perfect storage solution so it’s a pick your compromises when setting up raid or raid like solution.
You should _always_ use a form of ssd as a log device for a raidz1 or raidz2 if you want a decent performance. An alternatve is to force the pool to be asyncrons, but then you can lose up to 5 seconds of data. One of the best log devices you can use are smallish optane drives, just avoid the 16GB ones since their sustained write perfomance is to low.
Very pertinent. I switched from a 4th-gen Celeron-powered build with MDADM (and an extra PCI SATA card) with 6 disks, to a PowerEdge R510 with an H700 Perc card. MDADM was especially slow with writes, much slower than the Perc. However, Raid-6 performance isn't impressive either.
I really liked your video but ... What happens if the volume or array runs out of space. Can i just add another drive and keep going? Unraid can do that.
With parity RAID, most hardware raid cards support adding a array, MDADM has supported adding disks to arrays for a long time, and ZFS has added this feature recently. In all of these examples the drive has to be the same size as the existing drives(larger drives houdl work but the extra space won't be used)
How does HW raid card stand against MDADM with dedicated NVME (or ramdisk) journal and bitmap device?
I didn't test the journal device for this video(I already spend a long time setting up these arrays and rebuilding them). I might look into mdadm journal devices later if people are interested.
@@ElectronicsWizardry ZFS can have a special vdev on nvme, which helps a lot in certain cases. metadata is always stored on nvme (crazy fast directory index loading!) and you can choose (per dateset if you want) up to which block size shall be stored on nvme instead of HDD. attention: if blocksize==recsize, everything will be stored on nvme.
I have one suggestion for you about this video. Add time stamps.
MDADM doesn't have a journal... the filesystem you use on it has the journal.
I'd suggest you do some tests using different filesystems..... BTRFS, EXT4, XFS.... there maybe others you want to look at... is JFS still kicking around?
And as others have pointed out, you could move the journal to SSD's perhaps a mirrored pair.
Also look at MDADM "chunk" size. Many years ago when i was playing with MDADM and XFS i had to do some calculations for what i would set for XFS "sunit" and "swidth" values.
I expect BTRFS has something similar.
(Sorry, can't remember the exact details of those calculations.)
I think I understood how MDADM does its journal, I thought(incorrectly it seems) that it always uses it like ZFS log, but it seems to only use the journal when a journal device is connected.
I think I tried a few other filesystems and didn't see a performance difference. Since I was mostly trying to test RAID performance I stuck with XFS as I didn't see a big difference between filesystems and fio performance, and wanted to keep the tests rounds down(It took ~3 days for each RAID types to be tested as I had to wait for the initialization, then write a 15TB test file, then do a rebuild)
I should check MDADM chunk sizes, that easily could have been the issue here.
@@ElectronicsWizardry Sounds like you need to acquire some smaller drives, 1tb perhaps.
I know they aren't good for price per TB, but it would save a considerable about of time for tests like these.
Even if you don't, a follow up video testing just XFS with a separate journal drive, and tweaks to the chunk size to get better performance out of MDADM would be a good topic.
I do have a pile of 1TB drives. I should have remembered to use them instead as the whole drive can be overwritten faster.
It seems like looking into MDADM would make a good video and I'll work on that in the future, It will be a bit of time though as I have some other videos in the pipeline.
@@ElectronicsWizardry Make it whenever you can..... until then we will all look forward to you other videos.
MDADM has a "bitmap" which allows it to only resync the recently changed data when a device fails and then comes back, but it's not a journal. But Device Mapper has dm-era module that does something like this, I think.
Thanks Brandon.
My hardware raid card works funky with my consumer grade motherboard. Add it's a 10 year old motherboard and you have all sorts of funny. But it works.
How about intel vroc ?
Isn't that only for nvme? I'd say SSD performance in a RAID is high enough, even in a pure software solution.
Vroc to my knowledge is for NVMe only drives on specific platforms. I'm not sure how it does performance wise, but I think it uses a bit of silicon on the CPU to help with putting the array together for booting from, but I think much of the calculations is still done on CPU cores with no dedicated cache. I might make a video about it if I can get my hands on the hardware needed for VROC.
It would be interesting to see how Microsoft Storage Spaces performs in this test 🤔
I should take another look at storage spaces. Its been a bit since I've done a video on storage spaces, and I think server 2025 changes some things. I decided to skip it here as I am more familiar with Linux and adding a second OS to testing adds a lot of variables.
Been using it for years on a workstation as a extra backup point. No issues. Works great.
Linux md + XFS is the only way to go. All other solutions are inferior. ZFS sucks and is slow.
The problem with hardware raid is when something goes bad, no tools ... RAID5 and 6 with mdadm are terrible, especially with fast storage. Some work is being done to fix that but when will that be included in the kernel is anyone's guess. For now disable bitmaps and try to use power of 2 number of data disk (4+2 for RAID6 for example). That should fix some of the issues you are seeing.
Does anybody knows which codename Debian 18 will have and what codename will be chosen when they are through all the Toy Story characters? 🤪🤔
I don't think its been announced what Debian 18's code name will be. The latest codename I think they have public is for 14 with Forky. I'm guessing they still have a lot of toy story characters to go through.
I am running unraid with two zfs pools. I can get almost full Read and write speed out of my exos hdds. So Software raid isnt a bottlenecks now days if there is enough cpu power available
Bro, where are the SSDs?
I stuck with HDDs here as there most common in home server and NAS use. SSDs change up a lot of the performance calculations as they are so much faster things like the CPU and bus speeds are much more likely to be the storage limit than the disks themselves.
Look at the NAS Minisforums has coming out! Its legit.
as you are using a Perc card which is a rebranded OEM LSI MegaRAID 9361-8i, if the RAID controller fails, and you are using Linux you can use an HBA + MDADM to import it and run it like normal to recover any data you need till you get another RAID controller with the same firmware
haha I've actually been booting my Windows Server off a RAID 1, 2-drive array for like 9years. This is on a Dell R510 with a H700. I've never heard you can't boot from RAID, but then again I've also never tried using software RAID
another great vid
Thanks! Glad you enjoyed the video.
There is so much to ZFS that I don't understand, that it seems too dangerous to me. I also don't like how they don't focus one bit on performance. You can now expand RAIDz, but doing so is nothing like expanding traditional RAID. Performance drops with each drive added. It is not a complete resilvering. Dumb.
this is nonsense the charts clearly show he is testing the controlller cash not the disk performance. thats why writes are so much better with battery
All the tests were done over a ~3m period, so the writes couldn't just be dumped to cache without testing the whole array performance. The cache helps with sustained write performance, so I'm far from jsut testing the cache..
Or just use SeaweedFS and stop waisting HDDs in RAIDs. Also where's the data for BTRFS RAIDs?
I skipped BTRFS as its RAID 5 and 6 solution isn't listed as fully stable yet. I like how BTRFS RAID is very flexible with adding and removing drives and mixed drive size configs.
So hw hba cards are only used nowadys in servers in something like a multi cluster jbod atorage set up where you might want redundant data paths... Reason md raid was slow in your dual socket is probably due to the fact its old. Everything uses md raid in the server world now.
Hardware raid is really firmware raid, its software “burned” onto a chip. Does that sound like it’s easy to update? It’s software and it will need to be updated. Remember when firmware was a joke? Its software written to the “eeprom” so between software and hardware is firmware. LOL.
lvl1techs made a great video on this why you should not use this, This is not a viable solution due to it not detecting write errors and bitrot. The current hardware raid cards are not the raid cards from the past th-cam.com/video/l55GfAwa8RI/w-d-xo.html
I watched that video when it came out, and probably should have talked about this issue more in the video. While this is a potential issue, It generally seems to be pretty rare in practice due to the error correction on HDDs/SSDs. I have used many hardware raid cards with data checksumming in software and almost never get data corruption/bitrot. Zfs and other checksummed filesystems are nice and help to keep data from changing, and notifying the user upon issue. My general experience is checksum errors on HDDs is extremely rare on a drive that doesn't have other issues. Keeping bitrot away is one reason why I generally stick with ZFS as my default storage solution, and try to use ECC if possible.
@@ElectronicsWizardry ECC will not do anything against Bitrot, Bitrot refers to the gradual corruption of data stored on disks over time, often due to magnetic or physical degradation. ECC memory primarily protects against bit flips in system memory, not storage, so it does not prevent bitrot.
ECC memory for ZFS is recommended for not writing any bit flips to disk, this is even a more extreme measure since bit flips in memory occur even less then writing errors so to on 1 hand being very lax about data integrity using hardware raid cards but then suggesting ECC memory for ZFS on an even less likely issue is very strange.
HDD's dont have any error correction, they just write the data it gets, it does not have any way to even know what data would be the correct data so how can you even start to do error correction on it, the filesystem is responsible for that by doing checksums. (yes a HDD has ECC but this is for reading data from the instability of reading such tiny magnetic fluxes)
You should rewatch the video, specially at 6min timestamp, there are no cards that do any checksums anymore
" I have used many hardware raid cards with data checksumming in software and almost never get data corruption/bitrot" this statement is 1 to 1 equivalent to your grandma saying "I got the age of 90 with smoking a pack of cigrates a day" Good on you, but those empirical statements are kinda useless and specially as "informing" people it only hurts, since i bet you, atleast one person is going to buy a raid card in 2025 because of this video
@@ElectronicsWizardry After Advanced Format drives (512e included) came out, all the different DIF formatted drives became much less attractive because the how powerful the new ECC algorithm AF drives uses (LDPC on 100 bytes per 4k physical sector). This is why most modern hardware raid controllers don't use DIF formats; because it's been made redundant by the checksumming that is running on the drives themselves.
Yea ecc doesn’t help with bitrot but you want all memory and interfaces in the data storage system to be error correcting to prevent corruption. Ecc and disk checksuming help with different parts of the data storage pipeline.
Hdds do have error correcting that is hidden from the user available space. That’s how drives know if there reading the data correctly or not. Also interfaces like sata and sas have error checking as well. The assumption raid cards make is the interfaces used by drives and the data on the disks is checksummed so the drives only return correct data or none at all. There are edge cases where this can occur but it’s rare in my experience and I’ve used a lot of drive so if this was common I’d guess I would have seen it.
Also a huge amount of business servers are running of these raid cards. If there was a significant data bitrot risk I’d guess this would have been changed a while ago.
@@ElectronicsWizardry I definitely agree that it would be ideal to have ECC throughout the entire pipeline of subsystems (all the SerDes structures and all memory levels) the data goes through, but adding complexity to a design can also introduce new sources of errors; there is probably some ideal balance between simplicity and error checking, and I'd be willing to bet we're pretty close to it in most systems.
I suppose bitrot is kind of an all-encompassing term, but the ECC that runs on hdds is being used to checksum every sector that is read, so it would protect against some kind of surface defect or magnetic bit problem below LDPC threshold on the hdd causing bitrot, but it'd be narrowed to correcting only those sources of bitrot and not some kind of wider memory or cabling issue. This is all mostly transparent to the user like you said, you'd have to do into SMART to look for reallocated sectors to discover this was happening.
Great video! My few cents on this topic
1) Hardware RAID with writeback cache and single card is pure evil (TL;DR just don't do that) - whatever redundancy you get from RAID geometry is swept away by reliance on a single piece of hardware. When that controller fails, your data is in an unknown state, because some writes might be in the cache and thus lost. Also, when that battery fails, performance drops off the cliff because you can no longer use writeback cache (if you value your data), at which point if you plan for _not_ having the cache, then why rely on that piece of hardware at all? There are ways to do this properly with multiple cards that provide redundancy and synchronize their cache, but that's usually vendor specific and I haven't had much experience dealing with them, those make sense mostly if the setup is large, like multiple SAS expanders with tens or hundreds of drives (and usually hosting a huge Oracle instance or something like that).
2) MDADM is great for OS partitions because of its simplicity, but also for performance-sensitive workloads like databases and is extremely flexible nowadays when it gets to expansion or even changing between RAID levels (if you are ready for the performance penalty when it's rebuilding). But it lacks the features that make stuff like ZFS great, like snapshots (yes, I know you can use DM for those, but has anybody actually tried using that productively?)
3) ZFS is what you use if you actually value your data because of all the features, like streaming replication, snapshots, checksums... but you lose a lot of performance because of CopyOnWrite nature of the filesystem. This is what I use as long as I can live with that performance penalty. This should just be default for everyone :)
4) BTRFS - just don't (unless you are an adrenaline junkie). Every single BTRFS filesystem with more than light desktop use I ever encountered broke sooner or later. Very common scenario is colleagues claming BTRFS works just fine for them and me discovering I/O errors in kernel.log that were simply overlooked or ignored and resulting damaged data that were never checked.
5) there are many seldom-used but very powerful options for gaining performance, for example putting your ext4 filesystem journal on a faster SSD or adding a NVDIMM SLOG device to your ZFS pool, it might be interesting to see a video on those :)
Some more notes on your testing
1) FIO is complicated, IO engines are a mess. I spent good few months trying to get numbers that made sense and correlated to actual workload out of it. Sometimes you're not even testing the bottom line ("what if everything was a synchronous transaction") but just stress-testing weird fsync()+NCQ behaviour in your HBA's firmware. Not sure I have a solution for this, testing is hard, testing for "all workloads" is impossible and you can get very misleading numbers just because you used LSI HBA with unsupported drive, which breaks NCQ+FUA, which makes all testing worthless (ask how I know). Btw LSI HBA in IT mode is not as transparent as an ordinary SATA controller.
2) CPU usage ("load") might not seem that important when what you're storing are media files, but when you're running a database it translates into latency... except when it doesn't, which is what I suspect happened to you, because FIO wasn't waiting for writes to finish (so not blocked on I/O) which is what "load" actually measures. On the other hand, if you throw NVMe drives in there then that CPU overhead will start making a huge impact inside your database/application which can just throw away that thread immediately etc..
3) HW RAID controllers are capable of rebuilding just the actual data in an array, but that requires a supported hardware stack (drives and their FW revisions) and functional TRIM/discard in your OS.
4) ZFS performance testing is usually non-reproducible with faster drives or when measuring IOPS for a couple of reasons like fragmentation, allocating unused space vs. rewriting data, meta(data/slab) caching, recordsize or volblocksize vs. actual workload and weird double caching behaviour on Linux. Just try creating a ZVOL (let's say 2 times larger than your testing writeset) and then run the same benchmark multiple times on it, you'll start seeing different number on every run, trending downwards :)
If you need the reliability of drives connected to two raid controllers you need a real dual controller storage array
Thanks for the feedback.
I noticed the card in IT mode seems to be realize it cache to help single disks. This affected my testing and trying to get good results, but in the real world this should only help(unless there is a bug with the cache, but I guess this could affect any disk controller, the cache on the card is error corrected)
The more storage performance tests I do the mode complex it becomes and more variables I realize I have to control. I know these results are far from perfect and can't be extrapolate to every use case, but I hope these numbers are better than the often vague descriptions of performance I see online, like slow writes, or no practical CPU usage difference. I hope others can use these tests as a jumping off point to running more tests specific to their workloads.
Load averages likely weren't the best for seeing the CPU usage. Looking back running a CPU benchmark at the same time might have been a better metric. It was pretty reproducible though in my testing.
@@ElectronicsWizardry IT mode shouldn't use cache at all. In the past I used IR mode with RAID0 of single drives to get controller-based writeback cache (backed up by the battery), but in IT mode this is not possible.
Just beware that IT mode is not transparent at all. I am biased against LSI adapters based on my past experience, but they have sort of become the standard for home NAS/homelab setup. For small setups, I think it's better to get a cheap SATA controller. (And lots of AMD boards can switch U.2 ports to SATA mode drectly to the CPU which is awesome).
@@zviratko I might test it again, but it defiantly seems to used the RAM as a read cache on that card when set to IT mode. I should look into this more and might do some testing in the future
Do you have a model of sata controller you suggest?
Be careful with the results of this video. The system he's using has a very weak and underpowered processor in terms of single core IPC and without any additional information his performance problems are most likely localized to his system configuration. I've been using mdadm for years on a variety of different configurations and the only caveats were slightly lower. Random read speeds and in some cases higher disc latency than hardware RAID arrays but other than that, as long as you have a decent CPU (a 10-year-old optron is hardly a good test candidate) It's a viable solution for Budget conscious individuals. Fun! Fact: Intel maintains the entire software stack for mdadm and it's implemented as part of their Intel VROC solution on high-end server grade motherboards. One other caveat is nearly all modern hardware RAID solutions do not scale with high-end solid state storage devices and other nvme drives. Using these type of cards will bottleneck pretty quickly and experience diminishing returns. I've worked with distributed storage systems for years and stopped using hardware raid arrays almost a decade ago.
That opteron server might not have been the choice here as it was probably getting to be too slow for a realistic test bench. I was trying to stick with a slow CPU and slow HDDs as that seems more common in NAS units and home servers. SSD scale differently in performance but I stuck with HDDs for this test. Unfortunately no single platform can be used to extrapolate results for every use case, but I try to get the best results I can with the hardware I have.
Hopefully I can get something like a LSI 9560 and a few NVMe drives one day to see how a modern NVMe RAID card compares to software RAID on a high speed CPU.
During my testing I didn't notice the CPU threads being maxed out and the disk usage was being maxed out.
Err... Windows have no problem booting from a raid like volume, and have been able to do so since Windows XP days... So that's just plain misinformation you're spreading there. It's a bit fiddly to set up, but it's very much possible.
How can Windows be setup to boot from a software RAID volume? I've check a lot of sources over the years and yet to see a good way to do it. Storage spaces isn't bootabe to my knowledge(I think it was setup in a bootable matter on a surface system, but I don't think that counts).
@@ElectronicsWizardry You're right that storage spaces isn't usable for booting yet. Windows does however have the older raid like tech that was introduced in XP named Dynamic Disks, which can be. It supports simple, spanned, striped, mirror and raid5 volumes, and most importantly here, Booting from it is supported. But, for all usages except booting from a mirror set of dynamic disks, it's also so old that it's deprecated, but not yet removed either. It's even still there in the current previews of Win12, and probably isn't going to actually removed until Storage spaces drives actually are bootable.
As for how, just install on one drive. Once in windows, convert to dynamic, initialize another drive as dynamic (but no partitions), and extend the system drive over to that drive. It does however NOT work with dual booting, because it needs both the MBR and boot sector of the partitions to work. Normally you set grub to overwrite the MBR and chainload the bootsector version, but that does not work with dynamic disks.
Yea your right there are ways to do a mirrored boot, but compared to Linux the options are limited, its mirrored only, and uses a deprecated RAID method in Windows. I think its reasonble to say that a RAID card can a good option if you want to boot windows from a RAID array.
Thanks for the reply
@@ElectronicsWizardry No no. You misunderstand. Only mirrored boot is still what MS considers current with this method. You CAN boot any dynamic disk type, including raid5 style, and the method as a whole, is not deprecated. It's only for OTHER USES that it's deprecated. As you, I'd still recommend a raid card, that's not the point. But recommendations are different from possible. My point is simply that you shouldn't claim it can't do it when it's very much possible and has been for almost a quarter of a century.
@danieljonsson8095 let me give that a try then. I ran into issues tbh at last time I think I tried this method. Thanks for the correction.
I mean no disrespect by this but it's time to shave your head bro. I promise It'll look better than what you've got going on now.
I am glad that the YT algorithm recommended my this channel. Nevertheless you should hit a gym or at least do 20 pushups a day.
2 minutes ago 😅
hardware raid is dead! ceph/zfs leads everything. i went away from both software and or hardware raid now for about 4 years. im happy with my zfs in my homelab!
Yea I'm a big fan of ZFS/Ceph for much of my use I go with ZFS. For many of my uses squeezing a bit more performance out of a array is less important than features like ZFS send/receive and I am very used to how to manage ZFS.
Also next week's video is gonna be using ceph in a small cluster so stay tuned.
@ElectronicsWizardry The redundancy and ease of management that ZFS offers are truly unmatched compared to traditional RAID. If you're replicating something like RAID0 or RAID5 and want to avoid sacrificing performance by skipping a ZRAID setup, there’s absolutely no reason to stick with RAID. With ZFS, you get almost the same speed but with far better security!
For me, proper error correction is non-negotiable! I refuse to rely on systems that don’t (proper) offer it. While I do like BTRFS, it’s still a no-go because I prioritize both security and performance. Unfortunately, BTRFS RAID setups, as I would need them, are still too unstable. Honestly, there’s anyway no reason to move away from ZFS, even if BTRFS eventually becomes stable for certain RAID configurations.
Btw, i like your recent style change with this fancy beard! :-D
MinIO is also awesome. I have all my media stored on a MinIO cluster
One of the biggest problems with hardware RAID is, that more sophisticated RAID modes are locked behind an extra licence purchase.
Yea that can be annoying on a lot of cards. Dealing with licensing is a pain in IT. I often try to stick with free open source products if I can even if I have the money available just to save me the hassle of licensing.