Synology NAS FAIL Adventure
ฝัง
- เผยแพร่เมื่อ 28 พ.ย. 2024
- My Synology DS418 NAS failed again, but some really bizzare symptoms.
Previous drive fail: • Synology NAS Western D...
The C2000 bug: • EEVblog #1288 - Synolo...
NAS thermal measurements: • WD Red NAS Thermal Mea...
Western digital Red NAS HDD teardown: • EEVblog 1398 - Western...
If you find my videos useful you may consider supporting the EEVblog on Patreon: / eevblog
Web Site: www.eevblog.com
Main Channel: / eevblog
EEVdiscover: / eevdiscover
AliExpress Affiliate: s.click.aliexpr...
Buy anything through that link and Dave gets a commission at no cost to you.
T-Shirts: teespring.com/s...
#ElectronicsCreators #Synology #NAS
After years of running raid arrays from cheap ones to enterprise systems, this will always happen when a drive partially fails, not enough for the system to say “crap you’re dead, disconnected and flag a warning to the user” it will just sit there trying to write to the one bad drive and bog the system down. When you have a drive that completely fails does the array finally realise it and shuts it down. Stupid but it happens :)
That can happen when the disk "work" but is super slow. The system never get an error from the disk so keep waiting for the operation to complete.
Since SATA drives do not have a big command queue, the operation still complete before the system timeout. So the system never drop the drive.
So if the timeout is 15 seconda, at the maximum 32 commands, at 4k, that is 128kB in 15 seconds, or 8kB/sec trip limit. That is if the drive support the maximum 32 NCQ and the system uses it. In reality it can be less.
Yep, that was my thought too. Broken enough to take forever but not enough to be a proper "fail". Had a fair few drives do that over the years, they never quite died but past a point never quite worked properly either.
That is where buying the "SAN" drives can help, they have TLER (Time Limited Error Recovery). They give up and error out earlier, which tends to be what you want in a RAID array.
@@DuskHorizon TLER would not have helped. The commands get executed within the time frame. IIRC TLER is 7 seconds. If the command execute in 6, it pass, but you get a few kB/sec of transfert.
I have a Synology NAS and this happened to me, the drive was flagged as bad with 'timeout' errors.
As an enterprise storage guy, I have seen this with other storage arrays were a disk goes bad and starts throwing garbage on bus which took a whole disk shelf offline. (Higher end storage shelves can detect this and cut power to a slot)
When this happens you can get very weird behavior.
I kinda like the pissed off Dave mode
I enjoyed the pre-whining 🙂 "Go Ahead, Make My Day. Write your stupid comment" Lol
That's the default mode.
Cracked up when he said he threaten it with physical violence. Time for a beer I think Dave :D
Just bear in mind if you are using WD Reds, that they changed the design of them a few years back, and the new design is not suitable for use in NAS devices with more than one drive. After a lot of push-back, they re-introduced the old design as "Red Plus".
So if you are replacing the drive, make sure you get a Red Plus, Red Pro, or Gold; or a Seagate Iron Wolf, Iron Wolf Pro, or Exos.
I've listed the product teirs for WD and Seagate from mimumum acceptable to best. Last time I was looking for drives, the more premium tier ones were actually cheaper. Red Plus is the same model as the Reds you currently have in there.
Exos was cheapest when I filled my NAS. It varies.
I wonder how bad that bad drive's SMART readings are. Glad nothing was permanently unrecoverable.
i have HDD from 1998 and it still works perfectly... last 10 years or so i am changing HDDs at least once a year or two... quality of electronics declined so bad lately, seems like the "cold war" era engineers are retired...
There were duds back then too. My Dad once (around that 1998 timeframe) came home with a proverbial spring in his step, he just bought a new computer (Mom was livid, cost a months salary). The HDD let out literal smoke during Windows installation.
Your 1998 hard drive probably isn't trying to cram multiple terabytes into the same sized platters and using increasingly more advanced signal processing to get the data back off. remember - hard disk storage has increased by orders of magnitude but the physical size of the devices hasn't changed since the 80s. If you think modern hard drives are less robust, go learn how they work, you'll come away surprised they work at all 😄
@@ncot_tech i know how the HDDs work, i am firmware dev for embedded devices... the problem is that precision and materials engineering has come long way since 90th and with that its natural to expect that even with higher data density and transfer rates, important device such as HDD would be way more robust and better than HDDs back then, the real issue here might be the "planned obsolescence" and "cutting corners" to stay competitive...
@@MLeoDaalder true that, but again, failure nowadays is more common
@@Hobypyrocom My experience is completely the opposite. You used to have to do scandisk ALL the time back 20+ years ago. Can't remember the last time I had a HD failure.
It used to be that you had to use drives with Time Limited Error Recovery (TLER) in RAID configurations to avoid this exact problem. Not sure if your drives support/are enabled for TLER.
The number of times that people actually *do* confuse a RAID or similar storage solution for a "backup" is alarming though. So I wouldn't be too hard on the commenters pointing that out - I even made a video about it way, way back in the days...
Your login web page may have just been cached, not actually loading the site from the CPU. Always good to check it in incognito.
This isnt a synology specific thing, can happen in windows aswell where it bogs down to a crawl because of an edgecase drive that cant be read properly. Even if its not ran as your main (operating system) drive.
Thanks for taking the time make this video.
In ancient times some drives "failed" nastly in a way that they won't signal error but do things very very slow, so if there is no timeout system will try to work but 1000x slower
My DS918 was acting bizarre the other day. After thorough diagnosing I found that the power brick was failing (indicator light was flickering). I ordered an off brand power supply and was back in business the next day. Lots of anxiety having a NAS down that you rely on for work.
What I've found is that modern hard drives do their darndest to hide any of their pain from the user. Sometimes even hiding it from diagnostic software. Your only clue is excessive time to respond. Why Synology doesn't have a time out - you'll have to ask them. They probably also have a threshold where it takes a number of read failures to trigger a fault.
Yeah that's the main difference with enterprise/data center drives, NAS drives and consumer/desktop drives. Not so much the hardware itself but the firmware. Enterprise drives have firmware that fails fast to allow for the RAID controller to deal with it, desktop drives spend ages trying to re-read bad blocks.
Synology will just tell you to buy Synology brand drives to be sure it is fully supported.
I've got a four drive TrueNAS machine and have had to replace three drives over the past 8 years. Only 1 of those drives ever gave me a warning before it happened. And the only way I noticed something was wrong was when file transfers took forever. Two of the drives failed because their spindles stopped spinning so at least the machine could tell there was a fault with them. The last drive just started giving SMART errors so it was me who decided to replace it.
Home NAS setups need to make it really really obvious when things are going wrong. We stick these boxes in cupboards, under a pile of wires under our desks and then forget about them. They're not in a data centre being monitored 24/7. I think they should make angry beeping noises or something persistently irritating so their owners can figure out there's a problem. Blinking LEDs and email status messages that get lost in spam aren't the way.
@@KarlBaron Most NAS drives you can adjust TLER. I had to set TLER for my disks to 7 seconds, but needed to add smartctl commands to /etc/init/enableTLER.conf file. But yeah, this should be a default behavior of Synology NAS devices.
@@htwingnut yep NAS drives are typically tuned closer to the data center drives with their firmware so that you don’t screw up RAID rebuilds and such, but since they’re still consumer oriented you can’t quite rely on them without researching the models because occasionally you get someone like WD selling you SMR drives
For the failed drive, see if you can run something like Spinrite on it, often if a drive runs into a few errors that lead to the "Reallocated sector count" to increase, it will cause some NAS appliances to freak out because the drive will just hang rather than immediately returning an error, thus it can take a while for a NAS to give up. But if the drive runs through those sector issues and the counter stops going up, then you can often get the drive to run again. Though at that point it is best to not trust the drive, but I have recovered arrays where 2 drives failed with a 1 drive redundancy.
One of the drives with that issue where a NAS rejected the drive, I used for a number of years (after completing a level 5 scan using spinrite on the 2TB drive) after to basically test a fresh windows install on various systems, where if a strange issue could be software related, I would just disconnect the original drive and connect the old rejected 2TB drive and test a fresh install of windows.
I largely did that for awhile until I could easily get cheap SSDs, now I use a cheap 240GB SSD for that purpose.
Thanks Dave, sometimes there are just strange errors which we never experienced before. You fixed it. That's the important part. Take care.
Well, the OS it self is actually stored on the actual hdd, so if something fails, the OS comes along with it.
All synilogy NAS usually only comes with a 1GB DOM just enough to load thier propietary bootloader.
Some drives, when they're failing will fail into a state where they work, but not really. It's nothing Synology could do - some drives lie and tell the host controller that it's got the data and please wait, it's coming. Of course, it hits the bad spot on the disk and never quite finishes reading the disk, so it retries and retries and hangs the system. And while it's hanging the system, it's hanging the controller because it's supposed to have the data ready. This is highly dependent on the drive firmware - you never state what drives they are. Most of the "NAS" drives will not do this - they will return with an error or not really quick, while desktop drives often will hang the system trying to get at the data. You can't really implement a timeout as it's a hardware failure - the only way to fix it would be a complete reset of the hardware itself as it's effectively dead. This means even if you hot swap the drive, nothing will happen.
Yeah, that’s a drawback of software raid.
Lost my whole photo collection last month - 2TB gone after my WD Red drives died in my RAID 5 setup. What a punch to the gut.
Been in IT over 20 years and I can confirm this is a thing that happens. Sometimes drives fail in a way where they don't outright die but instead take ages trying to seek or write, timing out, and repeating. It's just one of those issues you have to either just be familiar with or be lucky enough to have some sort of log you can get to. It's annoying but not Synology's fault. It's also expected behavior that you would have a hard time logging in because I'm pretty sure Synology installs their OS to the drives themselves, so if the RAID array is unresponsive the OS would be, too.
I've got no excuse for the dumb third party locator program. I'd rather just log into the router or whatever is handing out DHCP and get the info straight from the horse's mouth. It's also good practice to do static reservations for important equipment and even better practice to just do static IPs outside of the dynamic range.
Just my 2c here. IT engineer of 34 years. I have used NAS units at home over the years and have never really been happy with them, plus have had unexplained instances like yours. I have lots of data to store (music mainly), so i built my own server. Dell T110 tower server, these are dirt cheap on ebay. They are quiet and easy to work on. It has an 8 channel disk controller and can use regular SATA drives. So i maxxed it with drives, installed ESXi and then built a Windows Server 2012 R2 file server. So i use the File Server for my storage. The added bonus of this is that i can run GoogleDrive desktop and select specific folders for live backup to the cloud. Plus i use the server for other VMs, PFsense router, PiHole for blocking ads, and so on. I have a small APC UPS to filter its power and give some backup time and the whole setup has been sweet for the last 6 years.
I admin a TrueNAS machine at work, which is why I use a Synology at home. I don't want to have to deal with admin stuff in my free time. The Synology works fine, it's just Linux with a pretty UI. I've seen the exact problem Dave had here with both Windows and Linux boxen as well, a drive goes into recovery instead of reporting the error to the machine and the OS freezes up.
Synology does everything you describe as well - they have a built-in Google Drive client (along with a Dropbox client that actually works better then the official Windows desktop client with a large amount of files). You just log in, check the directories to sync and it does it automatically.
Synology also has everything else you described like built-in docker support, VM support etc etc, all super user friendly. I run a linux VM for experimentation as well as some random docker stuff like homebridge, the unifi controller, etc. My Synology is from 2018 and has been rock-solid.
I guess you have never seen Dave use a computer if you think he can setup a server.
No more esxi, the personal use one is canned now.
I recommend with the synologys you can use hyper backup to backup to like another file share, external drive or another NAS or even Google drive
I never thought of working straight off a NAS like one like that. I might have to once I get a better life situation because I have more than 1 PC.
I also was thinking of a NAS like that, which hopefully would be easier than maintaining a PC to share with a family, but I don't think the others might be able to fix it if it goes down
Working off a NAS is fine if you aren't trying to move giant files. They are definitely easier to maintain than a regular PC though, the web interfaces are designed for end users, and when drives fail you just pop the drive bay open and swap the disk. It doesn't require completely disassembling a PC and fixing it afterwards.
It's ideal for media and documents. Plus the plus series can back computers up at a full level.
@@QuickQuips @ncot_tech Thanks for the perspective!
To really work off a NAS productively will require a good NAS, ideally with SSDs, or enough HDDs, a good fast network card, and a good LAN, ideally 10gbe. Otherwise out may be slow enough to cause aggravation.
Drive related problems on regular PCs would always lock everything up and make the system unusuable. This being a dedicated piece of hardware you'd think they would have that kind of issue under control and well managed. Also, I'd be a bit wary of running the check if there's anything on the array that's crucial. Those file system check programs typically just take an axe to the data and chop out all the bad parts...
I've worked with Synology drives for many years and I do rather love them. In your case, for disk 1, I'd guess the drive had not completely failed. After some number (probably a large number) of failed read attempts, it would actually respond with the data. For whatever reason that delay was not sufficient to trigger a failed state in Synology's DSM OS. It might even be a tuneable parameter.
This is something I came across on a laptop just the other day (and have encountered before). A failing disk drive takes longer and longer to respond but may not throw actual errors at all for a while -- then you start getting timeouts.
My first was a FreeNAS (before the name change). My second was a QNAP, I figured something designed for it would be better.
Big mistake. I get so much trouble from a proprietary consumer solution. From now on, any future ones will be TrueNAS. That one is still running perfectly.
The upside to Synology is that underneath it's nothing proprietary - it's just Linux mdraid and BTRFS with a pretty UI. Synology even have detailed instructions on their website on how to put the drives from a Synology into a Linux PC and mount them if you need to do recovery. One of the reasons I'm happy to use Synology at home even if I use TrueNAS at work.
@@KarlBaron With TrueNAS, you can pull the drives out and recover them in any FreeBSD system. Linux systems that have zfs support will probably also work.
@@KarlBaron Same as QNAP. I've had the displeasure of trying both.
Never had issues with the OS. It's the proprietary bits that cripples and are unstable whenever you try to do anything more difficult than an SMB share with a local user.
I do like that the glass front door breaking warrants a full evacuation of the building
People just hear an alarm and think it's a fire alarm. First time it's ever happened here. People did say that it sounded different to the normal fire alarm though.
Years ago I got a 4-bay WD RAID 5 thing… two weeks before ThioJoe made a video about why RAID 5 won't cut it anymore. About a year later it did what he'd warned about: It worked when a drive failed but then while rebuilding encountered a read error and completely gave up. On the bright side, I'd paid $26 for the extended warranty and they refunded the full purchase price, and by then high-capacity drives could fit into my new case anyway. In any case, best of luck to you.
RAID5 is fine as long as you're using ZFS or BTRFS with checksumming and scheduling regular scrubs. The scrubs act the exact same way as a rebuild does and exercise the drives in the same way, so that any errors or weaknesses are detected before you've lost redundancy.
I admin a ZFS pool with 22 drives, the drives are scrubbed monthly, never lost a sector of data - the UBE rates reported on the white papers for drives that are quoted in the "RAID is dead" videos/blog posts are worst case scenarios so they don't have to honor the drive warranty.
@@KarlBaron ZFS isn't RAID5 though, it is RAIDZ1 (or Z2 etc), which, unlike RAID5, gives you protection from non-catastrophic disk failures. That is where the drive responds, but gives the wrong answer. RAID5 will know there is a problem, but won't know which drive is giving the wrong answer. RAIDZ1 will be able to figure that out.
You should not use BTRFS's RAID5 equivalent as it is not stable.
it's because it relies on smart data. Steve Gibson from grc has a good white up on the failings of SMART enabled drives.
I'm saving that comment in case your cloud "backup" gets vaporized by big pharma.
I've had this same issue with two separate Netgear ReadyNAS units. Failed drive knocked them offline but all indicators showed fine. Learned to pull one drive at a time to find the bad one.
I believe the Synology NAS's store their OS on the HDDs, so if the drive is having trouble but hasn't failed completely you can have issues exactly like this.
Yeah, which is also why Synology asked if you want to set-up new NAS after inserting HDDs(with Synology OS installed) using hot-swap. It probably boots into small "initial setup OS" if it can't boot into OS from HDD.
I have referenced your older Syno clock degradation video many times in the QNAP forums...very useful (QNAP units had the same issue)
Sorry to hear about your hassles! Am I right thinking that Synology install the OS across the drives? I thought they loaded it on a separate NVMe or dedicated drive?
The Synology partition is what holds the operating system, it usually installs on drive 1 which is your failed drive - if that fails then it can't run the operating system without failing. When you remove Drive 1, you are ALWAYS required to reinstall the Synology partition. It's a hidden partition on the raid. Removing Drive 1 removes the operating system. That is why it tries to install the operating system again.. All the data is still available, it just needs the operating system reinstalled.
Thats why on my NAS, the OS is on a 4GB disk on module. All the HDDs could fail and it would still start up fine.
Even Synology aren't daft enough to have a single point of failure like that. The OS lives on a RAID1 across all drives.
The HDD S.M.A.R.T status sometimes will not report the correct fault because the faulty parts are the HDD read/write head itself,not the disk .The system will stay waiting for the faulty read/write head to read the faulty disk.If the system doesn't have a wait time out count,it may just keep waiting forever, and the system frozen with no response.
Dave, here is your answer from a software engineers' perspective : this software-assisted RAID protocol implementation probably is "incomplete" to some extent - RAID can recover from many failure states, but a few, key software steps in the overall failure-detection algorithm seem to have been unaccounted for. Basically, this specific failure scenario probably was not included in all of their test cases. The reason being : someone was lazy, inexperienced or completely overburdened with work. The last option being actually somewhat common - sometimes, entire, major products rest on the shoulders of one or two developers - and thats probably not a good thing. It should always be a whole team - and it should've been done in an "agile" way, even if non-software dev people seem to hate that term.
I've been doing computers for 20+ years professionally. This is not at all the first time I've heard of an issue like this. I've had a few computers that would not boot to BIOS if a particular broken drive was plugged in. This is the same situation. The driver chip on the logic board is likely broken in such a way that it works enough to send firmware information, but won't process further. This causes the CPU to expect data, but either not get it and wait for it, or be interrupted by the drive locking up the system. I've also had this issue a LOT with laptop wireless cards ( mPCI/mPCIe ).
PS: Good documentation ( telling and details ) of your story!
Dave, configure a static lease for your NAS.
And you could look into repurposing a PC and running TrueNAS on it, it's fairly easy to setup basic stuff, it's set and forget just like many off the shelf stuff, works great, and gives you the option of easily upgrading it in the future.
I've had a whole bunch of Seagates from 2010 that wouldn't give a smart report and chugged along with until they died, inside of a hp storage works enclosure, they just report back with false positives, I opened a few of them and the platters where totally trashed on them.
I have a HP server at home, and I was testing some old hard drives to see if there was some data left and one of the drives made the whole server freeze. There is something to do with the S.M.A.R.T that works just enough to make the computer wait for a response that never arrives. When I hard reset the system it would hang on boot while detecting the drives on the array initialisation. And it only started working again after I removed the bad hard drive.
Seriously though, Trunas :) lol sorry Dave, had to do it.
back everything up before you do anything else...especially a rebuild, the most likely time for another drive to go is during the rebuild (in which case you'll lose everything).
Be sure to back everything up to a good quality cassette tape!
You would think something as potentially important as a NAS would have a more real-time alert popup, something like an SNMP window. Used Synology for years and so far (touch wood) only issue has been after a Windows update (well documented resolution).
I had a similar thing happen to me with my Dell PowerEdge T420, a failing WD purple drive would cause the whole system to lock up. I thought it was bad RAM, but a few days later the RAID controller told me the drive had a SMART failure and it automatically powered it down.
I'd be getting ready to replace the other drives, too
That's why I always have a backup disk outside the NAS it's a Seagate Enterprise 8Tb, last month the power brick was also dead. So replacing the damn thing. And back online.
Also don't forget to update your software on your Synology NAS it looks like you're using a fairly old version, I have the exact same NAS and the interface looks totally different but I keep mine updated even though mine spends most of its time off
Ok so your system reported “Bad Sectors”. This happens when the drive fails. The reason your NAS can boot and operate is because it’s mostly only reading data. It does NOT matter if it’s your computer, this NAS or any other system. You need to have your Synology scrub all contents regularly so it knows when bad sectors start to form. It only knows this when files are accessed. Scrubbing does this. In Synology it’s under Storage Manager -> Storage Pool -> Data Scrubbing. I personally have it set up for the first of the month. It reads all data front to back and when it runs unto bad sectors it will “Repair” the data somewhere else on the disk and flag the sector. You then will get a notification that bad sectors have “increased”, meaning the drive is on its way out and give you a heads up to begin replacing that drive, before issues like this occurs.
For more info I recommend a Tekzilla video for many years ago about ‘Are green drives killing your NAS’. Don’t worry about the title of the video. But they explain the scrubbing in detail.
Time to upgrade that consumer NAS to enterprise server equipment... Time to check the dumpster room again. Love the rant on future comments about 'Truenas' and 'Nas is not a backup'. 😂😂 👍
Sadly, I ran quite recently in to a very similar problem;
In my case the external power-brick failed and with no spare power-brick available I moved the HDD's into another NAS with a built-in powersupply.
Eventually it looked good but quite soon it gave 1x orange led on 1x drive, 2x red leds on 2x drives and 1x drive green.
I also noticed the slow response of the NAS system, I believe the NAS system is waiting on the host's of the respective HDD's to respond.
Even worse, my drives are the infamous WD RED with SMR, so known to cause performance issues when things do go south.
Although I also expect the CPU of the NAS not being that powerful to deal with the many requests (errors) of the HDD's.
I do wonder what drive (brand & model) failed in your case, would not surprise me if it would be (also) SMR-type.
Odd thing is, in my situation, before the power-brick failed, all the drives were green and perfectly operational, and only gave problems after the power-brick failed and the HDD's were moved to another NAS.
I wonder if the old(er) NAS was not checking the health that "properly" of the HDD's?
Crazy i have the exact same model and had this same problem last month
Sounds like the drive has a bunch of bad sectors it was trying to recover, one after the other. I've seen PCs running very slowly for this reason, the drives had hundreds of bad sectors in SMART but weren't completely dead. Maybe a dying head/preamp or something.
Don't know if you were using a NAS grade drive with TLER or not, this may have helped in such a situation, or maybe it won't if there's a failing head and it's just getting error after error.
When mechanical HDDs were the norm (Win XP days) I saw several computers with bad drives that would run incredibly slow. Like take an hour to boot. Yet they never actually crashed or had any read errors. Running the OEM diagnostics (like Seatools) would eventually turn up some errors after running for like multiple days. Weird.
If your NAS is using RAID-5 with a parity drive. Then the CPU has to constantly compute the real data using the parity info.
id suggest buying wd easystore drives (or similar ones) instead. you can take the drive out and use it in a NAS, all you need to do is cover a couple pins on the power connector to get it to work. the drives themselves are basically same ones they use in datacenters, more reliable than most. much cheaper too
On Synology the OS is saved to the drives. So when drive 1 failed it corrupted it. I love the simplicity of Synology, i own a 918+, and 920+
The OS works without any drives in it. The "slowness" caused by the failed disk not responding it what the problem is in this case.
@@EEVblog2 DSM is on the drives. There is a timeout and it should've failed the drive without any issues. NAS drives have lower timeouts (sub 10sec) than desktop equivalents, what drives do you use?
@EEVblog2 DSM is mostly on the hard drives. I have a DS1621+ and had a similar issue lately due to 1 drive having an intermittent fault that Synolgy didn’t detect, but caused the NAS to run like an absolute dog. Took me 3 days and the bad drive failing completely to figure out the issue, I was not amused.
@@EEVblog2 Depends on what you call "OS": Without drives, it can only show the setup screen. Everything beyond that is installed to the drives.
That's why FreeNAS boots from a USB drive. Easy to backup and replace.
If you asked Synology, they'd probably say they won't bother to look at it because it's out of warranty and/or your disks aren't on their approved disk list. I'd hope they'd look into something like this. I love my Synology NAS boxes, but I hate Synology as a company. I had a critical error with their DSM but they just ignored me because my disk wasn't on their "approved" disk list. It had nothing to do with the disks in the system.
My DS415+ was plagued by the famous C2000 issue. It was barely one month out of warranty and all their support had to say to me was "we can help you in selecting a new Synology device". Fixed the NAS myself and it's still running fine to this day. But you can bet that my next NAS won't be a Synology again.
I had an issue a bit like this just the other day - this drive was one in a ZFS mirror pair in Proxmox but I think the issue (in both my and your case) is likely in the Linux SATA driver. It seems to lock up if it gets these kind of errors, but in my case I could still get in to the server (although it did fail to boot twice too with a kernel bug, but came back up on reset), it’s just certain things wouldn’t load like the ‘Disks’ page in the Proxmox web UI. The dmesg log was full of SATA errors. Seems in certain cases it just keeps trying and trying, slowing stuff down a lot, and never gives up and marks the drive as bad.
It may be that Synology’s software is trying to query something and the kernel is just blocking - perhaps their software is more monolithic so it stops anything from working, unlike Proxmox where only parts of the web UI wouldn’t load for me.
I have a older ds1515+ and messing with it, drive 1 runs my os and configs as i dont have a built in memory for running the os from unfortunately. So if drive 1 dies on mine it needs a reset then i can pull my config back from the synology cloud backup config and its back to normal again. Does yours have a built in storage like a m.2 at all? I didnt look your model up but throwing this out there to try and help
Nope, doesn't have that. Has external USB 3 only.
In a synology all drives run the os. This is not configurable.
As requested, I prefer to run a box that can do TrueNAS or similar so I'm never really bound to a vendor like Synology. I'll still have failures, but can recover/rebuild on any hardware I choose.
Ditched my Synology it just used to eat drives, I was replacing 2 a year on a 6 bay unit, and I wasn't buying cheap drives. Moved to Truenas with SSD's no problems since
I have a 6 bay QNAP which threw 2 drives, I kept them and eventually for fun put all 6 in a home built TrueNas box and it’s still going strong without error to this day!
I put 4 3Gig drives in my DS412+ in 2010/2011 and even though it's up 24/7, I haven't touched it since. Probably time to swap in some new drives and blow out the fans. Maybe they really don't make drives like they used to.
Thanks Dave for sharing appreciate it Yeah, I did the modification on the same device with the clock signal … with their manufacturer defect with the chipset .. this is pretty interesting
Wow, that sounds like a massive hassle! I've looking at upgrading the drives in my NAS for the last few months since it's pretty much full, but now I'm thinking it might be a good idea to replace the box with a new one too. Only problem is that I've been putting it off because replacing the drives is pretty expensive, replacing the NAS as well is just going to make it even more expensive!
Why replace a nas? It'd make more sense to get a new one and keep the old box for backup of the new one.
Had a similar thing happen to a 3 months old Fujitsu W5010. Total lockup after reboot at the Fujitsu logo. One 2 TB WD Drive of the data Raid 1 was dying.
Never had this happen before with similar setup on dozens of W510/W520/W550/W580 with boot SSD and data HDD Raid 1.
Thanks Dave I look forward to seeing your videos
I have one DS211+ which kept failing disks once in a few months. I always replaced the disk before noticing that it was always slot 2 that failed. All those disks we healthy, but the sata slot is somehow cursed, it generates errors itself.
My Synology NAS is very talkative. Sends me emails about everything. HDD status once a month, backup information once a week and these things.
these bad boys are notorious for wacky led indications. if a drive is more than 3 years old the led will go yellow etc.
dave you doing it wrong
1) Why do you not have your NAS set to a static IP outside of your DHCP range?
2) Do you have a backup other than your NAS, I am sure you are aware that RAID is not a backup but a means to protect against individual disk failure. What's your plan if your NAS corrupts the entire array when writing data?
3) It's a good idea to keep an eye on the SMART data of all the drives as reported by the NAS and replace a drive when errors start showing.
11:11
Fair enough about point number 2, but if you had your NAS set to a static IP, you would always know the IP of by hand to access it by.@@EEVblog2
@@ElliottVeares Never had the need to do it.
IME the SMART data always shows zero problems, and it automatically runs every week and is supposed to alert me anyway if any issues. My drives have just suddenly failed.
Did you setup scrubbing of the Raid with regular intervals?
The errors in the log shows clearly it's a physical disk error.
I hope you know you have to perform backups on NAS's just like a desktop PC and keep multiple copies so you can, at the very least, restore or use one or more files from your backup, and at best, restore the entire file system. Hope you didn't totally give up on your Synology. Replace your WD drives with those list on Synology's website for your NAS.
When drives are marginal, I try to use Spinrite to see if it can help the drive recover itself. I've had drives that have errors but not errored enough for the OSs to complain.
This has prompted me to backup the config of my Synology!
Also if one drive fails and the rest are the same model with the same mileage it's appropriate to expect the others to start cr*pping out soon.
Your login screen seems to be from old software version. Did not you install all updates? There has been some strange behaviour like this with older software versions, when a drive partially failed. With newest software version this problem does not occur.
Also this behaviour happens only when on first drive in the area of system partition problems are happening.
Thanks for sharing this Dave. I use a Synology NAS as a backup solution and also install them at client's facilities. While I did wonder about the location of the primary OS partition, I have never actually looked much at the OS-level allocations across the disk drives. This implies that the system partition may not be mirrored and if corruption or hardware failure impacts the drive, one would be SOL. So tomorrow, going to re-evaluate how we integrate these devices into our businesses. So, THANKS DAVE.
All drives have a copy of the system volume, and they're mounted raid 1. So you can boot the system from just a single drive, should you need to. The boot loader, kernel, and some various other things reside in flash - enough to download and install a new system to a set of bare drives, but not all of DSM. (Not sure if that's even DSM or some stripped-down version of it.)
Particularly if the drives aren't NAS drives. If it hits a bad sector and the drives aren't NAS drives the drive will do extended recovery which can take minutes to complete, which can either break the raid or make it hang if it doesn't respond for an extended period. A NAS drive will fail quickly on the bad sector and map it out and not try the extended recovery. (This is why NAS drives are required)
@@DigitalDependancealso, the definition of what makes a drive a "nas drive" seems to be a bit shady nowadays. I'd say go Enterprise instead of that, those are more likely to report an error instead of pretending to be fine when they're not
@@BoraHorzaGobuchul its the firmware and also you presume they use better components for longer mtbf when run always on
I would love an investigation on the SATA port multiplier issue that causes a second disk to fail when a first one fails. (first and second has no relation to their position or numer - the first fails from natural causes and that causes the second also to fail). Thanks!
some NAS, like iomega, lenovo, emc, dell, all system config is saved in disk 1, when disk1 fail, all data in raid 5, 0 or 1 it's lost.
Yeah that's "fun"... I recall having a similar thing happening on a HP-UX mainframe with 40 drives... So yeah, I got to spend some time playing whack-a-mole. Basically the drive mostly fails, but is marginally below the failure threshold.
BTW... urhg you use True-NAS. And urgh... NAS isn't backup urgh!
Would love to see the SMART data read with the manufacturers tool, then you can also see if Windows can access it.
Handling of disk errors is simply too poor on these systems. They have been like this all my time with em and I think it is not just the software on the NAS.
It seems many HD's never time out, as if it's their job alone to decide when to give up and the NAS controller just waits for drive to either complete or error out.
The drives recover soft errors on their own by retrying; if successful they don't return an error status for the operation, but do bump the soft error SMART counts on the drive. As far as DSM is concerned the operations succeeded, but it may have taken a huge number of retries. Some drives can be configured to record soft error counts above a certain threshold as hard errors, not sure if DSM has any way to do this (it usually requires drive-specific utilities). When I/O operations take long to complete processes that perform disk I/O (including paging something in or out) will block waiting, and this includes web server threads. I'd suggest checking the error counts on all your drives every six months of so (they're under the SMART status for each drive in DSM as I recall).
The NAS runs a SMART check every week.
@@EEVblog2 Which is about useless - not only is the information not standardised, but on top of the numerous bugs in drive firmware, they also flat out lie.
Always remember: RAID is NOT a backup!
I’ve seen drives fail like this before. It would even allow windows to attempt booting. Somewhere along that boot process it would just go incredibly slow and never fully boot.
Heck I’ve got a drive right now that works perfectly but has always thrown a caution since new.
Sometimes a drive is in denial about being in good condition and thus causes issues further down the line 😂
You missed one - "ooh ooh Dave but cloud backup isn't real backup".
When the last place I worked in was being renovated the builders accidentally cut through the cable all the fire alarm break glasses are wired to and set the alarm off. There was this massive swearing shout and "no no no it's not real no!" as the alarm started going off. Sorry mate, you set the alarm off, we all have to evacuate now, then search room by room to check the site is empty, then verify you being an idiot was the only reason the alarm went off. It's not like the cable wasn't bright orange or anything...
That's the thing. If you don't instill alarm compliance into your staff, then it's a safety issue. If a fire alarm goes off, depending on the building, in 5 minutes it could be impossible to escape from the building at all. Used to work in a company that had an auto-body shop on the ground floor, One of the vehicles burst into flames, and lit up all of the fuel in ALL of the vehicles in the shop. The building was gone in 30 minutes.
I'd be interested to see the drive health status using something like crystal disk info.
curious as to why the synology nas didn't pick it up.
Sounds really iffy for data. X drives missing still ok? I saw one NAS server just die and I use formost for linux to get as many pictures off of it as possible.
I ditched these stupid proprietary boxes ages ago. Got a PC case with 24 drive bays, a couple of SAS controller cards and some extender cards which allow more drives on the cards. FreeNAS as the OS and a relatively modest 16GB RAM and an AMD ryzen. It also had a 10GB ethernet card in there, using copper not fibre.
I had the same problem at a customer location. I replaced it after all the tricks failed. Seems to be an old synology thing.
Synology run loads of necessary processes, indexing, creating thumbnails etc which probably don't help drive life. They don't make it easy to disable these features, although you can with SSH and some command line stuff. Try a ps -ef.......
wasn't there few months ago news about faulty PSUs in Synology nas killing hdd arrays ?
The synology os is stored on the drives (?) having trouble botting os?
Take that disk 1 and run a full SMART test on it, it likely will come up with the disk having had to reallocate sectors. Gsmartmon works.
Just as a further test: reseat the drive several times and see if it starts working...if it does, I would think of replacing the NAS.
I built one myself one from an used HP xeon server/workstation. That one out of the sudden started regularly reporting errors (maybe every 2-3month) on 1 bay. Btw: SMART testing did not show any problem and the SAS controller suggested problems somewhere on the bus, not on the drives. My best guess is, that this bays connector went bad...or the SAS cable... Or some internal connector... Who knows... Anyway, the hardware is 10+years old and had a 2nd life with me. so now's time to retire and migrate to something new :)
Geez Dave did someone salt your coffee today 😅
Western Digital Red SMR drive by chance?