02-22-2017 12:12 PM - edited 02-22-2017 12:14 PM
I have a pair of MX100 drives that I suspect are failing, I've already removed them from service, but just want to confirm my suspicions. The drives were in a mirror, but were performing sequential reads at between 8 and 13 MB/s. The system has also hard locked a few times over the last month, prompting the replacement.
I threw in a screenshot from the Storage Executive for each drive. It still reports good health for the drive, but I'm concerned about the Raw Read Error Rate, Average Block-Erase Count, Reported Uncorrectable Errors, and Ultra-DMA CRC Error Counts.
02-22-2017 02:26 PM
Wow - they've seen some use! Both are almost 3 times over the NAND's rated life.
But yeah, both drives have a scattering of errors. 1, 5, 187, 196, 197 all indicate past errors. Coupled with the drives wear level and the actual issues you are experiencing in use then it probably is time to replace them considering I assume the safety of your data is important to you if you're mirroring.
02-22-2017 04:59 PM
Thanks for the confirmation. Someone prior to me just threw them in a db server that likes to hammer disks. So to confirm, 1 is actually an error indicator? I've seen mention that no one really uses that, and that it occasionally resets based on some internal status. Or is it just ignored when it's in the low thousands, not the millions that both of these drives are reporting?
02-22-2017 11:17 PM
Most SMART attributes are cumulative stats. They'll show the drives lifetime record of errors. 197 is a notable exception - it logs sectors that are going to be mapped out but haven't yet.
Anyway, when errors occur, drives should be able to map them out and replace them. So if those stats are all stable (ie. not increasing) and you're not experiencing any problems in use then it's a bit of a non issue. But if the numbers are increasing, then ongoing errors are occuring. Coupled with the issues you are experiencing in use, and the extreme wear the drives have taken (they're consumer level drives and not suitable for use as a DB server) then it's probably time to get rid.
There have been firmware bugs on oen of the drives where it showed incorrect stats on one of the error attributes and it was reset during a firmware upgrade btu I don't recall which model or firmware - and it was a lot lower value than that.
The one of the stats you mentioned that's probably not a drive error is 199 (ultra dma count) - it's typically a communication error with the host computer (cabling issue or similar)