SSD Failing Rant

Kilobyte Kid

SSD Failing Rant

I have a mix of MX300 and MX500 SSD hard drives in service in a variety of customer systems. I noticed in this months SMART script check that about eight are reporting 1-5 reallocated sectors. I've worked with Crucial previously and they really seem to downplay SMART reports. Even on systems I have the Executive software installed on and read the SMART data from it, they still argue that the drive is likely not dying that that I should try unplugging the data cable and letting it sit on overnight to run its repair/rebuild cycle. I've been on chat with the RMA people and they pretty well argue that the SMART data I'm telling them is not indicative of a failing drive. Even when I've explained the system is unhappy/wont boot/etc.

 

My question is; why is there so much hostility towards a return? And why when I have presented that two or three SMART reporting tools are all reporting problems and the system is unhappy that they continue to say the drive is likely not the problem, but do ultimately give in and process the return? I wouldn't normally chat regarding a failure except for the stupid Material Number they ask for. Thankfully now I have them printed and hanging on my wall because really who keeps the F*&%ing box to refer back to?

5 Replies
Highlighted
JEDEC Jedi

Re: SSD Failing Rant

As far as I know some program/write errors and realocations can happen, but typically, they are compensated for by write-retries and use of reserve NAND blocks, so the data would be secure. But when the system is unhappy/wont boot/etc that is not a good sign imho. At any circumstances such kind of storage device should work in a reliable way, shoudn't it?

 

As for the material number requirement - it is a little bit odd requirement, but forum users can always try to help with it, we have some crucial drives in service and some boxes lying here and there sometimes Smiley Wink

______________________________________

FAQs and Top Forum Solutions
Did a user help you? Say thanks by giving Kudos!
How do I know what memory to buy?
Still need help? Contact Crucial Customer Service
Remember to regularly backup your important data!

JEDEC Jedi

Re: SSD Failing Rant

You should give this article series a review as it contains a lot of good & interesting information about SSDs being pushed to their limits by running 24/7 over the course of several years.   It shows how Reallocated Blocks do not necessarily mean a failing drive.   Things may be a little different with the MX300 & MX500 series as they use TLC NAND, but in general I believe this still holds.

 

While with hard drives I would be concerned about a bad sector because most times it means you would get more bad sectors which would actually interfere with performance until they were reallocated.    With SSDs this is a bit different and it is expected you will have some bad blocks from time to time, which is why they have some reserved blocks in place.   If you see a lot of bad blocks in a short period of time, then that is a cause of concern and I believe Crucial would most likey replace the drive (implied from a chat I had with Crucial tech support about a similar issue).   Keep an eye on the Unused Reserved Block Count to make sure you still have spare blocks.  When this count gets low, I would start getting concerned.

 

I have a bunch of MX300s with a few bad blocks each and they are still going fine.  With the MX300 series, make sure you are using the latest firmware (M0CR60) on the SSD as older versions did develop problems where a Sanitize/Secure Erase were necessary to correct issues.

 

As for your script, you may want it to trigger only when the Reallocated Block Count increases so you know when you receive new bad blocks, but it doesn't keep alerting you to the current ones.  

 

The MX300 & MX500 use TLC NAND which is not as robust as the older MLC NAND SSDs.   TLC NAND is used in common consumer SSDs because the average end user will be reading from the drive more than writing to it.   If your customers are writing tons of data to the drive, then an MLC NAND SSD would probably be a better fit which the BX300 series are using I believe.    TLC NAND and newer are meant to bring out lager SSDs for a lower price, but doing so does have some unfortunate tradeoffs.

 

The "hostility" to returning the item is because they consider it to be working as intended which seems to be the case at the moment  This is common practice for most companies.   I am just a regular Crucial customer like you and personally I wouldn't be concerned yet, but I would keep an eye on them.   Make sure the customers are backing up their data.   While this is always recommended, I think it even more imporatant with SSDs since data recovery may not be possible without professional services because of the way some SSDs fail.

Kilobyte Kid

Re: SSD Failing Rant

I guess my problem is; how is this considered a good health drive?  The OS was having problems booting and I couldn't make a backup using Acronis or Macrium due to the amount of failing sectors. Thankfully I was able to put it in a USB tray and copy out what I needed but still. If it wasn't for a monitoring script nobody would have known until it was catestrophic apparently. 

MX300.png

 

HWtech -> Thanks for the link, I'll take a look. 

 

I checked the firmware and it is on the latest. I also talked to Crucial chat support this morning and confirmed that sitting idle overnight should have been enough time to trigger and run Trim/Garbage Collection on the drive. 

Historically I haven't gotten excited over a few reallocated sectors but when the system wont backup and will hardly boot I'm thinking any sign of reallocation and I'm gonna have it brought into the shop ASAP. 

 

I didnt mean to imply Crucial hasn't replaced drives for me, I dont like the hoops nor the stance of "well there really isn't a problem but we will go ahead and replace it for you" when clearly SMART shows errors the the system is having problems. Maybe I'm just spoiled from the WD or Seagate RMA process where you login, fill out a form and your done. Its in warranty, no real questions asked and away we go.

JEDEC Jedi

Re: SSD Failing Rant

I think the health will only drop from good in a SMART diagnostic program if an attribute hits a SMART threshold value

_______________________________________
How do I know what memory to buy?
Shop for your region: US | UK | EU | France | Global
I think my memory is bad. What do I do now?
FAQs and Top Forum Solutions
We want your feedback! Post in the Suggestion Box
Did a user help you? Say thanks by giving Kudos!
Still need help? Contact Customer Service
Want to be a Super User?
JEDEC Jedi

Re: SSD Failing Rant

The SMART Attributes in your screenshot are similar to several of our MX300s.    I don't recall exact numbers, but I believe I had 26 Reallocated Blocks and I forget how many thousand Uncorrectable Errors, but I was shocked.    The Uncorrectable Errors accumulated because the SSD got stuck reallocating the bad blocks while it was using the older firmware.  The one SSD was part of a software RAID which broke which is how we discovered the issue initially.  After updating the firmware & performing a Secure Erase, we haven't had any more issues with the RAID.     The other MX300s we use are in laptops, and I don't recall anyone reporting issues to us, but I'm not always made aware of the reports either.   We caught the next one when a SMART monitoring utility told us about the bad block.  The others I proactively upgraded and in some cases Secure Erased just to be safe.    We've also disabled power saving options in the OS for the MX300 as well and I think the latest firmware did the same IIRC.  

 

Keep an eye on the Uncorrectable Errors attribute.  I've only noticed this attribute increase while there are Pending Blocks to reallocate.   I actually watched the Uncorrectable errors increase while the SSD was sitting idle & unmounted when it had several pending blocks to reallocate.   Perhaps the issue you have is the bad blocks are not being realloated quickly enough which is causing your performance and other issues.  Perhaps if you monitor when a bad block occurs and how soon after it is actually reallocated might help your cause in getting an RMA if it is affecting booting & performance.   If after the Garbage Collection you still accumulate errors, I would Sanitize/Secure Erase the SSD and see if that solves your problems.   Personally I think the SSD needs that reset to overcome whatever occurred with the old firmware, but that's just my opinion.    

 

I think everyone feels the same way about bad blocks since we've been trained to fear them after dealing with hard drives.  With hard drives it almost always meant an imminent failure and it would severely affect performance.   I keep a close eye on our SSDs' Reallocated Blocks to study how SSDs behave and hopefully catch an issue before it is too late.   I had a lot of sudden & complete SSD failures (from a competitor) during the early days so my trust in the technology is not that great so I do understand your concerns.