Should chkdsk be run regularly as part of SSD maintenance?

SOLVED
Kilobyte Kid

Should chkdsk be run regularly as part of SSD maintenance?

Should chkdsk /R be run regularly as part of SSD maintenance?


The reason I'm asking is that I my MX100 suddenly failed with 1633 bad blocks/clusters as reported by Ariolic Disk Scanner. chkdsk approximately confirmed the count and listed all the bad files. Symptoms were similar to what was reported in Is-my-mx100-almost-dead? I was trying to do a backup when I got a read error.


I wonder if I could have gotten a warning somehow? I run Storage Exec every quarter or so looking for new updates and checking the SMART for problems. I program flash sometimes for a living (Micron no less!) and know bad blocks are to be expected. And that reallocation and wear leveling should take care of everything behind the scenes. So it was really surprising to see all these bad blocks pop up suddenly.

If HWTech happens by, here is my SMART info:

CrucialMX100SMART.PNG

I didn't keep records so I don't know if there were any trends in the counts.  65152 Reported Uncorrectable Errors looks bad.  But only 1 retired NAND block and 1 reallocation event.  So where did all these bad blocks come from?

 

Most of the files with errors were Windows update files or DLLs or other files that weren't written frequently and weren't backed up.  There were a few working files that had bad blocks and I'm going to have to try to recover those off of backups.  I used Acronis with "ignore all errors" turned on to move everything to a spinning disk (oh, the slowness!) and sfc to verify the core OS (Win 8.1) was intact.

 

So maybe I should take snapshots of the SMART every time I run Storage Exec to look for trends? I'm nervous now and running chkdsk on all my Crucial SSDs (I have about 6 here).  And of course, keep up with the backups.

 

9 Replies
JEDEC Jedi

Re: Should chkdsk be run regularly as part of SSD maintenance?

You shouldn't need to run it regularly, no.  I've probably never run it on any of my SSD's.  You'd normally run it only if you were having storage related issues.

 

I guess the problem is (and this applies to magnetic storage as well as SSD's) that you wouldn't know about a bad block until you tried to use it.  Chkdsk picked it up because it checks the whole disk.  But so probably would trying to access files on that block or other disk wide activities such as a virus scan.  And presumably, over time, wear levelling would have picked it up too.  I guess the only difference is that chkdsk is the only one of those processes actually qualified to repair the file system afterwards.  So typically, some other process would find a bad block and then chkdsk would have to be run to repair the file system.

_______________________________________
How do I know what memory to buy?
Shop for your region: US | UK | EU | France |
I think my memory is bad. What do I do now?
FAQs and Top Forum Solutions
Did a user help you? Say thanks by giving Kudos!
Still need help? Contact Customer Service
Want to be a Super User?
JEDEC Jedi

Re: Should chkdsk be run regularly as part of SSD maintenance?

Sorry to hear about your troubles.  Assuming you are not getting any new Uncorrectable Errors on the SSD, then you should be fine.   If the Uncorrectable Errors are still increasing and you have no Pending or new Reallocated Blocks, then your SSD has a problem.   

 

Each block on an SSD is able to automatically correct for x number of errors before it will be reallocated.  My guess is the errors accumulated faster than the block could be reallocated.  Because of this you received 65k Uncorrectable Errors.  This does not mean you have 65k individual actual errors.  Most likely most of these 65k Uncorrectable Errors are the result of the same bad bits alerting over & over. 

 

 

I would suggest using smartmontools to monitor the health of your drives since you can customize how it reports changes in the various SMART attributes.  You will definitely want to monitor increases in attribute 187 since that is the one which can affect the filesystem & data.  If this attribute increases then you will want to check the filesystem for errors.  While you don't need to run "chkdsk /r" for checking for bad physical blocks, it may detect soft errors in the filesystem blocks and of course any other filesystem corruption.

 

I would not think the bad block reporting & discovering would be the same for an SSD as it would be for a hard drive.  All of the SSD's blocks can be accessed or monitored at all times (in theory) unlike a hard drive that can only discover bad blocks when accessed.   Unfortunately technical details are hard to come by, but I would hope the SSD would be proactive on reallocating bad blocks without the need for the OS to access them first.

 

If one block on the NAND chip behaved this way once, it's possible when another block fails on the same NAND chip it could happen in a similar manner.  It will depend on how large an area the "weak" NAND occupies.  Personally I would be monitoring increases in SMART attribue 187 to know about it as early as possible.  Luckily I've only ever seen Uncorrectable Errors on the MX300 series when it got stuck on reallocating bad blocks.  I don't recall seeing these errors on any other SSDs when bad blocks were reallocated.

Kilobyte Kid

Re: Should chkdsk be run regularly as part of SSD maintenance?

Thanks both of you for your comments. smartmontools looks like it would be a good addition to my regular maintenenance toolbox.  Right now, I'm in the process of setting up a new MX500 as my main drive.  I'll relegate the MX100 to usage on a not so critical system and keep a careful eye on it.  I'll try the instructions by HWTech in another thread for recovering the drive.

 

What has me curious is why so many of the bad blocks were in files that rarely get read or written by Windows.  Around 40% of the bad blocks were under the windows\install directory.  Another big bunch were under the windows\winSxS directory. That makes be wonder if there is a bug in the wear leveling on the MX100.  Such files would be good candidates for relocation since the blocks they are in would have low erase counts. ECC is supposed to be applied after the read before writing the data to a new block but I was wonder if this could have been skipped.

JEDEC Jedi

Re: Should chkdsk be run regularly as part of SSD maintenance?

I guess it depends how much you're writing to the rest of the disk.  Moving data about for the sake of it would just increase drive wear.  It would only need to do it when the NAND wear was getting uneven.

_______________________________________
How do I know what memory to buy?
Shop for your region: US | UK | EU | France |
I think my memory is bad. What do I do now?
FAQs and Top Forum Solutions
Did a user help you? Say thanks by giving Kudos!
Still need help? Contact Customer Service
Want to be a Super User?
Kilobyte Kid

Re: Should chkdsk be run regularly as part of SSD maintenance?

"Moving data about for the sake of it would just increase drive wear. It would only need to do it when the NAND wear was getting uneven."

 

Well that is the very definition of wear leveling! :-)  The MX100 controller should be doing it, not the Windows OS.  What I was speculating is that there might be some glitch/bug in the MX100 controller code that was causing it not to apply ECC when moving data around as part of the wear leveling process.

JEDEC Jedi

Re: Should chkdsk be run regularly as part of SSD maintenance?

My guess those rarely used items along with some more recently updated files were put onto a possibly more worn block because of the wear leveling.  At the time of the transfer everything was fine.  However, when the block began failing, it failed faster than the SSD controller could reallocate it resulting in damage to those files due to the Uncorrectable Errors after the ECC limit was reached for that block.  I would suspect a defective area of the NAND was exposed instead of any issue with the wear leveling since the SSD's controller couldn't reallocate the bad block before losing data due to the Uncorrectable Errors.

 

JEDEC Jedi

Re: Should chkdsk be run regularly as part of SSD maintenance?


@kwarner wrote:

"Moving data about for the sake of it would just increase drive wear. It would only need to do it when the NAND wear was getting uneven."

 

Well that is the very definition of wear leveling! :-)  The MX100 controller should be doing it, not the Windows OS.  What I was speculating is that there might be some glitch/bug in the MX100 controller code that was causing it not to apply ECC when moving data around as part of the wear leveling process.


 

My point was... unless you've caused wear to the rest of the drive, wear levelling won't have occured.

_______________________________________
How do I know what memory to buy?
Shop for your region: US | UK | EU | France |
I think my memory is bad. What do I do now?
FAQs and Top Forum Solutions
Did a user help you? Say thanks by giving Kudos!
Still need help? Contact Customer Service
Want to be a Super User?
Kilobyte Kid

Re: Should chkdsk be run regularly as part of SSD maintenance?


@HWTech wrote:

...when the block began failing, it failed faster than the SSD controller could reallocate it resulting in damage to those files due to the Uncorrectable Errors after the ECC limit was reached for that block.  I would suspect a defective area of the NAND was exposed..

 


Yes, that would fit what I observed. Interestingly, if the failures were slow and occurred over time, your theory would argue for scanning all files with chkdsk occasionally to give the controller a chance to detect failing blocks.

JEDEC Jedi

Re: Should chkdsk be run regularly as part of SSD maintenance?

Actually, if the NAND block failure built up slowly, then the block should be reallocated without any filesystem problems at all since the ECC or other internal safety mechanisms should be sufficient to prevent it (assuming the controller doesn't get stuck reallocating the bad block).  Monitoring the Uncorrectable Errors attribute is your best bet to know when to run those checks.   Of course you can get "soft" filesystem failures caused by other system issues, software issues or unexpected shutdowns.  Another good attribute to monitor is the "Pending" attribute.  Once it becomes non-zero, see how long it takes before the controller reallocates the bad block and this value goes back to zero.  Might provide some useful information for you next time since I don't know how long a pending block reallocation should take for an SSD.

 

BTW, depending on the size of the defective area of the NAND where the last error occurred, it is possible it could happen again on a physically adjacent area.