11-07-2018 11:00 AM - edited 11-07-2018 11:06 AM
we're running a system where we have a RAID of RAID 1 of 2 HDs and a RAID 5 of 6 HDs, all Crucial SSD drives 1 TB size. When we run some operations that access some sectors in the RAID we get these errors at OS level:
MR_MONITOR: <MRMON111> Controller ID: 0 Unrecoverable medium error during recovery: **bleep** Port 0 - 3:0:2 Location 0xee48ef2#012Event ID:111
The RAID software also reports errors periodically (see image above))
We have run the 'micron storage executive' tool and all the checks give that the HDs are OK (smart OK).
We wonder where the problem is: RAID reports problems in the HDs but HDS seems OK. Firmware of the HDS is at MU01, there is MU02 available.
What do you think can be the problem? Can the firmware update solve the problem? (it seems that this firmware update can "
Thanks and regards
11-09-2018 03:50 PM
Hello and thank you for your question. Updating the firmware is a good idea as it calls out fixing some SMART data readings that might not be reading correctly. This is where I would start to see if this fixes the issue.
11-11-2018 04:48 AM
thx for your answer! Yep, this seems to be the first action to do. I think we will give it a try, I'll keep you posted
11-11-2018 10:17 AM
Your mdraid0:8 SSD has some Uncorrectable Errors listed. These Uncorrectable Errors are most likely the source of your problem. Ideally the bad blocks should be reallocated before any Uncorrectable Errors show up so it does not affect your filesystem or data, but it appears enough sections failed too quickly before the controller realized the blocks needed to be reallocated.
If you continue having problems after the firmware update, I would suggest pulling this SSD and performing a Secure Erase on it to reset it. You may also want to write to the whole drive to see if any more errors occur or if it will reallocate more blocks. Reallocated Blocks are fine as long as they are not accompanied by Uncorrectable Errors. Then if things seem fine, perform another Secure Erase to reset the SSD one more time. Then add it back into your RAID to see if it resolves your problem, otherwise it looks like you may need to replace this SSD. If your RAID is not backed up, then I would install a new drive instead so you don't risk the other drive failing while you test out the other one.
11-12-2018 04:20 PM
@HWTech suggestion of secure erasing the drive is a good one. My guess is the errors are a result of the RAID controller timing out with request for new writes to the SSD. A secure erase will wipe the drive clean of all data at a controller level, so it should theoretically perform like it were new out of the box.
You need to realize the M550 and even the Micron 1100 are client grade storage devices. They're not designed to run around the clock in 24/7 enterprise environments. If they're subjected to continuous erase operations the controller can get backed up, which in a normal single drive scenario would simply mean loss in performance as it trys to catch up, but when you throw the drive into a RAID array this could mean timeouts and more serious issues as write request lag behind the other drives.
If you continue to have problem with the drive it's probably time to retire it from mission critical or write intensive operations. It should theoretically still work great in normal work environments or static read storage. If you end up buying a replacement I would highly recommend you look at enterprise class SSDs like the Micron 5200. They designed and more well suited for 24/7 operation and high erase environments, and they aren't that much more expensive than consumer SATA devices.
11-13-2018 11:10 AM
Hu Bennym thansk for your update. Yep, when installed the previous sysadmin considered that even if they are not intended for a critical 24*7 environment, they could run smoothly. However, I think it wasn't a good decision. We will think about it, upgrading the drives seems a good thing to do!