MX500, very high write amplification

Kilobyte Kid

Re: MX500, very high write amplification

Intel RST (the AHCI driver provided by Intel) versions other than 12.9 would not get installed on Windows 10 on the previously mentioned Intel H77 chipset.

 

After installing them, I disabled the "Link Power Management" option in their control panel and rebooted as instructed there. This could possibly be the reason why Power On hours didn't rise as real-time hours, and a possible indirect reason why I was having issues with the drive's write amplification and consequently excessive NAND wear for the amount of writes performed by the Host.

 

30781.6_image.png

 

I have not performed other changes for now (the P2P program is still active in "seeding" mode), but I've disabled the benchmarking program which was keeping the SSD actively working. In the next hours it should become clearer if this will solve the problem. If it will, this is still something that should be taken care of by the SSD (firmware), though.

 

EDIT: Unfortunately, that option did not result in any improvement in the behavior of the SSD.

Kilobyte Kid

Re: MX500, very high write amplification

Unfortunately so far I haven't been able to determine any clear software cause to the very high write amplification phenomenon seen.

 

On the other hand, after collecting SMART attributes at a rate of 1/minute for a few days, it appears there is a correlation between write amplification and the Current Pending ECC Count. It seems unlikely to me that this is something that would be directly software-caused.

 

CT500WAF.png

 

So, summing up what I have observed so far:

 

  • Power On Hours Count appears to increase faster (closer to real time) when a certain amount of load (even read-only load) is put on the SSD.
  • A continuous high read load appears to immediately defer this write amplification phenomenon for a while, but does not completely eliminate it.
  • On the long term, the high write amplification seems to be correlated with the appearance of pending ECC errors.
  • Turning off Aggressive Link Power Management (DIPM/HIPM) from Windows or the chipset AHCI driver does not seem to affect the behavior of the SSD.
  • This might or might not be related, but my Crucial MX500 came from the factory with a M3CR020 firmware, which I understand was not publicly released on the Crucial website. The currently installed firmware is the latest (M3CR023).

As a bonus, here are the latest SMART attributes from Crucial Storage Executive:

 

109350566.6_image.png

 

Wear leveling count increase for the past few logged days:

 

TimestampBlock wear-leveling CountHost GiB writtenGiB deltaDays deltaGiB/day
2019-03-15 01:36:06784643.4   
2019-03-17 08:49:13794675.732.312.3014.041
2019-03-18 16:11:52804706.230.461.3123.296
Kilobyte Kid

Re: MX500, very high write amplification

Differently than what it initially appeared, several days after secure erasing the SSD and cloning on it a relatively fresh Windows 10 installation on a slightly different hardware configuration, I think I can say that the high write amplification issue - although it's not really an immediate concern - still persists to some extent, together with Power On Hours from SMART parameters progressing slower real-time hours even though the system is on 24/7.

 

image.pngecc.pnghostw.png

 

Curiously, I noticed that the high write amplification appears to be mostly due to the the SMART parameter "FTL program page count" increasing in discrete 1 GiB chunks without a corresponding "Host program page count" increase. Most Host-initiated writes appear to be within the 1-2x write amplification range. Furthermore these 1 GiB-equivalent increases in FTL program page count appear to be closely associated with pending ECC count events from SMART parameters. So, to me it would seem this is (again) somehow related to the SSD firmware rather than the operating system.

 

The above data is assuming that one program page is equivalent to [cumulative host sectors written (bytes) / host program page count]. The result in bytes/page is roughly equal to 29300 bytes (varies slightly with time).

 

If it can be of any interest, I have been logging all SMART parameters at a 1-minute interval with smartctl, in CSV format, but these cannot be easily uploaded here. These are current statistics:

 

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   ---    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   ---    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   ---    Old_age   Always       -       1166
 12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       15
171 Program_Fail_Count      0x0032   100   100   ---    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   ---    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   095   095   ---    Old_age   Always       -       87
174 Unexpect_Power_Loss_Ct  0x0032   100   100   ---    Old_age   Always       -       1
180 Unused_Reserve_NAND_Blk 0x0033   000   000   ---    Pre-fail  Always       -       38
183 SATA_Interfac_Downshift 0x0032   100   100   ---    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   062   050   ---    Old_age   Always       -       38 (Min/Max 0/50)
196 Reallocated_Event_Count 0x0032   100   100   ---    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   ---    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   ---    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   ---    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   095   095   ---    Old_age   Offline      -       5
206 Write_Error_Rate        0x000e   100   100   ---    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   ---    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   ---    Old_age   Always       -       11504590976
247 Host_Program_Page_Count 0x0032   100   100   ---    Old_age   Always       -       200738389
248 FTL_Program_Page_Count  0x0032   100   100   ---    Old_age   Always       -       941287241
Kilobyte Kid

Re: MX500, very high write amplification

I did more testing under more diverse conditions, to try finding out what could be causing this unusual write activity that increases write amplification.

 

In short, I tried disabling TRIM and filling up the drive. As a result, the write amplification of the "base" (user-caused) activity dropped to values very close to 1.0x, but internal SSD write activity (occurring in discrete 1GB steps) did not stop. Re-enabling TRIM and performing a TRIM pass to the free space (for example by emptying the Recycle Bin on Windows) made such "base" write amplification incrase back to previous values and possibly increased the internal write activity that apparently is causing the WAF to increase more than necessary.

 

Again, values sampled at 1/minute.

 

trim_onoff.png

Latest raw SMART attributes from smartmontools:

 

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   ---    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   ---    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   ---    Old_age   Always       -       1212
 12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       15
171 Program_Fail_Count      0x0032   100   100   ---    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   ---    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   094   094   ---    Old_age   Always       -       90
174 Unexpect_Power_Loss_Ct  0x0032   100   100   ---    Old_age   Always       -       1
180 Unused_Reserve_NAND_Blk 0x0033   000   000   ---    Pre-fail  Always       -       38
183 SATA_Interfac_Downshift 0x0032   100   100   ---    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   064   044   ---    Old_age   Always       -       36 (Min/Max 0/56)
196 Reallocated_Event_Count 0x0032   100   100   ---    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   ---    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   ---    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   ---    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   094   094   ---    Old_age   Offline      -       6
206 Write_Error_Rate        0x000e   100   100   ---    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   ---    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   ---    Old_age   Always       -       12424196259
247 Host_Program_Page_Count 0x0032   100   100   ---    Old_age   Always       -       215878800
248 FTL_Program_Page_Count  0x0032   100   100   ---    Old_age   Always       -       965800162

 

Kilobyte Kid

Re: MX500, very high write amplification

I too, have this 'issue' on a MX500 500GB sata drive. It is used in Mac Mini under OS X 10.11. The drive has been installed for about 27 days, and the computed Write Amplification is over 17. I have been monitoring the Flash page write statistics like the OP.  As in the OP's case, I have periods of low ('normal') WA followed by periods where over 1 GiB are writen in a 15 minute period to the FTL, when less than 10 MiB were actually writen to the device by the host. Again like the OP, periods of significant read activity, lowers the occurrance of these excess write episodes. I would posit that this simply delays the internal garbage collection because the device is "busy". 

 

I have no pending ECC counts, I think that is a red herring.

 

For example from today's log (F: is delta FTL pages/smart 248 and H: is delta host pages/smart 247)

Screen Shot 2019-04-04 at 1.03.23 PM.jpg

 

 

 

 

 

 

waf.aengus-day.jpg

 

 

 

 

 

 

 

 

 

Kilobyte Kid

Re: MX500, very high write amplification

Great to know that somebody else had a look at his own MX500 on a different operating system and configuration and found a similar behavior.

 

Yes, it's not really an "immediate issue". My main point is that it makes SSD wear much faster than it would otherwise be, since the usual workloads do not comprise server-like usage patterns.

 

Furthermore, most importantly it is clear that at this rate the media wearout indicator would reach 100% at a significantly lower total TBW than what Crucial specifies for this model (180TBW). I'm at 91 average erase counts at just 5.85 GiB (6% wear), which would mean reaching 100% (1500 erase counts) at less than 100 TBW.

 


@coldcanuck wrote

I have no pending ECC counts, I think that is a red herring

This is more easily observed by reading SMART attributes at a quicker rate. Then, an apparent correlation between FTL writes bursts and these transient pending ECC count events becomes more apparent. You can see in the longer term-graph below (since I have installed the SSD on a different Windows 10 PC after performing a secure erase) that at least in my case such counts tend to cluster more in the periods where the internal garbage collection-induced (?) WAF is higher.

 

image.png

Of course, this could depend on the actual specimen tested and not be a general behavior of all Crucial MX500 produced.

Highlighted
Crucial Employee

Re: MX500, very high write amplification

@s12a 


We're unable to reproduce any of these results on several MX500s blind samples with a basic Windows OS and a few programs installed.

My guess is it is specficic to unique scenarios like you said. As always having a drive with less and less free space will hamper wear leveling, and depending how much 24/7 activity the drive has, combined with erase activity, you may neuter Garbage Collections ability to do its job as well.






Crucial_Benny, Micron CPG Support, US


How do I know what memory to buy?
Shop for your region: US | UK | EU | France |
I think my memory is bad. What do I do now?
FAQs and Top Forum Solutions
Did a user help you? Say thanks by giving Kudos!
Still need help? Contact Customer Service
Want to be a Super User?
Kilobyte Kid

Re: MX500, very high write amplification

@Crucial_Benny

For what it's worth, upon asking other MX500 users on a different forum, I haven't observed this to be a widespread issue either. It seems it only affects certain people and I and @coldcanuck as reported above happen to be among those.

 

Again, I don't consider my write activity to be high (roughly 1 GiB/hour on average), intensive or comprising server loads, but my SSD is installed on a PC that is turned on 24/7 (implying about 24 GiB/day; usually less than this). At the moment TRIM is enabled and reportedly working, and the OS and data are on two different partitions that currently (besides the short test I did days ago) have plenty of free space:

 

image.png

 

To be clear, I'm not asking Crucial to replace my drive with a different one, but if there is some correctable firmware issue that depends on some (dated, although not really "old") hardware configurations or unusual usage conditions (I realize that not everybody might keep their PC on 24/7) then I'll be happy to share any information and perform any test within reasonable limits that could help solve it for the next release.

 

I still believe that, among other things, the fact that the SMART "Power On Hours" attribute does not seem to increase as fast as real-time hours (unless continuous SSD read/write activity is present) even though the computer is always on and never goes into suspension or hibernation mode could be indicative of something on the SSD firmware side.

Crucial Employee

Re: MX500, very high write amplification


@s12a wrote:

 

I still believe that, among other things, the fact that the SMART "Power On Hours" attribute does not seem to increase as fast as real-time hours (unless continuous SSD read/write activity is present) even though the computer is always on and never goes into suspension or hibernation mode could be indicative of something on the SSD firmware side.


When the drive goes into devsleep mode it will stop tracking power on hours. The drive may go into develseep throughout the day if it isn't getting any activity from the system. Even Windows sitting idle with not sleep mode set, this may trigger a drive to go into devsleep provided there arn't any new read/write requests for data on the drive. Windows doesn't have any direct control over the power saving features of the drive, except for features like Link Powered Management can override these settings. We typically don't recommended software power managment, since drives are almost always better off managing their own power parameters.





Crucial_Benny, Micron CPG Support, US


How do I know what memory to buy?
Shop for your region: US | UK | EU | France |
I think my memory is bad. What do I do now?
FAQs and Top Forum Solutions
Did a user help you? Say thanks by giving Kudos!
Still need help? Contact Customer Service
Want to be a Super User?
Kilobyte Kid

Re: MX500, very high write amplification


@Crucial_Benny wrote:

When the drive goes into devsleep mode it will stop tracking power on hours. The drive may go into develseep throughout the day if it isn't getting any activity from the system. Even Windows sitting idle with not sleep mode set, this may trigger a drive to go into devsleep provided there arn't any new read/write requests for data on the drive. Windows doesn't have any direct control over the power saving features of the drive, except for features like Link Powered Management can override these settings. We typically don't recommended software power managment, since drives are almost always better off managing their own power parameters.


In my case power on hours are about 25-30% of real-time hours on average, and on the long term this seems relatively constant throughout the day, slightly increasing during actual user activity. Needless to say, I have never encountered a similar behavior before with other SSDs, including another that is currently installed as a secondary drive right now and which is supposed to support DevSleep as well.

 

Windows power plan is set to "High Performance" and after enabling the bits for setting it in advanced options, AHCI link state power management when plugged in appeared to be already set to not initiate DevSleep. On battery mode it was set to HIPM+DIPM; I've set it to active as well just in case. The PC is always plugged in, however. Note that I get battery settings due to the presence of an USB-enabled uninterruptible power supply.

 

image.pngimage.png

 

Anyway, this could be something separate or only loosely correlated (since high drive activity which makes power on hours increase as fast as real-time hours seem to prevent or delay it) from the high write amplification encountered.

 

Earlier at about 2019-04-11 12:30 local time tried to power cycle the SSD (turning the PC off and on again), and apparently this had an effect on the ongoing 1GiB FTL (internal) write spikes that seem to be the culprit of the issue, without any other change in system or user activity patterns. As I mentioned earlier, I rarely turn the PC off.

 

situation.jpg

 

About 14 hours later I noticed another such spikes and about 30 minutes later I then attempted cycling the PC on/off again. I'm wondering if by doing so whenever the phenomenon starts arising, it could be prevented.