Dead NVMe drive
22 October 2025
Yesterday, as I was working, an application that had just updated itself refused
to start. A quick glance at journalctl shot shivers down my spine:
nvme0n1: Read(0x2) @ LBA 362395648, 2048 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) MORE DNR
critical medium error, dev nvme0n1, sector 362395648 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
"Wait a minute, that's not harmless -- is my NVMe dying on me?"
My important files like photos are properly backed up, but suddenly losing your home directory is never fun. So I started copying files to another (older SATA SSD) drive, and was immediately faced with more of those scary errors.
Oh no.
How far has the rot spread? Suddenly the drive was unable to read a small number of seemingly random files -- some that I update constantly and some that I haven't touched in over five years.
Taking a closer look after the very first round of rescue operations:
~ % sudo smartctl -a /dev/nvme0n1
smartctl 7.5 2025-04-30 r5714 [x86_64-linux-6.17.3-arch2-1] (local build)
Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 960 EVO 500GB
Serial Number: S3EUNX0J703219V
Firmware Version: 2B7QCXE7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 500 107 862 016 [500 GB]
Unallocated NVM Capacity: 0
Controller ID: 2
NVMe Version: 1.2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 500 107 862 016 [500 GB]
Namespace 1 Utilization: 500 027 408 384 [500 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5771b122cb
Local Time is: Tue Oct 21 15:50:50 2025 EEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0007): Security Format Frmw_DL
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 77 Celsius
Critical Comp. Temp. Threshold: 79 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.04W - - 0 0 0 0 0 0
1 + 5.09W - - 1 1 1 1 0 0
2 + 4.08W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1500
4 - 0.0050W - - 4 4 4 4 2200 6000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning: 0x00
Temperature: 47 Celsius
Available Spare: 99%
Available Spare Threshold: 10%
Percentage Used: 17%
Data Units Read: 24 212 342 [12,3 TB]
Data Units Written: 53 312 310 [27,2 TB]
Host Read Commands: 429 606 470
Host Write Commands: 560 745 499
Controller Busy Time: 3 088
Power Cycles: 3 231
Power On Hours: 4 542
Unsafe Shutdowns: 141
Media and Data Integrity Errors: 57
Error Information Log Entries: 14 457
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 47 Celsius
Temperature Sensor 2: 60 Celsius
This drive was bought in 2017, so I guess it's a bit on the older side, but I didn't imagine it would fail this fast and especially with so little data written. This drive is rated for 200TBW, but sadly it only had three years of warranty, so I'm far out of that window.
I believe I got incredibly lucky to spot this so soon. The failure mode is scary in that random files have become unreadable, but not the worst-case scenario where the files would get silently corrupted.
Here is a tabular look at all the drives attached to my workstation:
| Manufacturer / | Capacity | Power | Power | Unsafe | Data | Data |
| Model | | Hours | Cycles | Shutdowns | Read | Written |
|---------------------|----------|---------|--------|-----------|---------|---------|
| OCZ Vertex 3 | 240 GB | 64245 h | 6406 | 360 | 13.2 TB | 8.9 TB |
| Crucial BX100 | 500 GB | 1338 h | 5217 | 111 | 25.1 TB | 22.0 TB |
| Samsung SSD 850 EVO | 500 GB | 27958 h | 5129 | 148 | | 14.8 TB |
| Samsung SSD 860 EVO | 1 TB | 23084 h | 3764 | 111 | | 21.2 TB |
| Samsung SSD 960 EVO | 500 GB | 4543 h | 3240 | 144 | 12.6 TB | 27.2 TB |
| Samsung SSD 980 PRO | 1 TB | 7777 h | 2897 | 127 | 36.3 TB | 28.4 TB |
All the other drives report very healthy SMART stats, although I don't for a second believe the numbers reported by the SandForce-based OCZ Vertex 3 -- that was my main drive for years around ~2011.
I used to think that Samsung's EVO line was a decent brand for consumer NVMe drives, but now I'm no longer sure. I have a newer Samsung SSD 980 PRO 1TB as my dual-boot & gaming drive, fingers crossed that it won't give up so easily.
I was reminded that that drive model famously suffered from poor firmware that needed to be updated: https://www.tomshardware.com/news/samsung-980-pro-ssd-failures-firmware-update -- although it may have been mostly 2TB drives.
Regardless, I immediately ordered a Western Digital WD_BLACK SN850X 2TB as replacement after a very brief roundup of reviews. It's also a great excuse to do a clean install of Arch Linux, as the current installation had been collecting cruft since early 2020 or so. I'll write a separate post about that.
I didn't lose anything vitally important thanks to taking regular backups. I recommend that every one of you make sure to do the same -- it's easier than ever with companies like https://www.backblaze.com/ and self-hosted options like https://syncthing.net/.
Also, test that your backups actually work. Just like at work.
