Predicting hard drive failure using SMART stats

Backblaze stores a ton of data for their clients, controlling more than 40,000 hard drives and more than one hundred million gigabytes of data. Detecting imminent drive failure is important to them.

Every disk drive includes Self-Monitoring, Analysis, and Reporting Technology (SMART https://en.wikipedia.org/wiki/S.M.A.R.T.), which reports internal information about the drive. Initially, we collected a handful of stats each day, but at the beginning of 2014 we overhauled our disk drive monitoring to capture a daily snapshot of all of the SMART data for each of the 40,000 hard drives we manage. We used Smartmontools to capture the SMART data.

Read the linked blog post to learn about hard drive failure, what SMART does and does not do well, and the decisions Backblaze makes to minimize drive failure, maximize their chances of catching a drive failure before it happens.