[AusNOG] [AUSNOG] Disk wear & Foucault Period
Matthew Moyle-Croft
mmc at mmc.com.au
Thu Aug 22 14:12:43 EST 2019
I preface this by saying - “I recommend that all my competitors spend their engineering time and effort on solving for this problem”.
> 3 - If we figure a drive is good for 1M restarts, then you'd expect precession to cause 0.2% of disks to fail over a 5 year lifespan
So, let’s look at the Backblaze numbers (https://www.backblaze.com/blog/2018-hard-drive-failure-rates/). Their DCs are in LA and AZ are approximately as north as Adelaide and Sydney are south.
They see a 1.27% Annual failure rate across their fleet. If you look at their numbers the variation is by far due to the model than anything else. If you’re claiming 0.2% over FIVE years (ie. not annually but across 5 years), as before, this isn’t going to be a significant impact to the fleet. It’s also worth noting that they don’t seem to look at controller failure vs mechanical failure.
An extra 40 drives per year against 4455 failures per annum. 40 drives per annum is around USD$8k ($200/drive at scale). If you said to BackBlaze, I’d like to save you $8k, but you’ve got to spend a few million to realign the racks, then, well, it’s not going to be a business case I’d defend. (Mostly cost would be rebuilding the DCs so the cooling was arranged properly as well as all the cabling, hot aisle containment etc etc).
At 0.2% across 5 years you’d have to have more than 1000 drives to make this something worth caring about (1000 drives means 2 extra failed drives over 5 years).
> 4 - Whether this shows up in MTBF depends on measurement techniques, and whether the effect is above the random noise
See above.
Feel free to argue how I’ve butchered probability etc, but I doubt it makes the business case “better”.
MMC
More information about the AusNOG
mailing list