/proc/mdstat give the current status of all software raid arrays. Here’s an example of its format from a system that has 4 correctly functioning raid arrays:
# cat /proc/mdstat Personalities : [raid1] md15 : active raid1 sda5 sdb5 534586816 blocks [2/2] [UU] md14 : active raid1 sda3 sdb3 8096640 blocks [2/2] [UU] md13 : active raid1 sda2 sdb2 81915328 blocks [2/2] [UU] md12 : active raid1 sda1 sdb1 530048 blocks [2/2] [UU] unused devices: <none>
I’m glad to say I don’t have any software raid arrays with failing drives here, but the Linux raid wiki ( https://raid.wiki.kernel.org/index.php/Mdstat ) had some examples of how that file looks when there are problems.
From that page we learn that there are two things to look out for. First, if a drive is missing from an array, one of the “U”’s will be an underscore (“_”). Second, if a drive is failing, the normal “” will become “(F)”. Since that 0 could be a 1 or some other digit, we’ll just ignore the number entirely and look for “](F)” (we need the backslash again to signal that we’re interested in the literal right square bracket character). Here are our new checks:
/proc/mdstat Does not contain _
/proc/mdstat Does not contain ](F)
Note that this only detects problems with software raid; hardware raid cards needs to be handled separately. Since we deal mostly with cloud servers that don’t have direct access to hardware, software raid will be more prevalent in the systems that use raid at all. Because these two rules both alert on bad things in /proc/mdstat, they’ll gracefully handle the case where you have no software raid arrays at all.