Author Topic: Every Verify Sync I get some bad blocks - does that mean my array is unstable?  (Read 273 times)

Offline gwpt

  • Full Member
  • ***
  • Posts: 112
  • Karma: +0/-0
    • View Profile
Hi all,
About once every couple of weeks, I run a Verify Sync which takes about 24hrs (I have a 55TB array) and always returns with a few bad block, not many, but I would say a minimum of 7-10.
Does that mean, if ever I had a bad drive that I would not be able to restore my data correctly?
My understanding of RAID is any errors means a restore is not possible...

I only have 1 PPU at the moment, would adding a second help with this?
Thanks
Guy

Offline Skirge01

  • Full Member
  • ***
  • Posts: 203
  • Karma: +5/-0
    • View Profile
Two suggestions, but first, read up on the different verify functions, just to ensure you're familiar with their functions.  Otherwise, this may not make sense.

1.  Do you have drive monitoring software like HD Sentinel keeping an eye on the condition of your drives?  If not, I strongly recommend you install one.  This is the first line of defense against drive failure.  If you have it set to the strictest monitoring, you're unlikely to ever have a drive fail before you can swap it out; meaning you'll always be able to transfer the data off the failing drive and not lose any data.

2.  The way I monitor the array itself is I run a Verify+ every day on a portion of my array, which means the entire array is checked about every week or so.  If that task ever encounters errors, I run the Forensic plugin to see WHAT had issues.  What files?  What drive?  This allows me to consider why that data may have failed the check.  Was a backup running during the Verify+?  Was a file transfer in progress?  Was a Windows Update going on?  Are the files important?  If they are important, I can examine them individually and ensure they're still okay.  Assuming they are, then I can run the Verify Sync to update the parity to reflect the "good" status of the live data.  However, if there's an issue with the files, then I may want to restore FROM parity.

I know this didn't directly answer your question, but I got the feeling the direct answer wouldn't have helped until you understood #2.  Assuming everything I wrote was clear, then the answer to your question is:  If your parity is correct, then, yes, you could restore your data.

A 2nd PPU would help you in the event that you had a 2nd DRU fail before you could get the first DRU replaced and restored.  You may also want to read over the article on dealing with a dropped disk, so that you're clear on what the array is doing at that point, as well as what the restore process would be.  It's critical you understand both of those because doing the wrong thing or doing things in the wrong sequence could absolutely cause a loss of (more) data.

Offline gwpt

  • Full Member
  • ***
  • Posts: 112
  • Karma: +0/-0
    • View Profile
Hi, thanks a lot for the tips and info! I'm been trying to study up on all the points you made.

Just bought HD Sentinel, looks like an awesome app. I've set it up with notifications. Seems like a great first line of defense.
I've also read those two pages you linked to, I had skimmed them before, but I think a deeper understanding is a good idea. I am actually going to setup a tiny raid and test failing a drive.

I do have a couple of questions regarding your comments:

* I assume when you say "run a Verify+ every day on a portion of my array" you mean the 'specific range operation'. i can work out how to do that manually, but how do you schedule it?

* I got the forensic plugin working and I ran a verify+. There were a few bad blocks. I must admit, I am not quite sure what the output is telling me. (I've have included the json).
  - multi disks seem to have an issue with : \\$LogFile::$DATA   - is this bad?
  - and six files had a issue, what should I do with them? -  k checked the files, they seem ok...

The thing I am trying to understand is. If I had tried to restore a failed disk with the drives in this state, would restore of just those 6 files be corrupt? or would I not be able to trust that any file could be restored as there were issues with the array?

Thanks for your help! :)
« Last Edit: January 28, 2019, 04:42:47 am by gwpt »

Offline Skirge01

  • Full Member
  • ***
  • Posts: 203
  • Karma: +5/-0
    • View Profile
It might not be that obvious where the range quantity is, so the first attachment shows exactly where that is.  Any time a scheduled operation runs, it will run on that amount of the array.  You schedule it via the interface section in the 2nd attachment.  Hope that helps.

Quote
  - multi disks seem to have an issue with : \\$LogFile::$DATA   - is this bad?
Probably not.  I never worry about that kind of stuff, assuming they're just log files which were actively being used while the verify was happening.

Quote
  - and six files had a issue, what should I do with them? -  k checked the files, they seem ok...
Perfect!  That's exactly how you handle it!  ::good job::   ;D  Do keep in mind that the forensic plugin only works on the results of the latest verify.  So, if you run a verify and have errors, but don't get to run a forensic being the next verify, you'll never know what the errors were for that first verify.  Hope that wasn't too confusing.

You're clearly understanding things now because you're asking a VERY good question.  If the drive containing those 6 files you mentioned had failed and been dropped by the array, there's a chance that the parity could also be corrupted.  Reason?  Since this is a theoretical, we don't know what caused the drive to be dropped.  If the read/write head of the drive was causing issues, then the parity could certainly have issues now.  The thing is, if the head was bad, it could have gone bad after the last verify was done (i.e. showing those 6 files with issues) and now there could be MORE data both on the failed drive AND in the parity which is corrupted from that head issue.  At the same time, the verify could have found errors BECAUSE of the failing drive head.  In other words, the parity could have been correct (i.e. was written before the drive head starting failing), but it didn't match because the drive head (now failing) couldn't read the data correctly and make a match.

So, could you trust the parity?  It depends what the problem is with the drive.  That's why it's so important to not depend on RAID as your backup solution.  RAID ≠ Backup!  If you have data you can't afford to lose if a drive fails, ensure it is backed up.  RAID is meant to keep your server's storage running/available while you replace a failed/failing drive, maybe a speed boost, and even a nice drive pool.

Most of my array is filled with TV shows and movies, so I can easily replace them if a drive dies.  However, family photos and other important data are backed up in multiple locations on the array (there are free tools to automate such tasks if you don't want to do it manually).  Computer backups are performed nightly, so if one backup is lost from a failed drive, I have plenty of others to pull from.  It's unlikely that losing even a week's (or more) worth of computer backups would cause a loss of any critical data in my setup.

It's amazing the peace of mind you have when there's a solid backup plan in place before a drive fails.  Okay, I've written a long enough novel here.  LOL!

Offline cogliostrio

  • Jr. Member
  • **
  • Posts: 65
  • Karma: +2/-0
    • View Profile
Have a look at the CQ salt value according to this thread as well.

http://forum.flexraid.com/index.php?topic=4812.0

Offline Skirge01

  • Full Member
  • ***
  • Posts: 203
  • Karma: +5/-0
    • View Profile
Have a look at the CQ salt value according to this thread as well.

http://forum.flexraid.com/index.php?topic=4812.0

Another good tip!  I completely forgot about that parameter.  If you're frequently getting verify issues, increasing the salt could help alleviate them.  Considering I run a partial verify every day with the expectation of getting the entire array verified over a 2 week period, "frequently" for me meant several times a month.  Your definition of frequently will likely be different.

Offline gwpt

  • Full Member
  • ***
  • Posts: 112
  • Karma: +0/-0
    • View Profile
Thanks eveyone. So far, so good. Not verify errors for a while now :)