Author Topic: Background Verify & Repair  (Read 2637 times)

Offline dscline

  • Sr. Member
  • ****
  • Posts: 272
  • Karma: +6/-0
    • View Profile
Background Verify & Repair
« on: February 07, 2012, 08:57:14 am »
"Data health" topics have come up in the past in the old forum, but this is something I think is very important, and worth bringing up again.  I have had at least one case in the past where my parity was invalid, but I didn't know it until doing a verify.  The problem is, a verify is a very expensive operation on a large data set, so I very rarely do them.  It takes well over 24hrs, of which the server is pretty much pegged, making it unusable for streaming.  Whenever I have any little issues with an update pop up, I always have this nagging concern that an error could have crept in, but I never want to take the server down long enough to verify it.  Even if the parity is ok, there's always the concern of bit-rot.

For me, the best new feature for FlexRAID would be a better way to monitor the health of the parity and the data.  It would be great if we could have some kind of "background verify", that could be scheduled to run during certain times, optionally be throttled to run at a slow enough pace such that it doesn't bog down the server or impede other data access, and provide a report of any issues and an easy way to fix them.

Currently, I don't know of any way to do a verify other than a full verify... you have to run the whole thing.  It would be great if FlexRAID could keep some kind of internal database of what it's checked and when, allowing it to be run in small batches, picking up where it left off.  So if I had it set to run background validation from 4AM to 6AM, at 6AM it stops, then just picks up where it left off at 4AM the next day.  It might take a couple weeks to check all the data in small bites like that, but by then it'd be worthwhile to start over again, constantly ensuring that the data and parity is healthy.

The second part of that is having the ability to fix any issues it finds.  If there's a discrepancy, then we should be able to either fix the data based on parity, or fix the parity.  Perhaps it exists, but I don't currently know of any way to "fix" the parity if something goes awry, other than rebuilding the entire set.

 :)
« Last Edit: February 07, 2012, 09:07:23 am by dscline »
WHS 2011
tRAID final 23 DRUs 2PPUs
Supermicro C2SEA, Q9505s (stock), 4GB
Supermicro AOC-SASLP-MV8
IBM m1015 flashed to LSI 9211-8i/IT + HP SAS Expander
Generic SiI3132

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,547
  • Karma: +204/-16
    • View Profile
Re: Background Verify & Repair
« Reply #1 on: February 16, 2012, 02:14:10 pm »
Indeed, Verify is an expensive process.

Note however that there are workarounds.

1. Use Validate instead of Verify
2. Verify in small batches (by using the VerifyStart and VerifyCount properties)
3. Verify the processed size of each update operation (this requires parsing the logs to capture those values)


Offline dscline

  • Sr. Member
  • ****
  • Posts: 272
  • Karma: +6/-0
    • View Profile
Re: Background Verify & Repair
« Reply #2 on: February 18, 2012, 09:09:31 am »
1. Use Validate instead of Verify
Yes, that's faster than Verify, but it's not as thorough, is it?  I thought a general rough recommendation was to run a verify weekly, and a verify monthly.  Since Verify is a bit for bit comparison, that would be the most definitive "yes/no, everything is/isn't healthy".  Regardless, my comments stand with Validate.  It would also be nice to have a "background validate" that could be stopped/resumed as needed.  :)
Quote
2. Verify in small batches (by using the VerifyStart and VerifyCount properties)
This could be an interim solution (and something I've inquired about before), but how would one accomplish this?  Currently I have my updates scheduled to occur every night along with several other server maintenance tasks, while we're all asleep and not needing the server.  I could see how I could reasonably break the array down into 30 separate chunks, and do one chunk each night of the month after an update, resulting in a full verify being done every month.  But the VerifyStart and VerifyCount properties are part of the config.  Can this be specified with the Expression Language?  I haven't dabbled in that yet, I need to see if I can find some examples.

Quote
3. Verify the processed size of each update operation (this requires parsing the logs to capture those values)
To what end?

I still think some kind of automated background scan would be a great feature to have.  We are all here because we want to protect our data.  I've had a case before where quick-validate said everything was OK, when in fact, it wasn't.  Giving FlexRAID a way to automatically monitor the data and parity, and confirm that everything really is and remains OK, would provide a lot of confidence that the data truly is protected.  It's not a bug, so it's not critical, but would certainly be a great feature to have.  :)
WHS 2011
tRAID final 23 DRUs 2PPUs
Supermicro C2SEA, Q9505s (stock), 4GB
Supermicro AOC-SASLP-MV8
IBM m1015 flashed to LSI 9211-8i/IT + HP SAS Expander
Generic SiI3132

Offline dscline

  • Sr. Member
  • ****
  • Posts: 272
  • Karma: +6/-0
    • View Profile
Re: Background Verify & Repair
« Reply #3 on: February 18, 2012, 02:17:56 pm »
Can this be specified with the Expression Language?  I haven't dabbled in that yet, I need to see if I can find some examples.

Ok, after searching, I've determind the answer is yes.  :)  However the wiki and the few examples I can find are rather vague for someone who isn't familiar with it.  I'll create a new thread based on doing this.
WHS 2011
tRAID final 23 DRUs 2PPUs
Supermicro C2SEA, Q9505s (stock), 4GB
Supermicro AOC-SASLP-MV8
IBM m1015 flashed to LSI 9211-8i/IT + HP SAS Expander
Generic SiI3132