I’ve been continuing to think about this and have a couple more ideas for you.
* Continuous validation.
If I’m reading the updated task list correctly, I think you already have this included, but just want to be sure since our previous discussion of continuous/progressive operations focused on verify tasks.
* Hidden Files
I may be in the minority for desiring this feature, but my confidence that I have everything on my array protected would definitely be enhanced with an option to protect hidden files. When facing a failed DRU, I’d much rather be able to click that “restore” button and know that everything will be back to normal than wonder what will go missing because it wasn’t protected.
* Move on while the retry timer runs
When a file can’t be read, instead of stopping and waiting out the “retry” timer before trying again, would it be possible to have FlexRAID move on to the next file and “come back” to it after the timer is up? This might reduce the time it takes to do an operation if another program is accessing/changing a file when FlexRAID tries to access it.
** The next couple ideas are more complicated and I’m not sure of the best terminology to use for a couple of the concepts, so bear with me a little here…
My understanding of how RAID-F works is that it calculates parity on a file-level instead of block-level. So, except for being GROSSLY oversimplified, is this a fair description of the concept…
DRU1 DRU2 PPU
Document1 + Document2 = 001.flxr
Movie1.iso + Movie2.mpg = 002.flxr
TVshow.divx + 1000 JPEGs = 003.flxr
50 MP3s + 75MB ZIP = 004.flxr
And so on…
…where parity data is calculated/based on groupings of files? Put another way, if a change occurs to Document2 above, the parity data for that line would be compromised until the next Update (so Document1 would no longer be protected if DRU1 were to fail), but the other groupings would still be safe. In this sense, RAID-F is more like a bunch of individual micro-arrays put together and one could become unhealthy without the others being damaged. Does that make sense?
(I’m sure the actual engine is much more complicated and efficient, but this mental picture should be good enough for our discussion as long as I’m right that a single bit mismatch would compromise only a small portion of the parity, as opposed to block-level RAID where it would have a bigger impact).
If that’s true, this leads me to two ideas:
* Don’t halt on verify failure.
This is especially important for continuous or progressive operations, but I can see it being a big help even with the current system. It’s been a while since I’ve had a bit mismatch, but my recollection is that the entire verify operation comes to a grinding halt as soon as it happens. While it makes sense to not waste time continuing a verify that has already failed, the problem is that you could solve the bit mismatch problem and then have another one later in the array pop up on the next verify (by the time this happens 3-4 times, you’ve lost a lot of time re-verifying the files at the start of the array).
Instead, what if the verify operation was set to continue even after a failure? To put this in terms of my description above: let’s say there was a bit mismatch in the Movie2.mpg file on the second line. While running Verify, FlexRAID would see that the first line (001.flxr) is healthy, enter an alert in the log that there was a bit mismatch on Movie2.mpg, and then continue to verify that lines 3 and 4 were heathly. At the end of the Verify task or at the end of the time window, an email alert would be sent detailing the problem files with the relevant sections of the log attached.
One specific example for how this would help would be the case of a hard drive that recently failed on me: the drive was beginning to slow down and get a little flaky on me, but all of the SMART parameters were fine and it continued to pass Extended Self Test operations right up until the day it failed completely (wouldn’t even mount in Windows)! If this drive had been part of my array and this “Don’t halt” feature were in place, there’s a chance the log would reflect a number of bit mismatches all coming from the same drive and could serve as a hint that a failure might be coming.
The second idea builds on this one and would represent a massive leap forward in the user-friendliness of the product (but would also be a big enough task to implement that I’m thinking of it in near-“pipe dream” terms):
* Create a “Conflict Resolution” GUI to deal with files that have failed Verify.
Let’s continue with my description above, where Movie2.mpg has a bit mismatch…
Since we can’t assume that the user will be able to resolve the problem before the next scheduled Update, the verify failure could also trigger a flag to be placed on 002.flxr that marks it as unhealthy and would prevent it from being affected by future updates until the user addresses the problem. So, if an Update ran the next morning, it would see the flag on 002.flxr and exclude that entire line from the Update process. To go back to my terminology of “micro arrays,” the 002.flxr array would become static while all the other micro arrays would be updated as scheduled (this would ensure that Movie2.mpg could be restored and that the bit mismatch wouldn’t become part of the parity going forward).
Here’s where the GUI comes in: once 002.flxr has been marked “unhealthy” and an email has been sent, FlexRAID would automatically restore the original Movie2.mpg to a hidden folder on a different DRU (using the function Sparky requested to restore individual files). When the user opens the FlexRAID interface, there would be an additional icon on the pseudo-desktop labeled “Conflict Resolution.” Opening that icon would present a list of problem files (bit mismatches, missing files, etc) and a comparison between the “current” version of the file (the mismatch one) and the “original” version (restored from parity) in a fashion very similar to “do you want to overwrite…” GUIs when doing copy/paste in Windows or TeraCopy (in the example below, { } represent buttons in the GUI).
Name Size: Location: Date Modified:
----Bit Mismatch----
Movie2.mpg 1.5GB D:\Movies 3/9/2014 {Open File} {Open Location}
Movie2.mpg 1.5GB [restored from parity] 3/9/2014 {Open File} {Open Location}
{Yes, I changed it: keep the current version} {Restore the earlier version}
---File Missing---
Document3.doc 200KB D:\Documents 2/15/2014 [missing] {Open Location}
Document3.doc 200KB [restored from parity] 2/15/2014 {Open File} {Open Location}
{Yes, I deleted it} {Restore the earlier version}
…etc (the list would be scroll-able)
Especially if we assume that most verify problems are due to non-datarot changes that have happened since the last Update (which would become more common with continuous/progressive operations), this sort of GUI would allow users to resolve any conflicts in a matter of minutes. No need to dig through the logs, access the file system to analyze the problem files, wonder “did I make that change?, ” or manually command the restoration of an individual file to another location to make a comparison. If the size/location/modification metadata aren’t useful for comparison (I imagine they’d all be the same in datarot situations), users could open both files up to compare the contents manually.
Once the user resolves the conflict, FlexRAID would either replace the damaged version with the restored original or delete the restored file and run a “mini update” to make 002.flxr healthy again.
Like I said, I know this sort of feature would be a big undertaking to implement, but I really think it would represent a HUGE leap forward in usability and greatly reduce the number of posts in the forums asking for help with verify failures. What do you think? Is it worth considering?