Author Topic: RAID over File System (RAID-F): The re-design  (Read 17307 times)

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,547
  • Karma: +204/-16
    • View Profile
RAID over File System (RAID-F): The re-design
« on: February 28, 2014, 01:01:26 pm »
The initiative
With version 2.0 having been out for a while, it is time for us to reflect on what we would have done differently knowing what we know now - and of course, execute on those reflections.
If you have been around long enough, you know that innovation does not stop around here. ;)

So, we are pleased to announce that RAID-F is forging forward.
Before adding any new feature, we are currently focusing on re-architecture. So, the first updates will focus on some of the redesign aspects. Nonetheless, we have some features on the roadmap that will put a smile on many faces. It is promised that you will not be disappointed. :)

Planned activities
  • Storage pooling engine port from tRAID as well as implementing several feature enhancements (started)
  • Snapshot RAID redesign coupled with new more efficient multi-PPU engines (started)
  • UI redesign (improve on the usability of the UI and make better usage of the screen real estate)
  • Real-time RAID redesign (develop a more resilient architecture or drop it as a feature :P)
  • Enhanced RAID management features:
    • Incremental operations on tasks such as Verify and Validate - including:
      - Continuous and progressive execution (with execution during idle times)
      - Hours of operation feature
    • Selective file operations such as re-Validating select files and restoring specific files from UI selection
    • Better control over Email and SMS notifications - including:
      - Configurable message templates
      - Adding task logs as attachment or log summary in the messages
  • Redesign/Update Scheduler:
    • Add ability to edit jobs
    • Redesign UI for simpler configuration

Status Update

2014-02-28
  • The first effort we have embarked on and completed is the port of the Transparent RAID pooling engine to RAID-F. We did not just port the engine, we ported it over and even improved upon it too. The improvements will be discussed once we start talking of new features. Nonetheless, look forward to less CPU and memory usage during storage pool activities.
  • The RAID-F managed folder (where disks used to be mounted for Cruise Control configurations) is gone! We know this will make many of you happy. :)
    What we have done is make all RAID-F operations access the needed data at a much lower level and without needing to mount the disks.
  • We have improved the implementation of the metadata database to support easier and faster RAID re-configuration.
  • There is more to come. So, keep an eye out for the next updates. :)
« Last Edit: March 05, 2014, 01:45:37 pm by Brahim »

Offline facke02

  • Full Member
  • ***
  • Posts: 106
  • Karma: +0/-1
    • View Profile
Re: RAID over File System (RAID-F): The re-design
« Reply #1 on: February 28, 2014, 01:05:39 pm »
Great news...  Looking forward to the new release.
Ken

Offline bigbob

  • Newbie
  • *
  • Posts: 36
  • Karma: +0/-0
    • View Profile
Re: RAID over File System (RAID-F): The re-design
« Reply #2 on: February 28, 2014, 01:12:32 pm »
Will this be free or for fee for people who paid for version 2.0 license?

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,547
  • Karma: +204/-16
    • View Profile
Re: RAID over File System (RAID-F): The re-design
« Reply #3 on: February 28, 2014, 01:50:25 pm »
Will this be free or for fee for people who paid for version 2.0 license?
Well, glad that we can address this early on so we can focus on the effort at hand. :)

We haven't really thought of that aspect as we have bigger concerns: delivering on the initiative.
Even if we decide to charge an upgrade fee, it will either be the case of a superior product for a small upgrade fee or a marginal improvement for no fee. Either way, you will lose nothing. ;)

We have some interesting features planned. It is those features that will really determine any fee strategy.
So, let's focus this thread on generating ideas and feature requests and worry about money when we get there.

Offline _mulder

  • Newbie
  • *
  • Posts: 12
  • Karma: +0/-0
    • View Profile
Re: RAID over File System (RAID-F): The re-design
« Reply #4 on: February 28, 2014, 03:46:23 pm »
Great to read that Raid-F is being improved upon  :)

Looking forward to future announcements!

Offline Spark

  • Newbie
  • *
  • Posts: 8
  • Karma: +0/-0
    • View Profile
Re: RAID over File System (RAID-F): The re-design
« Reply #5 on: March 05, 2014, 09:27:40 am »
I have been using FlexRAID for 6 months, and I have to say I really dig the product.
I am a software developer myself - so ideas/potential features come easily to me by nature.
Here they are - not considering development time.

Keep in mind that I love the product - this is in no way a critic. Just suggestions.

* The WebUI. You already mentionned this in your potential feature list - and I simply want to reinforce it. When I use FlexRAID (RAID-F), I feel that the current UI isn't harnessing much of the underlying engine's power. The windowing-system-on-a-webpage doesn't work so well, and windows expand without its content following it. I wish I had a simple web-page with some menus that would maximize the useage based on available page size. Have a version that is well layed-out for mid-size mobile devices like tablets. I love being able to administer FlexRAID from a tablet - but it's a real pain from an iPad.

* Scheduling. I can schedule basic Updates/Validates/Verify - but I don't have much control over it. Once a job is created, I can't edit it - I have to delete and reschedule. When I do run some jobs, like an update, I'd like the system to keep track of it. I don't always run UPDATE daily - but I have often found myself wondering when was the last time I ran an update or a validate.

* RAID health. Validates and verify are great - but they get to be heavier, longer-running tasks as the array fills up. I wish I didn't have to rely on the scripting to be able to run partial validates/verify. A lot of people aren't that savvy when it comes to programming/scripting - so it would definitely be useful. Being able to easily schedule a validate into, say, 7 smaller ones over a week (as opposed to one weekly 15-hour validate). Same for verify (which takes even longer). I typically like to do one per month - and being able to schedule 1/30 verifies of my array on a daily basis without writing a script would be great to me.  Even better would be a simple health management system that would basically ask me what periods the computer isn't used and would manage the scheduling automatically (I know, I am dreaming - I'm just saying this would be best case). Or some idle-based checking. Computer not touched for 15 minutes and CPU idle ? Start verify/validate. Used again ? take note of where we were at, and pause. In my case, the machine is a big NAS running Win7 and connected over HDMI to my home theater. God knows it is spending a LOT of time idle - but can be used frequently on some days. And idle-based health check would be perfect for me.

* Raid Recovery per file. When corruption is found in a file, I'd like easy recovery. The UI should simply tell me a corrupted file was found on the last verify/valide and ask me if I would like to attempt to recover it. Right now I have to locate which DRU the file is on, move it away from there, and initiate recovery on that particular DRU (at least, this is the way I found it to be working - if there is an easier way, let me know!). I understand it is necessary to be able to recover a whole DRU - but in the case of one file, I wish I could just have the option tell FlexRAID to attempt to recover it and be done with it.

* Better control over email notifications. Unless I use scripting for my updates/validates, I don't get much feedback on the array. I have to check the logs. I wish I could tell FlexRAID what information I am interested in (free space, new/changed/deleted files, health, etc).


I have to run. I hope this is the kind of suggestions you were looking for as input.
Maybe I'll add more later if something comes to mind.

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,547
  • Karma: +204/-16
    • View Profile
Re: RAID over File System (RAID-F): The re-design
« Reply #6 on: March 05, 2014, 10:27:44 am »
@Spark

Excellent! :)

I will be updating the planned activities with these if I got it right from your feedbacks:

1. Update the scheduler to be able to edit jobs

2. Incremental operations on tasks such as Verify and Validate. Possibly:
- make it continuous and progressive (with execution during idle times)

3. Selective file operations such as re-Validating select files and restoring specific files

4. Better control over Email and SMS notifications. Possibly:
- configurable templates
- including task logs as attachment or log summary in the message

Offline FlyingShawn

  • Newbie
  • *
  • Posts: 17
  • Karma: +0/-0
    • View Profile
Re: RAID over File System (RAID-F): The re-design
« Reply #7 on: March 05, 2014, 12:37:25 pm »
I want to 2nd a lot of Spark's ideas!

Two ideas to add:

* An easier, more capable scheduling interface.  Not just the ability to edit events (which is my biggest request!), but a GUI that is easier for users who aren't familiar with cron.  Maybe something similar to how Acronis True Image does scheduling...



(I couldn't find a picture with it, but on the weekly screen above you can select multiple days at the same time.  The same goes for the "monthly" screen: it shows a simplified days of the month calendar and you could choose, for example, to run an operation on the 1st, 10th, and 20th of each month).

The way I'm imagining it would be a list of scheduled events (similar to what we have now) and an interface with capability like these screenshots would be what you see when you click to add or edit an event.

* The second idea builds on Sparky's ideas for validate/verify operations, maybe as a way to implement them.

Under the current interface, if I want to schedule a verification on 1/30th of my array each day (to use Sparky's example), I'd have to figure out what 1/30th is and write a script for each day of the month to run that portion.  If I increase the size of my array, I'd have to redo all of those scripts to change the schedule to accommodate it.

Instead, imagine if I could simply tell the scheduler to run "continuous verify" from 2am-6am every day: at the end of the time window, it'd simply pause and pick up where it left off the next morning (and when it gets to the end of the array it'd start back at the beginning).  If a user wanted to use system idle time for these tasks, there could be a checkbox to "verify only when system idle" (similar to the Acronis screenshot above) and the user could just set a much wider time window to use (even 12am-12am if they wanted it all the time).

I also want to echo Sparky in saying I love FlexRAID and the concept of snapshot RAID, so I'm excited about the potential for this re-design! 

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,547
  • Karma: +204/-16
    • View Profile
Re: RAID over File System (RAID-F): The re-design
« Reply #8 on: March 05, 2014, 01:25:57 pm »
@FlyingShawn

Excellent, excellent! :)

1. Adding "Redesign of the scheduler interface" to the task list. I think we will need to refine what the interface should look like.
I think if we make the first drop down of the FlexRAID scheduler a selection panel, this might be a win. We can get more detailed once that piece gets tackled.

2. I agree that adding an "hour of operations" would be great.

Offline FlyingShawn

  • Newbie
  • *
  • Posts: 17
  • Karma: +0/-0
    • View Profile
Re: RAID over File System (RAID-F): The re-design
« Reply #9 on: March 08, 2014, 07:21:56 pm »
I’ve been continuing to think about this and have a couple more ideas for you.

* Continuous validation.

If I’m reading the updated task list correctly, I think you already have this included, but just want to be sure since our previous discussion of continuous/progressive operations focused on verify tasks.

* Hidden Files

I may be in the minority for desiring this feature, but my confidence that I have everything on my array protected would definitely be enhanced with an option to protect hidden files.   When facing a failed DRU, I’d much rather be able to click that “restore” button and know that everything will be back to normal than wonder what will go missing because it wasn’t protected.

* Move on while the retry timer runs

When a file can’t be read, instead of stopping and waiting out the “retry” timer before trying again, would it be possible to have FlexRAID move on to the next file and “come back” to it after the timer is up?  This might reduce the time it takes to do an operation if another program is accessing/changing a file when FlexRAID tries to access it.


** The next couple ideas are more complicated and I’m not sure of the best terminology to use for a couple of the concepts, so bear with me a little here…

My understanding of how RAID-F works is that it calculates parity on a file-level instead of block-level.  So, except for being GROSSLY oversimplified, is this a fair description of the concept…

DRU1                         DRU2                   PPU
Document1     +     Document2     =    001.flxr
Movie1.iso      +     Movie2.mpg    =    002.flxr
TVshow.divx    +     1000 JPEGs     =     003.flxr
50 MP3s           +       75MB ZIP      =     004.flxr
And so on…

…where parity data is calculated/based on groupings of files?  Put another way, if a change occurs to Document2 above, the parity data for that line would be compromised until the next Update (so Document1 would no longer be protected if DRU1 were to fail), but the other groupings would still be safe.  In this sense, RAID-F is more like a bunch of individual micro-arrays put together and one could become unhealthy without the others being damaged.  Does that make sense?

(I’m sure the actual engine is much more complicated and efficient, but this mental picture should be good enough for our discussion as long as I’m right that a single bit mismatch would compromise only a small portion of the parity, as opposed to block-level RAID where it would have a bigger impact).

If that’s true, this leads me to two ideas:

* Don’t halt on verify failure. 

This is especially important for continuous or progressive operations, but I can see it being a big help even with the current system.  It’s been a while since I’ve had a bit mismatch, but my recollection is that the entire verify operation comes to a grinding halt as soon as it happens.  While it makes sense to not waste time continuing a verify that has already failed, the problem is that you could solve the bit mismatch problem and then have another one later in the array pop up on the next verify (by the time this happens 3-4 times, you’ve lost a lot of time re-verifying the files at the start of the array).

Instead, what if the verify operation was set to continue even after a failure?  To put this in terms of my description above: let’s say there was a bit mismatch in the Movie2.mpg file on the second line.  While running Verify, FlexRAID would see that the first line (001.flxr) is healthy, enter an alert in the log that there was a bit mismatch on Movie2.mpg, and then continue to verify that lines 3 and 4 were heathly.  At the end of the Verify task or at the end of the time window, an email alert would be sent detailing the problem files with the relevant sections of the log attached.

One specific example for how this would help would be the case of a hard drive that recently failed on me: the drive was beginning to slow down and get a little flaky on me, but all of the SMART parameters were fine and it continued to pass Extended Self Test operations right up until the day it failed completely (wouldn’t even mount in Windows)!  If this drive had been part of my array and this “Don’t halt” feature were in place, there’s a chance the log would reflect a number of bit mismatches all coming from the same drive and could serve as a hint that a failure might be coming.

The second idea builds on this one and would represent a massive leap forward in the user-friendliness of the product (but would also be a big enough task to implement that I’m thinking of it in near-“pipe dream” terms):

* Create a “Conflict Resolution” GUI to deal with files that have failed Verify.

Let’s continue with my description above, where Movie2.mpg has a bit mismatch…

Since we can’t assume that the user will be able to resolve the problem before the next scheduled Update, the verify failure could also trigger a flag to be placed on 002.flxr that marks it as unhealthy and would prevent it from being affected by future updates until the user addresses the problem.  So, if an Update ran the next morning, it would see the flag on 002.flxr and exclude that entire line from the Update process.  To go back to my terminology of “micro arrays,” the 002.flxr array would become static while all the other micro arrays would be updated as scheduled (this would ensure that Movie2.mpg could be restored and that the bit mismatch wouldn’t become part of the parity going forward).

Here’s where the GUI comes in: once 002.flxr has been marked “unhealthy” and an email has been sent, FlexRAID would automatically restore the original Movie2.mpg to a hidden folder on a different DRU (using the function Sparky requested to restore individual files).  When the user opens the FlexRAID interface, there would be an additional icon on the pseudo-desktop labeled “Conflict Resolution.”  Opening that icon would present a list of problem files (bit mismatches, missing files, etc) and a comparison between the “current” version of the file (the mismatch one) and the “original” version (restored from parity) in a fashion very similar to “do you want to overwrite…” GUIs when doing copy/paste in Windows or TeraCopy (in the example below, { } represent buttons in the GUI).


Name                              Size:                   Location:                   Date Modified:

----Bit Mismatch----
Movie2.mpg                 1.5GB                  D:\Movies                        3/9/2014      {Open File} {Open Location}
         Movie2.mpg        1.5GB             [restored from parity]          3/9/2014    {Open File} {Open Location}
      {Yes, I changed it: keep the current version}            {Restore the earlier version}

---File Missing---
Document3.doc             200KB              D:\Documents                 2/15/2014        [missing]   {Open Location}
        Document3.doc     200KB           [restored from parity]        2/15/2014      {Open File} {Open Location}
                        {Yes, I deleted it}            {Restore the earlier version}

…etc (the list would be scroll-able)

Especially if we assume that most verify problems are due to non-datarot changes that have happened since the last Update (which would become more common with continuous/progressive operations), this sort of GUI would allow users to resolve any conflicts in a matter of minutes.  No need to dig through the logs, access the file system to analyze the problem files, wonder “did I make that change?, ” or manually command the restoration of an individual file to another location to make a comparison.  If the size/location/modification metadata aren’t useful for comparison (I imagine they’d all be the same in datarot situations), users could open both files up to compare the contents manually.
 
Once the user resolves the conflict, FlexRAID would either replace the damaged version with the restored original or delete the restored file and run a “mini update” to make 002.flxr healthy again.

Like I said, I know this sort of feature would be a big undertaking to implement, but I really think it would represent a HUGE leap forward in usability and greatly reduce the number of posts in the forums asking for help with verify failures.  What do you think?  Is it worth considering?

Offline Quaraxkad

  • Sr. Member
  • ****
  • Posts: 381
  • Karma: +24/-1
    • View Profile
Re: RAID over File System (RAID-F): The re-design
« Reply #10 on: March 08, 2014, 10:52:27 pm »
* Create a “Conflict Resolution” GUI to deal with files that have failed Verify.

This is *exactly* what I was trying to think of how to describe for a suggestion, but I couldn't quite figure out how to explain it understandably in plain English! I want this feature more than anything. It will make RAID-F far more transparent (no pun intended re: t-RAID), by letting the user know exactly what is going on with their files and parity creation. The way it stands now, we just basically have to have faith that RAID-F is doing what we want it to, and in the event of a conflict in a verify/validate failure we are practically blind as to what went wrong and what we need to check to proceed (and let's be honest, the guides in the wiki are vague, incomplete, and uninformative). I envision some sort of UI that shows me, perhaps in a folder tree view, what files are *for sure* protected, shows me what files exist in my array that are not in the parity, and potentially shows me any conflicts between existing files and mismatched parity. This would go hand-in-hand with the continuous verify/validate feature proposal.

Another note: The language used in the log files is very often extremely misleading and is seemly only useful to the coder, and means little or nothing to the end user. For example, when a file is in use by another process during a snapshot operation, the logfile describes the file saying "no longer exist". I have seen many posts about this where people see that in the log and are confused because the file is still there. On a few occasions you (Brahim) insisted to the poster that the file was no longer there, that it was perhaps moved to another DRU. I have run across it many times, with files that are being used in torrent seeds, they are open at the time of the nightly update/verify task and the log file complains. Also sometimes image files are not properly closed by my HTPC frontend, they remain locked and the logfile complains. The specified files are most definitely, without a doubt, still exactly where the logfile says that they are not! It's not missing by any definition of the word, and it makes it sound like that file has been deleted. And due to the lack of transparency, it's unclear whether or not that file is still protected or not.

Also: The expression language could be greatly improved by adding even the most basic variable and math capabilities. For example, when setting up a partial verify, instead of creating one for every day of the month, I'd like to create one with a line like: @param verifyStart=<arraysize> / 28 * <dayofmonth> - 1073741824. The result would be on the 1st of the month, it would verify the first 1/28th of the array (minus 1GB for a little overlap), on the second of the month it would verify the second 1/28th, etc.
« Last Edit: March 08, 2014, 11:02:48 pm by Quaraxkad »

Offline FlyingShawn

  • Newbie
  • *
  • Posts: 17
  • Karma: +0/-0
    • View Profile
Re: RAID over File System (RAID-F): The re-design
« Reply #11 on: March 09, 2014, 05:52:37 am »
This is *exactly* what I was trying to think of how to describe for a suggestion, but I couldn't quite figure out how to explain it understandably in plain English!

Thanks.

Another note: The language used in the log files is very often extremely misleading and is seemly only useful to the coder, and means little or nothing to the end user. For example, when a file is in use by another process during a snapshot operation, the logfile describes the file saying "no longer exist". I have seen many posts about this where people see that in the log and are confused because the file is still there. On a few occasions you (Brahim) insisted to the poster that the file was no longer there, that it was perhaps moved to another DRU. I have run across it many times, with files that are being used in torrent seeds, they are open at the time of the nightly update/verify task and the log file complains. Also sometimes image files are not properly closed by my HTPC frontend, they remain locked and the logfile complains. The specified files are most definitely, without a doubt, still exactly where the logfile says that they are not! It's not missing by any definition of the word, and it makes it sound like that file has been deleted. And due to the lack of transparency, it's unclear whether or not that file is still protected or not.

I think this was part of my reasoning behind the idea for the change in the “retry timer” logic.  Instead of just pausing the whole operation for a relatively short period of time, FlexRAID could “skip and come back to” the “missing” file after a much longer period of time, say, an hour or two?  After that much time, there’s a much better chance your HTPC will have released the file and FlexRAID will be able to access it (and since it kept going, that hour or two won’t be wasted just waiting around for the retry timer).

Also: The expression language could be greatly improved by adding even the most basic variable and math capabilities. For example, when setting up a partial verify, instead of creating one for every day of the month, I'd like to create one with a line like: @param verifyStart=<arraysize> / 28 * <dayofmonth> - 1073741824. The result would be on the 1st of the month, it would verify the first 1/28th of the array (minus 1GB for a little overlap), on the second of the month it would verify the second 1/28th, etc.

You certainly won’t find me arguing against increasing the capability of the language, especially since I’m sure there are specialized use cases that can take advantage of it.  That being said, I think the single biggest change to push for in this re-design, over and above all of the other changes we’ve discussed, is that common tasks should never require user scripting.  In fact,  the underlying assumption of the interface should even be that most users don’t know how to write scripts!  The expression language should definitely still be around and even improved as Quaranxkad suggests, but I think it should be reserved for uncommon tasks and for advanced users (such as if Quaranxkad were to decide his 1/28 verify plan were a better fit for his needs than a progressive one).

For the most part, FlexRAID is already pretty good at this: array creation, update/validate/verify tasks, scheduling, and restores can all be performed through the GUI, but there are two notable exceptions:

-Partial verifies.  As I said, I’d never argue against your idea for improving the expression language, but I think the continuous/progressive idea that has already been discussed would solve this for the majority of users.

-Verify failures.  Let’s be honest: every user will encounter these at one time or another.  It’s not Brahim’s fault or a problem with FlexRAID, it’s simply the nature of the beast when talking about any sort of Snapshot implementation: at some point the RAID will be out of sync when an operation occurs.  Brahim is certainly under no obligation to adopt my particular idea for a Conflict Resolution GUI, but I really think that one way or another this should be a GUI issue.

Offline facke02

  • Full Member
  • ***
  • Posts: 106
  • Karma: +0/-1
    • View Profile
Re: RAID over File System (RAID-F): The re-design
« Reply #12 on: April 02, 2014, 05:36:25 am »
Brahim, any updates on the progress?
Ken

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,547
  • Karma: +204/-16
    • View Profile
Re: RAID over File System (RAID-F): The re-design
« Reply #13 on: April 02, 2014, 08:19:19 am »
Brahim, any updates on the progress?
Busy, redesigning. ;)
I am currently focused on the UI. The main effort is modernizing the code base and porting a few UI features done in tRAID back to RAID-F.

Offline webs0r

  • Full Member
  • ***
  • Posts: 107
  • Karma: +4/-1
  • Hello, world.
    • View Profile
Re: RAID over File System (RAID-F): The re-design
« Reply #14 on: April 03, 2014, 02:55:01 am »
Yay! Exciting times ahead. Really liking all the suggestions so far.

The 'conflict' GUI would be particularly useful. It is really a verify/validate UI.

Couple of complications to bear in mind when designing: (I haven't put too much into thinking about this yet but some quick thoughts)

1. With the incremental running of verify, checks could be spread across long periods of time
Would be useful to timestamp each conflict when it is logged.
Also future actions on the array might invalidate the past result, e.g. a restore or deletion/changes to other files including the parity.
Some kind of re-check feature at that file level might be needed.
An update may clear logged conflicts (e.g. where due to file change rather than bitrot)
Also if user hasn't reviewed the logs for ages, a 2nd (or 3rd or 4th) verify pass might cover the same file again, so old result should be removed.

2. If a whole DRU is down/missing, how to deal with that? User probably doesn't want to see a million files come up? Or maybe they do.
Probably just need some thinking here.

3. Feature to prevent an update if conflicts are outstanding (non-bitrot)?

4. Ability for user to do multi-select  (e.g. select an entire path) to do a restore on

Hm it's actually quite a complex challenge.

Brahim what about that file metadata caching to reduce drive spin up - was that feasible at all, or not really?
FlexRAID expert/snapshot RAID/Storage Pool mode
Windows Server 2008 R2, 23 TB pool, Array1: 3TB redundancy, Array 2: 4TB redundancy, 11 drives total