Author Topic: WHO IS TESTING tRAID?  (Read 3896 times)

Offline NLS

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1,018
  • Karma: +29/-4
  • Look ma, no hands!
    • View Profile
    • iLogic
WHO IS TESTING tRAID?
« on: July 16, 2013, 03:49:47 am »
Who is testing tRAID right now?




What tests do you do? Do you read/write/edit data?
Using pool or not?
Using shares or local?
Did you simulate any failure?
Did restores?
Expansion?
Contraction?
On-line? Off-line?


What?


Everything except performance issues (except if they are EXTREMELLY low) is interesting.


Check in here!

---
NLS
Production system: SBS2011 fully patched, intel Core2 Quad, 8GB, 12 disks (1 system IDE, 1 backup IDE, 10 for array and parity most SATA3), parity is 3TB, largest data disk is 3TB, millions of smaller files, common browser Chrome latest.

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,547
  • Karma: +204/-16
    • View Profile
Re: WHO IS TESTING tRAID?
« Reply #1 on: July 16, 2013, 06:47:02 am »
Strangely, testing is still being done private beta style.
I am dealing with most users through PMs on specific issues.

Outside of that, most issues have been posted in the bug forum section for tRAID.

It is possible that we might have done too good of a job during private beta testing.  ;D

The RAID part is rock stable. So, I would still urge users to go through the advanced features and simulate them.
Things like Global Hot Spare restoring, online RAID Expansion in multi-PPU setups, etc.

What's key here is catching things we simply cannot see during development due to varying hardware setups.


Offline monkeysez

  • Full Member
  • ***
  • Posts: 102
  • Karma: +0/-0
    • View Profile
Re: WHO IS TESTING tRAID?
« Reply #2 on: July 16, 2013, 08:43:43 am »
I have expanded my array, but have yet to test the hot spare functionality. I will be receiving a new HD in a couple of weeks, so I'll make sure to do some contraction/hot spare testing at that time. I do have any issue that I briefly mentioned in another post.

Every time I start my pool (such as after a reboot), I receive a message from the Windows 8 Action Center telling me to scan my drive for errors and/or restart. When I do so, I receive the same error message. I attached a screen shot. Any ideas?

Edit: The functionality of the pool remains. I am able to open files and such, no issues.

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,547
  • Karma: +204/-16
    • View Profile
Re: WHO IS TESTING tRAID?
« Reply #3 on: July 16, 2013, 09:10:17 am »
@monkeysez
Can you go to your Windows event viewer and copy out every disk related events?

Offline monkeysez

  • Full Member
  • ***
  • Posts: 102
  • Karma: +0/-0
    • View Profile
Re: WHO IS TESTING tRAID?
« Reply #4 on: July 16, 2013, 09:27:04 am »
The re-occurring error:

Volume Shadow Copy Service error: Unexpected error DeviceIoControl(\\?\Volume{18b53ae6-05b2-4625-a8dd-b948f64c0a06} - 0000000000000168,0x0053c008,00000088F799AA80,0,00000088F799CA90,4096,[0]).  hr = 0x80070570, The file or directory is corrupted and unreadable.
« Last Edit: July 17, 2013, 04:00:25 pm by monkeysez »

Offline SirMaster

  • Jr. Member
  • **
  • Posts: 78
  • Karma: +4/-0
    • View Profile
Re: WHO IS TESTING tRAID?
« Reply #5 on: July 16, 2013, 09:38:06 am »
I've been mainly focusing my testing on the live reconstruction and restoring features.

I'm happy to say that my first problem where disks that went missing while the array was offline or server was off were fixed in RC2 (thanks Brahim).  tRAID now checks the disks when you attempt to start the array and notifies you if it can't find one which it was expecting and tells you to set it to failed if it indeed has been removed from the system (due to say a failure).

However I still have 2 sort of open issues.

First: 
http://forum.flexraid.com/index.php/topic,2386.0.html

This one seems interesting.  Seems someone else has the issue too.  Brahim says the last byte failue doesn't make sense (because its lower?)  I've been running RC2r2 and that didn't change the behavior.  So that issue is still in question for me.

Also I was testing multi PPU configurations and ran into this issue: 
http://forum.flexraid.com/index.php/topic,2387.0.html

Can't really move forward with that one until Brahim reports back what he finds.


I haven't done any testing this past weekend, but I've been meaning to focus my testing into swapping out failed disks with new ones.  Perhaps I will try that tonight.

I've also been focusing on checking the data myself too.  Like writing a file to DRU1, then computing its MD5, then failing DRU1 and then computing it's MD5 from the live reconstruction. (later I will compute it's md5 from a newly created and restored to DRU)  I've seen a few inconsistencies so far and need to spend more time seeing how often it happens and what seems to get it into that state.  I'm wondering if it is linked to that first issue I posted where the parity seems to get out of sync on its own from a reboot.

I'm more than happy to send snapshots of my VMs and or allow RDP access to them any time if that would help with anything.

To recap I've been testing read/write data to pool and to individual disks locally (not at the same time of course).  I've been simulating failures.  Both online and offline for these things.

I plan to test RAID expansion and restores still.
« Last Edit: July 16, 2013, 09:42:27 am by SirMaster »

Offline Ramshackles

  • Jr. Member
  • **
  • Posts: 81
  • Karma: +8/-3
    • View Profile
Re: WHO IS TESTING tRAID?
« Reply #6 on: July 16, 2013, 11:39:20 am »
I'm using it on my home system currently, I had that last byte failure mentioned already. Also having issues with the pooling functionality, but I'll research that more now since I enabled more verbose logging.

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,547
  • Karma: +204/-16
    • View Profile
Re: WHO IS TESTING tRAID?
« Reply #7 on: July 16, 2013, 12:58:30 pm »
@monkeysez
Volume Shadow Copy is not supported on the storage pool.

@SirMaster
Your first issue needs more detail on your setup.
I can confirm on the second issue on multi-PPU live data reconstruction when two drives have failed.
Essentially, the system drops both failed drives to protect the array. However, doing so is not necessary, and I will fix that.

Offline monkeysez

  • Full Member
  • ***
  • Posts: 102
  • Karma: +0/-0
    • View Profile
Re: WHO IS TESTING tRAID?
« Reply #8 on: July 16, 2013, 02:07:08 pm »
@Brahim,

Shadow copy is disabled on my system. See the attachment that the service is disabled :/

When I turn off the array, everything is fine.
« Last Edit: July 16, 2013, 02:15:29 pm by monkeysez »

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,547
  • Karma: +204/-16
    • View Profile
Re: WHO IS TESTING tRAID?
« Reply #9 on: July 16, 2013, 02:22:24 pm »
Its driver service is running still (volsnap).
It could be System Restore or NT Backup using it.

It is like with tRAID. There is a kernel mode service and a user mode service.
Right-click on the pool drive and make sure it is not used for anything.
Likely, it is being used as target for other drives shadow copies. So, check the settings on other drives too.

Offline monkeysez

  • Full Member
  • ***
  • Posts: 102
  • Karma: +0/-0
    • View Profile
Re: WHO IS TESTING tRAID?
« Reply #10 on: July 16, 2013, 03:43:54 pm »
I disabled system restore, defrag all drives, ran chkdisk, so lets see what happens. The only thing I can think of is Crashplan trying to access Shadow copy? I don't have any scheduled backups or anything.

Offline SirMaster

  • Jr. Member
  • **
  • Posts: 78
  • Karma: +4/-0
    • View Profile
Re: WHO IS TESTING tRAID?
« Reply #11 on: July 16, 2013, 04:38:07 pm »
I disabled system restore, defrag all drives, ran chkdisk, so lets see what happens. The only thing I can think of is Crashplan trying to access Shadow copy? I don't have any scheduled backups or anything.

If you turn off "backup files in use" in CrashPlan settings, CrashPlan should in theory not access Shadow copy.

Offline monkeysez

  • Full Member
  • ***
  • Posts: 102
  • Karma: +0/-0
    • View Profile
Re: WHO IS TESTING tRAID?
« Reply #12 on: July 16, 2013, 05:58:21 pm »
If you turn off "backup files in use" in CrashPlan settings, CrashPlan should in theory not access Shadow copy.

Bingo, you nailed it. I will look into that and see if I can get crashplan not to use VSS.

If you want to prevent CrashPlan from using VSS, you can turn off the “Backup Open FIles” option in CrashPlan by going to Settings > Backup > Click on (Configure…) next to Advanced Backup Settings and uncheck the Backup Open Files box.

http://support.crashplan.com/doku.php/articles/vss

We should document this somewhere in the wiki along with the endpoint protection on Windows server.
« Last Edit: July 16, 2013, 06:01:13 pm by monkeysez »

Offline vletroye

  • Hero Member
  • *****
  • Posts: 714
  • Karma: +7/-0
    • View Profile
Re: WHO IS TESTING tRAID?
« Reply #13 on: July 20, 2013, 03:59:37 am »
I am testing tRAID from 2 point of view

- User interface. Is is user friendly/complete/consistent & coherent.
- Resilience. What occurs in case of disk/server failure while reading/writing in the pool
- Expansion (not yet tested Contraction)
- Initialize RAID or "Do Nothing+Verify Sync".
- Verify/Verify Sync/Recreate. I did test them all many times, online and offline.

None of those tests was done with an "automatic start" of the pool!


1) Regarding the UI, I did already report a few notes (bugs or suggestions). But I have more and will post soon.
2) Regarding resilience, I am testing with VM and have no mean to simulate a disk failure while the VM is up and running (Do you know how to do that ? I mean: without using the FlexRaid feature to fail a disk).
3) Regarding Expansion. Works fine so far... Except that I could expand with a placeholder instead of a "physical" UoR.. Wierd  :o
4) Regarding Initialization versus "Do Nothing"+"Verify Sync": I noticed indeed (as reported in other posts and analyzed by Brahim) that within a VM, the throughputs are sometimes really weird, but as perf is not the main topic now I will come back on those tests later
5) Works fine (I.e.: Verify always succeeds immediately after "Initialization" or after "Verify Sync" (even if there was errors before the "Verify Sync")


About 2 - Resilience: So far, I can only simulate BSOD/power failure/... while reading or writing data in the pool (Here after, all Verify and Verify Sync are done "offline")


A) BSOD always results in disk corruption if I was writing data in the pool (I.e.: a "Verify" fails). It's ok if data were only read when BSOD occurs. But as soon as a disk is corrupted, I cannot fix it anymore. I.e.: I do a "Verify Sync", reboot and again "Verify" and this last one fails (disk access error). As explained to me by Brahim, this is most probably related to VMWare and could/should be different with real hardware...  (that being said, the data in to pool can still all be accessed.. But no sure "where" is the disk error...)

B) Normal/clean Reboot also results in disk corruption if data were under writing. For that test, I used the Windows restart menu while remotely writing and deleting files in the pool.

B.1) In most cases, the writing of the data stops because the server is rebooting, but after the reboot, everything is fine. I.e.: A "Verify" succeeds. For sure, the data are only partially written and are "corrupted" (cannot be read as incomplete)... The error message got client side can be weird: e.g.: "Disk are write protected"

B.2) Twice, the server crashed with a real BSOD at the very end of the shutdown process (the BSOD message is something like NO_REFERENCES_POINTER or THREAD_EXCEPTION_NOT_HANDLED). But I didn't analyze the minidumps). After rebooting, a "Verify" failed. I did a "Verify Sync" and rebooted but was NOT in the same situation as above. I.e.: the "Verify" succeeded after the last reboot and the disks did not appear to be corrupted. So clearly, there is a difference between a real BSOD and simulating a BSOD with a VM "hard" reset :)

B.3) In some cases, after the reboot, a "Verify" fails. In such case, I do a "Verify Sync", reboot and redo a "Verify". As in B.2), it succeeds too.



To be complete, I really need to find how to do a disk failure while the VM is running and without using tRAID (I want to do it while writing in the pool - as I have created one folder per DRU in the pool, I can to write on a specific drive for testing purpose).
« Last Edit: July 20, 2013, 04:26:45 am by vletroye »

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,547
  • Karma: +204/-16
    • View Profile
Re: WHO IS TESTING tRAID?
« Reply #14 on: July 20, 2013, 10:37:02 am »
@vletroye
Killing a disk in a live VM is near impossible.
I even tried getting Virtualbox to let me do AHCI hot unplug, but that did not give me the intended result.

Now, I went back and tried to replicate your scenario.

I setup a VM under Virtualbox running Windows 8.
I initialized the array and ensure everything checked.
With the pool running, I started a copy of a file from a network share to the pool. Then, I reset the VM to simulate a power failure.
When the VM came back up, I ran a Verify+ on the array, which came up clean.

So, what happens to the array one power failure is what we call in computer programming "undefined behavior". Meaning, anything could ensue from that. You can come out clean or you might not. There are just too many variables to know for sure.

Regardless of what happens, the Verify Sync task is designed to scrub and fix any issue just like chkdisk fixed similar issues on normal disks.