Author Topic: Instability, Frequent Reconciliation in Realtime RAID-F  (Read 2718 times)

Offline jdezso

  • Newbie
  • *
  • Posts: 1
  • Karma: +0/-0
    • View Profile
Instability, Frequent Reconciliation in Realtime RAID-F
« on: June 29, 2013, 12:20:40 pm »
I love the potential of Realtime RAID-F, but I've been having a lot of stability problems since I installed it that are making it cumbersome to use day-to-day. Although I haven't lost any data, the pool needs almost daily (sometimes more) reconciliations. I've also had more serious errors twice now. I've tried googling the problem to no avail, so I'm hoping someone here can shed some light on my situation, or at least give me some new things to try.

System Specifications:
Windows Home Server 2011
CPU: AMD Athlon 64 X2 Dual Core Processor 3800+ 2.00 GHz
RAM: 2.00 GB
2x2.0TB Samsung HD204 SATA
2x2.0TB Samsung HD204 SATA via PCIe drive controller (shown as SCSI in Drive Management)

Current Data occupies 2.23 TB with 165,628 Files within 15,954 Folders

The Main Issue:

When deleting specific files or folders, the first "delete" operation of the day will cause FlexRAID to unpredictably crash hours later. The pool will just disappear: networked computers no longer see the shared drives; media files currently streaming over the network suddenly cut off mid-stream; on the server the pool does not appear as a drive in Windows Explorer. When I check on the server, the FlexRAID user interface claims the drive pool is running. If you attempt to stop the drive pool or click on any of the other functions, a popup appears in the corner of the chrome window with this error message:
com.google.gwt.user.client.rpc.statuscodeexception 0
If you open the log file up at this point, the latest entries at the bottom are never current, usually from sometime the night before. There are no error messages in the log, just run-of-the-mill stuff.

I usually resort to restarting the computer. Upon restarting, the drive pool realizes something is wrong and says there is an error preventing it from starting up. The log file after a restart has been populated with all kinds of recent entries. Here is a usual error message:
  • ERROR: Invalid entry detected...
  • ERROR: Need for reconciliation detected! Please run the reconcile task...
  • java.io.IOException: Need for reconciliation detected! Please run the reconcile task... (this line is not always there)

And also, though I'm not sure this is important, another error message is always paired with this a few lines down:
  • WARN : /gwt/rpc
  • java.lang.RuntimeException: Login exception: unauthenticated user!

A few lines above all of what I've shared is this message:
  • WARN : Zero live size for C:\FlexRAID-Managed-Pool\class1_0\{c3342046-4f79-4f34-abdc-759d3cf52f9a}\folder\folder\filename.xxx

This is usually one entry containing the name of the first thing I deleted during the day (the exception is a folder with items under it, leading to many of these warnings, or a set of files ctrl-selected together and then deleted. I may have deleted multiple files after this - but these subsequent operations are never in the log.

Performing a backup of the filesystem database and then reconciling usually fixes the problem after 60 - 90 minutes of downtime performing these operations.

The very first time this issue cropped up the log file still showed an error after reconciling and the storage pool could not be restarted. The log file recommended a forced-sync-verify.

And strangely the second to last time this issue cropped up, the reconcile completed successfully and the storage pool was started up successfully, but Windows Home Server no longer remembered the network's shared folders and these shares had to be created again.

This problem crops up once almost every day. Deleting files will not be the only operation that sets this problem off. Moving files from one place to another (i.e. the first move of the day) also leads to the same symptoms described above. I haven't tested this rigorously, but creating a large number of files seems to also cause problems, such as when I was migrating all my data into the pool in the first place (using Windows Explorer per the Do's and Don'ts). The crashes during the data import were so frequent that I never finished moving everything into the pool.

Lastly, just yesterday, I got this error:
  • ERROR: Not able to find live info for C:\FlexRAID-Managed-Pool\class1_0\{2563839d-649f-476f-8f0d-1856d756b285}\System Volume Information\SPP\snapshot-2
  • WARN : Disabling all live operations! All future operations will get an access denied error...

I have yet to resolve this, and I have no idea what caused it, but I want to include it here as part of my saga of problems trying to run Realtime RAID-F in case it's somehow relevant.

Finally, I did try rebuilding the pool from scratch once when I first started experiencing problems during the initial data import. The problems continued with the new build.

Can someone please give me some guidance on what is going on here? I have TRACE level logs going back at least 2 weeks if that would help.

Thanks,
Jeff

Offline DrBlaze

  • Sr. Member
  • ****
  • Posts: 281
  • Karma: +14/-0
    • View Profile
Re: Instability, Frequent Reconciliation in Realtime RAID-F
« Reply #1 on: June 29, 2013, 10:45:24 pm »
Sounds like you're having lots of fun    :-\

"The pool will just disappear: networked computers no longer see the shared drives; media files currently streaming over the network suddenly cut off mid-stream; on the server the pool does not appear as a drive in Windows Explorer. When I check on the server, the FlexRAID user interface claims the drive pool is running. If you attempt to stop the drive pool or click on any of the other functions, a popup appears in the corner of the chrome window with this error message:
com.google.gwt.user.client.rpc.statuscodeexception 0"

These things all happen any time the service stops unexpectedly.  From your error messages it sounds as though there may be 1 or more problem files that Flexraid cannot cope with.  Here are some steps you can take which should hopefully get you running again:

1- Through Windows Disk Management, mount your first drive (assign temp drive letter).  Right-click on recycle bin on desktop and make sure that it is disabled for this drive.  Navigate to "_flxr_\l" folder and delete anything in it.  Find any files that were mentioned in log file with the "zero live size error" and remove from drive.  Run Windows chkdsk on drive.  Unmount. Repeat ALL steps for other data drives.

2- Mount PPU, run chkdsk. Unmount.

3- Run Reconcile.


Keep yourself in Trace mode and test system.  If you have more probs post your logs.  For future reference, just be aware that deletes tend to get you in more trouble than anything else with RT.  Here is a post I made many months ago that still keeps me (mostly) out of trouble :

"I leave my task manger running when working with my pool.  If Flexraid is using more than 1-2% I know it is calculating parity, so I go easy on it.
I don't delete while writing to the pool / I don't write while deletes are completing.

When I have lot of changes to make I don't delete things right away.  I created my own recycle bin, a folder called "000 - Trash" that is the first folder in my pool.  When I want to delete large files/dirs I drop them in the trash, to be deleted later (this way parity is not immediately affected, and I can continue with other changes)."

One thing I have also learned in the intervening months is not to select a combination of files and folders to delete all at once.  I do the files, then deal with the directories one at a time.  If the dirs contain a lot of files I will go so far as to delete the files within the dir first, then del the dir.  This is because the files in dir may be spread across several drives, and RT has a persistent bug that will sometimes interfere with deleting such a dir in one shot.

RT does still have a few peculiarities (this is why it is still listed as experimental), but I use it with very few issues these days.  I am looking forward to the new release due in the coming days.  Brahim has gradually been making the Delete and Reconcile operations more robust.

Good luck :)