Author Topic: verify sync failes shortly after starting fue to "disk error"  (Read 746 times)

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,341
  • Karma: +199/-15
    • View Profile
Re: verify sync failes shortly after starting fue to "disk error"
« Reply #15 on: April 17, 2017, 08:43:46 am »
i am still getting some errors in windows events when i do copy/move with traid.
error 153. the IO operation at logical block address ... disk11 was retried.

it doesn't seem serious.  but its always for that disk that i just replaced, and i replaced the cable, too.  it's the only disk giving errors, but i can't figure out why always the same disk.
It is serious. You definitely want to resolve that or you will end up with file system corruption.

Offline pooler1

  • Jr. Member
  • **
  • Posts: 73
  • Karma: +0/-0
    • View Profile
Re: verify sync failes shortly after starting fue to "disk error"
« Reply #16 on: April 25, 2017, 01:06:01 pm »
i'm trying to identify the problem.  i just ran a regular verify, and it completed the whole thing 100%, but at the very end, there are a lot of red lines in the log...
error code= 9999999999
2 stripe block failure
first...
last...
operation aborted!
failed uor position = 6
failed uor id = 1000000020

also, since completing the verify, i have gotten no more of the 153 errors in the windows event log, previous to that i was getting them all the time like a couple times a minute.

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,341
  • Karma: +199/-15
    • View Profile
Re: verify sync failes shortly after starting fue to "disk error"
« Reply #17 on: April 26, 2017, 04:28:24 am »
Again, try to resolve the underlying issue. The software will not resolve any hardware or driver issue you might have.
Verify & Sync should succeed consistently before you can rest easy that your data is safe.

I would delete the RAID configuration, regain control of the disks, and then run a series of disks tests as well as load tests on the system itself.

Offline adridolf

  • Jr. Member
  • **
  • Posts: 86
  • Karma: +0/-0
    • View Profile
Re: verify sync failes shortly after starting fue to "disk error"
« Reply #18 on: April 27, 2017, 06:13:37 am »
i am still getting some errors in windows events when i do copy/move with traid.
error 153. the IO operation at logical block address ... disk11 was retried.

For which disk do you observe that?
Because I have the 153 regularly on a healthy system, but only after server restart and during verify/sync, and only for block 0x2 at the pool disk.

In this special case, I would consider it unrelated. In any other case, just ignore me. ;-)

Offline pooler1

  • Jr. Member
  • **
  • Posts: 73
  • Karma: +0/-0
    • View Profile
Re: verify sync failes shortly after starting fue to "disk error"
« Reply #19 on: April 27, 2017, 08:26:18 am »
it's usually just this one disk.  but here's the latest, i think i'm ok.

i ran the verify sync again.  when it started, i got the 153 errors a lot, like a couple times a minute.  a few hours later, they stopped, verify sync is still running.  the task completed successfully (before it would get aborted with a bunch of red text).  And i haven't got any more of the 153 errors since it stopped.  good!  (i think)

so maybe brahim can confirm this or someone else...
my conclusion is that the disk cable that i replaced was bad, and i don't think when i replaced the cable and disk, that a verify sync ever completed until just now.  And so, things should be ok now.

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,341
  • Karma: +199/-15
    • View Profile
Re: verify sync failes shortly after starting fue to "disk error"
« Reply #20 on: April 27, 2017, 09:25:19 am »
it's usually just this one disk.  but here's the latest, i think i'm ok.

i ran the verify sync again.  when it started, i got the 153 errors a lot, like a couple times a minute.  a few hours later, they stopped, verify sync is still running.  the task completed successfully (before it would get aborted with a bunch of red text).  And i haven't got any more of the 153 errors since it stopped.  good!  (i think)

so maybe brahim can confirm this or someone else...
my conclusion is that the disk cable that i replaced was bad, and i don't think when i replaced the cable and disk, that a verify sync ever completed until just now.  And so, things should be ok now.
As a test:
1. Could you run the Verify task without the pool running? Do not stop the array - just the pool.

2. These errors are not normal. Please check your TCQ, SWO, and pool cache settings (turn them off and see if that makes any difference while the pool is running).

3. Check your disks SMART and see if there is an increase in error rates.

Offline pooler1

  • Jr. Member
  • **
  • Posts: 73
  • Karma: +0/-0
    • View Profile
Re: verify sync failes shortly after starting fue to "disk error"
« Reply #21 on: April 27, 2017, 04:36:32 pm »
i'll do that and report back.  btw, what does it mean to have the pool running but not the array?  to me, it's not any different than not having either running because i cant access the data either way.  is there any practical purpose to having a pool running without the array?

Offline adridolf

  • Jr. Member
  • **
  • Posts: 86
  • Karma: +0/-0
    • View Profile
Re: verify sync failes shortly after starting fue to "disk error"
« Reply #22 on: April 28, 2017, 05:57:17 am »
The array without the pool, not the other way around.

The ARRAY just means the "transparent" disks, so a virtualized disk for each physical disk while parity is maintained.

The POOL is the one single disk which displays you the content of all individual disks as a single drive.

Thus, you can just start the array without the pool, so you can access the transparent, parity-protected disks individually (by assigning drive letters to them). The pool then is a distinct feature, providing you this merged view of all files.

Offline pooler1

  • Jr. Member
  • **
  • Posts: 73
  • Karma: +0/-0
    • View Profile
Re: verify sync failes shortly after starting fue to "disk error"
« Reply #23 on: July 24, 2017, 08:04:15 pm »
so i've looked more into my errors, here's some more info, please help me if it sounds like something you know about...

i was looking into if the cables, or hardware was a problem.  I switched some cables around.  It didn't seem like cables were an issue.
One thing I noticed was that by WD black 2TB drives were the ones giving me errors regardless of the cables, and even when i replaced it with a new one same model.  SO next thing i will try is replacing those drives with a hitachi and see if that gets rid of some of the issues.

The other set of issues i noticed were coming from the motherboards sata attached drives.  most of the drives are attached by sas cables and breakouts.  but 4 of them are direct to mobo sata connected, and they all give 153 errors.  could it be a mobo driver issue or something like that?  it's a supermicro x10sat motherboard.  the rest of the drives are attached to m1015, and other than the WD black drives, they don't seem to have problems.

Offline Brahim

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8,341
  • Karma: +199/-15
    • View Profile
Re: verify sync failes shortly after starting fue to "disk error"
« Reply #24 on: July 26, 2017, 05:43:10 am »
It could be driver or the SATA controller on the mobo being bad or flaking under load.

Offline TheKLF

  • Newbie
  • *
  • Posts: 24
  • Karma: +1/-0
    • View Profile
Re: verify sync failes shortly after starting fue to "disk error"
« Reply #25 on: August 06, 2017, 09:41:29 pm »
I found in my setup that mixing disk controllers in the same tRaid was very unstable.  I had all kinds of problems with a setup that was using a dell perc card and the intel mobo onboard controller.  (8 drives on perc and 4 on intel)  I ended up getting an intel SAS expander so that I could put all 12 drives on my dell perc card and have had zero problems in 3 years since.  I also found that WD black drives to be very unstable.  They seem to start out fast but over time get really really slow, which I think contributes to timing problems.  I have been running a mix of Hitachi and Seagate 7200 Rpm drives with no issues.  The SAS expander was expensive but worth it.

Offline pooler1

  • Jr. Member
  • **
  • Posts: 73
  • Karma: +0/-0
    • View Profile
Re: verify sync failes shortly after starting fue to "disk error"
« Reply #26 on: August 07, 2017, 07:17:02 pm »
I found in my setup that mixing disk controllers in the same tRaid was very unstable.  I had all kinds of problems with a setup that was using a dell perc card and the intel mobo onboard controller.  (8 drives on perc and 4 on intel)  I ended up getting an intel SAS expander so that I could put all 12 drives on my dell perc card and have had zero problems in 3 years since.  I also found that WD black drives to be very unstable.  They seem to start out fast but over time get really really slow, which I think contributes to timing problems.  I have been running a mix of Hitachi and Seagate 7200 Rpm drives with no issues.  The SAS expander was expensive but worth it.
interesting.  thanks.
yes, I'm using 3 controllers essentially.  The motherboard SATA controllers, and two M1015 cards...one on a x16 slot, and one on a x8 slot.
Thanks for confirming the WD black thing.  I'm going to replace them with HGST and see if that helps, the ones I already use don't give me problems (except for the ones on the mobo sata).


I have two pools:
the first pool uses the x16 M1015 along with the mobo sata.  I tried an HP SAS expander, but I noticed the speeds got affected drastically (especially with the verify operations).  SO I wanted to retain the speeds by not using an expander, but that means i can only have 8 drives per M1015.

the second pool is on the slower x8 M1015 card, no expander.  That one continually gave me the problems I've posted here, but I have it going to an external box with additional external cables, and the only drives really giving me problems on that pool is the WD black.  SO hopefully, that fixes that problem, and the pools work fine after that.

So I'm learning the pros and cons here. WD Black is bad for flexraid apparently, even though they are supposed to be good drives.  The HGST drives seem ok.  mixing the controllers is bad, at least mixing the mobo with the M1015.

What are people's experiences with SAS expanders on traid and speed?  I forgot the numbers but it made the verify operations extremely long, like it went from maybe half a day or one day to weeks even.

Offline TheKLF

  • Newbie
  • *
  • Posts: 24
  • Karma: +1/-0
    • View Profile
Re: verify sync failes shortly after starting fue to "disk error"
« Reply #27 on: August 07, 2017, 10:38:16 pm »
So the cheap HP expanders you see on ebay are typically going to be 3Gps which will affect your throughput, but by how much will depend on other factors such as drive speed, bus, etc.  I paid more for the intel both because I wanted 6Gbps as well as needed an external card that did not require using a pci-e (expanders only use them for power)   I ended up going with a model like this:  Intel RES2CV360 SG27535 RAID Expander which at the time was about $250

In my setup I have external JBOD enclosures running on 3Gbps expanders so I didn't need a 6Gbps controller, but for the "internal" 12 drives I did go for the 6Gps controller and expander.  I get a very consistent 50Mbs write speed on the 12 drive 6Gbps array (using all matched 3TB ST3000DM001) where speed will vary between 30-40Mbps write speed on the external JBODs with a mixture of drives.  Write speed is also dependent on where on the disk the data is being written.  An empty disk will perform much faster writing to the inner part of the platter than later when it is full and writing to the outer part.

In general I read from other posts on this forum that with 1 parity drive, you should see about 50-66% the normal write speed (maybe Brahim can correct me on this if I am wrong) - so on the 3Gps I definitely take a performance hit.  But so far it has not been an issue for me.  Only times I have failed to synchronize since working out the bugs on my setup is when a drive has a lot of bad sectors.    I run a scheduled sync task every night and on that task result the throughput is usually reported as 1000Mbps on the 6Gpbs controller and 700Mbs on the 3Gps.  So I am not sure how long it would take to do a start to finish verify sync but a 300GB chunk of a 30TB array takes about an hour at 6Gps and 1.5 hour at 3Gps (again with 7200 rpm on the 6Gps and slower 5900 rpm drives on the 3Gps)
« Last Edit: August 08, 2017, 12:05:37 am by TheKLF »

Offline pooler1

  • Jr. Member
  • **
  • Posts: 73
  • Karma: +0/-0
    • View Profile
Re: verify sync failes shortly after starting fue to "disk error"
« Reply #28 on: August 09, 2017, 01:25:55 pm »
So the cheap HP expanders you see on ebay are typically going to be 3Gps which will affect your throughput, but by how much will depend on other factors such as drive speed, bus, etc.  I paid more for the intel both because I wanted 6Gbps as well as needed an external card that did not require using a pci-e (expanders only use them for power)   I ended up going with a model like this:  Intel RES2CV360 SG27535 RAID Expander which at the time was about $250

In my setup I have external JBOD enclosures running on 3Gbps expanders so I didn't need a 6Gbps controller, but for the "internal" 12 drives I did go for the 6Gps controller and expander.  I get a very consistent 50Mbs write speed on the 12 drive 6Gbps array (using all matched 3TB ST3000DM001) where speed will vary between 30-40Mbps write speed on the external JBODs with a mixture of drives.  Write speed is also dependent on where on the disk the data is being written.  An empty disk will perform much faster writing to the inner part of the platter than later when it is full and writing to the outer part.

In general I read from other posts on this forum that with 1 parity drive, you should see about 50-66% the normal write speed (maybe Brahim can correct me on this if I am wrong) - so on the 3Gps I definitely take a performance hit.  But so far it has not been an issue for me.  Only times I have failed to synchronize since working out the bugs on my setup is when a drive has a lot of bad sectors.    I run a scheduled sync task every night and on that task result the throughput is usually reported as 1000Mbps on the 6Gpbs controller and 700Mbs on the 3Gps.  So I am not sure how long it would take to do a start to finish verify sync but a 300GB chunk of a 30TB array takes about an hour at 6Gps and 1.5 hour at 3Gps (again with 7200 rpm on the 6Gps and slower 5900 rpm drives on the 3Gps)
in my effort to reduce system errors and get out of the testing phase for traid...here's what i'm getting.  thanks for your comments btw, very helpful.

i should move the mobo sata drives to it's own separate traid pool.
have another traid pool for one of the m1015 controllers, and use the intel expander to retain speeds.
have a third pool for my second m1015 controller, which already is on it's own pool.  however, that is the one with the wd black drives.  so i still have to replace those and see if i reduce those errors because that one is the most severe at this point.

the only tricky part for this now is splitting the one pool into two, which also means adding an extra parity drive.

off topic...i heard some of the supermicro chassis superserver combos are whisper quiet.  how can i run 20+ drives quiet enough and cool enough to have in the same room as the office computers?