Snap Server 4100 Meltdown

mvastola12 · 10-19-2006, 01:11 AM

Hey guys,
For several years my organization has had a Snap Server 4100 (Purchased as a Dell PowerVault 705N). Long story short, recently, a major problem developed whereby our RAID 5 array (4x 40gig Drives) refuses to mount its filesystem. On startup, it seems to try to mount, but it runs an fsck and finds that a cylinder group has a bad magic number. Unfortunately, several runs of fsck in "Repair all errors" mode fails to yeild any results (FSCK exits fatally once it reaches that sector and finds the bad magic number error).

I was wondering if anyone had any ideas about how to go about getting our data back online. The RAID array assembles, and our data is there if I do a sector-by-sector dump of /dev/rraid0 - the filesystem just won't mount, making the data inaccessible.

We played for a while, to no avail, trying to find some way to either manually correct the data for that sector on the disk through Snap's debug command line, or somehow get an image of the the assembled array off of the SnapServer (so we could put it on a hard drive, correct the sector, and then hook that drive up to the SnapServer).

Does anyone have any idea how we could try to rescue our data? Is there a tool available to get an image of the assembled array? Should we try to get a hold of a v4 SnapOS to see if fsck is improved? Any ideas would be greatly appreciated.

Thanks soo much,
Mike

PS: I'm attaching a recent boot log in case it helps.

Phoenix32 · 10-19-2006, 07:03 AM

Okay, call me dumb, but it sounds like you have a bad drive, but you are in RAID 5, so why not just replace the bad drive and let it rebuild the array? Did I miss something?

blue68f100 · 10-19-2006, 08:02 AM

rule #1. Do not do anything to compromise the array.
DO NOT DO ANY UPDATES, This will only compound the problem.

The log does not apear to be complete. Can you send the the complete log.

I'm like Phoenix, Think you may have a bad drive or 2. Does any of the lights indicates any problem?

Do you have any spare drives that you can work with?

Do you have a copy of spinrite?

mvastola12 · 10-19-2006, 11:40 AM

It doesn't seem to be a physical disk errror. The sector in question can be accessed just fine. At no point do we get a read/write error, or any other message at a disk-level. We're not getting any strange lights. It very much looks like the incorrect sector data is duplicated across all 4 drives, but the drives and array assembly is functioning properly. I'm willing to consider anything though.

I don't have SpinRite, but I can get it if it'll help. I do have a spare 160gb hard drive.

What I attached is the complete log as far as I can tell. What's missing from it?

jontz · 10-19-2006, 12:13 PM

I don't think that it is a bad drive either. It seems more like a duplicated bad write, as you suggested. If you look at the array status in the web interface, does it show up as healthy, degraded, or something else?

I would be interested to see what would happen if you replaced drive 1 with a different drive (by drive 1 I mean physical drive 1...it would be drive 0 according to the snap). I'm wondering if when it re-installs the OS to it and rebuilds the RAID if it corrects your error. That's what I'd do, but I will admit that I don't know for sure what is going on and this is not a "fix" that I know for sure will work. It's just what I would try on my own.

rpmurray · 10-19-2006, 12:33 PM

What do you get back when you use the command line to do an:

info dev

and

info log t

The info log t is where I was able to watch what it was trying to repair when I ran the repair on my failing 705N. It would be informative to see if there's any more specific error messages in there that would help in diagnosing the problem.

blue68f100 · 10-19-2006, 12:47 PM

I'm not sure I would play with any drives swaping till we get more info. I do not want to do anything that will compromise the data. I am assuming that you do not have a current backup.

Any time you get a SuperBlock error it is criticle.

I was thinking of powering down and use DD to clone the drives. Use a Sharpy Pen and mark the drive positions, must beable to re-install back into the original position. But it must be the same size to do a RAW Copy. Or a std clone to a file will have to be done. Doing this for each drive will tell us if a drive has a problem.

1 drive is a mfg than the others. This should not cuase a problem if the capacity is the same.

The reason I mentioned SpinRite is that it is not OS dependent. It uses the smart system to verify each sector, byte by byte, by reading multiple times to verify the bytes. If unknow it starts shifting the timing were it can read slightly before and after. Tring to determine what it should be. If bad it will swap a reserve sector in to correct the bad.

mvastola12 · 10-19-2006, 01:11 PM

The status of the RAID Array through the web interface is:
RAID5 - Large data protection disk
Error. Fatal error during disk check.

I'm pretty sure the actual array is healthy though.

Here's info dev:

Code:

Logical Device: 10006      Position: 0  JBOD     Size (KB):    32296  Free (KB):    22064  Private  Mounted
  Label:Private  Contains system files only
     Unique Id: 0x630A77496C09210C    Mount: /priv      Index: 12  Order: 0
     Partition: 10006  Physical: 10007  FS       Size (KB):    32768  Starting Blk:   515  Private
      Physical: 10007    Drive Slot: 0  IDE      Size (KB): 39088640  Fixed    

Logical Device: 1000E      Position: 0  JBOD     Size (KB):    32296  Free (KB):    19752  Private  Mounted
  Label:Private  Contains system files only
     Unique Id: 0x4B58D23D6E14EE1E    Mount: /pri2      Index: 13  Order: 1
     Partition: 1000E  Physical: 1000F  FS       Size (KB):    32768  Starting Blk:   515  Private
      Physical: 1000F    Drive Slot: 1  IDE      Size (KB): 39088640  Fixed    

Logical Device: 60000      Position: 1  RAID     Size (KB): 116175192  Free (KB):        0  Public   Unmounted
  Label:RAID5  Large data protection disk
     Unique Id: 0x0FEBE8223B4463B1    Mount: /0         Index: 0  Order: 255
     Partition: 10000  Physical: 10007  R 60000  Size (KB): 38725064  Starting Blk: 45319  Public 
      Physical: 10007    Drive Slot: 0  IDE      Size (KB): 39088640  Fixed    
     Partition: 10008  Physical: 1000F  R 60000  Size (KB): 38725064  Starting Blk: 45319  Public 
      Physical: 1000F    Drive Slot: 1  IDE      Size (KB): 39088640  Fixed    
     Partition: 10010  Physical: 10017  R 60000  Size (KB): 38725064  Starting Blk: 45319  Public 
      Physical: 10017    Drive Slot: 2  IDE      Size (KB): 39088640  Fixed    
     Partition: 10018  Physical: 1001F  R 60000  Size (KB): 38725064  Starting Blk: 37830  Public 
      Physical: 1001F    Drive Slot: 3  IDE      Size (KB): 40146432  Fixed

info log t is attached on the first post.

blue68f100, The error isn't in the superblock. It's in a cylinder group.

rpmurray · 10-19-2006, 03:34 PM

Well, except for the stuff about "Cylinder group 953: bad magic number" and "FSCK fatal error = 15" after Phase 5 in your log this looks much like my logs when I was trying to repair my failing 705N. The only thing missing is "File System : Logical set synchronization done on device 60000" that should be at the very end.

I got nuthin'

blue68f100 · 10-19-2006, 05:19 PM

Since the log is not indicating which drive has the cylinder group problem, just the Raid5 array, 60000.

I would remove 1 drive at a time and see if you can make a clone of the drives. The one with the problem should report an error. DO ALL DRIVES

DO NOT START THE 4100 WITH A DRIVE REMOVED. We want to locate the bad drive/s first.

mvastola12 · 10-25-2006, 04:36 PM

I was able to successfully clone all of the drives with no problems whatsoever. I really think it's just that one sector of data on the RAID array that is corrupt (there's no physical corrupiton).
Any other ideas? Thanks

blue68f100 · 10-25-2006, 05:07 PM

If you were able to clone all the drives, and the problem did not show up, that's odd. I'm at a lost if nothing showed up. Did you use dd or some other cloning program.

Does this happen at the same place every time?

mvastola12 · 10-28-2006, 03:01 PM

I used dd. Yea. The error is in the same place every time, which is making us think that it's an error with the data in the array after the array. fsck sees the error, but doesn't seem to make any attempts to fix it - it just exits fatally.

blue68f100 · 10-28-2006, 07:17 PM

You can use a "/force" cmd or the "/fixfatal"

Quote:

[fsck dev [[/fix /fixfatal /altsb]] (check device's filesystem)

Does any one know what the "/altsb" cmd does?

mvastola12 · 10-28-2006, 08:18 PM

Been there, done that. No dice.

mvastola12 · 10-28-2006, 08:24 PM

/altsb will replace the superblock with an alternate one. That's not the problem I'm having.

rpmurray · 10-29-2006, 11:11 AM

Quote:

Originally Posted by blue68f100

You can use a "/force" cmd or the "/fixfatal"

Looks like he already tried that. From the log:

10/16/2006 16:53:19 70 I L01 | File System Check : Executing fsck /dev/rraid0 /force /fix /fixfatal

blue68f100 · 10-29-2006, 12:36 PM

mvastola, Thanks for the meaning of "/altsb"

After talking with Snap-tech last night it is proably a timing problem. Clone that drive to a new one and install it into the set. The snap then should rebuild the array.

Snap-tech may be able to elaborate more. It was a pretty in depth technical discussion on how the Snap OS works at a hardware level.

rpmurray · 10-29-2006, 02:12 PM

Quote:

Originally Posted by blue68f100

After talking with Snap-tech last night it is proably a timing problem. Clone that drive to a new one and install it into the set. The snap then should rebuild the array.

Which drive do you want him to clone? None of the drives are giving a failure message and the error is not specific to any one drive.

rpmurray · 10-29-2006, 02:18 PM

This is just a question I have, so don't do this just because I'm asking.

What would happen if he tried to mount the raid? I mean using:

config devices mount [dev], where dev would be 60000.

Would it attempt to mount the raid even with the problem? Or would it fail because the disk check hasn't completed?

mvastola12 · 10-29-2006, 02:36 PM

We tried that at one point. It crashed the kernel and didn't mount the drive.

Mike

blue68f100 · 10-29-2006, 02:45 PM

rpmurray, I failed to recall that he was able to clone the drives with no errors.

Quote:

The error is in the same place every time, which is making us think that it's an error with the data in the array after the array. fsck sees the error, but doesn't seem to make any attempts to fix it - it just exits fatally.

This made be believe that it did fail at one point, but this was snap not dd.

If he still have the drives cloned install them all and see what happens. The way Snap-Tech explained it to me is. If a drive does not resond with in a set time, it reports a problem, even if all data is good. The problem accours if the SMART reloactes some sectors due to an error. Then the time required to get the spot has increased, due to the detour. Drive is still good. He has the tools required to adjust the drives to work in unison.

I hope the Guardian OS is not this picky.....

snap-tech · 10-29-2006, 08:56 PM

Unfortunately after looking at the full debug log sent to me there is nothing that can be done that is going to allow the snap to remount the raid.

The fsck error = 15, due to the bad magic number is a fatal filesystem error that we at snap support were never able to fix. At that time are only suggestion was to send snap to a data recovery company.

Sorry for the bad news mvastola12

Douglas

blue68f100 · 10-29-2006, 09:29 PM

Everything I hear and read about magic numbers is, "The magic is your data disapeares". Which seams to be a weakness with the XFS file system.

I hope I can simulate this type of failure when I start working on my 4500.

Phoenix32 · 10-29-2006, 09:42 PM

Quote:

Originally Posted by blue68f100

I hope I can simulate this type of failure when I start working on my 4500.

So what's taking you so long? Get busy dang it!

10-19-2006, 07:03 AM	#2
Phoenix32 Thermophile Join Date: May 2006 Location: Yakima, WA Posts: 1,282	Re: Snap Server 4100 Meltdown Okay, call me dumb, but it sounds like you have a bad drive, but you are in RAID 5, so why not just replace the bad drive and let it rebuild the array? Did I miss something?

10-19-2006, 08:02 AM	#3
blue68f100 Thermophile Join Date: Jul 2005 Location: Plano, TX Posts: 3,135	Re: Snap Server 4100 Meltdown rule #1. Do not do anything to compromise the array. DO NOT DO ANY UPDATES, This will only compound the problem. The log does not apear to be complete. Can you send the the complete log. I'm like Phoenix, Think you may have a bad drive or 2. Does any of the lights indicates any problem? Do you have any spare drives that you can work with? Do you have a copy of spinrite? __________________ 1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5, 1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5, 1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820

10-19-2006, 11:40 AM	#4
mvastola12 Cooling Neophyte Join Date: Oct 2006 Location: US Posts: 8	Re: Snap Server 4100 Meltdown It doesn't seem to be a physical disk errror. The sector in question can be accessed just fine. At no point do we get a read/write error, or any other message at a disk-level. We're not getting any strange lights. It very much looks like the incorrect sector data is duplicated across all 4 drives, but the drives and array assembly is functioning properly. I'm willing to consider anything though. I don't have SpinRite, but I can get it if it'll help. I do have a spare 160gb hard drive. What I attached is the complete log as far as I can tell. What's missing from it?

10-19-2006, 12:13 PM	#5
jontz Cooling Savant Join Date: Feb 2006 Location: South Bend, IN Posts: 385	Re: Snap Server 4100 Meltdown I don't think that it is a bad drive either. It seems more like a duplicated bad write, as you suggested. If you look at the array status in the web interface, does it show up as healthy, degraded, or something else? I would be interested to see what would happen if you replaced drive 1 with a different drive (by drive 1 I mean physical drive 1...it would be drive 0 according to the snap). I'm wondering if when it re-installs the OS to it and rebuilds the RAID if it corrects your error. That's what I'd do, but I will admit that I don't know for sure what is going on and this is not a "fix" that I know for sure will work. It's just what I would try on my own. __________________ Snap Server 4100, 4x120GB Seagate Drives, RAID 5, version 3.4.803

10-19-2006, 12:33 PM	#6
rpmurray Cooling Savant Join Date: Apr 2006 Location: Tennessee Posts: 157	Re: Snap Server 4100 Meltdown What do you get back when you use the command line to do an: info dev and info log t The info log t is where I was able to watch what it was trying to repair when I ran the repair on my failing 705N. It would be informative to see if there's any more specific error messages in there that would help in diagnosing the problem.

10-19-2006, 12:47 PM	#7
blue68f100 Thermophile Join Date: Jul 2005 Location: Plano, TX Posts: 3,135	Re: Snap Server 4100 Meltdown I'm not sure I would play with any drives swaping till we get more info. I do not want to do anything that will compromise the data. I am assuming that you do not have a current backup. Any time you get a SuperBlock error it is criticle. I was thinking of powering down and use DD to clone the drives. Use a Sharpy Pen and mark the drive positions, must beable to re-install back into the original position. But it must be the same size to do a RAW Copy. Or a std clone to a file will have to be done. Doing this for each drive will tell us if a drive has a problem. 1 drive is a mfg than the others. This should not cuase a problem if the capacity is the same. The reason I mentioned SpinRite is that it is not OS dependent. It uses the smart system to verify each sector, byte by byte, by reading multiple times to verify the bytes. If unknow it starts shifting the timing were it can read slightly before and after. Tring to determine what it should be. If bad it will swap a reserve sector in to correct the bad. __________________ 1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5, 1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5, 1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820

10-19-2006, 03:34 PM	#9
rpmurray Cooling Savant Join Date: Apr 2006 Location: Tennessee Posts: 157	Re: Snap Server 4100 Meltdown Well, except for the stuff about "Cylinder group 953: bad magic number" and "FSCK fatal error = 15" after Phase 5 in your log this looks much like my logs when I was trying to repair my failing 705N. The only thing missing is "File System : Logical set synchronization done on device 60000" that should be at the very end. I got nuthin'

10-19-2006, 05:19 PM	#10
blue68f100 Thermophile Join Date: Jul 2005 Location: Plano, TX Posts: 3,135	Re: Snap Server 4100 Meltdown Since the log is not indicating which drive has the cylinder group problem, just the Raid5 array, 60000. I would remove 1 drive at a time and see if you can make a clone of the drives. The one with the problem should report an error. DO ALL DRIVES DO NOT START THE 4100 WITH A DRIVE REMOVED. We want to locate the bad drive/s first. __________________ 1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5, 1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5, 1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820

10-25-2006, 04:36 PM	#11
mvastola12 Cooling Neophyte Join Date: Oct 2006 Location: US Posts: 8	Re: Snap Server 4100 Meltdown I was able to successfully clone all of the drives with no problems whatsoever. I really think it's just that one sector of data on the RAID array that is corrupt (there's no physical corrupiton). Any other ideas? Thanks

10-25-2006, 05:07 PM	#12
blue68f100 Thermophile Join Date: Jul 2005 Location: Plano, TX Posts: 3,135	Re: Snap Server 4100 Meltdown If you were able to clone all the drives, and the problem did not show up, that's odd. I'm at a lost if nothing showed up. Did you use dd or some other cloning program. Does this happen at the same place every time? __________________ 1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5, 1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5, 1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820

10-28-2006, 03:01 PM	#13
mvastola12 Cooling Neophyte Join Date: Oct 2006 Location: US Posts: 8	Re: Snap Server 4100 Meltdown I used dd. Yea. The error is in the same place every time, which is making us think that it's an error with the data in the array after the array. fsck sees the error, but doesn't seem to make any attempts to fix it - it just exits fatally.

10-28-2006, 08:18 PM	#15
mvastola12 Cooling Neophyte Join Date: Oct 2006 Location: US Posts: 8	Re: Snap Server 4100 Meltdown Been there, done that. No dice.

10-28-2006, 08:24 PM	#16
mvastola12 Cooling Neophyte Join Date: Oct 2006 Location: US Posts: 8	Re: Snap Server 4100 Meltdown /altsb will replace the superblock with an alternate one. That's not the problem I'm having.

10-29-2006, 12:36 PM	#18
blue68f100 Thermophile Join Date: Jul 2005 Location: Plano, TX Posts: 3,135	Re: Snap Server 4100 Meltdown mvastola, Thanks for the meaning of "/altsb" After talking with Snap-tech last night it is proably a timing problem. Clone that drive to a new one and install it into the set. The snap then should rebuild the array. Snap-tech may be able to elaborate more. It was a pretty in depth technical discussion on how the Snap OS works at a hardware level. __________________ 1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5, 1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5, 1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820 Last edited by blue68f100; 10-29-2006 at 12:43 PM.

10-29-2006, 02:18 PM	#20
rpmurray Cooling Savant Join Date: Apr 2006 Location: Tennessee Posts: 157	Re: Snap Server 4100 Meltdown This is just a question I have, so don't do this just because I'm asking. What would happen if he tried to mount the raid? I mean using: config devices mount [dev], where dev would be 60000. Would it attempt to mount the raid even with the problem? Or would it fail because the disk check hasn't completed?

10-29-2006, 02:36 PM	#21
mvastola12 Cooling Neophyte Join Date: Oct 2006 Location: US Posts: 8	Re: Snap Server 4100 Meltdown We tried that at one point. It crashed the kernel and didn't mount the drive. Mike

10-29-2006, 08:56 PM	#23
snap-tech Cooling Neophyte Join Date: Mar 2002 Location: Washington State Posts: 54	Re: Snap Server 4100 Meltdown Unfortunately after looking at the full debug log sent to me there is nothing that can be done that is going to allow the snap to remount the raid. The fsck error = 15, due to the bad magic number is a fatal filesystem error that we at snap support were never able to fix. At that time are only suggestion was to send snap to a data recovery company. Sorry for the bad news mvastola12 Douglas

10-29-2006, 09:29 PM	#24
blue68f100 Thermophile Join Date: Jul 2005 Location: Plano, TX Posts: 3,135	Re: Snap Server 4100 Meltdown Everything I hear and read about magic numbers is, "The magic is your data disapeares". Which seams to be a weakness with the XFS file system. I hope I can simulate this type of failure when I start working on my 4500. __________________ 1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5, 1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5, 1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)