Pro/Forums - Snap Server 4100 Meltdown

Page 1 of 2

Show 30 post(s) from this thread on one page

Pro/Forums (http://forums.procooling.com/vbb/index.php)

- Snap Server / NAS / Storage Technical Goodies (http://forums.procooling.com/vbb/forumdisplay.php?f=82)

- - Snap Server 4100 Meltdown (http://forums.procooling.com/vbb/showthread.php?t=13573)

mvastola12

10-19-2006 01:11 AM

Snap Server 4100 Meltdown

1 Attachment(s)

Hey guys,
For several years my organization has had a Snap Server 4100 (Purchased as a Dell PowerVault 705N). Long story short, recently, a major problem developed whereby our RAID 5 array (4x 40gig Drives) refuses to mount its filesystem. On startup, it seems to try to mount, but it runs an fsck and finds that a cylinder group has a bad magic number. Unfortunately, several runs of fsck in "Repair all errors" mode fails to yeild any results (FSCK exits fatally once it reaches that sector and finds the bad magic number error).

I was wondering if anyone had any ideas about how to go about getting our data back online. The RAID array assembles, and our data is there if I do a sector-by-sector dump of /dev/rraid0 - the filesystem just won't mount, making the data inaccessible.

We played for a while, to no avail, trying to find some way to either manually correct the data for that sector on the disk through Snap's debug command line, or somehow get an image of the the assembled array off of the SnapServer (so we could put it on a hard drive, correct the sector, and then hook that drive up to the SnapServer).

Does anyone have any idea how we could try to rescue our data? Is there a tool available to get an image of the assembled array? Should we try to get a hold of a v4 SnapOS to see if fsck is improved? Any ideas would be greatly appreciated.

Thanks soo much,
Mike

PS: I'm attaching a recent boot log in case it helps.

Phoenix32

10-19-2006 07:03 AM

Re: Snap Server 4100 Meltdown

Okay, call me dumb, but it sounds like you have a bad drive, but you are in RAID 5, so why not just replace the bad drive and let it rebuild the array? Did I miss something?

blue68f100

10-19-2006 08:02 AM

Re: Snap Server 4100 Meltdown

rule #1. Do not do anything to compromise the array.
DO NOT DO ANY UPDATES, This will only compound the problem.

The log does not apear to be complete. Can you send the the complete log.

I'm like Phoenix, Think you may have a bad drive or 2. Does any of the lights indicates any problem?

Do you have any spare drives that you can work with?

Do you have a copy of spinrite?

mvastola12

10-19-2006 11:40 AM

Re: Snap Server 4100 Meltdown

It doesn't seem to be a physical disk errror. The sector in question can be accessed just fine. At no point do we get a read/write error, or any other message at a disk-level. We're not getting any strange lights. It very much looks like the incorrect sector data is duplicated across all 4 drives, but the drives and array assembly is functioning properly. I'm willing to consider anything though.

I don't have SpinRite, but I can get it if it'll help. I do have a spare 160gb hard drive.

What I attached is the complete log as far as I can tell. What's missing from it?

jontz

10-19-2006 12:13 PM

Re: Snap Server 4100 Meltdown

I don't think that it is a bad drive either. It seems more like a duplicated bad write, as you suggested. If you look at the array status in the web interface, does it show up as healthy, degraded, or something else?

I would be interested to see what would happen if you replaced drive 1 with a different drive (by drive 1 I mean physical drive 1...it would be drive 0 according to the snap). I'm wondering if when it re-installs the OS to it and rebuilds the RAID if it corrects your error. That's what I'd do, but I will admit that I don't know for sure what is going on and this is not a "fix" that I know for sure will work. It's just what I would try on my own.

rpmurray

10-19-2006 12:33 PM

Re: Snap Server 4100 Meltdown

What do you get back when you use the command line to do an:

info dev

and

info log t

The info log t is where I was able to watch what it was trying to repair when I ran the repair on my failing 705N. It would be informative to see if there's any more specific error messages in there that would help in diagnosing the problem.

blue68f100

10-19-2006 12:47 PM

Re: Snap Server 4100 Meltdown

I'm not sure I would play with any drives swaping till we get more info. I do not want to do anything that will compromise the data. I am assuming that you do not have a current backup.

Any time you get a SuperBlock error it is criticle.

I was thinking of powering down and use DD to clone the drives. Use a Sharpy Pen and mark the drive positions, must beable to re-install back into the original position. But it must be the same size to do a RAW Copy. Or a std clone to a file will have to be done. Doing this for each drive will tell us if a drive has a problem.

1 drive is a mfg than the others. This should not cuase a problem if the capacity is the same.

The reason I mentioned SpinRite is that it is not OS dependent. It uses the smart system to verify each sector, byte by byte, by reading multiple times to verify the bytes. If unknow it starts shifting the timing were it can read slightly before and after. Tring to determine what it should be. If bad it will swap a reserve sector in to correct the bad.

mvastola12

10-19-2006 01:11 PM

Re: Snap Server 4100 Meltdown

The status of the RAID Array through the web interface is:
RAID5 - Large data protection disk
Error. Fatal error during disk check.

I'm pretty sure the actual array is healthy though.

Here's info dev:

Code:

Logical Device: 10006      Position: 0  JBOD     Size (KB):    32296  Free (KB):    22064  Private  Mounted

  Label:Private  Contains system files only

     Unique Id: 0x630A77496C09210C    Mount: /priv      Index: 12  Order: 0

     Partition: 10006  Physical: 10007  FS       Size (KB):    32768  Starting Blk:   515  Private

      Physical: 10007    Drive Slot: 0  IDE      Size (KB): 39088640  Fixed    



Logical Device: 1000E      Position: 0  JBOD     Size (KB):    32296  Free (KB):    19752  Private  Mounted

  Label:Private  Contains system files only

     Unique Id: 0x4B58D23D6E14EE1E    Mount: /pri2      Index: 13  Order: 1

     Partition: 1000E  Physical: 1000F  FS       Size (KB):    32768  Starting Blk:   515  Private

      Physical: 1000F    Drive Slot: 1  IDE      Size (KB): 39088640  Fixed    



Logical Device: 60000      Position: 1  RAID     Size (KB): 116175192  Free (KB):        0  Public   Unmounted

  Label:RAID5  Large data protection disk

     Unique Id: 0x0FEBE8223B4463B1    Mount: /0         Index: 0  Order: 255

     Partition: 10000  Physical: 10007  R 60000  Size (KB): 38725064  Starting Blk: 45319  Public 

      Physical: 10007    Drive Slot: 0  IDE      Size (KB): 39088640  Fixed    

     Partition: 10008  Physical: 1000F  R 60000  Size (KB): 38725064  Starting Blk: 45319  Public 

      Physical: 1000F    Drive Slot: 1  IDE      Size (KB): 39088640  Fixed    

     Partition: 10010  Physical: 10017  R 60000  Size (KB): 38725064  Starting Blk: 45319  Public 

      Physical: 10017    Drive Slot: 2  IDE      Size (KB): 39088640  Fixed    

     Partition: 10018  Physical: 1001F  R 60000  Size (KB): 38725064  Starting Blk: 37830  Public 

      Physical: 1001F    Drive Slot: 3  IDE      Size (KB): 40146432  Fixed

info log t is attached on the first post.

blue68f100, The error isn't in the superblock. It's in a cylinder group.

rpmurray

10-19-2006 03:34 PM

Re: Snap Server 4100 Meltdown

Well, except for the stuff about "Cylinder group 953: bad magic number" and "FSCK fatal error = 15" after Phase 5 in your log this looks much like my logs when I was trying to repair my failing 705N. The only thing missing is "File System : Logical set synchronization done on device 60000" that should be at the very end.

I got nuthin'

blue68f100

10-19-2006 05:19 PM

Re: Snap Server 4100 Meltdown

Since the log is not indicating which drive has the cylinder group problem, just the Raid5 array, 60000.

I would remove 1 drive at a time and see if you can make a clone of the drives. The one with the problem should report an error. DO ALL DRIVES

DO NOT START THE 4100 WITH A DRIVE REMOVED. We want to locate the bad drive/s first.

mvastola12

10-25-2006 04:36 PM

Re: Snap Server 4100 Meltdown

I was able to successfully clone all of the drives with no problems whatsoever. I really think it's just that one sector of data on the RAID array that is corrupt (there's no physical corrupiton).
Any other ideas? Thanks

blue68f100

10-25-2006 05:07 PM

Re: Snap Server 4100 Meltdown

If you were able to clone all the drives, and the problem did not show up, that's odd. I'm at a lost if nothing showed up. Did you use dd or some other cloning program.

Does this happen at the same place every time?

mvastola12

10-28-2006 03:01 PM

Re: Snap Server 4100 Meltdown

I used dd. Yea. The error is in the same place every time, which is making us think that it's an error with the data in the array after the array. fsck sees the error, but doesn't seem to make any attempts to fix it - it just exits fatally.

blue68f100

10-28-2006 07:17 PM

Re: Snap Server 4100 Meltdown

You can use a "/force" cmd or the "/fixfatal"

Quote:

[fsck dev [[/fix /fixfatal /altsb]] (check device's filesystem)

Does any one know what the "/altsb" cmd does?

mvastola12

10-28-2006 08:18 PM

Re: Snap Server 4100 Meltdown

Been there, done that. No dice.

mvastola12

10-28-2006 08:24 PM

Re: Snap Server 4100 Meltdown

/altsb will replace the superblock with an alternate one. That's not the problem I'm having.

rpmurray

10-29-2006 11:11 AM

Re: Snap Server 4100 Meltdown

Quote:

Originally Posted by blue68f100

You can use a "/force" cmd or the "/fixfatal"

Looks like he already tried that. From the log:

10/16/2006 16:53:19 70 I L01 | File System Check : Executing fsck /dev/rraid0 /force /fix /fixfatal

blue68f100

10-29-2006 12:36 PM

Re: Snap Server 4100 Meltdown

mvastola, Thanks for the meaning of "/altsb"

After talking with Snap-tech last night it is proably a timing problem. Clone that drive to a new one and install it into the set. The snap then should rebuild the array.

Snap-tech may be able to elaborate more. It was a pretty in depth technical discussion on how the Snap OS works at a hardware level.

rpmurray

10-29-2006 02:12 PM

Re: Snap Server 4100 Meltdown

Quote:

Originally Posted by blue68f100

After talking with Snap-tech last night it is proably a timing problem. Clone that drive to a new one and install it into the set. The snap then should rebuild the array.

Which drive do you want him to clone? None of the drives are giving a failure message and the error is not specific to any one drive.

rpmurray

10-29-2006 02:18 PM

Re: Snap Server 4100 Meltdown

This is just a question I have, so don't do this just because I'm asking.

What would happen if he tried to mount the raid? I mean using:

config devices mount [dev], where dev would be 60000.

Would it attempt to mount the raid even with the problem? Or would it fail because the disk check hasn't completed?

mvastola12

10-29-2006 02:36 PM

Re: Snap Server 4100 Meltdown

We tried that at one point. It crashed the kernel and didn't mount the drive.

Mike

blue68f100

10-29-2006 02:45 PM

Re: Snap Server 4100 Meltdown

rpmurray, I failed to recall that he was able to clone the drives with no errors.

Quote:

The error is in the same place every time, which is making us think that it's an error with the data in the array after the array. fsck sees the error, but doesn't seem to make any attempts to fix it - it just exits fatally.

This made be believe that it did fail at one point, but this was snap not dd.

If he still have the drives cloned install them all and see what happens. The way Snap-Tech explained it to me is. If a drive does not resond with in a set time, it reports a problem, even if all data is good. The problem accours if the SMART reloactes some sectors due to an error. Then the time required to get the spot has increased, due to the detour. Drive is still good. He has the tools required to adjust the drives to work in unison.

I hope the Guardian OS is not this picky.....

snap-tech

10-29-2006 08:56 PM

Re: Snap Server 4100 Meltdown

Unfortunately after looking at the full debug log sent to me there is nothing that can be done that is going to allow the snap to remount the raid.

The fsck error = 15, due to the bad magic number is a fatal filesystem error that we at snap support were never able to fix. At that time are only suggestion was to send snap to a data recovery company.

Sorry for the bad news mvastola12

Douglas

blue68f100

10-29-2006 09:29 PM

Re: Snap Server 4100 Meltdown

Everything I hear and read about magic numbers is, "The magic is your data disapeares". Which seams to be a weakness with the XFS file system.

I hope I can simulate this type of failure when I start working on my 4500.

Phoenix32

10-29-2006 09:42 PM

Re: Snap Server 4100 Meltdown

Quote:

Originally Posted by blue68f100

I hope I can simulate this type of failure when I start working on my 4500.

So what's taking you so long? Get busy dang it! :D

blue68f100

10-30-2006 07:29 AM

Re: Snap Server 4100 Meltdown

It's on my list of things to do, trust me I will get to it.

This time of year I get busy with my leather craft, you have one of my big sellers. Joe has 3 (of ?) for allowing us to use his servers. All are great little gifts. It's how I fund my toys.

Phoenix32

10-30-2006 11:04 AM

Re: Snap Server 4100 Meltdown

Quote:

Originally Posted by blue68f100

I know, lol... :D

snap-tech

10-30-2006 04:53 PM

Re: Snap Server 4100 Meltdown

It took me a little while to find the info I was looking for but here it is.

The issue that caused this unit to have fail is the following.

If you look at the in dev log you will see that the last drives starting block is not the same as the other 3 drives. We discovered that this would ultimately cause the superblock error and but not until after a reboot once the raid was rebuilt with a replacement drive.

It had to do with what version the snap was running when the drives were first formated. The starting block was changed between versions 2x and 3x, and was increased by a factor of 1. to find the starting is multiplied by 16 and thats the actual sector were it is located on the disk.

This turned out to be a big issue at snap and from that point on all single drives that were shipped out to clients had to be pre-formted at snap under a certain os version depending on were the starting block was on the current snap the client had.

Again, the raid would build ok and issue would not show up until after the snap was rebooted the first time, then this error would show up and if you tried to force a mount it would cause snap to panic.

We never found a way to resolve it once it was in this condition and therefore data recovery was the clients only option.

I hope this makes sense. I found this in my notes. I am still looking for the actuall official document which explains this in full detail which was given to the techs.

Douglas

Phoenix32

10-30-2006 08:44 PM

Re: Snap Server 4100 Meltdown

Good info...

blue68f100

10-31-2006 07:12 AM

Re: Snap Server 4100 Meltdown

Great Info.

So the fix is when you replace a drive. Do a "co de fromat xxxxx /reinit" to all drives.

Or does all drives need to be totally clean and let the snap do it from scratch?

All times are GMT -5. The time now is 12:01 PM.

Page 1 of 2

Show 30 post(s) from this thread on one page

Powered by vBulletin® Version 3.7.4
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
(C) 2005 ProCooling.com
If we in some way offend you, insult you or your people, screw your mom, beat up your dad, or poop on your porch... we're sorry... we were probably really drunk...
Oh and dont steal our content bitches! Don't give us a reason to pee in your open car window this summer...