Go Back   Pro/Forums > ProCooling Technical Discussions > Snap Server / NAS / Storage Technical Goodies
Password
Register FAQ Members List Calendar Chat

Snap Server / NAS / Storage Technical Goodies The Home for Snap Server Hacking, Storage and NAS info. And NAS / Snap Classifides

Reply
Thread Tools
Unread 10-19-2006, 01:11 AM   #1
mvastola12
Cooling Neophyte
 
Join Date: Oct 2006
Location: US
Posts: 8
Snap Server 4100 Meltdown

Hey guys,
For several years my organization has had a Snap Server 4100 (Purchased as a Dell PowerVault 705N). Long story short, recently, a major problem developed whereby our RAID 5 array (4x 40gig Drives) refuses to mount its filesystem. On startup, it seems to try to mount, but it runs an fsck and finds that a cylinder group has a bad magic number. Unfortunately, several runs of fsck in "Repair all errors" mode fails to yeild any results (FSCK exits fatally once it reaches that sector and finds the bad magic number error).

I was wondering if anyone had any ideas about how to go about getting our data back online. The RAID array assembles, and our data is there if I do a sector-by-sector dump of /dev/rraid0 - the filesystem just won't mount, making the data inaccessible.

We played for a while, to no avail, trying to find some way to either manually correct the data for that sector on the disk through Snap's debug command line, or somehow get an image of the the assembled array off of the SnapServer (so we could put it on a hard drive, correct the sector, and then hook that drive up to the SnapServer).

Does anyone have any idea how we could try to rescue our data? Is there a tool available to get an image of the assembled array? Should we try to get a hold of a v4 SnapOS to see if fsck is improved? Any ideas would be greatly appreciated.

Thanks soo much,
Mike

PS: I'm attaching a recent boot log in case it helps.
Attached Files
File Type: txt bootlog.txt (7.1 KB, 13 views)
mvastola12 is offline   Reply With Quote
Unread 10-19-2006, 07:03 AM   #2
Phoenix32
Thermophile
 
Phoenix32's Avatar
 
Join Date: May 2006
Location: Yakima, WA
Posts: 1,282
Default Re: Snap Server 4100 Meltdown

Okay, call me dumb, but it sounds like you have a bad drive, but you are in RAID 5, so why not just replace the bad drive and let it rebuild the array? Did I miss something?
Phoenix32 is offline   Reply With Quote
Unread 10-19-2006, 08:02 AM   #3
blue68f100
Thermophile
 
blue68f100's Avatar
 
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
Default Re: Snap Server 4100 Meltdown

rule #1. Do not do anything to compromise the array.
DO NOT DO ANY UPDATES, This will only compound the problem.

The log does not apear to be complete. Can you send the the complete log.

I'm like Phoenix, Think you may have a bad drive or 2. Does any of the lights indicates any problem?

Do you have any spare drives that you can work with?

Do you have a copy of spinrite?
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5,
1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5,
1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy

Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820
blue68f100 is offline   Reply With Quote
Unread 10-19-2006, 11:40 AM   #4
mvastola12
Cooling Neophyte
 
Join Date: Oct 2006
Location: US
Posts: 8
Default Re: Snap Server 4100 Meltdown

It doesn't seem to be a physical disk errror. The sector in question can be accessed just fine. At no point do we get a read/write error, or any other message at a disk-level. We're not getting any strange lights. It very much looks like the incorrect sector data is duplicated across all 4 drives, but the drives and array assembly is functioning properly. I'm willing to consider anything though.

I don't have SpinRite, but I can get it if it'll help. I do have a spare 160gb hard drive.

What I attached is the complete log as far as I can tell. What's missing from it?
mvastola12 is offline   Reply With Quote
Unread 10-19-2006, 12:13 PM   #5
jontz
Cooling Savant
 
Join Date: Feb 2006
Location: South Bend, IN
Posts: 385
Default Re: Snap Server 4100 Meltdown

I don't think that it is a bad drive either. It seems more like a duplicated bad write, as you suggested. If you look at the array status in the web interface, does it show up as healthy, degraded, or something else?

I would be interested to see what would happen if you replaced drive 1 with a different drive (by drive 1 I mean physical drive 1...it would be drive 0 according to the snap). I'm wondering if when it re-installs the OS to it and rebuilds the RAID if it corrects your error. That's what I'd do, but I will admit that I don't know for sure what is going on and this is not a "fix" that I know for sure will work. It's just what I would try on my own.
__________________
Snap Server 4100, 4x120GB Seagate Drives, RAID 5, version 3.4.803
jontz is offline   Reply With Quote
Unread 10-19-2006, 12:33 PM   #6
rpmurray
Cooling Savant
 
Join Date: Apr 2006
Location: Tennessee
Posts: 157
Default Re: Snap Server 4100 Meltdown

What do you get back when you use the command line to do an:

info dev

and

info log t

The info log t is where I was able to watch what it was trying to repair when I ran the repair on my failing 705N. It would be informative to see if there's any more specific error messages in there that would help in diagnosing the problem.
rpmurray is offline   Reply With Quote
Unread 10-19-2006, 12:47 PM   #7
blue68f100
Thermophile
 
blue68f100's Avatar
 
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
Default Re: Snap Server 4100 Meltdown

I'm not sure I would play with any drives swaping till we get more info. I do not want to do anything that will compromise the data. I am assuming that you do not have a current backup.

Any time you get a SuperBlock error it is criticle.

I was thinking of powering down and use DD to clone the drives. Use a Sharpy Pen and mark the drive positions, must beable to re-install back into the original position. But it must be the same size to do a RAW Copy. Or a std clone to a file will have to be done. Doing this for each drive will tell us if a drive has a problem.

1 drive is a mfg than the others. This should not cuase a problem if the capacity is the same.

The reason I mentioned SpinRite is that it is not OS dependent. It uses the smart system to verify each sector, byte by byte, by reading multiple times to verify the bytes. If unknow it starts shifting the timing were it can read slightly before and after. Tring to determine what it should be. If bad it will swap a reserve sector in to correct the bad.
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5,
1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5,
1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy

Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820
blue68f100 is offline   Reply With Quote
Unread 10-19-2006, 01:11 PM   #8
mvastola12
Cooling Neophyte
 
Join Date: Oct 2006
Location: US
Posts: 8
Default Re: Snap Server 4100 Meltdown

The status of the RAID Array through the web interface is:
RAID5 - Large data protection disk
Error. Fatal error during disk check.

I'm pretty sure the actual array is healthy though.

Here's info dev:
Code:
Logical Device: 10006      Position: 0  JBOD     Size (KB):    32296  Free (KB):    22064  Private  Mounted
  Label:Private  Contains system files only
     Unique Id: 0x630A77496C09210C    Mount: /priv      Index: 12  Order: 0
     Partition: 10006  Physical: 10007  FS       Size (KB):    32768  Starting Blk:   515  Private
      Physical: 10007    Drive Slot: 0  IDE      Size (KB): 39088640  Fixed    

Logical Device: 1000E      Position: 0  JBOD     Size (KB):    32296  Free (KB):    19752  Private  Mounted
  Label:Private  Contains system files only
     Unique Id: 0x4B58D23D6E14EE1E    Mount: /pri2      Index: 13  Order: 1
     Partition: 1000E  Physical: 1000F  FS       Size (KB):    32768  Starting Blk:   515  Private
      Physical: 1000F    Drive Slot: 1  IDE      Size (KB): 39088640  Fixed    

Logical Device: 60000      Position: 1  RAID     Size (KB): 116175192  Free (KB):        0  Public   Unmounted
  Label:RAID5  Large data protection disk
     Unique Id: 0x0FEBE8223B4463B1    Mount: /0         Index: 0  Order: 255
     Partition: 10000  Physical: 10007  R 60000  Size (KB): 38725064  Starting Blk: 45319  Public 
      Physical: 10007    Drive Slot: 0  IDE      Size (KB): 39088640  Fixed    
     Partition: 10008  Physical: 1000F  R 60000  Size (KB): 38725064  Starting Blk: 45319  Public 
      Physical: 1000F    Drive Slot: 1  IDE      Size (KB): 39088640  Fixed    
     Partition: 10010  Physical: 10017  R 60000  Size (KB): 38725064  Starting Blk: 45319  Public 
      Physical: 10017    Drive Slot: 2  IDE      Size (KB): 39088640  Fixed    
     Partition: 10018  Physical: 1001F  R 60000  Size (KB): 38725064  Starting Blk: 37830  Public 
      Physical: 1001F    Drive Slot: 3  IDE      Size (KB): 40146432  Fixed
info log t is attached on the first post.

blue68f100, The error isn't in the superblock. It's in a cylinder group.
mvastola12 is offline   Reply With Quote
Unread 10-19-2006, 03:34 PM   #9
rpmurray
Cooling Savant
 
Join Date: Apr 2006
Location: Tennessee
Posts: 157
Default Re: Snap Server 4100 Meltdown

Well, except for the stuff about "Cylinder group 953: bad magic number" and "FSCK fatal error = 15" after Phase 5 in your log this looks much like my logs when I was trying to repair my failing 705N. The only thing missing is "File System : Logical set synchronization done on device 60000" that should be at the very end.

I got nuthin'
rpmurray is offline   Reply With Quote
Unread 10-19-2006, 05:19 PM   #10
blue68f100
Thermophile
 
blue68f100's Avatar
 
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
Default Re: Snap Server 4100 Meltdown

Since the log is not indicating which drive has the cylinder group problem, just the Raid5 array, 60000.

I would remove 1 drive at a time and see if you can make a clone of the drives. The one with the problem should report an error. DO ALL DRIVES

DO NOT START THE 4100 WITH A DRIVE REMOVED. We want to locate the bad drive/s first.
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5,
1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5,
1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy

Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820
blue68f100 is offline   Reply With Quote
Unread 10-25-2006, 04:36 PM   #11
mvastola12
Cooling Neophyte
 
Join Date: Oct 2006
Location: US
Posts: 8
Default Re: Snap Server 4100 Meltdown

I was able to successfully clone all of the drives with no problems whatsoever. I really think it's just that one sector of data on the RAID array that is corrupt (there's no physical corrupiton).
Any other ideas? Thanks
mvastola12 is offline   Reply With Quote
Unread 10-25-2006, 05:07 PM   #12
blue68f100
Thermophile
 
blue68f100's Avatar
 
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
Default Re: Snap Server 4100 Meltdown

If you were able to clone all the drives, and the problem did not show up, that's odd. I'm at a lost if nothing showed up. Did you use dd or some other cloning program.

Does this happen at the same place every time?
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5,
1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5,
1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy

Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820
blue68f100 is offline   Reply With Quote
Unread 10-28-2006, 03:01 PM   #13
mvastola12
Cooling Neophyte
 
Join Date: Oct 2006
Location: US
Posts: 8
Default Re: Snap Server 4100 Meltdown

I used dd. Yea. The error is in the same place every time, which is making us think that it's an error with the data in the array after the array. fsck sees the error, but doesn't seem to make any attempts to fix it - it just exits fatally.
mvastola12 is offline   Reply With Quote
Unread 10-28-2006, 07:17 PM   #14
blue68f100
Thermophile
 
blue68f100's Avatar
 
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
Default Re: Snap Server 4100 Meltdown

You can use a "/force" cmd or the "/fixfatal"

Quote:
[fsck dev [[/fix /fixfatal /altsb]] (check device's filesystem)
Does any one know what the "/altsb" cmd does?
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5,
1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5,
1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy

Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820
blue68f100 is offline   Reply With Quote
Unread 10-28-2006, 08:18 PM   #15
mvastola12
Cooling Neophyte
 
Join Date: Oct 2006
Location: US
Posts: 8
Default Re: Snap Server 4100 Meltdown

Been there, done that. No dice.
mvastola12 is offline   Reply With Quote
Unread 10-28-2006, 08:24 PM   #16
mvastola12
Cooling Neophyte
 
Join Date: Oct 2006
Location: US
Posts: 8
Default Re: Snap Server 4100 Meltdown

/altsb will replace the superblock with an alternate one. That's not the problem I'm having.
mvastola12 is offline   Reply With Quote
Unread 10-29-2006, 11:11 AM   #17
rpmurray
Cooling Savant
 
Join Date: Apr 2006
Location: Tennessee
Posts: 157
Default Re: Snap Server 4100 Meltdown

Quote:
Originally Posted by blue68f100
You can use a "/force" cmd or the "/fixfatal"
Looks like he already tried that. From the log:

10/16/2006 16:53:19 70 I L01 | File System Check : Executing fsck /dev/rraid0 /force /fix /fixfatal
rpmurray is offline   Reply With Quote
Unread 10-29-2006, 12:36 PM   #18
blue68f100
Thermophile
 
blue68f100's Avatar
 
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
Default Re: Snap Server 4100 Meltdown

mvastola, Thanks for the meaning of "/altsb"

After talking with Snap-tech last night it is proably a timing problem. Clone that drive to a new one and install it into the set. The snap then should rebuild the array.

Snap-tech may be able to elaborate more. It was a pretty in depth technical discussion on how the Snap OS works at a hardware level.
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5,
1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5,
1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy

Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820

Last edited by blue68f100; 10-29-2006 at 12:43 PM.
blue68f100 is offline   Reply With Quote
Unread 10-29-2006, 02:12 PM   #19
rpmurray
Cooling Savant
 
Join Date: Apr 2006
Location: Tennessee
Posts: 157
Default Re: Snap Server 4100 Meltdown

Quote:
Originally Posted by blue68f100
After talking with Snap-tech last night it is proably a timing problem. Clone that drive to a new one and install it into the set. The snap then should rebuild the array.
Which drive do you want him to clone? None of the drives are giving a failure message and the error is not specific to any one drive.
rpmurray is offline   Reply With Quote
Unread 10-29-2006, 02:18 PM   #20
rpmurray
Cooling Savant
 
Join Date: Apr 2006
Location: Tennessee
Posts: 157
Default Re: Snap Server 4100 Meltdown

This is just a question I have, so don't do this just because I'm asking.

What would happen if he tried to mount the raid? I mean using:

config devices mount [dev], where dev would be 60000.

Would it attempt to mount the raid even with the problem? Or would it fail because the disk check hasn't completed?
rpmurray is offline   Reply With Quote
Unread 10-29-2006, 02:36 PM   #21
mvastola12
Cooling Neophyte
 
Join Date: Oct 2006
Location: US
Posts: 8
Default Re: Snap Server 4100 Meltdown

We tried that at one point. It crashed the kernel and didn't mount the drive.

Mike
mvastola12 is offline   Reply With Quote
Unread 10-29-2006, 02:45 PM   #22
blue68f100
Thermophile
 
blue68f100's Avatar
 
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
Default Re: Snap Server 4100 Meltdown

rpmurray, I failed to recall that he was able to clone the drives with no errors.

Quote:
The error is in the same place every time, which is making us think that it's an error with the data in the array after the array. fsck sees the error, but doesn't seem to make any attempts to fix it - it just exits fatally.
This made be believe that it did fail at one point, but this was snap not dd.

If he still have the drives cloned install them all and see what happens. The way Snap-Tech explained it to me is. If a drive does not resond with in a set time, it reports a problem, even if all data is good. The problem accours if the SMART reloactes some sectors due to an error. Then the time required to get the spot has increased, due to the detour. Drive is still good. He has the tools required to adjust the drives to work in unison.

I hope the Guardian OS is not this picky.....
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5,
1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5,
1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy

Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820
blue68f100 is offline   Reply With Quote
Unread 10-29-2006, 08:56 PM   #23
snap-tech
Cooling Neophyte
 
Join Date: Mar 2002
Location: Washington State
Posts: 54
Default Re: Snap Server 4100 Meltdown

Unfortunately after looking at the full debug log sent to me there is nothing that can be done that is going to allow the snap to remount the raid.

The fsck error = 15, due to the bad magic number is a fatal filesystem error that we at snap support were never able to fix. At that time are only suggestion was to send snap to a data recovery company.

Sorry for the bad news mvastola12

Douglas
snap-tech is offline   Reply With Quote
Unread 10-29-2006, 09:29 PM   #24
blue68f100
Thermophile
 
blue68f100's Avatar
 
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
Default Re: Snap Server 4100 Meltdown

Everything I hear and read about magic numbers is, "The magic is your data disapeares". Which seams to be a weakness with the XFS file system.

I hope I can simulate this type of failure when I start working on my 4500.
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5,
1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5,
1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy

Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820
blue68f100 is offline   Reply With Quote
Unread 10-29-2006, 09:42 PM   #25
Phoenix32
Thermophile
 
Phoenix32's Avatar
 
Join Date: May 2006
Location: Yakima, WA
Posts: 1,282
Default Re: Snap Server 4100 Meltdown

Quote:
Originally Posted by blue68f100

I hope I can simulate this type of failure when I start working on my 4500.
So what's taking you so long? Get busy dang it!
Phoenix32 is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -5. The time now is 11:37 AM.


Powered by vBulletin® Version 3.7.4
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
(C) 2005 ProCooling.com
If we in some way offend you, insult you or your people, screw your mom, beat up your dad, or poop on your porch... we're sorry... we were probably really drunk...
Oh and dont steal our content bitches! Don't give us a reason to pee in your open car window this summer...