![]() | ||
|
|
Snap Server / NAS / Storage Technical Goodies The Home for Snap Server Hacking, Storage and NAS info. And NAS / Snap Classifides |
![]() |
Thread Tools |
![]() |
#1 |
Cooling Neophyte
Join Date: Oct 2006
Location: US
Posts: 8
|
![]()
Hey guys,
For several years my organization has had a Snap Server 4100 (Purchased as a Dell PowerVault 705N). Long story short, recently, a major problem developed whereby our RAID 5 array (4x 40gig Drives) refuses to mount its filesystem. On startup, it seems to try to mount, but it runs an fsck and finds that a cylinder group has a bad magic number. Unfortunately, several runs of fsck in "Repair all errors" mode fails to yeild any results (FSCK exits fatally once it reaches that sector and finds the bad magic number error). I was wondering if anyone had any ideas about how to go about getting our data back online. The RAID array assembles, and our data is there if I do a sector-by-sector dump of /dev/rraid0 - the filesystem just won't mount, making the data inaccessible. We played for a while, to no avail, trying to find some way to either manually correct the data for that sector on the disk through Snap's debug command line, or somehow get an image of the the assembled array off of the SnapServer (so we could put it on a hard drive, correct the sector, and then hook that drive up to the SnapServer). Does anyone have any idea how we could try to rescue our data? Is there a tool available to get an image of the assembled array? Should we try to get a hold of a v4 SnapOS to see if fsck is improved? Any ideas would be greatly appreciated. Thanks soo much, Mike PS: I'm attaching a recent boot log in case it helps. |
![]() |
![]() |
![]() |
#2 |
Thermophile
Join Date: May 2006
Location: Yakima, WA
Posts: 1,282
|
![]()
Okay, call me dumb, but it sounds like you have a bad drive, but you are in RAID 5, so why not just replace the bad drive and let it rebuild the array? Did I miss something?
|
![]() |
![]() |
![]() |
#3 |
Thermophile
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
|
![]()
rule #1. Do not do anything to compromise the array.
DO NOT DO ANY UPDATES, This will only compound the problem. The log does not apear to be complete. Can you send the the complete log. I'm like Phoenix, Think you may have a bad drive or 2. Does any of the lights indicates any problem? Do you have any spare drives that you can work with? Do you have a copy of spinrite?
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5, 1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5, 1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820 |
![]() |
![]() |
![]() |
#4 |
Cooling Neophyte
Join Date: Oct 2006
Location: US
Posts: 8
|
![]()
It doesn't seem to be a physical disk errror. The sector in question can be accessed just fine. At no point do we get a read/write error, or any other message at a disk-level. We're not getting any strange lights. It very much looks like the incorrect sector data is duplicated across all 4 drives, but the drives and array assembly is functioning properly. I'm willing to consider anything though.
I don't have SpinRite, but I can get it if it'll help. I do have a spare 160gb hard drive. What I attached is the complete log as far as I can tell. What's missing from it? |
![]() |
![]() |
![]() |
#5 |
Cooling Savant
Join Date: Feb 2006
Location: South Bend, IN
Posts: 385
|
![]()
I don't think that it is a bad drive either. It seems more like a duplicated bad write, as you suggested. If you look at the array status in the web interface, does it show up as healthy, degraded, or something else?
I would be interested to see what would happen if you replaced drive 1 with a different drive (by drive 1 I mean physical drive 1...it would be drive 0 according to the snap). I'm wondering if when it re-installs the OS to it and rebuilds the RAID if it corrects your error. That's what I'd do, but I will admit that I don't know for sure what is going on and this is not a "fix" that I know for sure will work. It's just what I would try on my own.
__________________
Snap Server 4100, 4x120GB Seagate Drives, RAID 5, version 3.4.803 |
![]() |
![]() |
![]() |
#6 |
Cooling Savant
Join Date: Apr 2006
Location: Tennessee
Posts: 157
|
![]()
What do you get back when you use the command line to do an:
info dev and info log t The info log t is where I was able to watch what it was trying to repair when I ran the repair on my failing 705N. It would be informative to see if there's any more specific error messages in there that would help in diagnosing the problem. |
![]() |
![]() |
![]() |
#7 |
Thermophile
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
|
![]()
I'm not sure I would play with any drives swaping till we get more info. I do not want to do anything that will compromise the data. I am assuming that you do not have a current backup.
Any time you get a SuperBlock error it is criticle. I was thinking of powering down and use DD to clone the drives. Use a Sharpy Pen and mark the drive positions, must beable to re-install back into the original position. But it must be the same size to do a RAW Copy. Or a std clone to a file will have to be done. Doing this for each drive will tell us if a drive has a problem. 1 drive is a mfg than the others. This should not cuase a problem if the capacity is the same. The reason I mentioned SpinRite is that it is not OS dependent. It uses the smart system to verify each sector, byte by byte, by reading multiple times to verify the bytes. If unknow it starts shifting the timing were it can read slightly before and after. Tring to determine what it should be. If bad it will swap a reserve sector in to correct the bad.
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5, 1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5, 1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820 |
![]() |
![]() |
![]() |
#8 |
Cooling Neophyte
Join Date: Oct 2006
Location: US
Posts: 8
|
![]()
The status of the RAID Array through the web interface is:
RAID5 - Large data protection disk Error. Fatal error during disk check. I'm pretty sure the actual array is healthy though. Here's info dev: Code:
Logical Device: 10006 Position: 0 JBOD Size (KB): 32296 Free (KB): 22064 Private Mounted Label:Private Contains system files only Unique Id: 0x630A77496C09210C Mount: /priv Index: 12 Order: 0 Partition: 10006 Physical: 10007 FS Size (KB): 32768 Starting Blk: 515 Private Physical: 10007 Drive Slot: 0 IDE Size (KB): 39088640 Fixed Logical Device: 1000E Position: 0 JBOD Size (KB): 32296 Free (KB): 19752 Private Mounted Label:Private Contains system files only Unique Id: 0x4B58D23D6E14EE1E Mount: /pri2 Index: 13 Order: 1 Partition: 1000E Physical: 1000F FS Size (KB): 32768 Starting Blk: 515 Private Physical: 1000F Drive Slot: 1 IDE Size (KB): 39088640 Fixed Logical Device: 60000 Position: 1 RAID Size (KB): 116175192 Free (KB): 0 Public Unmounted Label:RAID5 Large data protection disk Unique Id: 0x0FEBE8223B4463B1 Mount: /0 Index: 0 Order: 255 Partition: 10000 Physical: 10007 R 60000 Size (KB): 38725064 Starting Blk: 45319 Public Physical: 10007 Drive Slot: 0 IDE Size (KB): 39088640 Fixed Partition: 10008 Physical: 1000F R 60000 Size (KB): 38725064 Starting Blk: 45319 Public Physical: 1000F Drive Slot: 1 IDE Size (KB): 39088640 Fixed Partition: 10010 Physical: 10017 R 60000 Size (KB): 38725064 Starting Blk: 45319 Public Physical: 10017 Drive Slot: 2 IDE Size (KB): 39088640 Fixed Partition: 10018 Physical: 1001F R 60000 Size (KB): 38725064 Starting Blk: 37830 Public Physical: 1001F Drive Slot: 3 IDE Size (KB): 40146432 Fixed blue68f100, The error isn't in the superblock. It's in a cylinder group. |
![]() |
![]() |
![]() |
#9 |
Cooling Savant
Join Date: Apr 2006
Location: Tennessee
Posts: 157
|
![]()
Well, except for the stuff about "Cylinder group 953: bad magic number" and "FSCK fatal error = 15" after Phase 5 in your log this looks much like my logs when I was trying to repair my failing 705N. The only thing missing is "File System : Logical set synchronization done on device 60000" that should be at the very end.
I got nuthin' |
![]() |
![]() |
![]() |
#10 |
Thermophile
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
|
![]()
Since the log is not indicating which drive has the cylinder group problem, just the Raid5 array, 60000.
I would remove 1 drive at a time and see if you can make a clone of the drives. The one with the problem should report an error. DO ALL DRIVES DO NOT START THE 4100 WITH A DRIVE REMOVED. We want to locate the bad drive/s first.
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5, 1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5, 1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820 |
![]() |
![]() |
![]() |
#11 |
Cooling Neophyte
Join Date: Oct 2006
Location: US
Posts: 8
|
![]()
I was able to successfully clone all of the drives with no problems whatsoever. I really think it's just that one sector of data on the RAID array that is corrupt (there's no physical corrupiton).
Any other ideas? Thanks |
![]() |
![]() |
![]() |
#12 |
Thermophile
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
|
![]()
If you were able to clone all the drives, and the problem did not show up, that's odd. I'm at a lost if nothing showed up. Did you use dd or some other cloning program.
Does this happen at the same place every time?
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5, 1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5, 1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820 |
![]() |
![]() |
![]() |
#13 |
Cooling Neophyte
Join Date: Oct 2006
Location: US
Posts: 8
|
![]()
I used dd. Yea. The error is in the same place every time, which is making us think that it's an error with the data in the array after the array. fsck sees the error, but doesn't seem to make any attempts to fix it - it just exits fatally.
|
![]() |
![]() |
![]() |
#14 | |
Thermophile
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
|
![]()
You can use a "/force" cmd or the "/fixfatal"
Quote:
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5, 1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5, 1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820 |
|
![]() |
![]() |
![]() |
#15 |
Cooling Neophyte
Join Date: Oct 2006
Location: US
Posts: 8
|
![]()
Been there, done that. No dice.
|
![]() |
![]() |
![]() |
#16 |
Cooling Neophyte
Join Date: Oct 2006
Location: US
Posts: 8
|
![]()
/altsb will replace the superblock with an alternate one. That's not the problem I'm having.
|
![]() |
![]() |
![]() |
#17 | |
Cooling Savant
Join Date: Apr 2006
Location: Tennessee
Posts: 157
|
![]() Quote:
10/16/2006 16:53:19 70 I L01 | File System Check : Executing fsck /dev/rraid0 /force /fix /fixfatal |
|
![]() |
![]() |
![]() |
#18 |
Thermophile
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
|
![]()
mvastola, Thanks for the meaning of "/altsb"
After talking with Snap-tech last night it is proably a timing problem. Clone that drive to a new one and install it into the set. The snap then should rebuild the array. Snap-tech may be able to elaborate more. It was a pretty in depth technical discussion on how the Snap OS works at a hardware level.
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5, 1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5, 1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820 Last edited by blue68f100; 10-29-2006 at 12:43 PM. |
![]() |
![]() |
![]() |
#19 | |
Cooling Savant
Join Date: Apr 2006
Location: Tennessee
Posts: 157
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#20 |
Cooling Savant
Join Date: Apr 2006
Location: Tennessee
Posts: 157
|
![]()
This is just a question I have, so don't do this just because I'm asking.
What would happen if he tried to mount the raid? I mean using: config devices mount [dev], where dev would be 60000. Would it attempt to mount the raid even with the problem? Or would it fail because the disk check hasn't completed? |
![]() |
![]() |
![]() |
#21 |
Cooling Neophyte
Join Date: Oct 2006
Location: US
Posts: 8
|
![]()
We tried that at one point. It crashed the kernel and didn't mount the drive.
Mike |
![]() |
![]() |
![]() |
#22 | |
Thermophile
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
|
![]()
rpmurray, I failed to recall that he was able to clone the drives with no errors.
Quote:
If he still have the drives cloned install them all and see what happens. The way Snap-Tech explained it to me is. If a drive does not resond with in a set time, it reports a problem, even if all data is good. The problem accours if the SMART reloactes some sectors due to an error. Then the time required to get the spot has increased, due to the detour. Drive is still good. He has the tools required to adjust the drives to work in unison. I hope the Guardian OS is not this picky.....
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5, 1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5, 1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820 |
|
![]() |
![]() |
![]() |
#23 |
Cooling Neophyte
Join Date: Mar 2002
Location: Washington State
Posts: 54
|
![]()
Unfortunately after looking at the full debug log sent to me there is nothing that can be done that is going to allow the snap to remount the raid.
The fsck error = 15, due to the bad magic number is a fatal filesystem error that we at snap support were never able to fix. At that time are only suggestion was to send snap to a data recovery company. Sorry for the bad news mvastola12 Douglas |
![]() |
![]() |
![]() |
#24 |
Thermophile
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
|
![]()
Everything I hear and read about magic numbers is, "The magic is your data disapeares". Which seams to be a weakness with the XFS file system.
I hope I can simulate this type of failure when I start working on my 4500.
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5, 1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5, 1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820 |
![]() |
![]() |
![]() |
#25 | |
Thermophile
Join Date: May 2006
Location: Yakima, WA
Posts: 1,282
|
![]() Quote:
![]() |
|
![]() |
![]() |
![]() |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
|
|