Snap Server 4100 Meltdown
1 Attachment(s)
Hey guys,
For several years my organization has had a Snap Server 4100 (Purchased as a Dell PowerVault 705N). Long story short, recently, a major problem developed whereby our RAID 5 array (4x 40gig Drives) refuses to mount its filesystem. On startup, it seems to try to mount, but it runs an fsck and finds that a cylinder group has a bad magic number. Unfortunately, several runs of fsck in "Repair all errors" mode fails to yeild any results (FSCK exits fatally once it reaches that sector and finds the bad magic number error). I was wondering if anyone had any ideas about how to go about getting our data back online. The RAID array assembles, and our data is there if I do a sector-by-sector dump of /dev/rraid0 - the filesystem just won't mount, making the data inaccessible. We played for a while, to no avail, trying to find some way to either manually correct the data for that sector on the disk through Snap's debug command line, or somehow get an image of the the assembled array off of the SnapServer (so we could put it on a hard drive, correct the sector, and then hook that drive up to the SnapServer). Does anyone have any idea how we could try to rescue our data? Is there a tool available to get an image of the assembled array? Should we try to get a hold of a v4 SnapOS to see if fsck is improved? Any ideas would be greatly appreciated. Thanks soo much, Mike PS: I'm attaching a recent boot log in case it helps. |
Re: Snap Server 4100 Meltdown
Okay, call me dumb, but it sounds like you have a bad drive, but you are in RAID 5, so why not just replace the bad drive and let it rebuild the array? Did I miss something?
|
Re: Snap Server 4100 Meltdown
rule #1. Do not do anything to compromise the array.
DO NOT DO ANY UPDATES, This will only compound the problem. The log does not apear to be complete. Can you send the the complete log. I'm like Phoenix, Think you may have a bad drive or 2. Does any of the lights indicates any problem? Do you have any spare drives that you can work with? Do you have a copy of spinrite? |
Re: Snap Server 4100 Meltdown
It doesn't seem to be a physical disk errror. The sector in question can be accessed just fine. At no point do we get a read/write error, or any other message at a disk-level. We're not getting any strange lights. It very much looks like the incorrect sector data is duplicated across all 4 drives, but the drives and array assembly is functioning properly. I'm willing to consider anything though.
I don't have SpinRite, but I can get it if it'll help. I do have a spare 160gb hard drive. What I attached is the complete log as far as I can tell. What's missing from it? |
Re: Snap Server 4100 Meltdown
I don't think that it is a bad drive either. It seems more like a duplicated bad write, as you suggested. If you look at the array status in the web interface, does it show up as healthy, degraded, or something else?
I would be interested to see what would happen if you replaced drive 1 with a different drive (by drive 1 I mean physical drive 1...it would be drive 0 according to the snap). I'm wondering if when it re-installs the OS to it and rebuilds the RAID if it corrects your error. That's what I'd do, but I will admit that I don't know for sure what is going on and this is not a "fix" that I know for sure will work. It's just what I would try on my own. |
Re: Snap Server 4100 Meltdown
What do you get back when you use the command line to do an:
info dev and info log t The info log t is where I was able to watch what it was trying to repair when I ran the repair on my failing 705N. It would be informative to see if there's any more specific error messages in there that would help in diagnosing the problem. |
Re: Snap Server 4100 Meltdown
I'm not sure I would play with any drives swaping till we get more info. I do not want to do anything that will compromise the data. I am assuming that you do not have a current backup.
Any time you get a SuperBlock error it is criticle. I was thinking of powering down and use DD to clone the drives. Use a Sharpy Pen and mark the drive positions, must beable to re-install back into the original position. But it must be the same size to do a RAW Copy. Or a std clone to a file will have to be done. Doing this for each drive will tell us if a drive has a problem. 1 drive is a mfg than the others. This should not cuase a problem if the capacity is the same. The reason I mentioned SpinRite is that it is not OS dependent. It uses the smart system to verify each sector, byte by byte, by reading multiple times to verify the bytes. If unknow it starts shifting the timing were it can read slightly before and after. Tring to determine what it should be. If bad it will swap a reserve sector in to correct the bad. |
Re: Snap Server 4100 Meltdown
The status of the RAID Array through the web interface is:
RAID5 - Large data protection disk Error. Fatal error during disk check. I'm pretty sure the actual array is healthy though. Here's info dev: Code:
Logical Device: 10006 Position: 0 JBOD Size (KB): 32296 Free (KB): 22064 Private Mounted blue68f100, The error isn't in the superblock. It's in a cylinder group. |
Re: Snap Server 4100 Meltdown
Well, except for the stuff about "Cylinder group 953: bad magic number" and "FSCK fatal error = 15" after Phase 5 in your log this looks much like my logs when I was trying to repair my failing 705N. The only thing missing is "File System : Logical set synchronization done on device 60000" that should be at the very end.
I got nuthin' |
Re: Snap Server 4100 Meltdown
Since the log is not indicating which drive has the cylinder group problem, just the Raid5 array, 60000.
I would remove 1 drive at a time and see if you can make a clone of the drives. The one with the problem should report an error. DO ALL DRIVES DO NOT START THE 4100 WITH A DRIVE REMOVED. We want to locate the bad drive/s first. |
Re: Snap Server 4100 Meltdown
I was able to successfully clone all of the drives with no problems whatsoever. I really think it's just that one sector of data on the RAID array that is corrupt (there's no physical corrupiton).
Any other ideas? Thanks |
Re: Snap Server 4100 Meltdown
If you were able to clone all the drives, and the problem did not show up, that's odd. I'm at a lost if nothing showed up. Did you use dd or some other cloning program.
Does this happen at the same place every time? |
Re: Snap Server 4100 Meltdown
I used dd. Yea. The error is in the same place every time, which is making us think that it's an error with the data in the array after the array. fsck sees the error, but doesn't seem to make any attempts to fix it - it just exits fatally.
|
Re: Snap Server 4100 Meltdown
You can use a "/force" cmd or the "/fixfatal"
Quote:
|
Re: Snap Server 4100 Meltdown
Been there, done that. No dice.
|
Re: Snap Server 4100 Meltdown
/altsb will replace the superblock with an alternate one. That's not the problem I'm having.
|
Re: Snap Server 4100 Meltdown
Quote:
10/16/2006 16:53:19 70 I L01 | File System Check : Executing fsck /dev/rraid0 /force /fix /fixfatal |
Re: Snap Server 4100 Meltdown
mvastola, Thanks for the meaning of "/altsb"
After talking with Snap-tech last night it is proably a timing problem. Clone that drive to a new one and install it into the set. The snap then should rebuild the array. Snap-tech may be able to elaborate more. It was a pretty in depth technical discussion on how the Snap OS works at a hardware level. |
Re: Snap Server 4100 Meltdown
Quote:
|
Re: Snap Server 4100 Meltdown
This is just a question I have, so don't do this just because I'm asking.
What would happen if he tried to mount the raid? I mean using: config devices mount [dev], where dev would be 60000. Would it attempt to mount the raid even with the problem? Or would it fail because the disk check hasn't completed? |
Re: Snap Server 4100 Meltdown
We tried that at one point. It crashed the kernel and didn't mount the drive.
Mike |
Re: Snap Server 4100 Meltdown
rpmurray, I failed to recall that he was able to clone the drives with no errors.
Quote:
If he still have the drives cloned install them all and see what happens. The way Snap-Tech explained it to me is. If a drive does not resond with in a set time, it reports a problem, even if all data is good. The problem accours if the SMART reloactes some sectors due to an error. Then the time required to get the spot has increased, due to the detour. Drive is still good. He has the tools required to adjust the drives to work in unison. I hope the Guardian OS is not this picky..... |
Re: Snap Server 4100 Meltdown
Unfortunately after looking at the full debug log sent to me there is nothing that can be done that is going to allow the snap to remount the raid.
The fsck error = 15, due to the bad magic number is a fatal filesystem error that we at snap support were never able to fix. At that time are only suggestion was to send snap to a data recovery company. Sorry for the bad news mvastola12 Douglas |
Re: Snap Server 4100 Meltdown
Everything I hear and read about magic numbers is, "The magic is your data disapeares". Which seams to be a weakness with the XFS file system.
I hope I can simulate this type of failure when I start working on my 4500. |
Re: Snap Server 4100 Meltdown
Quote:
|
Re: Snap Server 4100 Meltdown
It's on my list of things to do, trust me I will get to it.
This time of year I get busy with my leather craft, you have one of my big sellers. Joe has 3 (of ?) for allowing us to use his servers. All are great little gifts. It's how I fund my toys. |
Re: Snap Server 4100 Meltdown
Quote:
|
Re: Snap Server 4100 Meltdown
It took me a little while to find the info I was looking for but here it is.
The issue that caused this unit to have fail is the following. If you look at the in dev log you will see that the last drives starting block is not the same as the other 3 drives. We discovered that this would ultimately cause the superblock error and but not until after a reboot once the raid was rebuilt with a replacement drive. It had to do with what version the snap was running when the drives were first formated. The starting block was changed between versions 2x and 3x, and was increased by a factor of 1. to find the starting is multiplied by 16 and thats the actual sector were it is located on the disk. This turned out to be a big issue at snap and from that point on all single drives that were shipped out to clients had to be pre-formted at snap under a certain os version depending on were the starting block was on the current snap the client had. Again, the raid would build ok and issue would not show up until after the snap was rebooted the first time, then this error would show up and if you tried to force a mount it would cause snap to panic. We never found a way to resolve it once it was in this condition and therefore data recovery was the clients only option. I hope this makes sense. I found this in my notes. I am still looking for the actuall official document which explains this in full detail which was given to the techs. Douglas |
Re: Snap Server 4100 Meltdown
Good info...
|
Re: Snap Server 4100 Meltdown
Great Info.
So the fix is when you replace a drive. Do a "co de fromat xxxxx /reinit" to all drives. Or does all drives need to be totally clean and let the snap do it from scratch? |
All times are GMT -5. The time now is 12:01 PM. |
Powered by vBulletin® Version 3.7.4
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
(C) 2005 ProCooling.com If we in some way offend you, insult you or your people, screw your mom, beat up your dad, or poop on your porch... we're sorry... we were probably really drunk... Oh and dont steal our content bitches! Don't give us a reason to pee in your open car window this summer...