Go Back   Pro/Forums > ProCooling Technical Discussions > Snap Server / NAS / Storage Technical Goodies
Password
Register FAQ Members List Calendar JavaChat Mark Forums Read

Snap Server / NAS / Storage Technical Goodies The Home for Snap Server Hacking, Storage and NAS info. And NAS / Snap Classifides

Reply
Thread Tools
Unread 03-29-2007, 08:07 AM   #1
handind
Cooling Neophyte
 
Join Date: Mar 2007
Location: indiana
Posts: 6
4100 fsck fatal error, please help

Hi, I'm hoping you can help me with a 4100 that is reporting a fsck fatal error 39. It has 100gb drives in it. I've read a lot of the info here on the forum, but haven't found what I need. The four disks are in a RAID5 array. When I do disk check, I get fsck fatal error and several partially allocated inode errors. In my opinion one of the drives is probably failing, but my problem is I don't know how to find out which one. All the log says is that the errors are on disk 60000. None of the LEDs on the unit are showing amber for a failed drive. This snap unit has been in operation for a few years with no problems.
This is the second unit I've had with this exact problem. On the first unit that failed, I had previously replaced a bad drive (which did show an amber LED) and it ran fine for about a year, then the fsck fatal error and so on.
I'd sure like to get these units fixed. I have a fairly recent copy of the data backed up, so it is not too critical to worry about it, though I'd like to save it if possible.
So I guess my main question is how to determine which disk in the array is failing (hopefully it is just one). I have checked the server log and it doesn't show any info for any drives having problems except 60000. I have run "in dev" from the debug command line, and it isn't helpful other than showing that the RAID5 array is unmounted. I am fairly familiar with the snap 4100, but I don't know much of anything about DD, and I don't have spinrite or a unix system. Any help/advice is truly appreciated.
handind is offline   Reply With Quote
Unread 03-29-2007, 08:14 AM   #2
handind
Cooling Neophyte
 
Join Date: Mar 2007
Location: indiana
Posts: 6
Default Re: 4100 fsck fatal error, please help

UPDATE: after the most recent disk check, the error is fsck fatal error = 26
handind is offline   Reply With Quote
Unread 03-29-2007, 11:51 AM   #3
blue68f100
Thermophile
 
blue68f100's Avatar
 
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
Default Re: 4100 fsck fatal error, please help

Ok lets see if we can find what causing the errors.

Were going to need more info.

Please post the results of
co de info Post results
info log T
in Lo p -1

Save the results in html format, If not to large you can post here in HTML code or compress the results. Send the results to my email.

I would also like the OS Version, Model no. and SN of the unit. And have the original HD been replaced.

And please take a look at the sticky at the top of the thread. It documents some HW problems with the early 4100's.
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5,
1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5,
1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy

Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820
blue68f100 is offline   Reply With Quote
Unread 03-29-2007, 01:29 PM   #4
handind
Cooling Neophyte
 
Join Date: Mar 2007
Location: indiana
Posts: 6
Default Re: 4100 fsck fatal error, please help

Thanks for the response. This is a somewhat critical system here, so I have been on the phone with adaptec support. They determined that all 4 drives are sporadically failing and are going to RMA my unit and send me a new one. Unfortunately, since all drives are having issues, they did not think I would be able to save any of the data.

In case it helps in the future, the server I have is running Snap OS v 4.0.854
I have never replaced any of the drives in this unit. However, this unit is also a refurbished unit that was sent to us a few years ago to replace a different unit that failed.

The guys at the tech support said that many of the 4100s with certain Western Digital drives have a lot of problems.

Unfortunately, the second unit that I have which is exhibiting the same failure issues was out of warranty, so no new server for that one. They tech support individuals suggested that I buy 4 new drives for it and build a new array.

So the good news is I'll be getting a new refurbished system and the support call was free since it turned out to be a hardware issue. The bad news is I'll lose a few days worth of data, but it could have been a lot worse if I hadn't been backing things up weekly.

Now, all that said, I am going to try replacing one of the drives in my unit and rebuilding the array to see if I can recover any of the data since I've got nothing to lose.
handind is offline   Reply With Quote
Unread 03-29-2007, 02:27 PM   #5
handind
Cooling Neophyte
 
Join Date: Mar 2007
Location: indiana
Posts: 6
Default Re: 4100 fsck fatal error, please help

I checked out the post about modifications to the board of the 4100s. My 4100 does have the two modifications done.

Also, for reference, the drives in this unit are WD1000BB and have dates between August and October 2001
handind is offline   Reply With Quote
Unread 03-29-2007, 03:14 PM   #6
blue68f100
Thermophile
 
blue68f100's Avatar
 
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
Default Re: 4100 fsck fatal error, please help

Post the logs and maybe we can tell you wich drive is actually calling the trouble. 6 yr old drives are over due, atleast there WD and not the cheaper Maxtors. Since Adaptec bought SnapAppliance they do not have any real knowledgeable persons. Let most of them go during the merger. We hava a person on the forum that can assist you in file recovery if you want it. He goes by Snap-tech and run a recovery services "FrontLine Data recovery" and can be reached at 1-866-279-2985

I have discovered that SpinRite can work on these drives. Since it works at the controller level and does not inturpit the data like most. I use them routinely on my units. The bad thing is that you have to remover the drives and install them in a desktop machine.
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5,
1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5,
1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy

Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820
blue68f100 is offline   Reply With Quote
Unread 03-30-2007, 10:41 AM   #7
handind
Cooling Neophyte
 
Join Date: Mar 2007
Location: indiana
Posts: 6
Default Re: 4100 fsck fatal error, please help

Here are some of the errors that I've gotten:
FATAL ERROR PANIC : General Protection Fault (#13) at $002442D4 EAX=E99A8000 EBX=07F71B14 ECX=E99A8000 EDX=AA798DEC ESP=004123EC EBP=00412438 ESI=040F8DB4 EDI=ACB3ADAF

File System Check : FSCK fatal error = 32

FATAL ERROR PANIC : ASSERT((*dp)->dino_index == DSLOT2DINDEX(slot))

FATAL ERROR PANIC : ASSERT(fcbp->fcb_magic == FCB_MAGIC)

File System Check : FSCK fatal error = 26

File System Check : Bad state 0 for inode I=718848

File System Check : Partially allocated inode I=787993 (lots of these errors)

File System Check : Inode = 787976 - Bad direct addr[1]: 1050783

File System Check : FSCK fatal error = 27

FATAL ERROR PANIC : blkfree: freeing free block (this was the very first error.)

I'm glad to be getting a new server, but I think I'll always wonder about the reliability now. Sure seems like a bad thing that all 4 drives would go bad at the same time. What is especially bad is that there was no warning errors prior to the system going down about 48 hours ago. I looked at the system log as far back as it would go and there were no other errors. I also have the system set up to email me error messages so that I can keep on top of it. It sort of takes the point out of running a RAID5 system when all the drives fail.

I did try replacing the drive in slot2 (with a new drive, also a wd1000bb) since one of the errors above references that slot. The system did rebuild but did not pass the file check after that, again returning fsck fatal errors.
handind is offline   Reply With Quote
Unread 03-30-2007, 12:02 PM   #8
rpmurray
Cooling Savant
 
Join Date: Apr 2006
Location: Tennessee
Posts: 157
Default Re: 4100 fsck fatal error, please help

If you're not afaid of losing anything and want to give it a shot, you could try going into Disk Utilities, then Check or Repair Disk, and try changing the setting on how disk errors should be repaired, and then restart.

On one of my units, I was able to get it to mount the RAID so I could backup most of it, by changing the setting to Repair all errors. This assumes that you don't already have it set to that, and that the error is repairable. In my case some of the files became corrupt after the repair, but most of the data was still there. Of course you'd want to backup a soon as possible because the whole thing could just up and die. But don't try to backup onto any media that you already have a good backup on, so you won't lose anything if some (or all) of the files on the snap are corrupted.

edit: But looking over what you posted, the FATAL ERROR PANICs are something I didn't see on mine, so I'm going to assume that the data is hosed unless a recovery service can get it back for you.

Last edited by rpmurray; 03-30-2007 at 12:09 PM.
rpmurray is offline   Reply With Quote
Unread 03-30-2007, 12:07 PM   #9
blue68f100
Thermophile
 
blue68f100's Avatar
 
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
Default Re: 4100 fsck fatal error, please help

Some of those errors I have not seen before, making me wonder if it's not hardware related. Like you said it is very odd for all the drives to fail at the same time. I have seen drives fails with in months of each other. We did have a user through here a while back that had 2 drives fail in less than 2 weeks apart, before he got the replacement.

You do not want to fill a drive with data over 90%, it needs the 10% free for caching and temp files. The Guardian OS takes 10gig of the drives space for OS & temp/cache files.

Phoenix32 has done the most testing on these units. We have discovered that you must set the drive jumpers correctly for the IDE to handle a drive failure without breaking the array.

The cooling fan on my 4100 sounds like its trashing it self. Is your cooling fan running smoothly? I will be rewireing mine so it runs 24/7, no temp control.

I would also wager that Adaptec sends your referb units with Maxtor Drives. There only goal is to get it out of the warranty period. This is where SnapAppliance and Adaptec differ. Snap use to always used high quality drives, where Adaptec is to save a buck. They just want you to have a service contract. The 2200 (referb) I bought for a friend came with Maxtors and it did not make it through the initial raid1 build without failing. I Don't trust Maxtor's in anything. If you open one of the up you will find plastic where other mfg use steel. After that I decided to test all newly installed drives with Spinrite. Another thing we discovered (Phoenix32 and I) is that SpinRite found a lot of bad sectors and mapped them out, along with high rate of seek errors. I had one drive that used over 50% of it's reserved sectors, and high seek error count (seagate in my 4500). The other 3 drives were clean or just minor errors. Since MFG no longer test the media this can be expected. So it needs to be done on server drives, where seek times must be held to a minimum. You can't have the Smart tech doing this with consective bad sectors and seek times all over the place. Systems will complain. I now run Spinrite on any new install, I would prefer it to fail while I am testing it than in service. At least you will know which drives may present a problem. My new WD RE drives were clean as a whissle.

If you have a copy of Spinrite you may want to use it to check your drives. Since it runs at the controller level it will not damage or corrupt a raid5 system. It may correct the problem you are having. Spinrite also has a reconditioning feature to refreshes the surface. I generally use my spare pc's for this. It takes close to 22 hr to run on my 400gig drives.
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5,
1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5,
1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy

Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820
blue68f100 is offline   Reply With Quote
Unread 03-30-2007, 02:44 PM   #10
handind
Cooling Neophyte
 
Join Date: Mar 2007
Location: indiana
Posts: 6
Default Re: 4100 fsck fatal error, please help

Yeah, I have run the file check in fix all error mode several times with no luck in the last 2 days. Blue68f100, I'd say the capacity of the RAID5 array on my 4100 was about 85% used. The fans inside the unit were a little dusty, but really not bad. The unit is mounted in a rack which has cooling fans going at all times.

I also kind of have wondered if the real problem is not just the drives, but possibly something on the circuit board. When the techs at adaptec support said they would send me a whole new unit (instead of just new drives), I half wondered if they thought something else besides the drives might be bad too.

But, all in all, thanks to my strict backup policies, only about one day worth of data/changes was lost. My users have already re-entered any necessary data changes and I've had everything back up and running since yesterday on a Dell NAS that the company has. So I am no longer at all concerned about recovering any data.

I may purchase a copy of spinrite (I'm assuming its not free) and see what I can do with it. If anyone has any good tips on using spinrite, a link would be appreciated. Just for my own knowledge, I'd like to know if all the drives are really bad.
handind is offline   Reply With Quote
Unread 03-30-2007, 04:06 PM   #11
blue68f100
Thermophile
 
blue68f100's Avatar
 
Join Date: Jul 2005
Location: Plano, TX
Posts: 3,135
Default Re: 4100 fsck fatal error, please help

Spinrite is made be steve gibson with GRC.com . It is written in assembly language and is a very small program. Will fit on a floppy disk. When you buy and download the program you can launch it from disk you have the option to make a bootable floppy and/or CD img. You will notice that your name is imbeded in the licsen agrement. Make sure you save your order number, it is used for upgrades. When you boot you hit cr a couple of times then you have options 2 (recovery) or 4 (maintance). Then you select the drive to check. Works with IDE or SATA. If it detects a problem during maintance it auto changes to recovery mode, then switchs back. The only problems is some drive setups seam to take a extremely long time to check. Like I said earlier it took 22hr to scan my 400gig HD.
__________________
1 Snap 4500 - 1.0T (4 x 250gig WD2500SB RE), Raid5,
1 Snap 4500 - 1.6T (4 x 400gig Seagates), Raid5,
1 Snap 4200 - 4.0T (4 x 2gig Seagates), Raid5, Using SATA converts from Andy

Link to SnapOS FAQ's http://forums.procooling.com/vbb/showthread.php?t=13820
blue68f100 is offline   Reply With Quote
Unread 03-30-2007, 08:01 PM   #12
Phoenix32
Thermophile
 
Phoenix32's Avatar
 
Join Date: May 2006
Location: Yakima, WA
Posts: 1,282
Default Re: 4100 fsck fatal error, please help

Take the top off, clean out the Heat Sink and Fan, leave the top off, and see how it works. The drives data may all be corrupt now. If so, format, and see how it runs. This has indication of a hardware problem also, but I would suspect overheating before anything else, and the 4100 is poorly cooled. I switched the HSF to run all the time on my 4100.
Phoenix32 is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -5. The time now is 04:22 AM.


Powered by vBulletin® Version 3.7.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
(C) 2005 ProCooling.com
If we in some way offend you, insult you or your people, screw your mom, beat up your dad, or poop on your porch... we're sorry... we were probably really drunk...
Oh and dont steal our content bitches! Don't give us a reason to pee in your open car window this summer...