4100 constant rebuilding issue [Archive]

View Full Version : 4100 constant rebuilding issue

sleepysnap

06-22-2006, 07:39 AM

I software rebooted my 4100 after changing a WINS entry. I had not rebooted my unit for well over 2 years. It is on UPS and also generator power. It ia Mission critical piece of a medical facility.
Upon reboot, it of course went through the check, and then started to rebuilding the RAID. It never finishes this rebuild and hangs (all drive lights lit) at different times in this process which each time take 12 to 24 hours. The furthest I have gotten is 71%. I can view the files during the rebuild, but copying large amounts of files casues it to hang. I have gotten all the log files and some command output I had heard of from Snap back when I had another device fail. I actually have 2 of these boxes. I do NOT see any statement in these logs about any drive being faulty. WHich is just a whole 'nother issue I guess.

My question is, is there any way, through button presses or commands to NOT go through a rebuild upon reboot? I am will to get just one shot and copy all the files somewhere else and then reset the box or let it melt down, but I NEED a certain subset of those files, and a large amount of them.

Thanks for any replies.

blue68f100

06-22-2006, 12:02 PM

Interesting question. I noticed none of the 4x00 users have jumped in.

Looking throught the debug cmd list, I suspect a fsck is being run and a rebuild. There are a couple of cmds I came across. I hope one of these help you.

fsckmode dev (read current mode) mode (set mode: 1-10)

fsck dev [[/fix /fixfatal /altsb]] (check device's filesystem)

dev = 60000 for raid 5

sleepysnap

06-22-2006, 01:53 PM

Thank you for your kind reply.
What I did do yesterday was to turn off the repair all options, and the device went through the check, but ended with a forcing of the restting of the options error.
In the drive logs it states as shown. Drive clean as stated in attached graphic.
Which setting of those commands actually tells the sys to NOT perform a rebuild?
Sorry to be such a newbie, but the commands are a little cryptic, and no documentation of these sys commands are available that I have seen. Is there?

Also, here is fsckmode dev run on machine right now:
"Current fsck mode for device DE: -1 (RAM-based fsck, 90% of RAM used for i-node cache)"

There also is the following "IN PAN" command that I once was told by Snaptec to run on the other machine. The word PANIC grabs my attention.

06/22/2006 14:52:32 Command: IN PAN

Current boot number is 51
FATAL ERROR During prior boot: 41
PANIC : ASSERT(fcbp->fcb_magic == FCB_MAGIC)

Call Stack:
$00247F81
$00242B34
$00240C06
$00240AD6
$0023918B
$0023652F
$001041C3

------------
Command executed without error.

sleepysnap

06-22-2006, 04:20 PM

Looking through the logs, I found an issue that started over a year ago when my domain was taken over by the central MIS dept of the facility. I never did replce the domain entry, being that I was running a non-AD OS ver (3.4.803) I did not want to bump into the issue of no domain being found. So I just created local users for all the users and turned the Snap into a locally derived userdatabase. It has functioned well since then. What I noticed was entries stating
:Failed to get group names from domain controller"
So here I stand at 28% right now, and I figured that this MUST still be an issue. So, I changed all the networing settings to reflect just a workgroup, and turned off NFS, and FTP access, since I only had NFS on becaue it used to be a novell house until AD rollout. And now I figure that even if this latest boot fails again, when the reboot happens it will now no longer be looking for a domain and hoefully it will rebuild faster and correctly. Or All will meltdown and I will have failed.
Either way, whatever.

Any thoughts or commetns.

blue68f100

06-22-2006, 04:28 PM

Panics are not good .......

Give me some more info on your 4100. Model (PN) No, OS v, installed Ram , HD mfg and size. Weather you are running JVM.

Generally when these stack error start showing up it means problems, Hardware. WD EIDE drives have caused strange errors.

sleepysnap

06-22-2006, 04:41 PM

Model 4000 series (But is exactly like 4100 in pics and I remember buying 4100)
SW 3.4.803
HW 2.2.1
Server # 504418
Bios 2.4.437
RAID 5
HD Maxtor see log below
Disk status (size and free space) attached.
JAVA Yes see attached graphic also.

I apologize in advance for this length of cut and paste. I thought you might be able to glean something from the beginning of today's boot.

*******************************
Build Date: Jan 15 2003 18:04:19
Boot Count: 51
06/22/2006 10:41:37 51 D SYS | Executable built by KEVIN
06/22/2006 10:41:37 51 D SYS | Hardware platform:2.2.1 Model:2 (128 MBytes) S/N:504418
06/22/2006 10:41:37 51 D SYS | ETH: Reset- Eaddr set to 00 C0 B6 07 B2 62
06/22/2006 10:41:37 51 D SYS | Update IP...
06/22/2006 10:41:37 51 I NET | INIT: Setting IP address to 10.2.222.3
06/22/2006 10:41:37 51 D SYS | Update IP...
06/22/2006 10:41:38 51 D SYS | Initial file system BIO cache size is 13419264 bytes, 1616 buffers
06/22/2006 10:41:38 51 D SYS | DISK: Initial ARBs: 1616 Memory: 265024
06/22/2006 10:41:38 51 D SYS | Code Page set to 437
06/22/2006 10:41:38 51 D SYS | QDL System is DISABLED
06/22/2006 10:41:38 51 D SYS | Java: initial stack size 64Kb
06/22/2006 10:41:38 51 D SYS | Disk cache flush enabled
06/22/2006 10:41:38 51 D SYS | SNMP agent disabled.
06/22/2006 10:41:38 51 D SYS | AFP: not started
06/22/2006 10:41:38 51 D SYS | 'FTPD' started.
06/22/2006 10:41:38 51 D SYS | Update WorkGroup...
06/22/2006 10:41:43 51 D SYS | Intf: 0, dev: 0: Model: Maxtor 5T060H6
06/22/2006 10:41:43 51 D SYS | Firmware Rev: TAH71DP0 Serial #: T6H7EXDC
06/22/2006 10:41:43 51 D SYS | Intf: 1, dev: 0: Model: WDC WD800BB-50DKA0
06/22/2006 10:41:43 51 D SYS | Firmware Rev: 77.07W77 Serial #: WD-WMAHL2012860
06/22/2006 10:41:43 51 D SYS | Intf: 2, dev: 0: Model: Maxtor 5T060H6
06/22/2006 10:41:43 51 D SYS | Firmware Rev: TAH71DP0 Serial #: T6H5228C
06/22/2006 10:41:43 51 D SYS | Intf: 3, dev: 0: Model: Maxtor 5T060H6
06/22/2006 10:41:43 51 D SYS | Firmware Rev: TAH71DP0 Serial #: T6HMB9HC
06/22/2006 10:41:43 51 D SYS | RAID5Initialize: creating 32 worker threads
06/22/2006 10:41:43 51 D SYS | RAID5InitializeCache: allocated cache of 32732KB
06/22/2006 10:41:44 51 I L00 | File System Check : Executing fsck /dev/ride0g /fix /fixfatal
06/22/2006 10:41:44 51 W L00 | File System Check : partition is NOT clean.
06/22/2006 10:41:44 51 D SYS | Fsck - Using primary superblock
06/22/2006 10:41:44 51 D SYS | 62221632 bytes pre-allocated
06/22/2006 10:41:44 51 D SYS | Memory allocation for i-node cache: 90% of free RAM
06/22/2006 10:41:44 51 D SYS | -- RAM-based Fsck --
06/22/2006 10:41:44 51 D SYS | directories: 16, hash table: 19 entries
06/22/2006 10:41:44 51 D SYS | LRU caching percentage: 20%
06/22/2006 10:41:44 51 I L00 | File System Check : ** Phase 1 - Check blocks and sizes
06/22/2006 10:41:44 51 I L00 | File System Check : ** Phase 1b - Rescan for more duplicate blocks
06/22/2006 10:41:44 51 I L00 | File System Check : ** Phase 2 - Check pathnames
06/22/2006 10:41:44 51 I L00 | File System Check : ** Phase 3 - Check connectivity
06/22/2006 10:41:44 51 I L00 | File System Check : ** Phase 4 - Check reference counts
06/22/2006 10:41:44 51 W L00 | File System Check : Zero Length Dir I=2304 Owner= Mode=41200
/dev/ride0g: Size=0 (Cleared)
06/22/2006 10:41:44 51 I L00 | File System Check : ** Phase 4b - Check backlinks
06/22/2006 10:41:44 51 I L00 | File System Check : ** Phase 5 - Check cylinder groups
06/22/2006 10:41:44 51 W L00 | File System Check : Free blk count(s) wrong in superblk (Salvaged)
06/22/2006 10:41:44 51 W L00 | File System Check : Blk(s) missing in bit maps (Salvaged)
06/22/2006 10:41:44 51 W L00 | File System Check : Summary information bad (Salvaged)
06/22/2006 10:41:44 51 W L00 | File System Check : Modified flag set in superblock (Fixed)
06/22/2006 10:41:44 51 W L00 | File System Check : Clean flag not set in superblock (Fixed)
06/22/2006 10:41:44 51 D SYS | 62222796 bytes used during fsck()
06/22/2006 10:41:44 51 I L00 | File System Check : 709 files, 1045 used, 2992 free (0 frags, 2992 blocks, 0.0%% fragmentation)
06/22/2006 10:41:44 51 D SYS | Elapsed time: 0 s.
06/22/2006 10:41:44 51 D SYS | Fsck cache statistics:
06/22/2006 10:41:44 51 D SYS | cacheable directories: 15
06/22/2006 10:41:44 51 D SYS | total memory used for cache: 536 bytes
06/22/2006 10:41:44 51 D SYS | LRU-cacheable i-nodes: 3
06/22/2006 10:41:44 51 D SYS | non pre-allocated i-nodes: 0
06/22/2006 10:41:44 51 D SYS | LRU-cached i-nodes: 1
06/22/2006 10:41:44 51 D SYS | average cache search: 1.0 iterations
06/22/2006 10:41:44 51 D SYS | longest cache search: 3 iterations
06/22/2006 10:41:44 51 D SYS | successful searches: 264 (94%)
06/22/2006 10:41:44 51 D SYS | total insertions: 15
06/22/2006 10:41:44 51 D SYS | total replacements: 0
06/22/2006 10:41:44 51 D SYS | 1 items in 0-link list (28 bytes)
06/22/2006 10:41:44 51 I L00 | File System Check : Cleanup completed...
06/22/2006 10:41:44 51 D SYS | Update FDB 0x10006...
06/22/2006 10:41:44 51 I L00 | File System : Opened FDB for device 0x10006
06/22/2006 10:41:44 51 D SYS | Scheduled ACL Set and Propagate at /priv/os_private for FDB_ID_12
06/22/2006 10:41:44 51 I L00 | File System Check : Executing fsck /dev/ride1g /fix /fixfatal
06/22/2006 10:41:44 51 D SYS | Propagate on /priv/os_private: Success - 12 files, 0 dirs; Errors - 0 files, 0 dirs
06/22/2006 10:41:44 51 I L00 | File System Check : partition is clean.
06/22/2006 10:41:44 51 D SYS | Failed to copy (2), skipping tag.dat
06/22/2006 10:41:53 51 D SYS | Compared times file1Secs (4499F384) file2Secs (4499AA55)
06/22/2006 10:41:53 51 D SYS | Copy private FS /priv/tag.dat to /pri2/tag.dat = pass
06/22/2006 10:41:53 51 D SYS | Cloned private FS from 10006 to 1000E
06/22/2006 10:41:53 51 D SYS | Update FDB 0x1000E...
06/22/2006 10:41:53 51 I L00 | File System : Opened FDB for device 0x1000E
06/22/2006 10:41:53 51 D SYS | Scheduled ACL Set and Propagate at /pri2/os_private for FDB_ID_13
06/22/2006 10:41:53 51 I L01 | File System Check : Executing fsck /dev/rraid0 /fix
06/22/2006 10:41:53 51 D SYS | Propagate on /pri2/os_private: Success - 12 files, 0 dirs; Errors - 0 files, 0 dirs
06/22/2006 10:41:53 51 I L01 | File System Check : partition is clean.
06/22/2006 10:41:54 51 D SYS | Update FDB 0x60000...
06/22/2006 10:41:54 51 I L01 | File System : Opened FDB for device 0x60000
06/22/2006 10:41:54 51 D SYS | Scheduled ACL Set and Propagate at /0/os_private for FDB_ID_0
06/22/2006 10:41:54 51 D SYS | Java reserved 24Mb memory
06/22/2006 10:41:54 51 D SYS | Java: argv[0] = "/Java$/Jeode/bin/evmApJ2.exe"
06/22/2006 10:41:54 51 D SYS | Java: argv[1] = "-Djava.home=/Java$/Jeode"
06/22/2006 10:41:54 51 D SYS | Java: argv[2] = "-Djava.security.policy==/priv/Jeode/broker.policy"
06/22/2006 10:41:54 51 D SYS | Java: argv[3] = "com.snapserver.Jrc"
06/22/2006 10:41:54 51 D SYS | Propagate on /0/os_private: Success - 13 files, 2 dirs; Errors - 0 files, 0 dirs
06/22/2006 10:41:54 51 D SYS | NFS: The hash table has been initialized.
06/22/2006 10:41:54 51 D SYS | NFS: the NFSID <--->FDBID cache has been initialised.
06/22/2006 10:41:54 51 D SYS | NFS Server started.
06/22/2006 10:41:54 51 D SYS | suspend_factor = 71D0A
06/22/2006 10:41:54 51 D SYS | DISK: Additional ARBs: 5315 (Mem: 871660) Total Arbs: 6931 (Mem: 1136684)
06/22/2006 10:41:54 51 I SYS | System Initialization : Initialization Complete! Memory to be released: 44135760 bytes.
06/22/2006 10:41:54 51 D SYS | Restarted process timing
06/22/2006 10:41:54 51 D SYS | RAID5Resync on array 0: started
06/22/2006 10:41:54 51 D SYS | JVM: current IP address = 10.2.222.3
06/22/2006 10:41:54 51 D SYS | JeodeVersion: JeodeEVM SnapOS developer build
06/22/2006 10:41:54 51 D SYS | Java VM starting
06/22/2006 10:42:04 51 I JVM | Java: Native methods linked OK
06/22/2006 10:42:05 51 I JVM | installed security manager
06/22/2006 10:42:06 51 I JVM | building info for hello.jar
06/22/2006 10:42:07 51 I JVM | done
06/22/2006 10:42:07 51 I JVM | building info for ssl.jar
06/22/2006 10:42:08 51 I JVM | done
06/22/2006 10:42:08 51 I JVM | No startup snaplets
06/22/2006 10:42:08 51 I JVM | SnapExtension Framework 1.2 initialized
06/22/2006 10:42:08 51 D JVM | SnapOS.init() - server threads: 2
06/22/2006 10:42:08 51 I JVM | SnapOS.init() executed
06/22/2006 10:42:08 51 I JVM | Jrc created
06/22/2006 10:42:08 51 I JVM | Jrc.execute() starting
06/22/2006 10:42:08 51 I JVM | Jrc: exiting
06/22/2006 10:42:45 51 D SYS | DISK: req=0x54450A8 dev=0xC0000 fn=1 blk=0x100160 sts=18
06/22/2006 10:43:18 51 D SYS | DISK: req=0x543D838 dev=0xC0000 fn=1 blk=0x10AC60 sts=7
06/22/2006 10:43:19 51 D SYS | DISK: req=0x544F694 dev=0xC0000 fn=1 blk=0x10B660 sts=18
06/22/2006 10:44:38 51 D SYS | DISK: req=0x5440C1C dev=0xC0000 fn=1 blk=0x10FE60 sts=7
06/22/2006 10:45:18 51 D SYS | DISK: req=0x54421E4 dev=0xC0000 fn=1 blk=0x1194E0 sts=7
06/22/2006 10:45:58 51 D SYS | DISK: req=0x544C688 dev=0xC0000 fn=1 blk=0x11BAE0 sts=7
06/22/2006 10:46:38 51 D SYS | DISK: req=0x544DE3C dev=0xC0000 fn=1 blk=0x11CD60 sts=7
06/22/2006 10:47:18 51 D SYS | DISK: req=0x544990C dev=0xC0000 fn=1 blk=0x11FCE0 sts=7
06/22/2006 10:48:38 51 D SYS | DISK: req=0x5440510 dev=0xC0000 fn=1 blk=0x123760 sts=7
06/22/2006 10:49:19 51 D SYS | DISK: req=0x543C270 dev=0xC0000 fn=1 blk=0x12C360 sts=7
06/22/2006 10:49:59 51 D SYS | DISK: req=0x543DA24 dev=0xC0000 fn=1 blk=0x1341E0 sts=7
06/22/2006 10:50:39 51 D SYS | DISK: req=0x5448ECC dev=0xC0000 fn=1 blk=0x1404E0 sts=7
06/22/2006 10:51:19 51 D SYS | DISK: req=0x5444B88 dev=0xC0000 fn=1 blk=0x14E7E0 sts=7
06/22/2006 10:51:19 51 D NET | INET: sendit: send err 39
06/22/2006 10:51:59 51 D SYS | DISK: req=0x543E130 dev=0xC0000 fn=1 blk=0x158EE0 sts=7
06/22/2006 10:51:59 51 D NET | INET: sendit: send err 39
06/22/2006 10:52:39 51 D SYS | DISK: req=0x543C500 dev=0xC0000 fn=1 blk=0x17DBE0 sts=7
06/22/2006 10:53:19 51 D SYS | DISK: req=0x543DF44 dev=0xC0000 fn=1 blk=0x1885E0 sts=7
06/22/2006 10:54:04 51 D SYS | DISK: req=0x543D318 dev=0xC0000 fn=1 blk=0x191960 sts=18
06/22/2006 10:54:45 51 D SYS | DISK: req=0x543D64C dev=0xC0000 fn=1 blk=0x19F660 sts=18
06/22/2006 10:55:59 51 D SYS | DISK: req=0x5444C2C dev=0xC0000 fn=1 blk=0x1A32E0 sts=7
06/22/2006 10:56:02 51 D SYS | DISK: req=0x544A7C8 dev=0xC0000 fn=1 blk=0x1AA4E0 sts=18
06/22/2006 10:56:03 51 D SYS | DISK: req=0x544499C dev=0xC0000 fn=1 blk=0x1AB8E0 sts=18
06/22/2006 10:56:04 51 D SYS | DISK: req=0x54415B8 dev=0xC0000 fn=1 blk=0x1AEAE0 sts=18
06/22/2006 10:56:43 51 D SYS | DISK: req=0x54451F0 dev=0xC0000 fn=1 blk=0x1B1860 sts=18
06/22/2006 10:58:00 51 D SYS | DISK: req=0x54447B0 dev=0xC0000 fn=1 blk=0x1C8B60 sts=7
06/22/2006 10:58:02 51 D SYS | DISK: req=0x5441700 dev=0xC0000 fn=1 blk=0x1CF8E0 sts=18
06/22/2006 10:59:00 51 D SYS | DISK: req=0x543E08C dev=0xC0000 fn=1 blk=0x2016E0 sts=7
06/22/2006 10:59:01 51 D SMB | SMB : get_lanman2_dir_entry, with QDL, took 33 sec for file /0/EDF/EDF_file_format.pdf (m/c=SLEEPBED4-4792(IP))
06/22/2006 10:59:03 51 D SYS | RAID5Resync on array 0: 1% done
06/22/2006 10:59:40 51 D SYS | DISK: req=0x54427A8 dev=0xC0000 fn=1 blk=0x2251E0 sts=7
06/22/2006 10:59:40 51 D NET | INET: sendit: send err 39
06/22/2006 11:00:27 51 D SYS | DISK: req=0x543C128 dev=0xC0000 fn=1 blk=0x233560 sts=18
06/22/2006 11:01:00 51 D SYS | DISK: req=0x54438F4 dev=0xC0000 fn=1 blk=0x242FE0 sts=7
06/22/2006 11:01:40 51 D SYS | DISK: req=0x54405B4 dev=0xC0000 fn=1 blk=0x251760 sts=7
06/22/2006 11:02:20 51 D SYS | DISK: req=0x543D3BC dev=0xC0000 fn=1 blk=0x25B2E0 sts=7
06/22/2006 11:03:40 51 D SYS | DISK: req=0x5441D68 dev=0xC0000 fn=1 blk=0x2641E0 sts=7
06/22/2006 11:03:47 51 D SYS | DISK: req=0x543CB68 dev=0xC0000 fn=1 blk=0x274260 sts=18
06/22/2006 11:05:00 51 D SYS | DISK: req=0x544D91C dev=0xC0000 fn=1 blk=0x27BB60 sts=7
06/22/2006 11:06:08 51 W SMB | SMB : Can't resolve master browser IP address for domain PULMONARY.

blue68f100

06-22-2006, 07:42 PM

2 things stand out.

1. 3 Maxtor drives and 1 Western Digital, What is the age of the drives. On Snap raid systems it's never a good idea to mix mfg. If I recall one of these models (WD) is a EIDE drives. They don't play well with Snap's. Another thing is the physical spec of the drives are different, heads, cyl. which can lead into problems too.

2. It apears you only have 64 meg of ram. This is not good, yes it will work but it is restrained with JVM running taking 24 meg.

How long was this running in this configuration?
It's never good to mix drives in RAIDS.

sleepysnap

06-22-2006, 08:27 PM

Thank you very much for your replies, and the sharing of your knowledge.

This is the way this system which was a refurb for an original one. It has been in this config for about 4-5 years. I have another 4100 with I think Maxtor's that was the original drive that never got picked up etc, etc long story. But it was configed with a dual raid mirrors, but now one drive is has a yellow led, so it has crapped out. I can cannibalize one of those maxtors to bring this one into what you consider compliance. And I have a lot of DIMMs and SIMMS that might fit the bill for upgrades to that also.

I have never needed to peer into this machine's workings. If I can just get the thing to stop rebuilding and get the files I need off of it, I can do all the things I am reading about here. Koppix (Dl'd today) on another computer (got tons of those) and write image, and trash this setup and get one well configured box, I will be a happy camper. I might even donate the old box to all of you guys here looking for a 4100 I keep reading about. Hmmm. now if I could get a V4 too and make my MIS dept allow me to join the AD domain I will be estatic.

SHould I also turn off JVM at this time?
How much RAM should I be looking for?

blue68f100

06-22-2006, 09:13 PM

JVM takes up 24meg of your 64 meg system if I read everthing correctly. The 4100 uses a DIMM's memory PC66 spec, so PC100 will/should work. V4 takes more ram, I have been recomending a min of 128. If you are use the JVM for ssl or web function, more is better. The max is 256.

Snap use a Modified XFS file system, the OS is a custom FreeBSD. DD is popular hear because it can do a RAW copy.

Since you said this is a mission critical system. This is what I would do. I have not been able to tell with 100% accuracy that your WD if failing. Since you have some spare Maxtor drives the same capacty. I would 1st mark the position of the drives in the 4100.

What I would suggest is to see if you can copy (clone) the snap drives. I suspect the bad one will give errors. Once the bad drive is located, install a new/clean drive (non unix format). The snap will auto format the drive. You then need to set it as a active spare. The the unit will reconize and start the array rebuild.

A word of caution, if you pick the wrong one your SOL. So clone them all timing them as kind of a check sum. One may take longer if it's having problem.

Phoenix32

06-22-2006, 09:33 PM

I need to chime in here on one "possible" issue about the WD drives and/or EIDE drives issues etc... My appologies for not picking up on this issue some of you have had sooner. While the problems may in fact be from EIDE drives and the RAID controller, this sounds to me more like a common issue you can find on PC's with PATA RAID. There has been an issue where Hard Disks get dropped from a RAID array on PCs. This quite often has to do with Timeouts caused by idle acoustic features. I suspect this is the problem within the Snap Servers as well when drives keep getting dropped from a RAID Array.

Go here for a little more information from WD

http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?p_faqid=913&p_created=1047068027&p_sid=DDw8jKai&p_lva=&p_sp=cF9zcmNoPTEmcF9zb3J0X2J5PSZwX2dyaWRzb3J0PSZwX 3Jvd19jbnQ9MTcmcF9wcm9kcz05MSwwJnBfY2F0cz0wJnBfcHY 9MS45MSZwX2N2PSZwX3NlYXJjaF90eXBlPWFuc3dlcnMuc2Vhc mNoX2ZubCZwX3BhZ2U9MSZwX3NlYXJjaF90ZXh0PVJBSUQ*&p_li=&p_topview=1

If this massive link does not work, go to WD Knowledge Base Answer ID 913

The solution is to go in and turn off the acoustic feature of the drive. This has also been seen in other manufacturers drives. Sometimes those fancy features to make things "better" can cause issues with RAID arrays where timing is important.

I hope this helps and clears up some people's issues with their Snap Servers.

Now with that said, this does not mean there are no other issues or that this is sleepysnap's problem. But it was info that needed to be brought to light. Also, as Dave has said here, in ANY RAID ARRAY in ANY EQUIPMENT, they always work better when they are using matched drives.

sleepysnap

06-22-2006, 10:02 PM

Thanks y'all. I never even looked into a Log info entry in this machine. I figured it came from them it was good to go, and plugged it in and setup and went to work.
Go figure. As I write I am at 56% complete on boot 51, so I guess in the AM it will either hang at 71% like last time or somewhere else, or maybe, just maybe get all the way through, and then reboot to accept all the networking changes I had made.

SHould I turn JVM off at this time to not have it start upon reboot?

Phoenix should I be dl'ing the "For all configurations other than 3Ware controller cards, download the IDE Upgrade Utility for the Desktop PC" http://support.wdc.com/download/index.asp?cxml=n&pid=3&swid=12 I assume it is NOT a NON3eare controller? ALso:
"The problem is a result of a feature that reduces idle acoustic noise in desktop drives. This feature may cause a timeout likely (though not exclusively) in an IDE RAID environment. To disable the feature, you can run a simple Western Digital utility to turn off a single bit in the drive’s run-time configuration. Disabling of this feature will NOT impact normal system operations. No firmware or hardware changes are required."
So I would ahve to take it out and run the software from a PC? WOuldn't taking it out kill the signature? I know that unplugging the drive froom its cable it like RAID death right?

Also, blue you are speaking in tongues, so I need to play catch up with the DD (Dig dolly? mentioned in wiki?)

I have one visual piece of info I maybe can add. The DISK 1 LED is the ONLY one that keep stopping the progress of the rebuild, and stays lit, and then they all flash quickly, until it hangs again. IT HAS ALWAYS BEEN THE ONLY DRIVE TO BE LIT SOLID.
I would think that is the culprit. And if I read the Log correctly:

06/22/2006 10:41:43 51 D SYS | Intf: 0, dev: 0: Model: Maxtor 5T060H6
06/22/2006 10:41:43 51 D SYS | Firmware Rev: TAH71DP0 Serial #: T6H7EXDC
06/22/2006 10:41:43 51 D SYS | Intf: 1, dev: 0: Model: WDC WD800BB-50DKA0
06/22/2006 10:41:43 51 D SYS | Firmware Rev: 77.07W77 Serial #: WD-WMAHL2012860
06/22/2006 10:41:43 51 D SYS | Intf: 2, dev: 0: Model: Maxtor 5T060H6
06/22/2006 10:41:43 51 D SYS | Firmware Rev: TAH71DP0 Serial #: T6H5228C
06/22/2006 10:41:43 51 D SYS | Intf: 3, dev: 0: Model: Maxtor 5T060H6
06/22/2006 10:41:43 51 D SYS | Firmware Rev: TAH71DP0 Serial #: T6HMB9HC

It is a Maxtor. Yes?

blue68f100

06-22-2006, 10:07 PM

Running for 4-5 yrs not likely the problem this time, but a good point Phoenix. My original Hatichi drives had the acustics turned off, but had FDB.

These drives were built with real bearings not FDB.

Looking at the actual physical spec of the drives. WD had 5 heads (odd) where the Maxtors had 6, both had 3 platters. Capacity was slightly different, but well with in the 2-3% tollerent. 3 platters today = 300-400gigs capacity. But it worked for a long period of time, go figure.

Which brings up another intersting point. Could this be used to locate a bad drive?

I use to use acustics monitoring equipment on big pumps and motors to catch problems before it was to late, PM.

sleepysnap

06-22-2006, 10:53 PM

FDB?

SHould I turn the JVM off now, or wait till after the reboot?

I saw in other threads that you guys were discussing GRC's SPinrite. I have used this product for years, especially when I had a whopping 20MB Winchester HD. I did buy the new version when it came out and it has now saved colleagues twice. GREAT software. I can't emphasize GREAT enough.

sleepysnap

06-23-2006, 08:14 AM

0920 EDT
Finished, WOOOHOOOO!
What should I do first, TUrn off JVM?
Start ripping the files off AFAP?
Something else?

*******Snippet of end of present LOG FILE ********
06/23/2006 8:53:43 51 D SYS | DISK: req=0x54440A4 dev=0xC0000 fn=1 blk=0x72640E0 sts=7
06/23/2006 8:54:23 51 D SYS | DISK: req=0x544990C dev=0xC0000 fn=1 blk=0x72704E0 sts=7
06/23/2006 8:55:03 51 D SYS | DISK: req=0x5441A34 dev=0xC0000 fn=1 blk=0x727A0E0 sts=7
06/23/2006 8:55:43 51 D SYS | DISK: req=0x544EF88 dev=0xC0000 fn=1 blk=0x7284660 sts=7
06/23/2006 8:55:46 51 D SYS | RAID5Resync on array 0: 100% done
06/23/2006 8:55:47 51 D SYS | RAID5Resync on array 0: completed OK
06/23/2006 8:55:47 51 I L01 | File System : Logical set synchronization done on device 60000
06/23/2006 9:02:00 51 I L01 | File System : Extended Rights Backup for device 0x60000 has begun
06/23/2006 9:02:56 51 D SYS | DISK: req=0x5432954 dev=0xC0000 fn=1 blk=0x662F60 sts=18
06/23/2006 9:03:51 51 I L01 | File System : Extended Rights Backup for device 0x60000 has completed successfully

eventLogGetRecord() printed 6271 records at 06/23/2006 9:08:04

------------
Command executed without error.

Can we glean which drive is the new backup? None of the LEDs are yellow, or steady green. How can I test drives to see which is the flawed one?

A graphic of the IN DE command is also attached.

sleepysnap

06-23-2006, 08:50 AM

BTW, I attach a graphic of the second 4100 drive config.
Of course I have no idea what I am looking at.
The IN LO p -1 command shows the drives to be IBMs. OY!

blue68f100

06-23-2006, 12:22 PM

sleepysnap, Unit 1 make sure you have all of your data backed up. If you don't need the ssl funciton or any function of the JVM, turn it off for now.

Resyncs take a long time, almost as long as building an array with 300gig drives. If this unit is 4-5 yrs old your are at the average end of life for the drives. It may be a good time to replace them, once you get the data backed up. I've seen it so may times with drives that run 24/7. When you shut them down they may or may not come back. Just because it's a raid 5, you still ned to have some kind of back program.

Your second 4100 indicate a span is broken on 40000. This is a raid 0 stripped set. Missing drives in slot 0 and 2. IBM makes a solid drive. Thats what Apple use to use in there desktop computers, I have one that been running for 10+ yrs now (SCSI).

What does the disk status show?

I don't think the snap supports RAID 0+1 config.

Phoenix32

06-23-2006, 12:58 PM

So I would ahve to take it out and run the software from a PC? WOuldn't taking it out kill the signature? I know that unplugging the drive froom its cable it like RAID death right?

Yes, you would have to take the drive out, connect it to a PC, run their utility, then put it back in the snap server. DO NOT WRITE ANY DATA TO THE DRIVE.

No, as long as you do not power up the Snap while the drive is out, and do not write to the drive while it is connected to the PC, it should not harm the RAID. That utility does not actualy write any data that is stored on the magnetic area of the drive. It just goes in and flips an electronic switch on the drives controller.

Phoenix32

06-23-2006, 01:08 PM

Running for 4-5 yrs not likely the problem this time, but a good point Phoenix.

My mistake, I thought he said he changed the drives in it recently. But it is still something others should think about if they had problems with their newly installed drives randomly dropping out of the RAID. This is very common in the hardware PC world with IDE drives and RAID arrarys.

Which brings up another intersting point. Could this be used to locate a bad drive?

I use to use acustics monitoring equipment on big pumps and motors to catch problems before it was to late, PM.

Yes, I have used acoustics for monitoring large pumps and motors as well (submariner remember :cool: ). But in this case, that acoustics stuff on the hard disks controller is for something else (not the same type monitoring you want). So I would doubt it honestly. The S.M.A.R.T. function of the drive would yield much more useful data for locating a failing drive than the acoustic circuit I would think.

Phoenix32

06-23-2006, 01:10 PM

FDB?

Fluid Dynamic Bearing...

Phoenix32

06-23-2006, 01:13 PM

I don't think the snap supports RAID 0+1 config.

Not that I have ever seen listed, but if it does, I want to know ASAP please please please....

jontz

06-23-2006, 01:16 PM

I don't think the snap supports RAID 0+1 config.

Nope, it doesn't.

blue68f100

06-23-2006, 02:31 PM

I was looking at a 8ch sata controller (raidcore BC4852) for my FreeNAS box that that some nice functions, plus some other options I havn't seen before.

jontz

06-23-2006, 03:30 PM

I was looking at a 8ch sata controller (raidcore BC4852) for my FreeNAS box that that some nice functions, plus some other options I havn't seen before.

Does it slice, dice, and make julien fries?

blue68f100

06-23-2006, 03:38 PM

Almost check it out http://www.broadcom.com/products/Enterprise-Small-Office/Storage-Solutions/BC4852

TomsHardware did a article http://www.tomshardware.com/2005/10/31/sata_spells_trouble_for_scsi_raid/ on these units, was quite impressive.

sleepysnap

06-23-2006, 04:48 PM

So, should I take the 4 IBM drives out of snap2 and DD them in a PC and put them into snap1? I am confused by this whole mess.
Should I look to buy 4 nex HD? What kinds, there is the 137gb limit, so I cannot get any bump for all this work.
I need to bump the RAM, what about that, What do I need there too?

"Oh the humanity."

blue68f100

06-23-2006, 05:40 PM

IBM makes good drives. I was surprise to see a RAID 0 on a 4 drive unit. Most everyone uses RAID 5 on 4x000 units.

4100's I was told doesn't have a JBOD (individual). Change it to Raid 5, build a 3 disk array with a hot spare, and see what happens. I suspect at least one drive is bad, but with 2 MIA you may have lost a controller. So if one drive fails it will auto roll. Till you know what kind of shape the drives are in. I am assuming all of the disk led show good.

There is no easy way to rebuild a RAID 0, requires a recovery services.

If you have a copy of SpinRite by GRC, you may pull the drives and let it check the drives. The latest v6 is suppose to support XFS file systems and raid drives. I have not tested it on any snap drives.

jontz

06-23-2006, 07:06 PM

4100's I was told doesn't have a JBOD (individual).

That is correct. The only options in disk config are RAID 5, RAID 0, or hot spare.

Italo2

09-03-2006, 06:23 PM

I have a similar problem.

After some 'strange' noises comming from the unit, possibly a read error on one of the standard IBM 60Gb drives, I decided to let the Snap check and possibly correct errors on the drive.

To do so it has to restart the unit, since then it keeps rebuilding all the drives and flagging that they are all fsck'd, but worse it come to a complete stop after a few minutes, PANIC, system and disk LED's blinking rapidly!

Drive 1 and 2 are in Mirror and 3 and 4 as JBOD.

By disconnecting 1, 2 and 3, I succesfully finished the check on disk 4 and I managed to save it's data. Did the same for disk 3.

The mirror doesn't want to come on-line, so I mounted them in my Snap2000, thinking my 4000-unit is (partially) broken. What do you think? Yes! Exactly the same problem! PANIC http://forums.procooling.com/vbb/images/smilies/doh.gif
:doh:

So maybe it is a problem with harddisk(s). I used SpinRite recovery mode to check them some Untrecoverabl errors were found, but not on one of the Mirror drives.
I mounted them all back in the 4000-unit and decided to format /reinit the one with errors and it succesfully formatted it. http://forums.procooling.com/vbb/images/smilies/smile.gif
:)

After yet another reboot problems start all over again! Even the freshly formatted drive was signed as having a fatal error. And again PANIC!

I am about to Panic as well now, because (as always) there's important business data on this disk.

Backup, yes I had a back-up on my 2000, which I accidently formatted in the process of salvaging my 4000-data http://forums.procooling.com/vbb/images/smilies/new/bawling.gif
:bawling:

Does anybody have any suggestiosn left?

BTW: I have 128Mb as standard memory installed, could it be insufficient?

Thanks in advance for any hints,

Roberto

blue68f100

09-04-2006, 03:44 PM

Panic errors are usually caused by a MAJOR problem. And seems to be Hardware related. If hardware, installing the drive in another 4000 may work.

If you got some more ram you can try, have nothing to loose at this time.

Italo2

09-04-2006, 07:47 PM

Thanks for your reply.

It turned out the Snap is very picky about which memory it likes to operate with.
The 256Mb memory I took out of a working PC, but Snap didn't like it.

I've relocated the Mirrored disks to a Snap2000 with the same version operating system.

It acts-up in the exact same manner. It boots the disks and in a few sec's it stalls in Panic (flashing the system and disks LED very fast)

During the boot sequence I can get into the web-interface for about 20 secs, then it hangs, wheter it be the 4100 or the 2000 unit, no difference.

Could a file-system corruption lead to a hang of the whole server?

I really don't like we have no way of getting to our data then from a Snapserver. Or did I miss something?

Any suggestions?

Both units work fine with freshly formatted disks, even the same disks!

Ciao4now,

Roberto

blue68f100

09-05-2006, 12:09 PM

The 4100 and 2000 RAID Setup are different. The 4100 uses a Hardware Raid where the 2000 uses a software raid. So swaping drives was proably not a good thing.

You can reload the OS and it may help. The only way to do it when it is acting up like this, is to put the unit in "flup mode". This is done like reseting but it is 5 blinks, instead of 4. It will only work if the unit has a valid IP address. You must use the same OS as loaded or newer. Preferably the same.

jontz

09-05-2006, 08:40 PM

The 4100 and 2000 RAID Setup are different. The 4100 uses a Hardware Raid where the 2000 uses a software raid. So swaping drives was proably not a good thing.

I think he has a 4000...he referenced it as a 4000 in his post a couple of times and talked about drives 3 and 4 being JBOD, which the 4100 doesn't do.

Italo2

09-06-2006, 01:17 AM

I think he has a 4000...he referenced it as a 4000 in his post a couple of times and talked about drives 3 and 4 being JBOD, which the 4100 doesn't do.

jontz might be right! I am a bit confused, because I have a 1U heigh box but I have JBOD and I suffered from no ill effects after putting the drives in my 2000 unit.

I managed to get the data from disk 3 and 4, both JBOD.
Especially disk 3 was cumbersome, the Snap froze very frequently and I had to restart, mount the disk manually and then after the disk check failed, start the disk check manually again via menu. And that x times (I've lost count)

The Cracked Mirror is giving me the real problems.
On the mirrored set I had my (small) business data, to be 'protected' :ha:

I tried to format one half of the mirror and make it a spare, in the hope Snap would grab it.

Instead I'm left now with a cracked mirror and a disk which lets my Snap freez in a Snap everytime it's booting.

The disks used are IBM deathstar 60Gb. Ironically the mirror-disk doesn't contain one bad spot on it's surface. I've checked that with SpeedRite 6, Hitachi's Drive Fitness Test and Ontrack's Data Advisor.

After sucking the existing disks dry, I've succesfully build a new mirror, using both RAID controllers (i.e. disk 1 and 4) and formatted disk3, after trying to use it as a spare for the original mirror disk2.

Any more suggestions?

I REALLY appreciate the help after one week of getting just 40 hours sleep.
My (small) business data is on the disk and I'm pretty lost without it.

blue68f100

09-06-2006, 07:33 AM

Are the 4000 and 2000 using the same version of OS????

With the single good drive (from raid1) in the 2000 (master) does in go into panic mode? YES.

In order for this procedure below to work it must be able to mount on the network.

Make sure it is set as master, NO other drive installed. If in goes in panic mode. Reload the OS. To do this you need to put the 2000 in flup mode. Which is 5 blinks using the reset procedure. Then using assist, reload the OS. This has worked for me in the past.

Italo2

09-06-2006, 10:52 AM

Are the 4000 and 2000 using the same version of OS????

:hammer: Ehhhh, now you've mentioned it I checked the 2000 unit and upgraded it to the same version as the 4000 unit.

How bad could that have been for the data on my drive?

With the single good drive (from raid1) in the 2000 (master) does in go into panic mode? YES.

In order for this procedure below to work it must be able to mount on the network.

Make sure it is set as master, NO other drive installed. If in goes in panic mode. Reload the OS. To do this you need to put the 2000 in flup mode. Which is 5 blinks using the reset procedure. Then using assist, reload the OS. This has worked for me in the past.

Well I did just that, more or less.

I did the following:

Booted Snap with all IDE cables connected, bar one
Put it in FLUP mode
Ran OSUPGRADE.EXE to upload an almost 30Mb big OS file (v3.4.803)
co de automount disable
Turned it off
Reconnected the defective disk and disconnected all other ones
Rebooted
Mounted the drive manually
Panic!

I just got me a copy of OnTrack's EasyRecovery Pro 6.1.
It supposedly has a RawRecovery mode where it scans the whole disk using a signature file of known dataformats building a location database allong the way.
This database is then used to save what's left of the files (BTW: no filenames, 'cause it's a non FAT/NTFS filesystem) then you can use WordRecovery or ExcelRecovery to try and make it a true file again. I haven't seen anything like it on the market. I'll report back my outcome soon!

Thanks for the idea's and support so far.

P.S.: OnTrack Remote DataRecovery (www.ontrack.com (http://www.ontrack.com)) just quoted me € 350 for diagnostic and € 2.000 - € 5.000 recovery cost!

A local Dutch company (http://www.rse.nl) does it for € 75 diag and up-to € 1.500 recovery.

blue68f100

09-06-2006, 12:32 PM

Not familar with OSUPGRADE.EXE. I have always used the web, in your case the Assist utility. I do know do the age of these units, XP would not work on a lot of utilities. I would use Win98. This is a requirment if you are upgrade the BIOS on a V1 model.

If this is a version 2 2000 (dimm) change the drive config to cable select and make sure its on the end of cable. Sometimes this make a different.

Normally a raid 1 drive can be in any positon and boot. Since this came from a 4000 put it the same position as the original master drive of the set.

re3dyb0y

09-06-2006, 02:57 PM

blue68f100

09-06-2006, 03:35 PM

If thats the case, means it is for Win95, Win98 and NT4.0, So more than likely the update was not done correctly. XP is not smart enough to do it, even under imulation.

Italo2

09-07-2006, 03:07 AM

If thats the case, means it is for Win95, Win98 and NT4.0, So more than likely the update was not done correctly. XP is not smart enough to do it, even under imulation.

This comes from the on-line manual UpgradeNotes2.html:

Upgrading with command-line utilities
The OSUpdate utility (and the UGUpdate utility) can be installed by downloading Util_Zip.exe from http://www.snapappliance.com/download. Unzip the file to a computer running Windows 95, 98, Me, 2000, NT 4.0, or XP that has network access to the Snap Server.
OSUpdate — Use the OSUpdate program to (1) update the SnapOS from a command line; or, (2) write a script or batch file to update multiple Snap Servers

I used it succesfully to upload Snap OS v3.4.803 to both my 2000 & 4000 units.

But... maybe I misunderstand the concept of Flup-mode...

Putting the server into Flup-mode:

Turn power on and hold reset button until the System & Disk LED's blink in sync
Press reset five times in a row to watch the Disk LED blink in groups of 5 blinks
Press and hold reset once more to wait for the System & Disk LED to sync again.

If I do that I can't connect to my Snap anymore, although it has a fixed IP address.
What am I doing wrong?

blue68f100

09-07-2006, 01:11 PM

The flup mode works but requires assist.

Using v3 or v4 Assist, You access the Snap then Follow the update procedure instruction. It will instruct you to put the server in flup, say wait for 30 sec, this is a minimum, use 45 sec. Time your self it takes that long for the snap to be in the proper mode.

And yes, some times it does not work if the unit can not retrive an IP. I all ways use a Fixed IP. Because it will ask for the IP, if it does not find it.