Mark Holtz (mholtz@intermag.com)
Thu, 13 Jan 2000 10:39:08 -0800
>So, yesterday morning we replaced the 150 MHz CPU card with the 200 MHz one.
>The server went OK for about 8 hours. And then started crashing like hell.
>Crashed 3 times in less than one hour. I finally removed the 200 MHz cpu,
>and but back the old one.
We're presently running 4 ANS machines, 2 700's and 2 500's. We also
upgraded all of them to the 200 Mhz cards, and 3 of them run like a
charm, not a care in the world.
However, one of them has been -very- temperamental, crashing/kernel
panicking on everything from initial fsck's to running netstat to a
highly-parallel kernel build. I can't say, however, that it was directly
related to putting the 200Mhz card in the machine, and all our other
machines are running without incident, both prototype and production
alike.
Here's what we've been seeing:
- Generally, the machine cleanly kernel panics and reboots. The error is
typically something akin to 'illegal memory access'.
- Occasionally, the machine fails fsck and refuses to come up without
user intervention.
- Increasing the network load on the machine significantly caused it to
fail more often. The load in question was a high number of hits on
Apache, though just for basic HTML, no CGI. However, the machine was
running at about 120-140 simultaneous connections consistently.
- Crashing frequency does not seem -directly- tied to uptime, swap-usage,
or even CPU usage. The machine can run anywhere from a few minutes to a
week without rolling over. It can also run at 99% usage and be fine,
then crash an hour later while the machine is doing nothing.
- The 'oops' dump isn't terribly informative, unfortunately. Anyone know
how to get this more detailed, a-la x86 dumps?
We've tried running the same motherboard with a variety of installed
drives (all stable on other machines in similar conditions) and have
tried turning off networking while running intensive processes, etc. The
motherboard itself is the only consistent element in a failing machine.
We're using kernels 2.2.11 and 2.2.13 so far, though it makes no
noticible difference.
The motherboard in question has been moved to a non-production
environment, in the hopes that some of the diagnostics we had would tell
something, but they all claim the device is fine.
Any thoughts would be greatly appreciated!
Thanks,
_MH
________________________________________________________________
Mark Holtz mholtz@intermag.com
intermag.com Internet Services http://www.intermag.com/
Now part of HT Networks!
vox +1.510.476.1397 fax +1.650.299.0791
This archive was generated by hypermail 2.0b3 on Thu Jan 13 2000 - 19:35:44 GMT