I have a question that occured to me last night. What kind of
DISASTER RECOVERY 
methods do people use, and what is the best way to implement them.
These are my thoughts, I would like to see other folks comment on them, point out anything I might have missed, or tell me how they deal with a given failure type. I didn't find anything useful when I searched on the subject here, If there are threads on it already that I missed, please let me know.
I intend to have the system I'm building running 24/7 (I think it's better for the hardware, and I don't like waiting for bootups) This naturally causes me concern about how to handle different system failures, especially the ones that might occur when nobody is at the console. I'm doing a plain WC setup, no pelts, extreme tech. etc. All of my cooling hardware will be in or on my case. I will be running Linux, but the functions would be similiar on that other OS, even if the software is different. (i.e. Linux uses lm_sensors, MS stuff uses MBM or equivalent)
As many folks have pointed out, there are risks in WC'ing a system, I think there are also risks in an AC'd system, just different ones. However I would think that advance planning and care in system design can manage the risk so as to minimize the damage risks.
DEFINITION: A properly managed risk is one where a failure causes no hardware damage to the system beyond the failed part itself. Data corruption or system downtime is undesireable but totally acceptable if it prevents hardware damage.
Failures in the PC hardware itself I don't see as an issue because they will handle (or not) the same way no matter how the system is cooled. (Note that WC'ing is supposed to reduce the odds of hardware failures by giving better temps, but that's a different issue)
Five failure modes exist for a WC system that I can see. Each would have a different set of symptoms, detection methods, damage potential, and optimum handling to avoid damage to data or hardware.[list=1][*]Partial pump failure / flow restriction[*]TOTAL pump failure / flow blockage[*]Radiator cooling (fan) failure[*]LEAKS[*]Excess temperature rise[/list=1]
(Combinations of the above might happen, but I suspect that one would actually happen first, and trigger others if not handled)
To take them one at a time:
1. Partial pump failure or flow restriction -
Symptom - Temperature rises to new equilibrium point, how high depends on level of failure.
Detection - Flowmeter and / or Normal temperature monitoring, mobo or digidoc based.
Handling - Varies and depends on severity. Certainly generate an alarm (can't fix it if you don't know it's broke, and can't count on user checking) Severe case should be treated as total failure. Less severe case might turn on / increase speeds on fans to increase cooling levels and bring temps back down.
2. Total pump failure or flow blockage.
Symptom - Temperature rises to MELTDOWN levels - I'm not sure how fast... (opinions anyone?)
Detection - optimal would be some kind of flow detector, (flow meter, pressure switch, etc.) If that doesn't exist, look for temperature increases.
Handling - Trigger alarm? Shut system down.
Opinion requested how fast would depend on speed of temperature rise and detection method. (Can a digidoc talk to a mobo BTW? also, can a Digidoc and a mobo share temp or fan speed sensors?) If there is enough time, shut down gracefully ('shutdown -h now') otherwise trigger a relay to interrupt power and slam system off. (deal with any resulting data loss / corruption later)
3. Radiator cooling (fan) failure.
Symptom - Similiar to 1 or 2 above, but slower (more water to heat)
Detection - Fan speed monitoring optimal, or look for temp increase.
Handling - Trigger alarm, turn on / increase speed of any redundant fans, other case fans. If temp rises beyond 'comfortable' levels, shut system down, preferably gracefully (there should be time)
4. LEAKS!!!
Symptom - FLOOD, Possible major component damage!
Detection - Nothing I know of as standard PC equipment, I would consider modding a basement water detector with the sensor(s) in the bottom of the case and other vulnerable spots. I would also consider a level sensor in the res, if you use one.
Mods to detector -
- Multiple sensors (sensor is just two contacts seperated by a space, I see no reason you couldn't have multiple sets wired in parallel)
- Add output to trigger shutdown relay
Handling - Two parts here, one beforehand, one after it happens:
Beforehand - Consider putting conformal coating or other waterproofing on all boards. Questions -
- Would this potentially void warranties?
- Could it have any thermal consequences for mildly hot parts?
- Could it have any functional effect (changing capacitive/inductive properties of traces for instance)?
- Would this influence any other mods on the board?
After the leak happens
SHUT DOWN ALL POWER IMMEDIATELY!!! That should minimize any damage from water shorts. (most electronics can deal with being soaked if they aren't powered, and are dried well before being powered back on, its power and water together that do things in...
5. Excessive temperature rise - This is kind of a last ditch defense intended to catch anything that didn't trip another alarm.
Symptom - Temperatures trending towards potential meltdown. CPU temperatures exceeding normal operation range significantly.
Detection - I see two points needed, 'Concern' and 'Panic'. 'Concern' I would trigger at say 10*C over normal maximum. Panic I would trigger at 10*C over 'Concern'. (or well below CPU damage threshold) Concern level triggering I would use on-board monitoring. Ideally Panic level I would trigger with an off board monitor to ensure a process wouldn't keep it from triggering.
Handling - Concern - start a graceful shutdown...
Handling - Panic - Shut down power immediately!
I know this was long, but tell me what you think...
Gooserider