Some recent work has got me thinking about how KODA goes about designing systems. More to the point: how do systems perform when things go wrong? This week's post contains a few notes on what you can do.

Redundancy

There are a few ways of interpreting the idea of 'redundancy'. Often, fully redundant systems are cost prohibitive, so it's worthwhile targeting those particular areas which are exposed to the most risk, or which have the most importance to your purpose.

We've tried using redundant sensors on a roadheader. We learned one thing, and that was that the redundant sensors were undergoing just the same vibration, dust, dirt and water exposure. Over the course of a project, the cost of maintaining redundant systems was much higher than the cost of not having a redundant system in the first place. So, ask yourself why you're doing it, and what risks you are protecting yourself against.

Also, when you're communicating with a client about your redundant systems be careful that you're not suddenly promising a system that is 'infallible'. Many people hear 'redundant' but think 'unbreakable'...they're not the same thing as there are often residual risks.

RESIDUAL RISKS (AND NEW RISKS)

After you've put in place some systems to protect yourself against failures be sure to look critically at what you've done. Firstly, look at what risks you've left unaddressed - these are the Residual Risks. Try to address them if you can, but if you can't, at least document and inform others of what they are, and likely circumstances in which they might occur.

The other things to note are New Risks i.e. what risks have you just created by implementing these protection mechanisms?

A recent design process illustrates this nicely: It was determined that a distributed UPS was required to increase the availability of a monitoring system. Fair enough. One Residual Risk worth noting was that the UPS would last for 2 days. Manageable. But, a New Risk emerged when the batteries were below 50% discharge and power was restored. Multiple UPS charge circuits, in their eagerness to restore the batteries to 100%, would all start drawing currents well above what the cabling and main circuit where designed to cope with. This resulted in a system that could protect itself, but it could not recover.

REPAIR-ABILITY

With the Roadheader system above we started to design systems that were easy to repair. For certain aspects of the systems it became necessary to accept that damage was inevitable. Yes, of course, do everything you can to lengthen the time between failures, but also invest some time in minimizing the time to repair.

Failure Mechanism

One last point worth making is that about 'understanding the failure mechanism' of your systems. The main question you're asking yourself here is whether your remain accurate/safe when something goes wrong. Systems that fail dramatically are easy to detect, but it is those that fail in more subtle ways that can challenge you.

One tricky scenario occurs with EL tilt sensors. Zero volts is a valid reading from these sensors BUT you also get zero volts when the cable is cut or power is lost - a conumdrum.

Final Word

This discussion is by no means exhaustive, but it is intended to highlight a few ideas above and beyond the obvious. KODA would like to hear your thoughts on the topic.

So Like, Learn and Contribute! Until next time...

KODA