V1.03 30-Jun-03
Even though this tragic event did not involve a radio control robot, it goes to show that even after a failsafe unit is fitted, it is imperative that users do not assume that everything is now safe and that they can do whatever they want. In this case, the failsafe was not adequate for the job. In our case, with 100kg robots with lethal weapons fitted it is even more important.
The failsafes used on many of the robots that compete in the Robot Wars competition are anything but that. The fundamental property of a failsafe is that if anything fails in the circuit, then the machine shall be rendered into a safe condition. The failsafes used in Robot Wars often require a separate channel of the radio control system, and operate a servo to disable the electrical power to the robot. However, any failure of this servo mechanism, which is not at all unlikely, could cause the robot to eneter a dangerous condition where remote control commands are ignored, and the robot is still fully operational.
The failsafe circuits presented here are true failsafes. If any one of the components in it fails, then the system will disconnect power from the robot. The circuit may be capable of sustaining multiple component failure and still return the robot to a safe condition. However, it is guaranteed to work for any single component failure. Since there are so many options available to robot builders, I haven't presented one single circuit here, but a variety of circuits, all of which could be used, but it is most likely that only a selection would be used since not all builders will incorporate all of the features. For example, not everyone will have a micrcontroller onboard. Those that do may not be using serial communications as the control mechanism. Some people may be using PPM radio systems, and some PCM systems. The circuits below generally assume a PPM system since PCM systems will often have self-checking facilities built in.
Note that possible shorts to ground or Vcc on the tracks of the circuit could make the circuit malfunction (i.e. the solenoid could remain powered up during a fault condition), but faults on tracks like this are orders of magnitude less likely than component faults, especially if the failsafe circuit board is mounted inside its own enclosure. You could also conformal coat the PCB to prevent problems from ingress of moisture and metallic swarf from causing short circuit faults on the board.
I’ll start with the mechanical device. This can be built from scratch, using a solenoid and your own contactor arrangement, or a commercial relay may be used. It is important that the relay is not latching, i.e. there is a mechanical force permanently pulling the relay contacts open. It is only the correct operation of all the circuitry on the robot which enables the relay or solenoid to be energised to overcome the mechanical force:
The advantage of building your own is that multiple springs may be used to pull the contacts open. This ensures that if one spring breaks or becomes unattached, there is at least one more capable of opening the contacts.
The solenoid is powered by the circuitry and overcomes the spring tension to close the contacts. It may not be necessary for the solenoid to be able to close the contacts, just to keep them closed once they are closed, which is a much easier requirement. The closure can be performed manually before a bout, to "arm" the robot.
The circuitry that powers the solenoid depends upon how the robot is designed. In its simplest form, it simply measures the pulses on each of the RC channels, and if these ever disappear for more than a set time, then the solenoid power is turned off. If the robot incorporates a microcontroller, then a more sophisticated system can be built, where an output line from the micro is also required to regularly pulse. If this pulse ever stops (because of a micro fault or a software crash) then the solenoid will also be opened.
The first thing you will notice in this circuit is that there are two power Darlington transistors in series with the solenoid. If there was only one, and that failed short circuit, then the solenoid would be permanently powered on. This configuration prevents that.
The line from the detection circuits must be always high for the solenoid to stay powered and the power contacts to remain closed. The 10k resitors to ground from the base of the transistors ensure that if the line from the detector circuits becomes open circuit, then the transistors will definitely turn off.
Vpp is the positive power rail and may be 10v, 12v or a voltage generated from the electronics/receiver battery. This circuit should not be powered from the main power battery due to the possible effects of interference from the motors. Powering from the receiver battery also means that if this starts to run low, then the failsafe will kick in before any other electronics can start to misbehave.
Darlington power transistors are used because the solenoid may require a fairly large current to keep powered on. The LED must be a low current type (2mA) and will light when the transistors turn off, idicating that the failsafe has operated.
This is the detector chain. In this configuration, if any of the signals from the detection circuits goes low, then the output to the solenoid driver will go low.
The circuit must be designed such that if any of the components in it fail open or short circuit, then the signal powering the solenoid is disabled. If the signal ever goes permanently high or permanently low, then the circuit must disable the line to the solenoid power circuit. This is achieved by high-pass filtering the signal from the RC receiver. This will allow the pulse edges to pass, whilst blocking any DC signal:
The signals at various points in the circuit are shown in the timing diagram below:
The C1-R1 pair form the high pass filter. This strips the DC component off the signal to give a series of odd-shaped pulses as shown in waveform B. The opamp simply buffers this waveform so that it can source current into the much larger smoothing capacitor C2. The diode and C2 form a standard half-wave rectifying circuit. The voltage on the capacitor is bled through R2 so that the voltage will return to zero fast enough when a fault occurs. The R2-C2 combination has a time constant of 47 milliseconds, which means the voltage will be bled down to 1% within 0.22 seconds. The Robot Wars rule is that the robot should disable within 1 second of the radio signal stopping.
If the opamp in this circuit fails such that its output rises high, then this part of the circuit would always send an “OK” signal to the detector transistor chain. This is why it is important that all the radio channels are connected to these detector circuits. It would then require all the opamps to fail.
If you have a four channel radio control system, you may be tempted to use a quad opamp IC like an LM324 or TL084, and use one amplifier in the package for each radio channel. This should be avoided due to the fact that a single fault within the IC may cause all four opamps to fail and present a high voltage output. Remember, the whole circuit should be capable of exhibiting a single circuit fault and still be able to shut down the solenoid and break the power contact.
This problem can be rectified by adding a circuit that monitors the width of the pulses, shown below. To reduce the size of the circuit somewhat, this monitoring section does not need to be repeated for every RC channel. It may be possible to use just one monitoring circuit that analyses the pulses from every RC channel, since the pulses arrive in series and not concurrently. However the circuit shown uses two detectors that monitor alternate pulses. This is done because the detection circuit has a 2ms pulse generator beyond which it is detecting faulty pulses, but the next pulse from the RC receiver may arrive before this 2ms checking time. This is due to the encoding scheme used in the Tx to Rx section of the radio link, which is slightly different from the servo encoding scheme. See this document for details of this.
A pulse width detection circuit for use on an 8-channel RC system is shown below:
Click on the circuit diagram to open it in another window.
Circuit description
The eight channels go to alternate 74HC373 latch inputs (so that no two
adjacent channels end up supplying the same pulse detector section). The
74HC373 is there simply to buffer the logic inputs, not to work as a latch.
When the pulse arrives, two monostables produce pulses of length 0.9ms and
2.1ms. A fault is deemed to have occurred if the input pulse goes back low
before the 0.9ms pulse finishes or if the input pulse is still high when the
2.1ms pulse finishes. Note that a 0.1ms safety margin is used since the pulse
may not be exactly within the 1ms - 2ms limits. This functionality can be
represented by the following logic equation:
A = 0.9ms pulse
B = 2.1ms pulse
P = Input servo pulse
Since there are two separate pulse checking sections, the result of these two are ANDed together so that if either goes low (indicating a fault condition) then the following latch circuit (U9:A and U9:B) will be triggered. The latch is reset on power-up by C5/R5, and can be reset by momentarily activating the switch (or performing this function by taking U9:B pin 5 low momentarily with a logic signal line).
Software method
If a microcontroller is being used in the robot, and these servo control signals are also being used, then it would be a good idea to perform this pulse width checking in software, since this hardware solution uses a lot of ICs. See section 4 for details of software failsafes.
The I/O line should be set to an output, and should be pulsed by the software from within the main control loop. This rather depends on the method of control software that is being used. Refer to the Embedded Software page in section 3.4 for descriptions of various strategies for control loops. For the simple main control loop, timed control loop, state machine, and counter controlled sequencer methods, the pulsing should be placed at the end of the control loop. This ensures that if the software crashes for any reason throughout the loop, the pulses will not be sent and the solenoid power will be disabled. The I/O line pulsing should never be placed in an Interrupt Service Routine (unless the whole operation is triggered from an ISR) since these may continue to operate even if the software has crashed.
An example using the Timed Control Loop method (where the whole operation is triggered from an ISR) is shown below:
void main(void)
{
/* Setup the timer interrupt */
SetupTimerInterrupt();
GoToSleep();
}
void interrupt [T1] TimerInterruptServiceRoutine()
{
int RxError;
RxError = GetRadioCommands();
if (RxError == TRUE)
{
EmergencyPowerDown();
}
else
{
ControlMovement();
ControlWeapons();
DoOtherStuff();
}
PulseFailsafeLine();
}
The I/O output can be fitted onto a detector circuit just the same as in section 2.2.1. If the function is visited less often than the 20ms of the RC receiver line pulses, then R1 should be increased according to the equation:
There are several methods of performing error detection. The simplest is a checksum. This involves adding together all the bytes in the data message into a single byte. This is highly likely to overflow but we don't worry about that. The checksum byte is then transmitted at the end of the data message. The receiving microcontroller does the same, adding up all the bytes that it received, and compares this with the checksum byte that was transmitted. If they do not agree then there must have been some error in the transmitted data, and that data should not be enacted upon. After a second or so of error-strewn data, the failsafe power-down can be enacted.
A more secure scheme is the Cyclic Redundancy Check. This performs a complex logical operation on the data to produce a two byte CRC value which is transmitted in a similar fashion to the checksum. Example code using a CRC is shown in the Embeddeded pages "Commands.C" file here
Even more complex schemes allow not only error detection, but error correction also. Some articles covering these schemes are listed in the Links section below.
The microcontroller UART that is used to receive the signal will also have error detection facilities. Typically, Overrun Error (byte received before software could read it), Framing Error (didn't detect a stop bit), and Parity Error (data corruption). These can be used also to increment an error counter that can cause the failsafe to activate when a threshold is reached.
The horizontal axis shows the received signal strength. This is a logarithmic axis, and the strength is in deciBels. The vertical axis is the output voltage from the RSSI pin of the module.
A circuit to perform the failsafe trip using this module is shown below:
Note the use of two preset resistors and two separate comparators so that either may pull the line low if the RSSI signal goes below the threshold. This allows one to fail and the circuit still to work. Comparators have an open collector output (they do not pull-up the output signal), so their outputs can be wired together like this to perform an OR operation. The preset resistors are tweaked until the voltage at their outputs is at a value which represents "no signal present" on the RSSI graph, about 0.5 Volts. The output of this circuit goes to one of the transistor base terminals in the detector chain of section 2.1.
Note that this weakness is only on systems which use the standard RC kit, not those which use microcontroller communication as described in sections 2.2.2 and 2.2.3.
How can we cope with this problem? I can envisage two possible solutions:
The first method requires hacking into the RC receiver, which may be difficult or impossible, and will certainly be different for every different type of receiver. The latter wastes a channel.
The software should be written in a "defensive failsafe" manner. The function that performs the failsafe checking must be written such that for the function to return an OK result requires that the process flows all the way through the function, and at any point within the function it can fail returning FALSE. A TRUE result from the function can toggle an I/O line as explained in section 2.2.2.
The use of interrupts must be carefully considered. Remember that even if your software has crashed and is running around excuting random code, the interrupt servicce routines are still likely to be called, so putting the I/O toggle function inside an ISR is a very bad idea! The I/O toggle function should be placed inline with the main loop, at as high a level as possible so that virtually all the code must run corrctly for that function to get called.
The actual I/O toggle command should be embedded at the end of a chain of "if" statements guaranteeing correct operation, so that it can only be executed if all the preceding conditions are correct. The following C/pseudo code demonstrates this:
int io_toggle(void)
{
if RADIO_COMMS_CHECKSUM_CORRECT
if RADIO_COMMANDS_ARE_SENSIBLE
if WATCHDOG_HAS_NOT_TRIGGERED
if OTHER_TEST_INPUTS_ARE_OK
/* Ok, we can toggle the failsafe IO line */
IO_LINE = ~IO_LINE
}
The design of reliable software is a large subject. Some insight into
it may be gleaned from the following article (in three parts):
Predicting Software Reliability. Part 1
Predicting Software Reliability. Part 2
Predicting Software Reliability. Part 3
| Manufacturer | Device |
| Texas Instruments | TL081 single opamp |
| Fairchild Semiconductor | BC549 NPN transistor |
| TIP122 NPN power darlington | |
| National Semiconductor | LM311 voltage comparator |
74HC00 Quad NAND gate |
|
74HC02 Quad NOR gate |
|
74HC123 Dual monostable |
|
74HC373 Octal latch |
|
| SGS Thomson | 74HC4072 Dual 4-input OR gate |
An example of the use of failsafe design in machine protection
http://www.engineeringtalk.com/news/mat/mat119.html
A short article on designing for failure
http://archives.e-insite.net/archives/ednmag/reg/1998/061898/13ed.htm
Links to articles on reliablilty engineering
http://www.chipcenter.com/eexpert/rpoltz/archive.html
Powertrac
http://www.powertrac.fsnet.co.uk/failsafes.htmhttp://www.powertrac.fsnet.co.uk/failsafes.htm
GWS Electronics
http://www.ukmodelshop.com/GWS_Electronics.htmhttp://www.ukmodelshop.com/GWS_Electronics.htm
Reliability design articles by Robert Poltz (from Chipcenter)
Part 1
Part 2
Part 3
Part 4
Articles on reliability in software design by the same author
Part 1
Part 2
Part 3
The Golay code:
Article Part 1
Article Part 2
Article Part 3
Implementation Part 1
Implementation Part 2
Implementation Part 3
The BCH (Bose-Chaudhuri-Hochquenghem) code:
Part 1
Part 2
Other error correction article links:
http://www.epanorama.net/tele_datacom.html#ecc