A true failsafe

V1.03 30-Jun-03

1. Introduction

In my personal opinion, which is shared by some other significant members of the robot building community, the safety of our hobby is not taken as seriously as it should. I would like to refer all readers to this document:

INQUEST TOUCHING THE DEATH OF ADAM KIRBY

Even though this tragic event did not involve a radio control robot, it goes to show that even after a failsafe unit is fitted, it is imperative that users do not assume that everything is now safe and that they can do whatever they want. In this case, the failsafe was not adequate for the job. In our case, with 100kg robots with lethal weapons fitted it is even more important.

The failsafes used on many of the robots that compete in the Robot Wars competition are anything but that. The fundamental property of a failsafe is that if anything fails in the circuit, then the machine shall be rendered into a safe condition. The failsafes used in Robot Wars often require a separate channel of the radio control system, and operate a servo to disable the electrical power to the robot. However, any failure of this servo mechanism, which is not at all unlikely, could cause the robot to eneter a dangerous condition where remote control commands are ignored, and the robot is still fully operational.

The failsafe circuits presented here are true failsafes. If any one of the components in it fails, then the system will disconnect power from the robot. The circuit may be capable of sustaining multiple component failure and still return the robot to a safe condition. However, it is guaranteed to work for any single component failure. Since there are so many options available to robot builders, I haven't presented one single circuit here, but a variety of circuits, all of which could be used, but it is most likely that only a selection would be used since not all builders will incorporate all of the features. For example, not everyone will have a micrcontroller onboard. Those that do may not be using serial communications as the control mechanism. Some people may be using PPM radio systems, and some PCM systems. The circuits below generally assume a PPM system since PCM systems will often have self-checking facilities built in.

Note that possible shorts to ground or Vcc on the tracks of the circuit could make the circuit malfunction (i.e. the solenoid could remain powered up during a fault condition), but faults on tracks like this are orders of magnitude less likely than component faults, especially if the failsafe circuit board is mounted inside its own enclosure. You could also conformal coat the PCB to prevent problems from ingress of moisture and metallic swarf from causing short circuit faults on the board.

2. Circuit diagram and description

It is important that each stage of the design is analysed to answer the question, “what would happen if that component went short circuit, or open circuit”. This results in a system where all the active components are in a series chain such that any one of them failing results in the chain failing. These components are all permanently overcoming a mechanical device which is trying to disconnect the power.

I’ll start with the mechanical device. This can be built from scratch, using a solenoid and your own contactor arrangement, or a commercial relay may be used. It is important that the relay is not latching, i.e. there is a mechanical force permanently pulling the relay contacts open. It is only the correct operation of all the circuitry on the robot which enables the relay or solenoid to be energised to overcome the mechanical force:

The advantage of building your own is that multiple springs may be used to pull the contacts open. This ensures that if one spring breaks or becomes unattached, there is at least one more capable of opening the contacts.

The solenoid is powered by the circuitry and overcomes the spring tension to close the contacts. It may not be necessary for the solenoid to be able to close the contacts, just to keep them closed once they are closed, which is a much easier requirement. The closure can be performed manually before a bout, to "arm" the robot.

The circuitry that powers the solenoid depends upon how the robot is designed. In its simplest form, it simply measures the pulses on each of the RC channels, and if these ever disappear for more than a set time, then the solenoid power is turned off. If the robot incorporates a microcontroller, then a more sophisticated system can be built, where an output line from the micro is also required to regularly pulse. If this pulse ever stops (because of a micro fault or a software crash) then the solenoid will also be opened.

2.1. Powering the solenoid

The solenoid is powered by a power transistor, whose control comes from the detection circuits:

The first thing you will notice in this circuit is that there are two power Darlington transistors in series with the solenoid. If there was only one, and that failed short circuit, then the solenoid would be permanently powered on. This configuration prevents that.

The line from the detection circuits must be always high for the solenoid to stay powered and the power contacts to remain closed. The 10k resitors to ground from the base of the transistors ensure that if the line from the detector circuits becomes open circuit, then the transistors will definitely turn off.

Vpp is the positive power rail and may be 10v, 12v or a voltage generated from the electronics/receiver battery. This circuit should not be powered from the main power battery due to the possible effects of interference from the motors. Powering from the receiver battery also means that if this starts to run low, then the failsafe will kick in before any other electronics can start to misbehave.

Darlington power transistors are used because the solenoid may require a fairly large current to keep powered on. The LED must be a low current type (2mA) and will light when the transistors turn off, idicating that the failsafe has operated.

2.2. The detection circuits

Each of the detection circuits will have an output signal that must be high to keep the solenoid powered up. All these signals are logically ANDed together using a series set of NPN transistors as shown below:

This is the detector chain. In this configuration, if any of the signals from the detection circuits goes low, then the output to the solenoid driver will go low.

2.2.1. Lines from the RC receiver

The lines from RC receiver consist of regular pulses of a width between 1ms and 2ms, and occurring at most every 30ms. There is a document here which describes these signals in some detail, but I’ll reproduce the timing diagram here for convenience:

The circuit must be designed such that if any of the components in it fail open or short circuit, then the signal powering the solenoid is disabled. If the signal ever goes permanently high or permanently low, then the circuit must disable the line to the solenoid power circuit. This is achieved by high-pass filtering the signal from the RC receiver. This will allow the pulse edges to pass, whilst blocking any DC signal:

The signals at various points in the circuit are shown in the timing diagram below:

The C1-R1 pair form the high pass filter. This strips the DC component off the signal to give a series of odd-shaped pulses as shown in waveform B. The opamp simply buffers this waveform so that it can source current into the much larger smoothing capacitor C2. The diode and C2 form a standard half-wave rectifying circuit. The voltage on the capacitor is bled through R2 so that the voltage will return to zero fast enough when a fault occurs. The R2-C2 combination has a time constant of 47 milliseconds, which means the voltage will be bled down to 1% within 0.22 seconds. The Robot Wars rule is that the robot should disable within 1 second of the radio signal stopping.

If the opamp in this circuit fails such that its output rises high, then this part of the circuit would always send an “OK” signal to the detector transistor chain. This is why it is important that all the radio channels are connected to these detector circuits. It would then require all the opamps to fail.

If you have a four channel radio control system, you may be tempted to use a quad opamp IC like an LM324 or TL084, and use one amplifier in the package for each radio channel. This should be avoided due to the fact that a single fault within the IC may cause all four opamps to fail and present a high voltage output. Remember, the whole circuit should be capable of exhibiting a single circuit fault and still be able to shut down the solenoid and break the power contact.

2.2.1.1 Checking the servo pulse widths

The method of detecting the pulses from the RC receiver as explained above in section 2.2.1 has a weakness: It does not validate the pulses themselves. If a fault exists in the RC receiver such that random pulses are transmitted, this will go undetected.

This problem can be rectified by adding a circuit that monitors the width of the pulses, shown below. To reduce the size of the circuit somewhat, this monitoring section does not need to be repeated for every RC channel. It may be possible to use just one monitoring circuit that analyses the pulses from every RC channel, since the pulses arrive in series and not concurrently. However the circuit shown uses two detectors that monitor alternate pulses. This is done because the detection circuit has a 2ms pulse generator beyond which it is detecting faulty pulses, but the next pulse from the RC receiver may arrive before this 2ms checking time. This is due to the encoding scheme used in the Tx to Rx section of the radio link, which is slightly different from the servo encoding scheme. See this document for details of this.

A pulse width detection circuit for use on an 8-channel RC system is shown below:

Click on the circuit diagram to open it in another window.

Circuit description
The eight channels go to alternate 74HC373 latch inputs (so that no two adjacent channels end up supplying the same pulse detector section). The 74HC373 is there simply to buffer the logic inputs, not to work as a latch. When the pulse arrives, two monostables produce pulses of length 0.9ms and 2.1ms. A fault is deemed to have occurred if the input pulse goes back low before the 0.9ms pulse finishes or if the input pulse is still high when the 2.1ms pulse finishes. Note that a 0.1ms safety margin is used since the pulse may not be exactly within the 1ms - 2ms limits. This functionality can be represented by the following logic equation:

A = 0.9ms pulse
B = 2.1ms pulse
P = Input servo pulse

Since there are two separate pulse checking sections, the result of these two are ANDed together so that if either goes low (indicating a fault condition) then the following latch circuit (U9:A and U9:B) will be triggered. The latch is reset on power-up by C5/R5, and can be reset by momentarily activating the switch (or performing this function by taking U9:B pin 5 low momentarily with a logic signal line).

Software method

If a microcontroller is being used in the robot, and these servo control signals are also being used, then it would be a good idea to perform this pulse width checking in software, since this hardware solution uses a lot of ICs. See section 4 for details of software failsafes.

2.2.2. Microcontroller I/O lines

If a microcontroller is used to control the wheel motors or weaponry, then it should be failsafe protected also. The best way to do this is for it to generate a series of pulses that can be treated in the same way as the RC receiver pulses in section 2.1.1. The pulses need not be the same width as the RC pulses – it is easier to make it roughly a square wave by inverting the output line from the microcontroller each time the part of the code that controls it is visited.

The I/O line should be set to an output, and should be pulsed by the software from within the main control loop. This rather depends on the method of control software that is being used. Refer to the Embedded Software page in section 3.4 for descriptions of various strategies for control loops. For the simple main control loop, timed control loop, state machine, and counter controlled sequencer methods, the pulsing should be placed at the end of the control loop. This ensures that if the software crashes for any reason throughout the loop, the pulses will not be sent and the solenoid power will be disabled. The I/O line pulsing should never be placed in an Interrupt Service Routine (unless the whole operation is triggered from an ISR) since these may continue to operate even if the software has crashed.

An example using the Timed Control Loop method (where the whole operation is triggered from an ISR) is shown below:

    void main(void)
    {
        /* Setup the timer interrupt */
        SetupTimerInterrupt();
        GoToSleep();
    }
    
    void interrupt [T1] TimerInterruptServiceRoutine()
    {
        int RxError;
    
        RxError = GetRadioCommands();
    
        if (RxError == TRUE)
        {
            EmergencyPowerDown();
        }
        else
        {
            ControlMovement();
            ControlWeapons();
            DoOtherStuff();
        }
        PulseFailsafeLine();
    }

The I/O output can be fitted onto a detector circuit just the same as in section 2.2.1. If the function is visited less often than the 20ms of the RC receiver line pulses, then R1 should be increased according to the equation:

2.2.3. Microcontroller data communication checksums

If you have an onboard microcontroller and you are transmitting data over the radio link to control the robot, then this data should be error checked. If an error repeatedly occurs in this data, then the failsafe circuit should be activated. The easiest way to activate it is to make it stop sending the regular pulse output described in section 2.2.2 above.

There are several methods of performing error detection. The simplest is a checksum. This involves adding together all the bytes in the data message into a single byte. This is highly likely to overflow but we don't worry about that. The checksum byte is then transmitted at the end of the data message. The receiving microcontroller does the same, adding up all the bytes that it received, and compares this with the checksum byte that was transmitted. If they do not agree then there must have been some error in the transmitted data, and that data should not be enacted upon. After a second or so of error-strewn data, the failsafe power-down can be enacted.

A more secure scheme is the Cyclic Redundancy Check. This performs a complex logical operation on the data to produce a two byte CRC value which is transmitted in a similar fashion to the checksum. Example code using a CRC is shown in the Embeddeded pages "Commands.C" file here

Even more complex schemes allow not only error detection, but error correction also. Some articles covering these schemes are listed in the Links section below.

The microcontroller UART that is used to receive the signal will also have error detection facilities. Typically, Overrun Error (byte received before software could read it), Framing Error (didn't detect a stop bit), and Parity Error (data corruption). These can be used also to increment an error counter that can cause the failsafe to activate when a threshold is reached.

2.2.4. Radio receiver RSSI signal

Some radio receivers have an RSSI signal output. This is the "Received Signal Strength Indicator". It is generally an analogue signal that increases in voltage as the received signal strength increases. It can be used as part of the failsafe by comparing the voltage with a preset limit. As it falls below this limit, indicating a very weak or non-existant signal, the failsafe can be activated. The RSSI of the Circuit Design CDP-02 459MHz telemetry receiver module is shown in the graph below. This module may be used in robot wars (with prior permission from Mentorn) for radio control data transmission. The graph shows the relationship between the received signal strength and the RSSI analogue voltage.

The horizontal axis shows the received signal strength. This is a logarithmic axis, and the strength is in deciBels. The vertical axis is the output voltage from the RSSI pin of the module.

A circuit to perform the failsafe trip using this module is shown below:

Note the use of two preset resistors and two separate comparators so that either may pull the line low if the RSSI signal goes below the threshold. This allows one to fail and the circuit still to work. Comparators have an open collector output (they do not pull-up the output signal), so their outputs can be wired together like this to perform an OR operation. The preset resistors are tweaked until the voltage at their outputs is at a value which represents "no signal present" on the RSSI graph, about 0.5 Volts. The output of this circuit goes to one of the transistor base terminals in the detector chain of section 2.1.

3. Circuit weaknesses

Dave Gamble of the Tornado team pointed out a significant weakness in the design presented above. This weakness is actually built into the radio control receiver module that is supplied with cheaper RC kits. This module accepts a waveform from the transmitter, but if that waveform is corrupted, many RC receivers will present garbage signals to the servos. A true failsafe RC receiver would only present signals to the servos when a valid signal was detected from the transmitter.

Note that this weakness is only on systems which use the standard RC kit, not those which use microcontroller communication as described in sections 2.2.2 and 2.2.3.

How can we cope with this problem? I can envisage two possible solutions:

The first method requires hacking into the RC receiver, which may be difficult or impossible, and will certainly be different for every different type of receiver. The latter wastes a channel.

4. Failsafe software

Due to the possible complexity of the hardware in the failsafe, it is tempting to use a software solution. However, this method must be treated with at least, if not more, care than a hardware solution, since software is a lot more unpredictable than hardware.

The software should be written in a "defensive failsafe" manner. The function that performs the failsafe checking must be written such that for the function to return an OK result requires that the process flows all the way through the function, and at any point within the function it can fail returning FALSE. A TRUE result from the function can toggle an I/O line as explained in section 2.2.2.

The use of interrupts must be carefully considered. Remember that even if your software has crashed and is running around excuting random code, the interrupt servicce routines are still likely to be called, so putting the I/O toggle function inside an ISR is a very bad idea! The I/O toggle function should be placed inline with the main loop, at as high a level as possible so that virtually all the code must run corrctly for that function to get called.

The actual I/O toggle command should be embedded at the end of a chain of "if" statements guaranteeing correct operation, so that it can only be executed if all the preceding conditions are correct. The following C/pseudo code demonstrates this:

    int io_toggle(void)
    {
      if RADIO_COMMS_CHECKSUM_CORRECT
 
        if RADIO_COMMANDS_ARE_SENSIBLE
 
          if WATCHDOG_HAS_NOT_TRIGGERED
 
            if OTHER_TEST_INPUTS_ARE_OK
 
              /* Ok, we can toggle the failsafe IO line */
              IO_LINE = ~IO_LINE
    }

The design of reliable software is a large subject. Some insight into it may be gleaned from the following article (in three parts):
Predicting Software Reliability. Part 1
Predicting Software Reliability. Part 2
Predicting Software Reliability. Part 3

5. Devices used in the circuits

The following devices were used in this circuit. Click on the manufacturer’s name to go to their web site, or the device name to go to the device datasheet.

Manufacturer Device
Texas Instruments TL081 single opamp
Fairchild Semiconductor BC549 NPN transistor
TIP122 NPN power darlington
National Semiconductor LM311 voltage comparator

Philips Semiconductors

74HC00 Quad NAND gate

74HC02 Quad NOR gate

74HC123 Dual monostable

74HC373 Octal latch

SGS Thomson

74HC4072 Dual 4-input OR gate

6. Links

6.1. Failsafe articles and documents

Recommendations of the BMFA (British Model Flying Association) with regard to failsafes:
http://home.clara.net/wbruce.ogilvy/text/beware/b5.htmlhttp://home.clara.net/wbr uce.ogilvy/text/beware/b5.html

An example of the use of failsafe design in machine protection
http://www.engineeringtalk.com/news/mat/mat119.html

A short article on designing for failure
http://archives.e-insite.net/archives/ednmag/reg/1998/061898/13ed.htm

Links to articles on reliablilty engineering
http://www.chipcenter.com/eexpert/rpoltz/archive.html

6.2. Commercial failsafe units

Note that none of these commercial units are true failsafe units – they require parts to operate correctly in order for the robot to become safe, which is inherently unsafe. However, they are approved by the Robot Wars technical team.

Powertrac
http://www.powertrac.fsnet.co.uk/failsafes.htmhttp://www.powertrac.fsnet.co.uk/failsafes.htm

GWS Electronics
http://www.ukmodelshop.com/GWS_Electronics.htmhttp://www.ukmodelshop.com/GWS_Electronics.htm

6.3. Quality and reliability engineering links

Reliability design articles by Robert Poltz (from Chipcenter)
Part 1 Part 2 Part 3 Part 4

Articles on reliability in software design by the same author
Part 1 Part 2 Part 3

6.4. Data communication error detection and correction articles

The Golay code:
Article Part 1 Article Part 2 Article Part 3
Implementation Part 1 Implementation Part 2 Implementation Part 3

The BCH (Bose-Chaudhuri-Hochquenghem) code:
Part 1 Part 2

Other error correction article links:
http://www.epanorama.net/tele_datacom.html#ecc


Back to circuits index