Focused Improvement – Strategies for Reducing Failures

The approach to zero failures incorporates the following seven concepts, each of which is described below:

Phase 1 – Eliminate forced deterioration
Phase 2 – Extend lifetimes through corrective maintenance
Phase 3 – Monitor and Control deterioration
Phase 4 – Carry out predictive maintenance

(1) Eliminate forced deterioration

1 Classify failures

It is important to reduce the overall failure rate by eliminating the easy problems first when tackling failures. It is sometimes difficult to know where to start when faced with a mixture of different types of failure, such as simple ones (e.g. out-of-position sensors or broken wires), complex ones (e.g. broken gears or breakdown of control systems from unknown causes) and repetitive ones (occurring mainly in hydraulic systems, drive systems and other vital equipment systems).
According to a survey on the nature of breakdowns covering a large number of factories, however, about 70% of the total are simple failures while only 30% are complex ones, as Figure “Causes and the Equipment Elements where they Arise” illustrates. It is important to start by reducing the level of these simple failures to create enough time for maintenance personnel to reduce complex and recurring failures through corrective maintenance.

Causes and the Equipment Elements Where They Arise

By simple failures, we mean the type of failure that can be prevented through good Autonomous Maintenance. Some examples are:

Bearings seizing up because of insufficient lubrication.
Broken wires due to contact with equipment or excessive bending.
Malfunctions due to misalignment resulting from incorrect fixing of sensors.
Malfunctions due to ingress of coolant or other liquid into limit switches.
Bearings seizing up in hydraulic pumps.
Damaged V-belts.

Failures like this can be prevented if operators are trained in how to lubricate their equipment and check it using their five senses and can detect problems such as loose and broken V-belts, overheating and abnormal noise in bearings, etc. It is essential to make maximum use of operators’ abilities to reduce failures.
As mentioned earlier, it is difficult to identify failure trends on a single machine, because successive failures tend to occur randomly, but it is sometimes possible to identify trends if failures are classified by equipment group and location of occurrence.

The aims of classifying failures are:

To identify equipment weaknesses.
To highlight deficiencies in equipment management.
To clarify priorities for implementing solutions.
To identify the support that operators require for carrying out Autonomous Maintenance (e.g. training in inspection and lubrication) and the kinds of things they should be asked to do (e.g. checking and lubricating their equipment, and detecting problems at an early stage).

Failures can be classified in various ways. As explained, doing so helps to identify weaknesses in groups of similar equipment, locations of related causes, and weaknesses in equipment management. Six useful bases for classifying failures are described below.

(a) By production line or equipment group

If the output and OEE of particular lines or equipment groups have declined as a result of frequent breakdowns, it is first necessary to identify which types of equipment are failing most often. The Autonomous Maintenance activities of the operators and the specialized maintenance activities of the maintenance department should then be concentrated on these.

(b) By location

Failures can also be classified according to which part of the equipment they occur in. The categories specified for this might include fasteners (nuts, bolts and other fixing devices), lubrication systems, drive systems, pneumatic systems, hydraulic systems, electrical systems, control systems, sensors, and jigs/tools. This system of classification will reveal weaknesses in particular equipment groups and lines and enable the maintenance activities to be prioritized.

(c) By mode of occurrence

Classifying failures according to their mode of occurrence (cracking, breaking, deformation, wear, corrosion, leaking, loosening, etc.) is a useful way of sorting out what type of corrective maintenance (extending service life, preventing recurrence, etc.) needs to be undertaken for each.

(d) By cause

Failures may also be classified according to their cause (inadequate basic conditions, improper use, failure to reverse deterioration, faulty design, lack of skill, etc.) to pinpoint equipment management weaknesses and decide what needs to be done next.

(e) By history

Another method is to classify failures according to whether or not they have happened before. Classifying them in this way will shed light on the following points:

The MTBF of recurring failures.
The effectiveness of any steps taken to prevent failures recurring.
If there are no plans to extend the lifetimes of particular components at present, has the checking of these components been incorporated into the periodic maintenance calendar?
The way in which similar equipment should be inspected.
The most important issues that need to be addressed through Autonomous Maintenance.
The most important issues that need to be addressed through specialized maintenance.

(f) By relation to Autonomous Maintenance

Failures can also be classified according to whether they could be prevented through Autonomous Maintenance (‘simple failures’), or whether they could only be prevented by the maintenance department (‘complex failures’). The distinction between simple and complex failures is not clear-cut, but experience tells us that operators are capable of preventing the following types of failure through good Autonomous Maintenance:

(i) Failures whose warning signs can easily be detected through the five senses:

those associated with faults that can be noticed by looking at, touching or moving the equipment (e.g. limit switches, sensors, wires and bearings).
those associated with parts of the equipment that can be seen from the outside, without having to take it apart.

(ii) Failures whose warning signs can be detected through partial disassembly:

those associated with faults that can be noticed by looking at, touching or moving the equipment after partially disassembling it (faults such as deformation, wear, slackness, play, etc.), e.g. wear of speed reducer gears, backlash in key slots.

(iii) Failures whose warning signs can easily be detected through the use of simple measuring instruments:

those associated with faults that can be detected through the use of dial gauges, feeler gauges, spirit levels, etc. (e.g. misalignment, eccentricity, tilting, etc.).

The situation is bound to change as the standard of Autonomous Maintenance rises, but it is important to draw a line at a certain level and classify the failures in this way because doing so will clarify the following:

The priorities in carrying the Autonomous Maintenance programme forward.
The order in which the Autonomous Maintenance tasks (inspection techniques, etc.) should be taught.

Table “Responsibility of Autonomous Maintenance for Failure Causes” indicates the aspects of equipment management that fall within the remit of Autonomous Maintenance.

2 Analyze failures

(a) Why do failures happen?

All production equipment is subjected to the stress of one type or another. It may be operational (mechanical or electrical) stress applied to make the equipment work, or environmental stress (temperature, humidity, vibration, dust, etc.) coming from its surroundings. When the stress applied exceeds the equipment’s strength, it breaks down (see Figure “A Stress-Strength Model”).

In other words, equipment breaks down when excessive stress combines with insufficient strength (the equipment may have been too weak, to begin with, or it may have been allowed to weaken through deterioration). This situation can be caused by one or more of the following five factors:

Failure to maintain basic conditions
Failure to observe correct operating conditions
Failure to reverse deterioration
Failure to correct design weaknesses
Lack of skill

All of these are human responsibilities. Equipment does not want to break down; people make it break down. The corollary of this is that it is only people who can prevent it from breaking down, and they can only do so if they change their attitudes and behaviours. Figure “Relationship between Causes and Effects of Breakdowns” summarises the causes of different types of failure.

Relationship between Causes and Effects of Breakdowns

(b) The need for failure analysis

In most factories, the maintenance personnel are assigned to a particular line or piece of equipment who repair it and try to find the cause of the problem when it breaks down. However, as explained earlier, about 70% of failures can be prevented through good Autonomous Maintenance, and, in any case, failure rates will never go down to zero if all the maintenance work is left up to the maintenance personnel. Every operator needs to take personal responsibility for failures of their equipment and learn as much as possible from them. Although they cannot undo a failure once it has happened, they should resolve never to allow the same failure to happen twice.
It is possible to prevent failures from recurring if each breakdown is treated like a real-life case study and a careful analysis is made of the causes of the problem, whether or not there were any warning signs, the quality of the inspections carried out, and what kind of remedial action was taken. Operators must conduct this kind of failure analysis in conjunction with their Autonomous Maintenance activities to identify the causes of the routine failures that they see almost daily; to highlight the areas where there are weaknesses or when something is lacking; to improve their knowledge and skills, and to develop themselves into experts on their own equipment.

(c) How to carry out a good failure analysis

(i) Pinpoint the phenomenon

Start by visiting the scene of the crime immediately after the breakdown has happened; examine the equipment and materials there and interview the operators to find out how the equipment stopped, which components are broken and in what way they are broken, when the last similar breakdown occurred, and whether or not there were any warning signs. All this information should be recorded on a form provided for the purpose.

(ii) Take interim action

If a part is broken, it should of course be replaced with a new one to restart production as soon as possible, but it hardly needs to be said that this in itself is not a true countermeasure. Many people, however, do mistakenly treat this kind of stopgap action as a complete solution.

(iii) Prepare to investigate the causes

Owing to a lack of knowledge about how the equipment works, how it is constructed and how it should be correctly used, problems are often dealt with randomly. The upshot is that the problem repeats itself or other problems occur close to the part that was repaired. This happens when the people involved replace parts or strengthen the equipment without really understanding the causes of the problem. To determine the causes of the problem correctly and take effective action, it is essential to understand the equipment functions, internal structure and correct method of operation using systems diagrams and sketches of the equipment and the failed parts.
The next step is to work out what condition the functional parts should ideally be in, based on engineering principles and parameters, and draw up a list of items that need to be checked to ensure that both the minimum necessary and the optimal conditions are satisfied. The equipment must then be checked thoroughly using this list to identify and correct all deficiencies.

(iv) Track down the causes

The reasons why the deficiencies identified in the previous step have occurred should then be tracked down using Why-Why Analysis (see Figure “Example of Why-Why Analysis Sheet”). The human aspect should be investigated particularly carefully because, as mentioned earlier, failures are due to inadequate human behaviours. When multiple independent causes or complex combinations of interacting causes are at work, P-M Analysis should be used.

(v) Take corrective action

The causes identified in the previous step should be dealt with promptly through restoration or improvement. If left untreated, minor equipment defects will become worse and begin to affect other parts of the equipment, as well as being forgotten if too much time is allowed to pass before any action is taken. If the problem is too difficult to rectify right away because of technical, financial or time constraints, a plan should be drawn up to ensure that it is dealt with at an appropriate time. It is also important to roll out any action taken or checks implemented to similar equipment and other equipment with similar mechanisms.

(vi) Ensure that the problem cannot recur

Find out why the minor equipment defects that eventually led to the failure were not spotted in time, and work out what needs to be done to improve everyone’s ability to spot them in the future. Check whether the necessary inspection schedules are in place and whether the standards are adequate. It is also important to find some suitable predictive maintenance techniques for detecting deterioration.

Any failure analysis should follow the steps described above, paying particular attention to the following points:

Be sure to report each failure on a separate sheet.
Give people repeated practice in analysing failures while teaching them about their equipment.
Work closely with the maintenance department, exchanging information and conducting analyses together.
Supervisors and managers must coach and advise operators painstakingly.
Promote understanding by using broken parts, analysis sheets and one point lessons to explain equipment mechanisms, potential problems and the correct inspection methods.

(2) Extend lifetimes through corrective maintenance

1 Establish basic conditions

The three basic conditions for reliable equipment operation are cleanliness, tightness and correct lubrication. Establishing and maintaining basic conditions means preventing the equipment from deteriorating and is the most important way of avoiding the causes of breakdowns.

(a) Cleaning

As the word implies, cleaning means keeping the equipment free of dust, dirt, oil stains, spilt product, and other forms of contamination. Production equipment is extremely sensitive to contamination of this sort, and many sporadic failures or product defects are due to it getting into sliding parts, hydraulic systems, electrical control systems and so on, causing problems such as wear, clogging, leaks, defective operation, short-circuits, and inaccuracy. Thorough, regular cleaning is essential to eliminate this kind of forced deterioration.

Cleaning does not mean simply getting the equipment looking nice. When operators clean their machines, they naturally have to look at and touch every part of the equipment including all the little nooks and crannies that they never normally see. This makes it much more likely that they will spot potential problems in machinery, dies, jigs and tools; not only dust and dirt but also wear, play, scratches, slackness, deformation, leaks, cracks, overheating, excessive vibration and noise. Cleaning should not be done for its own sake; it should be cleaning with meaning. It is usually possible to find from two to five hundred potential defects when cleaning a single machine that has been neglected for a long time. This is why the slogan ‘cleaning is inspection’ is so common in Autonomous Maintenance circles.

(b) Lubrication

It goes without saying that equipment cannot perform satisfactorily unless it is properly lubricated. Despite this, however, empty, dirty, blocked or leaking reservoirs, grease nipples, lubricators, oil tubes and other lubrication devices are a common sight in many production areas.
Neglecting to lubricate can lead directly to sporadic failures such as bearing seizures. It can also accelerate equipment deterioration by causing wear or overheating, and the effects can spread out to all of the equipment’s units, giving rise to a huge range of different types of failure. Inadequate lubrication can be cited as a typical example of what might be called a psychological latent defect because it arises from insufficient attention and interest on the part of the people responsible for doing the job.

(c) Tightening

Many failures are due to nuts, bolts and other fasteners breaking, working loose or falling off. Even a single loose bolt can be a source of failure if it is used to attach an important part such as a bearing unit, die, jig, cutting tool, limit switch, coupling, or flange.
Fastener problems, however, do not usually lead directly to failure but start a chain reaction, which eventually results in a breakdown. When one bolt works loose, for example, the part it is supposed to hold may begin to vibrate, causing another bolt to work loose and create further vibration. Vibration breeds vibration, backlash breeds backlash, and the upshot is a serious breakdown. When one company investigated the causes of its breakdowns, it discovered that 60% of them were due to some form of nut or bolt problem. These kinds of problems account for a surprisingly high proportion of latent defects.

(3) Monitor and control deterioration

1 Observe correct operating conditions

If equipment is to perform its required functions, it must be operated under the correct conditions. In hydraulic systems, for example, the hydraulic fluid must be kept at the correct temperature, volume, pressure, acidity and level of cleanliness, while electrical control systems and measuring instruments must be operated under certain conditions of ambient temperature, humidity, dust level and vibration level. Switches and other devices must be fitted correctly in the right position and satisfy certain parameters (limit switches, for example, must have a dog of the correct shape, together with roller lever and dog contacts of the correct angle and strength). It is essential to set and observe the correct operating, handling and loading conditions for each piece of equipment in use.

Attempting improvements is pointless if the correct operating conditions are not being followed because the equipment’s accuracy of movement and processing conditions will be unstable and any problems will simply repeat themselves. To eliminate these problems, it is essential to specify the correct operating conditions for each equipment unit and component and ensure that they are followed.

2 Reverse deterioration

When dealing with failures, attempts are often made to introduce improvements while neglecting to restore deteriorated machines, jigs and tools, or only partially restoring them simply by replacing the broken parts. This will not work. Machines, dies, jigs and tools can only function effectively when the strength and accuracy of their components are properly balanced. If it is clear that a machine’s strength and accuracy are unbalanced from the start because of poor design or fabrication, it may be necessary to remodel it. In other cases, however, if only the broken parts of the machine are remodelled or restored, while other relevant parts are ignored, the problems will merely repeat themselves. In fact, they will go on repeating themselves as long as the deteriorated parts that ultimately cause the failures to remain undetected.
For example, if a drive shaft has broken off at a notched section, we should make sure that any defects such as play due to a worn or badly-fitting bearing, or backlash due to worn gears, are eliminated before replacing the shaft or remodelling it to increase the notched section’s radius of curvature.
Equipment deteriorates slowly over and time and its parts eventually begin to fail, starting with the weakest. Simply restoring or remodelling a broken part will not be very successful, because the next weakest part will fail soon after. The quickest way to achieve zero breakdowns is to go back to the drawings, identify the deteriorated parts by checking and testing, and restore the overall balance of the equipment’s strength and precision before thinking about changing its design.
To correct deterioration properly in this way, methods of accurately discovering, predicting and correcting deterioration must be found. Deterioration is detected and predicted by periodic checking and inspecting and through the use of diagnostic techniques and is corrected by overhauling based on standards. This of course requires a high level of skill on the part of those responsible for maintenance. It also requires the implementation of a preventive maintenance system.

3 Correct design weaknesses

To eliminate breakdowns, it is sometimes necessary to redesign the equipment, changing the materials, dimensions and shapes of its components. If a machine frequently breaks down despite being looked after carefully, and it is impossible to keep it going for long even with regular checks, inspections and overhauls, the maintenance costs become too great, and it may be necessary to eliminate the weaknesses by redesigning. However, it is better not to remodel equipment unless necessary. There are countless examples of serious mistakes being committed through making hasty decisions, inappropriately copying improvements done on other equipment, or being seduced by attractive new technologies presented in catalogues.

If a machine’s parts are not considered durable enough, the first thing to do is to decide whether it is a design fault. If it is, the weakness should be identified accurately and a plan for remodelling the equipment should be put in place. To do this, the following procedure should be adopted:

(a) Find out exactly what happened before and after the breakdown, and identify the phenomenon precisely.
(b) Check the equipment’s structure and functions.
(c) Check to see whether basic conditions are being maintained, correct operating conditions are being followed, and the equipment has been properly restored.
(d) Identify the mechanism by which the phenomenon occurs.
(e) Find the causes (design weakness, some other reason, or both).
(f) Plan an improvement.
(g) Implement the improvement.
(h) Follow up on the improvement to see whether it worked or not.

(4) Carry out predictive maintenance

1 Improve operating and maintenance skills

When thinking about how to eliminate breakdowns, we often make the mistake of focusing our attention exclusively on the machines, jigs, tools, materials being processed and other hardware while forgetting about the operating and maintenance skills involved. If the cause of the problem is in fact lack of skill, looking for the causes in the hardware can lead us to repeatedly change the design of a machine or the specifications of the materials used while still failing to reduce the number of breakdowns. If a problem is known to be due to operating or maintenance error, at least we can do something about it; but people are often convinced that the methods they are using are correct when in fact they are not. In such cases, a solution is not easily found. Problems like this can only be solved by working out exactly what skills the operators and maintenance people need to look after their particular equipment, and ensuring that they acquire those skills through comprehensive education and training.