Is your RCA Effort Trigger Happy?

Mark A. Latino

Most work environments are reactive in nature. There is always more work to be done in a day then there is time to perform work. This is because of unexpected changes in the work environment that cause the workforce to respond immediately and without preparation to return their environment back to the status quo.

Because of this, some companies have elected to do root cause analysis on these unexpected events. When dealing from a reactive point of view, the management will initiate a trigger that will execute the performance of a root cause analysis based on vibration level, hours of downtime, financial impact, etc. Trigger placement is a GOOD and necessary first step.

The reason it is a good first step is because the natural progression is to first realize the facility is reactive to an excessive level. This discovery is usually through daily or weekly downtime reporting. Once it is determined there is a problem, measures to control the situation are implemented. Triggers are almost always the first response.

Some companies measure employee problem-solving performance based on a weighted system of problem types. The more times the problem reoccurs, the more points the employee accumulates and is then scored at the year's end. Other companies measure employee problem-solving performance by assigning 10 or 12 failure investigations to be completed by the end of each year.


Figure 1. Process flow diagram for a sulfuric acid plant.

All this effort is based on the activation of some undesired event. The events can range from an electrical fault shutting down an entire section of a facility to a critical pump breaking shafts every two months. Why wait for triggers to trip and incur downtime and asset damage? It is much more difficult to do root cause analysis when severe secondary damage is incurred. The fact is that triggers are a reactive means to control unscheduled events.

The natural progression from this new knowledge is to stop waiting for triggers to be activated and to get proactive. When this step is achieved, the facility can move to the next level, GREAT. This will also eliminate employee pressure to deliver scores for performance appraisals that may be done in haste to meet requirements.

Performing a failure modes and effects analysis (FMEA) is a way to replace triggers and inform management that the root cause analysis effort is based on sound monetary results. Each root cause analysis completed will have a predetermined value that has been identified using the FMEA. This is going from GOOD to GREAT.

Proaction is the insight to look at operating areas with a structured approach designed to uncover potential events that would cause a trigger to activate. This can be accomplished using FMEA. This is a term used often, but it means different things to different people. The common thread is this: FMEA provides focus and points to the opportunities that will deliver a premeasured improvement to a facility.

FMEA is a proactive approach to uncovering what you don't know about your operation. This is important because there is an assumption that we already know the identity of our manufacturing problems. This is, for the most part, not true.

Some may know what the worst problem is, but it is very likely that facilities don't know what the second-worst problem is or the third, fourth and so on. In many cases, we don't know what the problems are truly costing us because they have been below the radar and have become a part of doing business.


Figure 2. An example of a data collection worksheet.

An example of this would be a piece of equipment that makes a tangible good, such as a cigarette-making machine or paper-converting machine. This type of equipment can be turned on and off many times during a shift for various reasons.

Operators sometimes shut down equipment because of quality defects or run at reduced rates because the full capacity rate causes excessive startups and shutdowns, which in turn cause the operators to work harder than if the equipment was run at a reduced rate.

Let us use an example from the cigarette industry. This example could just as easily apply to making candy, bolts or paper clips. In the cigarette industry, there is an electronically generated downtime called a rod break. When this condition occurs, the operator responds by collecting the paper part of the cigarette rod and disposing it into the waste can.

The operator then returns the tobacco lost from the rod into the rework container, rethreads the cigarette paper and pushes the start button to return the equipment to the producing mode. This process takes the operator three to four minutes.

The operator's response is a tasked action learned during his or her training cycle. This response can take place 40 to 50 times a shift, which reduces the machine's end-of-year productivity by more than 20 million cigarettes. This wasn't on the radar screen because it was a task done regularly; it was considered a part of doing the job.

There are many small occurrences of loss that happen daily in facilities and are considered as "the way we do business." When these occurrences are exposed and calculated for annual loss in hours and dollars, the financial impact to the facility becomes clear.

Where do you get the data to perform an FMEA? Managers and others often say that employees are the greatest asset. However, because of technology, analysts prefer information from the maintenance management system. This is a fast way to get downtime data, parts usage data, etc.

Most often, the data provided by the maintenance management system is what's on the radar screen or what we already know. When below-the-radar data is sought, it is collected from the most likely source of undetectable or below-the-radar information - the employees.

It is not a stretch to say that the people who operate and maintain a facility know things about their environment that will never be made known unless asked. Most employees find a way around problems that cause them pain or extra exertion to perform work.

This may include bypassing alarms that go off for no apparent reason, running at reduced rates, changing filters prematurely, adding set-screws to loose couplings, pinning bearings so that they won't move, tack-welding cracked impellers. The list can go on and on. These kinds of activities affect productivity and most likely will never show up in the maintenance management system.


Figure 3. An example of an electronic data collection worksheet.

Management can open this door and learn from employees by following a three-step method to performing a successful FMEA.

  1. Create a process flow diagram for the system you want to analyze.
  2. Create a failure definition to be communicated from the top tier of management to the hands at work level.
  3. Create a FMEA data collection worksheet that reflects the issues that are of concern, such as material waste, defect rates, downtime, safety incidents, etc.


Create a process flow diagram: The process flow diagram reflects the routing of the process. This is usually from the raw material input to the point of storage or shipping (see Figure 1).

The reason for doing this is to give the FMEA facilitator and employees a visual for reference during the interview process.

Create a failure definition: A clear and concise failure definition is needed to make sure the employees and management have the same understanding of what is considered a failure. Without this understanding, confusion results and your analysis is compromised. Failure definitions are usually tainted by the business climate, a sold-out condition or slow sales cycle. Failure definitions can also surround a current problem an area is experiencing, such as a high rate of rework, high defect rate, high hand injury rate, etc.

Some examples of failure definitions are:

  1. Failure is when secondary defects are incurred.

  2. Failure is any adverse happening that has human roots.

  3. Failure is when the asset becomes inoperable.

  4. Failure is when the asset can no longer perform its intended function.

  5. Failure is any event or condition that interferes with production.

  6. Failure is any event or condition that causes the expenditure of unexpected budget money.


When consensus is gained on a failure definition, you are ready to compile a FMEA data collection worksheet.

Create a data collection worksheet: The data collection worksheet's role is to create the capacity to capture the necessary data to identify the significant few failures from all the rest. To do this, a very simple rule is used: frequency multiplied by impact. The data collection worksheet configuration does a number of things for the analysis. It identifies the event, the modes that cause the event, the frequency of the mode and the impact of the event mode combination on the analyzed system (figures 2 and 3).

The findings allow you to uncover what

you don't know. This allows you to change

the outcome because you know your current cost of doing business.

This puts you in the catbird seat. You can see what others can't because you took the time to look. Now you can make decisions according to solid information, giving you the business advantage.

With this advantage, you can pick the project you would like to do root cause on by the loss incurred to the facility over a year's time. This will not be the case when reacting to a triggered root cause analysis project. Triggered projects may in some cases tie up valuable human assets that could be better utilized on projects where there is greater return to the organization.

Mark Latino is the vice president of operations for Reliability Center Inc. He came to RCI after spending 19 years in corporate America (Weyerhaeuser, Allied Chemical, Philip Morris). For more information, visit or call 804-458-0645.

Subscribe to Machinery Lubrication

About the Author