Root Cause Analysis: What's the Point?

Brian Hughes
Tags: root cause analysis, maintenance and reliability, continuous improvement

RCA: Current State

Maintenance and reliability people fix failed equipment regularly, and when the impact of this failure is substantial, a root cause analysis (RCA) is often required. But what’s the point? Once we’ve fixed the issue, shouldn’t we leave well enough alone and get on with our lives? We’ve got enough going on. Why do we need to take multiple people away from their jobs to spend several hours analyzing something that’s in the past?

A root cause analysis (or any analysis) requires an investment of our scarcest resource: time. Why should completing an RCA be how we spend our precious time? The answer to this question might seem obvious, but it’s always worth asking.

We invest in RCAs because, as with any investment, we believe they will yield positive returns. But that’s not always the mindset of those performing these analyses. Many times, we do an RCA because it’s mandated by leadership. Those investigating don’t believe it’s worth the time, but it must be done because, you know, those are the rules. So, the standard becomes the “minimum viable product” that gets the boss to sign off that the RCA was completed.

This is not the way to run an effective RCA program. RCA done this way achieves very few results and can do more harm than good. It becomes a tax on time, resulting in little added value and a considerable amount of negative feelings. But it doesn’t have to be this way. Having a trusted process and set of tools is necessary for developing a successful RCA process. But we first need to establish why anyone would want or need to invest their time and money in RCA.

Let’s talk for a minute about our current reality. At this moment, we are on an elevator of technological progress, exponentially accelerating upward. Take the progress of the wheel, for example; the wheel was first invented around 5,200 B.C., and it was likely used for pottery, not carts. It took nearly a thousand years for the wheel to be used for transportation, and it wasn’t until the 19th century that we started seeing powered vehicles. Now, as we approach the end of the first quarter of the 21^st century, we find that cars can almost drive themselves.

Most of the advancements in technology happened within the last 150 years or so. If the 7,000 years since the first wheel equated to one hour, nearly all the serious progress would have occurred in the final minute. And that progress continues to accelerate.

Figure 1. Vehicle Complexity Over Time - Click to Enlarge

So, what does that mean? Problems, and lots of them. Our drive to advance technology is spawning more problems of greater complexity that require solutions in a shorter amount of time. Of course, people in different industries will experience this phenomenon to different degrees; it’s not the same for everyone. But it is happening universally, and if we don’t find a way to become great at solving these problems quickly, there may come a time in our not-so-distant future when these problems overwhelm us.

So, what’s the point behind root cause analysis? RCA processes and tools allow us to solve difficult or “wicked” problems better and faster. These problems are often bigger than any single person. None of us has a monopoly on knowledge. We need to bring others into the mix to overcome gaps in what we think we know. At its best, RCA allows groups of diverse experts to quickly learn from each other to explain how the problem happened and what should be done to prevent future incidents.

We need to develop cultures that learn faster than machines fail. To do this, we need root cause analysis that is the right size (scalable) for the problem at hand, one that works — delivering consistent value given the time invested. When done correctly, those performing RCA can recognize the value in the process and not perform it simply because the boss asked for it.

Root Cause Analysis — Five Steps

There are several variations of root cause analysis, and typically all involve some mixture of the following five steps:

Gather evidence and data to be used to draw and support conclusions.
Document important problem information, such as what the specific problem is, when it happened, where it happened and what the impact of the problem is.
Identify the causes of the problem. What happened that led up to the problem?
Determine what will be done to solve the problem.
Share what was learned with others.

Gathering Evidence and Data

Imagine you’re cooking dinner for a large group, and you want it to be terrific. Every chef knows that buying the best ingredients is the first step to a great meal. Evidence and data are the “ingredients” for an RCA. Usually, when a failure occurs, all energy is directed toward bringing the asset back online.

In the rush to recover, data and evidence can wind up being discarded or destroyed. This is always a mistake. We need to gather broken parts and equipment, samples of fluids, system data, documentation and witness or expert statements as soon as possible to facilitate future learning.

You can find evidence from a variety of different sources:

People: Witnesses, operators, maintenance technicians, design engineers, safety engineers, OEM representatives and outside experts all are excellent sources of information. Remember, no single source knows everything about the problem, so it’s crucial to diversify.

Procedures and Documentation: Look for documented evidence. This includes:

Preventative and predictive maintenance task records
Maintenance, operation and installation manuals
Operations and maintenance reports
Design diagrams
Piping, instrumentation and electrical diagrams
Documentation of past failures

Photos, Video, Audio: Many facilities are under 24-hour video surveillance. If available, try to get these videos or photos, or tour the scene and take photos and video yourself, making sure to adhere to all organizational guidelines. If audio files exist, such as from two-way radio communication, these can also be good sources of information. While online videos can be helpful resources, don’t become reliant upon them, and always make sure they’re from a reputable source.

Hardware, Software, Systems: Different systems can offer unique insights. To discover the information, ask discovery questions such as:

What equipment failed? Was it a specific component or multiple?
What was the design intent?
How does each component fit into the overall system?
How was it being operated compared to how it was designed to operate?

Environment: Environmental causes are also important considerations. Ask yourself:

Was it hot or cold? Wet or dry?
Was it inside or outside?
What was the business environment at the time? Was it seasonally busy or slow?

Gathering evidence from diverse sources as soon as possible after the event is one of the most important parts of a high-quality RCA.

State the Problem

The problem needs to be accurately stated, and key information should be documented. When developing a problem statement for a reliability issue, it’s helpful to use the following formula:

“Asset ABC Unavailable + XX Hours to Recover”

When analyzing the causes, writing the problem statement using this formula allows us to include the story of why and how the asset experienced downtime, as well as how much time was required to bring the asset back online.

We also need to document the time and date, where the problem happened and the actual and potential impacts of the problem. It’s particularly important to document both actual and potential impacts because leaders need to know how bad the problem was as well as how bad it could have been. Finally, it’s useful to include how often this type of problem has happened in the past.

Analyze the Causes

There are several ways of analyzing causes. But what’s the “right” way? The truth is, there is no one right way, only the way that works best for you. "All models are wrong, but some are useful" is a saying attributed to George Box, and it’s appropriate here. Ask yourself, does your cause-and-effect analytical model:

Help you manage input from the group?
Help you tell the complete story of the event?
Accomplish these things in a way that doesn’t add to the burden of the analysis?

If so, then your model is useful.

At Sologic, we like to create a model by starting with the problem and then working backward in time, identifying the cause-and-effect relationships that led up to the problem. It’s like playing a movie backward, frame by frame. When you analyze causes in this way, the group can clearly see how they resulted in the problem.

Model templates can be extremely helpful. For instance, the template below works for most reliability issues.

Figure 2. Reliability Problem-Solving Chart - Click to Enlarge

This template starts with the formula described in the “State the Problem” section, which is Asset Unavailable + XX Hours to Recover. The top branch then prompts the investigation team to discover the story of the fault and what brought the asset down. The bottom branch asks them to account for the hours required to achieve recovery. Some of that time was used to make the system safe, some of it was due to diagnosing and the rest was used to repair the problem. This template can be scaled to fit the complexity and severity of the problem to help ensure we don’t waste time over-investigating.

Note that a method like this won’t lead the team to a single root cause. In fact, the farther back in time you go, the more the branches will diverge from each other. It’s important to remember that there is no single root cause for any given event. Therefore, searching for one is futile. What’s more important is to understand how the causes work in conjunction with each other to result in the problem.

Solve the Problem

Solutions control causes. When you use a causal model like the one above, you can easily identify solutions that control individual causes. It’s already been mentioned that there are no single root causes for any event. This fact is liberating in that it frees the team to identify any number of solutions that control the causes identified in the model. Ultimately, a diversified “basket” of solutions is desired.

Report Findings

Once the team has gathered evidence, thoroughly defined the problem, analyzed its causes and identified solutions, the final step is to share what was learned. The investigation team knows more about the problem than anyone else. Therefore, they are in the best position to help teach the rest of the organization. The best way to do this is by creating a thorough and thoughtful incident report. An incident report doesn’t need to be long — it just needs to tell the story in a way that helps others learn.

Putting it all Together

Technological advances are spawning an ever-greater number of complex problems. Success in such a world requires that we employ organizational learning techniques, including root cause analysis, in ways that help us leverage the diverse knowledge and brainpower at our disposal. Ultimately, success results in an organization that learns faster than it fails.