Failures Tell a Story—RCFA Teaches You How to Read It

Matt Sadler

Failures Tell a Story RCFA Header

It was an idle Tuesday when I received a call that a location in South America had finally figured out why the paper shredders in that facility kept failing. On the other end of the phone was a very excited voice that said they had used root cause failure analysis (RCFA) and found that the night-shift cleaning crew had been shredding binders with the covers still on in the paper shredders. The paper shredders would jam, the cleaning crew would override the safety features, and eventually damage the paper shredder beyond repair.

At another plant, we faced a recurring failure. We thought we had fixed it, thought we had it under control, and then it would come back. Over and over again. Every time it returned, it chipped away at us, not just in downtime or cost, but in morale. People started feeling like they were spinning their wheels.

Not only was morale impacted, but the constant failures started to affect the bottom line and the company’s reputation. But what really went wrong?

RCFA is a structured method for identifying the true underlying cause of a failure to prevent recurrence. In the maintenance industry, most teams are very good at identifying the problems. What is lacking is putting a solution in place so the event is less likely to recur.

That is why I set out to change the RCFA landscape. Treating the symptoms and not addressing the root causes started to burn out everyone on my team, including me. We would gather in a room, where a very complex Microsoft Excel document would be displayed, and for hours, symptoms would be identified. When it came time to identify actions, less than 10 minutes was dedicated to this portion.

Many times, a functional failure or catastrophic failure occurs, and due to cultural norms in the industry, an RCFA is not completed. The downtime mattered only until the next big event happened. The number of times something failed is often lost in the daily battle to keep that very piece of equipment running. The amount of wasted resources? You guessed it: not even given a moment's notice.

It was early in my career when I was taught that a reliable plant is a safer plant. It took many years to realize how true this statement was and true it remains today. The number of unsafe conditions that employees are exposed to due to the lack of completed RCFAs brings to mind the quote. “A reliable plant isn’t built on luck—it’s built on learning from every failure.”

Reliability engineering foundations are built on using RCFA tools that fit into the framework of continuous improvement (e.g., Six Sigma, TPM). By fully utilizing the RCFA process, any company, regardless of industry or size, can expect to improve safety results, downtime, and decision-making culture.

Would the aerospace industry be as reliable without using the RCFA process? It is comforting to know that aviation incidents are investigated in great detail to avoid repeat events for a specific failure mode. What about the energy industry? Healthcare? Manufacturing? Many, if not all, of these industries have developed leading practices that improve safety, reliability, quality, and cost performance by taking the time to properly complete an RCFA.

RCFA is built on cause-and-effect logic coupled with evidence-based investigation and human factors. The majority of failures are not due to a single cause but to a chain of contributing factors. RCFA seeks to find the root cause, but what does that really mean? We have all of these contributing factors, and as we start to eliminate them with evidence, we get to a point where a root cause is identified. The word “root” makes me think of something planted in the ground. We can only “see” the top, but we know that there are roots that travel underground, and until we dig in, we will not be able to find those roots.

Symptoms are the conditions that contribute to the overall failure or root cause. Symptoms can also be thought of as contributing factors. Symptoms are also thought of as the effect (what you notice), while the cause is the reason (why it happened). Symptoms are also clues that guide the RCFA process. From my experience, this is where we typically stop during the process; we address a symptom, but we don’t solve the underlying problem.

One important point is during the RCFA process to be hard on the process and not the people. Participants in the RCFA have to feel they are working in a “no-blame” culture. Blame is a natural reaction to a failure. Beginning to ask “what made this possible?” instead of “who did this?” is the step needed to get to the root cause. Be curious, not judgmental. This change is not loud, but it will be visible. Less finger-pointing and more collaboration.

I remember being in a reliability meeting when a frontline operator stood up and said, “I don’t think it’s the pump. I think it’s the vibration sensor, and here’s why.” He had no title, but he had the floor. That’s culture change.

Root cause analysis can use many different tools, and each has its own place in the reliability journey.

5 Whys: Simple and quick, but it can oversimplify complex problems.

Fishbone Diagram (Ishikawa): Categorizes possible causes (people, methods, machines, materials, environment).

Fault Tree Analysis: Provides a logical breakdown of how failures occur.

Failure Modes and Effects Analysis (FMEA): Provides a systematic evaluation of risks before failure.

Cause and Effect Mapping / Timeline Analysis: Integrates multiple data points and perspectives.

Selecting the right method depends on problem complexity, available data, and resources and can be the most challenging part of the process. I have found that starting the process as early and informally as possible is far better than doing nothing. Many times, using a piece of paper and meeting an operator where the work is happening provides more value and real results than using a fancy tool in a conference room.

The RCFA process, and it is a process, is defined below:

Problem Definition: What happened? Where? When? Who was involved?

Data Collection: Data, logs, statements, inspections, and evidence.

Analysis: What contributing factors can be identified? Do we need to use one or more root cause tools?

Root Cause Identification: What are the true underlying reasons?

Corrective Actions: Implement solutions that prevent recurrence.

Verification: Confirm that the solution eliminated the issue.

Documentation and Communication: Capture lessons learned and share them internally.

During the RCFA process, it can seem very heavily machinery related, but there is a human element that can influence the process. Every one of us has biases; for example, people tend to look for, remember, and believe information that supports what they already think. Another example: a mechanic believes a certain centrifugal pump “always has electric motor problems.” When that pump failed again, the mechanic immediately assumed it was the motor—even though the real issue was a plugged strainer. That mechanic noticed every past motor issue because it confirmed their belief and overlooked evidence that pointed elsewhere.

Individual failures can become organizational learning opportunities. If one location in a complex organization is having a multitude of failures, are other locations seeing the same pattern? Is it related to a supply issue or component defect, or can other locations take action to prevent failures before a functional failure occurs?

Implementing RCFA in an organization is not an easy task. While the training, methodology, and tools are fairly easy to understand and apply, the culture change is not an overnight task. Here are some tips to help you along your implementation journey:

Start small: Select one recurring issue as a pilot.

Train facilitators and develop standard templates for consistency.

Integrate RCFA into maintenance and reliability processes (e.g., a computerized maintenance management system, or CMMS.

Track metrics: mean time between failures (MTBF), recurrence rates, and downtime

Build management support by showing measurable benefits.

Looking ahead in the maintenance and reliability industry, the role of data analytics and AI will only continue to make the RCFA process more efficient by organizing work-order histories, sensor data, inspection notes, failure codes, and operator observations into clearer patterns for review. Where the need for a very seasoned RCFA facilitator was once needed, it will be supported by AI tools that can summarize evidence, surface recurring failure modes, suggest likely contributing factors, and help teams validate corrective actions without replacing human judgment.

Root causes hide in plain sight—but only evidence reveals them. While technology helps, it will take the curiosity and discipline of an RCFA facilitator to shift a culture from reactive to proactive problem-solving.

In a world where time is tight and teams are stretched, RCFA might be the most valuable tool we have. Transformation is not a distant dream. It’s something that happens every day, with every decision we make, with every failure we choose to learn from, with every standard we raise, and with every person who dares ask, “Why does this keep happening?”

Imagine a day when you go into work and failures don’t just get fixed—they are understood, dissected, and transformed into lessons that make your team stronger. Stand up. Take action. Every failure has a story—RCFA helps you find the real ending.