What is the real cost of a failure? Unfortunately, we don't know until after the failure has occurred - and reliability is about avoiding the failure. So, here is our quandary: How much is a non-event worth? The manufacturing reliability engineering world is replete with highly competent and well-trained people who are accustomed to defining their world in parametric and deterministic terms.

To get comfortable with the process of predicting the future under current conditions and then justifying changes to create a new and more profitable future, manufacturing reliability professionals must get comfortable with the process of defining their world in probabilistic and non-parametric terms.

We know two things - every manufacturing process will fail and the failure will have some impact on the organization. Likewise, there are two things we don't know: when the process will fail and the seriousness of the impact to the organization.

In my experience, reliability pros tend to gravitate to one of two extremes when defining the impact of a failure. On the one hand, the conservative reliability engineer only claims the avoided cost for parts, presuming that the labor cost is sunk; there is no downtime cost because we can't assume that the plant's production capacity will be sold out; and we can't assume any risk-based costs, such as personal injury, environmental impact, etc. On the other extreme, overzealous reliability engineers claim that had they not detected the potential bearing failure and scheduled it for repair, the pump would have failed, creating a total loss of sold-out production and led to a fire that blew the plant up, killing everyone, created a major environmental waste zone and caused the earth to discontinue rotating on its axis! I jest, but the point is that the truth lies somewhere in between.

Figure 1. The true cost of failure can't be estimated deterministically until after the fact. For planning purposes, a probabilistic approach must be utilized.

At its core, reliability is akin to risk management. Risk managers view the world in a probabilistic, non-parametric way because that is the only reasonable way in which to attempt to predict the future. Borrowing a chapter from risk management 101 and several associated standards relating to the topic, I've created a model to illustrate for you how to estimate the cost of a functional failure of a manufacturing process. This model is illustrated in Figure 1. Here are step-by-step instructions for creating and using risk-adjusted failure cost models.

Figure 2. Estimated annual failure cost after deploying monitoring, planning and reliability improvements

1) Create failure severity-based cost estimates: As Figure 1 illustrates, each functional failure has associated costs. These might include parts, labor, downtime, risk-based costs, etc. The key is to create failure severity-based cost models. A high-severity event may cause you to incur significant downtime costs and/or collateral costs, whereas moderate and minor events produce a lesser impact on the organization. In my example model, a high-severity failure costs $15,000 per event, a moderate-severity failure costs $4,500 and a low-severity failure is $2,200. This is not to say that every event judged to be moderate in impact will cost exactly $4,500 (remember, we have our non-parametric thinking caps on); this is a weighted average within that failure severity category.

I've opted for three severity classifications, which is my typical approach in working with clients. You may create as many categories as you wish. However, there is a diminishing marginal return in usefulness for each additional category. Also, I suggest that you resist the temptation to discount labor costs based on the logic that they are sunk. The truth is that labor is a variable cost. If manufacturing processes become more automated and reliable, we simply require fewer people to operate and maintain them, period.

2) Create probability weighting factors: In my example, I've assumed that 10 percent of my failure events are high severity, 20 percent are moderate severity and 70 percent are low severity. Multiply the total failure cost for each severity category by its associated likelihood estimate and sum the products to produce the weighted average total cost for a failure event. In my example, high-severity events contribute $1,500 to the weighted average, while moderate- and low-severity events contribute $900 and $1,540, respectively, for a total of $3,940 per event. Does this suggest that our next failure will cost precisely $3,940? Of course not. Again, we're thinking probabilistically and non-parametrically.

3) Estimate the number of events per year: In the financial world, cost benefit analysis is based upon annualized costs and benefits. So, we need to estimate how many failure events of this type we might expect in a year. In my example, we expect two events per year. So, our estimated average annual failure cost for this functional failure mode is $7,880. Any mitigating actions you take will either affect the severity distribution and/or reduce the number of failure events per year (increasing the mean time between failure [MTBF] or mean time to failure [MTTF]).

4) Changing the likelihood distribution: By and large, planning tools tend to modify the likelihood distribution. For example, protective monitoring, inspections and predictive monitoring help us to detect problems in their incipient stages, before they're given a chance to escalate to severe or catastrophic levels. Likewise, effective planning, scheduling and work management processes ensure that the detected problems are dealt with. These measures don't affect the base failure rate, but do tend to affect the likelihood distribution, decreasing the likelihood that an event will be high severity while increasing the likelihood that it will be a low-severity event.

In our example, if we improve our ability to detect and manage failures, we're estimating that the likelihood of a high-severity event reduces from 10 percent to 2 percent, the likelihood of a moderate-severity event decreases from 20 percent to 5 percent, while the likelihood of a low-severity event increases from 70 percent to 93 percent. Redistributing the severity likelihoods reduces the estimated weighted average cost per event from $3,940 to $2,796 (Figure 2).

5) Changing the failure rate: Proactive measures, on the other hand, influence the reliability of the manufacturing process, decreasing the failure rate. Proactive condition control and monitoring to improve lubrication, contamination control, balance and alignment, as well as precision operation and maintenance actions guided by documented standard operating procedures (SOPs) and standard maintenance procedures (SMPs) reduce the rate at which failures occur. In our example, we're estimating that we can reduce from two failures per year to one. Assuming we reduce the cost per event by improving our ability to detect and manage problems combined with our reliability improvement initiatives, we expect to reduce our annualized failure cost from $7,880 to $2,796 (figures 1 and 2).

So, in our example, improving our ability to detect and manage failures produces a net benefit of $1,144 per year for the specified functional failure modes. Combined with our reliability improvement initiatives, the net gain is $5,084. As long as the reliability investments required to make the changes produce an appropriate rate of return to the organization, the initiatives should be a go.

In this column, we've adopted risk management models to help us quantify the cost of a functional failure mode. In future issues, we'll dive into the field of decision making under uncertainty to discuss models for making estimates when there is little empirical data available, then we'll bring it together to show you how to perform cost benefit analysis and present your proposed reliability improvement projects in a form that will cause your approval rating to dramatically improve.