MTBF is a calculation used to predict the time between failures of a piece of machinery. Below, we'll discuss the MTBF calculation, MTBF traps to be aware of and how to improve your MTBF.
Mean time between failures (MTBF) is a prediction of the time between the innate failures of a piece of machinery during normal operating hours. In other words, MTBF is a maintenance metric, represented in hours, showing how long a piece of equipment operates without interruption. It's important to note that MTBF is only used for repairable items and as one tool to help plan for the inevitability of key equipment repair.
Before you calculate MTBF, you need to understand how it affects reliability and availability. Having high reliability and availability usually go together, but the terms are not interchangeable. Reliability is the ability of an asset or component to perform its required functions under certain conditions for a predetermined period of time. Put another way, it's the likelihood that a piece of machinery will do what it's meant to do with no failures. Think of an airplane; its mission is to safely complete a flight and get passengers to their destination with no catastrophic failures.
Availability is the time an asset or component is operational and accessible when it's needed for use. In other words, it's the likelihood that a piece of machinery is in a state to perform its intended function at any given time. Availability is determined by the reliability of a system and its recovery time when a failure does occur. Availability is usually looked at in tandem with reliability because, once a failure occurs, the critical variable switches to getting the asset up and running as quickly as possible.
MTBF is a basic measure of a system's reliability; the higher the MTBF, the higher the reliability of a product. This relationship is illustrated in the equation: Reliability = e-(time/MTBF).
There are a few variations of MTBF you may encounter. They are mean time between system aborts (MTBSA), mean time between critical failures (MTBCF) and mean time between unscheduled removal (MTBUR). You'll most likely see these variations when differentiating between critical and non-critical failures.
MTBF is calculated by taking the total time an asset is running (uptime) and dividing it by the number of breakdowns that happened over that same period of time.
MTBF = Total uptime / # of Breakdowns
Broken down, the MTBF calculation might look like this:
So, what does this tell us? In this example, the MTBF isn't suggesting that each widget should last 800 hours. It's saying if you run a group of widgets, the average time between failures within the tested group is 800 hours. In other words, MTBF isn't meant to predict the behavior of a single component; it predicts the behavior of a group of components.
It's important to understand that when defining "time," it may not always mean clock time; it could be the time in which the system is actually being used. For example, you may have a machine that has been run eight hours a day which might last three times as long as the exact same machine running 24 hours a day. The MTBF for both machines is the same because they both endured the same number of operating hours.
Let's look at another example of the MTBF calculation. Let's say you have a bottling machine designed to operate for 12 hours a day. The bottling machine breaks down after operating normally for 10 days. The MTBF in this example is 120 hours.
MTBF = (12 hours per day x 10 days) / 1 breakdown = 120 hours
The MTBF calculation requires more steps when you have longer periods of time with increasing occurrences of failures. For example, say the bottling machine that operates for 12 hours a day fails twice in 10 days. The first failure occurred 20 hours from the start time and took two hours to repair. The second failure happened 60 hours from the start time and took three hours to repair. Calculating the total uptime for the MTBF equation requires adding 20 (initial uptime period), 18 (start of first downtime period minus end of first downtime period) and 57 hours (start of second downtime period minus end of downtime period).
So, now the MTBF calculation looks like this: MTBF = (20 hours + 38 hours + 57 hours) / 2 breakdowns or 57.5 hours / 2 breakdowns = 57.5 hours.
One of the biggest misconceptions about MTBF is that it is the same thing as the number of operating hours before failure or "service life." If you get an extremely high MTBF number (not uncommon), you might think there's no way the system can operate this long without a failure. The reason for high MTBF numbers is because they are mostly based on the asset's rate of failure when that asset is still in its "normal" or "useful" life, assuming it will fail at that rate forever. It's for this reason there should be no correlation between service life and MTBF. You can have a piece of equipment with a very high MTBF but a low expected service life.
A good example of this is laid out by Wendy Torell and Victor Avelar in their whitepaper Mean Time Between Failure: Explanation Standards using human beings. Say you have 500,000 25-year-olds in a sample population. Over the span of one year, data is collected on failures (deaths) for this population. The population's operational life is 500,000 x 1 year = 500,000 people years. Over the course of the year, 625 people failed (died). This brings the failure rate to 625 failures / 500,000 people years = 0.125% / year. So, our MTBF is 1 / 0.00125 = 800 years.
This shows us that, even though 25-year-old humans have high MTBF values, their life expectancy (service rate) is a lot shorter and doesn't correlate.
Humans, like machines, don't exhibit a constant failure rate. As humans age, more failures occur (our bodies wear out). Since this is the case, the only way to calculate MTBF so it correlates with service life would be to wait for the whole population of 25-year-olds to reach the end of their life; then the average lifespans can be calculated. This puts that number at around 75-80 years.
So, is the MTBF for 25-year-olds 80 or 800? Torell and Avelar explain that it's all about assumptions. In this case, the MTBF of 80 years more accurately reflects the life of the product (humans). When it comes to things like tracking products from machinery, you have many more variables, the biggest of which is time.
The impacts of machine failure can be significant. It leads to lost production and increased time spent on maintenance. Getting to the root cause of failures is the best way to find, mitigate or even prevent future occurrences, all while increasing your MTBF in the process. There are a few ways you can increase MTBF.
It's important to know the potential issues that could arise from an MTBF calculation when using it for reliability analysis. MTBF can differ depending on how you define certain things like "failure" and "operation time" as well as whether you measure individual pieces of equipment or a whole process.
Along those same lines, should you consider operating time for your equipment as any time the equipment is turned on or only when it's operating under normal workloads? If you choose to use the former for your MTBF calculation, your MTBF value would be higher, but that value wouldn't be representative of machinery continually running under normal workloads and hardly ever idling. That's why it's important to define operating time for all assets you intend to use with MTBF.
Those who choose to measure an entire process for an MTBF calculation often find they can't achieve a high MTBF value due to "bad actors." It's recommended to test each piece of equipment to eliminate this issue.
If you consider these potential issues ahead of time, MTBF can still be a useful tool when evaluating the reliability of your assets.