Root Cause Analysis and Implementation: A Case Study

Ashley Troyer, Lockheed Martin
Tags: root cause analysis

In 2008, Lockheed Martin’s facilities maintenance and operations team had a counter-flow, mechanical draft cooling tower designed and installed onsite. The cooling tower contains four subcells that consist of a motor, gearbox and fan assembly.

This equipment is the heat sink for a chiller plant that provides cooling to 90 percent of the site’s buildings, including top-priority production equipment and comfort cooling for a majority of employees. The cooling tower has a strict run-time schedule depending on the time of year.

Due to the facility’s location in Orlando, Florida, the cooling capability is the No. 1 criticality function, tied with electrical power to the site.

Upon installation, an unidentified gearbox experienced infant mortality caused by an existing production defect. Four years later, in 2012, a second unidentified gearbox failed. In December 2016, gearbox No. 2 failed.

The next year, July 2017, gearbox No. 4 failed. All four of these failures were met with a quick and reactive response to replace the gearboxes immediately in order to return to 100 percent cooling capabilities.

In October 2018, the maintenance team was called to cooling tower cell No. 3 because the building management system software showed a demand for cooling but was receiving zero output. During this inspection, it was discovered that the motor to the gearbox coupling had completely sheared, leaving behind a pile of rubber dust.

While the team replaced the coupling, one of the technicians noticed that the fan for cell No. 1 was sitting lopsided. The fan shaft could be rocked from side to side by hand. Gearbox No. 1 had catastrophically failed, and although the original trouble cell was repaired, cooling capabilities dropped to 75 percent.

The Turning Point

The unexpectant gearbox failure led to a nightmare scenario for both operations and maintenance. A replacement part and crane rental were on rush order. Due to the chill-water valve placement, half of the cooling tower had to be locked out, dropping cooling capabilities even further to 50 percent.

The chiller plant relied heavily on the previously cooled expansion tank while the weather temperatures reached more than 80 degrees F. This decrease in cooling operations created extra pressure and urgency for the completion of the project.

Throughout the entire maintenance task, the team was constantly asked for status updates by customers. If the team did not act fast enough, the possibility existed for site-wide downtime.

Once the rush was over, the new coupling and gearbox were installed, and cooling returned to normal. However, the failures would soon be faced by new maintenance managers, reliability engineers and technicians. This team asked, “Why is this happening?”

This was a question that seemingly had never been asked before. Regardless, enough was enough, and we all wanted to get to the bottom of the situation.

The Investigation

For the first time, a root cause analysis and forensic investigation were conducted on the failed gearbox. Over a span of three days, the component was taken apart piece by piece, cleaned and closely inspected. Although resources were limited, there was enough evidence to paint a picture of the failure.

Symptoms of misalignment were found in multiple locations, including the gearbox casing and shaft roller bearings. The misalignment also caused the upper roller bearing to break apart. The rolling elements that fell to the bottom of the gearbox led to severe damage of the gear teeth.

A secondary issue was also evident on the component parts – the black markings of water present in the oil and gearbox. When the gearbox was taken out of the cooling tower, an oil sample was sent for analysis. The results supported the physical evidence.

The report revealed 2,984 parts per million of iron particles, while the Karl Fischer test indicated 19.6 parts per million of water.

Once the gearbox teardown was complete, a walkdown of the cooling tower was conducted. Conditions assisting the failure were identified. Each of the gearboxes were installed with a breathing tube that extended from the top of the gearbox to the top of the cooling tower, near the fan outlet.

It was discovered that none of the breather tubes had desiccants and that the lines were either crushed or broken off completely. The placement and lack of breathers indicated the gearbox was open to the fan’s humidity.

We also realized that the oil indicators provided by the original equipment manufacturer (OEM) were inadequate for inspection while the tower was in operation. It was even difficult to see the oil condition up close.

Our team contacted the gearbox manufacturer for missing information. We discussed the design life expectancy and learned that the limiting components were the bearings. Given the scheduled operating times, the bearings should last approximately 11.5 years, while the gear teeth should have an infinite design lifetime.

We also requested the gearbox’s natural frequency and were told that it had a band of 31 to 36 Hertz. With this information, we pulled the cooling towers’ operation log and found that our variable-frequency drives (VFDs) ran the gearboxes in the natural frequency band regularly and for long periods of time. This meant the gearboxes were exposed to excessive vibration due to operations.

Actions Taken

Once the causes of the failures were identified, our team immediately decided to have the new gearbox aligned. The plant mechanical engineer verified the cooling operations and allowed the maintenance team to shut down cooling tower cell No. 1 until a precision state was achieved.

We contacted several companies for alignment pricing and concluded that we could purchase the alignment equipment and provide training to our technicians for the same price as one third-party alignment. Not only did the technicians discover how to perform laser alignments, but they learned how to read and balance all the rotating equipment, including the tower fans.

The technicians were able to laser align the motor and gearboxes in all four subcells to within 0.002 inches, which resulted in vibration readings of 0.05 inches per second or less. These measurements fell well within the precision standards of 0.07 inches per second.

Our team has since upgraded the oil sight glasses from flat windows to versions that extend outside the gearbox. The new sight glasses have a red and green band that allows easy level detection within tolerance as well as the ability to pull oil samples from the glass and drain any water.

This upgrade enables technicians to verify the oil level according to the preventive maintenance plans without shutting down the equipment unnecessarily.

The design and engineering team is currently in the process of developing a new breathing tube layout that will extend out the sides of the cooling tower with a heavy-duty desiccant that can last for several months under the extreme conditions.

This new layout will keep the desiccant away from the moisture being blown out by the fans and minimize the presence of water in the gearbox.

Knowing the gearboxes’ natural frequency has enabled us to create a prohibited operation band on the cooling tower VFDs, preventing the tower from running in the 31 to 36 Hertz frequency band. This will help avert any damage caused by excessive vibration.

All the gearboxes have been placed on a vibration analysis route, allowing their behavior to be tracked. The failed coupling on gearbox No. 3 led to a condition assessment of the subcell. The oil was sent for analysis and returned with a high particle count. Vibration readings were as high as 0.2 inches per second at the misalignment frequency.

The readings also indicated severe bearing damage. These results raised a red flag for a potential failure like gearbox No. 1. Using these predictive measures gave our team the ability to plan for a gearbox replacement during a planned site outage over a weekend in December 2018. The new gearbox in subcell No. 3 was aligned and balanced to a precision state and given a new oil sight glass.

Finally, the preventive maintenance plan for the cooling tower has been rewritten to target and correct the appropriate equipment parameters and needs. Specific measurements and points have been clearly identified, and the maintenance plan frequencies have been developed to minimize equipment intrusion while painting a clear picture of current conditions.

Moving Forward

The root cause analysis and process improvement have caused a significant change in the maintenance culture. Prior to the catastrophic failure, the cooling tower was considered the worst actor and a money pit. During the analysis and training phase of the cooling tower improvements, there were confusion and resistance toward the additional and more challenging steps.

Technicians were not sure if the alignments would actually improve the tower operations. The day they completed the first alignment on gearbox No. 1, removed their red locks and turned the controller to “auto,” they became instant believers.

The subcell ran so quietly that you could only hear the water flowing over the fill. People kept opening the side-access door to see if the fan was indeed running. It was at that moment our entire culture shifted from “that’s how we’ve always done it” to “let’s get to precision maintenance.”

Since then, all the technicians continue to drive the “why” and “how” questions until a maintenance project is completed with all the appropriate actions taken. Now that the technicians are trained in vibration, alignment and balancing, these practices have been implemented with all critical equipment, and we have begun to move to the next level of priority equipment.

Our team has calculated tens of thousands of dollars in savings and reported them to both customers and management. Other Lockheed Martin sites are beginning to see the progress we’ve made and are following our example.

The cooling tower issues that were resolved have highlighted the importance of a reliability engineer analyzing maintenance data, the purpose and need for root cause analysis, and the benefits of predictive and precision maintenance with condition monitoring.

These changes have also given us a larger window to prepare for failures and are preventing unexpected downtime scenarios, which continues to increase customer satisfaction.

This article was previously published in the Reliable Plant 2019 Conference Proceedings.