Reliability Considerations and Fault Handling Strategies for Multi-MW Modular Drive Systems
Shunt-interleaved electrical drive systems consisting of several parallel medium-voltage back to back converters enable power ratings of tens of MVA, low current distortions and a very smooth airgap torque. In order to meet stringent reliability and availability goals despite the large parts count, the modularity of the drive system needs to be exploited and a suitable fault handling strategy is required that allows the exclusion and isolation of faulted threads. This avoids the shut-down of the complete system and enables the drive system to continue operating. If full power capability is required also in degraded mode operation, redundancy on a thread level is to be added. Experimental results confirm that thread exclusion allows the isolation of the majority of the faults without affecting the mechanical load. While the drive system continues to run, faulted threads can be repaired and added on-the-fly to the running system by thread inclusion. As a result, the downtime of such a modular drive system is expected to not exceed a few hours per year.
One way to achieve very high power levels in the range of tens of MVA is to operate several back to back converters (threads) in parallel with each thread being a three-level medium-voltage Neutral Point Clamped (NPC) converter using Integrated Gate Commutated Thyristors (IGCT) as switching devices. When a very smooth machine current and airgap torque is required, the threads can be coupled with inductors in the range of a quarter pu when referred to the thread. This allows one to apply thread specific switching patterns. Shifting the PWM carrier signal by a certain phase shift among the threads gives raise to the concept of interleaving. As a result, the system's machine and grid voltages effectively resemble a 2N+1 level inverter with N times the power rating . Compared to the standard application with one thread, the part count for the N thread system is increased by approximately N times. This impacts the system reliability and availability in an adverse way if no appropriate countermeasures are taken. On the other hand, the parallel thread arrangement can be exploited to add redundancy. Moreover, in the event of severe faults, the coupling inductors limit the instantaneous currents, confine the fault to one thread, and thus allow the isolation and removal of the faulted thread by an appropriate fault handling strategy.
Reliability considerations and an optimized fault handling strategy for shunt-interleaved drive systems are the focus of this paper. The modular structure of the drive system with its N threads is shown in 1 .
II. Reliability Modelling
The drive system comprises a common part (synchronous machine, master control, etc) and N parallel threads. For our reliability calculations the drive is conveniently modeled using the two aggregated blocks common part and thread as shown in 2. A bottom-up analysis using the failure rates from the relevant literature  and , manufacturers and field experience yields the common part failure rate lcom = 12'920FIT and the thread failure rate lthr = 23'300FIT. Here, FIT means Failures In Time, where 1 FIT is equal to one failure within one billion (109) hours. The detailed derivation of lcom and lthr will be included in the final paper.
III. Reliability Analysis
For a drive system with N threads and without redundancy every failure shuts down the complete system. The sum of the component failure rates yields the system failure rate lsys=lcom+Nlthr and its inverse is the system Mean Time Between Failure (MTBF). Assuming the above failure rates, the resulting system MTBF of a standard drive with one thread amounts to 27'609 h or 3.15 years. This is close to the 31'000 h reported in  for a one thread 8-10 MW drive system. As the number of threads is increased, the system MTBF quickly drops to low values. For three threads for example, the system is expected to fail on average every 1.4 years.
To boost the reliability, redundancy can be introduced. On a thread level this is achieved by adding one redundant thread to the existing N threads. Using the results in , where a Markov model is developed for a system with N+1 identical components out of which at least N are required for the whole system to work, and assuming the thread repair rate to be mthr = 1/MTTRthr yields the system MTBF
Using the above example, a three thread system with one additional redundant thread and a Mean Time To Repair of the threads (MTTRthr) of 7 days yields an MTBF of 8.2 years. To better understand (1), the impact of redundancy and the influence of a varying thread MTBF, consider N=3 threads and refer to 3. Considering first a non-redundant system, one can see that the overall system MTBF (red line) is clearly dominated by the thread MTBF (magenta line). Increasing the thread MTBF significantly increases the system MTBF. However, even for a thread MTBF of 100'000 hours, the overall system MTBF is limited to about 23'000 hours.
Next, redundancy is added on a thread level. Consider a system that comprises N+1 parallel threads without the common part. In this system, one thread is redundant. As shown in 3, the additional thread impressively boosts the MTBF (cyan vs. magenta line). This effect is further enhanced by reducing the MTTR from seven to two days, and by increasing the thread MTBF.
However, as discussed before, the real system also comprises the common part failures (black line). Even though the MTBF of the common part is expected to be higher than the MTBF of an individual thread, it is still comparatively low thus limiting the overall system MTBF of a redundant system (blue line). For the drive system considered here, the system MTBF basically saturates already at modest thread MTBFs of 30'000 h. Further improvements of the thread MTBF or a reduction of the MTTR do not further improve the system MTBF. Instead, to enhance the system MTBF, the common part needs to be made more reliable.
Reliability Considerations and Fault Handling Strategies for Multi-MW Modular Drive Systems Various MTBFs as a function of the thread MTBF for N=3. The straight lines refer to an MTTRthr of 7 days, whereas the dash-dotted lines refer to an MTTRthr of 2 days. The MTBFcom is 77'400 h. Please note the logarithmic scale on the vertical axis
4: Summary of the three fault handling approaches A, B and C, and the evolution of the system states in the event of a fault. Light gray refers to the state of normal operation (whole system runs), hatched to the degraded mode operation (system runs, but one redundant thread failed) and black to the failed state (system failed)
Reliability Considerations and Fault Handling Strategies for Multi-MW Modular Drive Systems
IV. Failure Modes and Effects Anlaysis
A Failure Mode and Effects Analysis (FMEA) was done to analyze and understand the drive system's failure modes, their associated effects and their propagation paths. For a 3 thread system, several hundred faults emerge, which can be classified in 3 categories (cf. Table 1). Severe (or destructive) thread faults necessitate adequate and quick corrective actions to limit cascading and/or escalating faults, while non-severe thread faults are less critical. The latter include pump failures, most thread control failures and trips due to the protection scheme (thermal, over-current, under-voltage, etc). Faults in the common part of the drive system are by definition single point of failures thus leading to a complete system shut down. These faults could be further partitioned into severe and non-severe faults.
V. Fault Handling Approaches
4 provides a pictorial summary of three viable fault-handling approaches whose characteristics and suitabilities are outlined hereafter.
1) Approach A: Traditional Non-Redundant System
Approach A applies to the standard non-redundant drive system, where, if any component fails, the whole drive system needs to be shut down. To boost the reliability, redundancy on a component level is often introduced, like a second cooling water pump or a redundant thyristor in an LCI stack. Today, the vast majority of drives are installed and operated according to this cost effective solution, which is suitable for applications where a slight reduction in availability is acceptable. Note that the availability is defined as MTBF / (MTBF + MTTR).
2) Approach B: Redundant System with Automatic Restart in Degraded Mode
In many applications, an availability close to 100% is required. Yet, occasional drive system outages can be tolerated provided that the system's outages are short compared to the mechanical and/or thermal inertia of the driven process thus ensuring that such short outages do not lead to a trip of the whole process. For such an application, a drive system with an automatic restart capability in degraded mode operation is a suitable choice. If sufficient redundancy on a thread level is available (i.e. the number of redundant threads is equal to or larger than the number of
Percentage of faults
Fault handling approach
Time of system shut down
Non-severe thread faults
Pump failures, thread control failures, trips due to thermal protection, etc
Zero, since the system keeps running
Severe thread faults
Device shorts such as IGCTs, diodes, dc-link capacitors, etc
In the range of seconds
Common part faults
Master control failures, machine shorts, etc
Some seconds up to the MTTR
Table 1: Classification of faults and related fault handling approaches for a drive system with 3 parallel threads
faulty threads) then full power operation is available also during degraded mode operation. Conversely, the drive is still operational, yet at a reduced power level.
3) Approach C: Redundant System with Thread Exclusion
A few processes do not tolerate even very short drive outages in the range of seconds or below. An on-the-fly thread exclusion capability is required that avoids shutting down the complete drive system in the event of a thread fault. Special attention needs to be paid to the bump-less torque transfer from normal operation to degraded mode operation - meaning that the torque transient should be as smooth as possible when excluding a thread.
VI. Proposed Fault Handling Strategy
A fault handling strategy that combines all three approaches A, B and C seems to be an attractive choice. With a modest software complexity the majority of the thread faults (the non-severe faults) can be addressed by the approach C, while the remaining few severe thread faults are handled by the option B. The common part faults require the approach A. Since a large fraction of the common part faults requires no maintenance (grid faults, trips due to thermal protection of the machine, etc) the system can be restarted shortly after a trip thus limiting the system shut down time. Moreover, from a cost-benefit point of view, this combined A-B-C strategy seems to be the optimum since it maximizes the number of faults addressed by the approach C, minimizes the complexity of the fault-handling scheme, and to a large extend rules out the residual risk of cascading faults. This strategy is summarized in Table 1.
VII. Experimental Results
In the following we provide selected experimental results. Specifically, thread exclusion during non-severe thread faults was tested on a scaled-down drive system with three threads and a synchronous machine rated at 20 kW and 480 V. The threads comprise NPC converters in a back-to-back arrangement. The coupling inductors and transformer stray inductances are roughly a quarter pu when referred to the thread power. We investigated different operating points with different speed and torque settings. In the following we show experimental results with a torque setting of 0.5pu. This implies a machine current of 0.5pu as shown in the lower plot of 5. Each of the three threads provides one third of the 0.5pu machine current, i.e. 0.167pu.
Consider a non-severe fault (pump failure or a thermal protection fault, etc) in thread 1 at time zero. The AFE and the INV of thread 1 are stopped within a few control cycles. The thread 1 current drops to zero with a fast transient, which is determined by the coupling inductor. To compensate for the lost thread, the thread current setpoints for the remaining two threads are increased to 0.25pu. As can be seen in 5, the thread controllers almost instantaneously ramp up the thread currents from 0.167pu to 0.25pu without an overshoot. As a result, the machine remains almost unaffected from the fault. Specifically, the machine currents exhibit only a small perturbation in one of the phases at time zero, while the glitch in the torque (cf. 6)
Reliability Considerations and Fault Handling Strategies for Multi-MW Modular Drive Systems has a magnitude of less than 20% and is shorter than 5ms thus resulting in a virtually bumpless transfer. In a last step, the thread breakers are opened to isolate the thread and enable maintenance.
Thread exclusion at full load and thread inclusion perform equally well as experimental results confirm that are not included here due to space limitations.
Even though modular high power drive systems have a large part count, their modularity can be exploited so as to provide an overall system reliability and availability that well exceeds the one of standard drives with one thread. To achieve this, a customized fault handling strategy is mandatory that is preferably easy to implement and maintain, while avoiding system shut-downs to a large extend.
The majority of the faults are non-severe thread faults that can be isolated relatively easily. By excluding and isolating the faulted thread, the overall system can continue its operation in degraded mode. If redundancy on a thread level is available then full power can be provided also in degraded mode operation. Experiments on a scaled-down drive system show that thread exclusion in the face of non-severe thread faults yields promising results. Specifically, the torque is reduced smoothly (if required) thus limiting the impact on the mechanical load (this will be shown in the final paper). While the drive system continues its operation, the faulted thread(s) can be repaired and added on-the-fly to the running system by thread inclusion. As a result, the downtime of such modular drive system is expected to not exceed a few hours per year.
 S. Schröder, P. Tenca, T. Geyer, P. Soldi, L. Garces, R. Zhang, T. Toma, P. Bordignon: “Modular high-power shunt-interleaved drive system: a realization up to 35MW for Oil & Gas applications”, IEEE Industry Application Society Annual Meeting. Edmonton, Alberta, Canada, October 2008
 H.H. Kari: “Latent sector faults and reliability of disk arrays”, PhD thesis, Helsinki University of Technology, Espoo, Finland, May 1997
 P. Wikström, L. A. Terens, H. Kobi: “Reliability, availability and maintainability of high-power variable-speed drive systems”, IEEE Transactions on Industy Applications, Vol. 36, No. 1, pp. 231-241, January/February 2000
 R.-D. Klug, A. Mertens: “Reliability of Megawatt drive concepts”, IEEE International Conference on Industrial Technology. Maribor, Slovenia, December 2003