# Simulation

### Simulation on Group Deleted Potentials for Diagnostis of Censored Survival Regression

### Abstract:

The identification of multiple high leverage points (HLP) can not be successfully carried out by using a method developed for handling single HLP. The situation getting worse when handling Weibull distributed censored response data. We proposed method for identifying multiple HLPs. The proposed method is developed based on the similar problem for handling multiple HLPs in multiple linear regression setting. Simulation study is conducted for assessing its performance across a wide range of scenarios. We found that the proposed method had good capability to identify multiple HLP. As a general rule, the proposed method perform better in higher covariate space dimension, larger number of outlying regressor variables, larger outlying leverage distance, larger magnitude of unusualness in response-space, larger number of multiple points clouds and higher percentage of censoring.

### Key words:

Multiple High Leverage Points; Simulation Study, Weibull Regression;

### 1 Introduction

The statistical analysis of survival time data has become a topic of considerable interest to statisticians and workers in areas such as engineering, medicine and biological sciences. Such data may show structural complexity, but it is the presence of censoring which sets its analysis apart from traditional statistical techniques.

The Cox proportional hazards (PH) model is usually applied to analyze the relationship between survival time and explanatory variables. Kudus and Ibrahim [7] extended the model for competing risks survival data. Whereas, Kudus et al. [8] proposed the regression tree method based on Cox proportional hazards for analyzing competing risks survival data. The regression tree is further extended by using proportional hazards model for subdistribution and applied on breast cancer study [4].

Weibull regression model is a special case of Cox proportional hazards. Kalbfleisch and Prentice [6] showed that Weibull regression model is not only having form of proportional hazards but also log-linear model. The problem arose when the model is fitted to data contained unusual observations, such as high leverage points (HLP). This problem can cause invalidity of the use of standard inferential procedure. It is thus important for the data analyst to be able to identify such observations.

If the dataset contain more than one HLP, which is likely to be the case in most data sets, the problem of identifying such observations become more difficult. There is evidence [9] that the single HLP detection techniques have been proved to be ineffective to detect potential observations in the presence of multiple HLP in linear regression problem. Wisnowski et al. [11] stated that many standard least-squares regression diagnostics quantities and plots have been shown to fail in the presence of multiple outliers (including multiple HLP), particularly if the observations are clustered in an outlying cloud. Imon [5] introduced generalized potentials (pii*) and using it for identifying multiple HLP in linear regression setting.

Since there is no existing method for identifying multiple HLP in Weibull regression model, then it is thus important to develop a method for such problem. The proposed method must have a closed relation to the existing method for linear regression problem due to the advantage of log-linear form of Weibull regression model. Hence, it would be a generalization from the existing method. We assesed its performance through a series of Monte Carlo simulation in various multiple HLP configurations likely to be encountered in practice.

In Section 2 we review Weibull regression modeling through proportional hazards or log-linear models and their parameter estimations by means of maximum likelihood method. Section 3 presented single and multiple HLP identifications for Weibull regression model. Extensive Monte Carlo simulation to evaluate multiple HLP identification method is designed and conducted in Section 4 and last section presents a conclusion of the paper.

### 2 The Weibull Regression Modeling

### 2.1 The PH model

Consider survival time T > 0, sometime censored, and suppose that a vector of basic covariates x' = (x1, x2, …) is available on each individual, their measurements having been taken at or before time 0. Aspect of x are expected to be predictive of subsequent survival time. The principal problem is that of modeling and determining the relationship between T and covariates x.

Recall that Weibull distribution with scale parameter l and shape parameter g has hazard function

(1)

To include the covariate vector xi of the ith individual, the hazard for a given xi can be expressed as

(2)

Now, Weibull distribution has scale parameter and shape parameter g. The survivor function turns out to be

(3)

and probability density function is

(4)

Given covariate xi, let , i = 1, …, n, be the true failure times of a sample of size n, assumed to be independent identically distributed with a Weibull distribution with hazard function (2). Assuming that the observations are subject to arbitrary right censoring, the period of follow-up for the ith individual is limited to a value ci. Then, the observed failure time of the ith individual is given by ti = min(, ci). Define di such that di = 0 if ³ ci (a censored observation) and di = 1 if < ci (an observed failure of some kind).

The Weibull proportional hazards model is fitted by constructing the likelihood function of the n observations

(5)

where f(.) is (4) and S(.) is (3). It is easier to maximize logarithm of this function instead of likelihood function itself for obtaining estimate of the unknown parameters l, g and b .

### 2.2 Log-linear model

Consider a log-linear model for the random variable Ti associated with the survival time of the ith individual in a survival study, according to which

(6)

In this model , , …, are the unknown coefficients of the value of k explanatory variables X1, X2, …, Xk, and , s are two further parameters, known as the intercept and scale parameter, respectively. The quantity ei is a random variable used to model departure of the values of ln Ti from the linear part of the model. Suppose that e follows standard extreme value distribution with probability density function given by

(7)

Let , then the probability density function of x is which is exponential distribution with unit mean.

Now consider the survivor function of Ti

(8)

Since has a unit exponential distribution, so . It then follows that

(9)

There is a direct correspondence between equation (3) and equation (9), in the sense that

, and (10)

With log-linear form of Weibull regression model, the likelihood function has a simple form. Since probability density function of e is (7), then density function of Yi = ln Ti is given by

(11)

and its survivor function is

(12)

where

(14)

The likelihood function based on y1, y2, …, yn which are logarithm of the n observed survival times t1, t2, …, tn is then

(15)

### 3 The Identification of HLP for Weibull Regression Model

### 3.1 Single HLP Identification

HLP is observation which is isolated in the covariate space (i.e., far removed from the main body of points in the X space). They can be thought of as outliers in the covariate space [1]. HLP need not be influential, and influential observations are not necessarily HLP.

There are several ways to understand the characteristics of leverage in the linear regression model yi = a0 + Sajxji + ei, one of them is which measure the amount of leverage of the response value yj on the predicted value . wii most directly reflects the influence of yi on the fit. The generalization of leverage from linear regression to more general models can be based on this viewpoint.

Wei et al. [10] showed that generalized leverage derived by the above approach is

(16)

where

, and

For log-linear specification of Weibull regression model (6)

, and l(q) is logarithm of likelihood function (15).

If we only concern with regression coefficient , then the generalized leverage is

(17)

where and zi is (14).

Next, let , it is then

(18)

Thus the generalized leverage of the ith observation is given by

(19)

In linear regression setting, points with wii greater than 2(k+1)/n (twice the average value) are generally regarded as HLP [2]. Since, the situation is different for log-linear model, then we use 0.2 as a calibration point which stated by Huber (1981) as risky to dangerous points.

Another statistic for HLP identification is called potential. Potential is leverage of the ith point which is based on a fit to the data with the ith case deleted, namely

(20)

where is matrix with ith case deleted. Simple relationship between generalized leverage (19) and potential (20) is

(21)

with cut-off point: Median(pii) + 3 MAD(pii), where MAD(pii) = Median{| pii – Median(pii)|}/0.6745.

### 3.2 Multiple HLPs Identification

Multiple HLPs identification method for Weibull regression model will be proposed in this section. It is motivated by the anticipation that the single case deleted measure discussed in the previous section may be ineffective for the identification of multiple HLPs because of masking and/or swamping effects. Let R be a set of cases 'remaining' in the analysis and D be a set of cases 'deleted'. Hence R contains (n–d) cases after d < (n–k) cases in D are deleted. We assume that the last d rows of is D set. Group deleted leverage based on group deleted cases D is

(22)

with cut-off point: Median(wii(R)) + 3 MAD(wii(R)). By adopting generalized potential proposed by Imon [5], then its corresponding group deleted potential turns out to be

(23)

with cut-off point:

### 4 Simulation on Identification of Multiple HLP for Weibull Regression Model

### 4.1 Scenario

We use Monte Carlo simulation to examine the performance of the multiple HLPs detection procedures across a wide range of scenarios. The simulation generated 90% of clean observations and plant HLPs at location specified by the scenario and factor settings. The regressor variables levels for clean observations are generated from a multivariate normal distribution with a mean of mx=7.5 and standard deviation of sx=4.0. The choice of these parameters does not affect the result of the simulations. The survival time response for the ith clean observation is generated from Weibull, where l = 1, bj = –3 and g = 3/2 which correspond to , and s = 2/3 in log-linear form of Weibull regression. For the planted HLPs, the jth regressor variable value for the ith observations is where is the average of the clean value for the jth regressor, dL is the magnitude of the outlying shift distance in X-space and is a random variate from a U(0,0.25). We use the term to separate multiple observations in a cloud. If the ith observation is both HLP and regression outlier, then response value ti is , where is generated from , dR is the magnitude of the outlying distance off the regression plane in standard deviation units, st.

Beside all the above factors, we also consider the level of censoring percentage (pc). The percentage of censoring is defined by letting C which follow Uniform(0,ac), where ac will be chosen such that it results in an overall probability of censoring pc = P(T > C|x).

Our simulation studies aim to characterize the effects of specific factors on two primary measures of performance: detection capability and false alarm rate. The false alarm rate is the probability that a clean observation is swamped and the complement of detection probability is the masking probability. The factors considered are:

- Magnitude of unusualness in X-space, dL (3 and 5)

- Number of clouds (1 and 2)

- Magnitude of unusualness in response-space, dR (5)

- Percentage of censoring, pc = (0%, 5% and 10%)

- Dimension of data ((n,# x variables) = (40,2) and (60,6))

- Proportion of x variables with extreme values (all k variables, 1 out of k variables and 3 out of 6 variables)

Each procedure's performance is evaluated on its ability to detect the planted HLPs and avoid false alarms. Both are reported for 500 replications.

### 4.2 Result

Table 1 and 2 showed the simulation results for one cloud of HLPs with 3 and 5 unit magnitude of usualness in X-space, respectively. Multiple HLPs detection based on wii(R) and showed better result. The same result is also obtained from simulation with dL = 5. We also conduct simulation for 2 clouds of HLPs and 2 clouds of HLPs located at different response-space magnitude. The results still showed better performance for wii(R) and .

### 5 Conclusion

The group deleted procedure (wii(R) and ) have the better detection probability and have false alarm rate about the nominal 5% level. As a general rule, the proposed method perform better in higher covariate space dimension, larger number of outlying regressor variables, larger outlying leverage distance, larger magnitude of unusualness in response-space, larger number of multiple points clouds and higher percentage of censoring.

### Table 1. Scenario with detection and false alarm probabilities (in parenthesis) for 1 cloud multiple HLPs with 3sx of magnitude of usualness in X-space

dL

pc

(n,k)

x variables

(1)*

(2)

(3)

(4)

3sx

0%

40,2

2

0.035(0.001)

0.417(0.098)

0.663(0.071)

0.639(0.099)

(1 cloud)

1

0.020(0.001)

0.354(0.098)

0.483(0.077)

0.453(0.103)

60,6

6

0.019(0.007)

0.172(0.101)

0.708(0.032)

0.667(0.079)

3

0.025(0.006)

0.227(0.095)

0.568(0.035)

0.523(0.082)

1

0.020(0.006)

0.198(0.095)

0.314(0.042)

0.271(0.091)

5%

40,2

2

0.030(0.001)

0.444(0.094)

0.686(0.070)

0.665(0.095)

1

0.017(0.001)

0.377(0.097)

0.512(0.076)

0.493(0.102)

60,6

6

0.014(0.005)

0.110(0.076)

0.805(0.024)

0.783(0.059)

3

0.032(0.005)

0.212(0.089)

0.600(0.032)

0.563(0.077)

1

0.018(0.006)

0.197(0.093)

0.338(0.041)

0.283(0.088)

10%

40,2

2

0.028(0.001)

0.452(0.092)

0.701(0.069)

0.677(0.095)

1

0.016(0.001)

0.377(0.098)

0.515(0.075)

0.490(0.102)

60,6

6

0.014(0.004)

0.099(0.071)

0.826(0.022)

0.805(0.054)

3

0.029(0.005)

0.196(0.083)

0.645(0.029)

0.610(0.071)

1

0.018(0.006)

0.197(0.091)

0.336(0.040)

0.278(0.087)

Average probabilities

0.022(0.004)

0.268(0.092)

0.580(0.049)

0.547(0.086)

Note: * Four detection methods compared: (1) generalized leverage, (2) potential, (3) group deleted leverage, (4) group deleted potential

### Table 2. Scenario with detection and false alarm probabilities (in parenthesis) for 1 cloud multiple HLPs with 5sx of magnitude of usualness in X-space

dL

pc

(n,k)

x variables

(1)*

(2)

(3)

(4)

5sx

0%

40,2

2

0.086(0.001)

0.523(0.093)

0.861(0.068)

0.849(0.095)

(1 cloud)

1

0.071(0.001)

0.497(0.090)

0.757(0.069)

0.740(0.096)

60,6

6

0.030(0.007)

0.200(0.100)

0.867(0.029)

0.850(0.075)

3

0.049(0.005)

0.274(0.094)

0.787(0.030)

0.760(0.076)

1

0.045(0.006)

0.278(0.091)

0.550(0.034)

0.506(0.082)

5%

40,2

2

0.078(0.001)

0.577(0.089)

0.871(0.065)

0.863(0.090)

1

0.067(0.001)

0.527(0.089)

0.749(0.069)

0.731(0.095)

60,6

6

0.014(0.003)

0.091(0.058)

0.937(0.018)

0.926(0.045)

3

0.041(0.004)

0.230(0.074)

0.847(0.024)

0.821(0.061)

1

0.038(0.005)

0.273(0.087)

0.586(0.032)

0.545(0.079)

10%

40,2

2

0.068(0.001)

0.617(0.088)

0.886(0.062)

0.878(0.086)

1

0.063(0.001)

0.546(0.086)

0.763(0.068)

0.751(0.093)

60,6

6

0.016(0.003)

0.088(0.057)

0.939(0.018)

0.928(0.044)

3

0.040(0.004)

0.219(0.068)

0.854(0.022)

0.835(0.057)

1

0.034(0.005)

0.269(0.083)

0.618(0.031)

0.572(0.076)

Average probabilities

0.049(0.003)

0.347(0.083)

0.791(0.043)

0.770(0.077)

Note: * Four detection methods compared: (1) generalized leverage, (2) potential, (3) group deleted leverage, (4) group deleted potential

### References

[1] Chatterjee, S. and Hadi, A. S. (1986). Influential observations, high leverage points, and outliers in linear regression. Stat. Sci. Vol. 1(3), pp: 379-416.

[2] Hoaglin, D. C. and Welsch, R. E. (1978). The hat matrix in regression and ANOVA. Amer. Statist. 32. 17-22.

[3] Huber, P. (1981). Robust Statistics. New York: Wiley.

[4] Ibrahim, Kudus, A., Daud, I. and Abu Bakar, M. R. (2008). Decision tree fro competing risks survival probability in breast cancer study. Int. J. of Biomed. Sci. Vol. 3(1).

[5] Imon, A. H. M. R. (1996). Subsample methods in regression residual prediction and diagnostics. PhD thesis, School of Mathematics and Statistics, University of Birmingham, UK.

[6] Kalbfleisch, J. D. and Prentice, R. L. (2002). The statistical analysis of failure time data. Wiley Interscience. New Jersey.

[7] Kudus, A. and Ibrahim, N. A. (2005). Competing Risks Cox Proportional Hazard Modeling Using SAS. SAS User Malaysia Forum 2005. http://www.sas.com/offices/asiapacific/malaysia/events/sum2005docs/19.pdf

[8] Kudus, A., Ibrahim, N. A., Abu Bakar, M. R. and Daud, I. (2005). Regression Trees for Competing Risks Survival Data. CD Proceedings of International Conference on Applied Mathematics. Bandung, August 22-26.

[9] Rousseeuw, P. J., and Leroy, A. (1987). Robust regression and outlier detection. Wiley, New York.

[10] Wei, B-C., Hu, Y-Q and Fung, W-K. (1998). Generalized leverage and its application. Scand. J. Stat. Vol. 25, pp:25-37.

[11] Wisnowski, J. W., Montgomery, D. C. and Simpson, J. R. (2001). A comparative analysis of multiple outlier detection procedures in the linear regression model. Comput. Statist. Data Anal. 36:351-382