Rough set based runtime estimator

Abstract Efficient scheduling of jobs in grid environment is a challenging task. To perform better resource utilization and proper resource allocation, the factor job runtime is essential. Accurate estimation of runtime helps to reserve resources in advance, provide user level QoS. But it is difficult to estimate the runtime of data intensive applications. Users are required to provide the runtime estimate of the job, but the user given estimates are inaccurate leading to poor scheduling. In this paper, we have used rough set techniques to analyse the history of jobs and estimate the runtime of the job. This requires maintaining a history of jobs that have executed along with their respective runtime. Our proposed rough set engine groups similar jobs and identifies the group to which the newly submitted job belongs. Based on this similar group identified, the runtime is estimated. Mostly users are not aware of resources, submitting incomplete job requirements. These missing job requirements affects data analysis. Those missing values should be accurately predicted. Missing value handler designed using rough sets fills the most probable value for missing attributes and then the runtime is estimated.

Keywords History based approach, Attribute reduction, Degree of dependency, Reduct, Accuracy of approximation, Missing value handler, Rough sets.

Introduction

Grid computing is concerned with "coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations" [1]. A grid is a decentralized heterogeneous system in which resources belong to multiple organizations. Grid scheduling is defined as the process of making scheduling decisions involving resources over multiple administrative domains. Grid scheduling is essential, in order to make coherent and coordinated use of distributed and heterogeneous grid resources. Most grid schedulers use backfilling which means that short jobs are allowed to run ahead of their time provided they do not delay previously queued jobs. This is needed for better utilization of resources. Job runtime is needed for backfilling process.

In Grid Environment, several execution sites are available and resource scheduler has to select optimum site out of these for job execution. Predicting runtime helps to know how long a request will wait for a resource. There is also a need for meeting user deadlines in order to full fill the quality of service requirements [2]. User given estimates are inaccurate leading to poor scheduling [5]. Jobs are killed when the requested time expires, thus resulting in user dissatisfaction.

Estimators will help scheduler in making intelligent decisions regarding the selection of optimum site capable of meeting user deadlines, by estimating job runtime, queue wait time, file transfer time and access time on each individual execution site. They are not core components of the scheduler [12]. They just provide input to the scheduler. This input helps to improve the efficiency of scheduling. Some estimators make use of historical information to make their estimations.

Databases are rich with hidden information that can be used for making intelligent decisions. But the data available is large, vague, uncertain, inconsistent and totally imperfect. Several techniques are available to extract this hidden, useful information from vague, inconsistent and imperfect data. Fuzzy and Rough set theories are both, approaches to uncertainty, vagueness and handles imperfect knowledge. It is not an alternative to classical set theory but it is embedded in it like fuzzy set theory [4]. The degree of uncertainty is reduced in rough sets compared to fuzzy set. In Rough Set approach, imprecision is expressed by a boundary region of a set, where as in fuzzy set theory it is by partial membership.

Rough set theory, introduced by Pawlak [14] and discussed in greater detail in Refs. [39,40], is a technique for dealing with uncertainty and for identifying cause-effect relationships in databases as a form of data mining and database learning [41]. It has also been used for improved information retrieval [42] and for uncertainty management in relational databases [6,7]. The theory has found applications in many domains, such as decision support, engineering, banking, medicine and others [10]. Rough Set based Intelligent agent has been used to manage grid data [6].

The main advantage of rough set theory in data analysis is that it does not need any preliminary or additional information about data - like probability in statistics, or basic probability assignment in Dempster-Shafer theory, grade of membership or the value of possibility in fuzzy set theory.

In our proposed approach, Rough Set analysis is done on the history of job and resource properties used by the job for execution. The principle in history based approach is jobs having similar features have similar runtimes.

[2],[3],[7],[8].The issue lies in finding the features of the job to be considered to group similar jobs. A technique should be developed to identify similar jobs. Proper attributes should be chosen to group similar jobs. If the runtime estimated is close to the actual runtime, the prediction is good.

Rough Sets Background

Rough Set is a intelligent tool that deals with uncertainty and vagueness [10]. It is mathematically relatively simple and handles complex problems. It offers mathematical tools to discover patterns hidden in data. It can be used for feature selection, feature extraction, data reduction, decision rule generation, and pattern extraction [11]. It also helps to identify partial or total dependencies in data, eliminates redundant data, gives approach to null values, missing data, dynamic data and others.

Based on this theory one can propose a formal framework for the automated transformation of data into knowledge. It helps in learning from examples. It simplifies the search for dominating attributes leading to specific properties. It is described by implicit relationships between the attributes and does not need any other further information, as theory of probability or other existing approaches.

The basic foundations of rough set techniques are described below.

Information Systems/Tables

An information system I(U,A) where U is universal set representing the objects /instances/cases. It is a non empty finite set of objects. A represents the attribute set, the non-empty finite set of attributes/features. An information system is thus more or less the same as a relational database. The rows represent objects, while the columns represent attribute values belonging to these objects.

Table I shows first ten jobs taken from history. Universal set U is given by U = { J1, J2, J3, J4, J5, J6, J7, J8, J9, J10 }. Attribute set 'A' includes all attributes in table. They are divided into condition and decision attributes. Condition Attribute set C is given by C = { RAM, SM, NC, OS, CPU_Speed}. Decision attribute set D is the job_Runtime.

Indiscernibility

Indiscernibility is similarity, where the similar objects are collected in a set. Thus the objects here are classified into groups based on the attributes/features. Classification helps to express the human knowledge about a domain. Knowledge is treated as an ability to classify the observed objects into categories. Objects belonging to the same category are considered to be indistinguishable to each other.

The rough set engine present inside the runtime estimator performs certain computations previously. Discretization a data pre-processing can be applied to the specific numerical attributes in the information table. It is applied when there is minimal difference between attribute values, that does not make two jobs much different. The categories are made for each attribute based on attribute values. For example, here CpuSpeed < 2 is categorized as low, between 2 and 2.5 as medium and > 2.5 as high.

After performing discretization, attribute reduction is done to obtain significant attributes greatly affecting decision attribute runtime called reduct. This is performed by reduct generator algorithm explained in Table III. The input to the reduct generator algorithm is Core that is computed as explained in Section C. The reduct generator algorithm calls degree of dependency function i.e. explained in B.

Compute Total degree of dependency

The steps in computing degree of dependency are as follows

Step 1. Compute condition relative equivalence classes: Consider all the condition attributes, the jobs with all similar condition attributes values are grouped together.

Step 2. Compute decision relative equivalence classes : Consider the decision attribute, the jobs with same decision attribute value are grouped together.

Step 3. Compute positive boundary region : Choose all condition relative equivalence classes that are subsets of any decision relative equivalence classes. The union of all these resultant condition relative equivalence classes is called the positive boundary region.

Reduct Generator Algorithm

On arrival of new job, based on reduct obtained, the job group is identified. The mean runtime of all the jobs in the group is computed. This gives the runtime estimation of the new job. If the job is executed successfully, it is updated in history with its runtime. The limitation of the runtime estimator is that, if there is no similar group found then the runtime cannot be estimated. So higher the number of jobs in history, higher the runtime estimator performance. Moreover some assumptions are made

  • The type of job should be similar in nature.
  • System type (mainframe/pc/cluster) is assumed to be the same.

Handling of Missing Job Requirements

When user submits job in a grid environment, the user is not aware of some of the job requirements. Here requirements refers to attributes required for the job which are taken as the condition attributes which may be needed to estimate the runtime of the job. These missing attributes affects data analysis. They are a hindrance in making the correct decision regarding the decision attribute runtime. Those missing values should be accurately predicted. The most probable value should be taken and replaced so that it helps in estimating runtime correctly.

Using indiscernibility and accuracy of approximation concepts of rough sets, the missing values are replaced. Determination of similar objects with respect to defined attribute values is very hard when some attribute values are missing. The attribute value having the highest possibility is found. The missing values are replaced by the most possible value and runtime for the job is estimated.

Conclusions

Thus analysis is performed on the job history using rough set approach. The proposed runtime estimator algorithm is implemented and the runtime of job is calculated even with incomplete missing job requirements. Our results show that the history based approach for runtime estimation works well with good accuracy. Thus using rough sets we approximately predict the runtime of the job. We are also able to predict the resource requirements of the job. The runtime estimation helps in improving scheduling performance.

Future Work

Many data intensive applications requires manipulation on very large databases , but they are difficult to maintain. Using the predicted resource requirements, the matched resources are found in the various sites. Predicting transfer time from sites where resources are available will also be done in future. Queue wait time of jobs and site access time can also be predicted if suitable attributes are taken. This could help grid scheduling greatly.

Acknowledgment

We wish to express our thanks to Centre for Advance Computing Research and Education (CARE) team members in the research laboratory, for their helpful and constructive suggestion, which considerably improve the quality of the paper. This paper has been supported by the CARE, for integrating the runtime estimator in grid scheduler and to learn more about the grid scheduling practically.

References

  1. Foster. I, C. Kesselman, and S.Tuecke, "The Anatomy of the Grid: Enabling Scalable Virtual Organizations," Int'l J. Supercomputer Applications,vol.15, no.3, 2001.
  2. Downey,A,B., (1997), "Predicting Queue Times on Space-Sharing Parallel Computers", Proc. of the 11th Intl. Parallel Processing Symposium (IPPS), Geneva,Switzerland, April.
  3. Gibbons,R., (1997), "A Historical Application Profiler for Use by Parallel Schedulers", Lecture Notes in Computer Science (LNCS), 1291, Springer-Verlag, pp.58-75
  4. Scheduling in Grid Databases, Rogerio Luis de Carvalho Costa and Pesdro Furado, IEEE, 2008.
  5. Dan Tsafrir, Yoav Etsion, Dror Feitelson, "Backfilling Using System-Generated Predictions Rather than User Runtime Estimates", IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL.18,NO.6,JUNE 2007.
  6. Asef AL-Khateeb, Rosni Abdullah and Nur'Aini Abdul Rashid, "Job Type Approach for Deciding Job Scheduling in Grid Computing Systems" , Journal of Computer Science 5 (10): 745-750, 2009.
  7. Smith,W., Foster,I., and Taylor,V., (1998), "Predicting Application Run Times Using Historical Information", in Proc. of the IPPS/SPDP '99 Workshop on Job Scheduling Strategies for Parallel Processing.
  8. Smith,W., Taylor,V., and Foster,I.,(1999), "Using run-time predictions to estimate queue wait times and improve scheduler performance", Lecture Notes in Computer Science (LNCS), 1659, Springer-Verlag, pp.202-229
  9. C C Tsang,Degang Chen,Daniel S.Yeung,Xi-Zhao Wang and John W.T. Lee, "Attributes Reduction Using Fuzzy Rough Sets", IEEE Transaction on Fuzzy Systems, Vol. 16, pp. 1130-1141, oct-2008.
  10. Jia Chen, Di Liu, "Rough Set-based Intelligent Agent Grid Data Management", International Conference On Communications ,Circuits and Systems,pp 937-941,2007.
  11. Asgarali Bouyer,Mohammadbagher Karimi and Mnsour Jalali, "An Online and Predictive Method for Grid Scheduling based on Data Mining and Rough Set",Compuatational Science and Its Applications,vol-5592,pp.775-787,2009.
  12. [8] Meiqun Liu, Kun Gao, Zhong Wan," A novel architecture for data mining grid sche duler", WSEAS Transactions on Systems, Vol.7 , Issue 4 ,pp. 373-383,2008.
  13. Shonali Krishnaswamy,Seng Wai Loke and Arkady Zaslavsky, "Estimating Computation Times of Data-Intensive Applications", IEEE DISTRIBUTED SYSTEMS ONLINE 1541-4922 2004 Published by the IEEE Computer Society Vol. 5, No. 4; April 2004.
  14. Zdzislaw Pawlak, "Rough set theory and its applications," Journal of Telecommunications and Information Technology,Vol-3,pp.468-473,2002.
  15. Komorowski,J., Pawlak,Z., Polkowski,L., and Skowron, A., (1998), "Rough sets: A Tutorial", in Rough-Fuzzy Hybridization: A New Trend in Decision Making, (eds) S.K.Pal and A.Skowron, Springer- Verlag, pp. 3-98.
  16. Rough Sets: A Tutorial Jan Komorowski, Lech Polkowski, Andrzej Skowron.
  17. A. E. Hassanien, "Rough set approach for attribute reduction and rule generation: a case of patients with suspected breastcancer," Journalof the American Society for Information Scienceand Technology, Vol. 55, no. 11, pp. 954-962, 2004.
  18. Jerzy W.Grzymala Busse, "Rough Set Theory with Applications to Data Mining".
  19. Fulufhelo V.Nelwamondo and Tshilidzi Marwala, "Rough Set Theory for the Treatment of Incomplete Data" ,IEEE 2007.
  20. M. Nakata and H. Sakai, "Rough sets approximations to possibilistic information," in IEEE International Conference on Fuzzy Systems,(Vancouver, BC, Canada), pp. 10804-10811, 2006.
  21. Ricardo Hirata Okamoto, Tatsuo Nishino, Mitsuo Nagamachi, "Comparison between statistical and lower / upper approximations rough sets models for beer can design and prototype evaluation.", Research paper.
  22. K.Thangavel,A.pethalakshmi and P.jaganathan "A Novel Reduct Algorithm for Dimensionality Reduction with Missing Values Based on Rough Set Theory", International Journal of Soft Computing 1(2) : 111-117,2006.
  23. Greco, S., Matarazzo, B., and Slowinski, R. [1999] Handling MissingValues in Rough Set Analysis of Multi-attribute and Multi-criteria Decision Problem, Lecture Notes in Arti?cial Intelligence, 1711, pp.146-157.
  24. Grzymala-Busse, J. W. [2004] Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction, Transactions on Rough Sets I, pp. 78-95.
  25. Nakata, N. and Sakai, H. [2006] Applying Rough Sets to Data Tables Containing Imprecise Information Under Probabilistic Interpretation, Lecture Notes in Arti?cial Intelligence, 4259, pp. 213-223.
  26. Rafalv Latkowski, "On Indiscernibility Relations for Missing Attribute Values".
  27. Fulufhelo Vincent Nelwamando and Tshilidzi Marwala , "Rough Sets Computations to Impute Missing Data" ,2008.

Please be aware that the free essay that you were just reading was not written by us. This essay, and all of the others available to view on the website, were provided to us by students in exchange for services that we offer. This relationship helps our students to get an even better deal while also contributing to the biggest free essay resource in the UK!