The recent explosion of publicly available biology gene sequences and chemical compounds o?ers an unprecedented opportunity for data mining. To make data analysis feasible for such vast volume and high-dimensional scienti?c data, we apply high performance dimension reduction algorithms. It facilitates the investigation of unknown structures in a three dimensional visualization. Among the known dimension reduction algorithms, we utilize the multidimensional scaling and generative topographic mapping algorithms to con?gure the given high-dimensional data into the target dimension. However, both algorithms require large physical memory as well as computational resources. Thus, the authors propose an interpolated approach to utilizing the mapping of only a subset of the given data. This approach e?ectively reduces computational complexity. With minor trade-o? of approximation, interpolation method makes it possible to process millions of data points with modest amounts of computation and memory requirement. Since huge amount of data are dealt, we represent how to parallelize proposed interpolation algorithms, as well. For the evaluation of the interpolated MDS by STRESS criteria, it is necessary to compute symmetric all pairwise computation with only subset of required data per process, so we also propose a simple but e?cient parallel mechanism for the symmetric all pairwise computation when only a subset of data is available to each process. Our experimental results illustrate that the quality of interpolated mapping results are comparable to the mapping
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro?t or commercial advantage and that copies bear this notice and the full citation on the ?rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci?c permission and/or a fee. HPDC ?10 Chicago, Illinois USA Copyright 2010 ACM X-XXXXX-XX-X/XX/XX ...$10.00.
results of original algorithm only. In parallel performance aspect, those interpolation methods are well parallelized with high e?ciency. With the proposed interpolation method, we construct a con?guration of two-million out-of-sample data into the target dimension, and the number of out-of-sample data can be increased further.
Due to the advancements in science and technologies for last several decades, every scienti?c and technical ?elds generates a huge amount of data in every minute in the world. We are really in the data deluge era. In re?ection of data deluge era, data-intensive scienti?c computing  has been emerging in the scienti?c computing ?elds and getting more interested by many people. To analyze those incredible amount of data, many data mining and machine learning algorithms have been developed. Among many data mining and machine learning algorithms that have been invented, we focus on dimension reduction algorithms, which reduce data dimensionality from original high dimension to target dimension, in this paper.
Among many dimension reduction algorithms, such as principle component analysis (PCA), generative topographic mapping (GTM) [3,4], self-organizing map (SOM) , multidimensional scaling (MDS) [5, 17], we discuss about MDS and GTM in this paper since those are popular and theoretically strong. Previously, we parallelize those two algorithms to utilize multicore clusters and to increase the computational capability with minimal overhead for the purpose of investigating large data, such as 100k data . However, parallelization of those algorithms, whose computational complexity and memory requirement is upto O(N2)where N is the number of points, is still limited by the memory requirement for huge data, e.g. millions of points, although it utilize distributed memory environments, such as clusters, for acquiring more memory and computational resources. In this paper, we try to solve the memory-bound problem by interpolation based on pre-con?gured mappings of the sample data for both MDS and GTM algorithms, so that we can provide con?guration of millions points in the target space.
In this paper, ?rst we will brie?y discuss about existed methods of out-of-sample problem in various dimension reduction algorithms in Section 2. Then, the proposed interpolation methods and how to parallelize them for MDS and GTM algorithms are described in Section 3 and Section 4, correspondingly. The quality comparison between interpolated results and full MDS or GTM running results and parallel performance evaluation of those algorithms are shown in Section 5 followed by our conclusion and future works in Section 6.
Embedding new points with respect to previously con?gured points, or known as out-of-sample problem, has been actively researched for recent years, aimed at extending the capability of various dimension reduction algorithms, such as LLE, Isomap, multidimensional scaling (MDS), generative topographic mapping (GTM), to name a few. Among many e?orts, a recent study by S. Xiang et. all in  provides a generalized out-of-sample solutions for non-linear dimension reduction problems by using coodinate propagation. In ,
M. Carreira-Perpi?n?asn and Z. Lu provides an out-of-sample extension for the algorithms based on the latent variable model, such as generative topographic mapping (GTM), by adapting spectral methods used for Laplacian Eigenmaps.
In sensor network localization ?eld, when there are only a subset of pairwise distances between sensors and a subset of anchor locations are available, people try to ?nd out the locations of the remaining sensors. For instance, semi-de?nite programming relaxation approaches and its extended approaches has been proposed to solve it .  and  proposed out-of-sample extension for the classical multidimensional scaling (CMDS) , which is based on spectral decomposition of a symmetric positive semide?nite matrix (or the approximation of positive semide?nite matrix), and the embeddings in the con?gured space are represented in terms of eigenvalues and eigenvectors of it.  projected the new point x onto the principal components, and  extends the CMDS algorithm itself to the out-of-sample problem.
In contrast to applying out-of-sample problem to CMDS, we extends out-of-sample problem to general MDS results with STRESS criteria in Eq. (1), which ?nds embeddings of approximating to the distance (or dissimilarity) rather than the inner product as in CMDS, with an EM-like optimization method, called iterative majorizing. The proposed iterative majorizing interpolation approach for the MDS problem will be explained in Section 3.1.
MULTIDIMENSIONAL SCALING (MDS)
Multidimensional scaling(MDS) [5, 17] is a general term for the techniques of con?guration of the given high dimensional data into target dimensional space based on the pair-wise proximity information of the data, while each Euclidean distance between two points becomes as similar to the corresponding pairwise dissimilarity as possible. In other words, MDS is a non-linear optimization problem with respect to mapping in the target dimension and original proximity information.
Majorizing Interpolation MDS
One of the main limitation of most MDS applications is that it requires O(N2) memory as well as O(N2)computation. Thus, though it is possible to run them with small data size without any trouble, it is impossible to execute it with large number of data due to memory limitation, so it could be considered as memory-bound problem. For instance, Scaling by MAjorizing of COmplicated Function (SMACOF) [9,10], a well-known MDS application via Expectation-Maximization (EM)  approach, uses six N , N matrices. If N = 100,000, then one N , N matrix of 8-byte double-precision numbers requires 80 GB of main memory, so the algorithm needs to acquire at least 480 GB of memory to store six N , N matrices. It is possible to run parallel version of SMACOF with MPI in Cluster-II in Table 1 with N = 100,000. If the data size is increased only twice, however, then SMACOF algorithm should have 1.92 TB of memory, which is bigger than total memory of Cluster-II in Table 1 (1.536 TB), so it is impossible to run it within the cluster. Increasing memory size will not be a solution, even though it could increase the runnable number of points. It will encounter the same problem as the data size increases.
To solve this obstacle, we develop a simple interpolation approach based on pre-mapped MDS result of the sample of the given data. Our interpolation algorithm is similar to k nearest neighbor (k-NN) classi?cation , but we approximate to new mapping position of the new point based on the positions of k-NN, among pre-mapped subset data, instead of classifying it. For the purpose of deciding new mapping position in relation to the k-NN positions, iterative majorization method is used as in SMACOF [9, 10] algorithm, with modi?ed majorization equation, as shown in below. The algorithm proposed in this section is called Majorizing Interpolation MDS (MI-MDS).
The proposed algorithm is implemented as follows. We are given N data in high-dimensional space, say D-dimension, and proximity information (? =[?ij ]) of those data as in Section 3. Among N data, the con?guration of the nsample points in L-dimensional space, x1,...,xn ? RL, called X, are already constructed by an MDS algorithm, here we use SMACOF algorithm. Then, we select k nearest neighbors, p1,...,pk ?P , of the given new point among npre-mapped points with respect to corresponding ?ix,where x represents the new point. Finally, the new mapping of the given new point x ?RL is calculated based on the pre-mapped position of selected k-NN and corresponding proximity information ?ix. The ?nding new mapping position is considered as a minimization problem of STRESS (1) as similar as normal MDS problem with m points, where m = k + 1. However, only one point x is movable among m points, so we can summarize STRESS (1) as belows, and we set wij =1, for ?i,j in order to simplify.
Parallel MI-MDS Algorithm
Suppose that, among N points, mapping results of nsample points in the target dimension, say L-dimension, are given so that we could use those pre-mapped results of n points via MI algorithm which is described above to embed the remaining points (M = N ?n). Though interpolation approach is much faster than full running MDS algorithm, i.e. O(Mn+ n 2)vs. O(N2), implementing parallel MI algorithm is essential, since M can be still huge, like millions. In addition, most of clusters are now in forms of multicore-clusters after multicore-chip invented, so we are using hybrid-model parallelism, which combine processes and threads together.
In contrast to the original MDS algorithm that the mapping of a point is in?uenced by the other points, interpolated points are totally independent one another, except selected k-NN in the MI-MDS algorithm, and independency of among interpolated points let the MI-MDS algorithm to be pleasingly-parallel. In other words, there must be minimum communication overhead and load-balance can be achieved by using modular calculation to assign interpolated points to each parallel unit, either between processes or between threads, as the number of assigned points are di?erent at most one each other.
Although interpolation approach itself is in O(Mn), ifwe want to evaluate the quality of the interpolated results by STRESS criteria in Eq. (1) of overall N points, it requires O(N2) computation. Note that we implement our hybrid-parallel MI-MDS algorithm as each process has access to only a subset of M interpolated points, without loss of generality M/p points, as well as the information of all pre-mapped n points. It is natural way of using distributed-memory system, such as cluster systems, to access only subset of huge data which spread to over the clusters, so that each process needs to communicate each other for the purpose of accessing all necessary data to compute STRESS.
Parallel Pairwise Computation with Subset of Data
In this section, we illustrate how to calculate symmetric pairwise computation in parallel e?ciently with the case that only subset of data is available for each process. In fact, general MDS algorithm utilize pairwise dissimilarity information, but suppose we are given N original vectors in D-dimension, yi,...,yN ? Y and yi ? RD, instead of given dissimilarity matrix, as PubChem ?nger print data that we used for our experiments. Thus, In order to calculate ?ij = ?yi ?yj ? in Eq. (1), it is necessary to communicate messages between each process to get required original vector, say yi and yj . Here, we used the proposed pair-wise computation to measure STRESS criteria in Eq. (1), but the proposed parallel pairwise computation will be used for general parallel pairwise computation whose computing components are independent, such as generating distance (or dissimilarity) matrix of all data, in condition that each process can access only a subset of required data.
GENERATIVE TOPOGRAPHIC MAPPING
The Generative Topographic Mapping (GTM) algorithm has been developed to ?nd an optimal representation of high-dimensional data in the low-dimensional space, or also known as latent space. Unlike the well-known PCA-based dimension reduction which ?nds linear embeddings in the target space, the GTM algorithm seeks a non-linear mappings in order to provide more improved separations than PCA . Also, in contrast to Self-Organized Map (SOM) which ?nds lower dimensional representations in a heuristic approach with no explicit density model for data, the GTM algorithm ?nds a speci?c probability density based on Gaussian noise model. For this reason, GTM is often called as a principled alternative to SOM .
The core of GTM algorithm is to ?nd the best K representations for N data points, which makes the problem complexity is O(KN). Since in general K ?N, the problem is sub O(N2) which is the case in MDS problem. However, for large N the computations in GTM algorithm is still challenging. For example, to draw a GTM map for 0.1 million 166-dimensional data points in a 20x20x20 latent grid space, in our rough estimation it took about 30 hours by using 64 cores.
To reduce such computational burden, we can use interpolation approach in GTM algorithm, in which one can ?nd the best K representations from a portion of N data points, known as samples, instead of processing full N data points and continue to process remaining or out-of-sample data by using the information learned from previous samples. Typically, since the latter (interpolation process) doesn?t involve computationally expensive learning processes, one can reduce overall execution time compared to the time spent for the full data processing approach.
Although many researches have been performed to solve various non-linear manifold embeddings, including GTM algorithm, with this out-of-sample approach, we have chosen a simple approach since it works well for our dataset used in this paper. For more sophisticated and complexed data, one can see [6, 14, 23].
Parallel GTM Interpolation
The core of parallel GTM interpolation is how to parallelize the computations for the responsibility matrix R as in (22) since its computation is the most time and memory consuming task. This can be done by the same approach for the general parallel matrix multiplication methods known as Fox algorithm  and SUMMA  but with extra tasks.
Assuming that we have P ? Q virtual compute grids with total p = PQ processes, we can decompose the responsibility matrix R into P ? Q sub blocks so that (u, v)-th sub block of R, denoted by Ruv for u =1, ..., P and v =1, ..., Q,can be computed by one process. For this computation, we also need to decompose the matrix Y for K cluster centers into P sub blocks such that each block contains approximately K/P cluster centers and divide the M out-of-sample data T into Q sub blocks that one block contains approximately
ANALYSIS OF EXPERIMENTAL RESULTS
To measure the quality and parallel performance of our MDS and GTM with interpolation approach discussed in this paper, we have used 166-dimensional chemical dataset obtained from PubChem project database1,which is a NIH-funded repository for over 60 million chemical molecules and provides their chemical structures and biological activities, for the purpose of chemical information mining and exploration. In this paper we have used randomly selected up to 4 million chemical subsets for our testing. The computing clusters we have used in our experiments are summarized in Table 1.
Mapping Quality Comparison
MDS vs. MI-MDS
Generally, the quality of k-NN (k-nearest neighbor) classi?cation (or regression) is related to the number of neighbors. For instance, if we choose larger number for the k, then the algorithm shows higher bias but lower variance. On the other hands, the k-NN algorithms show lower bias but higher variance with respect to smaller number of neighbors. The purpose of the MI algorithm is to ?nd appropriate embeddings for the new points based on the given mappings of the sample data, so it is better to be sensitive to the mappings of the k-NN of the new point than to be stable with respect to the mappings of whole sample points. Thus, in this paper, the authors use 2-NN for the MI algorithm.
In the above section, we discussed about the quality of constructed con?guration of MI-MDS based on the STRESS value of the interpolated con?guration. Here, we would like to investigate the MPI communication overhead and parallel performance of the proposed parallel MI-MDS implementation in terms of e?ciency with respect to the running results within Cluster-I and Cluster-II.
First of all, we prefer to investigate the parallel overhead, specially MPI communication overhead which could be major parallel overhead for the parallel MI-MDS in Section 3.2. Parallel MI-MDS consists of two di?erent computations, MI part and STRESS calculation part. MI part is pleasingly parallel and its computational complexity is O(M), where M = N ? n, if the sample size n is considered as a constant. Since the MI part uses only two MPI primitives, MPI_GATHER and MPI_BROADCAST, at the end of interpolation to gather all the interpolated mapping results and spread out the combined interpolated mapping result to all the processes for the further computation. Thus, the communicated message amount through MPI primitives is O(M), so it is not dependent on the number of processes but the number of whole out-of-sample points.
CONCLUSION AND FUTURE WORK
We have proposed and tested interpolation algorithms for extending MDS and GTM dimension reduction approaches to very large datasets. We have shown that our interpolation approach gives results of good quality with high parallel performance. Future research includes application of these ideas to di?erent areas including metagenomics and other DNA sequence visualization. We will study present more detailed parallel performance results for basic MDS and GTM algorithms in both traditional and deterministically annealed variations.
- D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245-248, 2009.
- Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. L. Roux, and M. Ouimet. Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In Advances in Neural Information Processing Systems, pages 177?184. MIT Press, 2004.
- C. Bishop, M. Svens?en, and C. Williams. GTM: A principled alternative to the self-organizing map. Advances in neural information processing systems, pages 354-360, 1997.
- C. Bishop, M. Svens?en, and C. Williams. GTM: The generative topographic mapping. Neural computation, 10(1):215-234, 1998.
- I. Borg and P. J. Groenen. Modern Multidimensional Scaling: Theory and Applications. Springer, New York, NY, U.S.A., 2005.
- M. Carreira-Perpin?an and Z. Lu. The laplacian eigenmaps latent variable model. In Proc. of the 11th Int. Workshop on Arti?cial Intelligence and Statistics (AISTATS 2007). Citeseer, 2007.
- J.Y. Choi, S.-H. Bae,X.Qiu, and G. Fox. High performance dimension reduction and visualization for large high-dimensional data analysis. In To be published in the proceedings of CCGRID 2010, 2010.
- T. M. Cover and P. E. Hart. Nearest neighbor pattern classi?cation. IEEE Transaction on Information Theory, 13(1):21-27, 1967.
- J. de Leeuw. Applications of convex analysis to multidimensional scaling. Recent Developments in Statistics, pages 133-145, 1977.
- J. de Leeuw. Convergence of the majorization method for multidimensional scaling. Journal of Classi?cation, 5(2):163-180, 1988.
- A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B, pages 1-38, 1977.
- G. Fox, S. Bae, J. Ekanayake, X. Qiu, and H. Yuan. Parallel data mining from multicore to cloudy grids. In Proceedings of HPC 2008 High Performance Computing and Grids workshop, Cetraro, Italy, July 2008.
- G. Fox, S. Otto, and A. Hey. Matrix algorithms on a hypercube I: Matrix multiplication. Parallel computing, 4(1):17-31, 1987.
- A. Kaban. A scalable generative topographic mapping for sparse data sequences. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC?05)-Volume I, pages 51-56. Citeseer, 2005.
- T. Kohonen. The self-organizing map. Neurocomputing, 21(1-3):1-6, 1998.
- J. B. Kruskal. Multidimensional scaling by optimizing goodness of ?t to a nonmetric hypothesis. Psychometrika, 29(1):1-27, 1964.
- J. B. Kruskal and M. Wish. Multidimensional Scaling. Sage Publications Inc., Beverly Hills, CA, U.S.A., 1978.
- Y. Takane, F. W. Young, and J. de Leeuw. Nonmetric individual di?erences multidimensional scaling: an alternating least squares method with optimal scaling features. Psychometrika, 42(1):7-67, 1977.
- W. S. Torgerson. Multidimensional scaling: I. theory and method. Psychometrika, 17(4):401-419, 1952.
- M. W. Trosset and C. E. Priebe. The out-of-sample problem for classical multidimensional scaling. Computational Statistics and Data Analysis, 52(10):4635-4642, 2008.
- R. Van De Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency Practice and Experience, 9(4):255-274, 1997.
- Z. Wang, S. Zheng, Y. Ye, and S. Boyd. Further relaxations of the semide?nite programming approach to sensor network localization. SIAM Journal on Optimization, 19(2):655-673, 2008.
- S. Xiang, F. Nie, Y. Song, C. Zhang, and C. Zhang. Embedding new data points for manifold learning via coordinate propagation. Knowledge and Information Systems, 19(2):159-184, 2009.