Analysis of High Dimensional Data Using Subspace Clustering
Principal Investigator: Wei Wang
Funding Agency: National Science Foundation
Agency Number: DMS-0406361
Abstract
An important and visible trend in empirical science today is the increasing prevalence, and prominence, of large data sets containing from thousands to hundreds of millions of measurements. The prevalence of large data sets can be attributed to a number of factors, including new and improved measurement techniques, increasingly inexpensive computer memory, faster processors, and the possibility of obtaining and exchanging data over high speed Internet links. With the shift towards larger data sets there has been an accompanying shift in the type of data they contain. Whereas small to moderate data sets typically have more samples than measurements, in large data sets it is common to have more measurements than samples, what statisticians refer to "high dimension and low sample size". The central premise of this proposal is that a relatively new development in the field of Data Mining, known as subspace clustering, has the potential to improve the exploratory analysis --of high dimensional data. Conversely, ideas from Statistics and Probability can inform the development of improved subspace clustering methods, and can provide a rigorous basis for interpreting their results. Research will be carried out in the context of ongoing collaborations with disciplinary scientists on the analysis of high dimensional data.

