Skip Navigation
Text:
Increase font size
Decrease font size

    Collaborative Systems: Visualizing and Exploring High-dimensional Data

    Principal Investigator: Leonard McMillan, Wei Wang
    Funding Agency: National Science Foundation
    Agency Number: IIS-0534580

    Abstract
    Visualizing and Exploring High-dimensional Data High-throughput experiments have revolutionized many areas of scientific endeavor. In contrast to traditional experimental methods, they generate vast amounts of data, which is only approachable through computer-aided data analysis. The analysis is complicated by the data's high-dimensionality and large volume. We propose to develop a new tool for interactively exploring inter-data relationships within such data sets. Our tool provides an aid to scientists prior to applying traditional offline analysis techniques such as clustering, segmentation, and classification. We avoid the problems of representing high-dimensional data directly by only considering the relationships between data points with multiple, scalar-valued dissimilarity measures. The user can interactively mix various measures and visualize how they influence the clustering of data points. This aids in the process of selecting appropriate weighing factors between incompatible features and noisy measurements. Furthermore, it allows the scientist to explore hypotheses and incorporate their own knowledge to drive traditional unsupervised datamining algorithms in sensible directions. A key component of our tool is its ability to interactively explore parameter spaces that combine various attributes of high-dimensional data points. We provide two alternate views of the high dimensional data sets. Our dissimilarity-matrix view offers insights into the size,compactness, separation, and relative proximity of clusters. An alternate point-cloud view provides a 3-D projection of the high-dimensional source data which best preserves the distance between points. This view excels in communicating the flow and migrations of points from one cluster to another as weighting parameters are tuned. It also allows the user to probe and interact with the data, including such tasks as hand clustering the data, and examining particular points. The problem with both dissimilarity matrices and MDS is that they do not easily scale to large data sets. We have developed a prototype of an approximate dissimilarity matrix visualization and a fast MDS algorithm that are targeted at interactive display rates and large data sets. Our methods are orders of magnitude smaller and faster than previous published methods on similarly sized data sets. Our goal is to provide visualizations of cluster formation and migration as the contributions of various data set features are interactively modified. We have conducted pilotstudies of our approach to multi-experiment gene-expression data and SNP-phenotype correlation assays. In the future, we plan to add more visualization features to our system and support alternative non-linear dimensionality reductions methods. Intellectual merits of the proposed research: The visualization tools proposed will assist scientist in many disciplines, including biologists in studying of gene function, medical doctors in comprehending disease susceptibility, chemists in developing candidate drugs, and high-energy physics in analyzing the data generated by particle accelerators. Furthermore, by applying newly developed dimensionality reduction methods to complex real-world problems will also help to evaluate and refine these new methods, making them more effective and efficient. Broader impacts of the proposed research: The proposed research requires knowledge of computer science, biology, chemistry and physics. Such interdisciplinary integration is crucial for the scientific development of future computer scientist. This research effort will be integrated into our graduate and undergraduate instruction. The PIs will offer classes in the related topics of visualization and bioinformatics, where tols will aid students in comprehending abstract concepts and data relations. We will engage to outreach programs to attract underrepresented minorities to the field of computer science, and provide research opportunities undergraduate minorities to stimulate an interest in grauate study. 

    Document Actions