CAREER: Mining Salient Localized Patterns in Complex Data
Principal Investigator: Wei Wang
Funding Agency: National Science Foundation
Agency Number: IIS-0448392
Abstract
One of the greatest challenges in modern data analysis is to find significant and non-obvious patterns within immense and complex data sets. The detection of such salient patterns is an indispensable tool for comprehending the trends and meaning of data. Such tools are required by scientists, economists, marketing analysts, and all other data analysts In this research initiative I plan to develop new algorithms, methods, and tools for identifying the salient patterns within complex data sets. My live-year CAREER plan includes the following research objectives:
• Evaluate the significance of mined patterns in the face of complex and noisy data
• Design robust and scalable algorithms for mining the most salient patterns • Integrate and correlate heterogeneous data sets based on corresponding patterns Solving these problems will allow the analysis of huge data sets whose analysis in their raw form is currently intractable. I have chosen to focus my efforts on problems related to bioinformatics in order to take advantage of the broad range of expertise available within University of North Carolina at Chapel Hill's Carolina Center for Genomics Sciences (CCGS), of which I am a member. I have selected the following 4 driving applications: • Integrative Genetics of Cancer Susceptibility
• HIV Salivary Gland Disease (SGD) Pathogenesis • Discovering Family Specific Residue Packing Patterns of Proteins
• Integrative Functional Annotation All of the problems produce massive quantities of data that are exemplary of the salient feature selection algorithms to be developed in this project.
While the data collected from these applications are all complex, each has different characteristics. The analysis of gene expression levels from DNA microarrays is a central component for understanding the genetics of cancer susceptibility and HIV SGD pathogenesis. In finding frequent residue packing patterns of proteins, the data sets are composed of graphs representing the structures of proteins in the Protein Data Bank (PDB). In integrative functional annotation, the goal is to find significant relationships and patterns between different, but related, sources of biological information. We will cross-link gene expression patterns, protein sequential and structural motifs, together with exiting knowledge structural (e.g., SCOP) and functional (e.g., Gene Ontology) classifications.
The intellectual merits of this proposal include a new class of data analysis tools for analyzing the huge data sets generated by modern quantitative genetics technologies. These tools will assist biologist in their study of functional proteomics, aid in their understanding of disease progression, and assist in the search for effective treatments. In order to be useful, the data mining techniques must also be accurate, computationally efficient, and operate autonomously.
The broader impacts include immediate applications to fields other than genetics, a multitude of educational impacts, and outreach to under represented minorities in the sciences. Our pattern mining methods will be applied to analyze the administrative paperwork of child welfare cases from the North Carolina Department of Health and Human Services (NCDHHS) in an effort to improve services and better outcomes for children in the welfare system. Educational impacts include new curriculum developments, support of multidisciplinary educational experiences, and supporting the research community.

