III-Core: Discovering and Exploring Patterns in Subspaces (IIS0812464)
September 1, 2008 ~ August 31, 2011
High-throughput experimental methods have revolutionized scientific
inquiry. In contrast to the hypothesis-driven scientific method,
data-driven science seeks to discover and explore hypotheses supported
by the huge volume of data generated in high-throughput experiments.
Such datasets are large and high-dimensional: they consist of a
multitude of samples and many measured attributes for each sample. A
typical hypothesis corresponds to a subspace of this dataset: a subset
of samples that share similar values on a subset of attributes.
The goal of this project is to develop a series of new data mining
methods that can effectively discover these subspaces, the embedded
patterns among the values, and the relationships between patterns. The
underlying problems are highly combinatorial and efficient algorithms
are required to enable users to mine and explore subspace patterns in
large and complex datasets. The proposed methods combine the advantages
of efficient matrix decomposition, effective sampling techniques, and
advanced graph algorithms. Solutions to these research problems will be
integrated into an interactive and visual interface to explore subspace
patterns mined from experimental data.
While the proposed methods are applicable across a wide range of
domains, the focus of project is the analysis of gene regulatory
networks and the analysis of protein structure, in collaboration
respectively with geneticists and pharmacologists.
Personnel:
Principal Investigators:
Students:
Publications:
- Efficient genome ancestry inference in complex pedigrees with inbreeding,
by Eric Yi Liu, Qi Zhang, Leonard McMillan, Fernando Pardo-Manuel de Villena, and Wei Wang,
Proceedings of the 18th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB), Special Issue of Bioinformatics, vol. 26, no. 12, pp. 199-207, 2010.
- TEAM: Efficient two-Locus epistasis tests in human genome-wide association study,
by Xiang Zhang, Shunping Huang, Fei Zou, and Wei Wang,
Proceedings of the 18th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB), Special Issue of Bioinformatics, vol. 26, no. 12, pp. 217-227, 2010.
- GAIA: Graph classification using evolutionary computation,
by Ning Jin, Calvin Young, and Wei Wang.
Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 879-890, 2010.
- Graph Classification Based on Pattern Co-occurrence,
by Ning Jin, Calvin Young, and Wei Wang.
Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), pp. 573-582, 2009.
- Split-order distance for clustering and classification hierarchies,
by Qi Zhang, Eric Yi Liu, Abhishek Sarkar, and Wei Wang.
Proceedings of the 21st International Conference on Scientific and Statistical Database Management (SSDBM), pp. 517-534, 2009.
- COE: a general approach for efficient genome-wide two-locus epistasis test in disease association study,
by Xiang Zhang, Feng Pan, Yuying Xie, Fei Zou, and Wei Wang.
Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology (RECOMB), pp. 253-269, 2009.
- TreeQA: quantitative genome wide association mapping using local perfect phylogeny trees,
by Feng Pan, Leonard McMillan, Fernando Pardo-Manuel de Villena, David Threadgill and Wei Wang.
Proceedings of the 14th Pacific Symposium on Biocomputing (PSB), pp. 415-426, 2009.
- Inferring genome-wide mosaic structure,
by Qi Zhang, Wei Wang, Leonard McMillan, Fernando Pardo-Manuel de Villena, and David Threadgill.
Proceedings of the 14th Pacific Symposium on Biocomputing (PSB), pp. 150-161, 2009.
- FastChi: an efficient algorithm for analyzing gene-gene interactions,
by Xiang Zhang, Fei Zou, and Wei Wang.
Proceedings of the 14th Pacific Symposium on Biocomputing (PSB), pp. 528-539, 2009.
- Quantitative association analysis using tree hierarchies,
by Feng Pan, Lynda Yang, Leonard McMillan, Fernando Pardo-Manuel de Villena, David Threadgill and Wei Wang.
Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), pp. 971-976, 2008.
- Functional neighbors: relationships between non-homologous protein families inferred using family-specific fingerprints,
by Deepak Bandyopadhyay, Luke Huan, Jinze Liu, Jan Prins, Jack Snoeyink, Wei Wang, and Alexander Tropsha.
Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2008.
- REDUS: finding reducible subspaces in high dimensional data,
by Xiang Zhang, Feng Pan, and Wei Wang.
Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM), pp. 961-970, 2008.
- Mining non-redundant high order correlations in binary data,
by Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel.
Proceedings of the 34th International Conference on Very Large Data Bases (VLDB), pp. 1178-1188, 2008.
- FastANOVA: an efficient algorithm for genome-wide association study,
by Xiang Zhang, Fei Zou, and Wei Wang.
Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 821-829, 2008. (Best Research Paper)
- CRD: a general framework for fast co-clustering on large datasets utilizing sample-based matrix decomposition,
by Feng Pan, Xiang Zhang, and Wei Wang.
Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 173-184, 2008.
- CARE: finding local linear correlations in high dimensional data,
by Xiang Zhang, Feng Pan, and Wei Wang.
Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE), pp. 130-139, 2008. (Best Student Paper)
- Poclustering: lossless clustering of dissimilarity data,
by Jinze Liu, Qi Zhang, Wei Wang, Leonard McMillan, and Jan Prins.
Proceedings of the 7th SIAM Conference on Data Mining (SDM), 2007.
- Clustering pair-wise dissimilarity data into partially ordered sets,
by Jinze Liu, Qi Zhang, Wei Wang, Leonard McMillan, and Jan Prins.
Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 637-642, 2006.
Sponsor: National Science Foundation