Mining Structural Patterns in Protein Families for Classification and Annotation

 
 

I have been working with Luke Huan on incorporating graph representations that use Delaunay and Almost-Delaunay edges into his Fast Frequent Subgraph Mining (FFSM) algorithm. We showed that the almost-Delaunay representation is sparse like the Delaunay, while maintaining robustness to perturbations (by definition) and high classification accuracy (as reported in our paper).

I used the above picture as my logo at the 3DSIG (Structural Bioinformatics special interest group at ISMB-ECCB 2004). The logo shows at top left, a frequent subgraph motif that occurs in more than 90% of the Serine Proteases and not more than 5% of other proteins. It contains the Catalytic Triad and a nearby geometrically conserved Alanine. The middle of the top row shows the subgraph motif mapped to the structure of serine protease 1LO6 (human kallikrein). Such motifs serve as family-specific fingerprints based on local structure, since they are enriched only in proteins of the family, and can be used with high accuracy to classify proteins as belonging to the family, and to annotate the functional family and functional regions on a protein of unknown function.




Here we show the intuition behind using almost-Delaunay edges as a graph representation for subgraph mining. If we measure density by average vertex deree, the AD edges with increasing cutoff (0.5, 1.0, 1.5 shown) smoothly interpolate between the sparse Delaunay edges, and the very dense (cubic in n) contact map edges. Additionally, AD edges are more robust than Delaunay edges, and not much slower to compute.

We show the residues covered by one large fingerprint in Eukaryotic Serine Proteases, as purple space-filling regions in the model of human Kallikrein (PDB: 1lo6). This fingerprint was found by mining graphs of almost-Delaunay edges with cutoff 0.1 Å and edge length prune 8.5 Å.



Structure-based Functional Annotation  
Fingerprints highlighted in 1nfg (Metallo-dependent hydrolase). Red=acidic,Blue=basic, Purple=polar, White=hydrophobic. 1m65 (structural genomics, Ycdx protein, CASP5 T0147). The Metallo-dependent Hydrolase fingerprints occur in active/binding site regions.

Above we show an example of functional annotation using fingerprints. The structure on the left represents the SCOP family 51556, Metallo-dependent hydrolases, that adopt a TIM-barrel 8-stranded beta-alpha fold. The structure on the right is a structural geomics protein that was a target in the last CASP. It was suspected to have Metallo-dependent hydrolase function based on crystallographic observations, but has a different overall fold (7-stranded beta-alpha barrel). Thus it is classified by SCOP as a PHP domain, and is annotated as "Unknown Function" in the Protein Data Bank. We find by our subgraph-based annotation techniques that 1m65 indeed does get classified as a Metallo-dependent Hydrolase even though it has a different fold. Many of the fingerprints are concentrated in its active site and metal-binding site towards the center and top of the barrel, as shown above. This supports conclusions made by other programs and databases that compare active site residues (such as Roman Polanski's ProFunc). Our approach has the advantage that we automatically mine family-specific features, while active sites are often specified in advance from known structures, and are not unique to a family. Also, the use of a rigid active site geometric or residue template may lead to misses in cases with distorted active site geometry and residue substitution. We can still recover most of the family-specific fingerprints even in families where functional sites are not geometrically conserved.



Sequence Annotation  

Above, we were using subgraph fingerprints to solve the problem of annotation of orphan structures. To tackle the much larger problem of annotating protein sequences using these techniques, we may either derive sequence patterns from the structure fingerprints, if their sequence order and spacing is conserved in most members of a family (described below), or we may apply the structure-based fingerprints to predicted structures, which is work in progress.

Shown on left is a schematic of finding family-specific tetrahedra or subgraphs, along with corresponding sequence patterns, for use in annotation of protein sequences. This is joint work with Ruchir Shah and Luke Huan. We have applied these methods to several protein families including toxins, and WD40 beta-propellors (in collaboration with Najl Valeyev. More results will be here soon.

Publications on subgraph mining:



Project Members  
Deepak Bandyopadhyay
Jun (Luke) Huan
Jinze Liu
Graduate Students
UNC Chapel Hill
debug,huan,liuj
at cs.unc.edu
Wei Wang
Assistant Professor
of Computer Science
UNC Chapel Hill
weiwang at cs.unc.edu
Jack Snoeyink
Jan Prins
Professors of Computer Science
UNC Chapel Hill
snoeyink,prins
at cs.unc.edu
Alexander Tropsha
Professor and Director
Laboratory of Molecular Modeling
School of Pharmacy
UNC Chapel Hill
alex_tropsha at unc.edu