%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Documentation file (text) included with Deepak %% Bandyopadhyay's MATLAB code for computing the %% almost-Delaunay tetrahedra and their thresholds %% UNDER CONSTRUCTION: this file is almost-Done(epsilon) %% %% Code Version: 0.03 (alpha) %% Revision Date: Feb 6, 2003 %% Contact: debug at cs.unc.edu %% RESEARCH CODE: probably not well-enough documented %% to modify, though we will assist you in setting up %% or in conversion to other languages (a C++ port is %% envisioned in the future, or we could make a binary %% using MATLAB Compiler and put it up). %% %% The plan is to put this code up on the web as a CGI script %% using one of the MATLAB web server tools. Until that %% happens, here is the way I run the programs on the %% MATLAB command line, followed by a description of each. %% >> % bootstrap, loads pts degree cutoff prune %% >> load 'ADstart.mat' %% >> % input phase, can skip since defaults already loaded at bootstrap %% >> C=pdb2cell('1xyz.pdb'); %% >> pts=Calphas(C); %% >> % computation %% >> [an,ap,adp,a3,at,adt,a4,aq,adq]=aDsimpfast2(pts,degree,cutoff,prune); %% >> res; %% >> % visualization %% >> plotAD4; % plots AD tetra. upto cutoff, in translucent shades of R/Y %% >> plotADsphere(pts,res4,adc,i); % plots the min. width annulus for a tet. %% % adc is opt. last output arg to aDsimp* %% >> patternhistd; % pattern histogram showing alpha patterns %% >> betahistd; % same for the best beta patterns we have so far %% >> helixcount; % produces "helices" a set of indices of individual helices %% >> betacount; % produces "sheets" a similar set for beta sheets %% % and "bonds", pairs of bonded neighbors in adj. strands %% The programs are divided into three groups: %% 1) MATLAB core (AD and Delaunay probability) code %% 2) MATLAB visualization and motif-computation code %% 3) Bala Krishnamoorthy's C++ code modified for AD-SNAPP and %% Delaunay probability SNAPP computation. %% groups 1 and 2 are packaged in this distribution. group 3 is %% distributed separately on request. It is not much different from %% the code of Krishnamoorthy & Tropsha which they do not distribute %% on the web, but is again available on request. %% Comments for individual files follow: %% files under group 1 (MATLAB core) %% ================================= %% %% ADstart.mat %% Data file with initial values of global variables stored thus: %% pts % 98x3 C-alphas of 2acy %% degree = 12 ; % calculate triangles (3) * tetrahedra (4) %% cutoff = 2.0; % default threshold cutoff %% prune = 10.0; % default edge length prune %% %% ADsimpfast1.m %% Invocation: %% [annuli,allpairs,adpairs,a3,alltri,adtri,a4,allquads,adquads,{ADcenters}] %% =aDsimpfast1(pts,{degree,cutoff,prune,outfile}); %% Parameters (Input): %% pts = n x 3 matrix of point coordinates, assumed to be sequence-numbered %% from 1 to n. Thus our data input from PDB files normalizes the %% sequence numbers: see pdb2cell(), Calphas() and SideCentroids(). %% %% degree = degree of the simplex/simplices to be output. Possible %% values: 2 (calculate only AD edges), 3 (edges and triangles), %% 4 (edges and tetrahedra), 12 (edges, triangles and tetrahedra). %% 12 is the default; this convention is there since in earlier %% versions of the code I used modulo arithmetic to check and %% multiplex degrees, until I saw how slow MATLAB's %% implementation of mod() was during profiling. %% %% cutoff = threshold cutoff, a maximum for the value of the AD threshold %% to be output, as described in the paper. Default is 2.0 %% %% prune = edge length prune, a maximum on the length of edges in any %% AD edge, triangle or tetrahedron whose threshold is to be %% evaluated. The default is 10.0 (angstrom), though we have %% tried values from 8.0 to 15.0. Setting it to a large number %% or inf causes all AD edges, triangles and tetrahedra to be %% considered; this is the worst case O(n^5 log n) behavior that %% we mention in the paper and is not of interest for proteins. %% %% outfile = file to write output to as text; output consists of AD %% tetrahedra with epsilon>0, written to , and these %% in addition to the Delaunay tetrahedra, written to %% +".dad", which stands for Delaunay and AD. This %% argument was specifically designed to process and output %% these two kinds of tetrahedra for a large set of proteins %% as input to SNAPP potential computation (training set), %% SNAPP scoring (a protein & its decoys), or Delaunay %% probability computation (both sets). The output consists %% of m rows of 4 indices, one for each tetrahedron. Threshold %% is not output; in our specialized applications, it is just %% required that it be small, and the actual value is not imp. %% Parameters (output) %% %% annuli = all AD edge thresholds that are >=0, calculated as half the %% minimum-width annuli for these edges as described in the paper. %% %% allpairs = rows of 2 indices of points on AD edges. The rows in %% allpairs correspond to the annuli, and so the two are often %% clubbed together. %% %% adpairs = similar structure, but with zero threshold (Delaunay) edges %% filtered out. These correspond to annuli(annuli>0) and %% can be clubbed with them. %% %% a3, alltri, adtri have the same meanings as annuli, allpairs and adpairs, %% but for AD triangles, thus there are 3 indices per row. The same holds %% for a4, allquads, adquads for AD tetrahedra. %% %% ADcenters = optional output, stores 3D coordinates of the centers of %% minimum-width annuli for each AD tetrahedron. This is %% needed only if you want to visualize the min-width annuli %% for some tetrahedron using the program PLOTADSPHERE, %% and verify visually that the four points lie within or %% on the annulus and that the AD threshold %% computation is accurate. %% %% Description: %% "Safe but slow" version of almost-Delaunay main program. %% It does not do any angle pruning and calculates the "AD tetrahedra %% with all Delaunay edges" by considering all centers, instead of using %% our two heuristics mentioned in the documentation of aDsimpfast2. %% However it takes at least twice the time, so we use adsimpfast1 only %% when absolute accuracy in the thresholds is critical; adsimpfast2 %% is good enough in practice. %% %% ADsimpfast2.m %% Description: %% %% This version takes on the average half the time of the %% previous version since it uses two heuristics. %% %% Heuristic 1 (AD tetrahedra with all Delaunay edges) %% In proteins these seem to come from a specific situation of five %% points {a,b,c,d,e} that form three Delaunay tetrahedra {abde,bcde,cade} %% sharing one edge (de) and pairwise sharing four more edges; the %% non-shared edges {ab,bc,ca} together form an AD triangle abc, and %% adding the points of the shared edge gives two AD tetrahedra, abcd and %% abce. NOTE: In general there are other cases possible not using %% this configuration, but we haven't observed them in proteins since they %% need a high degree of symmetry. %% The minimum threshold for these can be found without searching %% through all candidate Voronoi centers. Just considering these five points %% is sufficient, with two of the points of triangle abc on the outer %% sphere and the other one along with de on the inner sphere, or else %% abc on the outer sphere and the shared edge de on the inner. The %% tetrahedra abcd and abce have the same threshold as abc since de %% lie on the inner circle for four all cases above. %% This heuristic has not been known to fail for any case that we've tested %% - in a later version of the paper we will attempt to formally %% prove this; for now, it works & works very well, since there are only %% a constant number of cases (4) for such triangles and tetrahedra instead %% of being proportional to the number of points. %% Heuristic 2 (angle pruning) %% limits the candidate centers for AD edges to those that pass the %% "cone condition" mentioned in the paper, and the candidate edges between %% pairs of centers considered for AD triangles & tetrahedra, to those %% that use at least one center that passed the cone condition. %% This heuristic has been known to fail (and return a threshold value %% for a few tetrahedra that is 0.1 to 0.2 angstrom off since the correct %% minimum lies on an edge between two non-local-minimum centers). However %% the error has not been seen to be more than about 1.5 angstrom over 1-5 %% out of thousands of tetrahedra, for typical proteins. When one is %% trying to statistically compare the AD distributions or use them to %% score proteins or find motifs, this error is observed to be %% insignificant. So we use this code most of the time, using the %% ADsimpfast1 code when absolute accuracy is critical; it has been checked %% against a slow, naive but correct implementation in a previous version %% of the code. %% NEWS FLASH (Feb 1, 2003): I have fixed the heuristic so that now it %% works in all cases - I just include in the edges considered, those %% whose one point has a low cutoff and is nearer than the prune distance %% from the center of the bisector plane, though it does not pass the %% angle pruning condition. I also remove vertices that pass the angle %% pruning condition but whose threshold is too high. The heuristic seems %% to work as well as or better than before, though the speed gained depends %% on choosing a low value of cutoff, which isn't a problem for proteins. ------------------------------------------------------------------------- %% delprob.m (Delaunay probability code) %% result = delprob(pts, ntrials, tets) %% pts is n x 3 matrix of point coordinates, ntrials is the number of %% steps for Monte Carlo integration, by default 1000; ideally it should %% be higher but 1000 is fast enough to be practical for long lists of %% protein points. %% tets are the indices to verices of tetrahedra, 4 per rows %% result is a column vector as long as tets, with the probability of %% each tetrahedron being Delaunay after perturbation of all points %% from a Gaussian with radius 0.1 angstrom. %% Other core programs that use or modules used by ADsimpfast* & delprob: %% (shown by typical use rather than explanation of all parameters) %% * List processing: %% createADlist sc Read a file called "sc_list" and "process" all PDB %% files in there or lists of them, by loading the %% points from .PDB and taking SC centroids (or %% loading the condensed points from .pdb.sc), then: %% aDsimpfast(pts,degree=12,cutoff=0.3,prune=12,outfile) %% and writing _sc_tet.txt (AD tet with epsilon>0) %% and _sc_tet.txt.dad (Delaunay and AD tet) %% createADlist ca Same for C-alphas and "ca_list" %% createproblist ca Compute Delaunay probabilities for a list of files %% stored in ca_list, and their AD tetrahedra in %% the same directory as _ca_tet.txt.dad, writing %% output to _ca_prob.txt.dad %% %% createproblist sc Same using sc_list, _sc_tet.txt.dad, and %% _sc_prob.txt.dad %% %% res, rec These convert the output from aDsimp* into a more %% compact form. The relevant outputs for AD tet are: %% res4=[adq a4(a4>0)] and resd4=[aq a4] %% The first encodes indices & thresholds of all AD %% tetrahedra with positive thresholds, while resd4 adds %% the Delaunay tetrahedra with threshold zero. %% Point coordinate data input and preprocessing: %% C = pdb2cell('foo.pdb'); % read all ATOM entries in PDB file %% % (all chains present) into cell %% % array C %% or C = pdb2cell('foo.pdb','A'); % optional, to grab chain only %% %% pts = Calphas(C); % extract Calphas, one per residue %% or pts = SideCentroids(C); % extract sidechain centroids, one per residue %% points can also be saved and loaded once they have been computed %% Structures and Functions to store AD triangles & tetrahedra %% min_lookup_store3 %% Kind of a SLT (set if less than) instruction vectorized to %% handle indices into a 3D array and corresponding values. %% For efficiency reasons, two of the dimensions of the array %% are clubbed together to make a 2D array %% min_lookup_store_edge4 %% A similar instruction to min_lookup_store3 except that its indices %% are quadruples and are converted to 2D by indexing quads by short %% edges. For example, if (1,3) is short edge #1 and (5,6) is #7, %% quad (1 3 5 6) is stored at location [1,7]. Also, we have modified %% this function to concurrently store centers of annuli. %% %% Utility modules called by ADsimpfast*.m %% %% fatten(A,rowfactor,colfactor) %% Builds a new matrix A' that sequentially repeats each row of A %% rowfactor times and each column, colfactor times. For example, %% fatten([2 3 ; 1 4],1,2) returns [2 2 3 3 ; 1 1 4 4] %% %% modsqr(A) is shorthand for dot(A,A,2) %% %% normalize(A) is shorthand for A divided rowwise by sqrt(modsqr(A)) %% %% NDTriTetWithDelEdges(T), given the short Delaunay tetrahedra in T, finds %% the AD triangles and tetrahedra with all Delaunay edges using the %% heuristic in the documentation of ADsimpfast2.m. %% %% NDintersect(pts,pairindices, tripleindices) returns the centers of %% minimum-width annuli containing the points indexed by the rows of %% pairindices on one of the spheres and the points indexed by the rows %% of tripleindices on the other. Thus it works for both the cases as %% described in the documentation; 2 points outside and 3 inside, or %% 2 inside and 3 outside. Though it is written for the case described, %% it is in fact a completely general utility to find the centers of %% annuli given the two points on one sphere and the three points on the %% other. %% Group 2: Visualization and Motif Detection tools in MATLAB %% ========================================================== %% %% helixtable.m %% ht = helixtable(res4) %% This function generates a table of the patterns of sequence interval %% (eg. tetrahedra with indices 1 2 3 5 and 3 4 5 7 both can fall into %% i+(1 2 3 5) (parametric), 1-1-2 (sequence gap) or ooo.o (vertex/gap) %% patterns. The function reports only patterns that are "statistically %% significant" (defined as occurring in at least 0.5% of tetrahedra) %% The output has 7 columns as below: %% %% i1 i2 i3 i4 mean std dev probability %% 1 2 3 4 0.4471 0.2562 0.0123 %% 1 2 3 5 0.3364 0.1841 0.0101 %% . . . . . . . . %% %% Patterns that are within 0.1 Angstrom standard deviation %% show up as sharp peaks in the plot, and patterns with over 0.25 %% Angstrom have no peaks and are spread out over a range of %% AD thresholds. This tool helped us show the existence of peaks %% in alpha, 310 and pi helices; for beta sheets and other motifs, %% it found no characteristic peaks and we had to write tools that %% looked for other patterns. %% patternhist, patternhistd, betahist, betahistd, aDhistd.m %% These tools are invoked after thresholds are calculated and stored %% compactly as res4 (AD) and resd4(AD+Delaunay) tetrahedra. %% They use pts, res(d)4 and cutoff, which currently need to be declared %% globally, as input. %% They plot the histogram distribution of AD tetrahedral thresholds %% as a stacked bar chart, broken up by patterns. {*histd} operate %% on resd4 and {*hist} on res4, so {*histd} create a separate bar for %% the Delaunay tetrahedra at epsilon=0. I have also written a program %% to create a regular histogram with Delaunay bar colored differently, %% and that is aDhistd.m %% helixcount, betacount %% These programs, using the knowledge and patterns developed for making %% Pattern histograms for alpha-helices and beta-sheets, apply some %% filtering on AD threshold and a few heuristics to try to predict which %% residues lie in a helix or strand. The output of helixcount, a *x2 %% matrix "helices" in the global namespace, contains pairs of start and %% end sequence numbers for helices. Betacount produces the same kind %% of structure called "sheets", and also "bonds" which encodes which %% residues in adjacent sheets are determined to be connected. The total %% number of residues in sheet or helix can be counted thus: %% sum(helices(:,2)-helices(:,1)+1) %% plotAD4 Given a cutoff for epsilon, and assuming the output of ADsimpfast* is summarized by "res", it plots the tetrahedra with threshold less than the specified cutoff in translucent shades of red and yellow, so you can see the ones in the interior and the self-intersections. If additionally the variable "colorbar" exists and is set to 1, it will append a legend of the color scheme. %% plotADsphere(pts,list,adc,i) %% Visualize the annulus (pair of spheres) corresponding to the AD %% threshold for edge, triangle or tetrahedron "i" from the list %% "list" of indices and thresholds (typically res4 computed by the %% postprocessing function "res"). This is helpful when debugging %% the computed thresholds to see if the points indeed lie within %% that pair of spheres.