List of software developed during the course of the AD project by Deepak Bandyopadhyay This is prompted by my discovering that I rewrote some small MATLAB scripts a couple times, not remembering that I'd written them earlier. So I thought I'd catalog them, send it to you (in case anyone in Compgeom needs to use something, these are available under CompGeom/people/debug/AlmDel/ADMatlab and I'll be cleaning up/documenting each file after this submission) and include it in my thesis. 1. C++ software - ADCGAL (edges) and ADsimpCGAL (edges, triangle, tetrahedra in 3D), touched on in earlier emails sent in December, and to be described in CGAL workshop paper. 2. MATLAB software (only the newer ones that I still use regularly, not the obsolete/ superseded ones) BCH222/Kinemage conversion voltet.m - Volume of a tetrahedron kintet*.m - Various visualizations of the almost-Delaunay tetrahedra classified into groups based on their thresholds, Calphas/Sidechains, tetrahedron types, helix and sheet patterns, volume increase/decrease between Calpha tetrahedra and SideCentroid tetrahedra, etc. bchfinal.m - This started out kind of being a catch-all final project for BCH222 that would process a protein, assign secondary structure for it using AD, DSSP and a visual-geometric method, and write kinemages that illustrate the comparison. visualgeomstruct.m - Visual-geometric assignment: implementation of Kahn's method for helices; beta-strands from GASP(Drennan et al.), and beta-turns using distance and tetrahedrality alone (a superset of AD). kincomparelist - The list form of bchfinal, which processes a list of PDBs and chains hackPDB.m - Produce a PDB header record from a secondary structure assignment writePDBheader.m - This is useful as an "implant" into a PDB file to make Prekin use our definitions of secondary structure while making kinemages. Secondary structure assignment helixcount.m ] - The three methods to assign helix, sheet and turn residues. They bturncount.m ] are scripts, not functions, and they work on the variables pts and betacount.m ] resd4 (AD tetrahedra and thresholds) from the global namespace. They should always be run in the order helix, then turn, then sheet since that is the order of decreasing accuracy, and the later assignments assume the earlier assignments have been made and exclude those residues. createmotiflist.m - automate assignnment of secondary structures (and storage in mat files) for a list of PDB files. Superseded by later programs. tetrahedrality.m - a measure of disparity in tetrahedron edge lengths, used by bturncount findnextinseq.m - matches up two list of potential helix start and end residues, to give maximal-length helices consisting of regions from a start to its nearest end. Works even when there is more than one signal for a helix; blanks out the false starts. Used by helixcount. chaincontacts.m - Find AD tetrahedra that lie between two parts of a sequence, making a certain minimum number of contacts in each chain (typically 1) and a certain number in both chains combined. betaneighbors.m - Group neigbors in adjoining strands that are adjacent in sequence, when deciding whether to keep or prune a beta bond across two chains dssp2cell.m - Parse DSSP and DSSPcont files into a cell array dsspstruct.m - read in the DSSP secondary structure assignment for a (PDB,chain) in dsspcompare.m a form that it can be compared to AD and other assignment -- ranges of compareDSSPlist start and stop residues for helices and sheets, and all lists of all 4 residues in a beta-turn. Type G (3_10 helices) are converted to beta-turns since AD and geom methods cannot reliably identify them count_assigned_SSE - given ranges of Secondary Structure Elements returned by any of the methods, count how many residues are in each type of secondary structure. HETstring - convert from Secondary Structure Element data (in the above mentioned format) to alphabet strings that can be printed to files, and vice versa readGASPstrands - parse the GASP (Drennan et al.) strand assignment file format. The GASP strand parameters are also returned, in case we want to revalidate or change the assignment. Strand residues are renumbered from the PDB numbering (used by GASP) to the sequential numbering used by AD. promotif_struct.m - read in all assignments made by the PROMOTIF program. Like GASP, it read_promotif.m also uses ranges of the original PDB residue number space, and we remap it to a sequential numbering for comparison with AD. compareDSSPvisuallist.m - run residue-by-residue and cumulative comparisons of three assignment methods; AD, DSSP or PROMOTIF, and the visual-geometric method. compare_pairwise_SSE_assignments - a comparison of all n methods (AD, DSSP, geom) that assign a list of proteins into m discrete states (LHET), as well as a pairwise comparison, done over the entire assignment as well as for individual assignments. The scores for each comparison are the % of residues that match across the assignments. helices_missed - enumerate contiguous residues with an assignment of helix (H) of length min_length(4) or longer, that are assigned by one of two methods being compared, and not by the other. Useful to find irregular helices only assigned by AD and not DSSP, as well as entire helices missed by AD (usually short ones of 4 or 5 res.) Delaunay and AD calculation (the core is now in ADCGALwrap.m that calls ADsimpCGAL.exe) pruneddelaunayedges.m - given points and prune value, returns indices of the short Delaunay edges. Superseded by DT_tet_tri_edge. DT_tet_tri_edge.m - returns pruned Delaunay tetrahedra(T), triangles(t) and edges(e) in the given points, with all edges shorter than prune. ADsimpfast2.m - The latest "correct" version of AD MATLAB code. Edges are fast, but triangles and tetrahedra are slow. Also there is a bug in the code that was detected while writing the C++ version. I use ADCGALwrap instead of this, now. ADCGALwrap.m - Implements a MATLAB wrapper for the AD CGAL code, including transparently reading in the files written to an "AD repository" the first time AD is run on a (PDB file,chain,threshold), and conversion of the indices from 0-based to 1-based ones. ADCGALlist.m - ADCGALlist(cell array of pts,row matrix of labels,start,end,cutoff) was used to quickly parcel out parts of the AD computation of a huge list of proteins among available computers, by giving them different ranges in the start and end parameters. load_or_compute_AD.m - Used to multiplex precomputed AD thresholds stored in 1xyz_ca.mat and to compute (or load) using ADCGALwrap if that file doesn't exist. ADcompare.m - Compare exhaustively and robustly if the edge, triangle and tetrahedra testADerror.m thresholds returned by two different implementations of AD (in particular the MATLAB and C++/CGAL implementations) match up for a large set of proteins. AD_change_working_directory: All AD tets to a given threshold are cached in a repository called eg. "data2.0" if the threshold is 2.0. Given the cutoff value, AD_change_working_directory goes to the appropriate directory, storing the previous directory in a persistent variable. The next time this function is called, it reverts to the stored directory. Note that the directory caching is only one level deep. Delaunay probability: cdelprob.dll - Delaunay probability computation in a C-MEX file testprob.m - check that the Delaunay probabilities of tetrahedra in a set of proteins testprob2.m sum up to 100+/-2% of the number of Delaunay tetrahedra. SNAPP score related readscores.m: - returns the ensemble of residue score profiles from a scores file loadsnappprofile.m output by my AD SNAPP scoring code. Work with Wei/Luke buildtables.m - These files built a 3-dimensional matrix of the average vertex degree tabulatecphist.m - as a function of cutoff(0.0:0.1:2.0), prune (8.0:0.5:16.0) and the lukecount.m protein from a list of Eukaryotic Serine Proteases and Nuclear Binding makedadtable.m Domains. This was done for the complete, Delaunay and Delaunay+AD luketet.m tetrahedra. Results were saved in mat files luke16.mat and the figures wei1.fig...wei4.fig. writelukefiles.m ] - Programs for converting AD and Delaunay edges into writelukegraph.m ] Luke Huan's graph representation writedelgraph.m ] loadlukechains.m - Load muliple chains from PDB files stored in Luke's list/out format, and store Calpha coords in cell array ptsarr, one for each protein ; residue labels in another cell array labels; and names of each protein and chain in character arrays protnames and chains. appendlukegraph.m ] - similar to writelukegraph, but allows resume after crash labeledges.m ] - divide edges into proximity and peptide edges edgeseparator.m ] - ...and into rigid and flexible edges. Given a threshold $\epsilon$, the rigid edges are initially the Delaunay, and the flexible edges are either AD($\epsilon$)-Delaunay or the Delaunay edges within the annulus of some e' \in AD($\epsilon$) (\ie with stability$<\epsilon$. Tetrahedral sequence motifs bghifreqtets.m - find and store tetrahedra with high frequency in the background tetcomposition.m - given indices of an AD tetrahedron, find the compositional quad consisting of the 4 AA's in sequence order or unordered compute_tetrahedra.m - Wrapper for Delaunay (cutoff==0, calls DT_tet_tri_edge()) and AD (cutoff>0, calls ADCGALwrap) tetrahedra computation hist_tet_compositions.m - for all the proteins in the background (PDBselect) dataset, histogram the tetrahedra compositions (without storing them all) to derive the ones with high frequency in the background. Called by bghifreqtets. all_tet_compositions.m - for all the proteins in a family, find the almost-Delaunay tetrahedra, convert to compositions, and find the frequent ones with some minimum support (occur in some % of the family) filesuffix.m - Encodes the naming convention for background and family data Motifs found from Delaunay tetrahedra have no number in the suffix, and the AD motifs have the threshold stored as x.yz. The motifs found in sequence order have no further suffix, while those out of sequence order / in compositional order have _sr (sorted) appended to the suffix. AA3to1.m - convert row vectors of 3-letter AA codes to one-letter equivalents, and vice versa. Also return unique numeric codes that can be used to store compositions hash_composition.m - A composition consisting of the codes for 4 AA's can be better stored and manipulated as a single index (formed using the array subscript operation sub2ind) tetfamilymotifs.m - The main function, that produces tetraedral motifs in/not in sequence order, frequent in a family but not in the background, using Delaunay or AD($\epsilon$) tetrahedra. merge_motifs.m - This takes existing set of motifs and groups together the ones tht share three residues, to get a regular expression with higher support that individual tetrahedral motifs. PDB Utilities vertcount.m - Counts for each vertex in a subset of AD tetrahedra parsePDBname.m - Tokenize a PDB file name (either 1xyz.PDB or 1xyz.PDB.ca/sc or 1xyzA) to get the 4-letter PDB ID and either the chain or the suffix. pdb2cell.m - Parse PDB files into a cell array. Modification of Jack's original code Calphas.m - Extract Calphas from the cell array we made from the PDB file. Originally Jack's code, plenty of fixes to handle real PDB files and also (regrettably) to get other information -- residue labels, missing sequence numbers and the mapping from sequential numbers starting at 1 to residue numbers in the PDB -- that should ideally be put in separate functions. SideCentroids.m - Again Jack's code, this time robustified but not multiplexed. getPointsFromPDB.m - Encapsulate the entire process of reading a PDB file stored locally or in a PDB mirror on a network path/in AFS space, and getting Calphas, residue labels, etc. from it. map_to_PDB_seqnum.m - Convert tetrahedra (or edges) from the sequential numbering to the PDB sequence number space using the array mapping, which may be passed in as a numerical array of indices or char array of residue labels. findgaps_remdups.m ] Chains in PDB structures can have gaps as well as alternate positions remgapduptets.m ] for the same residues. These are detected by examining consecutive renumber_tets.m ] Calpha-Calpha distances, are removed, and tets containing or spanning them are renumbered. loadchains.m: Load muliple chains from a PDB file, and store in a cell array; not possible/efficient using pdb2cell Other Utilities normalize.m - Row vectors in a matrix are scaled to unit length modsqr.m - Squared euclidean norm of the rows of a matrix lenfun.m - given indices of some edges and a set of points in row vectors, return all edge lengths. fatten.m - Elegant solution to a problem posed by Leo on the newsgroups; I use it a lot. Basically it replicates each row or column of a matrix a given number of times in succession, in contrast to repmat that would tile the whole matrix. Visualization and graph plotting tools plottet : Plot some AD tetrahedra in a specified color plottet_alt : Another version of this that draws the tetrahedra as lines with any given style...? plotAD4 : Plot semi-transparent tetrahedra corresponding to AD($\epsilon$) on a protein backbone. plotADsphere: Verify that a minimum annulus that is found really contains a simplex, by visually drawing transparent spheres. Useful in debugging AD code. patternhistd: alpha-helix pattern histograms betahistd: a sequence gap histogram of AD tetrahedra that shows up ranges in tetgaphist parallel and antiparallel beta-sheets. makehist: Histogram distribution of a set of AD simplices and thresholds Helix packing (very old and not very successful work, just here for completeness) helixtet.m helixpacklist.m helixproblist.m createADhelixlist.m Detection of hinges and conformational change. replacePDBcoords.m: replace coordinates of several chains in a PDB file of a multimeric protein, with a different set of coordinates, eg. calculated from a structural alignment, and stored in a cell array. An optional parameter is the list of IDs of atoms to be erased from the PDB file, eg. if they are duplicates or not common across all the chains. kinmulti*.m: Make kinemages given multiple chains and the hinge assignments commontet.m made using the AD tetrahedra common to these chains that differ in threshold value in a few chains. kinallatom.m: Make a kinemage as above but using all atoms instead of Calphas. allatoms.m eraseExtraAtoms.m: Mark atoms or residues are missing from some of the chains being compared, so that they may be erased when AD tetrahedra/kinemages