FFSM Software (V. 1)
|
| Home | Publications | MotifSpace | FFSM | Family |
|---|
Introduction: The FFSM (Fast
Frequent Subgraph Mining) algorithm is part of our
on-going effort to develop effective and efficient algorithms for
knowledge discovery in complex data. The current executable FFSM V. 1
solves the frequent subgraph mining
problem,
which is, given a collection of graphs D and a threshold f between 0 (exclusive) and 1
(inclusive),
to enumerate subgraphs that occur in at least a fraction f of graphs in D. In searching for frequent
subgraphs, we make no assumption about the size, topology, or
location of these patterns, and hence our result is an exclusive
enumeration of all frequent
patterns. Further algorithmic details about FFSM can be found in
[1]; the application of FFSM in protein structural comparison, one of
our driving applications, can be found in [2].
Download & Install: Please send email to Dr. Wei Wang and cc to Dr. Jan Prins (prins@cs.unc.edu) and Jun Huan (jhuan@ku.edu). The executable, together with two sample data sets and a README file, is distributed as a zip file (ffsm.tar.gz). After saving the zip file into your local directory, please run commands: gunzip ffsm.tar.gz; tar -xvf ffsm.tar. Command Line Parameters: $FFSM <node file> <edge file> <frequency> Example: FFSM d1n d1e 27 or FFSM d2n d2e 36 Input Format: There are two input files: node file and edge file with the following formats: node file: node <g_index> <n_index> <n_label> <g_index> : an integer from 0 to n-1 where n is the size of the graph database. <n_index> : the index of a node in a graph <n_label> : a positive integer label of the related node edge file: edge <g_index> <n1_index> <n2_index> <e_label> <g_index> : an integer from 0 to n-1 where n is the size of the graph database. <n1/2_indices>: the indices of the nodes which the edge connects <e_label> : a positive integer label of the edge <frequency> is a positive integer. Sample Output: =========Welcome using FFSM: Fast Frequent Subgraph Mining========= Developed by the University of North Carolina at Chapel Hill graph database size 1083 threshold 27 ...... Total frequent patterns: 348514 Search Time 94.5196 seconds Sample Datasets: D1: Confirmed moderate active chemicals from NIH anti-HIV virus drug screen. D2: SCOP serine proteases. The list of the protein identifiers, which can be used to retrieve the related proteins from the Protein DataBank (PDB, http://www.pdb.org), is included. Platform: The FFSM executable was built in Red Hat Linux. Limitations & Disclaimer: FFSM supports graph databases that
contain maximal 32k graphs;
each graph may have up to 32k nodes. There may be no more than a total of 80 distinct labels for nodes and edges combined together. Single-node patterns are not
included in
results. The current software is confined to performance
comparison
only; full
functional one is available upon request. The
executable is
provided "as it is" and we assume no responsibility for any damage that
it
may cause to your system/files.
References: Algorithm/Dataset D1:
|
|
|
|
|