< Semantic Routing > 1. Introduction Search engines such as "Lycos", "Infoseek", or "Yahoo" help users locate useful information amoung the mass of information on the WWW. However, these centralized search engines are not able to scale up to accomodate the phenomial growth rates of either the volume of data or the number of users on the WWW today. In fact, most of these centralized search engines are already satuarated. One approach to this problem of locating information without the bottleneck is to build a distributed search system whereby the index of the data is distributed over multiple search engines. This allows for a scalable search system in which the data and the user interface are distributed over the multiple search engines. Flexible local search engines, such as "Harvest", can be used to build the elements of such a distributed search system. However, there are no good protocol (or system) that can tie in the multiple local search engines. For a distributed search system to be most useful, the multiple search engines must be able to cooperate with each other as well as efficiently find the relavent local search engines to a query. As one solution to this problem, we are proposing a semantic routing scheme using topical orgainization of the global information space. In this document we propose an architecture and protocol similar to the DNS system to do such semantic routing. In the rest of this paper, I will refer to this semantic routing system as the Topical Semantic Routing System (TSR). 1.2 Global Information Tree (GIT) This is a tree structure that represents the global inforamtion space on the WWW. In another words the global information space will be organized into a a directory of multilevels. We have chosen to adopt the Infoseek Directory struture which has 4 levels. The root of the tree is "General" and it is a very flat and wide tree structure. It is assumed that all servers participating in the TSR will know the structure of this GIT. The GIT will basically be a static structure with very gradual change. We do want the GIT to reflect the WWW information space so as the information available on the WWW changes, or increases we will want to reflect this. However, we do not expect this to be frequent, and any such changes can be done manually. Once a new GIT is formed, it will be broadcasted to all servers on the TSR system accomanied by acknowledgement. The following is an example subtree of the GIT: (Level 1)General / : Topics (Level 2)General / Computing : Topics ====================== ====================== Arts Access Providers Business Authoring Computing The Biz Education Computer Magazines Entertainment (see /Entertainment/Magazines/Computer Magazines) Health & Medicine Computer Science Hobbies Events Life & Style FAQs Money & Investing Graphics News & Reference Internet Personal Home Pages Multimedia Politics & Law Networks Regional PCs & OSs Science Security & Encryption Shopping Software Sports World Wide Web (Level 3)General / Computing / Computer Science : Topic ===================== Artificial Intelligence Complex Systems Courses & Information Resources Graphics & Vision Hardware Human-Computer Interaction Indexes Languages OOP Parallel Processing Robotics Software Software Engineering Supercomputers (Level 3)General / Computing / Graphics : Topics ====================================== Computer Art & Artistry Electronic Artists Information Resources Service Providers (Level 3)General / Computing / Networks : Topics ======================================= Messaging Networking Research Groups Suppliers Telecomm Tools Wireless 1.2 Domain Content Descriptor (DCD) The Domain Content Descriptor is description of the local information available for particular Search Server(SS). Each SS will have domain of information that it is responsible for indexing and keeping track of. Basical for answering queries related to the documents available in its own domain. This domain can be describe as some subtree of the GIT explained above. However just knowing what topics your local information space covers is not sufficient for good semantic routing. Thus for each topic in the subtree we keep a weight that reflects the amount of information on that topic in its local domain. The weight will be a number from 0 to 100 represented as percentile. This will have to be some kind of a relative weight based on the TS tree explained later. There are two issues that need to be worked out in terms the weight. We need find an algorithm that will give you the absolute wieght of the topic that reflect how much information you have on the topic. Then we need to find an algorithm to convert this absolute weight into a relative weigth compared with other sites that have information on the same topic. The exact details of how to construct a good DCD need to be thought out more. For now we do a very simple algorithm where we just use the number of documents on a topic as its wieght(Not taking into account the relativity). So to construct a DCD, A. A server builds an index of its local documents using Harvest (of the combination of the Gatherer and Glimpse) B. Does a search on each word in the GIT, using this index. C. If there are any documents on the topic, insert the tuple (Topic, # of documents) into the DCD. Note that step B and C are repeated for all the topics in the GIT. The resulting tree of (topic, # of documents) is the DCD for the SS. And it is obvious that it will be a subtree of the GIT. Changes in the DCD will be done by each SS constrtucting the DCD periodically. We expect to construct the DCD during the off peak hours every night. When there are any changes, the SS will broadcast the change to the Topical Server Tree (TST) explained below. Note that a new SS will have to build a local index, build a DCD, then register with the TST if you want to share your local information. If you only want to be able to do queries, and have no local information to share you can to so by registering as a null DCD. 1.3 Query Content Descriptor (QCD) The Query Content Descriptor basically is a description of the user query. Users can query on any word, or combination of words. To be able to locate the proper Search Server that holds information the user is looking for, the TSR system needs to be able to catagorize/generalize the user query into the GIT. This is an area of information science that needs more thought. We have decided to go with an interactive algorithm where the user is asked to help the system in doing this mapping. For now we have decided to go with a simple scheme, where we dynamically build a thesurus. When a user does a query, we will look in the thesurus to see if the word was ever queried before. If so there should be at least one entry in the thesurus, if not there wont be any. In either case, we will return to the user and either confirm our mapping(if only one entry was found in the thesurus), ask the user to choose one of the many mapping(if more than one entry was found), or ask the user to describe a QCD for their search(if no entries were found or if the user was not satisfied with any that were found). In the last case, the user will be given the GIT to work from. Any new query to QCD mapping will be entered into the thesurus. We might want to limit the total number of entries in the thesurus later on, but how to trim a thesurus after it grows too big is another issue that needs more thought. Note that the QCD does not have to be a leaf node in the GIT, it can be any node. Some examples: Query: 3D Rendering -> General/Computing/Computer Science/Graphics & Vision Query: Operating Systems -> General/Computing/Computer Science 2. Overview Before I talk about the architecture of the TSR, I would like to look into more detail on how a user query is "resolved". There are basically three smaller steps that must take place to "resolve" a user query to a document on WWW. A. The user query has to be mapped into a Query Content Descriptor(QCD). The QCD is basically a subtree of the Global Information Tree(GIT) explained above. For simplicity I will say that the user queries get mapped into a subset of the "topics" from a globally predetermined set of topics. B. The "topics" must be resolved into Search Servers that have actual documents and indexes for such documents on a certain topic. C. These Search Servers will have to "resolve" the actual user query to relevant documents. The C functionality is similar to the final resolution on the DNS, while the B functionality is similar to the referrals on the DNS which help to locate the ultimate authority Name Server. The A&C functionality is performed by the Search Server, and the B functionality is performed by the Topic Servers in the TSR. The Topic Servers(TS) are structured as a wide and flat tree like in a DNS model, but TS can only give referrals. This tree is call the Topic Server Tree (TST) and is based on the GIT. All the final resolution (C functionality) is done by a SS. So in a sense, the Topic Server Tree is the same tree as the GIT, with the leaf nodes of the TST pointing to Search Servers. 2.1 The Architecture Now I will explain the architecture of the system by comparing it to the DNS system. Fig 1 is the diagram of a tyipcal configuration of the DNS system(from RFS1035), and Fig 2 is a typical configuration of the TSR system. Local Host | Foreign | +---------+ +----------+ | +--------+ | | user queries | |queries | | | | User |-------------->| |---------|->|Foreign | | Program | | Resolver | | | Name | | |<--------------| |<--------|--| Server | | | user responses| |responses| | | +---------+ +----------+ | +--------+ | A | cache additions | | references | V | | +----------+ | | cache | | +----------+ | Fig 1 DNS configuration Local Host | Foreign | +---------+ +----------+queries | +--------+ | | user queries | HTTPd |on a Topic | | | | HTTP |-------------->| Server |------------|->|Foreign | | Client | | Program |<-----------|--| Topic | | Program |<--------------| running |references | | Server | | | user responses| a Search |to the | | (TS) | +---------+ |Server(SS)|Foreign | +--------+ +----------+HTTPd Server| | A A | | | | | | queries | | | | +-------------|--> +----------+ cache additions | | +---------------|--- | Foreign | | | responses | | HTTPd | | | | | Server | | | references | | with a SS| V | | | Holding | +----------+ | | the info | | cache | | +----------+ | & | | | Routing | | | Table | +----------+ Fig 2 TSRS configuration After mapping the user query to a topic(QCD), the Search Servers(SS) will resolve the topic to a relavent SS using the Topic Servers(similar to the Name Server), and send the user query to the new resolved SS returned by the Topic Servers. This new SS will process the user's original request to any relative documents and return it to the original SS. In the DNS model, the final resolution is also done by a foriegn Name server, and there is no distiction between intermediate nodes and the leaf nodes. However in the TSR there is a clear distinction. Hence I included the final SS in the diagram above. Below is the basic flow of TSR compared with DNS. -> user sends query - user sends a query -> the local resolver receives a referral to an authority on the domain of the query using the Name Servers - the local SS maps the query to a topic, and receives a referral to an authoritative SS on that topic using the Topic Servers. Note that unlike the DNS, there can be multiple SS's that are authorities on the topic which are not mere replicates of one another. Thus, the Topic Servers should return all athorities on the queried topic. To rank the usefulness of each SS on a topic, each SS is returned with a weight as explained above. -> forwards the query to the athoritive Name Server - Original SS forwards the user query to the anthoritive Search Servers -> The athoritive Name Server returns the resolution - the response to the query is returned to the original SS. This resonpse will be the pointers to the acutal documents that contain the information the user is looking for. 2.3 The Search Server(SS) The Search Server(SS) has a couple of important functionalities in the TSR. It serves the function of the local resolver in the DNS model in that it is the direct interface of our architecture to the users and it is one of the athorities on some topics. This means that it will resolves any queries for a topic it specailizes in, to its local documents. We expect that one Search Server will be running per every HTTPd server to handle query requests from its local users, and maintain local document indexes to resolve queries to its documents. The other main functionality of the SS is to map user queries to topics, called QCD. This is explained in later sections. 2.4 The Topic Server Tree (TST) The Topic Server Tree(TST) is multiple Topic Servers structured as a flat and wide tree based on the GIT as in the DNS model. The collections of TS is essentially a distributed directory of all the SS's in the Global Information Space, organized by topics. We expect to eventually have a replicated dedicated server per topic. However, there does not need to be any restrictions on this. So as new topic areas comeup, they can be added to existing topic servers, and when that particular TS is over burdened we can designate a new TS for the new topic. This is just like the DNS model. 2.4.1 Structure The GIT was explained briefly above. The TST has the following basic structure. The root of the tree, "General" has pointers to the major Topic servers in level 2 (as well as their replicated versions). The level 2 and 3 Topic Servers will have pointers to the next nevel Topic servers (and its replicates) as well as Search Servers that have a rich informtaion space on a partuicluar topic. Each This is to ensure, that queries do not have map to the leaf nodes of the GIT, but rather anywhere on the GIT. For example, if you were looking for information on a new field in computer science: Level 3(there will be no topic for this field under computer science yet since it is a new field) or a specialized topic in computer science that overlap in many of the subtopics of computer science you can map your query to "COMPUTER SICENCE". Than the TS will return reference to SS which are major computer science sites. Thus you can try to do your search on these computer science sites, without being forced to pick a subtopic in computer science. Note that with each Search Server there will be a wieght which reflects how significant the is. The algorithm for determining the major computer science sites and there weight will be discussed in a later section. The leaf levels of the TST will have pointers to all SS which have information on the topic. As in the upper levels entries all SS will have a weight associated with it to reflect the amount of information held at the SS about that particular topic. 2.4.2 Construction The TST will be built using the DCD's of each Search Server in the TSR system. As the Search Servers register with the TST, it is a trival problem to enter the SS into the proper leaf nodes in the TSR. Again note that unlike the DNS model there will be multiple entries of a single SS in the TST since all leaf nodes of the GIT which is in the DCD of the SS will be an entry. Furthermore, note that it is the tuple of (SS, wieght) that get entered in the leaf Topic Server. The question of finding the major sites of intermediate nodes(level 2 and 3) will need more experimentation. Weights of intermediate nodes will be determined by some function of the weight of the intermediate node(reflect in the DCD from the SS), the number of next level topics and their weights in the DCD. For example if a SS had a DCD that has the following strucure: /General, Wg/ /Computing, Wc/ /Computer Science, Wcs/ Courses & Information Resources, Wcir Graphics & Vision, Wgv Hardware, Wh OOP, Wo /Education, We/ (Level 3)General / Computing / Computer Science : Topic ===================== Artificial Intelligence Complex Systems Courses & Information Resources Graphics & Vision Hardware Human-Computer Interaction Indexes Languages OOP Parallel Processing Robotics Software Software Engineering Supercomputers Here /XXX/ reflects the intermediate nodes of the GIT and the rest are the leaf nodes in the GIT. The actual weights used for "Computer Science" will be f(Wcs, Wcir, Wgv, Wh, Wo, 3/14) and not just Wcs. For now we believe that unlike the leaf nodes, the intermediate nodes will have a limit on the number of major sites for that topic. It can either be limited by some constant value where we take the top 10, or by a minimum weight requirement. We will do a combination scheme where the main principle is to have intermediate nodes keep references to SS with weights greater than WMIN(Topic). This will be augmented by saying that if there are no SS in an intermediate node that satisfy this requirement, we will take the top 5 SS with the highest W(Topic), or if there are more than 20 SS that satisfy this condition, we will only keep the top 20 SS. So basically, after the Search Server A in the above example registers with the TST, a part of the TST will look as follows assuming that it is only a major site for "Computer Science": TSg(General) -> TSc TSc(Computing) -> TScs TScs(Computer Science) -> TScir -> TSgv -> TSh -> TSo => (SS(A), RWcs) TScir(Courses & Information Resources) => (SS(A), Wcir) TSgv(Graphics & Vision) => (SS(A), Wgv) TSh(Hardware) => (SS(A),Wh) TSo(OOP) => (SS(A), Wo) TSe(Education) -> TS I have indented the Topic Servers to reflect the tree structure. The -> are pointers to other TS in the TST, while the => are pointer to acutal Search Servers. I have used RWcs = f(Wcs, Wcir, Wgv, Wh, Wo, 3/14). 2.4.3 Registeration The actual registeration of a DCD with the TST needs more thought. There are two ways in which this can occur: 1) register with the root node and the DCD is trickled down the TST, 2) register with the leaf nodes, and the DCD is propagated up the tree as needed. It would seem that doing the registeration through the leaf will generate multiple copies in the TST since a SS will have more then one leaf node in the GIT. However, going through the root could become a centralization point and a bottleneck. A good compromise might be to let the initial registeration go through the root, but let any subseqent changes in the DCD go through the leaf. This way since there will be only one registeration for a new SS, after the intial setup of the system there will be no problem of a bottleneck. Furthermore, since changes in DCD only need to propagate the changes to the changed nodes of the DCD and not all of the DCD, there wont be the problem of multiple copies flooding the TST. Also now the SS only needs to know about the root of the TST (as in the DNS model) to bootstarp. 2.4.4 Query Propogation All SS will know about the root of the TST initially. After the bootstraping and the registeration with the TST, the SS will also know certain other Topic Servers in the TST. This information will be kept in the Semantic Routing Table explained later. Using this information, the SS will send a QCD of a query to the appropriate TS. Here appropriate TS will depend on what type of options the user set on the search, but basically it will be the TS you know about cloest to the leaf node you want references to. Here are two examples: Query: 3D Rendering -> QCD: General/Computing/Computer Science/Graphics & Vision you want TSgv, if not known TScs, if not known TSc, if not known TSg Query: Operating Systems -> QCD:General/Computing/Computer Science, you want TScs, if not known TSc, if not known TSg At the TS side, when a QCD is received from SS, it will send the request to a lower level TS if it is one of the itermediate nodes in the QCD, otherwise if it is the leaf node of the QCD it will send back tht tuple of references to the SS and tis weights from its table. For the leaf TSs who have a huge table of SS, it might want to send only the first 30. Note that the SS table will be sorted by the weights. More work needed in the following sections. 2.5 Cache The cache will work very similar to the cache mechanism in the DNS model. We will cache both the resolution of topics to SS, and the resolution of queries to actual documents. In addition to the these cacheing there will be a log of the cache hits and misses for the resolution of topic to SS, so that we can track the general characteristics of the user group.(To see which topics are of most interest to the user group.) We want to use this information to dyanmically build a part of the Semantic Routing Table. 2.6 Semantic Routing Table The Semantic Routing Table is an extension to the cache which will improve the performance of the TSR system. Unlike the DNS model where most of the queries can be resolved locally or by the cache, the TSR system will only be able to do this for resolving topics to SS. It will be unlikely that two different users actually do the exact same query, but it is very likely that the user group will tend to do queries on certain topics. So , even though you get a hit in the cache on the topic you will still have to forward the query to the proper SS to resolve the query to the actual documents. In another words, if a person did a query on "3D rendering" which was resolved to the QCD "General/Computing/Computer Science/Graphics & Vision" there should be an entry in the cache for references to SS specializing on "Graphics & Vision". Now if the next user did a query on "Resterization" which also resolves to the QCD "General/Computing/Computer Science/Graphics & Vision", this user will get a cache hit and have references to the SS specializing on "Graphics & Vision". However, this only resolves the first two steps mentioned in the overview section. You still have the last resolving step which is very unlikely to have a cache hit. This is the main difference between the DNS model and the TSR system. This is caused by the fact that there are two levels of resolving that has to happen in the TSR system through the Topic Servers(not including the mapping to the QCD), while there are only 1 level of resolving to do in the DNS system. Basically, the first level of resolving(Topic to SS) is what we are trying to model from the DNS system. To facilitate this resolution we are maintaining a Semantic Routing Table. The Semantic Routing Table will be able to route your query based on the topic area to a SS. If the topic area your query falls under is not in the Semantic Routing Table the server will first check the cache. If the topic is not in the cache either the server will contact a TS to get the SS. Basically the TS Tree is the distributed version of the master copy of the directory of Topics and Search Servers, and the Semantic Routing Table is the partial directory kept locally. The Semantic Routing Table will have two parts. First the static part will have pointers to the TS and SS in the topics of your DCD. This was based on the assumption that the user group will tend to be interested in topics that are residing in your SS. Second there will be a dynamic part of the table which will be more tailored to the actual usage by monitoring the cache. We will keep the dynamic part of the table set to a certain size, and delete obsolete entries as new ones are entered. Resource Record General, General/Computing etc SS = search server TS = topic server SOA = start of topic authority IN = internet