Photo
Knowledge Management
Information Organization and Retrieval
 

We are currently working on three aspects of information organization and retrieval.
 

  • Automatic Taxonomy Generation
  • Taxonomy Maintenance
  • Query Refinement and Disambiguation

  • Automatic Taxonomy Generation: Humans cannot comprehend large amounts of information easily. Concept hierarchies are usually used to group related concepts meaningfully into a tree-like structure for better abstraction. For example, various models of cars can be placed under 'cars', whereas cars and trucks can be placed under 'vehicles'. At IRL we are developing algorithms for automatic taxonomy generation (ATG). Our ATG tools are scalable up to a few million documents with a few hundred thousand concepts. They can export an XML version of the hierarchical taxonomy, which can be used by other tools as well. ATG algorithms have many applications such as easy-to-browse summarization of search results and creation of automated help desks. We are working on algorithms to create intuitively appealing concept hierarchies from corpora and to generate labels that aptly describe the concepts.
     

    Taxonomy Maintenance: Traditionally taxonomies have been built and maintained manually. However, as they grow in size and complexity, it becomes extremely difficult to maintain and update them. Another reason why manual maintenance of such taxonomies is cumbersome is that they evolve with time and nodes may get merged or split. 

    For example initially we might place "humans" and "apes" under "mammals". Later, when many more mammals get added, we may decide to add another level within the "mammals" group and place "humans" and "apes" under "primates," which in turn comes under "mammals". 

    At IRL, we are developing algorithms that automatically populate and maintain taxonomies. Document classification is a simple application of automatic population of taxonomies. Our approach is to define the notion of a state for the taxonomy and minimize the entropy associated with the state while adding new items and concepts. We use a measure for confidence of insertion (classification) so that a human can intervene whenever the classification is ambiguous.
     

    Query Refinement and Disambiguation: In order to answer a query, the system needs to navigate large hierarchies or taxonomies such as directories for the Internet, library catalogues, and product catalogues. Query refinement and disambiguation tools determine what part of the hierarchy is relevant to the user's query by seeking relevance feedback. Our approach to query disambiguation is to generate a compact representation of all contexts of the query from all documents that are possibly relevant to the query. The user can choose a particular context thereby clarifying the query. The system will then continue the search within the particular context.
     

    Web Mining