USC Annenberg Online Journalism ReviewUSC

Sections
Article Archive
Readers' Blog
Wikis
Ethics
Events Calendar
Making Money
Reporting
Video
Writing
Resources
Register
About OJR
Privacy Policy
OJR Delivered
OJR by E-mail
RSS Article Feed
RSS Blog Feed
Search




Taxonomy Software to the Rescue

The task of managing increasing volumes of information is already beyond the scope of individuals. It also consumes inordinate amounts of human and technological resources.

'Search tools are ineffective 'find' tools because their focus is on retrieving a specific reference and not on discovery,' writes Delphi Group analyst Carl Frappaolo in the white paper 'Connecting To Your Knowledge Nuggets.'

Knowledge workers are having a tough time categorizing data and finding what they need. One way to manage problems of information overload is by using taxonomy software.

Simply put, taxonomy can be described as an effort to incorporate a 'categorization' mechanism that allows functional search and retrieval to extend beyond keyword results.

Frappaolo observes, 'While not usually thought of as a 'search' tool, categorization practice and technology is the foundation for an effective information retrieval experience in a data-rich environment, [the] basis for transforming searching into finding because it helps users navigate within collections of documents that are related through hierarchical categories to the specific destination they seek. Once located, users must understand how the information is differentiated and contextualized -- they must find what they need.'

Most stored data is now unstructured and improperly 'meta-tagged.' Besides the usual difficulties associated with automated search and retrieval of information, categorizing such information has until recently been determined mainly by the personal interests and priorities of individuals.

In addition, most of the knowledge contained in a document is tacit, and very few retrieval systems offer 'concept extraction' facilities. Corporate repositories face difficulties from data being stored in different portions of a system -- sometimes even geographically dispersed.

Several companies, including Autonomy and Semio, offer different approaches to categorization software.

Autonomy's engine identifies naturally recurring patterns in text, based on the usage and frequency of words, and the corresponding framework. Using pattern-matching algorithms and probability theory, the engine employs a probability score to extract a document's digital 'signature.' The signature is based on a word-context match.

Autonomy's software helps automate the process of unstructured information. Here, automation is key. Since the categorization is not visible, it is referred to as a 'black box' approach, but this process requires a user to have substantial system training.

Verity, which is known for advanced search technology, recently launched a K2 enterprise solution that combines machine-based reasoning with human input. The company calls this a 'white box' approach.

Verity has pioneered a concept called social networking. The concept matches the profiles and behaviors of similar groups of users using the same knowledge base. Users are able to tap into experts' thought processes and choice patterns because social networking optimizes the search through information sharing.

'Verity's classification approach puts humans in control with the 'ABC' of intelligent classification: automatic rules, business rules and concept extraction,' said Verity's CTO Prabhakar Raghavan.

Software developed by Inxight, a technology company that sprang from the Xerox Research Center PARC in Palo Alto, adopts a mixed approach. It can recognize and understand more than 16 languages by breaking down each sentence, analyzing statistics and then putting the parts back together. The platform is based on techniques such as automatic language identification, stemming, part-of-speech tagging and noun-phrase extraction.

Semio offers a different, possibly complementary, approach. Vice President Jim Nasbet says the company's Neural network technology uses a predefined directory style structure to display information based on mathematical and statistical pattern matching (fingerprint analysis). But, it does not provide the user with a clear understanding on how and why the document has been categorized.

Semio's taxonomy technology works by analyzing and extracting concepts from unstructured, text-based data. Its core engine is based on Semiolex, a patent developed by Dr. Claude Vogel, Semio's founder.

Semio is perceived to be in direct competition with other vendors offering categorization technology. The software uses a 'knowledge matrix' that defines the location of the documents, relationship of the concepts and details such as language, acronyms and replacement files in the background.

The categorization challenge extends further when the content in question is multilingual.

As the software industry rapidly moves towards an age of 'structured data,' technology currently being developed tends to adopt a hybrid approach that combines both machine and human inputs to optimize a taxonomy.

This summer a new player officially entered the market. Quiver offers a taxonomy product that includes an element of workflow to automate processes of finding, screening, and categorizing content. It provides visibility to categorization decisions through an intuitive editorial toolset.

Several emerging companies are working at refining their tools to face the future. Mohomine in San Diego has developed an automatic classification product called mohoPlatform, based on a set of knowledge-mining tools that separate data retrieval, indexing and directories.

'A taxonomy should be built around users' requirements,' explains Bob Ainsbury, president and CEO of EoExchange, a company that specializes in corporate taxonomies configuration. 'A well-designed end product will demand the expertise of different professional figures within the organization, such as a domain expert, search and portal expert, classification expert as well as IT and business managers.'

Ainsbury adds, 'Taxonomy developers should have a solid understanding of not only the semantic meaning and context of the content but also the applications and technologies environment used to build and support that taxonomy.'

Taxonomy software is likely to be expensive, as initially it will be targeted at corporate enterprises. But as many portals and corporate intranet products begin incorporating a default element of content categorization, more entry-level 'personal' categorization tools will be available for users to download.

 

News briefs from around the world give you the latest developments that affect online journalism.

Autonomy

EoExchange

Inxight

Mohomine

Quiver

Semio

Verity