Classification

LIMBER project

On ia-cms, Brendan pointed out the LIMBER project. Limber stands for Language Independent Metadata Browsing of European Resources. The project, concerned with the exchange of multilingual metadata, particularly in the Social Sciences, has proposed an RDF schema for thesauri.

A Thesaurus Interchange Format in RDF (delivered at the Semantic Web conference 2002)
http://www.limber.rl.ac.uk/External/SW_conf_thes_paper.htm

RDF Schema for ISO compliant multi-lingual thesauri
http://www.limber.rl.ac.uk/External/thesaurus-iso.rdf

The Importance of Being Granular

Roy Tennant has a pretty good article in Library Journal on how granularity affects retrieval and impacts person-hours in Digital Library collections. Don't get turned off by the library lingo. The message is applicable to non-library collections.

The difficulty of categorization

Peter V pointed to Philip C. Murray's KM Connection article, The difficulty of categorization discusses implications of using categorization in the enterprise. In it he, he cite's Bella Hass Weinberg's 1996 article from the ASIS Conference Proceedings, Complexity In Indexing Systems -- Abandonment And Failure: Implications For Organizing The Internet, to bring up the issue of difficulty in classifying documents from a large corpus of data. Weinberg's article discusses the issues in classifying the Internet. Murray's position is that a corporate body's "three ring binder of knowledge" is not a massive data source, so is not necessarily subject to all of the difficulties that Weinberg mentions. He states,

    I also wonder whether classification experts simply cultivate the perception that classification is extremely difficult. Even manual classification can be done quickly, if my experience with professional indexers is any indicator. It's not unusual for a professional indexer to generate a comprehensive, high-quality back-of-the-book index for a new title in less than three weeks.
He goes on to discuss the advantages of faceted knowledge access at a high level. What I find problematic with arguments that state essentially that classification is not so hard is that there are so many variables at play when we're talking about classification of any kind. These variables can include definition of domain, size and scope of the indexable corpus, and specificity of indexing to name just a few. Providing facets of classification is another level of complexity that begs for some definition of guidelines as well.

But I wonder, are most organizations just concerned with indexing a "three ring binder of knowledge" or are they also concerned with indexing all of the published material -- technical documents, memos, press releases, etc. -- of the organization? Are they concerned with indexing at the level of the document or at a more granular level, indexing concepts within the document. There are a lot of high-level articles floating around lately that give lip service to the value of classification. What I'm interested in are those articles that actually discuss the pain of implementing classification processes within large corporations. If you have citations for any good examples/case studies, please share them!

As part of an information services organization in a large corporation, I've seen the great distances my colleagues have had to go to make an enterprise level taxonomy work for our customers, who have been the catalysts and partners in its development and use. Over the 4 years that I've used our taxonomy on the back end as an indexer and as a site developer -- but not as a subject matter expert creating/defining the terms and relationships of the taxonomy -- I have to say that there is not much about classification at the enterprise level that seems very simple to me. It is very clear that representing knowledge (automatic or manual) is never simple to do, and when done right, can never be always right and never serve everyone. Concepts can change, indexers will represent knowledge differently, environmental elements will affect priorities and sometimes shift the language and understanding of your subject matter. It's all very slippery. That being said, however, without classification it is clear that knowledge retrieval is hampered and the bottom line is affected. And I guess that necessitates the need for information professionals and information retrieval systems.

The Semantic Web: Taxonomies vs. ontologies

"The Semantic Web: Differentiating Between Taxonomies and Ontologies." Online. 26 n4 (July/August 2002): 20.

    Computer scientists--along with librarians--are working to solve problems of information retrieval and the exchange of knowledge between user groups. Ontologies or taxonomies are important to a number of computer scientists by facilitating the sharing and reuse of digital information.
Katherine Adams' article in ONLINE (ironically, not available online) talks about the Semantic Web and the subtle difference in the approaches that computer science and library information science have taken toward making information findable using structured hierarchical vocabularies -- ontologies for CS and taxonomies for LIS.

The article generalizes one difference between CS and LIS by saying that "software developers focus on the role ontologies play in the reuse and exchange of data while librarians construct taxonomies to help people locate and interpret information". Both hopefully remain focussed on the end result of making data findable and usable.

    Some of the traditional skills of librarianship--thesaurus construction, metadata design, and information organization--dovetail with this next stage of Web development. Librarians have the skills that computer scientists, entrepreneurs, and others are looking for when trying to envision the Semantic Web. However, fruitful exchange between these various communities depends on communication.
    Commonalities exist--as do differences--between librarians who create taxonomies and computer scientists who build ontologies. Mapping concepts, skills, and jargon between computer scientists and librarians encourages collaboration.
I'm quoting a few large blocks from the article because they're probably important for us to read (fair use!). One of the sections discussess differing views on inheritance and the last discusses topic maps.

    DIFFERENT POINTS OF EMPHASIS: INHERITANCE

    In general, those in computer science (CS) are concerned with how software and associated machines interact with ontologies. Librarians are concerned with how patrons retrieve information with the aid of taxonomies. Software developers and artificial intelligence scholars see hierarchies as logical structures that help machines make decisions, but for library science workers these information structures are about mapping out a topic for the benefit of patrons. For librarians, taxonomies are a way to facilitate certain types of information-seeking behavior. It would be a mistake to overemphasize this point since one can point to usability experts in the CS camp who advocate user-centered Web design or librarians who are fascinated with cataloging theory to the exclusion of flesh-and-blood patrons. Yet, as an overarching generalization, software developers focus on the role ontologies play in the reuse and exchange of data while librarians construct taxonomies to help people locate and interpret information.

    This difference is illustrated by the concept of inheritance. Computer scientists build hierarchies with an eye toward inheritance, one of the most powerful concepts in software development. Machines can correctly understand a number of relationships among entities by assigning properties to top classes and then assuming subclasses inherit these properties. For example, if Ricky Martin is a type of "Pop Star" in a hierarchy marked "Singers," then a software program can make assumptions about Mr. Martin even if the details of his biography are not explicitly known. An ontology may express the rule, "If an entertainer has an agent or a business manager and released an album last year, then assume he or she has a fan club." A program could then readily deduce, for example, that Ricky Martin has a fan club and process information accordingly. Inference rules give ontologies a lot of power. Software doesn't truly understand the meaning of any of this information, but inference rules allow computers to effectively use language in ways that are significant to the human users.

    By contrast, librarians think of inheritance in terms of hierarchical relationships and information retrieval for patrons. Taking the example above, the importance of the taxonomy rests in its ability to educate patrons. Someone who's been tuned out of popular culture might use the Pop Star hierarchy to learn the identities of singers who are currently in vogue. A searcher could also uncover the various types of Pop Stars that exist in mass culture: Singers, Movie Stars, Television Stars, Weight-Loss Gurus, Talk Show Hosts, etc. Finally, a patron could hop from one synonym to another--from "Singer" to "Warbler" to "Vocalist"--and discover associative relationships that exist within this category.

    TOPIC MAPS AS NEW WEB INFRASTRUCTURE

    Topic maps are closely related to the Semantic Web and point the way to the next stage of the Web's development. Topic maps hold out the promise of extending nimble-fingered distinctions to large collections of data. Topic maps are navigational aids that stand apart from the documents themselves. While topic maps do not include intelligent agents, other aspects of this technology--metadata, vocabularies, and hierarchies--fit well within the Semantic Web framework. According to Steve Pepper, senior information architect for Infostream in Oslo, Norway, in "The TAO of Topic Maps: Find the Way in the Age of Infoglut", his presentation at IDEAlliance's XML Europe 2000 conference, topic maps are important because they represent a new international standard (ISO 13250). Topic maps function as a super-sophisticated system of taxonomies, defining a group of subjects and then providing hypertext links to texts about these topics. Topic maps lay out a structured voca bulary and then point to documents about those topics. Even OCLC is looking to topic maps to help its project of organizing the Web by subject.

    An important advantage of topic maps is that Web documents do not have to be amended with metadata. While HTML metatags are embedded in the documents described, topic maps are information structures that stand apart from information resources. Topic maps can, therefore, be reused and shared between various organizations or user groups and hold great promise for digital libraries and enhanced knowledge navigation among diverse electronic information sources.

Other articles mentioned:
Tim Berners Lee, "The Semantic Web," Scientific American, May 200.

Natalya Fridman Noy and Deborah L. McGuinness. "Ontology Development 101: A Guide to Creating Your First Ontology," Knowledge Systems Laboratory Stanford University, March 2001.

Tom Gruber, "What is an Ontology," [September 2001].

Steve Pepper, "The TAO of Topic Maps: Find the Way in the Age of Infoglut," XML Europe 2000.

Thesaurus::RDF

I posted this link about Thesaurus::RDF, The RDF Thesaurus descriptor standard under the OPML thread so thought I'd surface it here in case it gets missed.

This document describes an RDF implementation of a representation of terms of a thesaurus. The definition of a thesaurus follows that of the NISO specification z39.19. This specification is intended as a method for thesaurus servers to transfer all or part of a thesaurus to an application.

Content Organization Methods Comparison

I wrote a short comparison of a few IA tools: authority lists, thesauri, and faceted approaches. I kept it pretty simple so that it could be given out to clients. It also includes "full-text search" as a method, since my client was in favor of using just search as an interface into 4000-5000 content items. I was trying to make the case that additional work on developing a thesaurus (at least) would improve the site in a number of ways.

Any comments are welcome. Word .doc, about 90k.

Content Organization Methods

Taxonomies: An Eye for the Needle

In Intelligent Enterprise, an article on business taxonomies.

Knowledge workers want content management applications to impose order on document chaos. The order imposed must model the business domain they work in. They see the taxonomy of a corporate portal as the key mechanism for managing content according to domain-relevant topics. The taxonomy a structure for categorizing text content by topic is the piece of the content management application that knowledge workers depend on most and, therefore, the piece they use for measuring its success.

Classifying web content

In New Thinking, Gerry McGovern talks about classifying web site content. He offers some tips in implementing and testing a simple classification scheme.

Classification is to content as mapping is to geography. It is an essential tool that allows the person visiting a website to navigate it quickly and efficiently. Without professional classification a website becomes a jumble yard of content that is confusing and time wasting. Before the Web, classification was some peripheral activity that happened deep in the bowels of the library. But the Web is a library. It is a place where people come to quickly find content. Quality classification facilitates them in doing that.

XML feed