The difficulty of categorization

Peter V pointed to Philip C. Murray's KM Connection article, The difficulty of categorization discusses implications of using categorization in the enterprise. In it he, he cite's Bella Hass Weinberg's 1996 article from the ASIS Conference Proceedings, Complexity In Indexing Systems -- Abandonment And Failure: Implications For Organizing The Internet, to bring up the issue of difficulty in classifying documents from a large corpus of data. Weinberg's article discusses the issues in classifying the Internet. Murray's position is that a corporate body's "three ring binder of knowledge" is not a massive data source, so is not necessarily subject to all of the difficulties that Weinberg mentions. He states,

    I also wonder whether classification experts simply cultivate the perception that classification is extremely difficult. Even manual classification can be done quickly, if my experience with professional indexers is any indicator. It's not unusual for a professional indexer to generate a comprehensive, high-quality back-of-the-book index for a new title in less than three weeks.
He goes on to discuss the advantages of faceted knowledge access at a high level. What I find problematic with arguments that state essentially that classification is not so hard is that there are so many variables at play when we're talking about classification of any kind. These variables can include definition of domain, size and scope of the indexable corpus, and specificity of indexing to name just a few. Providing facets of classification is another level of complexity that begs for some definition of guidelines as well.

But I wonder, are most organizations just concerned with indexing a "three ring binder of knowledge" or are they also concerned with indexing all of the published material -- technical documents, memos, press releases, etc. -- of the organization? Are they concerned with indexing at the level of the document or at a more granular level, indexing concepts within the document. There are a lot of high-level articles floating around lately that give lip service to the value of classification. What I'm interested in are those articles that actually discuss the pain of implementing classification processes within large corporations. If you have citations for any good examples/case studies, please share them!

As part of an information services organization in a large corporation, I've seen the great distances my colleagues have had to go to make an enterprise level taxonomy work for our customers, who have been the catalysts and partners in its development and use. Over the 4 years that I've used our taxonomy on the back end as an indexer and as a site developer -- but not as a subject matter expert creating/defining the terms and relationships of the taxonomy -- I have to say that there is not much about classification at the enterprise level that seems very simple to me. It is very clear that representing knowledge (automatic or manual) is never simple to do, and when done right, can never be always right and never serve everyone. Concepts can change, indexers will represent knowledge differently, environmental elements will affect priorities and sometimes shift the language and understanding of your subject matter. It's all very slippery. That being said, however, without classification it is clear that knowledge retrieval is hampered and the bottom line is affected. And I guess that necessitates the need for information professionals and information retrieval systems.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Theory vs. Practice

By coincidence I was chatting with Bella the other night (she's an LIS professor in NYC). She says she's doing a book index now, taking on one every once in a while to keep her skills and ideas fresh. This time she said it reminded her of how hard creating an index really is!

I don't doubt it

I don't doubt that she would say that. I was on a team that provided indexing services (was about 2 hours of my day) on news feeds for our corporation. Did it for about a year and a half. The controlled vocabulary we used then was simple compared to the complex, enormous cv the team uses now -- it has been sought after in the telecom industry by some large organizations -- but the task of indexing was still challenging for me every day. The difficulty for me comes in how you represent subject matter. First off, it really really helps to be expert in a subject matter field. Without experience in a subject area, your indexing will provide little value. Secondly, there is just such variation in the representation of knowledge from person to person. You sometimes wonder if you're doing the best job you can because of that.

There is no doubt in my mind, after seeing how automatted indexing performs, that manual indexing helps boost the signal to noise. But it just gets difficult when you think of your responsibility as an indexer. Someone's search results will be affected by how you tag a document and in an indirect way, that affects your company's bottom line. Unfound relevant information can make a big difference in dollars in terms of lost opportunities, or in terms of lost time in information seeking. The responsibility is great. The way you index greatly affects someone's information use. Imagine if Medline had a sucky indexing and information retrieval system. That would do more than just affect some company's bottom line.