Wednesday, March 03, 2010

Large Scale Content Classification With Little Time and Few Resources

What would you do if faced with the daunting task of building multiple industry taxonomies for large scale automatic classification given very little time and even less resources. This is exactly the dilemma zibb.com came face-to-face with early in our inception. Of course, being an enthusiastic bunch, myself and a co-worker, said yes, of course we can do this and set off to tackle this problem with full abandon and a prayer (an ecumenical prayer of course with a few expletive deletives thrown in for good measure).

Luckily, we had a few tricks up our sleeves plus the luxury of being able to focus almost exclusively on precision (accuracy) without too much regard for recall (getting it all) – this was largely thanks to our huge index of over 2 billion document.

Our first trick was that we already had Teragram’s rules based classifier available to us and the second trick was that our content was already organized into industries (sectorized). These two factors worked to greatly reduce the difficulty of our task.

Utilizing a rules based classifier, like Teragram, eliminates the need to collect training documents for each topic which can take an exceptional amount of time and if not done carefully result in-accurate results.

Secondly, sectorizing content significantly reduces the variety of meanings that words tend to have. For example, the term MMIC in the below Teragram rule (taken from our Electronics industry taxonomy) can mean all sorts of thing outside of the Electronics industry context.

(OR,"mmics","mmic","monolithic microwave")

MMIC = Medical Marijuana Identification Card
MMIC = Medical Mutual Insurance Company
MMIC = Mobile Medical International Corporation
MMIC = Motorcycle & Moped Industry Council

However, within the electronics industry it means “Monolithic Microwave Integrated Circuits”. Bottom line is having sectorized content allowed us to write very simple rules without having to worry about language ambiguity, which significantly reduced the effort required to write each rule.

The final piece of this puzzle involved implementing a ranking algorithm which allowed us to set a taxonomy wide aboutness threshold. For this we worked closely with the Teragram team to implement a ranking algorithm which looks at density of term hits and hit position along with a number of other factors. This allowed us to set a ranking cutoff score under which we drop classifications. This ranking algorithm is called "zone ranking" and is now part of the standard Teragram classification tool (by the way, we really enjoyed working with Teragram folks as they were eager to collaberate and really worked as a team with us).

In the end, this combination of sectorized content + simple rules + ranking cutoff allowed us to build out almost forty industry taxonomies in English plus twenty of those in Dutch and do it on time and in budget. Some of the results of this work can be seen on http://www.zibb.com/ - e.g. MMIC

Incidentally, this formula allowed us to run automatic content classification for large entity files without having to write a single rule (auto-generated the rules from names). These entities include companies, airplanes, airports, chefs, Semiconductor equipment and more.

So if you are ever faced with the task of implementing classification for a large number of topics and documents in a short period of time and few resource, you may want to consider a rules based classifier combines with some sort of overall content segmentation (like sectorization).

It should be noted that a great disadvantage of a rules base classifier is it is language dependant, meaning that the rules must be rewritten for each language. As such if you need to classify content in a large variety of languages a language independant statistical approach might prove to be more scalable.

0 comments: