Wednesday, April 11, 2007

Topic Detection (and Tracking)

One of items in my list of inquiries is Topic Detection. What is this really?

Topic Detection (and Tracking), from now on I will write it TDT, is a kind of classification task. What to classify? Textual documents. In what way TDT is different from other tasks of classification? In TDT one might expect a new class when new documents arrive. One of potential application of TDT is in classifying news stream.

In an environment such we live now, textual news come from many sources. They can come into mailboxes, in RSS news feed, in homepages, in weblogs. In original classification task, one (usually experts in the related domain) assigns to each document in the collection its class. A document might be classified to more than one class. But the main idea is the designated classes are determined by the experts, which usually don't grow (the number of classes is fixed).

In TDT, the number of classes may grow. A new class may be formed when a new document comes through RSS news feed. The published paper by Allan J in 1998 identifies state-of-the-art of TDT.

