Wednesday, April 11, 2007

Topic Detection (and Tracking)

One of items in my list of inquiries is Topic Detection. What is this really?

Topic Detection (and Tracking), from now on I will write it TDT, is a kind of classification task. What to classify? Textual documents. In what way TDT is different from other tasks of classification? In TDT one might expect a new class when new documents arrive. One of potential application of TDT is in classifying news stream.

In an environment such we live now, textual news come from many sources. They can come into mailboxes, in RSS news feed, in homepages, in weblogs. In original classification task, one (usually experts in the related domain) assigns to each document in the collection its class. A document might be classified to more than one class. But the main idea is the designated classes are determined by the experts, which usually don't grow (the number of classes is fixed).

In TDT, the number of classes may grow. A new class may be formed when a new document comes through RSS news feed. The published paper by Allan J in 1998 identifies state-of-the-art of TDT.

Wednesday, April 04, 2007

A Testbed for Indonesian Text Retrieval

A breakthrough in Indonesian Text Retrieval research made by a group of researcher in RMIT University, Melbourne, Australia. They develop a TREC like corpus for an objective evaluation in Indonesian Language documents.

What is a corpus? A corpus is a collection of documents that is used for evaluating performance and effectiveness of an Information Retrieval System. These evaluation includes performance and recall.

Monday, April 02, 2007

Keyword Extraction for Indonesian Language

In my previous posts, I tried to explore many technical terms related to Data or Text Mining (see [1] and [2]). One of them is Text Extraction. This term refers to the art and science of extracting parts of textual documents. In more specific sense, what can be extracted may take the forms of phrase, keywords, or concepts underlying the individual documents in the corpora.

Many scientists has tried many techniques to extract textual knowledge from documents. One of them is KEA (Keyword Extraction Algorithm). In their paper, KEA is said can be used to extract textual parts of any language, as long as the documents are stored in text format. KEA uses Unicode encoding, that supports virtually all possible character encodings in earth. I am curious about this statement, so I started to search the net for any research that has been conducted to use KEA in extracting documents in Indonesian Language. So far, I found no one has done this kind of research. Instead my elaboration led me to find Indonesian Stemming paper by a team in Melbourne.

Therefore I am now doing this research. I tried to find out whether KEA's statement is still valid for the case of Indonesian Language extraction.

This is an introduction for my planned scientific paper on Text Extraction.