Wednesday, April 11, 2007

Topic Detection (and Tracking)

One of items in my list of inquiries is Topic Detection. What is this really?

Topic Detection (and Tracking), from now on I will write it TDT, is a kind of classification task. What to classify? Textual documents. In what way TDT is different from other tasks of classification? In TDT one might expect a new class when new documents arrive. One of potential application of TDT is in classifying news stream.

In an environment such we live now, textual news come from many sources. They can come into mailboxes, in RSS news feed, in homepages, in weblogs. In original classification task, one (usually experts in the related domain) assigns to each document in the collection its class. A document might be classified to more than one class. But the main idea is the designated classes are determined by the experts, which usually don't grow (the number of classes is fixed).

In TDT, the number of classes may grow. A new class may be formed when a new document comes through RSS news feed. The published paper by Allan J et.al. in 1998 identifies state-of-the-art of TDT.

Wednesday, April 04, 2007

A Testbed for Indonesian Text Retrieval

A breakthrough in Indonesian Text Retrieval research made by a group of researcher in RMIT University, Melbourne, Australia. They develop a TREC like corpus for an objective evaluation in Indonesian Language documents.

What is a corpus? A corpus is a collection of documents that is used for evaluating performance and effectiveness of an Information Retrieval System. These evaluation includes performance and recall.

Monday, April 02, 2007

Keyword Extraction for Indonesian Language

In my previous posts, I tried to explore many technical terms related to Data or Text Mining (see [1] and [2]). One of them is Text Extraction. This term refers to the art and science of extracting parts of textual documents. In more specific sense, what can be extracted may take the forms of phrase, keywords, or concepts underlying the individual documents in the corpora.

Many scientists has tried many techniques to extract textual knowledge from documents. One of them is KEA (Keyword Extraction Algorithm). In their paper, KEA is said can be used to extract textual parts of any language, as long as the documents are stored in text format. KEA uses Unicode encoding, that supports virtually all possible character encodings in earth. I am curious about this statement, so I started to search the net for any research that has been conducted to use KEA in extracting documents in Indonesian Language. So far, I found no one has done this kind of research. Instead my elaboration led me to find Indonesian Stemming paper by a team in Melbourne.

Therefore I am now doing this research. I tried to find out whether KEA's statement is still valid for the case of Indonesian Language extraction.

This is an introduction for my planned scientific paper on Text Extraction.

Thursday, March 29, 2007

What is Ontology?

Ontology is, in simplest form, the science of representing knowledge.

Many years ago, when computer is gaining broader use, Ontology was discussed in terms of Knowledge Representation. That is how knowledge is stored in machines (i.e. computers).

I am googling by entering "ontology" as keywords, and from what Google gave, I picked at least three results I planned to explore further.

  1. OWL Web Ontology Language Overview

  2. Ontology. Based on well-known John F. Sowa's book.

  3. Ontology (computer science) - Wikipedia, the free encyclopedia

  4. What is an Ontology? by Tom Gruber of Stanford University's Knowledge Systems Lab.

  5. Ontology. A resource guide for philosophers

In recent years, where XML is gathering more attention, Ontology and Semantic Web is chained together. My research interest is in some sense related to Semantic Web research.

Friday, March 23, 2007

Criticize!

Try take 2 to 5 papers you have downloaded, read them, and write a review of them. Analyze, Compare, Contrast, and Criticize.

What are these all about?

Kenapa.. ada apakah dengan Association Rules in Text Mining?

Can we use it to find semantic underlying the corresponding paragraph?

What about Inference Rules?

What are the problems in Text Understanding?