Monday, April 02, 2007

Keyword Extraction for Indonesian Language

In my previous posts, I tried to explore many technical terms related to Data or Text Mining (see [1] and [2]). One of them is Text Extraction. This term refers to the art and science of extracting parts of textual documents. In more specific sense, what can be extracted may take the forms of phrase, keywords, or concepts underlying the individual documents in the corpora.

Many scientists has tried many techniques to extract textual knowledge from documents. One of them is KEA (Keyword Extraction Algorithm). In their paper, KEA is said can be used to extract textual parts of any language, as long as the documents are stored in text format. KEA uses Unicode encoding, that supports virtually all possible character encodings in earth. I am curious about this statement, so I started to search the net for any research that has been conducted to use KEA in extracting documents in Indonesian Language. So far, I found no one has done this kind of research. Instead my elaboration led me to find Indonesian Stemming paper by a team in Melbourne.

Therefore I am now doing this research. I tried to find out whether KEA's statement is still valid for the case of Indonesian Language extraction.

This is an introduction for my planned scientific paper on Text Extraction.

