Monday, June 15, 2009

Important Words

Suppose you are given a bunch of texts. They can be as small as one short sentence, a simple paragraph, or a complex document. From these texts, you were asked to determine the most important words representing each text. What is your strategy?

One simple way to answer this problem is to calculate the occurrence of all words in each text. The words that has high number of occurrence can be safely assumed to be the most important ones.

Another approach is to consider the first and the last sentence of each paragraph. This strategy is applicable to texts that at least has one paragraph. We know that most people write the main idea of paragraph in the first sentence, or sometimes in the last sentence. So it is worth considering these two positions in a paragraph.

You have to know how to identify what character starts and ends a sentence though. For example, most sentence ends by a period, like this sentence. But sometimes, period is used to denote abbreviations as well, as in "Ph.D.", or "Washington, D.C.". Your algorithm has to take into consideration of these appearance of periods in the middle of a sentence.

Another approach I can think of right now, is to throw out the stop words. These are words that has no important meaning, they support the whole sentence as a connector in complex sentence, such as "and", "or". State it in other words, stop words are the words which if you dropped them from a sentence, you can still get the meaning of the sentence. Try this example:

I was going to market with my wife yesterday, and we bought a kilo of apple.

If we drop "was", "to", "with", "and", "a", "of" from that sentence, we can still get what I want to say:

I going market my wife yesterday, we bought kilo apple.

Using this approach, your algorithm should be able to identify which words belong to stop words.

Any other ideas?

No comments:

Post a Comment