Tuesday, June 23, 2009

Extracting Meaning from Millions of Pages (KDnuggets News 09:12, item 23, Briefs)

Extracting Meaning from Millions of Pages (KDnuggets News 09:12, item 23, Briefs)

Shared via AddThis

Wednesday, June 17, 2009

Kemampuan Merawat dan Tingkat Pendidikan

Belum sampai satu bulan diresmikan, jembatan Suramadu sudah menjadi korban vandalisme bangsa sendiri. Sedih deh ngedengernya.

Capek-capek dibikin, diperjuangin dari tahun 60-an, demi kepentingan mereka sendiri. Begitu jembatannya jadi, malah di-preteli. Bukannya terima kasih, malah ngerusak hadiah yang diberikan buat mereka. Gimana coba cara mereka mikirnya?

Bukan hanya jembatan Suramadu aja kasusnya. Di Jabodetabek, coba lihat KRL dan angkutan umum lainnya. Dapat hibah bis dan kereta dari Jepang dalam kondisi yang masih baik, setahun kemudian udah berdebu dan banyak yang nggak berfungsi dengan baik.

Gimana kita bisa bikin fasilitas umum karya sendiri, kalo ngerawat yang udah ada aja nggak bisa?

Monday, June 15, 2009

Important Words

Suppose you are given a bunch of texts. They can be as small as one short sentence, a simple paragraph, or a complex document. From these texts, you were asked to determine the most important words representing each text. What is your strategy?

One simple way to answer this problem is to calculate the occurrence of all words in each text. The words that has high number of occurrence can be safely assumed to be the most important ones.

Another approach is to consider the first and the last sentence of each paragraph. This strategy is applicable to texts that at least has one paragraph. We know that most people write the main idea of paragraph in the first sentence, or sometimes in the last sentence. So it is worth considering these two positions in a paragraph.

You have to know how to identify what character starts and ends a sentence though. For example, most sentence ends by a period, like this sentence. But sometimes, period is used to denote abbreviations as well, as in "Ph.D.", or "Washington, D.C.". Your algorithm has to take into consideration of these appearance of periods in the middle of a sentence.

Another approach I can think of right now, is to throw out the stop words. These are words that has no important meaning, they support the whole sentence as a connector in complex sentence, such as "and", "or". State it in other words, stop words are the words which if you dropped them from a sentence, you can still get the meaning of the sentence. Try this example:

I was going to market with my wife yesterday, and we bought a kilo of apple.

If we drop "was", "to", "with", "and", "a", "of" from that sentence, we can still get what I want to say:

I going market my wife yesterday, we bought kilo apple.

Using this approach, your algorithm should be able to identify which words belong to stop words.

Any other ideas?