Attached
The advancement of information technology in the modern world
has contributed towards enhancing the quality of life of people around the
world. To keep pace with this rapid development, it is important to have
links with the enormous network called the Internet. This enables people
to have access to information resources, keep abreast of news, send
timely E-mail, and have interactive remote conferences. However, a
tremendous task facing information consumers today is to identify
relevant news items speedily. Hence, designing CUl information filter for
users of Internet news bulletins, is a dire need of the dav. The main thrust
of this study is, therefore, focused on designing an algoritlun to identify
news items available on the Internet and categorizing them according to
their degree of similarity to each other.
The main concept exploited in obtaining a metric for computing the
degree of similarity of two news items is ,b,;I;e-d: on-, calculating and
.J.' f.. <~'&::'.- :'
x, ".c·
comparing the percentage of proper nouns c0l1'l1JJ9nto both news items.
- • :!i.~~ .•••••
""'...'-'
In order to extract proper nouns from a news ite~:'a filtering process is
employed to eliminate pronouns, articles, prepositions, "Be verbs" , determiners and other function words. Subsequently the frequencies in
which the extracted words have occurred in the news item are calculated
and analyzed. Statistical methods are used to confirm the above result.s
before they are presented to the user. The proposed algorithm was tested
and favourable results were obtained by using the news items downloaded
from the LAcNet Sri lankan news archives available on the Internet.
Values of degree of similarity obtained for the test data was
compared with human classification. Based on these results, it is
demonstrated that the proposed algorithm is able to categorize news items
into two classes, 'similar' and 'different', successfully. This achievement
makes a significant contribution t.owards achieving the aut.omatic
categorization of news items available on the Internet into various topics.