Hierarchical Tag-set for Rule-based Processing of Tamil Language

DSpace Home
→
USJP - Academic Journals
→
International Journal of Multidisciplinary Studies
→
Volume 01 Issue 02 - 2014
→
View Item

Hierarchical Tag-set for Rule-based Processing of Tamil Language

Sarveswaran, Kengatharaiyer; Mahesan, Sinnathamby

URI: http://dr.lib.sjp.ac.lk/handle/123456789/3307

Date: 2016-10-25

Abstract:

Corpora are fundamental tools for Natural Language Processing. Part of Speech tagging provides more meaning to the corpora by annotating words. A tag-set used to annotate a corpus should be selected in such a way that it represents grammatical structure of the respective language. These tag-sets can be flat or hierarchical in structure. There are several efforts have been made in Tamil language to identify a tag-set. However, existing tag-sets have many shortcomings including inability of tagging all the words, inability to capture required syntactic information such as divisibility, too many numbers of tags in a set, flat in tag structure, and lack of extendibility. The scholar works Tolkāppiyam and Naṉṉūl clearly shows the grammatical classification of words. This paper proposes a new hierarchical tag-set with 10 labels for Tamil language in view of developing a morphological analyser by considering the existing limitations and using Tamil grammar. The morphological analyser can be used to extend the proposed tag-set easily with more grammatical information.

Show full item record