Abstract:
Corpora are fundamental tools for Natural Language Processing. Part of Speech tagging provides more
meaning to the corpora by annotating words. A tag-set used to annotate a corpus should be selected in such a
way that it represents grammatical structure of the respective language. These tag-sets can be flat or
hierarchical in structure. There are several efforts have been made in Tamil language to identify a tag-set.
However, existing tag-sets have many shortcomings including inability of tagging all the words, inability to
capture required syntactic information such as divisibility, too many numbers of tags in a set, flat in tag
structure, and lack of extendibility. The scholar works Tolkāppiyam and Naṉṉūl clearly shows the grammatical
classification of words. This paper proposes a new hierarchical tag-set with 10 labels for Tamil language in
view of developing a morphological analyser by considering the existing limitations and using Tamil grammar.
The morphological analyser can be used to extend the proposed tag-set easily with more grammatical
information.