Skip navigation
st. Mary's University Institutional Repository St. Mary's University Institutional Repository

Please use this identifier to cite or link to this item: http://hdl.handle.net/123456789/7695
Title: DEVELOPING PART-OF-SPEECH TAGGER MODEL FOR AFAAN OROMO LANGUAGE
Authors: MEKURIA, ABEL
Keywords: Rule-based taggers, Hidden Markov Model taggers, Afaan Oromo, unigram, bigram, trigram.
Issue Date: Jun-2023
Publisher: ST. MARY’S UNIVERSITY
Abstract: To manipulate, analyze and process human language in a computer, it must be organized and structured in a way it understands. Part-of-Speech (POS) tagging is one of the Natural Language Processing (NLP) applications. It is a task of labeling words with their appropriate Part-of-Speech tags. Different studies have conducted on Part-of-Speech tagging for Afaan Oromo but none of the studies have conducted a comparative study which best suited for Afaan Oromo. In this study, a Part-of-Speech tagger for Afaan Oromo language has been developed using a Hidden Markov Model and rule-based approach. The Viterbi algorithm for Hidden Markov Model and brill transformation-based error-driven learning for the rule-based approach was used with slight modifications in their modules based on the nature of the language. Natural Language Toolkit version 3.4.5 and Python 2.7 were used to implement the tagger model and conduct experimental analysis. Discussion with linguists and review on different works of literature were made to understand the morphological and grammatical structure of the language and to identify possible tagsets for the study. As a result, 27 tagsets were identified. 1196 sentences which are composed of 30, 165 words with 8366 unique words are collected from BBC Afaan Oromo, VOA Afaan Oromo and Afaan Oromo bible. The collected corpus has been split into training and testing corpus. Hence 80% of the corpus is used to train the tagger model and the remaining 20% is to test the performance of the tagger model. Both the Hidden Markov Model and rule-based taggers were trained and tested on the same data. As a result, Hidden Markov Model taggers: unigram, bigram, and trigram taggers achieved an accuracy of 87.3%, 88.4%, and 89.3% respectively and the rule-based taggers which use unigram, bigram, and trigram taggers as initial stage taggers achieved an accuracy of 88.6%, 89.3%, and 89.9% respectively. As shown in the performance analysis result that the rule-based taggers outperform the Hidden Markov Model taggers. To improve the performance of the taggers pre-prepared standard balanced corpus and standard tagsets were recommended for future work.
URI: .
http://hdl.handle.net/123456789/7695
Appears in Collections:Master of computer science

Files in This Item:
File Description SizeFormat 
DEVELOPING PART-OF-SPEECH TAGGER MODEL FOR AFAAN OROMO LANGUAGE.pdf1.44 MBAdobe PDFView/Open
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.