DESIGNING A STEMMING ALGORITHM FOR KAMBAATA TEXT: A RULE BASED APPROACH

SAMUEL, JONATHAN

st. Mary's University Institutional Repository

Please use this identifier to cite or link to this item: http://hdl.handle.net/123456789/4460

Title:	DESIGNING A STEMMING ALGORITHM FOR KAMBAATA TEXT: A RULE BASED APPROACH
Authors:	SAMUEL, JONATHAN
Keywords:	stemming algorithm; Kambaata stemmer; rule-based stemmer longest-match stemmer; Kambaata language
Issue Date:	Mar-2018
Publisher:	St.Mary's University
Abstract:	Stemming is the process of reducing inflectional and derivational variants of a word to its stem. It has substantial importance in several natural language processing applications. In this research, a rule based stemming algorithm that conflates Kambaata word variants has been designed for the first time. The algorithm is a single pass, context-sensitive, and longest-matching designed by adapting rule-based stemming approach. Several studies agree that Kambaata is a strictly suffixing language with a rich morphology and word formations mostly relying on suffixation; even though its word formation involves infixation, compounding and reduplication as well. The output artefact of this study is a context-sensitive, longest-match stemming algorithm for Kambaata words. To evaluate the stemmer’s effectiveness, error counting method was applied. Two different test sets of 1385 and 1040 distinct words were used to evaluate the stemmer. The combined output from the first stemmer indicates that out of 2425 words, 2271 words (93.65%) stemmed correctly, 138 words (5.69%) over stemmed and 16 words (0.66%) under stemmed. To minimize the problems identified in the first version of Kambaata stemmer, certain improvement was undertaken by identifying additional affixes and rules. Accordingly, the errors of over stemming and under stemming were reduced to 2.60% (63 words) and 0.54% (13 words), respectively. Consequently, the overall performance of the stemmer has been enhanced to 96.87%. What is more, a dictionary reduction of 67.52% has also been achieved for correctly stemmed words on the evaluation. The main factor for errors in stemming Kambaata words is the language’s rich and complex morphology. Hence a number of errors can be corrected by exploring more rules. However, it is difficult to avoid the errors completely due to complex morphology that makes use of concatenated suffixes, irregularities through in fixation, compounding, blending, and reduplication of affixes.
URI:	. http://hdl.handle.net/123456789/4460
Appears in Collections:	Master of computer science

Files in This Item:

File	Description	Size	Format
Designing a Stemming Algorithm for Kambaata Text - A Rule Based Approach_Print Version3.pdf		1.66 MB	Adobe PDF	View/Open

Show full item record