Ethiopian Language Identification from Text Data Using Hybrid Approach

Sitotaw, Saba

Full metadata record

DC Field	Value	Language
dc.contributor.author	Sitotaw, Saba	-
dc.date.accessioned	2024-04-29T12:03:09Z	-
dc.date.available	2024-04-29T12:03:09Z	-
dc.date.issued	2024-02	-
dc.identifier.uri	http://hdl.handle.net/123456789/7885	-
dc.description.abstract	Text identification is an automatic recognition task that seeks to determine a word's meaning based on its context from the specified text in a targeted language. In richly resourced languages, this issue has been thoroughly examined and analyzed like European, but not in low resourced language especially Ethiopian language so, to mitigate such issues many researchers propose a language identifier system and now become the main research topic of many researchers. To solve the above problem propose a language identifier system, by exploring the three experiment with the first Unigrams, Bigrams and Mixture of bothand second experiment analyzer=‘char’ and n-gram range= (1, 3), last experiment twenty feature sets used as a column in the first experiment, for all classifiers, employed a unigram (n=1) feature set with four specific language instruction classes for Hadiyya, Wolaytta/Wolaytegna, Somali &Sidama on this experiment in the Naïve Bayes model, the average classification accuracy for all language was 81%, and 85%, 90%, 79%, and 89% for Logistic Regression, Random forest, Decision Tree, and Gradient Boosting classifiers and in 1% mixture of Unigram & Bigram was an average classification accuracy of the Naïve Bayes, Logistic Regression, and Random forest, Decision Tree, Gradient Boosting classifiers was 95.25%, 96.7 %, 97.56%, 91%, and 96.6%, respectively. In 60%mixture of Unigram & Bigram feature set for all classifiers with four targeted language classes, Naïve Bayes is, Logistic Regression, Random forest, Decision Tree and Gradient Boosting classifiers showed an average classification accuracy of 91% and 94% ,95.96%,88.36% and 94.87% respectively. When using n-gram range= (1, 3)analyzer=‘char Logistic regression has an overall average performance of 98.9% Out of all the classifiers, this one has the highest rate and for each language Hadiyya,Sidama, and Somali wolayta is 99%, 98%, 100%%, and 99% respectively. In the third experiment, twenty Sets of features were employed as a column for each model; the average rate of correct classification using Naïve Bayes is 59.71%, whereas the rates for Logistic regression, Random Forest, Decision Tree, and Gradient Boosting are 70.41%, 78.11%, and 76.69%, respectively.	en_US
dc.language.iso	en	en_US
dc.publisher	St. Mary's University	en_US
dc.subject	Language Identification, Multinomial NB and DT, RF, Gradient Boost.	en_US
dc.title	Ethiopian Language Identification from Text Data Using Hybrid Approach	en_US
dc.type	Thesis	en_US
Appears in Collections:	Master of computer science