A coherent knowledge-driven deep learning model for idiomatic - aware sentiment analysis of unstructured text using Bert transformer

Bashar M. A., Tahayna (2023) A coherent knowledge-driven deep learning model for idiomatic - aware sentiment analysis of unstructured text using Bert transformer. PhD thesis, UTAR.

Preview

PDF
Download (3453Kb) | Preview

Abstract

People can express their feelings and views via online social media like Twitter. Many fields may benefit from recognizing and evaluating the sentiments portrayed in social media content, including businesses, governments, public health, social welfare, etc. Sentiment analysis, also known as opinion mining, is a task that tries to automatically extract and classify sentiments conveyed in written content. However, this task is not always trivial especially if the written text is ambiguous or includes figurative language that deviates the meaning of the words beyond their literal meaning rather to convey a complicated meaning. Idioms are important in every natural language and people tend to use them as a shorthand to express themselves neatly. An idiom or an idiomatic expression is a set or near-set sequence of two or more cooccurring but non-contiguous words with a unified meaning or purpose. Idiomatic expressions may have literal and metaphorical meanings and are customarily known in their usual context by native language speakers. However, the literal meaning of the words that constitute the idioms often cannot be used to infer their overall purpose. The research in this thesis is motivated by the fact that idioms are underutilized in sentiment analysis, even though they typically reflect an expressive sentiment about an object or an iv event. Sentiment analysis algorithms used to classify the sentiment of tweets on social media platforms such as Twitter face challenges when dealing with idiomatic expressions and figurative language used by users. These expressions often deviate from the typical meaning and sequence of words, making it difficult for sentiment classifiers to accurately classify the sentiment of a tweet. Existing methods rely on manually generated sentiment lexicons for idiomatic expressions, which requires painstaking labeling of large quantities of data, limiting their scalability and accuracy. Machine learning and deep neural networks have shown promise in accurately representing and classifying sentiment, but they require large amounts of labeled data to train the models. In this context, the proposed novel strategy aims to eliminate the need for human labeling of the idiomatic lexicon and fine-tuning the classifier to handle the sentiment classification of tweets containing idiomatic expressions. We hypothesized that revealing the implicit meaning of an idiom and using it as a feature may improve the sentiment classification results. Therefore, we proposed an idiom expansion and tweet enrichment method to integrate idioms as features in two tasks: the automatic annotation of an idiomatic lexicon and the sentiment classification of tweet data that contains idioms within it. To evaluate the effectiveness of including idioms as features in sentiment analysis, we utilized advanced deep transfer learning techniques, including variants of the BERT (Bidirectional Encoder Representations from Transformers) model. By doing so, we sought to investigate to what extent the incorporation of idioms as features could improve the results of conventional sentiment analysis. v To begin, we selected and compiled a list of idiomatic expressions that may be assigned to a certain sentiment. Traditionally, crowdsourcing is used to manually annotate the idioms to build the gold standard sentiment lexicon of idiomatic expressions. With the promising results from our preliminary experiment, the key constraint was the substantial knowledge-engineering cost of manually creating the sentiment lexicon of idiomatic expressions which was utilized to provide idiom-based features. Therefore, we automated the development of such resources at scale to alleviate the lag time and the cost normally associated with their procurement. The study compared the accuracy of the sentiment lexicon that was automatically annotated with the manually annotated lexicon, achieving a precision rate of 90%. The researchers then collected a dataset of tweets that included idioms and manually labeled them with a sentiment polarity to serve as a benchmark dataset. The study found that enriching the tweets with the explicit meaning of idioms led to an approximately 35% increase in classification accuracy in the sentiment analysis of the tweets dataset.

Item Type:	Final Year Project / Dissertation / Thesis (PhD thesis)
Subjects:	Q Science > Q Science (General) T Technology > T Technology (General)
Divisions:	Institute of Postgraduate Studies & Research > Faculty of Information and Communication Technology (FICT) - Kampar Campus > Doctor of Philosophy (Computer Science)
Depositing User:	ML Main Library
Date Deposited:	08 Sep 2023 21:48
Last Modified:	20 Sep 2023 20:37
URI:	http://eprints.utar.edu.my/id/eprint/5620

Actions (login required)

View Item