UTAR Institutional Repository

Deep learning for hate speech detection on X (Twitter) with different word embedding techniques

Thong, Wei Xin (2024) Deep learning for hate speech detection on X (Twitter) with different word embedding techniques. Final Year Project, UTAR.

[img]
Preview
PDF
Download (4Mb) | Preview

    Abstract

    This project was conducted to develop hate speech detection models using several deep learning techniques with different word embedding techniques to detect English hate speech tweets on X (Twitter) with the goal of enhancing the online communication environment and reducing the suicide rate due to cyberbullying. Several deep learning techniques were utilised in this project, such as CNN, BiLSTM, a pretrained DistilBERT model named 'distilbert/distilbert-base-uncased', and a pretrained RoBERTa model named 'facebook/roberta-hate-speech-dynabench-r4-target'. The word embedding techniques utilised in this project can be classified into two groups: those utilising a single word embedding technique such as GloVe (Global Vectors for Word Representation), Word2Vec, or word embedding vectors provided by DistilBERT and RoBERTa itself, and those combining two different word embedding techniques by stacking, averaging, and taking the root mean square of them. In comparison to the old trend models that utilised word-based tokenisation in the preprocessing of data, subword tokenisation is utilised in this project to tokenise the tweets in the dataset. Several papers on cyberbullying or hate speech detection models using deep learning were reviewed, outlining the strengths and weaknesses of the models developed by various authors. In addition to detailing the architectures of these models used in this project, the paper also explains the model development process, techniques employed to address class imbalance issues or hyperparameter tuning, which were visualised or explained to provide newcomers in text classification with a comprehensive understanding of how models were developed. The most significant focus was on the performance evaluation and analysis of the DistilBERT, RoBERTa transformer models, as well as those CNN and BiLSTM models utilising single word embedding techniques and combining different word embedding techniques.

    Item Type: Final Year Project / Dissertation / Thesis (Final Year Project)
    Subjects: L Education > L Education (General)
    L Education > LA History of education
    T Technology > T Technology (General)
    Divisions: Faculty of Information and Communication Technology > Bachelor of Computer Science (Honours)
    Depositing User: ML Main Library
    Date Deposited: 23 Oct 2024 14:46
    Last Modified: 23 Oct 2024 14:46
    URI: http://eprints.utar.edu.my/id/eprint/6684

    Actions (login required)

    View Item