Audio Sentiment Analysis with Spectrogram Representations and Transformer Models
- ,
- Yang Liub(Author),
- Mohd Anwara(Author)
- aNorth Carolina Agricultural and Technical State University,
- bNorth Carolina Central University
Abstract
Recent advancements in the domain of computer vision have enabled the analysis of audio spectrograms. In this paper, we present a novel approach that leverages spectrogram representations and Transformer architectures for multilingual audio sentiment analysis. We investigate the suitability of Transformer models for audio sentiment analysis tasks and evaluate their effectiveness relative to convolutional neural network (CNN) models. Our findings show that Transformer models outperform CNN models, establishing their effectiveness in the audio sentiment analysis tasks. We built a Transformer model with multiple encoders that use multi-head attention and feed-forward neural networks to analyze audio sentiment. We then applied layer normalization and dropout techniques, followed by hyperparameter tuning, which resulted in optimal performance. Transformer achieved the highest accuracy of 87% in an experiment to classify sentiment on multilingual film clips using the EmoFilms dataset. In conclusion, our research presents a novel approach for analyzing audio sentiment by using spectrograms derived from a multilingual dataset, a method that has not been extensively explored in this field.
