Abstract
Music emotion recognition (MER) deals with music classification by emotion using signal processing and machine learning techniques. Emotion ontology for music is not well established yet. Musical emotion can be conceptualized through various emotional models: categorical, dimensional, or domain-specific. Emotion can be represented with a single label, multiple labels, or
... read more
probability distributions. Also, the time scale to which an emotion label can be applied ranges from half a second to a complete musical piece. Describing musical audio with emotional labels is an inherently subjective task. MER field relies on ground truth data from human labelers. The quality of the ground truth labels is crucial to the performance of the algorithms that are trained on these data. Lack of agreement between the annotators leads to conflicting cues and poor discriminating ability of the algorithms. Conceptualizing musical emotion in a way that is most natural for the listener is crucial both to create better quality ground truth and build intuitive music retrieval systems. In this thesis we mainly deal with the problem of representation of musical emotion. The thesis consists of three parts. Part I. In this part we model induced musical emotion. We create a game with a purpose Emotify to collect emotional label using Geneva Emotional Music Scales as an emotional model. The game is able to produce high quality ground truth, but some modifications to GEMS model are suggested. We use the data from the game to create a computational model. We show that the performance of the model can be improved substantially through developing better features and this step is more crucial than finding a more sophisticated learning algorithm. We suggest new features that describe the harmonic content of music. A much bigger improvement in performance is expected, when high-level musical concepts such as rhythmic complexity, articulation or tonalness can be modeled. Part II. In this part (in collaboration with M. Soleymani and Y.H. Yang) we create a benchmark for Music Emotion Variation Detection (MEVD) algorithms (tracking per-second change in musical emotion). We describe the steps taken to improve the quality of the ground truth, benchmark evaluation metrics. We conduct a systematic evaluation of the algorithms and the feature sets presented at the benchmark. The best approach is to develop separate feature sets for Valence and Arousal dimensions, and incorporate local context either through algorithms that are capable of extracting data from the time-series (LSTM-RNN), or through smoothing. Part III. In this part we build on the experience obtained in benchmark organization and suggest that a better approach to MEVD is to view music as a succession of emotionally stable segments and transitional unstable segments. We proceed to list the reasons why the established MEVD approach is flawed and can not create good quality ground truth. We propose an approach based on CNN combined with MER-informed filtering. Three public data sets, corresponding to each part of the thesis, are released.
show less