Christian Di Maio thesis at the University of Siena for the “Artificial Intelligence and Automation Engineering, Intelligent Systems curriculum” course, supervised by prof. Maggini
The experimental thesis, conducted in collaboration with QuestIT, initially aimed to create a model capable of synthesizing emotional audio, known as Emotional TTS, starting from a transcript and a representative emotion label. However, during the initial project planning meetings, due to company requirements, it was decided to shift towards a more general transformation. Instead of using text as input to the model, the approach was changed to directly converting audio signals, leading to Emotional Voice Conversion. This transformation is an end-to-end process between audio signals, where informational content and phonetic speaker characteristics are preserved, but emotional characteristics are altered.
First Research Stage
In the first research stage, suitable datasets for the task were identified. Since QuestIT operates primarily in the Italian market, one of the predefined constraints was to focus on the Italian language. Italian datasets for this task are limited, with two datasets identified: Demos and Emovo. The limited number of examples led to exploring other languages, demonstrating the effectiveness of cross-lingual approaches for this specific task. Five English language datasets and one French language dataset were introduced. During this phase, specifications for representing audio signals were defined, with one common approach being the use of a mel spectrogram representation. The criteria for performing the Fourier inverse transform to bring the spectrogram back into the time domain were also established, utilizing a neural approach with the HiFi-GAN vocoder model.
Second Research Stage
In this stage, various approaches to Emotional Voice Conversion were analyzed. In summary, these approaches included:
StarGANv2-VC based approach: Utilizing a model originally designed for non-emotional Voice Conversion, various experiments were conducted by modifying both the internal structure of the model and the training methodology.
Transformer approach: The concept of a sequence-to-sequence approach was considered for the spectrogram, treating each time frame as a token, similar to a token generated by a text tokenizer. This approach aimed to consider the temporal evolution of the spectrogram.
ConvAutoEncoder with Recurrency approach: Following the experiments with the transformer-based approach, it was found to be unsustainable due to the limited number of examples for a data-hungry model like the transformer. To address this issue, a classic LSTM-EncDec network was employed as a bottleneck between pre-trained convolutional encoders and decoders, allowing the LSTM to work on high-level abstractions and focus solely on the temporal aspects of the information.
CycleAdversarial-UNet-2D1D2D approach: No recurrent model succeeded in generating satisfactory results, leading to a focus on the frequency components of the spectrogram. This approach explored various types of adaptive spectrogram source normalizations used in audio-based style conversion generative models (AdaIN, TFAN, etc.), comparing strengths and weaknesses. As an alternative to standard normalizations, the use of U-Net was considered, adapting the skip connection feature between encoders and decoders to inject source spectrogram features during decoding and output generation. To address the limited number of examples and the constraint of having parallel examples (same speaker, same transcript), the training concept was adapted to the cycle-GAN in this specific context. Among all the models analyzed, this approach showed the most promise and was selected for a statistical analysis.
Third Research Stage
In the third and final stage, a sample-based statistical analysis was conducted on a population of users using the Mean Opinion Score (MOS) methodology. No objective metrics were used for two reasons:
There are no objective metrics to evaluate emotional transfer.
Pre-trained models for emotion understanding from audio do not function correctly.
Despite the limited sample size of the population that completed the questionnaire, the obtained results are promising and have allowed the validation of some initial hypotheses.