Sue Hyun Park April 21, 2021

<aside> 🔗 We have reposted this blog on our Medium publication. Read this on Medium.

</aside>

<aside> 🎵 Music is a type of art that we hear. Music, especially the vocals in it, is powerful in that it evokes strong emotional responses, but it is hard to describe what catches our ear to make so. Can artificial intelligence create singing voices people love? In this post, we introduce our research on single/multi singer singing voice synthesis system. See how this system generates hyper-realistic and expressive human singing voices, and what are its potential impact on the media ecosystem.

</aside>

The feelings we get when listening to music can be very personal, quite indescribable. Think of a moment listening to a powerful and moving song. You may have felt a cascading sensation because of the singer's emotional vocals, or the nostalgic atmosphere of the music piece, or from some other mixed source that you cannot define nor feel the need to. Among various forms of art, it seems reasonable to say that music has an abstract modality where one's perception of it relies on individual preferences and feelings, something not simply determined by consensus.

Meanwhile, the music industry keeps on evolving, transforming not only the way people consume music but also the tools to record and distribute content. From records to MP3 players and streaming platforms, we can now easily explore a myriad of songs that entertain us with layers of creative sounds. The main catalyst in this transformation was the emergence of musical instrument digital interface (MIDI). With virtual instruments, producers could widen the scope of musical expressions. As such, we can expect a revolution once more in music and recording production driven by digital innovation.

Our team investigates the potential of the new era, starting by handling the abstract modality of music. Specifically, we notice there is an unresolved task in song-making: finding the right vocals with ease. MIDI is undoubtedly helpful in getting your creativity out in instrumental form, but there is still a journey left searching for the right person with the right voice that can complete your picture. It is hard to communicate your inspiration unless you can hear the voice and capture the feeling.

Why not make the voice you dreamed of?

In this post, we explain our research on singing voice synthesis (SVS), a task that generates a natural singing voice from given sheet music and lyrics information. We propose an end-to-end Korean singing voice synthesis system that can output realistic and expressive singing voices, from inputs of original single- or multi-singer voices and some information about the desired music the voices will be incorporated onto. Through visual evaluation and user study including listening tests, we experimentally verify that the proposed system can synthesize singing voices that cannot be distinguishable from real humans. Utilizing this system a music producer can easily make a sample voice sing a totally different song and find a voice that fits his or her work. By this research, we hope to solve the problem of abstractness in the art of music, and therefore boost content creation.

Our Approach


The singing voice synthesis (SVS) task is challenging in that it requires controllability of duration and pitch of each syllable. The voice output has more variations in natural speech than a plain text-to-speech (TTS) system, with even more color and dimension added with multiple singing inputs. So as to address the complexities we set the goals of our research as follows:

We first devise a single-singer SVS system to stabilize the performance of pronunciation and pitch in the generated voice. This serves as a backbone of a multi-singer system, which in turn integrates and flexibly controls the features of singer identity.