AI, deep learning, and the future of audio processing

Author: Christian Steinmetz | @csteinmetz1

The Deep Learning Revolution

Deep learning has brought significant advancements across nearly all fields and sectors of industry. Audio has been no exception. However, most of the work in modern deep learning for audio processing revolves around speech for telecommunications. This means things like lower sample rates (8 kHz - 16 kHz), and generally a much higher tolerance for artifacts where intelligibility is the goal. For more general audio production, we often require much higher sample rates, as well. There have been many recent research advancements in music signal processing using deep learning, but there has not much focus on audio engineering applications thus far. Below are some popular examples from 2020 that focus on music applications, which are certainly promising for the future of audio and music.

Music source separation (Spleeter)
Synthesis and timbre transfer (DDSP)
Complete music synthesis (Jukebox)

Signal Processing 2.0

Audio signal processing 2.0 will likely come as a result of integrating modern deep learning techniques into our audio processing. This means replacing our current set of DSP tools with learnable/trainable components that potentially provide more intuitive interfaces.

This has clear applications for audio engineering. In music production we are always trying to achieve some sonic goal. Currently the best way to achieve your sonic goal is to develop a deep understanding and intuitions about to how to utilize a wide array of audio effects. (For example, learning how to control the attack and release times of a compressor, or set the Q factor of a parametric EQ.)

My research asks:

<aside> 💡 Is this paradigm optimal to help audio engineers achieve their sonic goals?

My feeling is that it's likely not optimal, and that there exists a set of tools that enables us to more easily and more intuitively achieve our sonic goals. Currently, it seems that deep learning may be one route towards building those tools.

</aside>

There are likely three main results of the deployment of deep learning in music production:

Elevate the baseline level of audio quality for productions of any skill level. (We are here)
Expedite the workflow of professionals through automation.
Unlock previously unavailable creative avenues for sonic exploration.

Intelligent Music Production

Now let's consider some specific technologies that could aid in music production as a result of this so called Signal Processing 2.0. In my work, I consider three main stages within the music production process: Fix, Fit, and Feature, a nice alliteration created by Alex Case, a well known audio engineering educator. In his book Mix Smart, Case describes the process where audio engineers "Fix problematic areas, Fit all those tracks together, and Feature the parts of the music that really make the tune sing". The goal of my research is to investigate how deep learning can aid audio engineers in this three step process.

To address the Fix stage, this could mean building a system that can remove significant background noises that may have corrupted your otherwise perfect vocal take. It may also be able to remove the annoying room tone when recording in a home studio with untreated surfaces, transforming nearly any space to a quality recording studio. In the case of live recordings, such a system could also isolate bleed from other instruments, improving the separation. The ability to Fix recordings with this level of precision will enable anyone to record nearly anywhere, without causing a significant degradation to the audio quality.

While Fix addresses more straightforward tasks, Fit considers the more challenging task of creating a cohesive mixture. This involves understanding the complex interactions between the different elements in the mix, and searching among the extremely large space of possibilities, considering all of the different plugins and their settings that could be applied to each track. Here the goal is to build a system that enables us to search among the possibilities in this space and hone in on the parts of this space that are likely of interest to use, points that create high-quality, interesting, and engaging mixes, mixes that serve the music. You could imagine having just three controls to adjust a mix, and when doing so the system was actually swapping out plugins and changing their settings across all elements in the mix automatically. In this case, every position of those three knobs results in an amazing mix. The role of the audio engineer then shifts more towards a curator, where they utilize the system to find a mix that excites them. No more fiddling with attack and release times to get the sound you desire.

In the final task, there is also significant potential for deep learning to aid in the Feature process. Neural networks can unlock an entirely new space of signal processing possibilities. In the future, we may not be restricted to the current paradigm of thinking about audio effects as reverb, delay, EQ, or compression. Instead neural networks enable a new space of audio effects that is the interpolation or mixing of these effects, and others together, which can be explored by audio engineers and musicians to suit their artistic goals. Just as the misuse of the electric guitar amplifier lead to rock and roll, it's quite possible that neural networks will lead to an entirely new genre of music with unique effects and timbres.