Modulate uses model distillation in order to run our voice skin networks more efficiently on a variety of devices. In distillation, we take a large "master" model capable of converting any voice into any other voice, and use its input/output pairs to train a smaller model to duplicate that behavior. By distilling our voice skins before they're shipped, we're able to decouple the difficulty of adversarial training from final voice skin capacity. During unsupervised adversarial training, we want the models to have as much capacity as possible, since the loss signals can be weak, unclear, and vary over time; and allocating a large capacity at the start of training provides more pathways through the network to find good combinations of weights during training. However, when the final master model is done training, it's difficult to tell whether all of that capacity is needed for inference, or whether it was merely important for training.
Using distillation, we can separately experiment with how much capacity is needed for voice skin inference. The distilled models undergo supervised training, taking the master model inputs and outputs and trying to duplicate them. This is a training regime with far clearer signals: the objective function is stationary over time, and also the outputs are already from a neural network, so they're clearly representable by a network with roughly the target architecture. These strong training signals let us pare down network capacity allocated purely for training, to something approaching the final capacity required for inference. There are still more efficiency gains to be made after training through pruning, quantization, etc. - but the effect is not nearly as strong as with the master model.
DISTILLED MODELS FOR PRODUCT
Introducing distillation provides a significant win on product capabilities, by letting us adapt our neural network capacity easily to accommodate a variety of different devices. Traditionally, deploying only a single model to game clients running on desktop computers, laptops, consoles, and phones would force each platform to cater to the lowest common denominator of performance. With distillation, we can use the same master model to distill out several different runtime models, each with varying capacity and therefore compute efficiency. These can be provided to end users as different "quality settings", analogous to different graphics settings in games.
By distilling out several versions of voice skin with different numbers of parameters, we can ship multiple "quality settings" to end users with different hardware capacities, so that they can trade quality for performance on-the-fly. The performance gains from decreasing the number of parameters are better than linear, as smaller weight matrices fit better into caches.
Often, we can see significant performance gains by drastically reducing the number of distilled model parameters, without sacrificing much, if any, audio quality. This is because we can additionally trade off model capacity for development time complexity by training each distilled model to target only a single output voice. The difficulty of the problem the master models solves - converting any voice into any other voice - is drastically reduced to the problem of converting any voice into a single target voice. Therefore, a large amount of capacity used by the master model to represent vocal cords for different target speakers is no longer required. This process increases development time and cost compared to simply shipping the master model by introducing a new training step for distilling the target voice down to a more efficient model, and it means that in order to ship multiple voice skins to our customers we must distill out one model per voice. Fortunately, distilling several voices from a single master model can be done in parallel, and given the appropriate tools and scripts, requires little additional engineering effort.
Finally, distillation can help performance by separating out runtime architecture from ideal adversarial training architecture. When we train our master model, the losses, dataset, architectures, and training parameters are all optimized towards producing the best sounding voice outputs, regardless of their efficiency to implement on real hardware. With the cleaner training signals from distilling, we can separately explore new network architectures and layer types to optimize for performance, without hurting the training capability of the master model. This commonly involves changing the receptive field of the distilled model in order to reduce the amount of memory taken up by neural network state, reducing the number of layers in the distilled model - or else combining the functionalities of multiple separate layers in to one, such as by fusing a pre-processing network directly in to the voice skin network, and training them all together. Since distillation is a single supervised training objective, any barriers between different parts of the network that were kept separate due to training concerns (such as shared embedding layers between the generator and adversary) can be entirely repurposed towards improving performance of the distilled network.
DISTILLED MODELS FOR INFERENCE FRAMEWORKS
In addition to directly benefiting runtime performance, distilling provides an indirect benefit through stabilizing the structure of the inference model. By separating the inference model's architecture from the master model's, we avoid constantly re-writing the distilled model inference code any time the master model changes, allowing our core engineering team time to optimize the distilled model performance across a variety of platforms, and get a thorough understanding of its bottlenecks.
The vast majority of machine learning research at Modulate contributes to improving the master model's performance, which directly influences the highest quality voice skins that our models are capable of producing. Of that research, much goes into iterating on loss functions, hyperparameters, and the structure of supporting neural networks that provide conditioning information or adversarial losses. While some aspects of research go into optimizing the master model architecture directly, often significant gains can be found instead by adjusting the dataset or training environment. When the master model's architecture does change significantly, the inputs and outputs are invariably raw audio, meaning that any new changes can slot painlessly into the distilled model training framework, by providing input/output training pairs.
The distilled model architecture receives very little iteration, and for good reason - it's easy to measure when it is fulfilling its training objective! The distilled models have both a meaningful training loss (degree to which they match the master model's output) and ground truth targets to compare to during QA (the master model output). Therefore, it was very easy to iterate on the distilled model early on, to find what tradeoffs make the distilled model entirely faithful to reproducing the master model output, and what tradeoffs begin to degrade quality in exchange for computational efficiency. At Modulate, distilled model design, training, evaluation, and iteration took comparatively little total time; and while distilled model performance is revisited often to ensure that the distilled models are faithfully reproducing the master model output, no substantial changes have been needed for the distilled model architecture in over a year. During that time, the core engineering team has been able to continue to optimize the distilled model inference code without interruption, leading to better performance overall and no time wasted on costly code rewrites!
CONCLUSION
Distillation is a key technique in enabling Modulate to deliver high quality and high performance voice skins to customers. Not only is distillation a valuable product tool to trade performance for quality depending on device constraints, it also can be used to trade development time for runtime performance by splitting out different capabilities into separate networks. Finally, distillation has a benefit to core engineering and ML research collaboration, by explicitly separating out runtime performance from model quality into distinct training steps with their own tradeoffs. This separation further avoids rewriting inference code as models change, which gives more time for engineers to optimize the performance of the inference code across different devices.