Voice Moderation Tools: To Build or Not to Build

August 16, 2024

Mike Pappas

(HE/HIM/HIS)

Over the years, we've talked to a lot of companies and organizations looking to learn more about voice moderation technology. A lot of folks we chat with are actively considering the prospect of moderating voice conversations on their platform and the question often arises: "Why can't we just build our own tool?" A fair question indeed! The biggest reason our customers ultimately "buy" is simply because building is truly that hard. Let's walk through some of the basic reasons why building your own voice moderation ends up being impractical for most platforms.

The Dual Challenge of Machine Learning Expertise

Creating a voice moderation tool like ToxMod requires two opposite types of machine learning expertise.

First, you need to build a model that's exceptionally good at understanding the nuance of spoken communication. This involves training on hundreds of millions of hours of labeled audio that have been carefully tuned to teach a deep understanding of audio characteristics which off-the-shelf models lack. Moreover, that labeled audio needs to be domain specific – if you build a tool to decipher voice conversations happening in games, those models should be trained on gaming audio.

Unfortunately, most models designed to achieve this level of accuracy are also large, and therefore quite slow and costly to run. So secondly, you must figure out how to run your model cost-efficiently. And given the sheer amount of voice chat many platforms will end up processing, when we say cost-efficient, we mean really, really cost-efficient. The cheapest off-the-shelf models are still approximately ten times more expensive than what we offer through ToxMod. This focus on cost-efficiency initially came from our CTO’s experience building hyper-efficient machine learning models to fly on spacecraft for NASA JPL. Since then, Modulate has built on this expertise and spent years refining and optimizing models, a process not easily replicated.

The Complexity of Real-Time Audio Processing

Getting the audio to the machine learning models is another massive challenge. Most platforms use off-the-shelf VoIP solutions, which means they need to copy the audio within their client application to send it for processing. Platforms which host their own VoIP infrastructure can grab the audio from their servers instead (avoiding the need to update their application directly), but even so, they will not want to do the processing on the same servers that are transmitting the audio - that would risk voice chat lag or glitches, which nobody wants! So any platform seeking to conduct voice moderation will need to build some kind of “real-time record-and-upload the audio” solution… and then they’ll need to set up cloud architecture to store the audio in a low-cost and privacy-safe way, and ensure resilience to continue operating at all times while scaling up and down as needed. This sort of highly available, efficient cloud architecture requires an entirely different skill set from machine learning and, if mismanaged, can lead to inflated costs and significant delays in real-time actioning.

Moderator Tools and Mental Health Considerations

Finally, you need to build tools for moderators to review the audio. Have you considered how to normalize the volume of audio clips so your moderators don't contract tinnitus? What about the significantly increased mental health impact of listening to toxic audio compared to text? The review process for audio is inherently slower than text—are you prepared to surface the immediately relevant text and other metadata to keep moderators moving swiftly? These are not trivial challenges and require extensive expertise in UX and UI design to address effectively.

So… Do You Really Want to Reinvent the Wheel?

Given the complexity, cost, and specialized expertise required to build an effective voice moderation tool, the question isn't just whether you can build it, but whether you should. While the idea of building your own voice moderation tool may seem appealing, the reality is often far more complex and challenging than it appears. It requires a unique blend of machine learning expertise, real-time audio processing capabilities, and comprehensive tools for moderator support – not to mention ongoing support and updates including keeping track of all the new phrases, behaviors, and trends that your users might exploit to offend or ostracize others. Instead of reinventing the wheel, consider leveraging the proven solutions and expertise available with ToxMod.

Photo by Kevin Jarrett on Unsplash