Ever since the early days of the internet, text moderation has been a staple for community managers and others invested in developing safe and inclusive spaces online. Over time, this process has evolved: from humans reading every post, which quickly fell to the sheer scale of the online world; to automated detection of keywords and phrases; and then to more sophisticated ideas like natural language processing for sentiment analysis, intent extraction, and behavioral profiling. Today, the best companies in this space have intricate, complex tools which have been honed over many languages and communities and tested rigorously against trolls and predators who try to avoid detection or punishment. They are imperfect, because even humans can’t always detect whether someone really meant to do harm, but they are battle-tested, well-understood, and reliable.
So, with online discussions moving more and more to voice chat, the natural question would be - can we make use of these many years of text moderation, combined with a speech to text system, to perform voice moderation?
It’s a reasonable question to ask, but sadly, the answer is a resounding no. Text and voice are simply too different - the high-level knowledge about what bad behavior looks like is valid for both, but unfortunately, in practice voice and text just involve fundamentally different challenges.
The good news is that Modulate is here for you! Our ToxMod service is a fully voice-native moderation tool, which transcends the idea of simply transcribing what’s said and actually takes advantage of voice as a medium to perform moderation more effectively than text moderation could do on its own.
To understand this better, let’s break down a few of the key differences between text and voice, and how ToxMod is designed to make use of these differences.
Spelling vs Timbre
In text chat, you don’t have the ability to explicitly convey your emotion or tone. Your only degree of freedom comes from your word choice, and, importantly, your spelling. “Gr8” and “great” may be interpreted as the same word, but they have different nuance to them, and multiple studies have shown that analyzing these sorts of spellings can be used to predict with extraordinarily high accuracy characteristics of a speaker like their age and demographic - and even, to some extent, their intent. Modern text moderation tools rely on these cues to help determine context beyond what’s being said when determining whether a conversation is high risk and/or whether to penalize someone for potentially offensive speech.
Voice chat, of course, doesn’t allow you to change your spelling, but it does have other powerful cues. First and foremost is the timbre of the speaker’s voice, which can be used to make estimates of age, gender, ethnicity, and even body type. Modulate designs ToxMod to take advantage of these signals, though we’re acutely aware that such classification methods are fallible and can result in substantially problematic stereotyping - for instance, the misclassification of the gender of a trans gamer. In order to mitigate this, we incorporate a variety of other signals - such as the word choice of the speakers as well as behavioral characteristics like their volume and cadence - to keep our predictions extremely accurate. In addition, these predictions are all internally modeled as probability distributions which can be handled with significant nuance - we only present specific demographic predictions to human moderators when we believe it’s materially relevant to the risk. (For instance, flagging that someone is likely a child when monitoring for the risk of online grooming.)
And the above hasn’t even scratched the surface of the emotion in a voice! The way someone expresses something significantly changes the odds that it’s toxic - there’s a big difference between an “F*** yeah!” and an “F*** you!” In text, this is difficult to infer, and requires a deep understanding of the exact nature of the conversation. But in voice, we can actually hear the emotion - both in the original speaker, and in the voices of others as they respond to the statement. This gives ToxMod the incredibly powerful ability to ignore typically toxic triggers if we see that the participants in a chat are friends or otherwise are more comfortable with the particular language than other folks might be - massively reducing our false positive rate, and ensuring that moderation teams are only focusing their time on the transgressions that are creating real problems.
Group Size and Noise Levels
In text, you might have hundreds or even thousands of participants. That’s a fair amount of text to process, but it comes in a fundamentally serial way. Each speaker’s sentence appears in a single clear block - you don’t have to worry about two speaker’s words getting mixed together in a way where you need to pick them apart!
This has a lot of consequences, but the most important is that it’s really hard for trolls to make so much noise in text chat that people can’t just continue the conversation around them. Since there are so many other speakers in the channel, and the troll’s posts are separated from all the others, there’s very little such a troll can do structurally to mess with text chat. (Though they can certainly still say something offensive and hope it sets off a larger flame war.)
In voice chat, though, the channels tend to be smaller - and more importantly, you’re listening to every other speaker all at once, in a way that’s fundamentally mixed. What this means is that one troll can play a sufficiently loud or annoying noise and effectively seal off anyone else’s ability to speak. Attempting to dominate a voice chat channel in this way - typically called a “voice raid”, especially when multiple coordinated trolls are involved - is thus a much more significant concern than it would be in text.
ToxMod manages this situation smoothly, though, because it’s natively running inside the game client for each individual speaker. So when a troll appears and starts speaking, ToxMod identifies that their content - whether it’s a stream of racial slurs or simply death metal played at max volume - is hurting everyone else’s experience. In text chat, this would just mean deleting the text, and ToxMod can indeed simply mute the speaker automatically at this point, but it also has subtler ways to intervene if it’s unclear if the troll is acting intentionally. For instance, ToxMod can renormalize the volume of their input to tolerable levels, play an auditory or visual warning, or actively modify (think f*** -> fork) or bleep certain words while letting the rest through (though this last intervention requires some training in your specific game before it can be done fully automatically.)
Data Size and Complexity
Audio data is simply larger than text data. This has a lot of important consequences for moderation. Firstly, text data can be logged and stored for posterity with ease, while audio data is typically ephemeral. Secondly, the costs of processing audio at high speeds are substantially greater than the cost of processing text, making it much harder to do in real-time.
Modulate solved both these problems by designing ToxMod with a novel triaging process. The first layer of the triaging runs on the user’s device, within their game client. (Wondering how we can possibly run neural nets on-device while only using the thin sliver of compute and memory the game can afford us? Check out this blog post about how we solved this exact problem while initially developing voice skins!)
This initial triage layer screens all speech coming from the player, making a quick and rough determination as to whether or not there’s anything to worry about. If everything looks fine, the process ends here - without ever sending any audio off the player’s device (other than normal VoIP, of course). What this means is that, even though ToxMod’s algorithms do technically hear everything, no human will ever hear any audio that’s been moderated by it unless ToxMod detects a high probability that something toxic or problematic is happening. This is another big difference between text and audio - people feel a lot more comfortable having a bot moderate their text than their speech - so making sure we can give as strong a privacy guarantee as possible is a hugely important element of ToxMod’s design philosophy.
That initial layer makes its assessment through a combination of classic tools like voice activity detection (no need to moderate silence) as well as substantially more sophisticated ones such as emotion classification based on intonation, the demographic predictions discussed earlier in this post, and a quick and dirty on-device transcription model we developed to specifically focus on the sorts of things which might indicate toxicity.
If that first layer determines that something needs a closer look, it sends the audio off-device to Modulate’s secure servers, which house the second triaging layer. This layer mostly just involves using larger, higher-accuracy, but also more compute-heavy models which would have been too complex to run on the player’s own device. These models double-check the initial estimate made by the first layer of the triaging system. The second layer also has a bit of additional context about player histories. It doesn’t know anything about who you are - we don’t have access to the game’s records about your name, address, or anything else - but it does have records of whether previous games you’ve been in have involved toxicity, what sorts of punishments you’ve received in the past, and other related info. Thus, the first layer might have been uncertain if some mildly gender-related offenses (say, “Get back to the kitchen”) warranted immediate action, but the second layer might know that this particular player has already received multiple strikes for this sort of offense, in which case it can immediately deem it a substantial violation.
The third and final triaging layer isn’t an algorithm at all. Instead, if the second layer needs further validation, it will expose all the relevant details of the conversation (including a transcript, but again, not including any personally identifiable information) to a page within Modulate’s secure Admin Console, where your studio’s moderation team can view the conversation and pass judgement. Any decision, including a variety of possible interventions or determining that there was actually no serious offense, is immediately taken as training data by both earlier triage phases, allowing them to improve themselves automatically to make decisions which are even more aligned with how your moderation team approaches things.
Since this triaging process rules out most non-toxic audio at the first layer, this process ensures that Modulate only needs to store a very small fraction of the audio actually going through your system (which we also typically delete after a set time anyways.) It also allows us to offer ToxMod at a massively reduced cost compared to simply combining a speech to text system (which would need to run on all the audio) and a text moderation system.
Wrapping Up
Hopefully this blog post has helped illuminate why it’s so important to have a voice-first moderation solution for voice channels. The combination of ToxMod’s ability to understand emotional nuance, it’s triaging system that protects privacy and focuses resources on the most problematic speech, and its ability to deal with voice-specific problem areas like voice raids, mean that it can provide greater accuracy at a lower cost than any text-native moderation tool. (And of course, there are many more doors voice chat opens, for both good and bad kinds of new behaviors - and several additional ToxMod features we’ve fleshed out to manage them! Expect future blog posts on topics like differentiating honest frustration - expressed in a toxic way - from malicious intentions; using emotion and sentiment analysis to advise matchmaking algorithms towards creating more stable groups; and much more.)
If you’d like to learn more about Modulate’s ToxMod tool, we recommend you check out its product page here!