Voice Intelligence for Fraud Detection

These days, there are lots of discussions about “general artificial intelligence” - AI systems capable of understanding a suitably wide array of topics that they can truly mimic the capabilities of people.

As fascinating as these generalized models are, though, they face real challenges to implementation in the business world. The unpredictability of hallucinations and other misunderstandings makes these systems difficult to trust for high-impact or heavily regulated industries. The same sophistication of these models also makes it difficult to guardrail and understand their behavior. And fundamentally, most business applications don’t need a model that understands color theory, how to build a boat, the history of Rome, and a thousand other things - and they certainly don’t want to pay for the excess capabilities that are unneeded.

For all these reasons, Modulate took a different approach when designing ToxMod. Rather than a single massive, black box, unpredictable, expensive model, ToxMod utilizes our proprietary Voice Intelligence Model, which combines and layers a vast collection of narrower models, targeted to specific analyses such as detecting specific emotions or understanding particular types of back-and-forth in dialogue. This design allows us to achieve substantially reduced costs while matching or exceeding the accuracy of the top all-in-one models out there - and additionally ensures ToxMod can always offer clear, explainable (and true) justifications for why it comes to a certain conclusion.

As with all things, though, this design comes with a tradeoff - in order for ToxMod to work, we need to know in advance which sorts of behaviors or conversational ‘events’ the user is looking for, and configure ToxMod specifically to search for these. In ToxMod’s initial deployment with top games like Call of Duty, we’d configured it to search for harmful behaviors like hate speech, harassment, exploitation, and others. If someone wanted to use ToxMod to find other types of behaviors, we knew it could be done, but would take a fair bit of up-front work to reconfigure ToxMod accordingly.

At least, that was the theory.

The Journey to Supporting Gig Platforms

In early 2024, we began working with a few gig platforms in the delivery and rideshare space. ToxMod is excellent at detecting harmful behaviors in online voice chats, so we were excited to see how our technology could also help to protect drivers, riders, and gig workers alike – especially in situations where a phone call precedes physical interaction. It’s unfortunately too common that a customer will verbally (or even physically) assault a driver who is just trying to do their job; and our clients wanted to ensure they knew when this was happening so they could take action to mitigate the damage and get problematic users off their platform. While we certainly did help these platforms protect their workers from this kind of abuse, though, we also discovered a new way to help - detecting fraud. Let’s look at our findings from this first foray into fraud detection, and what AI tools can or can’t detect when it comes to scams and fraud happening in voice chats. 

The first thoughts about fraud came in a client meeting with one of our newer customers. We were about a week into a trial, and gathered for our first meeting to discuss the initial results. The client’s team arrived one by one, and as we began talking through our findings, they nodded along, but it appeared their minds were elsewhere.

When we asked them what was most interesting to them so far, their eyes lit up, and they told us the big news:

Without even trying to look for fraud, we’d actually caught more than 5x more attempted fraud than their actual fraud-detection systems!

This caught us all by surprise - though in retrospect, it’s not surprising that many of ToxMod’s signals around negative emotions, manipulation, and distrust were also picking up on attempted scams. But the client was ecstatic, and promptly asked us if we could make a true “fraud detection” capability within ToxMod.

This was our moment to put ToxMod’s design to the test! How long would it take for us to do that up-front configuration, to ensure that the myriad models inside it are all working in tandem to search for this new thing? This is analogous in concept (though totally different in technical execution) to “training a new model”, which takes even top AI companies months or years - how long would it take a startup like Modulate? Three months? Six months? Longer?

The answer: just under one month. That’s the power of targetable voice intelligence.

What exactly counts as “fraud”?

There are, in fact, a number of different types of harmful behaviors which could fall under this umbrella. Most of that month was not spent doing technical research, but rather with our data experts working with clients and outside researchers to understand the range of harms we might wish to detect. Ultimately, we put together a simple framework for the types of fraud we could focus on.

The first type we prioritized are scams. Scams are lies designed to extract information or money right now, not relying on any kind of relationship between the scammer and the target. In 2023, the Federal Trade Commission estimated that 1 in 5 people lost money to scams, resulting in a net loss of over $2 billion. Even financial experts aren’t immune to phone call scams. 

This is in contrast to extortion, which relies on leverage (rather than lies alone) to force certain actions, or exploitation, which relies on building a trusted relationship in order to manipulate the behavior of the target.

(There’s a clear analogy here to in-game toxicity, which breaks down into toxic utterances - one-off comments which are immediately harmful; aggressive behaviors like coordinated bullying campaigns; and finally manipulative relationships used by groomers and extremists to cultivate targets.)

We decided to prioritize scams for a number of reasons - they are by far the highest-scale type of fraud, and are much more typical in enterprise contexts than extortion or exploitation, which tend to be more focused on individuals. (We will, of course, build out support for extortion and exploitation over time as well!) But even just with scams, we were still massively outperforming the previously-existing fraud detection tools used by our clients, so we knew our approach would maximize the value we could deliver in the short term.

Scams can break down into many more subcategories. That topic deserves a blog post in and of itself, but high-level, most scams rely either on imposter or false proprietor scams. Imposter scams involve pretending to be someone you are not; false proprietor scams involve pretending to have something you do not. In typical enterprise settings (including retail call centers, gig platforms, and many retail financial institutions), the most common types of conversation-based fraud of these types are things like:

  • Tip scamming: “Hey, I accidentally over-tipped in the app, can you give me back $X in cash to make it up?”
  • Posing as a company representative: “Yes, this is Larry from corporate, I need you to confirm your SSN to resolve a glitch in our payroll software.”
  • Posing as a regulator, the IRS, or another empowered outside agency: “We’re investigating your company, you need to comply to protect yourself”
  • Posing as a customer: “Hi, I am definitely Customer A, and I’d like to withdraw $Y from my account.”

ToxMod accurately reports detected instances of all these types of fraud and more.

Disclaimer about Deepfakes

What about deepfaked voices, where someone literally sounds like another person during their call? Can Modulate detect this?

Well, yes and no. We have a lot of expertise in this area (in fact, Modulate’s earliest work was in building real-time speech changers and detection/watermarking tools for them.) But we tend not to believe this is the best approach. Why?

First off, your voice is not your password, and should not be trusted as such. We would much rather move to more secure systems overall than attempt to preserve a system with as many security flaws as voice-authentication.

But separately, there are legitimate reasons to use a synthetic voice. Perhaps you simply cannot speak due to a medical procedure or injury. Maybe you experience voice dysphoria and choose to use a voice changer to express yourself more authentically. Heck, maybe you just value your privacy! 

As such, Modulate does not consider “are they using a synthetic voice” to be, in and of itself, a harm worth detecting. We do of course detect “are they pretending to be someone they are not” - but we do that by also looking at what they are saying, not merely how they sound.

To Infinity and Beyond

Right now, we have tuned ToxMod to look for socially harmful behaviors in applications like games; and to look for economically harmful behaviors in enterprise applications. But this still barely scratches the surface of what ToxMod’s conversational-intelligence capabilities unlock. What else can ToxMod do?

Call centers are relying more and more on AI agents; or may simply worry that some calls could be above the pay grade of the line workers answering the call. In 2025, we’ll be helping these platforms identify caller frustration and other signs of a high-impact call that should be escalated to a manager, helping them use their people as efficiently as possible while resolving critical problems cleanly.

AI companions are growing more and more popular…but sometimes go off the rails and can cause serious mental health damage. By listening to the human side of the conversation, ToxMod can recognize at-risk users and prompt the platform to step in and re-align the conversation towards a healthier direction (or, depending on the urgency, provide support links to the user or even call law enforcement directly.)

Even in games, there's a huge opportunity to grow. Why just focus on detecting harmful behaviors, when you could also find your best users and encourage them to stick around and teach more players how to more healthily engage with the community? This is yet another area where Modulate has already begun to work with our clients.

These might all seem like very different applications - and indeed, Modulate invests heavily in our product experts and customer success team, to ensure that every Modulate client gets the support they need to truly apply ToxMod’s intelligence to their specific needs. But the unifying thread is ToxMod itself - the unique design that allowed us to build a conversational intelligence with all the nuance and expertise, coupled with an explainable, reliable, directable focus on the specific behaviors that matter most to you.