Modulate’s Model Benchmarks

Compare audio-native Velma to LLMs

Conversation Understanding Benchmark — Accuracy vs. Cost
Evaluates a model's ability to identify conversation types, topics, speaker roles and key behaviors.
Highest accuracy lowest cost
Inference cost
Accuracy score
velma-2-fast
velma-2
grok-4.1-fast-non-reasoning
grok-4.1-fast-reasoning
gemini-2-flash-lite
deepseek-v3.1
gemini-2-flash
deepseek-v3.2
gemini-3-flash-min
deepseek-r1
gemini-3-flash-med
gemini-2.5-pro
gemini-3-pro
grok-3
nova-3-intelligence
scribe-v2
grok-4-heavy
gpt-5-mini
gpt-5.2-pro
gpt-5.2
1
2
3
4
5
6
7
8
9
10
$0.01
0.02
0.03
0.04
0.05
0.06
0.07
$0.08
$0.10
0.50
1.00
$1.50
0

Compare Transcribe
to the competition

Transcription Benchmark (Accuracy vs. Price)
Average Word Error Rate (WER) across Earnings-22 and VoxPopuli datasets
Lowest WER lowest cost
Cost per hour
Avg. Word Error Rate
modulate-transcribe
scribe-v2
assemblyai-universal-2
assemblyai-universal-3-pro
speechmatics-enhanced
google-gemini-2.5-pro
gpt-4o-transcribe
google-chirp-2
deepgram-nova-3
openai-whisper-large-v3
8
9
10
11
12
13 %
$0.00
0.10
0.20
0.30
$0.40
Speech-to-Text Transcription Pricing (Batch)
Modulate
$0.03 / hr
xAI
grok-stt
$0.10 / hr
AssemblyAI
universal-3 Pro
$0.21 / hr
ElevenLabs
scribe v2
$0.22 / hr
Speechmatics
enhanced
$0.24 / hr
Deepgram
nova-3
$0.31 / hr
OpenAI
gpt-4o-transcribe
$0.36 / hr
Speech-to-Text Transcription Pricing (Streaming)
Modulate
$0.06 / hr
xAI
grok
$0.20 / hr
Speechmatics
enhanced
$0.24 / hr
Deepgram
nova-3
$0.35 / hr
OpenAI
gpt-4o-transcribe
$0.36 / hr
ElevenLabs
scribe-v2
$0.39 / hr
AssemblyAI
universal-3-pro
$0.45 / hr

Hugging Face’s Deepfake Speech Leaderboard

Modulate is the top ranked deepfake detection model on Hugging Face's Speak Deepfake Arena , the leading independent benchmark. View it here.

Compare Deepfake
Detect to the competition

Modulate is #1 on 🤗 Hugging Face

Modulate is the top ranked deepfake detection model on Hugging Face's Speech Arena Leaderboard, the leading independent benchmark. Just 1.1% Equal Error Rate, Modulate catches 133% more deepfakes than the next best.
System Date Added Num Params (M) Pooled EER Average EER ↓
🥇Modulate-VELMA-2-Syntheti
🥇Modulate-VELMA-2-Syntheti 11/03/2026 316.000 1.586 1.104
🥈Resemble-Detect-3B-Omni
🥈Resemble-Detect-3B-Omni 14/10/2025 3000.000 2.099 2.570
🥉Hiya-Authenticity-Verific
🥉Hiya-Authenticity-Verific 13/02/2026 1000.000 2.324 2.113
DLMSL-SpeakSure-v0.1
DLMSL-SpeakSure-v0.1 27/10/2025 658.630 6.142 3.954
Whispeak
Whispeak 20/08/2025 98.900 8.060 3.049
EER (Equal Error Rate) is the foundation performance metric used to evaluate how accurately a model can distinguish between genuine human speech and AI-generated audio.

Modulate Catches 99% of all Deepfakes

Catch 2x more deepfakes and flag 48% fewer false positives vs. next-best. 🤗 Hugging Face Leaderboard.
Accuracy
92
94
96
98
100%
98.9%
Modulate
velma-deepfake-detect
97.9%
Hiya
authenticity-verific
97.4%
Resemble AI
resemble-detect-3b
96.9%
Whispeak
whispeak
96.0%
Deep Learning
dlmsl-speaksure-v0.1
94.2%
DF Arena
df-arena-500m-v1
94.1%
DF Arena
df-arena-1b-v1
93.9%
Syntra
syntra-detector
92.9%
Momenta
momenta

Detect Deepfakes for just $0.25 / hr

Fraud protection at scale, at a price that levels the playing field vs. scammers.
Modulate Deepfake-Detect
$0.25 / hr
Resemble AI Enterprise
$29 / hr
Other Providers
$30 — $120 / hr
Resemble AI Self-Serve
$144 / hr