Industry-leading Voice AI models. Straight from the source.
Modulate's Velma is the #1 model for transcription accuracy, deepfake detection, and conversation understanding. Now available as direct APIs — so you can build on the same intelligence that powers our enterprise platform.
No credit card required.
The most accurate speech-to-text API for real-world conversations. Handles interruptions, accents, overlapping speech, and noise that breaks typical systems.
#1 accuracy on real-world benchmarks including Earnings 22 and AMI
Up to 10× cheaper than Deepgram
Batch + real-time streaming
Diarization, emotion, accent detection included free
- PII/PHI redaction (including redacted audio) for +$0.02/hr
400 hours in free credits

The #1 ranked deepfake detection model on 🤗 Hugging Face's Speech Deepfake Arena. Catches what others miss — including mid-call voice switches that gate-check systems are blind to.
1.1% equal error rate, less than half the next-best model
120× lower cost than closest competitor
Works with just 3 seconds of audio
Segment-level scores, updated every 2 seconds
1,000 free credits

Full voice intelligence via API — intent, emotion, fraud signals, compliance risk, policy violations, and more. Built on the same Ensemble Listening Model that powers Modulate's enterprise platform.

Classify music and speech at the frame level - handling tricky content like music-with-vocals that single-label detectors get wrong.
Independent music + speech probabilities per frame — handles music-with-vocals, jingles, and background music under dialogue.
Sub-200ms time-to-first-result on streaming
Built for ad-break detection, podcast segmentation, broadcast monitoring, and UGC moderation
Batch (REST) + streaming
1,000 hours / month witin base quota
Transcribe audio and automatically redact sensitive personal and health information — replacing detected spans with entity-type tags in the transcript and silencing the matching audio ranges.
Replaces names, SSNs, PHI, and more with entity tags (e.g.
[FIRSTNAME],[SSN],[PHI])Returns both redacted transcript and silenced audio in one call
Multilingual, with speaker diarization included
Batch + real-time streaming
Not just another LLM wrapper.
Most voice AI APIs transcribe audio and hand the text to a language model. Context, tone, and everything that makes a voice conversation meaningful gets discarded at step one.
Velma is built differently. Our Ensemble Listening Model (ELM) processes audio natively — understanding conversations the way a human listener would, with full awareness of how something is said, not just the words.
The result is an API that's more accurate, more cost-efficient, and capable of outputs that text-first systems simply can't produce.
Built for developers shipping production systems
Velma Transcribe is designed to integrate cleanly into modern infrastructure.
REST endpoints for batch transcription
Streaming endpoints for real-time transcription
Predictable structured output for downstream pipelines
Built for scalable high-throughput workloads
Velma API is designed to work well with analytics stacks, search systems, and LLM-based workflows.

Start free. Scale
when you're ready.
Our API includes free credits to get you started — no card required, no sales call necessary. When you're ready to go to production, usage-based pricing means you pay for what you use.
Need volume, SLAs, or custom endpoints? We can help with that too.