Build Your Own Voice Stack with Deepgram and PlayHT: A Practical Guide

TL;DR

Most voice stacks fail because STT and TTS operate independently—you get latency jitter, buffer misalignment, and audio cutoffs mid-sentence. This guide builds a real-time conversational pipeline: Deepgram handles streaming speech-to-text with partial transcripts, PlayHT generates low-latency audio responses, and a Node.js server orchestrates the handoff. Result: sub-500ms round-trip latency, proper barge-in handling, and no audio overlap.

Prerequisites

API Keys & Credentials

You’ll need active accounts with Deepgram and PlayHT. Generate API keys from both platforms’ dashboards—Deepgram’s key enables real-time speech-to-text streaming via WebSocket, while PlayHT’s key handles text-to-speech …

Build Your Own Voice Stack with Deepgram and PlayHT: A Practical Guide

TL;DR

Prerequisites

API Keys & Credentials

System & Runtime Requirements

Node.js 18+ (for native fetch and async/await support). Install dependencies: npm install dotenv axios ws for WebSocket streaming and HTTP requests. You’ll also need a modern browser with Web Audio API support if building a client-side component.

Network & Audio Setup

Ensure your development environment supports WebSocket connections (firewalls sometimes block these). Have a microphone available for testing real-time STT. For production, you’ll need HTTPS endpoints and a domain for webhook callbacks—ngrok works for local testing.

Knowledge Assumptions

Familiarity with async JavaScript, REST APIs, and JSON payloads. Understanding of audio formats (PCM 16kHz) helps but isn’t mandatory.

Deepgram: Try Deepgram Speech-to-Text → Get Deepgram

Step-by-Step Tutorial

Most voice stacks break because developers treat STT and TTS as separate batch operations. Real-time audio requires streaming both directions simultaneously while managing buffer states. Here’s how to build it correctly.

Architecture & Flow

Your voice stack needs three concurrent processes: audio capture → Deepgram STT → LLM processing → PlayHT TTS → audio playback. The critical part is managing the bidirectional streams without blocking.

flowchart LR
A[Microphone] -->|WebSocket| B[Deepgram STT]
B -->|Transcript| C[LLM Processing]
C -->|Response Text| D[PlayHT TTS]
D -->|Audio Stream| E[Speaker]
E -.->|Barge-in Signal| B```

## Configuration & Setup

**Deepgram Configuration** - Enable interim results for low-latency partial transcripts:

```javascript
const deepgramConfig = {
model: 'nova-2',
language: 'en-US',
encoding: 'linear16',
sample_rate: 16000,
channels: 1,
interim_results: true,
endpointing: 300, // 300ms silence = end of utterance
vad_events: true, // Voice activity detection
punctuate: true
};```

**PlayHT Configuration** - Stream audio in chunks for immediate playback:

```javascript
const playhtConfig = {
voice: 'larry', // Or your cloned voice ID
output_format: 'mp3',
sample_rate: 24000,
quality: 'medium', // Balance latency vs quality
speed: 1.0,
seed: null // Randomize for natural variation
};```

## Step-by-Step Implementation

**Step 1: Initialize WebSocket Connections**

Open persistent connections to both services\. Deepgram uses WebSocket for bidirectional audio streaming\. PlayHT uses HTTP streaming with chunked transfer encoding\.

```javascript
const deepgramWs = new WebSocket(`wss://api\.deepgram\.com/v1/listen?$\{new URLSearchParams\(deepgramConfig\)\}`,
{ headers: { 'Authorization':`Token $\{process\.env\.DEEPGRAM\_API\_KEY\}\` \}\}
\);

// PlayHT uses HTTP streaming, not WebSocket
const playhtStream = await fetch\('[https://api\.play\.ht/api/v2/tts/stream](https://api.play.ht/api/v2/tts/stream)', \{
method: 'POST',
headers: \{
'Authorization': `Bearer ${process.env.PLAYHT_API_KEY}`,
'X-User-ID': process\.env\.PLAYHT\_USER\_ID,
'Content-Type': 'application/json'
\},
body: JSON\.stringify\(\{
text: responseText,
voice: playhtConfig\.voice,
output\_format: playhtConfig\.output\_format,
sample\_rate: playhtConfig\.sample\_rate
\}\)
\}\);
\```

**Step 2: Stream Audio to Deepgram**

Capture microphone input and send raw PCM chunks\. Do NOT buffer entire utterances - stream immediately\.

``\`javascript
let isProcessing = false; // Race condition guard

navigator\.mediaDevices\.getUserMedia\(\{ audio: true \}\)
\.then\(stream =\> \{
const mediaRecorder = new MediaRecorder\(stream, \{
mimeType: 'audio/webm;codecs=opus'
\}\);

mediaRecorder.ondataavailable = (event) => { if (deepgramWs.readyState === WebSocket.OPEN) { deepgramWs.send(event.data); } };

mediaRecorder.start(250); // Send chunks every 250ms


\}\);
\```

**Step 3: Handle Partial Transcripts**

Process interim results for responsiveness\. Only trigger LLM on final transcripts to avoid duplicate responses\.

``\`javascript
deepgramWs\.onmessage = async \(message\) =\> \{
const data = JSON\.parse\(message\.data\);

if \(data\.is\_final\) \{
const transcript = data\.channel\.alternatives\[0\]\.transcript;

if (isProcessing) return; // Prevent race condition isProcessing = true;

try { const llmResponse = await processWithLLM(transcript); await streamTTSResponse(llmResponse); } finally { isProcessing = false; }


\}
\};
\```

**Step 4: Stream TTS Audio Back**

PlayHT returns audio chunks as they're generated\. Play immediately - don't wait for complete response\.

```javascript
async function streamTTSResponse(text) {
const response = await fetch('https://api.play.ht/api/v2/tts/stream', {
method: 'POST',
headers: {
'Authorization':`Bearer $\{process\.env\.PLAYHT\_API\_KEY\}\`,
'X-User-ID': process\.env\.PLAYHT\_USER\_ID,
'Content-Type': 'application/json'
\},
body: JSON\.stringify\(\{ text, \.\.\.playhtConfig \}\)
\}\);

const reader = response\.body\.getReader\(\);
const audioContext = new AudioContext\(\{ sampleRate: 24000 \}\);

while \(true\) \{
const \{ done, value \} = await reader\.read\(\);
if \(done\) break;

// Decode and play chunk immediately const audioBuffer = await audioContext.decodeAudioData(value.buffer); const source = audioContext.createBufferSource(); source.buffer = audioBuffer; source.connect(audioContext.destination); source.start();


\}
\}
\```

## Error Handling & Edge Cases

**WebSocket Reconnection** - Deepgram connections drop after 10 seconds of silence\. Implement exponential backoff:

``\`javascript
let reconnectAttempts = 0;
const maxReconnectDelay = 30000;

deepgramWs\.onclose = \(\) =\> \{
const delay = Math\.min\(1000 \* Math\.pow\(2, reconnectAttempts\), maxReconnectDelay\);
setTimeout\(\(\) =\> \{
reconnectAttempts\+\+;
initializeDeepgram\(\);
\}, delay\);
\};
\```

**Barge-in Handling** - Stop TTS playback when user interrupts\. Flush audio buffers to prevent old audio playing after interrupt\.

``\`javascript
deepgramWs\.onmessage = \(message\) =\> \{
const data = JSON\.parse\(message\.data\);

if \(data\.type === 'SpeechStarted'\) \{
// User started speaking - cancel TTS immediately
audioContext\.close\(\); // Stops all audio sources
audioContext = new AudioContext\(\{ sampleRate: 24000 \}\);
isProcessing = false; // Allow new processing
\}
\};
\```

**Rate Limiting** - PlayHT enforces 100 requests/minute\. Queue requests and implement backoff on 429 errors\.

## Testing & Validation

Test with 200-500ms network jitter\. Real mobile networks have variable latency\. Your endpointing threshold \(300ms\) must account for this or you'll get false turn-taking triggers\.

Validate audio format compatibility: Deepgram expects PCM 16kHz, PlayHT outputs 24kHz MP3\. Resample if needed to prevent playback speed issues\.

### System Diagram

Audio processing pipeline from microphone input to speaker output\.

``\`mermaid
graph LR
AudioInput\[Audio Input\]
PreProcessor\[Pre-Processor\]
FeatureExtractor\[Feature Extraction\]
AcousticModel\[Acoustic Model\]
LanguageModel\[Language Model\]
Decoder\[Decoder\]
PostProcessor\[Post-Processor\]
Transcript\[Transcript Output\]
ErrorHandler\[Error Handler\]
Log\[Logging\]

AudioInput–>PreProcessor PreProcessor–>FeatureExtractor FeatureExtractor–>AcousticModel AcousticModel–>LanguageModel LanguageModel–>Decoder Decoder–>PostProcessor PostProcessor–>Transcript

PreProcessor– Error –>ErrorHandler FeatureExtractor– Error –>ErrorHandler AcousticModel– Error –>ErrorHandler LanguageModel– Error –>ErrorHandler Decoder– Error –>ErrorHandler

ErrorHandler–>Log Log–>PreProcessor


\```

## Testing & Validation

Most voice stacks break in production because devs skip local testing\. Here's how to catch issues before they hit users\.

### Local Testing

Test the full pipeline locally before deploying\. This catches 80% of integration bugs\.

``\`javascript
// Test STT → LLM → TTS pipeline with mock audio
async function testVoicePipeline\(\) \{
const testAudioFile = '\./test\_audio\.wav'; // 16kHz PCM audio
const audioData = fs\.readFileSync\(testAudioFile\);

// 1\. Test Deepgram STT
deepgramWs\.send\(audioData\);
deepgramWs\.on\('message', async \(data\) =\> \{
const transcript = JSON\.parse\(data\)\.channel\.alternatives\[0\]\.transcript;
console\.log\('STT Output:', transcript\);

// 2. Test LLM response const llmResponse = await fetch(‘https://api.openai.com/v1/chat/completions’, { method: ‘POST’, headers: { ‘Authorization’: Bearer ${process.env.OPENAI_API_KEY} }, body: JSON.stringify({ messages: [{ role: ‘user’, content: transcript }] }) });

// 3. Test PlayHT TTS await streamTTSResponse(llmResponse.choices[0].message.content);


\}\);
\}
\```

Run this with 5-10 test audio files covering edge cases: background noise, fast speech, accents, interruptions\.

### Webhook Validation

If using webhooks for async processing, validate signatures to prevent replay attacks\. Check response codes: 200 = success, 429 = rate limit hit \(back off exponentially\), 503 = service down \(retry with jitter\)\.

## Real-World Example

## Barge-In Scenario

User interrupts the AI mid-sentence while it's explaining a 3-step process\. Most implementations break here because they don't flush the TTS buffer—the old audio keeps playing after the interrupt\.

``\`javascript
let currentAudioSource = null;
let isPlaying = false;

// Barge-in detection from Deepgram STT
deepgramWs\.on\('message', \(message\) =\> \{
const data = JSON\.parse\(message\);

if \(data\.is\_final && data\.speech\_final\) \{
const transcript = data\.channel\.alternatives\[0\]\.transcript;

// User spoke while AI was talking = barge-in if (isPlaying && transcript.length > 0) { console.log(‘[BARGE-IN] User interrupted:’, transcript);

// CRITICAL: Stop current audio immediately if (currentAudioSource) { currentAudioSource.stop(0); currentAudioSource = null; }

// Flush PlayHT stream buffer if (playhtStream && !playhtStream.destroyed) { playhtStream.destroy(); }

isPlaying = false;

// Process new user input handleUserInput(transcript); }


\}
\}\);

// Track audio playback state
function playAudioChunk\(audioBuffer\) \{
const source = audioContext\.createBufferSource\(\);
source\.buffer = audioBuffer;
source\.connect\(audioContext\.destination\);

currentAudioSource = source;
isPlaying = true;

source\.onended = \(\) =\> \{
isPlaying = false;
currentAudioSource = null;
\};

source\.start\(0\);
\}
\```

## Event Logs

**Timestamp: 14:32:18\.234** - AI starts TTS: \"To complete the setup, first navigate to\.\.\.\"

**Timestamp: 14:32:19\.891** - Deepgram detects speech: `is_final: false, transcript: "wait"`

**Timestamp: 14:32:20\.103** - Barge-in triggered, audio source stopped

**Timestamp: 14:32:20\.156** - PlayHT stream destroyed, buffer flushed

**Timestamp: 14:32:20\.421** - New STT final: \"wait, can you repeat that?\"

## Edge Cases

**Multiple rapid interrupts**: User says \"wait\.\.\. no\.\.\. actually\.\.\.\" within 500ms\. Without debouncing, you'll trigger 3 separate LLM calls\. Add a 300ms debounce window before processing the final transcript\.

**False positives from background noise**: Breathing, keyboard clicks, or ambient sound trigger barge-in at Deepgram's default `endpointing: 10` \(10ms silence\)\. Increase to `endpointing: 300` for noisy environments\. This prevents phantom interrupts but adds 290ms latency to legitimate barge-ins—tune based on your use case\.

**Partial audio playback**: If you don't track `isPlaying` state, the system can't distinguish between \"AI is speaking\" vs \"silence between responses\.\" Result: user's normal speech gets treated as an interrupt, breaking turn-taking logic\.

## Common Issues & Fixes

## Race Conditions in Audio Playback

Most voice stacks break when TTS chunks arrive faster than they can be played\. You'll hear overlapping audio or the bot talking over itself\.

**The Problem:** PlayHT streams audio chunks at ~50ms intervals, but Web Audio API scheduling isn't instant\. If you queue chunks without tracking playback state, they pile up\.

``\`javascript
// BROKEN: Chunks overlap because we don't track playback
playhtStream\.on\('data', \(chunk\) =\> \{
const audioBuffer = audioContext\.decodeAudioData\(chunk\);
const source = audioContext\.createBufferSource\(\);
source\.buffer = audioBuffer;
source\.connect\(audioContext\.destination\);
source\.start\(0\); // ❌ Always starts immediately = overlap
\}\);

// FIXED: Track playback timing
let nextStartTime = audioContext\.currentTime;

playhtStream\.on\('data', async \(chunk\) =\> \{
const audioBuffer = await audioContext\.decodeAudioData\(chunk\);
const source = audioContext\.createBufferSource\(\);
source\.buffer = audioBuffer;
source\.connect\(audioContext\.destination\);

// Schedule next chunk after current one finishes
source\.start\(Math\.max\(0, nextStartTime\)\);
nextStartTime = Math\.max\(audioContext\.currentTime, nextStartTime\) \+ audioBuffer\.duration;
\}\);
\```

## WebSocket Reconnection Failures

Deepgram WebSocket connections drop after 10 seconds of silence or network hiccups\. Without exponential backoff, you'll spam reconnect attempts and hit rate limits \(429 errors\)\.

``\`javascript
// Exponential backoff with jitter
async function reconnectDeepgram\(\) \{
const delay = Math\.min\(1000 \* Math\.pow\(2, reconnectAttempts\), maxReconnectDelay\);
const jitter = Math\.random\(\) \* 1000; // Prevent thundering herd

await new Promise\(resolve =\> setTimeout\(resolve, delay \+ jitter\)\);

deepgramWs = new WebSocket\('wss://api\.deepgram\.com/v1/listen', \{
headers: \{ Authorization: `Token ${process.env.DEEPGRAM_API_KEY}` \}
\}\);

reconnectAttempts\+\+;
\}
\```

**Real-world trigger:** Mobile networks cause 200-500ms jitter\. Set `endpointing: 1500` \(not 300ms default\) to avoid false disconnects\.

## Barge-In Audio Corruption

When users interrupt mid-sentence, you must **flush the TTS buffer** and cancel queued chunks\. Otherwise, old audio plays after the interruption\.

``\`javascript
deepgramWs\.on\('message', \(data\) =\> \{
const transcript = JSON\.parse\(data\);

if \(transcript\.is\_final && isPlaying\) \{
// Stop current audio immediately
if \(currentAudioSource\) \{
currentAudioSource\.stop\(\);
currentAudioSource = null;
\}

// Clear queued chunks nextStartTime = audioContext.currentTime; isPlaying = false;


\}
\}\);
\```

## Complete Working Example

Most voice stack tutorials show isolated snippets\. Here's the full server that actually runs—WebSocket handlers, audio streaming, error recovery, and graceful shutdown\. This is what you deploy\.

## Full Server Code

This implementation handles the complete real-time speech-to-text to text-to-speech pipeline\. The server manages concurrent WebSocket connections, buffers audio chunks to prevent jitter, and implements exponential backoff for reconnection failures\.

``\`javascript
// server\.js - Production voice stack server
const WebSocket = require\('ws'\);
const express = require\('express'\);
const \{ createClient \} = require\('@deepgram/sdk'\);
const fetch = require\('node-fetch'\);

const app = express\(\);
const server = require\('http'\)\.createServer\(app\);
const wss = new WebSocket\.Server\(\{ server \}\);

// Configuration from previous sections
const deepgramConfig = \{
model: 'nova-2',
language: 'en-US',
encoding: 'linear16',
sample\_rate: 16000,
channels: 1,
endpointing: 300,
interim\_results: true
\};

const playhtConfig = \{
voice: 'jennifer',
output\_format: 'mp3',
quality: 'high',
speed: 1\.0
\};

// Session state management
const sessions = new Map\(\);
const SESSION\_TTL = 300000; // 5 minutes

// Audio buffer management to prevent jitter
class AudioBuffer \{
constructor\(\) \{
this\.chunks = \[\];
this\.isPlaying = false;
this\.nextStartTime = 0;
\}

add\(chunk\) \{
this\.chunks\.push\(chunk\);
if \(\!this\.isPlaying\) this\.play\(\);
\}

async play\(\) \{
this\.isPlaying = true;
while \(this\.chunks\.length \> 0\) \{
const chunk = this\.chunks\.shift\(\);
const now = Date\.now\(\);
const delay = Math\.max\(0, this\.nextStartTime - now\);
await new Promise\(resolve =\> setTimeout\(resolve, delay\)\);
// Send chunk to client
this\.nextStartTime = Date\.now\(\) \+ \(chunk\.duration \|\| 100\);
\}
this\.isPlaying = false;
\}

clear\(\) \{
this\.chunks = \[\];
this\.isPlaying = false;
\}
\}

// Deepgram connection with reconnection logic
function createDeepgramConnection\(sessionId\) \{
const deepgram = createClient\(process\.env\.DEEPGRAM\_API\_KEY\);
const connection = deepgram\.listen\.live\(deepgramConfig\);

const session = sessions\.get\(sessionId\);
session\.reconnectAttempts = 0;
const maxReconnectDelay = 30000;

connection\.on\('open', \(\) =\> \{
console\.log\(`[${sessionId}] Deepgram connected`\);
session\.reconnectAttempts = 0;
\}\);

connection\.on\('Results', async \(data\) =\> \{
if \(\!data\.channel?\.alternatives?\.\[0\]\) return;

const transcript = data.channel.alternatives[0].transcript; if (!transcript || data.is_final === false) return;

// Prevent race condition during TTS playback if (session.isProcessing) { console.log([${sessionId}] Dropped transcript (already processing)); return; } session.isProcessing = true;

try { // Generate LLM response (simplified - use your LLM here) const llmResponse = await generateResponse(transcript);

// Stream TTS from PlayHT await streamTTSResponse(sessionId, llmResponse); } catch (error) { console.error([${sessionId}] Processing error:, error); session.ws.send(JSON.stringify({ type: ‘error’, message: ‘Processing failed’ })); } finally { session.isProcessing = false; }


\}\);

connection\.on\('error', \(error\) =\> \{
console\.error\(`[${sessionId}] Deepgram error:`, error\);
\}\);

connection\.on\('close', \(\) =\> \{
console\.log\(`[${sessionId}] Deepgram closed`\);
if \(sessions\.has\(sessionId\)\) \{
reconnectDeepgram\(sessionId\);
\}
\}\);

return connection;
\}

// Exponential backoff reconnection
function reconnectDeepgram\(sessionId\) \{
const session = sessions\.get\(sessionId\);
if \(\!session\) return;

session\.reconnectAttempts\+\+;
const delay = Math\.min\(
1000 \* Math\.pow\(2, session\.reconnectAttempts\),
30000
\);

console\.log\(`[${sessionId}] Reconnecting in ${delay}ms (attempt ${session.reconnectAttempts})`\);

setTimeout\(\(\) =\> \{
if \(sessions\.has\(sessionId\)\) \{
session\.deepgramWs = createDeepgramConnection\(sessionId\);
\}
\}, delay\);
\}

// PlayHT TTS streaming with buffer management
async function streamTTSResponse\(sessionId, text\) \{
const session = sessions\.get\(sessionId\);
if \(\!session\) return;

// Cancel any ongoing playback \(barge-in handling\)
session\.audioBuffer\.clear\(\);

try \{
const response = await fetch\('[https://api\.play\.ht/api/v2/tts/stream](https://api.play.ht/api/v2/tts/stream)', \{
method: 'POST',
headers: \{
'Authorization': `Bearer ${process.env.PLAYHT_API_KEY}`,
'X-User-ID': process\.env\.PLAYHT\_USER\_ID,
'Content-Type': 'application/json'
\},
body: JSON\.stringify\(\{
text: text,
voice: playhtConfig\.voice,
output\_format: playhtConfig\.output\_format,
quality: playhtConfig\.quality,
speed: playhtConfig\.speed
\}\)
\}\);

if (!response.ok) { throw new Error(PlayHT API error: ${response.status}); }

const reader = response.body.getReader(); let audioChunk;

while (true) { const { done, value } = await reader.read(); if (done) break;

// Buffer audio to prevent jitter session.audioBuffer.add({ data: value, duration: (value.length / 16000) * 1000 // Estimate duration });

// Send to client WebSocket if (session.ws.readyState === WebSocket.OPEN) { session.ws.send(value, { binary: true }); } }


\} catch \(error\) \{
console\.error\(`[${sessionId}] TTS streaming error:`, error\);
throw error;
\}
\}

// Simplified LLM response \(replace with your LLM\)
async function generateResponse\(transcript\) \{
// This is where you'd call OpenAI, Anthropic, etc\.
return `You said: ${transcript}. This is a test response.`;
\}

// WebSocket connection handler
wss\.on\('connection', \(ws\) =\> \{
const sessionId = `session_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
console\.log\(`[${sessionId}] Client connected`\);

// Initialize session
const session = \{
ws: ws,
deepgramWs: createDeepgramConnection\(sessionId\),
audioBuffer: new AudioBuffer\(\),
isProcessing: false,
reconnectAttempts: 0,
createdAt: Date\.now\(\)
\};
sessions\.set\(sessionId, session\);

// Auto-cleanup after TTL
setTimeout\(\(\) =\> \{
if \(sessions\.has\(sessionId\)\) \{
console\.log\(`[${sessionId}] Session expired (TTL)`\);
cleanupSession\(sessionId\);
\}
\}, SESSION\_TTL\);

ws\.on

## FAQ

## Technical Questions

**How do I handle WebSocket reconnection when Deepgram drops mid-stream?**

Implement exponential backoff with a maximum delay cap\. Track `reconnectAttempts` and increment after each failed connection\. Set `maxReconnectDelay` to 30 seconds to prevent runaway retry loops\. When the WebSocket closes unexpectedly, calculate `delay = Math.min(1000 * Math.pow(2, reconnectAttempts), maxReconnectDelay)`, then reconnect\. Store partial transcripts in memory before reconnecting so you don't lose context\. Most production failures happen because developers retry immediately without backoff—this will exhaust your connection pool\.

**What's the latency difference between batch STT and streaming STT?**

Batch processing \(send entire audio file\) adds 500ms–2s overhead for queueing and processing\. Streaming STT with Deepgram returns partial transcripts within 100–300ms of audio arrival, depending on network jitter and VAD \(voice activity detection\) settings\. For real-time voice stacks, streaming is non-negotiable\. Batch is only acceptable for post-call analysis or asynchronous transcription jobs\.

**How do I prevent PlayHT from speaking over Deepgram's STT?**

Set `isProcessing = true` when audio input starts, and only allow TTS output when `isProcessing = false`\. Use a state machine, not boolean flags alone—this prevents race conditions where both streams try to output simultaneously\. If the user interrupts mid-sentence, flush the `playhtStream` buffer immediately and set `interrupted = true` to signal cancellation downstream\.

**Why does my voice stack have 2–3 second latency spikes?**

Check three things: \(1\) LLM response time \(usually 500–1500ms for GPT-4\), \(2\) TTS synthesis delay \(PlayHT typically 200–800ms depending on text length\), \(3\) network jitter on WebSocket\. Profile each component separately using `console.time()`\. Most developers blame the API when the bottleneck is their own LLM integration\.

## Performance

**What sample rate should I use for Deepgram?**

Use 16kHz for voice conversations \(standard for telephony\)\. 8kHz works but reduces accuracy by 3–5%\. Higher rates \(48kHz\) waste bandwidth without meaningful accuracy gains for speech\. Set `sample_rate: 16000` in `deepgramConfig` and ensure your `mediaRecorder` or audio source matches this rate\. Mismatched sample rates cause audio artifacts and transcription errors\.

**How do I reduce PlayHT synthesis latency?**

Lower the `quality` setting from \"high\" to \"medium\" \(saves 200–400ms\)\. Reduce `speed` to 0\.9–1\.0 \(faster speech = less synthesis time, but clarity suffers\)\. Pre-warm the connection by sending a test request during initialization\. Batch multiple short sentences into one TTS call instead of calling PlayHT for every single response chunk—this reduces API overhead by 40–60%\.

## Platform Comparison

**Should I use Deepgram or Google Cloud Speech-to-Text?**

Deepgram is 2–3x faster for streaming \(100–200ms latency vs\. 300–500ms for Google\)\. Deepgram's pricing is predictable \(per-minute\)\. Google charges per request plus storage\. For real-time voice stacks, Deepgram wins on latency and cost\. Google wins if you need multi-language support across 100\+ languages out-of-the-box\.

**PlayHT vs\. ElevenLabs for TTS?**

PlayHT has lower latency \(200–600ms\) and better cost efficiency for high-volume applications\. ElevenLabs has superior voice quality and emotional expressiveness, but adds 400–1000ms latency\. For conversational AI, PlayHT is the practical choice\. For branded voice experiences, ElevenLabs justifies the latency trade-off\.

## Resources

**Deepgram Speech-to-Text API**

- [Official Documentation](https://developers.deepgram.com/docs) – Real-time STT, WebSocket streaming, VAD configuration
- [API Reference](https://developers.deepgram.com/reference) – Endpoint specs, authentication, error codes

**PlayHT Text-to-Speech API**

- [Official Documentation](https://docs.playht.com) – Low-latency TTS streaming, voice selection, output formats
- [API Reference](https://docs.playht.com/api-reference) – Endpoint specs, authentication, streaming protocols

**Voice AI Stack Architecture**

- [Deepgram GitHub Examples](https://github.com/deepgram-devs) – Production WebSocket implementations, audio streaming patterns
- [PlayHT GitHub Examples](https://github.com/playht) – End-to-end conversational AI pipeline examples

Build Your Own Voice Stack with Deepgram and PlayHT: A Practical Guide

TL;DR

Prerequisites

Build Your Own Voice Stack with Deepgram and PlayHT: A Practical Guide

TL;DR

Prerequisites

Step-by-Step Tutorial

Architecture & Flow

Similar Posts