Streaming (WebSocket)

Transcribe audio in real time using the WebSocket streaming API.

The streaming API lets you send audio data in real time and receive transcription results as they are produced. This is ideal for live transcription, voice interfaces, and real-time captioning.

Connection Lifecycle

A streaming session follows this lifecycle:

  • Connect — Open a WebSocket connection and authenticate with Authorization: Bearer
  • Ready — Server sends a ready message confirming the session is active
  • Stream — Send audio chunks and receive transcription results
  • End — Send an END command to finalize and receive remaining results
  • Close — Server closes the connection after final results are sent

Connecting

Connect to the WebSocket endpoint and authenticate using the Authorization: Bearer header.

text
wss://asr.lesan.ai/v1/ws/transcribe?language=am&format=pcm_s16le

Note: Replace wss://asr.lesan.ai with your WebSocket server URL (e.g., wss://asr.lesan.ai for production). Most browsers cannot set custom headers (including Authorization) on WebSocket connections; for browser apps, use a server-side proxy that injects the header.

Query parameters:

  • language (optional) — Language code: am, ti, so, or en
  • format (optional) — Audio format. Default: pcm_s16le. See Audio Formats
  • turn_detection (optional) — Set to server_vad to enable server-side turn detection
  • vad_type (optional) — VAD algorithm: energy (default) or silero

Client Commands

Send these text commands to control the streaming session:

FORMAT

Change the audio format mid-session. Send as a text message before the next audio chunk:

text
FORMAT:webm_opus

TRANSCRIBE

Force transcription of buffered audio without ending the session. Useful for getting intermediate results:

text
TRANSCRIBE

END

Signal that audio streaming is complete. The server will process any remaining audio and send final results before closing:

text
END

CLEAR

Discard all buffered audio without transcribing. Useful for cancelling or resetting:

text
CLEAR

PING

Keep the connection alive. The server will respond with a pong message:

text
PING

Server Messages

The server sends JSON messages with a type field:

ready

Sent when the connection is established and the session is ready to receive audio:

json
{
  "type": "ready",
  "session_id": "sess_abc123",
  "format": "pcm_s16le",
  "sample_rate": 16000
}

chunk_received

Acknowledgement that an audio chunk was received and buffered:

json
{
  "type": "chunk_received",
  "bytes_received": 32000,
  "buffer_duration_ms": 2000
}

transcription

A transcription result, either partial or final:

json
{
  "type": "transcription",
  "transcription": "ሰላም እንዴት ነህ",
  "language": "am",
  "is_final": false,
  "is_turn": false,
  "duration_seconds": 2.5,
  "processing_time_seconds": 0.3,
  "audio_size_bytes": 80000,
  "session_id": "550e8400-e29b-41d4-a716-446655440000"
}
  • is_finalfalse for partial results that may change, true for final results
  • is_turntrue if the server considers this transcription a completed VAD turn
  • duration_seconds — Amount of audio included in this transcription event
  • processing_time_seconds — Server processing time for this event

turn_detected

Sent when Voice Activity Detection (VAD) detects a speaker turn boundary:

json
{
  "type": "turn_detected",
  "transcription": "how are you",
  "language": "en",
  "turn_start_ms": 0,
  "turn_end_ms": 1500,
  "duration_seconds": 1.5,
  "processing_time_seconds": 0.2,
  "session_id": "550e8400-e29b-41d4-a716-446655440000"
}

pong

Response to a PING command:

json
{ "type": "pong" }

error

An error occurred during the session:

json
{
  "type": "error",
  "error": "Unsupported format: flac",
  "session_id": "550e8400-e29b-41d4-a716-446655440000"
}

Session Results & Audio Playback

When the server sends a ready message, it includes a job_id. After the session ends (via the END command), the server finalizes the job and persists both the audio recording and transcript to cloud storage. You can retrieve the completed results by polling the transcription endpoint:

text
GET /v1/transcriptions/{job_id}

Once the job status is completed, the response includes storage URLs:

json
{
  "id": "txn_550e8400-e29b-41d4-a716-446655440000",
  "object": "transcription",
  "status": "completed",
  "language": "am",
  "text": "ሰላም እንዴት ነህ",
  "segments": [...],
  "duration_seconds": 10.5,
  "audio_url": "lesan://streaming/audio/550e8400.opus",
  "audio_download_url": "/v1/transcriptions/txn_550e8400-.../audio",
  "result_url": "https://storage.../streaming/550e8400.json",
  "url": "/v1/transcriptions/txn_550e8400-..."
}
  • audio_url — Canonical lesan:// URI. A stable, storage-agnostic identifier for the recorded audio. Never expires. Used internally by the API and by native SDKs.
  • audio_download_url — Stable API path for downloading the audio. Returns a 302 redirect to a fresh signed URL (1-hour TTL) on every request. Web clients should use this for playback.
  • result_url — URL to the full transcript JSON artifact in cloud storage.

Audio playback in web apps

The audio_download_url works directly as an audio source. The browser follows the 302 redirect transparently:

html
<!-- Use audio_download_url directly -->
<audio src="/v1/transcriptions/txn_550e8400-.../audio" controls></audio>
javascript
// Fetch the transcription, then play audio
const res = await fetch(`/v1/transcriptions/${jobId}`, {
  headers: { Authorization: `Bearer ${apiKey}` }
});
const data = await res.json();


// audio_download_url is stable — store it, use it anytime
audioElement.src = data.audio_download_url;

Unlike signed cloud storage URLs which expire after 1 hour, audio_download_url never expires. Each request generates a fresh signed URL server-side, so clients can store and reuse it indefinitely.

Turn Detection (VAD)

You can enable server-side turn detection (VAD) using query parameters. This helps split speech into turns and can trigger turn-level events.

  • turn_detection — Set to server_vad to enable server VAD
  • vad_type — Choose the VAD algorithm: energy (default) or silero

Streaming Audio Formats

The streaming API supports a subset of audio formats optimized for low-latency transmission. See the Audio Formats reference for full details.

  • pcm_s16le (default) — Raw PCM, 16-bit signed little-endian. Lowest latency.
  • wav — WAV container. Send the header with the first chunk.
  • webm_opus — WebM with Opus codec. Best for browser-based streaming.
  • opus_raw_16k — Raw Opus frames at 16kHz. Low bandwidth.

Complete Examples

Node.js (JavaScript)

javascript
// Node.js example (Authorization header). For browsers, use a server-side proxy.
// npm i ws


import WebSocket from "ws";
import fs from "fs";


const WS_URL = "wss://asr.lesan.ai/v1/ws/transcribe?language=am&format=pcm_s16le";
const ws = new WebSocket(WS_URL, {
  headers: {
    Authorization: "Bearer YOUR_API_KEY"
  }
});


ws.on("open", () => {
  console.log("Connected");


  // Send PCM audio bytes in chunks
  const chunkSize = 8000;
  const audio = fs.readFileSync("recording.pcm");
  for (let i = 0; i < audio.length; i += chunkSize) {
    ws.send(audio.subarray(i, i + chunkSize));
  }


  ws.send("END");
});


ws.on("message", (data) => {
  const msg = JSON.parse(data.toString());
  if (msg.type === "ready") {
    console.log("Session:", msg.session_id);
  }
  if (msg.type === "transcription") {
    const text = msg.text || msg.transcription;
    if (msg.is_final) console.log("Final:", text);
    else console.log("Partial:", text);
  }
  if (msg.type === "error") {
    console.error("Error:", msg.message || msg.error);
  }
});


ws.on("close", (code, reason) => {
  console.log("Closed:", code, reason.toString());
});

Python

python
import asyncio
import websockets
import json


async def stream_audio(file_path, language="am"):
    url = "wss://asr.lesan.ai/v1/ws/transcribe"
    params = f"?language={language}&format=pcm_s16le"


    async with websockets.connect(
        url + params,
        extra_headers={"Authorization": "Bearer YOUR_API_KEY"}
    ) as ws:
        # Wait for ready message
        ready = json.loads(await ws.recv())
        assert ready["type"] == "ready"
        print(f"Session: {ready['session_id']}")


        # Send audio in chunks
        chunk_size = 8000  # 250ms of 16kHz 16-bit audio
        with open(file_path, "rb") as f:
            while chunk := f.read(chunk_size):
                await ws.send(chunk)


                # Check for messages (non-blocking)
                try:
                    msg = json.loads(await asyncio.wait_for(ws.recv(), timeout=0.01))
                    if msg["type"] == "transcription" and msg["is_final"]:
                        text = msg.get("transcription") or msg.get("text")
                        print(f"Transcription: {text}")
                except asyncio.TimeoutError:
                    pass


        # Signal end of audio
        await ws.send("END")


        # Collect remaining results
        async for message in ws:
            msg = json.loads(message)
            if msg["type"] == "transcription" and msg["is_final"]:
                text = msg.get("transcription") or msg.get("text")
                print(f"Final: {text}")


asyncio.run(stream_audio("recording.pcm"))

Connection Limits

  • Max concurrent connections — 5 per API key
  • Max session duration — 30 minutes
  • Idle timeout — 300 seconds (5 minutes) without audio data or PING
  • Max audio data rate — 1 MB/s

See the Rate Limits guide for full quota details.

Close Codes

The server uses these WebSocket close codes:

  • 1000 — Normal closure after END command
  • 1008 — Policy violation (invalid API key, insufficient permissions)
  • 1011 — Internal server error
  • 4000 — Invalid request (bad query parameters)
  • 4001 — Authentication failed
  • 4008 — Idle timeout (no data or PING for 300 seconds)
  • 4029 — Rate limit exceeded (too many concurrent connections)

Troubleshooting

  • No transcription results — Check that the audio format matches the format parameter. Mismatched formats produce silence.
  • Connection closes immediately — Verify your API key has the write scope. Check the close code.
  • High latency — Use pcm_s16le format for lowest latency. Send smaller, more frequent chunks (100-250ms).
  • Idle timeout — Send PING commands during silence to keep the connection alive.
  • Garbled results — Ensure your audio bytes match the declared format (and its expected sample rate). Mismatches can cause pitch/speed issues.

See the Audio Formats reference for format details, or the Error Codes reference for error handling.