Lesan AI Documentation

The streaming API lets you send audio data in real time and receive transcription results as they are produced. This is ideal for live transcription, voice interfaces, and real-time captioning.

Connection Lifecycle

A streaming session follows this lifecycle:

Connect — Open a WebSocket connection and authenticate with Authorization: Bearer
Ready — Server sends a ready message confirming the session is active
Stream — Send audio chunks and receive transcription results
End — Send an END command to finalize and receive remaining results
Close — Server closes the connection after final results are sent

Connecting

Connect to the WebSocket endpoint and authenticate using the Authorization: Bearer header.

text

wss://asr.lesan.ai/v1/ws/transcribe?language=am&format=pcm_s16le

Note: Replace wss://asr.lesan.ai with your WebSocket server URL (e.g., wss://asr.lesan.ai for production). Most browsers cannot set custom headers (including Authorization) on WebSocket connections; for browser apps, use a server-side proxy that injects the header.

Query parameters:

language (optional) — Language code: am, ti, so, or en
format (optional) — Audio format. Default: pcm_s16le. See Audio Formats
turn_detection (optional) — Set to server_vad to enable server-side turn detection
vad_type (optional) — VAD algorithm: energy (default) or silero

Client Commands

Send these text commands to control the streaming session:

FORMAT

Change the audio format mid-session. Send as a text message before the next audio chunk:

text

FORMAT:webm_opus

TRANSCRIBE

Force transcription of buffered audio without ending the session. Useful for getting intermediate results:

text

TRANSCRIBE

END

Signal that audio streaming is complete. The server will process any remaining audio and send final results before closing:

text

END

CLEAR

Discard all buffered audio without transcribing. Useful for cancelling or resetting:

text

CLEAR

PING

Keep the connection alive. The server will respond with a pong message:

text

PING

Server Messages

The server sends JSON messages with a type field:

ready

Sent when the connection is established and the session is ready to receive audio:

json

{
  "type": "ready",
  "session_id": "sess_abc123",
  "format": "pcm_s16le",
  "sample_rate": 16000,
  "job_id": "txn_550e8400-e29b-41d4-a716-446655440000"
}

chunk_received

Acknowledgement that an audio chunk was received and buffered:

json

{
  "type": "chunk_received",
  "bytes_received": 32000,
  "buffer_duration_ms": 2000
}

transcription

A transcription result, either partial or final:

json

{
  "type": "transcription",
  "transcription": "ሰላም እንዴት ነህ",
  "language": "am",
  "is_final": true,
  "is_turn": false,
  "duration_seconds": 2.5,
  "processing_time_seconds": 0.3,
  "audio_size_bytes": 80000,
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "job_id": "txn_550e8400-e29b-41d4-a716-446655440000"
}

is_final — false for partial results that may change, true for final results
is_turn — true if the server considers this transcription a completed VAD turn
duration_seconds — Amount of audio included in this transcription event
processing_time_seconds — Server processing time for this event
job_id — Included only in the final message (is_final: true). Use it to retrieve audio and transcript via GET /v1/transcriptions/{job_id}

turn_detected

Sent when Voice Activity Detection (VAD) detects a speaker turn boundary:

json

{
  "type": "turn_detected",
  "transcription": "how are you",
  "language": "en",
  "turn_start_ms": 0,
  "turn_end_ms": 1500,
  "duration_seconds": 1.5,
  "processing_time_seconds": 0.2,
  "session_id": "550e8400-e29b-41d4-a716-446655440000"
}

pong

Response to a PING command:

json

{ "type": "pong" }

error

An error occurred during the session:

json

{
  "type": "error",
  "error": "Unsupported format: flac",
  "session_id": "550e8400-e29b-41d4-a716-446655440000"
}

Session Results & Audio Playback

The server includes a job_id in both the ready message and the final transcription message (is_final: true). After the session ends (via the END command), the server finalizes the job and persists both the audio recording and transcript to cloud storage. You can retrieve the completed results by polling the transcription endpoint:

text

GET /v1/transcriptions/{job_id}

Once the job status is completed, the response includes storage URLs:

json

{
  "id": "txn_550e8400-e29b-41d4-a716-446655440000",
  "object": "transcription",
  "status": "completed",
  "language": "am",
  "text": "ሰላም እንዴት ነህ",
  "segments": [...],
  "duration_seconds": 10.5,
  "audio_url": "lesan://streaming/audio/550e8400.opus",
  "audio_download_url": "/v1/transcriptions/txn_550e8400-.../audio",
  "result_url": "https://storage.../streaming/550e8400.json",
  "url": "/v1/transcriptions/txn_550e8400-..."
}

audio_url — Canonical lesan:// URI. A stable, storage-agnostic identifier for the recorded audio. Never expires. Used internally by the API and by native SDKs.
audio_download_url — Stable API path for downloading the audio. Returns a 302 redirect to a fresh signed URL (1-hour TTL) on every request. Web clients should use this for playback.
result_url — URL to the full transcript JSON artifact in cloud storage.

Audio playback in web apps

The audio_download_url works directly as an audio source. The browser follows the 302 redirect transparently:

html

<!-- Use audio_download_url directly -->
<audio src="/v1/transcriptions/txn_550e8400-.../audio" controls></audio>

javascript

// Fetch the transcription, then play audio
const res = await fetch(`/v1/transcriptions/${jobId}`, {
  headers: { Authorization: `Bearer ${apiKey}` }
});
const data = await res.json();


// audio_download_url is stable — store it, use it anytime
audioElement.src = data.audio_download_url;

Unlike signed cloud storage URLs which expire after 1 hour, audio_download_url never expires. Each request generates a fresh signed URL server-side, so clients can store and reuse it indefinitely.

Turn Detection (VAD)

You can enable server-side turn detection (VAD) using query parameters. This helps split speech into turns and can trigger turn-level events.

turn_detection — Set to server_vad to enable server VAD
vad_type — Choose the VAD algorithm: energy (default) or silero

Streaming Audio Formats

The streaming API supports a subset of audio formats optimized for low-latency transmission. See the Audio Formats reference for full details.

pcm_s16le (default) — Raw PCM, 16-bit signed little-endian. Lowest latency.
wav — WAV container. Send the header with the first chunk.
webm_opus — WebM with Opus codec. Best for browser-based streaming.
opus_raw_16k — Raw Opus frames at 16kHz. Low bandwidth.

Complete Examples

Node.js (JavaScript)

javascript

// Node.js example (Authorization header). For browsers, use a server-side proxy.
// npm i ws


import WebSocket from "ws";
import fs from "fs";


const WS_URL = "wss://asr.lesan.ai/v1/ws/transcribe?language=am&format=pcm_s16le";
const ws = new WebSocket(WS_URL, {
  headers: {
    Authorization: "Bearer YOUR_API_KEY"
  }
});


ws.on("open", () => {
  console.log("Connected");


  // Send PCM audio bytes in chunks
  const chunkSize = 8000;
  const audio = fs.readFileSync("recording.pcm");
  for (let i = 0; i < audio.length; i += chunkSize) {
    ws.send(audio.subarray(i, i + chunkSize));
  }


  ws.send("END");
});


ws.on("message", (data) => {
  const msg = JSON.parse(data.toString());
  if (msg.type === "ready") {
    console.log("Session:", msg.session_id);
  }
  if (msg.type === "transcription") {
    const text = msg.text || msg.transcription;
    if (msg.is_final) console.log("Final:", text);
    else console.log("Partial:", text);
  }
  if (msg.type === "error") {
    console.error("Error:", msg.message || msg.error);
  }
});


ws.on("close", (code, reason) => {
  console.log("Closed:", code, reason.toString());
});

Python

python

import asyncio
import websockets
import json


async def stream_audio(file_path, language="am"):
    url = "wss://asr.lesan.ai/v1/ws/transcribe"
    params = f"?language={language}&format=pcm_s16le"


    async with websockets.connect(
        url + params,
        extra_headers={"Authorization": "Bearer YOUR_API_KEY"}
    ) as ws:
        # Wait for ready message
        ready = json.loads(await ws.recv())
        assert ready["type"] == "ready"
        print(f"Session: {ready['session_id']}")


        # Send audio in chunks
        chunk_size = 8000  # 250ms of 16kHz 16-bit audio
        with open(file_path, "rb") as f:
            while chunk := f.read(chunk_size):
                await ws.send(chunk)


                # Check for messages (non-blocking)
                try:
                    msg = json.loads(await asyncio.wait_for(ws.recv(), timeout=0.01))
                    if msg["type"] == "transcription" and msg["is_final"]:
                        text = msg.get("transcription") or msg.get("text")
                        print(f"Transcription: {text}")
                except asyncio.TimeoutError:
                    pass


        # Signal end of audio
        await ws.send("END")


        # Collect remaining results
        async for message in ws:
            msg = json.loads(message)
            if msg["type"] == "transcription" and msg["is_final"]:
                text = msg.get("transcription") or msg.get("text")
                print(f"Final: {text}")


asyncio.run(stream_audio("recording.pcm"))

Connection Limits

Max concurrent connections — 5 per API key
Max session duration — 30 minutes
Idle timeout — 300 seconds (5 minutes) without audio data or PING
Max audio data rate — 1 MB/s

See the Rate Limits guide for full quota details.

Close Codes

The server uses these WebSocket close codes:

1000 — Normal closure after END command
1008 — Policy violation (invalid API key, insufficient permissions)
1011 — Internal server error
4000 — Invalid request (bad query parameters)
4001 — Authentication failed
4008 — Idle timeout (no data or PING for 300 seconds)
4029 — Rate limit exceeded (too many concurrent connections)

Troubleshooting

No transcription results — Check that the audio format matches the format parameter. Mismatched formats produce silence.
Connection closes immediately — Verify your API key has the write scope. Check the close code.
High latency — Use pcm_s16le format for lowest latency. Send smaller, more frequent chunks (100-250ms).
Idle timeout — Send PING commands during silence to keep the connection alive.
Garbled results — Ensure your audio bytes match the declared format (and its expected sample rate). Mismatches can cause pitch/speed issues.

See the Audio Formats reference for format details, or the Error Codes reference for error handling.

Streaming (WebSocket)

Connection Lifecycle

Connecting

Client Commands

FORMAT

TRANSCRIBE

END

CLEAR

PING

Server Messages

ready

chunk_received

transcription

turn_detected

pong

error

Session Results & Audio Playback

Audio playback in web apps

Turn Detection (VAD)

Streaming Audio Formats

Complete Examples

Node.js (JavaScript)

Python

Connection Limits

Close Codes

Troubleshooting