Lesan AI Documentation

Lesan AI supports a wide range of audio formats for file-based transcription and a subset of formats optimized for real-time WebSocket streaming.

File Upload Formats

The following formats are supported for file uploads via the /transcribe and /transcribe/batch endpoints:

MP3 (.mp3) — Most common format. Good compression with acceptable quality.
WAV (.wav) — Uncompressed audio. Best quality but largest file size.
M4A (.m4a) — AAC-encoded audio in MP4 container. Common on Apple devices.
FLAC (.flac) — Lossless compression. Best quality-to-size ratio.
OGG (.ogg) — Vorbis or Opus encoded. Common in web applications.
WebM (.webm) — Opus encoded in WebM container. Native browser recording format.
AAC (.aac) — Raw AAC audio. Common in mobile applications.

File Size Limits

Maximum file size — 500 MB per file
Maximum duration — 4 hours per file
Minimum duration — 0.1 seconds

For files exceeding these limits, consider splitting the audio into smaller segments before uploading.

Recommendations

Best quality — FLAC or WAV at 16kHz or higher sample rate
Best compression — MP3 at 128kbps or OGG at 96kbps
Browser recordings — WebM with Opus codec (native MediaRecorder output)
Mobile recordings — M4A (iOS) or OGG (Android)

WebSocket Streaming Formats

The streaming API supports these formats, optimized for low-latency real-time audio:

pcm_s16le (default) — Raw PCM, 16-bit signed, little-endian. No encoding overhead, lowest latency. Requires setting the correct sample_rate parameter.
wav — WAV container format. Send the WAV header with the first chunk; subsequent chunks are raw PCM.
webm_opus — WebM container with Opus codec. Best for browser-based streaming using MediaRecorder.
opus_raw_16k — Raw Opus frames at 16kHz. Low bandwidth, good for mobile applications.

Format Selection

Choose the right streaming format for your use case:

Server-side processing — Use pcm_s16le for lowest latency and simplest implementation
Browser applications — Use webm_opus for native MediaRecorder compatibility
Mobile / low bandwidth — Use opus_raw_16k for compressed audio with minimal overhead

Sample Rate Guidelines

The sample rate affects transcription quality and bandwidth:

16000 Hz (recommended) — Optimal for speech recognition. Best accuracy-to-bandwidth ratio.
8000 Hz — Telephony quality. Acceptable for voice calls but lower accuracy.
44100 Hz / 48000 Hz — High-quality audio. Automatically downsampled to 16kHz internally. Uses more bandwidth with no accuracy benefit.

Converting Audio Formats

Use ffmpeg to convert audio to a supported format:

# Convert any format to 16kHz WAV (best for upload)
ffmpeg -i input.mp4 -ar 16000 -ac 1 output.wav


# Convert to MP3 (good compression)
ffmpeg -i input.wav -ar 16000 -ac 1 -b:a 128k output.mp3


# Convert to FLAC (lossless compression)
ffmpeg -i input.wav -ar 16000 -ac 1 output.flac


# Extract raw PCM for streaming
ffmpeg -i input.wav -f s16le -ar 16000 -ac 1 output.pcm

See the ASR guide for file upload examples, or the Streaming guide for real-time audio streaming.