Audio Formats
Supported audio formats for file upload and real-time streaming.
Lesan AI supports a wide range of audio formats for file-based transcription and a subset of formats optimized for real-time WebSocket streaming.
File Upload Formats
The following formats are supported for file uploads via the /transcribe and /transcribe/batch endpoints:
- MP3 (.mp3) — Most common format. Good compression with acceptable quality.
- WAV (.wav) — Uncompressed audio. Best quality but largest file size.
- M4A (.m4a) — AAC-encoded audio in MP4 container. Common on Apple devices.
- FLAC (.flac) — Lossless compression. Best quality-to-size ratio.
- OGG (.ogg) — Vorbis or Opus encoded. Common in web applications.
- WebM (.webm) — Opus encoded in WebM container. Native browser recording format.
- AAC (.aac) — Raw AAC audio. Common in mobile applications.
File Size Limits
- Maximum file size — 500 MB per file
- Maximum duration — 4 hours per file
- Minimum duration — 0.1 seconds
For files exceeding these limits, consider splitting the audio into smaller segments before uploading.
Recommendations
- Best quality — FLAC or WAV at 16kHz or higher sample rate
- Best compression — MP3 at 128kbps or OGG at 96kbps
- Browser recordings — WebM with Opus codec (native MediaRecorder output)
- Mobile recordings — M4A (iOS) or OGG (Android)
WebSocket Streaming Formats
The streaming API supports these formats, optimized for low-latency real-time audio:
- pcm_s16le (default) — Raw PCM, 16-bit signed, little-endian. No encoding overhead, lowest latency. Requires setting the correct
sample_rateparameter. - wav — WAV container format. Send the WAV header with the first chunk; subsequent chunks are raw PCM.
- webm_opus — WebM container with Opus codec. Best for browser-based streaming using
MediaRecorder. - opus_raw_16k — Raw Opus frames at 16kHz. Low bandwidth, good for mobile applications.
Format Selection
Choose the right streaming format for your use case:
- Server-side processing — Use
pcm_s16lefor lowest latency and simplest implementation - Browser applications — Use
webm_opusfor native MediaRecorder compatibility - Mobile / low bandwidth — Use
opus_raw_16kfor compressed audio with minimal overhead
Sample Rate Guidelines
The sample rate affects transcription quality and bandwidth:
- 16000 Hz (recommended) — Optimal for speech recognition. Best accuracy-to-bandwidth ratio.
- 8000 Hz — Telephony quality. Acceptable for voice calls but lower accuracy.
- 44100 Hz / 48000 Hz — High-quality audio. Automatically downsampled to 16kHz internally. Uses more bandwidth with no accuracy benefit.
Converting Audio Formats
Use ffmpeg to convert audio to a supported format:
# Convert any format to 16kHz WAV (best for upload)
ffmpeg -i input.mp4 -ar 16000 -ac 1 output.wav
# Convert to MP3 (good compression)
ffmpeg -i input.wav -ar 16000 -ac 1 -b:a 128k output.mp3
# Convert to FLAC (lossless compression)
ffmpeg -i input.wav -ar 16000 -ac 1 output.flac
# Extract raw PCM for streaming
ffmpeg -i input.wav -f s16le -ar 16000 -ac 1 output.pcmSee the ASR guide for file upload examples, or the Streaming guide for real-time audio streaming.