Audio Formats

Supported audio formats for file upload and real-time streaming.

Lesan AI supports a wide range of audio formats for file-based transcription and a subset of formats optimized for real-time WebSocket streaming.

File Upload Formats

The following formats are supported for file uploads via the /transcribe and /transcribe/batch endpoints:

  • MP3 (.mp3) — Most common format. Good compression with acceptable quality.
  • WAV (.wav) — Uncompressed audio. Best quality but largest file size.
  • M4A (.m4a) — AAC-encoded audio in MP4 container. Common on Apple devices.
  • FLAC (.flac) — Lossless compression. Best quality-to-size ratio.
  • OGG (.ogg) — Vorbis or Opus encoded. Common in web applications.
  • WebM (.webm) — Opus encoded in WebM container. Native browser recording format.
  • AAC (.aac) — Raw AAC audio. Common in mobile applications.

File Size Limits

  • Maximum file size — 500 MB per file
  • Maximum duration — 4 hours per file
  • Minimum duration — 0.1 seconds

For files exceeding these limits, consider splitting the audio into smaller segments before uploading.

Recommendations

  • Best quality — FLAC or WAV at 16kHz or higher sample rate
  • Best compression — MP3 at 128kbps or OGG at 96kbps
  • Browser recordings — WebM with Opus codec (native MediaRecorder output)
  • Mobile recordings — M4A (iOS) or OGG (Android)

WebSocket Streaming Formats

The streaming API supports these formats, optimized for low-latency real-time audio:

  • pcm_s16le (default) — Raw PCM, 16-bit signed, little-endian. No encoding overhead, lowest latency. Requires setting the correct sample_rate parameter.
  • wav — WAV container format. Send the WAV header with the first chunk; subsequent chunks are raw PCM.
  • webm_opus — WebM container with Opus codec. Best for browser-based streaming using MediaRecorder.
  • opus_raw_16k — Raw Opus frames at 16kHz. Low bandwidth, good for mobile applications.

Format Selection

Choose the right streaming format for your use case:

  • Server-side processing — Use pcm_s16le for lowest latency and simplest implementation
  • Browser applications — Use webm_opus for native MediaRecorder compatibility
  • Mobile / low bandwidth — Use opus_raw_16k for compressed audio with minimal overhead

Sample Rate Guidelines

The sample rate affects transcription quality and bandwidth:

  • 16000 Hz (recommended) — Optimal for speech recognition. Best accuracy-to-bandwidth ratio.
  • 8000 Hz — Telephony quality. Acceptable for voice calls but lower accuracy.
  • 44100 Hz / 48000 Hz — High-quality audio. Automatically downsampled to 16kHz internally. Uses more bandwidth with no accuracy benefit.

Converting Audio Formats

Use ffmpeg to convert audio to a supported format:

# Convert any format to 16kHz WAV (best for upload)
ffmpeg -i input.mp4 -ar 16000 -ac 1 output.wav


# Convert to MP3 (good compression)
ffmpeg -i input.wav -ar 16000 -ac 1 -b:a 128k output.mp3


# Convert to FLAC (lossless compression)
ffmpeg -i input.wav -ar 16000 -ac 1 output.flac


# Extract raw PCM for streaming
ffmpeg -i input.wav -f s16le -ar 16000 -ac 1 output.pcm

See the ASR guide for file upload examples, or the Streaming guide for real-time audio streaming.