Speech recognition, voice activity detection, punctuation restoration, speaker diarization, emotion detection, and audio event recognition — one unified Python API handles it all. 50+ languages, self-hosted, production-ready.
A complete speech understanding pipeline from raw audio to structured output — all in one call
End-to-end ASR supporting 50+ languages, including 7 Chinese dialects and 26 regional accents, with automatic language detection
Millisecond-precision VAD with adaptive silence thresholds, accurately segmenting speech from silence
Automatically adds punctuation and applies inverse text normalization, producing readable formatted text
Identifies who said what, labeling each sentence with a speaker ID — ideal for meetings and interviews
Recognizes emotional states — happy, sad, angry, neutral — for customer service QA and sentiment analysis
Detects background music, applause, laughter, crying, and other acoustic events for full scene understanding
Three steps: Install → Choose your scenario → Call
Python 3.8+ · GPU 8GB+ · Linux / macOS
Ideal for meeting recordings, video subtitles, and batch processing. Automatically includes VAD segmentation, punctuation, timestamps, and speaker labels.
For live captions, broadcast transcription, and voice assistants. WebSocket-based protocol with confirmed text that never changes and new text that updates continuously.
Standard /v1/audio/transcriptions endpoint. LangChain, AutoGen, Dify, and Coze can connect directly without any code modifications.
184 files / 11,541 seconds / Fun-ASR-Nano
| Model | Engine | RTFx | CER | Notes |
|---|---|---|---|---|
| Fun-ASR-Nano | PyTorch | 21 | 8.06% | Baseline |
| Fun-ASR-Nano | vLLM batch | 340 | 8.20% | 16x speedup |
| Fun-ASR-Nano | Offline service | 102 | 8.14% | Incl. VAD + timestamps |
| GLM-ASR-Nano | vLLM batch | 265 | 12.93% | Community model |
Accuracy matches PyTorch exactly (CER delta < 0.2%), with 16–340x speedup. Full report →
Watch FunASR real-time speech recognition in action