Industrial-Grade
Speech Recognition

🚀 Flagship Fun-ASR-Nano · zh/en/ja + Chinese dialects/accents 🌍 31 languages Fun-ASR-MLT-Nano · separate checkpoint

Speech recognition, voice activity detection, punctuation restoration, speaker diarization, emotion detection, and audio event recognition — one unified Python API handles it all. 50+ languages, self-hosted, production-ready.

How to Use Watch Demo

50+Languages

170xRealtime Speed

1 APIUnified Interface

# Start speech recognition service
$ pip install torch torchaudio pip install funasr vllm fastapi uvicorn python-multipart
$ funasr-server --device cuda

# Call with OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8899/v1", api_key="x")
result = client.audio.transcriptions.create(
model="fun-asr-nano",
file=open("meeting.wav", "rb")
)
print(result.text)

Core Capabilities

A complete speech understanding pipeline from raw audio to structured output — all in one call

🎙️

Speech Recognition

End-to-end ASR supporting 50+ languages, including 7 Chinese dialects and 26 regional accents, with automatic language detection

📍

Voice Activity Detection

Millisecond-precision VAD with adaptive silence thresholds, accurately segmenting speech from silence

✍️

Punctuation Restoration

Automatically adds punctuation and applies inverse text normalization, producing readable formatted text

👥

Speaker Diarization

Identifies who said what, labeling each sentence with a speaker ID — ideal for meetings and interviews

😊

Emotion Detection

Recognizes emotional states — happy, sad, angry, neutral — for customer service QA and sentiment analysis

🔔

Audio Event Recognition

Detects background music, applause, laughter, crying, and other acoustic events for full scene understanding

How to Use

Three steps: Install → Choose your scenario → Call

$ pip install torch torchaudio
pip install funasr vllm fastapi uvicorn python-multipart

Python 3.8+ · GPU 8GB+ · Linux / macOS

File Transcription — Upload audio, get complete results

Ideal for meeting recordings, video subtitles, and batch processing. Automatically includes VAD segmentation, punctuation, timestamps, and speaker labels.

# Start offline transcription service (works after pip install)

$ funasr-server --device cuda --port 8899

# Call (curl)

$ curl -X POST http://localhost:8899/v1/audio/transcriptions \

    -F "file=@meeting.wav" -F "model=fun-asr-nano" -F "response_format=verbose_json"

Output [00:01.7 → 00:05.5] Speaker 0: Let's discuss the three topics today.
[00:05.8 → 00:08.2] Speaker 1: Sounds good. First one is the Q3 plan.
[00:08.5 → 00:12.1] Speaker 0: Go ahead, we have 30 minutes.

Real-time Recognition — Speak and see results instantly

For live captions, broadcast transcription, and voice assistants. WebSocket-based protocol with confirmed text that never changes and new text that updates continuously.

# Streaming requires source code (not in pip yet)

$ git clone https://github.com/modelscope/FunASR.git && cd FunASR

$ python examples/industrial_data_pretraining/fun_asr_nano/serve_realtime_ws.py --port 10095 --language 中文

# Open the built-in browser client

$ open client_mic.html

# Or connect via Python

$ python client_python.py --server ws://localhost:10095 --mic

Real-time output (updates progressively) [live] Let's discuss the...
[confirmed] Let's discuss the three topics today.
[live] Sounds good first...
[confirmed] Sounds good. First one is the Q3 plan.

API Integration — OpenAI-compatible, zero-code changes for AI frameworks

Standard /v1/audio/transcriptions endpoint. LangChain, AutoGen, Dify, and Coze can connect directly without any code modifications.

# Start OpenAI-compatible API

$ funasr-server --device cuda

# Python (identical to OpenAI Whisper API)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8899/v1", api_key="x")

result = client.audio.transcriptions.create(

    model="fun-asr-nano",

    file=open("audio.wav", "rb"),

    response_format="verbose_json"

)

print(result.text)

JSON Response {"text": "Let's discuss the three topics today.", "segments": [{"start": 1.7, "end": 5.5, "text": "..."}], "duration": 12.1}

Explore More

🔥 Hotword Boosting → 👥 Speaker Diarization → 😊 Emotion Detection → 🌍 Multilingual → 🎯 Fine-tuning → ⚡ vLLM Acceleration →

Model	Engine	RTFx	CER	Notes
Fun-ASR-Nano	PyTorch	21	8.06%	Baseline
Fun-ASR-Nano	vLLM batch	340	8.20%	16x speedup
Fun-ASR-Nano	Offline service	102	8.14%	Incl. VAD segment timing
GLM-ASR-Nano	vLLM batch	265	12.93%	Community model

Industrial-Grade
Speech Recognition

Core Capabilities

Speech Recognition

Voice Activity Detection

Punctuation Restoration

Speaker Diarization

Emotion Detection

Audio Event Recognition

How to Use

File Transcription — Upload audio, get complete results

Real-time Recognition — Speak and see results instantly

API Integration — OpenAI-compatible, zero-code changes for AI frameworks

Explore More

Performance

Product Demo

Industrial-GradeSpeech Recognition

Core Capabilities

Speech Recognition

Voice Activity Detection

Punctuation Restoration

Speaker Diarization

Emotion Detection

Audio Event Recognition

How to Use

File Transcription — Upload audio, get complete results

Real-time Recognition — Speak and see results instantly

API Integration — OpenAI-compatible, zero-code changes for AI frameworks

Explore More

Performance

Product Demo

Industrial-Grade
Speech Recognition