ComfyUI Qwen3 ASR
A state-of-the-art "All-in-One" speech intelligence integration for ComfyUI, delivering SOTA accuracy across 52 languages and dialects.
Project Link: https://github.com/kaushiknishchay/ComfyUI-Qwen3-ASR
Overview
ComfyUI-Qwen3-ASR integrates the Qwen3-ASR model family into ComfyUI, providing high-performance, multilingual speech recognition. Built on the Qwen3-Omni foundation, it offers unified Language Identification (LID) and Automatic Speech Recognition (ASR) with support for both real-time streaming and high-throughput batch processing.
Problem
Existing ASR solutions often struggled with complex acoustic environments (noise, music, singing) and lacked the efficiency required for local high-concurrency workflows.
Constraints
- Must maintain low latency (under 100ms for streaming)
- Support for 52+ languages including regional dialects
- Run on consumer-grade NVIDIA GPUs (CUDA)
Approach
Developed a high-performance ComfyUI wrapper for the Qwen3-ASR-1.7B and 0.6B models. Utilized the Whisper-style encoder and Qwen3-based decoder architecture to provide robust transcription with FlashAttention 2 optimization.
Key Decisions
Unified LID and ASR
Integrating Language Identification and Transcription into a single pass significantly reduces overhead and simplifies multi-modal workflows.
Group Sequence Policy Optimization (GSPO)
Leveraging GSPO reinforcement learning enhances transcription stability and noise robustness, ensuring reliability in non-ideal recording environments.
NAR Forced Aligner Integration
Incorporating the non-autoregressive (NAR) Qwen3-ForcedAligner-0.6B enables precise word-level timestamps, critical for automated subtitling and animation sync.
Tech Stack
- Python
- PyTorch
- Qwen3 LLM
- FlashAttention 2
- vLLM
- ComfyUI
Result & Impact
- 52Supported Languages
- 92msStreaming Latency
- 2000xMax Throughput
Outperforms proprietary commercial APIs in multilingual accuracy and noise robustness, enabling professional-grade local transcription workflows.
Learnings
- Unified "All-in-One" models significantly outperform pipelined LID+ASR systems in both speed and accuracy.
- Reinforcement learning (GSPO) is transformative for speech model stability.
Detailed case study on integrating the January 2026 Qwen3 speech stack into ComfyUI.