ComfyUI Qwen3 ASR

2 min read

A state-of-the-art "All-in-One" speech intelligence integration for ComfyUI, delivering SOTA accuracy across 52 languages and dialects.

Project Link: https://github.com/kaushiknishchay/ComfyUI-Qwen3-ASR

Visit Project View Documentation

Overview

ComfyUI-Qwen3-ASR integrates the Qwen3-ASR model family into ComfyUI, providing high-performance, multilingual speech recognition. Built on the Qwen3-Omni foundation, it offers unified Language Identification (LID) and Automatic Speech Recognition (ASR) with support for both real-time streaming and high-throughput batch processing.

Problem

Existing ASR solutions often struggled with complex acoustic environments (noise, music, singing) and lacked the efficiency required for local high-concurrency workflows.

Constraints

Must maintain low latency (under 100ms for streaming)
Support for 52+ languages including regional dialects
Run on consumer-grade NVIDIA GPUs (CUDA)

Approach

Developed a high-performance ComfyUI wrapper for the Qwen3-ASR-1.7B and 0.6B models. Utilized the Whisper-style encoder and Qwen3-based decoder architecture to provide robust transcription with FlashAttention 2 optimization.

Key Decisions

Unified LID and ASR

Reasoning:

Integrating Language Identification and Transcription into a single pass significantly reduces overhead and simplifies multi-modal workflows.

Group Sequence Policy Optimization (GSPO)

Reasoning:

Leveraging GSPO reinforcement learning enhances transcription stability and noise robustness, ensuring reliability in non-ideal recording environments.

NAR Forced Aligner Integration

Reasoning:

Incorporating the non-autoregressive (NAR) Qwen3-ForcedAligner-0.6B enables precise word-level timestamps, critical for automated subtitling and animation sync.

Tech Stack

Python
PyTorch
Qwen3 LLM
FlashAttention 2
vLLM
ComfyUI

Result & Impact

52
Supported Languages
92ms
Streaming Latency
2000x
Max Throughput

Outperforms proprietary commercial APIs in multilingual accuracy and noise robustness, enabling professional-grade local transcription workflows.

Learnings

Unified "All-in-One" models significantly outperform pipelined LID+ASR systems in both speed and accuracy.
Reinforcement learning (GSPO) is transformative for speech model stability.

Detailed case study on integrating the January 2026 Qwen3 speech stack into ComfyUI.

All projects