ComfyUI Qwen3 ASR

2 min read

A state-of-the-art "All-in-One" speech intelligence integration for ComfyUI, delivering SOTA accuracy across 52 languages and dialects.

Project Link: https://github.com/kaushiknishchay/ComfyUI-Qwen3-ASR

Overview

ComfyUI-Qwen3-ASR integrates the Qwen3-ASR model family into ComfyUI, providing high-performance, multilingual speech recognition. Built on the Qwen3-Omni foundation, it offers unified Language Identification (LID) and Automatic Speech Recognition (ASR) with support for both real-time streaming and high-throughput batch processing.

Problem

Existing ASR solutions often struggled with complex acoustic environments (noise, music, singing) and lacked the efficiency required for local high-concurrency workflows.

Constraints

  • Must maintain low latency (under 100ms for streaming)
  • Support for 52+ languages including regional dialects
  • Run on consumer-grade NVIDIA GPUs (CUDA)

Approach

Developed a high-performance ComfyUI wrapper for the Qwen3-ASR-1.7B and 0.6B models. Utilized the Whisper-style encoder and Qwen3-based decoder architecture to provide robust transcription with FlashAttention 2 optimization.

Key Decisions

Unified LID and ASR

Reasoning:

Integrating Language Identification and Transcription into a single pass significantly reduces overhead and simplifies multi-modal workflows.

Group Sequence Policy Optimization (GSPO)

Reasoning:

Leveraging GSPO reinforcement learning enhances transcription stability and noise robustness, ensuring reliability in non-ideal recording environments.

NAR Forced Aligner Integration

Reasoning:

Incorporating the non-autoregressive (NAR) Qwen3-ForcedAligner-0.6B enables precise word-level timestamps, critical for automated subtitling and animation sync.

Tech Stack

  • Python
  • PyTorch
  • Qwen3 LLM
  • FlashAttention 2
  • vLLM
  • ComfyUI

Result & Impact

  • 52
    Supported Languages
  • 92ms
    Streaming Latency
  • 2000x
    Max Throughput

Outperforms proprietary commercial APIs in multilingual accuracy and noise robustness, enabling professional-grade local transcription workflows.

Learnings

  • Unified "All-in-One" models significantly outperform pipelined LID+ASR systems in both speed and accuracy.
  • Reinforcement learning (GSPO) is transformative for speech model stability.

Detailed case study on integrating the January 2026 Qwen3 speech stack into ComfyUI.