r/StableDiffusion

SmartAttentionDispatcher — ComfyUI node that patches model attention with SageAttention

# 1. What is it and why

A node that replaces PyTorch SDPA with SageAttention kernels (SA2 / SA3) without restarting ComfyUI and without launch flags. Automatically detects GPU architecture, installed libraries, and available kernels. Shows active mode, GPU tier, SA2/SA3 availability, and model architecture in the node status panel after each run.

Inspired by Kijai's node, SmartAttentionDispatcher extends it with additional capabilities: specific kernel selection, dynamic combine mode, and support for models that import attention locally (ErnieImage, Qwen, ACE-Step).

https://preview.redd.it/5b7moef2th0h1.png?width=804&format=png&auto=webp&s=2c68bfffbd5d9b070532ad3d96634b28a77edb05

Recommended launch flag: --fast

⚠️ Do not use --use-sage-attention together with this node — it conflicts with the patching mechanism.

# 2. Model patching specifics

Most DiT models (Flux, SD3.5, Z-Image, LTX, Wan) are patched through the standard ComfyUI transformer_options mechanism. However, some models import optimized_attention locally at module load time — a regular patch does not reach them. For these models the node additionally scans sys.modules and patches all found references. Confirmed for ErnieImage, Qwen-Image/Edit, and ACE-Step.

SDXL (UNet architecture) is also supported via SA2, though speed gain is minimal — sequences are too short for SA to provide advantage.

⚠️ Qwen 2512 in SA3 mode produces results that do not match the prompt — unstable FP4 math at long sequences (seq > 7000). SA2 on Qwen works correctly.

# 3. Modes

When sdpa=False and all other parameters are disable — this is standard PyTorch SDPA, the node changes nothing. When sdpa=True — also SDPA, but all other node settings are forcibly ignored.

SA2 — SageAttention2 on all steps. Kernels: `auto`, `fp16`, `fp8`, `fp8++`, `triton`. `auto` selects the best kernel for your GPU automatically.
SA3 — SageAttention3 on all steps. Blackwell only (RTX 50xx), CUDA 12.8+, separate sageattn3 package. Works from Python 3.10+.
Combine (dynamic mode) — switches between SA2 and SA3 depending on the diffusion step. First and last step — SA2 (or SDPA if SA2 is also disabled), middle steps — SA3. Displayed in the node as `SA2-SA3-SA2` or `SDPA-SA3-SDPA`.

How to connect in workflow: The node is placed directly before KSampler — after model loading, after applying LoRA, after any nodes that shift or modify the model. Input `model` → output `model`. The node detects the architecture and applies the patch automatically.

# 4. Tested models

|Model|SA2|SA3|Patch|Notes|
|:-|:-|:-|:-|:-|
|SDXL 1.0|✅|—|transformer\_options|SA3 not tested on UNet, minimal gain|
|SD3.5|✅|✅|transformer\_options|cross-attn layers auto-fallback to SDPA|
|Flux.1 dev (Kontext, Krea)|✅|✅|transformer\_options|—|
|Flux.2 dev (Klein)|✅|✅|transformer\_options|—|
|Z-Image turbo|✅|✅|transformer\_options|—|
|Qwen-Image 2512 / Edit 2511|✅|⚠️|sys.modules|SA3 unstable at long sequences|
|ERNIE-Image turbo|✅|✅|sys.modules|—|
|LTX 2.3 (dev, distilled)|✅|✅|transformer\_options|—|
|Wan2.2|✅|⚠️|transformer\_options|SA3 OOM at 1280x720 on 16GB VRAM|
|HunyuanVideo 1.5|✅|—|transformer\_options|not fully tested|
|ACE-Step 1.5|—|—|sys.modules|may work, not tested|

# 5. Image generation benchmark

Model: `flux-2-klein-base-9b-fp8` \+ `qwen_3_8b_fp8mixed` text encoder
Settings: 896×1152, 30 steps, dpmpp\_2m\_sde, cfg=5
GPU: RTX 5060 Ti 16GB | PyTorch 2.11.0+cu130 | Python 3.14.4 | SM 12.0 Blackwell

Why this model — 9GB fits entirely in VRAM, attention is the real bottleneck, clean results without RAM/VRAM swap overhead.

18 images split into rows:

Row SDPA

https://preview.redd.it/si9nwf08th0h1.png?width=896&format=png&auto=webp&s=1a12c88246dced527d48353c25d6740102aa9ef4

Row SA2: fp8, fp8++

https://preview.redd.it/2pocu859th0h1.jpg?width=1822&format=pjpg&auto=webp&s=ce642ac994a89f96a6ba301e8cc73a239aaf1f83

Row SA3: standard,

3 views16:40

r/StableDiffusion

per_block_mean

https://preview.redd.it/396ct36ath0h1.jpg?width=1822&format=pjpg&auto=webp&s=fb49bd85b2632e5a2c83de438f84a7914c691717

Row combine: SA2-SA3-SA2 and SDPA-SA3-SDPA with different kernel combinations

https://preview.redd.it/d8ct5gbbth0h1.jpg?width=2728&format=pjpg&auto=webp&s=ea0f499a320b1becf511efe4c715c4c2a8ada066

https://preview.redd.it/8el7yqbhth0h1.jpg?width=2728&format=pjpg&auto=webp&s=7d1509d4a573c02be7284506cb2cab00fa60d572

Row without node: --fast, --use-sage-attention, --fast --use-sage-attention

https://preview.redd.it/qnwccz7kth0h1.jpg?width=2728&format=pjpg&auto=webp&s=c1a0650562757c14f1a7b914a32923bb7f39a641

https://preview.redd.it/b8rrp37lth0h1.jpg?width=3634&format=pjpg&auto=webp&s=1527b8f451167cfb9feb7890f657fe48a06c54b2

|Mode|Flags|s/it|Total|vs SDPA|
|:-|:-|:-|:-|:-|
|SDPA (baseline)|vanilla|2.42|73.70s|0.0%|
|SA2 fp8|vanilla|2.22|67.48s|\+8.3%|
|SA2 fp8++|vanilla|2.20|66.81s|\+9.1%|
|SA3 standard|vanilla|2.22|67.50s|\+8.3%|
|SA3 per_block_mean|vanilla|2.20|67.00s|\+9.1%|
|SDPA-SA3-SDPA standard|vanilla|2.24|68.36s|\+7.4%|
|SDPA-SA3-SDPA per_block_mean|vanilla|2.24|68.26s|\+7.4%|
|SA2-SA3-SA2 fp8 + standard|vanilla|2.24|68.10s|\+7.4%|
|SA2-SA3-SA2 fp8 + per_block_mean|vanilla|2.24|68.06s|\+7.4%|
|SA2-SA3-SA2 fp8++ + standard|vanilla|2.23|67.74s|\+7.9%|
|SA2-SA3-SA2 fp8++ + per_block_mean|vanilla|2.24|68.03s|\+7.4%|
|SA2 fp8|\--fast --force-channels-last --fp16-intermediates|2.13|64.87s|\+12.0%|
|SA2 fp8++|\--fast --force-channels-last --fp16-intermediates|2.13|64.93s|\+12.0%|
|SA3 standard|\--fast --force-channels-last --fp16-intermediates|2.17|66.26s|\+10.3%|
|SDPA|\--fast|2.39|72.55s|\+1.2%|
|\--use-sage-attention|vanilla|2.11|64.43s|\+12.8%|
|\--use-sage-attention|\--fast|2.08|63.45s|\+14.0%|
|\--use-sage-attention|\--fast --force-channels-last --fp16-intermediates|2.08|63.48s|\+14.0%|

⚠️ --force-channels-last causes crashes with Wan. --fp16-intermediates breaks audio in LTX video+audio pipelines. For universal use only --fast is recommended.

# 6. Video models benchmark

|Model|Resolution|SDPA s/it|SA2 fp8++ s/it|Gain|Notes|
|:-|:-|:-|:-|:-|:-|
|ltx-2.3-22b-distilled bf16|1280x720|Ph1: 12.83 / Ph2: 63.75|Ph1: 11.07 / Ph2: 46.89|\+14% / +26%|—|
|Wan2.2 (VAE from Wan2.1)|960x544|Ph1: 126.82 / Ph2: 126.08|Ph1: 60.28 / Ph2: 58.81|\+52% / +53%|—|
|Wan2.2 (VAE from Wan2.1)|1280x720|—|—|—|SA3 per_block_mean OOM (740MB), requires >16GB VRAM + 64GB RAM|
|HunyuanVideo 1.5|1280x720|184s/it|73s/it|\+60%|stopped — unrealistic time for 5s video on 16GB|

# 7. Links

GitHub: https://github.com/Rogala/ComfyUI-rogala
All nodes available via ComfyUI Manager.

Google Drive with test images, videos, workflow and LogicIfElse node:
https://drive.google.com/drive/folders/17jy3g\_FTlM09YfM-Fwh5KWNIlvX0UCyc?usp=sharing

LogicIfElse — helper node for conditional model or parameter selection in workflow, not yet in the main repository as it is still being refined.

Built with the assistance of Claude.

https://redd.it/1ta0ewm
@rStableDiffusion

6 views16:40

r/StableDiffusion

0:30

This media is not supported in your browser

VIEW IN TELEGRAM

Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

https://redd.it/1ta7aq9
@rStableDiffusion

6 views17:40

r/StableDiffusion

What do you think of this character consistency?

https://redd.it/1tab6ur
@rStableDiffusion

From the StableDiffusion community on Reddit: What do you think of this character consistency?

Explore this post and more from the StableDiffusion community

5 views18:40

r/StableDiffusion

5 views18:40

r/StableDiffusion

5 views18:41

About

Blog

Apps

Platform