r/StableDiffusion

6 views19:40

r/StableDiffusion

7 views19:41

r/StableDiffusion

ZIT I2I "Character LORA Transformation" Workflow

https://redd.it/1tae2yl
@rStableDiffusion

From the StableDiffusion community on Reddit: ZIT I2I "Character LORA Transformation" Workflow

Explore this post and more from the StableDiffusion community

7 views20:40

r/StableDiffusion

LTX 2.3 audio as standalone speech model.

User @wildmindai from X posted about this new model. Has anyone here tried it yet?

LTX 2.3 audio as standalone speech model.

Emotional TTS with Scenema Audio.

\- Zero-shot expressive voice cloning, speech gen

\- 8-step distilled with Gemma 3 12B text encoding

\- stage directions via <action> tags

\- runs at 1.5x real-time on RTX 4090

\- fits in 16GB VRAM

\- 13 languages, 48kHz stereo output

it also gens matching environment sounds

https://huggingface.co/ScenemaAI/scenema-audio

https://redd.it/1tab0tb
@rStableDiffusion

From the StableDiffusion community on Reddit: LTX 2.3 audio as standalone speech model.

Explore this post and more from the StableDiffusion community

6 views21:40

r/StableDiffusion

I have to pretend I hate image generation AI to avoid getting banned or insulted on 99% of Reddit or the internet, even though Stable Diffusion is actually what I like and am most excited about right now. Why do people hate AI so much, especially image generation AI?

I'm not even saying I care if they know the difference between open-source and closed-source image-generating AI, or if they insult me or not.

What I want to know is why so many people hate AI, especially image-generating AI.

At first, I thought it only bothered artists. Then I thought it might also bother those who are afraid of not being able to distinguish AI from reality.

But it's practically 99% of people who hate AI, and I just can't understand why.

For example, I've been using Blender for years. I learned to model, sculpt, and animate as an amateur. Thanks to AI, things that used to take me months now take me seconds. Isn't that supposed to be a good thing?

I don't feel bad or like I've wasted my time using Blender; I simply feel fortunate to have found a better tool for what I needed.

EDIT 1: When I say "Stable Diffusion" I mean the open source model community, all models, not "SD" specifically.

https://redd.it/1tahphc
@rStableDiffusion

From the StableDiffusion community on Reddit

Explore this post and more from the StableDiffusion community

7 views22:40

r/StableDiffusion

0:30

This media is not supported in your browser

VIEW IN TELEGRAM

Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

https://redd.it/1tamqbf
@rStableDiffusion

7 views01:40

r/StableDiffusion

looks like Runexx made that dub lora for ltx turn any silent video into speaking

Video-2-Video/LTX-2.3\_-\_V2V\_Just\_Talk\_dub\_any\_silent\_video\_multilanguage.json · RuneXX/LTX-2.3-Workflows at main

https://redd.it/1tabyy3
@rStableDiffusion

From the StableDiffusion community on Reddit: looks like Runexx made that dub lora for ltx turn any silent video into speaking

Explore this post and more from the StableDiffusion community

7 views02:40

r/StableDiffusion

ComfyUI Support for HiDream-01-Image Released

The support for HiDream-01-Image has been merged into ComfyUI. (Thanks to Kijai.)

ComfyUI versions of the checkpoints.

https://redd.it/1tapxvf
@rStableDiffusion

GitHub

GitHub - Comfy-Org/ComfyUI: The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface. - Comfy-Org/ComfyUI

6 views04:40

r/StableDiffusion

Citizen Kane Intro but it's all AI - Qwen 3.6, LTX 2.3
https://youtu.be/wFi60JO50kU

https://redd.it/1tau0ho
@rStableDiffusion

YouTube

Citizen Kane Intro but it's Sweded by AI - Second Attempt

This is a test of how information makes the round-trip. After being processed into a text prompt by Qwen 3.6, video is generated using LTX 2.3. None of the original video is used, all video is generated from the text prompts.

6 views07:40

r/StableDiffusion

Why MXFP8 and NVFP4 Actually Matter for Your Home GPU Setup

>Note: I'm not a developer or someone working in the AI field professionally. I'm just a regular home user who, like many others here, is trying to understand all these new AI technologies, quantization formats, and terms that are starting to appear everywhere around local image and video generation.
>
>I wanted to collect the information scattered across different articles and technical posts into one practical overview focused on consumer GPUs, ComfyUI, and local AI workflows.
>
>I also used AI assistance to help organize and summarize information from the sources linked below.

# A practical breakdown for ComfyUI, FLUX, WAN Video, and similar workflows

Modern image and video generation models don't hit a wall because of shader count anymore. The real bottlenecks are:

VRAM capacity
memory bandwidth
tensor movement
cache efficiency

That's exactly why FP8, MXFP8, and NVFP4 were created.

Relevant if you're running locally:

ComfyUI
FLUX
SDXL
WAN Video
LTX Video
Hunyuan
Qwen Image

# 1. Hardware Support

|Format|RTX 20/30 (Turing/Ampere)|RTX 40 (Ada Lovelace)|RTX 50 (Blackwell)|Server GPUs|
|:-|:-|:-|:-|:-|
|FP16/BF16|Yes|Yes|Yes|Yes|
|FP8|Software fallback only|Yes, native (compute cap. 8.9+)|Yes|Hopper+ (compute cap. 9.0+)|
|MXFP8|No|No|Yes, native|Blackwell|
|NVFP4|No|No native hardware|Yes, native|Blackwell|

The difference between "supported" and "emulated" actually matters a lot here:

RTX 20/30 (Turing/Ampere) — no FP8 tensor cores at all. You can load FP8 models through a software fallback where weights are stored in FP8 but all compute still runs in FP16/BF16. This saves VRAM but gives you zero speedup — and sometimes runs slower due to conversion overhead.
RTX 40 (Ada Lovelace) — native FP8 via 4th-gen Tensor Cores (compute capability 8.9). Real hardware acceleration, real speedup. MXFP8 and NVFP4 have no hardware support here, so running them is just emulation with no meaningful benefit.
RTX 50 (Blackwell) — the first consumer GPU generation with native MXFP8 and NVFP4 through 5th-gen Tensor Cores.

# 2. Software Support

|Technology|FP16|FP8|MXFP8|NVFP4|
|:-|:-|:-|:-|:-|
|CUDA 12.6|Stable|Stable|No|No|
|CUDA 12.8|Stable|Stable|Partial|No|
|CUDA 13.0|Stable|Stable|Yes|Yes|
|PyTorch|Full|Broadly usable since 2.2|Stable since 2.10+cu130|Stable since 2.10+cu130|
|TorchAO|Yes|Yes|Yes|Yes|
|TensorRT|Yes|Yes|Partial|Partial|
|Diffusers|Yes|Yes|Emerging|Emerging|
|ComfyUI|Full|Common|Rare|Experimental|

The honest picture on PyTorch versions:

FP8 — `float8_e4m3fn` and `float8_e5m2` dtype landed in PyTorch 2.1 (experimental), broadly usable from 2.2, mature from 2.3+.
MXFP8 / NVFP4 — available as a stable pip install since PyTorch 2.10 + cu130 (January 21, 2026). CUDA 13.0 was promoted to stable in that release with full Blackwell (compute cap. 10.0, 12.0) support. The April 2026 PyTorch blog post used nightlies (2.12.0.dev+cu130) only because the authors wanted the latest TorchAO kernels — not because cu130 itself required nightlies.

What MXFP8/NVFP4 workflows need:

CUDA 13.0 (`pip install torch --index-url https://download.pytorch.org/whl/cu130`)
Driver 570+
PyTorch 2.10+ (2.11+ recommended for better TorchAO kernels, 2.12 stable drops May 13)
TorchAO (latest stable)

# 3. How They Actually Work

>Quick note on scope: most home users will encounter these formats during inference, not training. Training support for MXFP8/NVFP4 is still much more experimental and enterprise-oriented — that's the context of most NVIDIA benchmark posts. Everything below is written from an inference perspective.

# FP16/BF16

Traditional high-precision inference:

highest image quality
largest VRAM usage
slowest memory throughput

Still the gold standard for:

photorealism
upscaling
inpainting
cinema-quality video

# FP8

Lower precision floating point:

faster tensor operations
lower VRAM usage
higher throughput

The problem is a single global scale

6 views08:40

r/StableDiffusion

Why MXFP8 and NVFP4 Actually Matter for Your Home GPU Setup

>**Note:** I'm not a developer or someone working in the AI field professionally. I'm just a regular home user who, like many others here, is trying to understand all these new AI technologies, quantization formats, and terms that are starting to appear everywhere around local image and video generation.
>
>I wanted to collect the information scattered across different articles and technical posts into one practical overview focused on consumer GPUs, ComfyUI, and local AI workflows.
>
>I also used AI assistance to help organize and summarize information from the sources linked below.

# A practical breakdown for ComfyUI, FLUX, WAN Video, and similar workflows

Modern image and video generation models don't hit a wall because of shader count anymore. The real bottlenecks are:

* VRAM capacity
* memory bandwidth
* tensor movement
* cache efficiency

That's exactly why FP8, MXFP8, and NVFP4 were created.

Relevant if you're running locally:

* ComfyUI
* FLUX
* SDXL
* WAN Video
* LTX Video
* Hunyuan
* Qwen Image

# 1. Hardware Support

|Format|RTX 20/30 (Turing/Ampere)|RTX 40 (Ada Lovelace)|RTX 50 (Blackwell)|Server GPUs|
|:-|:-|:-|:-|:-|
|FP16/BF16|Yes|Yes|Yes|Yes|
|FP8|Software fallback only|Yes, native (compute cap. 8.9+)|Yes|Hopper+ (compute cap. 9.0+)|
|MXFP8|No|No|Yes, native|Blackwell|
|NVFP4|No|No native hardware|Yes, native|Blackwell|

The difference between "supported" and "emulated" actually matters a lot here:

* **RTX 20/30 (Turing/Ampere)** — no FP8 tensor cores at all. You can load FP8 models through a software fallback where weights are stored in FP8 but all compute still runs in FP16/BF16. This saves VRAM but gives you zero speedup — and sometimes runs slower due to conversion overhead.
* **RTX 40 (Ada Lovelace)** — native FP8 via 4th-gen Tensor Cores (compute capability 8.9). Real hardware acceleration, real speedup. MXFP8 and NVFP4 have no hardware support here, so running them is just emulation with no meaningful benefit.
* **RTX 50 (Blackwell)** — the first consumer GPU generation with native MXFP8 and NVFP4 through 5th-gen Tensor Cores.

# 2. Software Support

|Technology|FP16|FP8|MXFP8|NVFP4|
|:-|:-|:-|:-|:-|
|CUDA 12.6|Stable|Stable|No|No|
|CUDA 12.8|Stable|Stable|Partial|No|
|CUDA 13.0|Stable|Stable|Yes|Yes|
|PyTorch|Full|Broadly usable since 2.2|Stable since 2.10+cu130|Stable since 2.10+cu130|
|TorchAO|Yes|Yes|Yes|Yes|
|TensorRT|Yes|Yes|Partial|Partial|
|Diffusers|Yes|Yes|Emerging|Emerging|
|ComfyUI|Full|Common|Rare|Experimental|

**The honest picture on PyTorch versions:**

* **FP8** — `float8_e4m3fn` and `float8_e5m2` dtype landed in PyTorch **2.1** (experimental), broadly usable from **2.2**, mature from **2.3+**.
* **MXFP8 / NVFP4** — available as a **stable pip install since PyTorch 2.10 + cu130** (January 21, 2026). CUDA 13.0 was promoted to stable in that release with full Blackwell (compute cap. 10.0, 12.0) support. The April 2026 PyTorch blog post used nightlies (`2.12.0.dev+cu130`) only because the authors wanted the latest TorchAO kernels — not because cu130 itself required nightlies.

What MXFP8/NVFP4 workflows need:

* CUDA 13.0 (`pip install torch --index-url https://download.pytorch.org/whl/cu130`)
* Driver 570+
* PyTorch 2.10+ (2.11+ recommended for better TorchAO kernels, 2.12 stable drops May 13)
* TorchAO (latest stable)

# 3. How They Actually Work

>**Quick note on scope:** most home users will encounter these formats during **inference**, not training. Training support for MXFP8/NVFP4 is still much more experimental and enterprise-oriented — that's the context of most NVIDIA benchmark posts. Everything below is written from an inference perspective.

# FP16/BF16

Traditional high-precision inference:

* highest image quality
* largest VRAM usage
* slowest memory throughput

Still the gold standard for:

* photorealism
* upscaling
* inpainting
* cinema-quality video

# FP8

Lower precision floating point:

* faster tensor operations
* lower VRAM usage
* higher throughput

The problem is a single global scale

6 views08:40

r/StableDiffusion

per tensor. Diffusion models are very sensitive to outlier values.

Possible issues:

* washed textures
* unstable lighting
* detail degradation
* temporal instability in video

# MXFP8

Microscaling FP8.

Instead of one global scale for the whole tensor, MXFP8 splits it into small blocks of 32 values, each with its own independent scale factor.

Benefits:

* much better dynamic range
* lower quantization error
* significantly more stable diffusion inference

Especially good for:

* attention layers
* residual connections
* video generation
* large latent spaces

This is why MXFP8 is becoming the preferred FP8 variant for diffusion models specifically.

# NVFP4

Very aggressive 4-bit floating point format from NVIDIA.

Uses:

* microscaling (blocks of 16 values)
* FP8 scale factors
* FP32 tensor scaling

Advantages:

* extremely low VRAM (\~3.5x smaller than BF16)
* maximum throughput
* up to 1.68x speedup on Blackwell vs BF16 (benchmarked on B200 with FLUX.1-Dev)

Disadvantages:

* visible quality degradation on some layers (mean LPIPS 0.44 vs 0.11 for MXFP8 on FLUX.1-Dev)
* artifacts
* temporal instability
* not suitable for every layer equally

NVFP4 works best with:

* selective quantization (skip sensitive layers)
* hybrid precision pipelines
* video workloads
* very large models

# 4. FP16 vs FP8 vs MXFP8 vs NVFP4

|Format|Quality|VRAM Usage|Speed|Stability|Best Use Case|
|:-|:-|:-|:-|:-|:-|
|FP16/BF16|Excellent|Highest|Baseline|Excellent|Maximum quality|
|FP8|Good|Lower|Fast|Medium|General acceleration|
|MXFP8|Near-FP16|Very low|Very fast|High|Best overall balance|
|NVFP4|Lower|Lowest|Fastest|Lower|Maximum throughput|

Real numbers from the PyTorch blog (April 2026, FLUX.1-Dev on B200, batch size 1, selective quantization):

|Mode|Latency|Memory|Speedup vs BF16|
|:-|:-|:-|:-|
|BF16 (baseline)|2.10s|38.34 GB|1.00x|
|MXFP8|1.75s|26.90 GB|1.21x|
|NVFP4|1.41s|21.33 GB|1.50x|

# 5. Why This Matters for Home Users

Consumer GPUs are no longer gaining performance purely through raw shader count.

Modern AI workloads are bottlenecked by:

* VRAM capacity
* bandwidth
* memory movement
* memory latency
* tensor cache efficiency

That is why NVIDIA created Tensor Cores, FP8, MXFP8, and NVFP4.

Without low-precision formats:

* future video models would become impossible on consumer GPUs
* VRAM requirements would explode
* local AI generation would become impractical

Quantization is not just optimization anymore — it's the core technology that keeps local generation possible at all.

# 6. Practical Recommendations for ComfyUI Users

|Goal|Recommended Format|
|:-|:-|
|Maximum image fidelity|BF16|
|Best daily-driver (RTX 50 only)|MXFP8|
|Best compatibility|FP8|
|Lowest VRAM (RTX 50 only, use carefully)|NVFP4|
|Video generation|MXFP8 / NVFP4 hybrid (RTX 50)|
|RTX 20/30 users|FP16/BF16, or FP8 weight-only (saves VRAM, no speedup)|
|RTX 40 users|FP8 (native hardware)|
|RTX 50 users|MXFP8|

# Final Thoughts

FP8 was the first real step toward low-precision inference on consumer hardware — native on RTX 40, weight-only fallback on older cards.

MXFP8 is the next step up: near-FP16 quality, FP8-level speed, and much better stability for diffusion models thanks to per-block scaling.

NVFP4 pushes efficiency even further but trades image quality for throughput — best used selectively, not across the whole model.

For Blackwell GPUs, MXFP8 is becoming the default format for local AI image and video generation. And for the first time, consumer GPUs are being designed specifically around AI quantization formats rather than traditional graphics workloads.

Software support has been moving fast — MXFP8/NVFP4 became accessible via stable pip with PyTorch 2.10 + cu130 back in January 2026. PyTorch 2.12 (May 13) shifts focus heavily toward CUDA 13 and Blackwell support — official wheels drop cu128, though the broader ecosystem will keep running on 12.8 for a long time yet.

# Sources

* [https://cursor.com/blog/kernels](https://cursor.com/blog/kernels)
*

Cursor

1.5x faster MoE training with custom MXFP8 kernels · Cursor

Achieving a 3.5x MoE layer speedup with a complete rebuild for Blackwell GPUs.

6 views08:40

r/StableDiffusion

[https://developer.nvidia.com/blog/using-nvfp4-low-precision-model-training-for-higher-throughput-without-losing-accuracy](https://developer.nvidia.com/blog/using-nvfp4-low-precision-model-training-for-higher-throughput-without-losing-accuracy)
* [https://www.glennklockwood.com/garden/low-precision-training](https://www.glennklockwood.com/garden/low-precision-training)
* [https://pytorch.org/blog/faster-diffusion-on-blackwell-mxfp8-and-nvfp4-with-diffusers-and-torchao/](https://pytorch.org/blog/faster-diffusion-on-blackwell-mxfp8-and-nvfp4-with-diffusers-and-torchao/)
* [https://ltx.io/model/model-blog/quantization-formats-explained](https://ltx.io/model/model-blog/quantization-formats-explained)
* [https://docs.vllm.ai/en/latest/features/quantization/fp8/](https://docs.vllm.ai/en/latest/features/quantization/fp8/)
* [https://github.com/pytorch/ao/releases](https://github.com/pytorch/ao/releases)

https://redd.it/1tau4os
@rStableDiffusion

NVIDIA Technical Blog

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

As the sizes of AI models and datasets continue to increase, relying only on higher-precision BF16 training is no longer sufficient. Key challenges such as training throughput expectations…

7 views08:40

r/StableDiffusion

Optimizing LTX-2.3 Inference Speed: from 300s to 45s on an RTX 3080Ti

[Background\]

I’m currently building an entertainment app powered by video generation AI. My hardware setup consists of an RTX 5090 on my local PC for training and an RTX 3080Ti on a private server for serving. My goal was to train LTX-2.3 LoRAs on the 5090 and serve the model efficiently on the 3080Ti.

[Training\]

For LoRA training, I went with musubi-tuner based on community recommendations, and I was impressed. The optimization is top-notch. Using FP8 and NF4 options saved a significant amount of VRAM, making the whole training process very smooth.

[Inference & Optimization in ComfyUI\]

I used ComfyUI for the backend. Initially, the default workflow took about 300 seconds per generation, which was too slow for my app.

Here’s what I found while trying to shave off that time:

1. Resolution is Key: Unless you absolutely need high-res, lowering it helps significantly. Switching from 1080x1920 to 720x1280 dropped the generation time from 300s to the 120s range.
2. Spatial Upscaler Tweaks: Changing the Spatial Upscaler from x2 to x1.5 further reduced the time from 120s to 80s. However, if you combine this with the resolution drop in step 1, the quality loss is noticeable, so use it with caution.
3. Stage 2 Step Reduction: LTX-2.3 consists of Stage 1 and Stage 2(Upsampling). Stage 2 defaults to 3 steps, but I tried cutting it down to 2 steps by modifying the sigma list from [0.85, 0.7250, 0.4219, 0.0\] to [0.85, 0.4219, 0.0\]. This provides a proportional speed boost, and I found the quality remains perfectly acceptable.
4. Sage Attention: I didn't see much improvement here. Since the RTX 3080Ti is Ampere-based, it follows the standard Triton logic rather than Sage-specific optimizations. I suspect RTX 50xx users might see different results—definitely worth testing on newer hardware.
5. The Power of INT8: This was the biggest surprise. The 3080Ti seems to handle INT8 much better than NVFP4. Switching to an INT8 model cut the time from 80s to 45s.
6. GGUF vs. INT8: In my environment, INT8 with VRAM offloading outperformed GGUF. While GGUF is great for running without offloading, my tests showed Stage 1 took 40s on GGUF vs. 29s on INT8.
7. Custom Nodes: Since there weren't many INT8 models or specific ComfyUI nodes for the new v1.1 yet, I used an AI agent to help me write a custom INT8 conversion script and a Custom Loader Node.
8. LoRA Latency: Adding a LoRA (Rank 16) adds about 4 seconds of overhead.
9. Warm-up Run: As expected, the first inference takes much longer due to model loading and caching. The \~50s speeds I mentioned are consistent from the second run onwards.
10. Frame Count: If your project allows for shorter clips, reducing the frames from 121 to 49 drastically cuts down the processing time.

[Final Results\]

Using these optimizations on my RTX 3080Ti:

832x1024 @ 121 frames: 73 seconds

832x1024 @ 49 frames: 45 seconds

https://preview.redd.it/vl2vyy386o0h1.png?width=2112&format=png&auto=webp&s=0906069b50ac57175abb740086bad5aafc57bb8a

https://reddit.com/link/1tavvnj/video/4nllka5u9o0h1/player

Hope this helps anyone trying to squeeze more performance out of their mid-to-high end setups!

https://redd.it/1tavvnj
@rStableDiffusion

6 views09:40

r/StableDiffusion

0:29

This media is not supported in your browser

VIEW IN TELEGRAM

AI rendering pipeline experiment on Maya by @Matarawi on Instagram

https://redd.it/1tauucg
@rStableDiffusion

14 views10:40

r/StableDiffusion

Qwen Image 2 papers - does that mean anything?

https://huggingface.co/papers/2605.10730

https://preview.redd.it/cmg25rw5ro0h1.png?width=1990&format=png&auto=webp&s=94f7e04f28fbaaccd504dd2502af38b798e59aae

https://preview.redd.it/vyloqa9nro0h1.png?width=1618&format=png&auto=webp&s=175ee402bff154bca8d691e5ef4c2102d5c8f5a3

"We present Qwen-Image-2.0, an omni-capable image generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. Despite recent progress, existing models still struggle with ultra-long text rendering, multilingual typography, high-resolution photorealism, robust instruction following, and efficient deployment, especially in text-rich and compositionally complex scenarios.

Qwen-Image-2.0 addresses these challenges by coupling Qwen3-VL as the condition encoder with a Multimodal Diffusion Transformer for joint condition-target modeling, supported by large-scale data curation and a customized multi-stage training pipeline. This enables strong multimodal understanding while preserving flexible generation and editing capabilities.

The model supports instructions of up to 1K tokens for generating text-rich content such as slides, posters, infographics, and comics, while significantly improving multilingual text fidelity and typography. It also enhances photorealistic generation with richer details, more realistic textures, and coherent lighting, and follows complex prompts more reliably across diverse styles. Extensive human evaluations show that Qwen-Image-2.0 substantially outperforms previous Qwen-Image models in both generation and editing, marking a step toward more general, reliable, and practical image generation foundation models."

https://redd.it/1taxowh
@rStableDiffusion

7 views11:40

r/StableDiffusion

What nobody tells you about retouching shiny stuff (and how AI quietly changed my workflow)
https://www.reddit.com/gallery/1tb1eyg

https://redd.it/1tb1i6c
@rStableDiffusion

From the comfyui community on Reddit: What nobody tells you about retouching shiny stuff (and how AI quietly changed my workflow)

Explore this post and more from the comfyui community

7 views13:40

r/StableDiffusion

INT8 in the age of MXFP8. An investigation into the quality of various quantization types, and their speed.

I've seen some MXFP8 posts recently, so I've been wondering how it compares against other quant types.

Most interesting to me is the comparison against INT8, which unlike MXFP8, has been hardware accelerated since the RTX 20 series.

So I've spent the past week testing how INT8 via my comfy node "INT8-Fast" compares.

PS: All of the text here is human written, and reflects my own conclusions, with the exception of a single clearly marked paragraph.

TLDR: The rough ranking for the quantization quality tested is GGUF Q8 > INT8 ConvRot > MXFP8 > FP8 >= INT8 Row.

#Quick glossary:

INT8: A data type storing numbers from -128 to 127. Like FP8 but using integers.

INT8 Row-wise: A slightly fancier way to store INT8 weights and activation with more granularity.

INT8 Tensor-Wise: The easiest and lowest quality way to do INT8.

INT8 ConvRot: It's row-wise INT8, but the model and activations are rotated in a way that removes outliers before quantization. Reference paper here

Explaining what the measurements do (AI):

SNR dB: "How loud is the real signal compared to the static/noise the quantization added?"

Cosine Similarity (Cos-sim): "Are the quantized latents pointing in the same direction as the originals, even if they're a slightly different size?"

Rel-RMSE: "On average, how wrong is each value, as a percentage of how big the values actually are?"

/end of AI explanation

#Methodology:

What I did is to capture the cond/uncond latents at every step of the inference process with a modified KSampler node. Then I compare it against the unquantized BF16 baseline model.

These tests are run with the ~latest comfy on an RTX3090

#Results:

Anima, 100 samples at 1MP resolution, 25 steps.

| Metric | INT8 ConvRot | INT8 Row | INT8 Row Bedovyy | INT8 Tensor Silver | FP8 | GGUF_Q8 |
| :--- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| Rel-RMSE ↓ | 0.09032 ±0.00626 ★ | 0.13396 ±0.00720 | 0.13084 ±0.00920 | 0.23802 ±0.01011 | 0.14523 ±0.00679 | 0.12124 ±0.00714 |
| SNR dB ↑ | 24.05 ±0.53 ★ | 19.68 ±0.39 | 20.24 ±0.52 | 14.48 ±0.36 | 19.66 ±0.35 | 21.98 ±0.46 |
| Cos-sim ↑ | 0.992165 ±0.001113 ★ | 0.984617 ±0.001780 | 0.984765 ±0.002368 | 0.957751 ±0.003461 | 0.981587 ±0.001878 | 0.985553 ±0.001704 |
----

Z-Image turbo, 64 samples, 0.5MP resolution, 8 steps:

| Metric | GGUF_Q8 | INT8 ConvRot | INT8 Row | MXFP8 |
| :--- | ---: | ---: | ---: | ---: |
| Rel-RMSE ↓ | 0.16740 ±0.00628 ★ | 0.19634 ±0.00660 | 0.35659 ±0.00968 | 0.30729 ±0.00645 |
| SNR dB ↑ | 16.42 ±0.29 ★ | 14.86 ±0.26 | 9.27 ±0.23 | 10.59 ±0.18 |
| Cos-sim ↑ | 0.978215 ±0.001696 ★ | 0.971225 ±0.001920 | 0.916394 ±0.004070 | 0.935860 ±0.002428 |
---

HiDream O1, 16 samples, 0.5MP resolution, 24 steps

FP8 Naive refers to using a BF16 checkpoint with the dtype set to FP8, which naively casts most weights to FP8.

| Metric | FP8Naive | [FP8 Scaled](https://huggingface.co/Comfy-Org/HiDream-O1-Image/blob/main/checkpoints/hidreamo1imagedevfp8scaled.safetensors) | INT8 ConvRot | INT8 Row | MXFP8 |
| :--- | ---: | ---: | ---: | ---: | ---: |
| Rel-RMSE ↓ | 0.23140 ±0.03353 | 0.08793 ±0.01196 | 0.06738 ±0.00849 ★ | 0.40533 ±0.03865 | 0.09269 ±0.00912 |
| SNR dB ↑ | 14.86 ±1.00 | 22.98 ±0.91 | 25.65 ±0.85 ★ | 8.77 ±0.76 | 22.65 ±0.79

GitHub

GitHub - BobJohnson24/ComfyUI-INT8-Fast: Custom node to load models in INT8 for 1.5~2X Speed gains on 30 series cards.

Custom node to load models in INT8 for 1.5~2X Speed gains on 30 series cards. - BobJohnson24/ComfyUI-INT8-Fast

5 views14:40

About

Blog

Apps

Platform