LTX 2.3 audio as standalone speech model.

User @wildmindai from X posted about this new model. Has anyone here tried it yet?

LTX 2.3 audio as standalone speech model.

Emotional TTS with Scenema Audio.

\- Zero-shot expressive voice cloning, speech gen

\- 8-step distilled with Gemma 3 12B text encoding

\- stage directions via <action> tags

\- runs at 1.5x real-time on RTX 4090

\- fits in 16GB VRAM

\- 13 languages, 48kHz stereo output

it also gens matching environment sounds

https://huggingface.co/ScenemaAI/scenema-audio

https://redd.it/1tab0tb
@rStableDiffusion
I have to pretend I hate image generation AI to avoid getting banned or insulted on 99% of Reddit or the internet, even though Stable Diffusion is actually what I like and am most excited about right now. Why do people hate AI so much, especially image generation AI?

I'm not even saying I care if they know the difference between open-source and closed-source image-generating AI, or if they insult me ​​or not.

What I want to know is why so many people hate AI, especially image-generating AI.

At first, I thought it only bothered artists. Then I thought it might also bother those who are afraid of not being able to distinguish AI from reality.

But it's practically 99% of people who hate AI, and I just can't understand why.

For example, I've been using Blender for years. I learned to model, sculpt, and animate as an amateur. Thanks to AI, things that used to take me months now take me seconds. Isn't that supposed to be a good thing?

I don't feel bad or like I've wasted my time using Blender; I simply feel fortunate to have found a better tool for what I needed.

EDIT 1: When I say "Stable Diffusion" I mean the open source model community, all models, not "SD" specifically.

https://redd.it/1tahphc
@rStableDiffusion
This media is not supported in your browser
VIEW IN TELEGRAM
Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

https://redd.it/1tamqbf
@rStableDiffusion
Why MXFP8 and NVFP4 Actually Matter for Your Home GPU Setup

>Note: I'm not a developer or someone working in the AI field professionally. I'm just a regular home user who, like many others here, is trying to understand all these new AI technologies, quantization formats, and terms that are starting to appear everywhere around local image and video generation.
>
>I wanted to collect the information scattered across different articles and technical posts into one practical overview focused on consumer GPUs, ComfyUI, and local AI workflows.
>
>I also used AI assistance to help organize and summarize information from the sources linked below.

# A practical breakdown for ComfyUI, FLUX, WAN Video, and similar workflows

Modern image and video generation models don't hit a wall because of shader count anymore. The real bottlenecks are:

VRAM capacity
memory bandwidth
tensor movement
cache efficiency

That's exactly why FP8, MXFP8, and NVFP4 were created.

Relevant if you're running locally:

ComfyUI
FLUX
SDXL
WAN Video
LTX Video
Hunyuan
Qwen Image

# 1. Hardware Support

|Format|RTX 20/30 (Turing/Ampere)|RTX 40 (Ada Lovelace)|RTX 50 (Blackwell)|Server GPUs|
|:-|:-|:-|:-|:-|
|FP16/BF16|Yes|Yes|Yes|Yes|
|FP8|Software fallback only|Yes, native (compute cap. 8.9+)|Yes|Hopper+ (compute cap. 9.0+)|
|MXFP8|No|No|Yes, native|Blackwell|
|NVFP4|No|No native hardware|Yes, native|Blackwell|

The difference between "supported" and "emulated" actually matters a lot here:

RTX 20/30 (Turing/Ampere) — no FP8 tensor cores at all. You can load FP8 models through a software fallback where weights are stored in FP8 but all compute still runs in FP16/BF16. This saves VRAM but gives you zero speedup — and sometimes runs slower due to conversion overhead.
RTX 40 (Ada Lovelace) — native FP8 via 4th-gen Tensor Cores (compute capability 8.9). Real hardware acceleration, real speedup. MXFP8 and NVFP4 have no hardware support here, so running them is just emulation with no meaningful benefit.
RTX 50 (Blackwell) — the first consumer GPU generation with native MXFP8 and NVFP4 through 5th-gen Tensor Cores.

# 2. Software Support

|Technology|FP16|FP8|MXFP8|NVFP4|
|:-|:-|:-|:-|:-|
|CUDA 12.6|Stable|Stable|No|No|
|CUDA 12.8|Stable|Stable|Partial|No|
|CUDA 13.0|Stable|Stable|Yes|Yes|
|PyTorch|Full|Broadly usable since 2.2|Stable since 2.10+cu130|Stable since 2.10+cu130|
|TorchAO|Yes|Yes|Yes|Yes|
|TensorRT|Yes|Yes|Partial|Partial|
|Diffusers|Yes|Yes|Emerging|Emerging|
|ComfyUI|Full|Common|Rare|Experimental|

The honest picture on PyTorch versions:

FP8 — `float8_e4m3fn` and `float8_e5m2` dtype landed in PyTorch 2.1 (experimental), broadly usable from 2.2, mature from 2.3+.
MXFP8 / NVFP4 — available as a stable pip install since PyTorch 2.10 + cu130 (January 21, 2026). CUDA 13.0 was promoted to stable in that release with full Blackwell (compute cap. 10.0, 12.0) support. The April 2026 PyTorch blog post used nightlies (2.12.0.dev+cu130) only because the authors wanted the latest TorchAO kernels — not because cu130 itself required nightlies.

What MXFP8/NVFP4 workflows need:

CUDA 13.0 (`pip install torch --index-url https://download.pytorch.org/whl/cu130`)
Driver 570+
PyTorch 2.10+ (2.11+ recommended for better TorchAO kernels, 2.12 stable drops May 13)
TorchAO (latest stable)

# 3. How They Actually Work

>Quick note on scope: most home users will encounter these formats during inference, not training. Training support for MXFP8/NVFP4 is still much more experimental and enterprise-oriented — that's the context of most NVIDIA benchmark posts. Everything below is written from an inference perspective.

# FP16/BF16

Traditional high-precision inference:

highest image quality
largest VRAM usage
slowest memory throughput

Still the gold standard for:

photorealism
upscaling
inpainting
cinema-quality video

# FP8

Lower precision floating point:

faster tensor operations
lower VRAM usage
higher throughput

The problem is a single global scale
Why MXFP8 and NVFP4 Actually Matter for Your Home GPU Setup

>**Note:** I'm not a developer or someone working in the AI field professionally. I'm just a regular home user who, like many others here, is trying to understand all these new AI technologies, quantization formats, and terms that are starting to appear everywhere around local image and video generation.
>
>I wanted to collect the information scattered across different articles and technical posts into one practical overview focused on consumer GPUs, ComfyUI, and local AI workflows.
>
>I also used AI assistance to help organize and summarize information from the sources linked below.

# A practical breakdown for ComfyUI, FLUX, WAN Video, and similar workflows

Modern image and video generation models don't hit a wall because of shader count anymore. The real bottlenecks are:

* VRAM capacity
* memory bandwidth
* tensor movement
* cache efficiency

That's exactly why FP8, MXFP8, and NVFP4 were created.

Relevant if you're running locally:

* ComfyUI
* FLUX
* SDXL
* WAN Video
* LTX Video
* Hunyuan
* Qwen Image

# 1. Hardware Support

|Format|RTX 20/30 (Turing/Ampere)|RTX 40 (Ada Lovelace)|RTX 50 (Blackwell)|Server GPUs|
|:-|:-|:-|:-|:-|
|FP16/BF16|Yes|Yes|Yes|Yes|
|FP8|Software fallback only|Yes, native (compute cap. 8.9+)|Yes|Hopper+ (compute cap. 9.0+)|
|MXFP8|No|No|Yes, native|Blackwell|
|NVFP4|No|No native hardware|Yes, native|Blackwell|

The difference between "supported" and "emulated" actually matters a lot here:

* **RTX 20/30 (Turing/Ampere)** — no FP8 tensor cores at all. You can load FP8 models through a software fallback where weights are stored in FP8 but all compute still runs in FP16/BF16. This saves VRAM but gives you zero speedup — and sometimes runs slower due to conversion overhead.
* **RTX 40 (Ada Lovelace)** — native FP8 via 4th-gen Tensor Cores (compute capability 8.9). Real hardware acceleration, real speedup. MXFP8 and NVFP4 have no hardware support here, so running them is just emulation with no meaningful benefit.
* **RTX 50 (Blackwell)** — the first consumer GPU generation with native MXFP8 and NVFP4 through 5th-gen Tensor Cores.

# 2. Software Support

|Technology|FP16|FP8|MXFP8|NVFP4|
|:-|:-|:-|:-|:-|
|CUDA 12.6|Stable|Stable|No|No|
|CUDA 12.8|Stable|Stable|Partial|No|
|CUDA 13.0|Stable|Stable|Yes|Yes|
|PyTorch|Full|Broadly usable since 2.2|Stable since 2.10+cu130|Stable since 2.10+cu130|
|TorchAO|Yes|Yes|Yes|Yes|
|TensorRT|Yes|Yes|Partial|Partial|
|Diffusers|Yes|Yes|Emerging|Emerging|
|ComfyUI|Full|Common|Rare|Experimental|

**The honest picture on PyTorch versions:**

* **FP8** — `float8_e4m3fn` and `float8_e5m2` dtype landed in PyTorch **2.1** (experimental), broadly usable from **2.2**, mature from **2.3+**.
* **MXFP8 / NVFP4** — available as a **stable pip install since PyTorch 2.10 + cu130** (January 21, 2026). CUDA 13.0 was promoted to stable in that release with full Blackwell (compute cap. 10.0, 12.0) support. The April 2026 PyTorch blog post used nightlies (`2.12.0.dev+cu130`) only because the authors wanted the latest TorchAO kernels — not because cu130 itself required nightlies.

What MXFP8/NVFP4 workflows need:

* CUDA 13.0 (`pip install torch --index-url https://download.pytorch.org/whl/cu130`)
* Driver 570+
* PyTorch 2.10+ (2.11+ recommended for better TorchAO kernels, 2.12 stable drops May 13)
* TorchAO (latest stable)

# 3. How They Actually Work

>**Quick note on scope:** most home users will encounter these formats during **inference**, not training. Training support for MXFP8/NVFP4 is still much more experimental and enterprise-oriented — that's the context of most NVIDIA benchmark posts. Everything below is written from an inference perspective.

# FP16/BF16

Traditional high-precision inference:

* highest image quality
* largest VRAM usage
* slowest memory throughput

Still the gold standard for:

* photorealism
* upscaling
* inpainting
* cinema-quality video

# FP8

Lower precision floating point:

* faster tensor operations
* lower VRAM usage
* higher throughput

The problem is a single global scale
per tensor. Diffusion models are very sensitive to outlier values.

Possible issues:

* washed textures
* unstable lighting
* detail degradation
* temporal instability in video

# MXFP8

Microscaling FP8.

Instead of one global scale for the whole tensor, MXFP8 splits it into small blocks of 32 values, each with its own independent scale factor.

Benefits:

* much better dynamic range
* lower quantization error
* significantly more stable diffusion inference

Especially good for:

* attention layers
* residual connections
* video generation
* large latent spaces

This is why MXFP8 is becoming the preferred FP8 variant for diffusion models specifically.

# NVFP4

Very aggressive 4-bit floating point format from NVIDIA.

Uses:

* microscaling (blocks of 16 values)
* FP8 scale factors
* FP32 tensor scaling

Advantages:

* extremely low VRAM (\~3.5x smaller than BF16)
* maximum throughput
* up to 1.68x speedup on Blackwell vs BF16 (benchmarked on B200 with FLUX.1-Dev)

Disadvantages:

* visible quality degradation on some layers (mean LPIPS 0.44 vs 0.11 for MXFP8 on FLUX.1-Dev)
* artifacts
* temporal instability
* not suitable for every layer equally

NVFP4 works best with:

* selective quantization (skip sensitive layers)
* hybrid precision pipelines
* video workloads
* very large models

# 4. FP16 vs FP8 vs MXFP8 vs NVFP4

|Format|Quality|VRAM Usage|Speed|Stability|Best Use Case|
|:-|:-|:-|:-|:-|:-|
|FP16/BF16|Excellent|Highest|Baseline|Excellent|Maximum quality|
|FP8|Good|Lower|Fast|Medium|General acceleration|
|MXFP8|Near-FP16|Very low|Very fast|High|Best overall balance|
|NVFP4|Lower|Lowest|Fastest|Lower|Maximum throughput|

Real numbers from the PyTorch blog (April 2026, FLUX.1-Dev on B200, batch size 1, selective quantization):

|Mode|Latency|Memory|Speedup vs BF16|
|:-|:-|:-|:-|
|BF16 (baseline)|2.10s|38.34 GB|1.00x|
|MXFP8|1.75s|26.90 GB|1.21x|
|NVFP4|1.41s|21.33 GB|1.50x|

# 5. Why This Matters for Home Users

Consumer GPUs are no longer gaining performance purely through raw shader count.

Modern AI workloads are bottlenecked by:

* VRAM capacity
* bandwidth
* memory movement
* memory latency
* tensor cache efficiency

That is why NVIDIA created Tensor Cores, FP8, MXFP8, and NVFP4.

Without low-precision formats:

* future video models would become impossible on consumer GPUs
* VRAM requirements would explode
* local AI generation would become impractical

Quantization is not just optimization anymore — it's the core technology that keeps local generation possible at all.

# 6. Practical Recommendations for ComfyUI Users

|Goal|Recommended Format|
|:-|:-|
|Maximum image fidelity|BF16|
|Best daily-driver (RTX 50 only)|MXFP8|
|Best compatibility|FP8|
|Lowest VRAM (RTX 50 only, use carefully)|NVFP4|
|Video generation|MXFP8 / NVFP4 hybrid (RTX 50)|
|RTX 20/30 users|FP16/BF16, or FP8 weight-only (saves VRAM, no speedup)|
|RTX 40 users|FP8 (native hardware)|
|RTX 50 users|MXFP8|

# Final Thoughts

FP8 was the first real step toward low-precision inference on consumer hardware — native on RTX 40, weight-only fallback on older cards.

MXFP8 is the next step up: near-FP16 quality, FP8-level speed, and much better stability for diffusion models thanks to per-block scaling.

NVFP4 pushes efficiency even further but trades image quality for throughput — best used selectively, not across the whole model.

For Blackwell GPUs, MXFP8 is becoming the default format for local AI image and video generation. And for the first time, consumer GPUs are being designed specifically around AI quantization formats rather than traditional graphics workloads.

Software support has been moving fast — MXFP8/NVFP4 became accessible via stable pip with PyTorch 2.10 + cu130 back in January 2026. PyTorch 2.12 (May 13) shifts focus heavily toward CUDA 13 and Blackwell support — official wheels drop cu128, though the broader ecosystem will keep running on 12.8 for a long time yet.

# Sources

* [https://cursor.com/blog/kernels](https://cursor.com/blog/kernels)
*