not recommended|Quality suffers significantly from heavy quantization|
|**Flux.2 DEV**| Cannot run|Base FP16 model is \~60GB — no quantization makes this practical|
|**Flux.2 Klein 4B**|⚠️ Runs stably|Decent quality, but tiny community and very limited model selection|
|**Flux.2 Klein 9B**|⚠️ Runs with caveats|\~20GB native — needs quantization or interlaced mode, both reduce quality|

**Bottom line on Flux:** It can technically run in quantized form, but the quality trade-off is significant enough that it is not worth pursuing on 6GB VRAM. Z-Image Turbo delivers superior results on this hardware.

# 🧠 RAM Planning for Z-Image Turbo — A Hidden Pitfall

Z-Image Turbo has a RAM requirement that is easy to underestimate. Unlike Illustrious where text encoders are small, Z-Image Turbo uses **Qwen 3 4B as its text encoder — and it stays permanently in RAM**.

**Full RAM breakdown for Z-Image Turbo:**

|Component|RAM Usage|Notes|
|:-|:-|:-|
|**Qwen 3 4B Text Encoder (FP16)**|\~7.5 GB|Permanent — never unloaded|
|**Z-Image Turbo model**|\~12 GB|Staged dynamically|
|**ComfyUI + latents + overhead**|\~2-3 GB|Varies|
|**Windows OS**|\~4-6 GB|Background processes|
|**Total**|**\~25-28 GB**|With 32GB RAM: only \~4-7GB headroom|

**The danger with 32GB RAM:** When the model unload doesn't run cleanly — which can happen — Z-Image Turbo ignores Windows Shared Memory settings and aggressively accumulates RAM. Observed peak usage: **20GB+ for the model alone**, pushing total system RAM to the absolute limit. Windows will then start swapping to SSD, causing severe slowdowns or freezes.

**64GB RAM is strongly recommended for Z-Image Turbo.**

**The Qwen Q8 workaround:** A quantized Q8 version of the Qwen encoder reduces RAM usage from \~7.5GB to \~4.5GB — saving \~3GB. However, there is an important trade-off:

* Z-Image Turbo already struggles with prompt following compared to tag-based models
* Natural Language prompting requires the encoder to correctly interpret complex sentence structures
* Any quality loss in the encoder hits harder on Z-Image Turbo than on simpler tag-based models
* Only consider Q8 Qwen if RAM pressure is severe and you are willing to accept potentially weaker prompt adherence

# FP8 on Pascal — Surprising Results

The GTX 1060 (Pascal) is often said to have no FP8 support. This is partially true but misleading.

ComfyUI's eager backend reports these FP8 capabilities on Pascal:

capabilities: ['dequantize_per_tensor_fp8', 'quantize_per_tensor_fp8',
'quantize_mxfp8', 'dequantize_mxfp8', ...]

**Practical results with** `--fp8_e4m3fn-unet` **+** `--fast fp16_accumulation`\*\*:\*\*

|Metric|FP16|FP8 (e4m3fn\_fast)|
|:-|:-|:-|
|Model staged in VRAM|11,739 MB|5,869 MB|
|Generation speed (steps)|Baseline|Slightly faster|
|Load time|Faster|Slightly slower (conversion on load)|
|Image quality (normal view)|Excellent|Excellent|
|Image quality (300% zoom, eyes)|Sharper fine detail|Slightly softer|

**Conclusion:** FP8 nearly halves VRAM usage with minimal quality difference at normal viewing distances. For drafts and exploration, FP8 is the better choice. For final renders where fine detail matters, use FP16.

**Important:** FP8 works for Z-Image Turbo (Flow Matching architecture) but NOT for Illustrious/SDXL (UNet architecture). Illustrious will silently fail to generate with `--fp8_e4m3fn-unet` on Pascal.

# 🚀 Recommended Startup BAT Files

# BAT 1: FP16 Quality Mode (for Illustrious XL + Z-Image quality renders)

bat

u/echo off
echo ComfyUI Start - FP16 Fast Mode + Force Model Unload
echo.
.\python_embeded\python.exe -s ComfyUI\main.py ^
--windows-standalone-build ^
--fast fp16_accumulation ^
--disable-smart-memory
pause

# BAT 2: FP8 Draft Mode (for Z-Image Turbo only — drafts & exploration)

bat

u/echo off
echo ComfyUI Start - FP8 Fast Mode + Force Model Unload
echo NOTE: FP8 works for Z-Image Turbo. Use FP16 BAT for Illustrious!
echo.
.\python_embeded\python.exe -s ComfyUI\main.py ^
--windows-standalone-build ^
--fast fp16_accumulation ^
--fp8_e4m3fn-unet ^
--disable-smart-memory
pause

# Why --disable-smart-memory?

This flag changes how ComfyUI handles memory between generations:

**Without flag (default behavior):**

* Models stay cached in VRAM after use
* VRAM accumulates with each Image you generate. causing later images to take more time to finish

**With** `--disable-smart-memory`\*\*:\*\*

* After each use, modules are offloaded from VRAM → RAM
* The model stays in RAM (loaded once from SSD at startup)
* VRAM stays clean and constant between individual generations
* RAM→VRAM transfer is fast (DDR3: \~15-25 GB/s vs SSD: \~500 MB/s) — overhead is negligible

**⚠️Batch Generation Reality Check**

Batch generation with Illustrious XL on 6GB VRAM was tested extensively. Here is what actually happens:

ComfyUI processes all batch images **simultaneously** — every denoising step is computed for all images at once. This sounds efficient but on 6GB VRAM it has a severe cost:

|Method|Time per image|10 images total|Notes|
|:-|:-|:-|:-|
|**Sequential (recommended)**|\~131 seconds|\~22 minutes|Stable, consistent|
|**Batch 10 parallel**|\~1193 seconds|**3h 19min**|\~10x slower than sequential!|

The reason: each parallel step must process the latent data of all 10 images simultaneously, quickly exhausting VRAM. Second problem is, the GPU doesn't have enough power to render them fast. The per-step time explodes from \~4.68s/it to \~463s/it.

**Recommendation: Always generate sequentially on 6GB VRAM.** Run images one by one — it is dramatically faster than batch mode. `--disable-smart-memory` helps keep VRAM clean between sequential generations, which is its real value here.

# 🎯 Z-Image Turbo — Recommended Settings

Z-Image Turbo uses **Qwen 3 4B** as text encoder and requires **natural language prompts** — NOT Danbooru tags.

|Parameter|Value|Notes|
|:-|:-|:-|
|Sampler|`euler_ancestral`|Official recommendation — model trained on this|
|Scheduler|`beta`|Best for Z-Image Turbo|
|Steps|8-10|More steps = diminishing returns|
|CFG|1.0-1.5|Must be low — higher values cause artifacts|
|Negative prompt|Leave empty|Has no effect on Turbo models|

**Prompt style:**

Write like a film director's script, not keyword lists.

"A young woman in a black maid uniform standing on a rooftop at sunset,
fox ears and a fluffy tail, warm golden light from behind,
looking directly at the viewer with a calm expression."

"1girl, maid, fox ears, sunset, masterpiece, best quality, 8k"

# 🔧 Illustrious XL — Recommended Settings

|Parameter|Value|Notes|
|:-|:-|:-|
|Sampler|`dpmpp_2m_cfg_pp`|Best quality/speed ratio|
|Scheduler|`karras`|Standard recommendation|
|Steps|20-28|Sweet spot for Illustrious|
|CFG|5.0-7.0|Illustrious is CFG-sensitive|
|Resolution|1024×1024 or 896×1152|Must be multiples of 64|

**Quality tags for Illustrious (NOT Pony tags!):**

masterpiece, best quality, very aesthetic, absurdres

Do NOT use `score_9`, `score_8_up` — those are Pony-specific and have no effect on Illustrious.

# 💡 Key Insights Summary

1. **ComfyUI is mandatory** — Forge/A1111 cannot do what ComfyUI does with limited VRAM
2. **Illustrious XL fits on 6GB** because the UNet (\~4.5GB) fits in VRAM — text encoders go to CPU
3. **Z-Image Turbo (12GB model) runs** due to Single-Stream architecture enabling efficient layer streaming
4. **Flux.1 FP16 does not run** — Dual-Stream architecture requires too much simultaneous VRAM. Heavily quantized versions (Q4-Q8) technically run but quality suffers too much to be worthwhile.
5. **Flux.2 Klein 4B** runs stably but has a tiny community.
6. **FP8 works on Pascal** for Z-Image Turbo via the eager backend — nearly halves VRAM with minimal quality loss
7. **FP8 does NOT work** for Illustrious/SDXL on Pascal — silently fails
8. **CPU** — even the Qwen 3 4B (4B parameter LLM) runs acceptably fast on CPU as an encoder because it only does a single forward pass (encoding), not token-by-token generation
9. **VAE is critical for Flow Matching models** (Z-Image, Flux) —
wrong VAE = broken output. For Z-Image use flux1-vae, NOT flux2-vae
10. **Newer SDXL and all Illustrious models have the VAE fix built in** — external VAE fix is only needed for older SDXL models

# 🖥️ Tested Hardware

* **GPU:** NVIDIA GeForce GTX 1060 6GB (Pascal architecture, GP106)
* **RAM:** 32GB DDR3
* **Storage:** Fast SSD recommended
* **ComfyUI version:** Windows portable cu128 build
* **Driver:** Current NVIDIA drivers (May 2026)

# ⚙️ Minimum & Recommended System Requirements

Running modern models on a 6GB VRAM GPU shifts the bottleneck from VRAM to **RAM and storage**. ComfyUI's Dynamic VRAM Management offloads aggressively to RAM — this only works if you have enough of it and can transfer it fast enough.

|Component|Minimum|Recommended|Why|
|:-|:-|:-|:-|
|**GPU VRAM**|6GB|6GB|GTX 1060 target|
|**RAM**|32GB|64GB|Models offload to RAM — 32GB works but gets tight with large models + OS overhead|
|**Storage**|Fast SATA SSD|NVMe M.2 SSD|Initial model load from disk — slower SSD = longer cold start per session|
|**CPU**|Any modern|Any modern|Text encoders run on CPU — but only for a single forward pass, not a bottleneck|

**Why RAM matters so much:**

* A 12GB Z-Image Turbo model staged in RAM needs \~12GB just for the model
* OS + ComfyUI + other background processes easily add another 8-10GB
* With 16GB RAM: constant disk swapping, extremely slow or unstable
* With 32GB RAM: workable, tight on very large models
* With 64GB RAM: comfortable headroom for multiple large models and batch operations

**Why SSD speed matters:** ComfyUI loads the model from disk once per session into RAM. With `--disable-smart-memory`, it then transfers from RAM→VRAM as needed (fast). But that initial disk load:

* Slow HDD: potentially minutes per model load
* SATA SSD: acceptable, 10-30 seconds
* NVMe M.2: near-instant, 2-5 seconds

**Bottom line:** A fast GPU with slow RAM or HDD will be severely bottlenecked. The GTX 1060 6GB setup only works well when RAM and storage can keep up.

*This guide was written based on hands-on testing. All benchmarks are real measurements, not theoretical estimates. If your experience differs, please share — community knowledge benefits everyone.*

*The goal of this guide is simple: don't let hardware limitation myths stop you from experimenting. Test first, assume nothing.*

https://redd.it/1tfs3ee
@rStableDiffusion
Best Way to Prompt Qwen, Klein, Zit...You're Welcome

This is the best way to prompt images for Flux Klein, Qwen or Wan. These models were trained on .json in such away that they understand hierarchal structure but there is no need to waste your time on on all the punctuation.

The parts of an image include; a basic concept or summary, a subject or subjects, attire, expression, pose, hair/makeup/accessories and a background.

So break you prompt into sections. Each concept on it's own line, single returns.

Generate your image and if you want to tweak the prompt you can immediately at glance see what you need to edit, not having to dig through a paragraph of mess to find what you want to change.

\--

professional glamour photography (put LORA Trigger and Medium at top)

Concept
Modern office portrait of woman seated on stool, polished professional workspace aesthetic

pose
Seated on round stool with legs crossed at knees and extended slightly forward
Torso angled slightly toward camera with upright posture
One arm folded across body, other resting on thigh
Head slightly tilted with direct gaze toward viewer

attire
White fitted button-up blouse
Red high-waisted mini skirt
Black sheer pantyhose
Red pointed-toe high heels
secretary glasses worn low on nose, eyes looking over glasses top
gold ankle bracelet on left ankle
gold bangle bracelet
gold stud earrings

hair/makeup/nails
Long straight black hair with blunt bangs
Smooth, sleek styling
Defined brows with eyeliner and mascara
Soft blush with red-toned lip color
Neatly manicured nails in neutral tone

expression
Soft confident smile with direct eye contact
Composed, slightly playful demeanor
Calm and self-assured presence

background
White brick wall backdrop
Desk with computer monitor behind subject
Printer/copier unit on side cabinet
Light-colored tiled floor with blue accent tiles
Bright, even indoor lighting creating clean office look

\--

this was 1 shot generation with Klein just show prompt adherence. It wasn't trying to make anything fancy. This is the format I use with Qwen2512 as well. I use LORA files to control my style and avoid using any stylizations words like, "masterpiece, trending, best quality, highly realistic, 4k, etc." I let the LORA do all the work and only describe the objects.

https://preview.redd.it/iwwsg89evq1h1.png?width=1280&format=png&auto=webp&s=2d3a5c99a7110560b567c18875047215fbd9cb15



https://redd.it/1tfya25
@rStableDiffusion
Is something like this possible?
https://redd.it/1tg058j
@rStableDiffusion
Generated 1000 liminal/dreamcore images with GPT Image 2 and put them in a dataset - could be useful for training

Was playing around with GPT Image 2 on 2K medium and ended up with about 1000 images that all have this liminal space / dreamcore feel. Empty indoor pools, weird corridors, foggy parking lots at night, that sort of thing.

Instead of letting them sit on my drive I packaged everything up and put it on Hugging Face. Could be decent for fine-tuning SD models or just as a reference set for this aesthetic.

https://huggingface.co/datasets/LukaDev13/Liminal-Dreamcore-1K

If anyone uses it for training I'd be curious how it turns out.

https://redd.it/1tg3rym
@rStableDiffusion
Media is too big
VIEW IN TELEGRAM
Tried using HY-Pano 2.0 and WorldMirror 2.0 together to create some rooms

https://redd.it/1tg3dq9
@rStableDiffusion
Captivating Chroma

I'm a huge fan of lodestones/Chroma, as it's very good at realism, creativity, and overall freedom. It is based on FLUX.1-schnell, so it's a bit of an older architecture by now.

One thing I like the most about Chroma is the incredible team and community behind it especially the Legendary Lodestones and the almighty, wise Silver. As it must be a monumental load of time effort and work to create a good model like this.

There is also the work-in-progress lodestones/Zeta-Chroma on the horizon, which is based on the great Z-Image/turbo.

In a world where companies are closed-sourcing their models, it's amazing to see the great independent work that creators like these are doing to keep the open-source community alive.

Well done to the Chroma team and the community behind it. You are all legends.

https://redd.it/1tg5nit
@rStableDiffusion
LoRA Training Auto-caption generator recommendation?

Hi Guys! I'm kinda new in image generation. I'm trying to train a character image LoRA with multiple image references. And I believe captions for each image are needed right? If I have, let's say 30 or more images, it'll be tiring to put caption for each. Would you recommend any great LoRA auto-caption generator that is free to use for multiple images all at once? By the way, i'm training for ZIT model.

Thank you in advance!

https://redd.it/1tgd3zo
@rStableDiffusion
Training a Portrait LoRA on AMD RX 9060 XT (RDNA4 / gfx1200) on Native Linux

This is a full account of getting LoRA training working on an AMD RX 9060 XT (Navi 44, RDNA4) on native Kubuntu 24.04.4. It covers everything tried, what failed and why, what had to be fixed, and what ended up working. Written for anyone with the same or similar hardware who wants to skip the trial-and-error.

---

## Hardware

- GPU: AMD RX 9060 XT — Navi 44, RDNA4, gfx1200, 16GB GDDR6, 150W TDP
- CPU: AMD Ryzen 5 5600G
- RAM: 32GB
- OS: Kubuntu 24.04.4, kernel 6.17.0-23-generic
- Primary SSD: Samsung 990 1TB M.2 (ext4, Linux)
- ROCm: 7.2.3

Important architecture note: Native Linux ROCm and amd-smi report this GPU as gfx1200. If you have WSL2 experience with this card, you may have seen gfx1201 — that was the WSL2 librocdxg bridge reporting incorrectly. The correct arch ID on native Linux is gfx1200. This matters for cmake flags and any arch-specific builds.

---

## Goal

Train a LoRA on portrait photos and use it with ComfyUI to generate lifestyle portrait photos. Models tested: SDXL (completed), Flux.1 Dev (completed, 1500 steps).

The article covers both models in sequence. The SDXL sections document a fully working pipeline and are useful standalone — but if you only care about Flux, you can skip ahead. The SDXL sections are not a prerequisite for Flux.

---

## Why Native Linux, Not WSL2

I started on Windows 10 + WSL2 (Ubuntu 24.04). Short version: don't bother with WSL2 for RDNA4 training as of May 2026.

### What happens in WSL2

WSL2 GPU passthrough for AMD goes through the DXG bridge — a closed-source component (libthunk_proxy.a) inside the AMD Adrenalin driver. On RDNA4, there is a confirmed bug in this library that breaks GPU kernel dispatch for large workloads.

Symptom: training appears to start, pipeline loads successfully, GPU VRAM fills to 8-10GB — but then nothing. CPU climbs to 30%, RAM to 28GB, GPU compute stays at 0%. The first training step either runs entirely on CPU (~50 minutes for one SDXL step at batch size 1) or the process hangs indefinitely.

The error that appears in logs:
[GetSegmentId] Failed to get segment id for type 1


This is librocdxg Issue #22 (opened April 2026, unfixed as of May 2026). Root cause is in libthunk_proxy.a which is closed source — librocdxg cannot fix it, only AMD can by shipping an updated driver.

What was ruled out through testing:
- bitsandbytes (same hang with plain adamw)
- bf16 precision (same hang with fp16)
- accelerate config (explicit single-GPU config made no difference)
- model loading (all 7 pipeline components load fine, VRAM fills correctly)
- Proof: MIOpen kernel cache after a 50-minute "run" was 180KB — essentially empty. If GPU kernels had been compiling for 50 minutes, the cache would be hundreds of MB. The work was running on CPU the whole time.

Do not try float32 in WSL2 either. dtype: float32 doubles VRAM to ~26-30GB, exceeds 16GB, OOM crashes the GPU driver, and on Windows this causes a BSOD. Use bf16 or fp16 always.

### Native Linux

Bypasses the DXG bridge entirely. ROCm accesses the GPU natively via /dev/kfd and /dev/dri. The same training config that hung indefinitely in WSL2 ran at 3-4 seconds per step on native Linux. First 50-step test: 8 minutes total. The difference is dramatic.

---

## ROCm Installation on Native Linux

# Download the installer .deb — the package is not in the default Ubuntu repos
wget https://repo.radeon.com/amdgpu-install/7.2.3/ubuntu/noble/amdgpu-install_7.2.3.70203-1_all.deb
sudo apt install -y ./amdgpu-install_7.2.3.70203-1_all.deb
sudo amdgpu-install --usecase=rocm --no-dkms -y


--no-dkms skips kernel module installation — not needed if the amdgpu module is already loaded (which it is in current kernels).

Critical: amdgpu-install does NOT add your user to the required groups. You must do this manually:

sudo usermod -aG render,video $USER
# Then log out and log back in — groups don't apply to existing sessions


Without render and
Training a Portrait LoRA on AMD RX 9060 XT (RDNA4 / gfx1200) on Native Linux

This is a full account of getting LoRA training working on an AMD RX 9060 XT (Navi 44, RDNA4) on native Kubuntu 24.04.4. It covers everything tried, what failed and why, what had to be fixed, and what ended up working. Written for anyone with the same or similar hardware who wants to skip the trial-and-error.

---

## Hardware

- **GPU:** AMD RX 9060 XT — Navi 44, RDNA4, gfx1200, 16GB GDDR6, 150W TDP
- **CPU:** AMD Ryzen 5 5600G
- **RAM:** 32GB
- **OS:** Kubuntu 24.04.4, kernel 6.17.0-23-generic
- **Primary SSD:** Samsung 990 1TB M.2 (ext4, Linux)
- **ROCm:** 7.2.3

**Important architecture note:** Native Linux ROCm and `amd-smi` report this GPU as **gfx1200**. If you have WSL2 experience with this card, you may have seen gfx1201 — that was the WSL2 librocdxg bridge reporting incorrectly. The correct arch ID on native Linux is **gfx1200**. This matters for cmake flags and any arch-specific builds.

---

## Goal

Train a LoRA on portrait photos and use it with ComfyUI to generate lifestyle portrait photos. Models tested: SDXL (completed), Flux.1 Dev (completed, 1500 steps).

The article covers both models in sequence. The SDXL sections document a fully working pipeline and are useful standalone — but if you only care about Flux, you can skip ahead. The SDXL sections are not a prerequisite for Flux.

---

## Why Native Linux, Not WSL2

I started on Windows 10 + WSL2 (Ubuntu 24.04). Short version: **don't bother with WSL2 for RDNA4 training as of May 2026.**

### What happens in WSL2

WSL2 GPU passthrough for AMD goes through the DXG bridge — a closed-source component (`libthunk_proxy.a`) inside the AMD Adrenalin driver. On RDNA4, there is a confirmed bug in this library that breaks GPU kernel dispatch for large workloads.

Symptom: training appears to start, pipeline loads successfully, GPU VRAM fills to 8-10GB — but then nothing. CPU climbs to 30%, RAM to 28GB, GPU compute stays at 0%. The first training step either runs entirely on CPU (~50 minutes for one SDXL step at batch size 1) or the process hangs indefinitely.

The error that appears in logs:
```
[GetSegmentId] Failed to get segment id for type 1
```

This is librocdxg Issue #22 (opened April 2026, unfixed as of May 2026). Root cause is in `libthunk_proxy.a` which is closed source — librocdxg cannot fix it, only AMD can by shipping an updated driver.

**What was ruled out through testing:**
- bitsandbytes (same hang with plain adamw)
- bf16 precision (same hang with fp16)
- accelerate config (explicit single-GPU config made no difference)
- model loading (all 7 pipeline components load fine, VRAM fills correctly)
- Proof: MIOpen kernel cache after a 50-minute "run" was 180KB — essentially empty. If GPU kernels had been compiling for 50 minutes, the cache would be hundreds of MB. The work was running on CPU the whole time.

**Do not try float32 in WSL2 either.** `dtype: float32` doubles VRAM to ~26-30GB, exceeds 16GB, OOM crashes the GPU driver, and on Windows this causes a BSOD. Use bf16 or fp16 always.

### Native Linux

Bypasses the DXG bridge entirely. ROCm accesses the GPU natively via `/dev/kfd` and `/dev/dri`. The same training config that hung indefinitely in WSL2 ran at 3-4 seconds per step on native Linux. First 50-step test: 8 minutes total. The difference is dramatic.

---

## ROCm Installation on Native Linux

```bash
# Download the installer .deb — the package is not in the default Ubuntu repos
wget https://repo.radeon.com/amdgpu-install/7.2.3/ubuntu/noble/amdgpu-install_7.2.3.70203-1_all.deb
sudo apt install -y ./amdgpu-install_7.2.3.70203-1_all.deb
sudo amdgpu-install --usecase=rocm --no-dkms -y
```

`--no-dkms` skips kernel module installation — not needed if the amdgpu module is already loaded (which it is in current kernels).

**Critical: amdgpu-install does NOT add your user to the required groups.** You must do this manually:

```bash
sudo usermod -aG render,video $USER
# Then log out and log back in — groups don't apply to existing sessions
```

Without `render` and