wrong VAE = broken output. For Z-Image use flux1-vae, NOT flux2-vae
10. **Newer SDXL and all Illustrious models have the VAE fix built in** β external VAE fix is only needed for older SDXL models
# π₯οΈ Tested Hardware
* **GPU:** NVIDIA GeForce GTX 1060 6GB (Pascal architecture, GP106)
* **RAM:** 32GB DDR3
* **Storage:** Fast SSD recommended
* **ComfyUI version:** Windows portable cu128 build
* **Driver:** Current NVIDIA drivers (May 2026)
# βοΈ Minimum & Recommended System Requirements
Running modern models on a 6GB VRAM GPU shifts the bottleneck from VRAM to **RAM and storage**. ComfyUI's Dynamic VRAM Management offloads aggressively to RAM β this only works if you have enough of it and can transfer it fast enough.
|Component|Minimum|Recommended|Why|
|:-|:-|:-|:-|
|**GPU VRAM**|6GB|6GB|GTX 1060 target|
|**RAM**|32GB|64GB|Models offload to RAM β 32GB works but gets tight with large models + OS overhead|
|**Storage**|Fast SATA SSD|NVMe M.2 SSD|Initial model load from disk β slower SSD = longer cold start per session|
|**CPU**|Any modern|Any modern|Text encoders run on CPU β but only for a single forward pass, not a bottleneck|
**Why RAM matters so much:**
* A 12GB Z-Image Turbo model staged in RAM needs \~12GB just for the model
* OS + ComfyUI + other background processes easily add another 8-10GB
* With 16GB RAM: constant disk swapping, extremely slow or unstable
* With 32GB RAM: workable, tight on very large models
* With 64GB RAM: comfortable headroom for multiple large models and batch operations
**Why SSD speed matters:** ComfyUI loads the model from disk once per session into RAM. With `--disable-smart-memory`, it then transfers from RAMβVRAM as needed (fast). But that initial disk load:
* Slow HDD: potentially minutes per model load
* SATA SSD: acceptable, 10-30 seconds
* NVMe M.2: near-instant, 2-5 seconds
**Bottom line:** A fast GPU with slow RAM or HDD will be severely bottlenecked. The GTX 1060 6GB setup only works well when RAM and storage can keep up.
*This guide was written based on hands-on testing. All benchmarks are real measurements, not theoretical estimates. If your experience differs, please share β community knowledge benefits everyone.*
*The goal of this guide is simple: don't let hardware limitation myths stop you from experimenting. Test first, assume nothing.*
https://redd.it/1tfs3ee
@rStableDiffusion
10. **Newer SDXL and all Illustrious models have the VAE fix built in** β external VAE fix is only needed for older SDXL models
# π₯οΈ Tested Hardware
* **GPU:** NVIDIA GeForce GTX 1060 6GB (Pascal architecture, GP106)
* **RAM:** 32GB DDR3
* **Storage:** Fast SSD recommended
* **ComfyUI version:** Windows portable cu128 build
* **Driver:** Current NVIDIA drivers (May 2026)
# βοΈ Minimum & Recommended System Requirements
Running modern models on a 6GB VRAM GPU shifts the bottleneck from VRAM to **RAM and storage**. ComfyUI's Dynamic VRAM Management offloads aggressively to RAM β this only works if you have enough of it and can transfer it fast enough.
|Component|Minimum|Recommended|Why|
|:-|:-|:-|:-|
|**GPU VRAM**|6GB|6GB|GTX 1060 target|
|**RAM**|32GB|64GB|Models offload to RAM β 32GB works but gets tight with large models + OS overhead|
|**Storage**|Fast SATA SSD|NVMe M.2 SSD|Initial model load from disk β slower SSD = longer cold start per session|
|**CPU**|Any modern|Any modern|Text encoders run on CPU β but only for a single forward pass, not a bottleneck|
**Why RAM matters so much:**
* A 12GB Z-Image Turbo model staged in RAM needs \~12GB just for the model
* OS + ComfyUI + other background processes easily add another 8-10GB
* With 16GB RAM: constant disk swapping, extremely slow or unstable
* With 32GB RAM: workable, tight on very large models
* With 64GB RAM: comfortable headroom for multiple large models and batch operations
**Why SSD speed matters:** ComfyUI loads the model from disk once per session into RAM. With `--disable-smart-memory`, it then transfers from RAMβVRAM as needed (fast). But that initial disk load:
* Slow HDD: potentially minutes per model load
* SATA SSD: acceptable, 10-30 seconds
* NVMe M.2: near-instant, 2-5 seconds
**Bottom line:** A fast GPU with slow RAM or HDD will be severely bottlenecked. The GTX 1060 6GB setup only works well when RAM and storage can keep up.
*This guide was written based on hands-on testing. All benchmarks are real measurements, not theoretical estimates. If your experience differs, please share β community knowledge benefits everyone.*
*The goal of this guide is simple: don't let hardware limitation myths stop you from experimenting. Test first, assume nothing.*
https://redd.it/1tfs3ee
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Best Way to Prompt Qwen, Klein, Zit...You're Welcome
This is the best way to prompt images for Flux Klein, Qwen or Wan. These models were trained on .json in such away that they understand hierarchal structure but there is no need to waste your time on on all the punctuation.
The parts of an image include; a basic concept or summary, a subject or subjects, attire, expression, pose, hair/makeup/accessories and a background.
So break you prompt into sections. Each concept on it's own line, single returns.
Generate your image and if you want to tweak the prompt you can immediately at glance see what you need to edit, not having to dig through a paragraph of mess to find what you want to change.
\--
professional glamour photography (put LORA Trigger and Medium at top)
Concept
Modern office portrait of woman seated on stool, polished professional workspace aesthetic
pose
Seated on round stool with legs crossed at knees and extended slightly forward
Torso angled slightly toward camera with upright posture
One arm folded across body, other resting on thigh
Head slightly tilted with direct gaze toward viewer
attire
White fitted button-up blouse
Red high-waisted mini skirt
Black sheer pantyhose
Red pointed-toe high heels
secretary glasses worn low on nose, eyes looking over glasses top
gold ankle bracelet on left ankle
gold bangle bracelet
gold stud earrings
hair/makeup/nails
Long straight black hair with blunt bangs
Smooth, sleek styling
Defined brows with eyeliner and mascara
Soft blush with red-toned lip color
Neatly manicured nails in neutral tone
expression
Soft confident smile with direct eye contact
Composed, slightly playful demeanor
Calm and self-assured presence
background
White brick wall backdrop
Desk with computer monitor behind subject
Printer/copier unit on side cabinet
Light-colored tiled floor with blue accent tiles
Bright, even indoor lighting creating clean office look
\--
this was 1 shot generation with Klein just show prompt adherence. It wasn't trying to make anything fancy. This is the format I use with Qwen2512 as well. I use LORA files to control my style and avoid using any stylizations words like, "masterpiece, trending, best quality, highly realistic, 4k, etc." I let the LORA do all the work and only describe the objects.
https://preview.redd.it/iwwsg89evq1h1.png?width=1280&format=png&auto=webp&s=2d3a5c99a7110560b567c18875047215fbd9cb15
https://redd.it/1tfya25
@rStableDiffusion
This is the best way to prompt images for Flux Klein, Qwen or Wan. These models were trained on .json in such away that they understand hierarchal structure but there is no need to waste your time on on all the punctuation.
The parts of an image include; a basic concept or summary, a subject or subjects, attire, expression, pose, hair/makeup/accessories and a background.
So break you prompt into sections. Each concept on it's own line, single returns.
Generate your image and if you want to tweak the prompt you can immediately at glance see what you need to edit, not having to dig through a paragraph of mess to find what you want to change.
\--
professional glamour photography (put LORA Trigger and Medium at top)
Concept
Modern office portrait of woman seated on stool, polished professional workspace aesthetic
pose
Seated on round stool with legs crossed at knees and extended slightly forward
Torso angled slightly toward camera with upright posture
One arm folded across body, other resting on thigh
Head slightly tilted with direct gaze toward viewer
attire
White fitted button-up blouse
Red high-waisted mini skirt
Black sheer pantyhose
Red pointed-toe high heels
secretary glasses worn low on nose, eyes looking over glasses top
gold ankle bracelet on left ankle
gold bangle bracelet
gold stud earrings
hair/makeup/nails
Long straight black hair with blunt bangs
Smooth, sleek styling
Defined brows with eyeliner and mascara
Soft blush with red-toned lip color
Neatly manicured nails in neutral tone
expression
Soft confident smile with direct eye contact
Composed, slightly playful demeanor
Calm and self-assured presence
background
White brick wall backdrop
Desk with computer monitor behind subject
Printer/copier unit on side cabinet
Light-colored tiled floor with blue accent tiles
Bright, even indoor lighting creating clean office look
\--
this was 1 shot generation with Klein just show prompt adherence. It wasn't trying to make anything fancy. This is the format I use with Qwen2512 as well. I use LORA files to control my style and avoid using any stylizations words like, "masterpiece, trending, best quality, highly realistic, 4k, etc." I let the LORA do all the work and only describe the objects.
https://preview.redd.it/iwwsg89evq1h1.png?width=1280&format=png&auto=webp&s=2d3a5c99a7110560b567c18875047215fbd9cb15
https://redd.it/1tfya25
@rStableDiffusion
Generated 1000 liminal/dreamcore images with GPT Image 2 and put them in a dataset - could be useful for training
Was playing around with GPT Image 2 on 2K medium and ended up with about 1000 images that all have this liminal space / dreamcore feel. Empty indoor pools, weird corridors, foggy parking lots at night, that sort of thing.
Instead of letting them sit on my drive I packaged everything up and put it on Hugging Face. Could be decent for fine-tuning SD models or just as a reference set for this aesthetic.
https://huggingface.co/datasets/LukaDev13/Liminal-Dreamcore-1K
If anyone uses it for training I'd be curious how it turns out.
https://redd.it/1tg3rym
@rStableDiffusion
Was playing around with GPT Image 2 on 2K medium and ended up with about 1000 images that all have this liminal space / dreamcore feel. Empty indoor pools, weird corridors, foggy parking lots at night, that sort of thing.
Instead of letting them sit on my drive I packaged everything up and put it on Hugging Face. Could be decent for fine-tuning SD models or just as a reference set for this aesthetic.
https://huggingface.co/datasets/LukaDev13/Liminal-Dreamcore-1K
If anyone uses it for training I'd be curious how it turns out.
https://redd.it/1tg3rym
@rStableDiffusion
huggingface.co
LukaDev13/Liminal-Dreamcore-1K Β· Datasets at Hugging Face
Weβre on a journey to advance and democratize artificial intelligence through open source and open science.
Media is too big
VIEW IN TELEGRAM
Tried using HY-Pano 2.0 and WorldMirror 2.0 together to create some rooms
https://redd.it/1tg3dq9
@rStableDiffusion
https://redd.it/1tg3dq9
@rStableDiffusion
Captivating Chroma
I'm a huge fan of lodestones/Chroma, as it's very good at realism, creativity, and overall freedom. It is based on FLUX.1-schnell, so it's a bit of an older architecture by now.
One thing I like the most about Chroma is the incredible team and community behind it especially the Legendary Lodestones and the almighty, wise Silver. As it must be a monumental load of time effort and work to create a good model like this.
There is also the work-in-progress lodestones/Zeta-Chroma on the horizon, which is based on the great Z-Image/turbo.
In a world where companies are closed-sourcing their models, it's amazing to see the great independent work that creators like these are doing to keep the open-source community alive.
Well done to the Chroma team and the community behind it. You are all legends.
https://redd.it/1tg5nit
@rStableDiffusion
I'm a huge fan of lodestones/Chroma, as it's very good at realism, creativity, and overall freedom. It is based on FLUX.1-schnell, so it's a bit of an older architecture by now.
One thing I like the most about Chroma is the incredible team and community behind it especially the Legendary Lodestones and the almighty, wise Silver. As it must be a monumental load of time effort and work to create a good model like this.
There is also the work-in-progress lodestones/Zeta-Chroma on the horizon, which is based on the great Z-Image/turbo.
In a world where companies are closed-sourcing their models, it's amazing to see the great independent work that creators like these are doing to keep the open-source community alive.
Well done to the Chroma team and the community behind it. You are all legends.
https://redd.it/1tg5nit
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
LoRA Training Auto-caption generator recommendation?
Hi Guys! I'm kinda new in image generation. I'm trying to train a character image LoRA with multiple image references. And I believe captions for each image are needed right? If I have, let's say 30 or more images, it'll be tiring to put caption for each. Would you recommend any great LoRA auto-caption generator that is free to use for multiple images all at once? By the way, i'm training for ZIT model.
Thank you in advance!
https://redd.it/1tgd3zo
@rStableDiffusion
Hi Guys! I'm kinda new in image generation. I'm trying to train a character image LoRA with multiple image references. And I believe captions for each image are needed right? If I have, let's say 30 or more images, it'll be tiring to put caption for each. Would you recommend any great LoRA auto-caption generator that is free to use for multiple images all at once? By the way, i'm training for ZIT model.
Thank you in advance!
https://redd.it/1tgd3zo
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
RealTime character swap
https://preview.redd.it/rm7xko8hsu1h1.png?width=2054&format=png&auto=webp&s=e01c06ce7224ce6590bd63714cdfae3b40946aef
Updated app from the DeluluStream team using Lucy 2.1
https://reddit.com/link/1tgfq9y/video/zozqmbjjqu1h1/player
https://redd.it/1tgfq9y
@rStableDiffusion
https://preview.redd.it/rm7xko8hsu1h1.png?width=2054&format=png&auto=webp&s=e01c06ce7224ce6590bd63714cdfae3b40946aef
Updated app from the DeluluStream team using Lucy 2.1
https://reddit.com/link/1tgfq9y/video/zozqmbjjqu1h1/player
https://redd.it/1tgfq9y
@rStableDiffusion
Training a Portrait LoRA on AMD RX 9060 XT (RDNA4 / gfx1200) on Native Linux
This is a full account of getting LoRA training working on an AMD RX 9060 XT (Navi 44, RDNA4) on native Kubuntu 24.04.4. It covers everything tried, what failed and why, what had to be fixed, and what ended up working. Written for anyone with the same or similar hardware who wants to skip the trial-and-error.
---
## Hardware
- GPU: AMD RX 9060 XT β Navi 44, RDNA4, gfx1200, 16GB GDDR6, 150W TDP
- CPU: AMD Ryzen 5 5600G
- RAM: 32GB
- OS: Kubuntu 24.04.4, kernel 6.17.0-23-generic
- Primary SSD: Samsung 990 1TB M.2 (ext4, Linux)
- ROCm: 7.2.3
Important architecture note: Native Linux ROCm and
---
## Goal
Train a LoRA on portrait photos and use it with ComfyUI to generate lifestyle portrait photos. Models tested: SDXL (completed), Flux.1 Dev (completed, 1500 steps).
The article covers both models in sequence. The SDXL sections document a fully working pipeline and are useful standalone β but if you only care about Flux, you can skip ahead. The SDXL sections are not a prerequisite for Flux.
---
## Why Native Linux, Not WSL2
I started on Windows 10 + WSL2 (Ubuntu 24.04). Short version: don't bother with WSL2 for RDNA4 training as of May 2026.
### What happens in WSL2
WSL2 GPU passthrough for AMD goes through the DXG bridge β a closed-source component (
Symptom: training appears to start, pipeline loads successfully, GPU VRAM fills to 8-10GB β but then nothing. CPU climbs to 30%, RAM to 28GB, GPU compute stays at 0%. The first training step either runs entirely on CPU (~50 minutes for one SDXL step at batch size 1) or the process hangs indefinitely.
The error that appears in logs:
This is librocdxg Issue #22 (opened April 2026, unfixed as of May 2026). Root cause is in
What was ruled out through testing:
- bitsandbytes (same hang with plain adamw)
- bf16 precision (same hang with fp16)
- accelerate config (explicit single-GPU config made no difference)
- model loading (all 7 pipeline components load fine, VRAM fills correctly)
- Proof: MIOpen kernel cache after a 50-minute "run" was 180KB β essentially empty. If GPU kernels had been compiling for 50 minutes, the cache would be hundreds of MB. The work was running on CPU the whole time.
Do not try float32 in WSL2 either.
### Native Linux
Bypasses the DXG bridge entirely. ROCm accesses the GPU natively via
---
## ROCm Installation on Native Linux
Critical: amdgpu-install does NOT add your user to the required groups. You must do this manually:
Without
This is a full account of getting LoRA training working on an AMD RX 9060 XT (Navi 44, RDNA4) on native Kubuntu 24.04.4. It covers everything tried, what failed and why, what had to be fixed, and what ended up working. Written for anyone with the same or similar hardware who wants to skip the trial-and-error.
---
## Hardware
- GPU: AMD RX 9060 XT β Navi 44, RDNA4, gfx1200, 16GB GDDR6, 150W TDP
- CPU: AMD Ryzen 5 5600G
- RAM: 32GB
- OS: Kubuntu 24.04.4, kernel 6.17.0-23-generic
- Primary SSD: Samsung 990 1TB M.2 (ext4, Linux)
- ROCm: 7.2.3
Important architecture note: Native Linux ROCm and
amd-smi report this GPU as gfx1200. If you have WSL2 experience with this card, you may have seen gfx1201 β that was the WSL2 librocdxg bridge reporting incorrectly. The correct arch ID on native Linux is gfx1200. This matters for cmake flags and any arch-specific builds.---
## Goal
Train a LoRA on portrait photos and use it with ComfyUI to generate lifestyle portrait photos. Models tested: SDXL (completed), Flux.1 Dev (completed, 1500 steps).
The article covers both models in sequence. The SDXL sections document a fully working pipeline and are useful standalone β but if you only care about Flux, you can skip ahead. The SDXL sections are not a prerequisite for Flux.
---
## Why Native Linux, Not WSL2
I started on Windows 10 + WSL2 (Ubuntu 24.04). Short version: don't bother with WSL2 for RDNA4 training as of May 2026.
### What happens in WSL2
WSL2 GPU passthrough for AMD goes through the DXG bridge β a closed-source component (
libthunk_proxy.a) inside the AMD Adrenalin driver. On RDNA4, there is a confirmed bug in this library that breaks GPU kernel dispatch for large workloads.Symptom: training appears to start, pipeline loads successfully, GPU VRAM fills to 8-10GB β but then nothing. CPU climbs to 30%, RAM to 28GB, GPU compute stays at 0%. The first training step either runs entirely on CPU (~50 minutes for one SDXL step at batch size 1) or the process hangs indefinitely.
The error that appears in logs:
[GetSegmentId] Failed to get segment id for type 1
This is librocdxg Issue #22 (opened April 2026, unfixed as of May 2026). Root cause is in
libthunk_proxy.a which is closed source β librocdxg cannot fix it, only AMD can by shipping an updated driver.What was ruled out through testing:
- bitsandbytes (same hang with plain adamw)
- bf16 precision (same hang with fp16)
- accelerate config (explicit single-GPU config made no difference)
- model loading (all 7 pipeline components load fine, VRAM fills correctly)
- Proof: MIOpen kernel cache after a 50-minute "run" was 180KB β essentially empty. If GPU kernels had been compiling for 50 minutes, the cache would be hundreds of MB. The work was running on CPU the whole time.
Do not try float32 in WSL2 either.
dtype: float32 doubles VRAM to ~26-30GB, exceeds 16GB, OOM crashes the GPU driver, and on Windows this causes a BSOD. Use bf16 or fp16 always.### Native Linux
Bypasses the DXG bridge entirely. ROCm accesses the GPU natively via
/dev/kfd and /dev/dri. The same training config that hung indefinitely in WSL2 ran at 3-4 seconds per step on native Linux. First 50-step test: 8 minutes total. The difference is dramatic.---
## ROCm Installation on Native Linux
# Download the installer .deb β the package is not in the default Ubuntu repos
wget https://repo.radeon.com/amdgpu-install/7.2.3/ubuntu/noble/amdgpu-install_7.2.3.70203-1_all.deb
sudo apt install -y ./amdgpu-install_7.2.3.70203-1_all.deb
sudo amdgpu-install --usecase=rocm --no-dkms -y
--no-dkms skips kernel module installation β not needed if the amdgpu module is already loaded (which it is in current kernels).Critical: amdgpu-install does NOT add your user to the required groups. You must do this manually:
sudo usermod -aG render,video $USER
# Then log out and log back in β groups don't apply to existing sessions
Without
render andTraining a Portrait LoRA on AMD RX 9060 XT (RDNA4 / gfx1200) on Native Linux
This is a full account of getting LoRA training working on an AMD RX 9060 XT (Navi 44, RDNA4) on native Kubuntu 24.04.4. It covers everything tried, what failed and why, what had to be fixed, and what ended up working. Written for anyone with the same or similar hardware who wants to skip the trial-and-error.
---
## Hardware
- **GPU:** AMD RX 9060 XT β Navi 44, RDNA4, gfx1200, 16GB GDDR6, 150W TDP
- **CPU:** AMD Ryzen 5 5600G
- **RAM:** 32GB
- **OS:** Kubuntu 24.04.4, kernel 6.17.0-23-generic
- **Primary SSD:** Samsung 990 1TB M.2 (ext4, Linux)
- **ROCm:** 7.2.3
**Important architecture note:** Native Linux ROCm and `amd-smi` report this GPU as **gfx1200**. If you have WSL2 experience with this card, you may have seen gfx1201 β that was the WSL2 librocdxg bridge reporting incorrectly. The correct arch ID on native Linux is **gfx1200**. This matters for cmake flags and any arch-specific builds.
---
## Goal
Train a LoRA on portrait photos and use it with ComfyUI to generate lifestyle portrait photos. Models tested: SDXL (completed), Flux.1 Dev (completed, 1500 steps).
The article covers both models in sequence. The SDXL sections document a fully working pipeline and are useful standalone β but if you only care about Flux, you can skip ahead. The SDXL sections are not a prerequisite for Flux.
---
## Why Native Linux, Not WSL2
I started on Windows 10 + WSL2 (Ubuntu 24.04). Short version: **don't bother with WSL2 for RDNA4 training as of May 2026.**
### What happens in WSL2
WSL2 GPU passthrough for AMD goes through the DXG bridge β a closed-source component (`libthunk_proxy.a`) inside the AMD Adrenalin driver. On RDNA4, there is a confirmed bug in this library that breaks GPU kernel dispatch for large workloads.
Symptom: training appears to start, pipeline loads successfully, GPU VRAM fills to 8-10GB β but then nothing. CPU climbs to 30%, RAM to 28GB, GPU compute stays at 0%. The first training step either runs entirely on CPU (~50 minutes for one SDXL step at batch size 1) or the process hangs indefinitely.
The error that appears in logs:
```
[GetSegmentId] Failed to get segment id for type 1
```
This is librocdxg Issue #22 (opened April 2026, unfixed as of May 2026). Root cause is in `libthunk_proxy.a` which is closed source β librocdxg cannot fix it, only AMD can by shipping an updated driver.
**What was ruled out through testing:**
- bitsandbytes (same hang with plain adamw)
- bf16 precision (same hang with fp16)
- accelerate config (explicit single-GPU config made no difference)
- model loading (all 7 pipeline components load fine, VRAM fills correctly)
- Proof: MIOpen kernel cache after a 50-minute "run" was 180KB β essentially empty. If GPU kernels had been compiling for 50 minutes, the cache would be hundreds of MB. The work was running on CPU the whole time.
**Do not try float32 in WSL2 either.** `dtype: float32` doubles VRAM to ~26-30GB, exceeds 16GB, OOM crashes the GPU driver, and on Windows this causes a BSOD. Use bf16 or fp16 always.
### Native Linux
Bypasses the DXG bridge entirely. ROCm accesses the GPU natively via `/dev/kfd` and `/dev/dri`. The same training config that hung indefinitely in WSL2 ran at 3-4 seconds per step on native Linux. First 50-step test: 8 minutes total. The difference is dramatic.
---
## ROCm Installation on Native Linux
```bash
# Download the installer .deb β the package is not in the default Ubuntu repos
wget https://repo.radeon.com/amdgpu-install/7.2.3/ubuntu/noble/amdgpu-install_7.2.3.70203-1_all.deb
sudo apt install -y ./amdgpu-install_7.2.3.70203-1_all.deb
sudo amdgpu-install --usecase=rocm --no-dkms -y
```
`--no-dkms` skips kernel module installation β not needed if the amdgpu module is already loaded (which it is in current kernels).
**Critical: amdgpu-install does NOT add your user to the required groups.** You must do this manually:
```bash
sudo usermod -aG render,video $USER
# Then log out and log back in β groups don't apply to existing sessions
```
Without `render` and
This is a full account of getting LoRA training working on an AMD RX 9060 XT (Navi 44, RDNA4) on native Kubuntu 24.04.4. It covers everything tried, what failed and why, what had to be fixed, and what ended up working. Written for anyone with the same or similar hardware who wants to skip the trial-and-error.
---
## Hardware
- **GPU:** AMD RX 9060 XT β Navi 44, RDNA4, gfx1200, 16GB GDDR6, 150W TDP
- **CPU:** AMD Ryzen 5 5600G
- **RAM:** 32GB
- **OS:** Kubuntu 24.04.4, kernel 6.17.0-23-generic
- **Primary SSD:** Samsung 990 1TB M.2 (ext4, Linux)
- **ROCm:** 7.2.3
**Important architecture note:** Native Linux ROCm and `amd-smi` report this GPU as **gfx1200**. If you have WSL2 experience with this card, you may have seen gfx1201 β that was the WSL2 librocdxg bridge reporting incorrectly. The correct arch ID on native Linux is **gfx1200**. This matters for cmake flags and any arch-specific builds.
---
## Goal
Train a LoRA on portrait photos and use it with ComfyUI to generate lifestyle portrait photos. Models tested: SDXL (completed), Flux.1 Dev (completed, 1500 steps).
The article covers both models in sequence. The SDXL sections document a fully working pipeline and are useful standalone β but if you only care about Flux, you can skip ahead. The SDXL sections are not a prerequisite for Flux.
---
## Why Native Linux, Not WSL2
I started on Windows 10 + WSL2 (Ubuntu 24.04). Short version: **don't bother with WSL2 for RDNA4 training as of May 2026.**
### What happens in WSL2
WSL2 GPU passthrough for AMD goes through the DXG bridge β a closed-source component (`libthunk_proxy.a`) inside the AMD Adrenalin driver. On RDNA4, there is a confirmed bug in this library that breaks GPU kernel dispatch for large workloads.
Symptom: training appears to start, pipeline loads successfully, GPU VRAM fills to 8-10GB β but then nothing. CPU climbs to 30%, RAM to 28GB, GPU compute stays at 0%. The first training step either runs entirely on CPU (~50 minutes for one SDXL step at batch size 1) or the process hangs indefinitely.
The error that appears in logs:
```
[GetSegmentId] Failed to get segment id for type 1
```
This is librocdxg Issue #22 (opened April 2026, unfixed as of May 2026). Root cause is in `libthunk_proxy.a` which is closed source β librocdxg cannot fix it, only AMD can by shipping an updated driver.
**What was ruled out through testing:**
- bitsandbytes (same hang with plain adamw)
- bf16 precision (same hang with fp16)
- accelerate config (explicit single-GPU config made no difference)
- model loading (all 7 pipeline components load fine, VRAM fills correctly)
- Proof: MIOpen kernel cache after a 50-minute "run" was 180KB β essentially empty. If GPU kernels had been compiling for 50 minutes, the cache would be hundreds of MB. The work was running on CPU the whole time.
**Do not try float32 in WSL2 either.** `dtype: float32` doubles VRAM to ~26-30GB, exceeds 16GB, OOM crashes the GPU driver, and on Windows this causes a BSOD. Use bf16 or fp16 always.
### Native Linux
Bypasses the DXG bridge entirely. ROCm accesses the GPU natively via `/dev/kfd` and `/dev/dri`. The same training config that hung indefinitely in WSL2 ran at 3-4 seconds per step on native Linux. First 50-step test: 8 minutes total. The difference is dramatic.
---
## ROCm Installation on Native Linux
```bash
# Download the installer .deb β the package is not in the default Ubuntu repos
wget https://repo.radeon.com/amdgpu-install/7.2.3/ubuntu/noble/amdgpu-install_7.2.3.70203-1_all.deb
sudo apt install -y ./amdgpu-install_7.2.3.70203-1_all.deb
sudo amdgpu-install --usecase=rocm --no-dkms -y
```
`--no-dkms` skips kernel module installation β not needed if the amdgpu module is already loaded (which it is in current kernels).
**Critical: amdgpu-install does NOT add your user to the required groups.** You must do this manually:
```bash
sudo usermod -aG render,video $USER
# Then log out and log back in β groups don't apply to existing sessions
```
Without `render` and
`video` group membership, ROCm cannot access `/dev/kfd` and `/dev/dri`. Training will fail silently or with permission errors.
**Verify:**
```bash
rocminfo | grep -E "gfx|Marketing"
# Should show: gfx1200 and "AMD Radeon RX 9060 XT"
amd-smi static | grep -i "gfx\|market"
```
**Additional packages needed that are not in default Kubuntu 24.04:**
```bash
sudo apt install python3.12-venv cmake radeontop
```
`python3.12-venv` is required before you can create any Python venv. `cmake` is required for bitsandbytes compilation.
---
## Training Tool Selection
All major training tools were evaluated. The main blocker for most is ROCm version compatibility:
| Tool | Verdict | Reason |
|------|---------|--------|
| **cupertinomiranda/ai-toolkit-amd-rocm-support** | **Use this** | Explicitly mentions gfx1200/gfx1201, tested on ROCm 7.1 (7.2 works), bitsandbytes instructions included |
| ostris/ai-toolkit (main) | May work | Civitai guide used it on RX 9070 + ROCm 7.2; no confirmed end-to-end results |
| daMustermann/ai-toolkit-rocm | Do not use | Targets ROCm 6.2 β incompatible with RDNA4 |
| Kohya_ss / sd-scripts | Do not use | requirements_linux_rocm.txt targets ROCm 6.3 β incompatible |
| FluxGym | Do not use | Wraps Kohya internally, same incompatibility |
| SimpleTuner | Avoid | Explicitly states "AMD and Apple GPUs do not work for training Flux" |
| OneTrainer | Possibly | Needs manual ROCm version edit in requirements; AMD support "may be outdated" per maintainers |
**Use cupertinomiranda/ai-toolkit-amd-rocm-support.** It's the only fork that explicitly documents gfx1200/gfx1201 support, ROCm 7.x compatibility, and provides working bitsandbytes build instructions.
Clone and install:
```bash
cd ~
git clone https://github.com/cupertinomiranda/ai-toolkit-amd-rocm-support
cd ai-toolkit-amd-rocm-support
python3 -m venv venv
source venv/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm7.2
pip install -r requirements-amd.txt
```
**Do not use PyTorch nightly** (`https://download.pytorch.org/whl/nightly/rocm7.2`). Nightly 2.13.0.dev crashes due to a rocprofiler fatal error. Use stable: `https://download.pytorch.org/whl/rocm7.2` which gives 2.11.0+rocm7.2.
**Important: do not clone or install on NTFS mounts** (`/media/`, `/mnt/`). NTFS does not support Linux file permissions β `chmod` operations will fail with "Operation not permitted". Always install in `~/` (ext4).
---
## bitsandbytes: Must Compile From Source
`pip install bitsandbytes` installs a CUDA version that does not work on AMD. You must compile from source for gfx1200.
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
cd ~
git clone https://github.com/bitsandbytes-foundation/bitsandbytes -b 0.48.2
cd bitsandbytes
cmake \
-DCMAKE_HIP_COMPILER="/opt/rocm/lib/llvm/bin/clang++" \
-DBNB_ROCM_ARCH="gfx1200" \
-DCOMPUTE_BACKEND=hip \
.
make -j$(nproc)
pip install .
```
**Key flags:**
- `-DBNB_ROCM_ARCH="gfx1200"` β use gfx1200, not gfx1201. Native Linux ROCm reports gfx1200. Building for the wrong arch produces a binary that silently falls back to CPU.
- `-DCMAKE_HIP_COMPILER` β full path required; ROCm's clang++ is not always in PATH.
Verify after install (run inside the training venv):
```python
import bitsandbytes as bnb
print(bnb.__version__)
# Should print 0.48.x or similar, no errors
```
---
## SDXL Model Download
ai-toolkit uses diffusers format (separate component folders). Download fp16 only β the full repo is 25-30GB and you only need ~6.7GB:
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
huggingface-cli download stabilityai/stable-diffusion-xl-base-1.0 \
--include "*.fp16.safetensors" "*.json" "*.txt" \
--local-dir ~/models/sdxl/
```
diffusers looks for `diffusion_pytorch_model.safetensors` but only the fp16 versions exist. Create symlinks:
```bash
cd ~/models/sdxl
ln -sf unet/diffusion_pytorch_model.fp16.safetensors unet/diffusion_pytorch_model.safetensors
ln -sf vae/diffusion_pytorch_model.fp16.safetensors vae/diffusion_pytorch_model.safetensors
ln -sf
**Verify:**
```bash
rocminfo | grep -E "gfx|Marketing"
# Should show: gfx1200 and "AMD Radeon RX 9060 XT"
amd-smi static | grep -i "gfx\|market"
```
**Additional packages needed that are not in default Kubuntu 24.04:**
```bash
sudo apt install python3.12-venv cmake radeontop
```
`python3.12-venv` is required before you can create any Python venv. `cmake` is required for bitsandbytes compilation.
---
## Training Tool Selection
All major training tools were evaluated. The main blocker for most is ROCm version compatibility:
| Tool | Verdict | Reason |
|------|---------|--------|
| **cupertinomiranda/ai-toolkit-amd-rocm-support** | **Use this** | Explicitly mentions gfx1200/gfx1201, tested on ROCm 7.1 (7.2 works), bitsandbytes instructions included |
| ostris/ai-toolkit (main) | May work | Civitai guide used it on RX 9070 + ROCm 7.2; no confirmed end-to-end results |
| daMustermann/ai-toolkit-rocm | Do not use | Targets ROCm 6.2 β incompatible with RDNA4 |
| Kohya_ss / sd-scripts | Do not use | requirements_linux_rocm.txt targets ROCm 6.3 β incompatible |
| FluxGym | Do not use | Wraps Kohya internally, same incompatibility |
| SimpleTuner | Avoid | Explicitly states "AMD and Apple GPUs do not work for training Flux" |
| OneTrainer | Possibly | Needs manual ROCm version edit in requirements; AMD support "may be outdated" per maintainers |
**Use cupertinomiranda/ai-toolkit-amd-rocm-support.** It's the only fork that explicitly documents gfx1200/gfx1201 support, ROCm 7.x compatibility, and provides working bitsandbytes build instructions.
Clone and install:
```bash
cd ~
git clone https://github.com/cupertinomiranda/ai-toolkit-amd-rocm-support
cd ai-toolkit-amd-rocm-support
python3 -m venv venv
source venv/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm7.2
pip install -r requirements-amd.txt
```
**Do not use PyTorch nightly** (`https://download.pytorch.org/whl/nightly/rocm7.2`). Nightly 2.13.0.dev crashes due to a rocprofiler fatal error. Use stable: `https://download.pytorch.org/whl/rocm7.2` which gives 2.11.0+rocm7.2.
**Important: do not clone or install on NTFS mounts** (`/media/`, `/mnt/`). NTFS does not support Linux file permissions β `chmod` operations will fail with "Operation not permitted". Always install in `~/` (ext4).
---
## bitsandbytes: Must Compile From Source
`pip install bitsandbytes` installs a CUDA version that does not work on AMD. You must compile from source for gfx1200.
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
cd ~
git clone https://github.com/bitsandbytes-foundation/bitsandbytes -b 0.48.2
cd bitsandbytes
cmake \
-DCMAKE_HIP_COMPILER="/opt/rocm/lib/llvm/bin/clang++" \
-DBNB_ROCM_ARCH="gfx1200" \
-DCOMPUTE_BACKEND=hip \
.
make -j$(nproc)
pip install .
```
**Key flags:**
- `-DBNB_ROCM_ARCH="gfx1200"` β use gfx1200, not gfx1201. Native Linux ROCm reports gfx1200. Building for the wrong arch produces a binary that silently falls back to CPU.
- `-DCMAKE_HIP_COMPILER` β full path required; ROCm's clang++ is not always in PATH.
Verify after install (run inside the training venv):
```python
import bitsandbytes as bnb
print(bnb.__version__)
# Should print 0.48.x or similar, no errors
```
---
## SDXL Model Download
ai-toolkit uses diffusers format (separate component folders). Download fp16 only β the full repo is 25-30GB and you only need ~6.7GB:
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
huggingface-cli download stabilityai/stable-diffusion-xl-base-1.0 \
--include "*.fp16.safetensors" "*.json" "*.txt" \
--local-dir ~/models/sdxl/
```
diffusers looks for `diffusion_pytorch_model.safetensors` but only the fp16 versions exist. Create symlinks:
```bash
cd ~/models/sdxl
ln -sf unet/diffusion_pytorch_model.fp16.safetensors unet/diffusion_pytorch_model.safetensors
ln -sf vae/diffusion_pytorch_model.fp16.safetensors vae/diffusion_pytorch_model.safetensors
ln -sf
GitHub
GitHub - cupertinomiranda/ai-toolkit-amd-rocm-support: The ultimate training toolkit for finetuning diffusion models. Add supportβ¦
The ultimate training toolkit for finetuning diffusion models. Add support for AMD ROCm GPUs repo. - cupertinomiranda/ai-toolkit-amd-rocm-support
text_encoder/model.fp16.safetensors text_encoder/model.safetensors
ln -sf text_encoder_2/model.fp16.safetensors text_encoder_2/model.safetensors
```
Without these symlinks, the pipeline load fails with a missing file error.
---
## GPU Monitoring
`rocm-smi` works on native Linux (unlike WSL2 where it was broken):
```bash
watch -n1 rocm-smi # text monitor, refreshes every second
radeontop # AMD-specific graphical TUI β recommended
```
**Do not use nvtop 3.0.2** β it crashes on this ROCm/AMD setup. Use radeontop instead.
If your system has both a discrete GPU and an integrated GPU (e.g. Ryzen with Vega iGPU), radeontop defaults to bus 0 which may be the iGPU. Find your discrete GPU's bus ID with `radeontop -l` and pass it with `-b`: `radeontop -b 03` (the number varies by system).
---
## Photo Captioning with JoyCaption
**JoyCaption Beta One** (`fancyfeast/llama-joycaption-beta-one-hf-llava`) produces high-quality captions specifically designed for LoRA training. It's a Llama 3.1 base with a SigLIP vision encoder.
Download (~16GB):
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
huggingface-cli download fancyfeast/llama-joycaption-beta-one-hf-llava \
--local-dir ~/models/joycaption/
```
Performance on RX 9060 XT: ~5 sec/photo, ~82% GPU load, ~11.7GB VRAM peak.
### Three bugs to know about
**Bug 1: Use local path, not HF repo ID**
```python
# Wrong β re-downloads 16GB from HuggingFace every run:
MODEL_NAME = "fancyfeast/llama-joycaption-beta-one-hf-llava"
# Correct:
MODEL_NAME = os.path.expanduser("~/models/joycaption")
```
**Bug 2: apply_chat_template with multimodal list content fails**
The Jinja2 sandbox in this version of transformers cannot call `.replace()` on list content. The multimodal format `[{"type": "image"}, {"type": "text", ...}]` throws:
```
UndefinedError: 'list object' has no attribute 'replace'
```
Fix: use a plain string with the image token embedded:
```python
conversation = [{"role": "user", "content": f"<image>\n{PROMPT}"}]
text_input = processor.tokenizer.apply_chat_template(
conversation, tokenize=False, add_generation_prompt=True
)
inputs = processor(images=image, text=text_input, return_tensors="pt").to(model.device)
```
**Bug 3: 4-bit quantization breaks SigLIP vision tower**
`BitsAndBytesConfig(load_in_4bit=True)` quantizes all linear layers including SigLIP's `MultiheadAttention.out_proj`. SigLIP calls `F.multi_head_attention_forward` with raw weight tensors, bypassing bitsandbytes' override, causing:
```
RuntimeError: self and mat2 must have the same dtype, but got Half and Byte
```
Fix: use 8-bit with vision modules excluded:
```python
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_skip_modules=["vision_tower", "multi_modal_projector"],
)
model = LlavaForConditionalGeneration.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
torch_dtype=torch.float16,
device_map="auto",
)
```
This keeps the LLM at 8-bit (~8GB) and the vision tower at fp16 (~1-2GB), totalling ~10-11GB VRAM. Fits comfortably on 16GB.
---
## The Trigger Word Problem
If you generate captions with JoyCaption (or any captioner), the captions are plain descriptive text. **The model has no trigger word unless you explicitly add one to every caption.**
Example: if you train with JoyCaption captions and then generate with prompt `"ohwx man, portrait photo..."`, the token `ohwx man` was never in the training data and is ignored by the LoRA. It is not harmful but it does nothing.
Options:
1. Prepend a trigger word to all captions before training: `"ohwx man, [joycaption text]"` β requires a script to add the prefix to every `.txt` file
2. Use the `trigger_word` or `caption_prefix` setting in the training config if the tool supports it β cupertinomiranda/ai-toolkit does not currently expose this for Flux
**Recommendation:** For option 1, a one-liner to prepend to all captions: `for f in /path/to/photos/*.txt; do sed -i "1s/^/ohwx man, /" "$f"; done`. Include the trigger word
ln -sf text_encoder_2/model.fp16.safetensors text_encoder_2/model.safetensors
```
Without these symlinks, the pipeline load fails with a missing file error.
---
## GPU Monitoring
`rocm-smi` works on native Linux (unlike WSL2 where it was broken):
```bash
watch -n1 rocm-smi # text monitor, refreshes every second
radeontop # AMD-specific graphical TUI β recommended
```
**Do not use nvtop 3.0.2** β it crashes on this ROCm/AMD setup. Use radeontop instead.
If your system has both a discrete GPU and an integrated GPU (e.g. Ryzen with Vega iGPU), radeontop defaults to bus 0 which may be the iGPU. Find your discrete GPU's bus ID with `radeontop -l` and pass it with `-b`: `radeontop -b 03` (the number varies by system).
---
## Photo Captioning with JoyCaption
**JoyCaption Beta One** (`fancyfeast/llama-joycaption-beta-one-hf-llava`) produces high-quality captions specifically designed for LoRA training. It's a Llama 3.1 base with a SigLIP vision encoder.
Download (~16GB):
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
huggingface-cli download fancyfeast/llama-joycaption-beta-one-hf-llava \
--local-dir ~/models/joycaption/
```
Performance on RX 9060 XT: ~5 sec/photo, ~82% GPU load, ~11.7GB VRAM peak.
### Three bugs to know about
**Bug 1: Use local path, not HF repo ID**
```python
# Wrong β re-downloads 16GB from HuggingFace every run:
MODEL_NAME = "fancyfeast/llama-joycaption-beta-one-hf-llava"
# Correct:
MODEL_NAME = os.path.expanduser("~/models/joycaption")
```
**Bug 2: apply_chat_template with multimodal list content fails**
The Jinja2 sandbox in this version of transformers cannot call `.replace()` on list content. The multimodal format `[{"type": "image"}, {"type": "text", ...}]` throws:
```
UndefinedError: 'list object' has no attribute 'replace'
```
Fix: use a plain string with the image token embedded:
```python
conversation = [{"role": "user", "content": f"<image>\n{PROMPT}"}]
text_input = processor.tokenizer.apply_chat_template(
conversation, tokenize=False, add_generation_prompt=True
)
inputs = processor(images=image, text=text_input, return_tensors="pt").to(model.device)
```
**Bug 3: 4-bit quantization breaks SigLIP vision tower**
`BitsAndBytesConfig(load_in_4bit=True)` quantizes all linear layers including SigLIP's `MultiheadAttention.out_proj`. SigLIP calls `F.multi_head_attention_forward` with raw weight tensors, bypassing bitsandbytes' override, causing:
```
RuntimeError: self and mat2 must have the same dtype, but got Half and Byte
```
Fix: use 8-bit with vision modules excluded:
```python
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_skip_modules=["vision_tower", "multi_modal_projector"],
)
model = LlavaForConditionalGeneration.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
torch_dtype=torch.float16,
device_map="auto",
)
```
This keeps the LLM at 8-bit (~8GB) and the vision tower at fp16 (~1-2GB), totalling ~10-11GB VRAM. Fits comfortably on 16GB.
---
## The Trigger Word Problem
If you generate captions with JoyCaption (or any captioner), the captions are plain descriptive text. **The model has no trigger word unless you explicitly add one to every caption.**
Example: if you train with JoyCaption captions and then generate with prompt `"ohwx man, portrait photo..."`, the token `ohwx man` was never in the training data and is ignored by the LoRA. It is not harmful but it does nothing.
Options:
1. Prepend a trigger word to all captions before training: `"ohwx man, [joycaption text]"` β requires a script to add the prefix to every `.txt` file
2. Use the `trigger_word` or `caption_prefix` setting in the training config if the tool supports it β cupertinomiranda/ai-toolkit does not currently expose this for Flux
**Recommendation:** For option 1, a one-liner to prepend to all captions: `for f in /path/to/photos/*.txt; do sed -i "1s/^/ohwx man, /" "$f"; done`. Include the trigger word