Captivating Chroma
I'm a huge fan of lodestones/Chroma, as it's very good at realism, creativity, and overall freedom. It is based on FLUX.1-schnell, so it's a bit of an older architecture by now.
One thing I like the most about Chroma is the incredible team and community behind it especially the Legendary Lodestones and the almighty, wise Silver. As it must be a monumental load of time effort and work to create a good model like this.
There is also the work-in-progress lodestones/Zeta-Chroma on the horizon, which is based on the great Z-Image/turbo.
In a world where companies are closed-sourcing their models, it's amazing to see the great independent work that creators like these are doing to keep the open-source community alive.
Well done to the Chroma team and the community behind it. You are all legends.
https://redd.it/1tg5nit
@rStableDiffusion
I'm a huge fan of lodestones/Chroma, as it's very good at realism, creativity, and overall freedom. It is based on FLUX.1-schnell, so it's a bit of an older architecture by now.
One thing I like the most about Chroma is the incredible team and community behind it especially the Legendary Lodestones and the almighty, wise Silver. As it must be a monumental load of time effort and work to create a good model like this.
There is also the work-in-progress lodestones/Zeta-Chroma on the horizon, which is based on the great Z-Image/turbo.
In a world where companies are closed-sourcing their models, it's amazing to see the great independent work that creators like these are doing to keep the open-source community alive.
Well done to the Chroma team and the community behind it. You are all legends.
https://redd.it/1tg5nit
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
LoRA Training Auto-caption generator recommendation?
Hi Guys! I'm kinda new in image generation. I'm trying to train a character image LoRA with multiple image references. And I believe captions for each image are needed right? If I have, let's say 30 or more images, it'll be tiring to put caption for each. Would you recommend any great LoRA auto-caption generator that is free to use for multiple images all at once? By the way, i'm training for ZIT model.
Thank you in advance!
https://redd.it/1tgd3zo
@rStableDiffusion
Hi Guys! I'm kinda new in image generation. I'm trying to train a character image LoRA with multiple image references. And I believe captions for each image are needed right? If I have, let's say 30 or more images, it'll be tiring to put caption for each. Would you recommend any great LoRA auto-caption generator that is free to use for multiple images all at once? By the way, i'm training for ZIT model.
Thank you in advance!
https://redd.it/1tgd3zo
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
RealTime character swap
https://preview.redd.it/rm7xko8hsu1h1.png?width=2054&format=png&auto=webp&s=e01c06ce7224ce6590bd63714cdfae3b40946aef
Updated app from the DeluluStream team using Lucy 2.1
https://reddit.com/link/1tgfq9y/video/zozqmbjjqu1h1/player
https://redd.it/1tgfq9y
@rStableDiffusion
https://preview.redd.it/rm7xko8hsu1h1.png?width=2054&format=png&auto=webp&s=e01c06ce7224ce6590bd63714cdfae3b40946aef
Updated app from the DeluluStream team using Lucy 2.1
https://reddit.com/link/1tgfq9y/video/zozqmbjjqu1h1/player
https://redd.it/1tgfq9y
@rStableDiffusion
Training a Portrait LoRA on AMD RX 9060 XT (RDNA4 / gfx1200) on Native Linux
This is a full account of getting LoRA training working on an AMD RX 9060 XT (Navi 44, RDNA4) on native Kubuntu 24.04.4. It covers everything tried, what failed and why, what had to be fixed, and what ended up working. Written for anyone with the same or similar hardware who wants to skip the trial-and-error.
---
## Hardware
- GPU: AMD RX 9060 XT — Navi 44, RDNA4, gfx1200, 16GB GDDR6, 150W TDP
- CPU: AMD Ryzen 5 5600G
- RAM: 32GB
- OS: Kubuntu 24.04.4, kernel 6.17.0-23-generic
- Primary SSD: Samsung 990 1TB M.2 (ext4, Linux)
- ROCm: 7.2.3
Important architecture note: Native Linux ROCm and
---
## Goal
Train a LoRA on portrait photos and use it with ComfyUI to generate lifestyle portrait photos. Models tested: SDXL (completed), Flux.1 Dev (completed, 1500 steps).
The article covers both models in sequence. The SDXL sections document a fully working pipeline and are useful standalone — but if you only care about Flux, you can skip ahead. The SDXL sections are not a prerequisite for Flux.
---
## Why Native Linux, Not WSL2
I started on Windows 10 + WSL2 (Ubuntu 24.04). Short version: don't bother with WSL2 for RDNA4 training as of May 2026.
### What happens in WSL2
WSL2 GPU passthrough for AMD goes through the DXG bridge — a closed-source component (
Symptom: training appears to start, pipeline loads successfully, GPU VRAM fills to 8-10GB — but then nothing. CPU climbs to 30%, RAM to 28GB, GPU compute stays at 0%. The first training step either runs entirely on CPU (~50 minutes for one SDXL step at batch size 1) or the process hangs indefinitely.
The error that appears in logs:
This is librocdxg Issue #22 (opened April 2026, unfixed as of May 2026). Root cause is in
What was ruled out through testing:
- bitsandbytes (same hang with plain adamw)
- bf16 precision (same hang with fp16)
- accelerate config (explicit single-GPU config made no difference)
- model loading (all 7 pipeline components load fine, VRAM fills correctly)
- Proof: MIOpen kernel cache after a 50-minute "run" was 180KB — essentially empty. If GPU kernels had been compiling for 50 minutes, the cache would be hundreds of MB. The work was running on CPU the whole time.
Do not try float32 in WSL2 either.
### Native Linux
Bypasses the DXG bridge entirely. ROCm accesses the GPU natively via
---
## ROCm Installation on Native Linux
Critical: amdgpu-install does NOT add your user to the required groups. You must do this manually:
Without
This is a full account of getting LoRA training working on an AMD RX 9060 XT (Navi 44, RDNA4) on native Kubuntu 24.04.4. It covers everything tried, what failed and why, what had to be fixed, and what ended up working. Written for anyone with the same or similar hardware who wants to skip the trial-and-error.
---
## Hardware
- GPU: AMD RX 9060 XT — Navi 44, RDNA4, gfx1200, 16GB GDDR6, 150W TDP
- CPU: AMD Ryzen 5 5600G
- RAM: 32GB
- OS: Kubuntu 24.04.4, kernel 6.17.0-23-generic
- Primary SSD: Samsung 990 1TB M.2 (ext4, Linux)
- ROCm: 7.2.3
Important architecture note: Native Linux ROCm and
amd-smi report this GPU as gfx1200. If you have WSL2 experience with this card, you may have seen gfx1201 — that was the WSL2 librocdxg bridge reporting incorrectly. The correct arch ID on native Linux is gfx1200. This matters for cmake flags and any arch-specific builds.---
## Goal
Train a LoRA on portrait photos and use it with ComfyUI to generate lifestyle portrait photos. Models tested: SDXL (completed), Flux.1 Dev (completed, 1500 steps).
The article covers both models in sequence. The SDXL sections document a fully working pipeline and are useful standalone — but if you only care about Flux, you can skip ahead. The SDXL sections are not a prerequisite for Flux.
---
## Why Native Linux, Not WSL2
I started on Windows 10 + WSL2 (Ubuntu 24.04). Short version: don't bother with WSL2 for RDNA4 training as of May 2026.
### What happens in WSL2
WSL2 GPU passthrough for AMD goes through the DXG bridge — a closed-source component (
libthunk_proxy.a) inside the AMD Adrenalin driver. On RDNA4, there is a confirmed bug in this library that breaks GPU kernel dispatch for large workloads.Symptom: training appears to start, pipeline loads successfully, GPU VRAM fills to 8-10GB — but then nothing. CPU climbs to 30%, RAM to 28GB, GPU compute stays at 0%. The first training step either runs entirely on CPU (~50 minutes for one SDXL step at batch size 1) or the process hangs indefinitely.
The error that appears in logs:
[GetSegmentId] Failed to get segment id for type 1
This is librocdxg Issue #22 (opened April 2026, unfixed as of May 2026). Root cause is in
libthunk_proxy.a which is closed source — librocdxg cannot fix it, only AMD can by shipping an updated driver.What was ruled out through testing:
- bitsandbytes (same hang with plain adamw)
- bf16 precision (same hang with fp16)
- accelerate config (explicit single-GPU config made no difference)
- model loading (all 7 pipeline components load fine, VRAM fills correctly)
- Proof: MIOpen kernel cache after a 50-minute "run" was 180KB — essentially empty. If GPU kernels had been compiling for 50 minutes, the cache would be hundreds of MB. The work was running on CPU the whole time.
Do not try float32 in WSL2 either.
dtype: float32 doubles VRAM to ~26-30GB, exceeds 16GB, OOM crashes the GPU driver, and on Windows this causes a BSOD. Use bf16 or fp16 always.### Native Linux
Bypasses the DXG bridge entirely. ROCm accesses the GPU natively via
/dev/kfd and /dev/dri. The same training config that hung indefinitely in WSL2 ran at 3-4 seconds per step on native Linux. First 50-step test: 8 minutes total. The difference is dramatic.---
## ROCm Installation on Native Linux
# Download the installer .deb — the package is not in the default Ubuntu repos
wget https://repo.radeon.com/amdgpu-install/7.2.3/ubuntu/noble/amdgpu-install_7.2.3.70203-1_all.deb
sudo apt install -y ./amdgpu-install_7.2.3.70203-1_all.deb
sudo amdgpu-install --usecase=rocm --no-dkms -y
--no-dkms skips kernel module installation — not needed if the amdgpu module is already loaded (which it is in current kernels).Critical: amdgpu-install does NOT add your user to the required groups. You must do this manually:
sudo usermod -aG render,video $USER
# Then log out and log back in — groups don't apply to existing sessions
Without
render andTraining a Portrait LoRA on AMD RX 9060 XT (RDNA4 / gfx1200) on Native Linux
This is a full account of getting LoRA training working on an AMD RX 9060 XT (Navi 44, RDNA4) on native Kubuntu 24.04.4. It covers everything tried, what failed and why, what had to be fixed, and what ended up working. Written for anyone with the same or similar hardware who wants to skip the trial-and-error.
---
## Hardware
- **GPU:** AMD RX 9060 XT — Navi 44, RDNA4, gfx1200, 16GB GDDR6, 150W TDP
- **CPU:** AMD Ryzen 5 5600G
- **RAM:** 32GB
- **OS:** Kubuntu 24.04.4, kernel 6.17.0-23-generic
- **Primary SSD:** Samsung 990 1TB M.2 (ext4, Linux)
- **ROCm:** 7.2.3
**Important architecture note:** Native Linux ROCm and `amd-smi` report this GPU as **gfx1200**. If you have WSL2 experience with this card, you may have seen gfx1201 — that was the WSL2 librocdxg bridge reporting incorrectly. The correct arch ID on native Linux is **gfx1200**. This matters for cmake flags and any arch-specific builds.
---
## Goal
Train a LoRA on portrait photos and use it with ComfyUI to generate lifestyle portrait photos. Models tested: SDXL (completed), Flux.1 Dev (completed, 1500 steps).
The article covers both models in sequence. The SDXL sections document a fully working pipeline and are useful standalone — but if you only care about Flux, you can skip ahead. The SDXL sections are not a prerequisite for Flux.
---
## Why Native Linux, Not WSL2
I started on Windows 10 + WSL2 (Ubuntu 24.04). Short version: **don't bother with WSL2 for RDNA4 training as of May 2026.**
### What happens in WSL2
WSL2 GPU passthrough for AMD goes through the DXG bridge — a closed-source component (`libthunk_proxy.a`) inside the AMD Adrenalin driver. On RDNA4, there is a confirmed bug in this library that breaks GPU kernel dispatch for large workloads.
Symptom: training appears to start, pipeline loads successfully, GPU VRAM fills to 8-10GB — but then nothing. CPU climbs to 30%, RAM to 28GB, GPU compute stays at 0%. The first training step either runs entirely on CPU (~50 minutes for one SDXL step at batch size 1) or the process hangs indefinitely.
The error that appears in logs:
```
[GetSegmentId] Failed to get segment id for type 1
```
This is librocdxg Issue #22 (opened April 2026, unfixed as of May 2026). Root cause is in `libthunk_proxy.a` which is closed source — librocdxg cannot fix it, only AMD can by shipping an updated driver.
**What was ruled out through testing:**
- bitsandbytes (same hang with plain adamw)
- bf16 precision (same hang with fp16)
- accelerate config (explicit single-GPU config made no difference)
- model loading (all 7 pipeline components load fine, VRAM fills correctly)
- Proof: MIOpen kernel cache after a 50-minute "run" was 180KB — essentially empty. If GPU kernels had been compiling for 50 minutes, the cache would be hundreds of MB. The work was running on CPU the whole time.
**Do not try float32 in WSL2 either.** `dtype: float32` doubles VRAM to ~26-30GB, exceeds 16GB, OOM crashes the GPU driver, and on Windows this causes a BSOD. Use bf16 or fp16 always.
### Native Linux
Bypasses the DXG bridge entirely. ROCm accesses the GPU natively via `/dev/kfd` and `/dev/dri`. The same training config that hung indefinitely in WSL2 ran at 3-4 seconds per step on native Linux. First 50-step test: 8 minutes total. The difference is dramatic.
---
## ROCm Installation on Native Linux
```bash
# Download the installer .deb — the package is not in the default Ubuntu repos
wget https://repo.radeon.com/amdgpu-install/7.2.3/ubuntu/noble/amdgpu-install_7.2.3.70203-1_all.deb
sudo apt install -y ./amdgpu-install_7.2.3.70203-1_all.deb
sudo amdgpu-install --usecase=rocm --no-dkms -y
```
`--no-dkms` skips kernel module installation — not needed if the amdgpu module is already loaded (which it is in current kernels).
**Critical: amdgpu-install does NOT add your user to the required groups.** You must do this manually:
```bash
sudo usermod -aG render,video $USER
# Then log out and log back in — groups don't apply to existing sessions
```
Without `render` and
This is a full account of getting LoRA training working on an AMD RX 9060 XT (Navi 44, RDNA4) on native Kubuntu 24.04.4. It covers everything tried, what failed and why, what had to be fixed, and what ended up working. Written for anyone with the same or similar hardware who wants to skip the trial-and-error.
---
## Hardware
- **GPU:** AMD RX 9060 XT — Navi 44, RDNA4, gfx1200, 16GB GDDR6, 150W TDP
- **CPU:** AMD Ryzen 5 5600G
- **RAM:** 32GB
- **OS:** Kubuntu 24.04.4, kernel 6.17.0-23-generic
- **Primary SSD:** Samsung 990 1TB M.2 (ext4, Linux)
- **ROCm:** 7.2.3
**Important architecture note:** Native Linux ROCm and `amd-smi` report this GPU as **gfx1200**. If you have WSL2 experience with this card, you may have seen gfx1201 — that was the WSL2 librocdxg bridge reporting incorrectly. The correct arch ID on native Linux is **gfx1200**. This matters for cmake flags and any arch-specific builds.
---
## Goal
Train a LoRA on portrait photos and use it with ComfyUI to generate lifestyle portrait photos. Models tested: SDXL (completed), Flux.1 Dev (completed, 1500 steps).
The article covers both models in sequence. The SDXL sections document a fully working pipeline and are useful standalone — but if you only care about Flux, you can skip ahead. The SDXL sections are not a prerequisite for Flux.
---
## Why Native Linux, Not WSL2
I started on Windows 10 + WSL2 (Ubuntu 24.04). Short version: **don't bother with WSL2 for RDNA4 training as of May 2026.**
### What happens in WSL2
WSL2 GPU passthrough for AMD goes through the DXG bridge — a closed-source component (`libthunk_proxy.a`) inside the AMD Adrenalin driver. On RDNA4, there is a confirmed bug in this library that breaks GPU kernel dispatch for large workloads.
Symptom: training appears to start, pipeline loads successfully, GPU VRAM fills to 8-10GB — but then nothing. CPU climbs to 30%, RAM to 28GB, GPU compute stays at 0%. The first training step either runs entirely on CPU (~50 minutes for one SDXL step at batch size 1) or the process hangs indefinitely.
The error that appears in logs:
```
[GetSegmentId] Failed to get segment id for type 1
```
This is librocdxg Issue #22 (opened April 2026, unfixed as of May 2026). Root cause is in `libthunk_proxy.a` which is closed source — librocdxg cannot fix it, only AMD can by shipping an updated driver.
**What was ruled out through testing:**
- bitsandbytes (same hang with plain adamw)
- bf16 precision (same hang with fp16)
- accelerate config (explicit single-GPU config made no difference)
- model loading (all 7 pipeline components load fine, VRAM fills correctly)
- Proof: MIOpen kernel cache after a 50-minute "run" was 180KB — essentially empty. If GPU kernels had been compiling for 50 minutes, the cache would be hundreds of MB. The work was running on CPU the whole time.
**Do not try float32 in WSL2 either.** `dtype: float32` doubles VRAM to ~26-30GB, exceeds 16GB, OOM crashes the GPU driver, and on Windows this causes a BSOD. Use bf16 or fp16 always.
### Native Linux
Bypasses the DXG bridge entirely. ROCm accesses the GPU natively via `/dev/kfd` and `/dev/dri`. The same training config that hung indefinitely in WSL2 ran at 3-4 seconds per step on native Linux. First 50-step test: 8 minutes total. The difference is dramatic.
---
## ROCm Installation on Native Linux
```bash
# Download the installer .deb — the package is not in the default Ubuntu repos
wget https://repo.radeon.com/amdgpu-install/7.2.3/ubuntu/noble/amdgpu-install_7.2.3.70203-1_all.deb
sudo apt install -y ./amdgpu-install_7.2.3.70203-1_all.deb
sudo amdgpu-install --usecase=rocm --no-dkms -y
```
`--no-dkms` skips kernel module installation — not needed if the amdgpu module is already loaded (which it is in current kernels).
**Critical: amdgpu-install does NOT add your user to the required groups.** You must do this manually:
```bash
sudo usermod -aG render,video $USER
# Then log out and log back in — groups don't apply to existing sessions
```
Without `render` and
`video` group membership, ROCm cannot access `/dev/kfd` and `/dev/dri`. Training will fail silently or with permission errors.
**Verify:**
```bash
rocminfo | grep -E "gfx|Marketing"
# Should show: gfx1200 and "AMD Radeon RX 9060 XT"
amd-smi static | grep -i "gfx\|market"
```
**Additional packages needed that are not in default Kubuntu 24.04:**
```bash
sudo apt install python3.12-venv cmake radeontop
```
`python3.12-venv` is required before you can create any Python venv. `cmake` is required for bitsandbytes compilation.
---
## Training Tool Selection
All major training tools were evaluated. The main blocker for most is ROCm version compatibility:
| Tool | Verdict | Reason |
|------|---------|--------|
| **cupertinomiranda/ai-toolkit-amd-rocm-support** | **Use this** | Explicitly mentions gfx1200/gfx1201, tested on ROCm 7.1 (7.2 works), bitsandbytes instructions included |
| ostris/ai-toolkit (main) | May work | Civitai guide used it on RX 9070 + ROCm 7.2; no confirmed end-to-end results |
| daMustermann/ai-toolkit-rocm | Do not use | Targets ROCm 6.2 — incompatible with RDNA4 |
| Kohya_ss / sd-scripts | Do not use | requirements_linux_rocm.txt targets ROCm 6.3 — incompatible |
| FluxGym | Do not use | Wraps Kohya internally, same incompatibility |
| SimpleTuner | Avoid | Explicitly states "AMD and Apple GPUs do not work for training Flux" |
| OneTrainer | Possibly | Needs manual ROCm version edit in requirements; AMD support "may be outdated" per maintainers |
**Use cupertinomiranda/ai-toolkit-amd-rocm-support.** It's the only fork that explicitly documents gfx1200/gfx1201 support, ROCm 7.x compatibility, and provides working bitsandbytes build instructions.
Clone and install:
```bash
cd ~
git clone https://github.com/cupertinomiranda/ai-toolkit-amd-rocm-support
cd ai-toolkit-amd-rocm-support
python3 -m venv venv
source venv/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm7.2
pip install -r requirements-amd.txt
```
**Do not use PyTorch nightly** (`https://download.pytorch.org/whl/nightly/rocm7.2`). Nightly 2.13.0.dev crashes due to a rocprofiler fatal error. Use stable: `https://download.pytorch.org/whl/rocm7.2` which gives 2.11.0+rocm7.2.
**Important: do not clone or install on NTFS mounts** (`/media/`, `/mnt/`). NTFS does not support Linux file permissions — `chmod` operations will fail with "Operation not permitted". Always install in `~/` (ext4).
---
## bitsandbytes: Must Compile From Source
`pip install bitsandbytes` installs a CUDA version that does not work on AMD. You must compile from source for gfx1200.
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
cd ~
git clone https://github.com/bitsandbytes-foundation/bitsandbytes -b 0.48.2
cd bitsandbytes
cmake \
-DCMAKE_HIP_COMPILER="/opt/rocm/lib/llvm/bin/clang++" \
-DBNB_ROCM_ARCH="gfx1200" \
-DCOMPUTE_BACKEND=hip \
.
make -j$(nproc)
pip install .
```
**Key flags:**
- `-DBNB_ROCM_ARCH="gfx1200"` — use gfx1200, not gfx1201. Native Linux ROCm reports gfx1200. Building for the wrong arch produces a binary that silently falls back to CPU.
- `-DCMAKE_HIP_COMPILER` — full path required; ROCm's clang++ is not always in PATH.
Verify after install (run inside the training venv):
```python
import bitsandbytes as bnb
print(bnb.__version__)
# Should print 0.48.x or similar, no errors
```
---
## SDXL Model Download
ai-toolkit uses diffusers format (separate component folders). Download fp16 only — the full repo is 25-30GB and you only need ~6.7GB:
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
huggingface-cli download stabilityai/stable-diffusion-xl-base-1.0 \
--include "*.fp16.safetensors" "*.json" "*.txt" \
--local-dir ~/models/sdxl/
```
diffusers looks for `diffusion_pytorch_model.safetensors` but only the fp16 versions exist. Create symlinks:
```bash
cd ~/models/sdxl
ln -sf unet/diffusion_pytorch_model.fp16.safetensors unet/diffusion_pytorch_model.safetensors
ln -sf vae/diffusion_pytorch_model.fp16.safetensors vae/diffusion_pytorch_model.safetensors
ln -sf
**Verify:**
```bash
rocminfo | grep -E "gfx|Marketing"
# Should show: gfx1200 and "AMD Radeon RX 9060 XT"
amd-smi static | grep -i "gfx\|market"
```
**Additional packages needed that are not in default Kubuntu 24.04:**
```bash
sudo apt install python3.12-venv cmake radeontop
```
`python3.12-venv` is required before you can create any Python venv. `cmake` is required for bitsandbytes compilation.
---
## Training Tool Selection
All major training tools were evaluated. The main blocker for most is ROCm version compatibility:
| Tool | Verdict | Reason |
|------|---------|--------|
| **cupertinomiranda/ai-toolkit-amd-rocm-support** | **Use this** | Explicitly mentions gfx1200/gfx1201, tested on ROCm 7.1 (7.2 works), bitsandbytes instructions included |
| ostris/ai-toolkit (main) | May work | Civitai guide used it on RX 9070 + ROCm 7.2; no confirmed end-to-end results |
| daMustermann/ai-toolkit-rocm | Do not use | Targets ROCm 6.2 — incompatible with RDNA4 |
| Kohya_ss / sd-scripts | Do not use | requirements_linux_rocm.txt targets ROCm 6.3 — incompatible |
| FluxGym | Do not use | Wraps Kohya internally, same incompatibility |
| SimpleTuner | Avoid | Explicitly states "AMD and Apple GPUs do not work for training Flux" |
| OneTrainer | Possibly | Needs manual ROCm version edit in requirements; AMD support "may be outdated" per maintainers |
**Use cupertinomiranda/ai-toolkit-amd-rocm-support.** It's the only fork that explicitly documents gfx1200/gfx1201 support, ROCm 7.x compatibility, and provides working bitsandbytes build instructions.
Clone and install:
```bash
cd ~
git clone https://github.com/cupertinomiranda/ai-toolkit-amd-rocm-support
cd ai-toolkit-amd-rocm-support
python3 -m venv venv
source venv/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm7.2
pip install -r requirements-amd.txt
```
**Do not use PyTorch nightly** (`https://download.pytorch.org/whl/nightly/rocm7.2`). Nightly 2.13.0.dev crashes due to a rocprofiler fatal error. Use stable: `https://download.pytorch.org/whl/rocm7.2` which gives 2.11.0+rocm7.2.
**Important: do not clone or install on NTFS mounts** (`/media/`, `/mnt/`). NTFS does not support Linux file permissions — `chmod` operations will fail with "Operation not permitted". Always install in `~/` (ext4).
---
## bitsandbytes: Must Compile From Source
`pip install bitsandbytes` installs a CUDA version that does not work on AMD. You must compile from source for gfx1200.
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
cd ~
git clone https://github.com/bitsandbytes-foundation/bitsandbytes -b 0.48.2
cd bitsandbytes
cmake \
-DCMAKE_HIP_COMPILER="/opt/rocm/lib/llvm/bin/clang++" \
-DBNB_ROCM_ARCH="gfx1200" \
-DCOMPUTE_BACKEND=hip \
.
make -j$(nproc)
pip install .
```
**Key flags:**
- `-DBNB_ROCM_ARCH="gfx1200"` — use gfx1200, not gfx1201. Native Linux ROCm reports gfx1200. Building for the wrong arch produces a binary that silently falls back to CPU.
- `-DCMAKE_HIP_COMPILER` — full path required; ROCm's clang++ is not always in PATH.
Verify after install (run inside the training venv):
```python
import bitsandbytes as bnb
print(bnb.__version__)
# Should print 0.48.x or similar, no errors
```
---
## SDXL Model Download
ai-toolkit uses diffusers format (separate component folders). Download fp16 only — the full repo is 25-30GB and you only need ~6.7GB:
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
huggingface-cli download stabilityai/stable-diffusion-xl-base-1.0 \
--include "*.fp16.safetensors" "*.json" "*.txt" \
--local-dir ~/models/sdxl/
```
diffusers looks for `diffusion_pytorch_model.safetensors` but only the fp16 versions exist. Create symlinks:
```bash
cd ~/models/sdxl
ln -sf unet/diffusion_pytorch_model.fp16.safetensors unet/diffusion_pytorch_model.safetensors
ln -sf vae/diffusion_pytorch_model.fp16.safetensors vae/diffusion_pytorch_model.safetensors
ln -sf
GitHub
GitHub - cupertinomiranda/ai-toolkit-amd-rocm-support: The ultimate training toolkit for finetuning diffusion models. Add support…
The ultimate training toolkit for finetuning diffusion models. Add support for AMD ROCm GPUs repo. - cupertinomiranda/ai-toolkit-amd-rocm-support
text_encoder/model.fp16.safetensors text_encoder/model.safetensors
ln -sf text_encoder_2/model.fp16.safetensors text_encoder_2/model.safetensors
```
Without these symlinks, the pipeline load fails with a missing file error.
---
## GPU Monitoring
`rocm-smi` works on native Linux (unlike WSL2 where it was broken):
```bash
watch -n1 rocm-smi # text monitor, refreshes every second
radeontop # AMD-specific graphical TUI — recommended
```
**Do not use nvtop 3.0.2** — it crashes on this ROCm/AMD setup. Use radeontop instead.
If your system has both a discrete GPU and an integrated GPU (e.g. Ryzen with Vega iGPU), radeontop defaults to bus 0 which may be the iGPU. Find your discrete GPU's bus ID with `radeontop -l` and pass it with `-b`: `radeontop -b 03` (the number varies by system).
---
## Photo Captioning with JoyCaption
**JoyCaption Beta One** (`fancyfeast/llama-joycaption-beta-one-hf-llava`) produces high-quality captions specifically designed for LoRA training. It's a Llama 3.1 base with a SigLIP vision encoder.
Download (~16GB):
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
huggingface-cli download fancyfeast/llama-joycaption-beta-one-hf-llava \
--local-dir ~/models/joycaption/
```
Performance on RX 9060 XT: ~5 sec/photo, ~82% GPU load, ~11.7GB VRAM peak.
### Three bugs to know about
**Bug 1: Use local path, not HF repo ID**
```python
# Wrong — re-downloads 16GB from HuggingFace every run:
MODEL_NAME = "fancyfeast/llama-joycaption-beta-one-hf-llava"
# Correct:
MODEL_NAME = os.path.expanduser("~/models/joycaption")
```
**Bug 2: apply_chat_template with multimodal list content fails**
The Jinja2 sandbox in this version of transformers cannot call `.replace()` on list content. The multimodal format `[{"type": "image"}, {"type": "text", ...}]` throws:
```
UndefinedError: 'list object' has no attribute 'replace'
```
Fix: use a plain string with the image token embedded:
```python
conversation = [{"role": "user", "content": f"<image>\n{PROMPT}"}]
text_input = processor.tokenizer.apply_chat_template(
conversation, tokenize=False, add_generation_prompt=True
)
inputs = processor(images=image, text=text_input, return_tensors="pt").to(model.device)
```
**Bug 3: 4-bit quantization breaks SigLIP vision tower**
`BitsAndBytesConfig(load_in_4bit=True)` quantizes all linear layers including SigLIP's `MultiheadAttention.out_proj`. SigLIP calls `F.multi_head_attention_forward` with raw weight tensors, bypassing bitsandbytes' override, causing:
```
RuntimeError: self and mat2 must have the same dtype, but got Half and Byte
```
Fix: use 8-bit with vision modules excluded:
```python
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_skip_modules=["vision_tower", "multi_modal_projector"],
)
model = LlavaForConditionalGeneration.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
torch_dtype=torch.float16,
device_map="auto",
)
```
This keeps the LLM at 8-bit (~8GB) and the vision tower at fp16 (~1-2GB), totalling ~10-11GB VRAM. Fits comfortably on 16GB.
---
## The Trigger Word Problem
If you generate captions with JoyCaption (or any captioner), the captions are plain descriptive text. **The model has no trigger word unless you explicitly add one to every caption.**
Example: if you train with JoyCaption captions and then generate with prompt `"ohwx man, portrait photo..."`, the token `ohwx man` was never in the training data and is ignored by the LoRA. It is not harmful but it does nothing.
Options:
1. Prepend a trigger word to all captions before training: `"ohwx man, [joycaption text]"` — requires a script to add the prefix to every `.txt` file
2. Use the `trigger_word` or `caption_prefix` setting in the training config if the tool supports it — cupertinomiranda/ai-toolkit does not currently expose this for Flux
**Recommendation:** For option 1, a one-liner to prepend to all captions: `for f in /path/to/photos/*.txt; do sed -i "1s/^/ohwx man, /" "$f"; done`. Include the trigger word
ln -sf text_encoder_2/model.fp16.safetensors text_encoder_2/model.safetensors
```
Without these symlinks, the pipeline load fails with a missing file error.
---
## GPU Monitoring
`rocm-smi` works on native Linux (unlike WSL2 where it was broken):
```bash
watch -n1 rocm-smi # text monitor, refreshes every second
radeontop # AMD-specific graphical TUI — recommended
```
**Do not use nvtop 3.0.2** — it crashes on this ROCm/AMD setup. Use radeontop instead.
If your system has both a discrete GPU and an integrated GPU (e.g. Ryzen with Vega iGPU), radeontop defaults to bus 0 which may be the iGPU. Find your discrete GPU's bus ID with `radeontop -l` and pass it with `-b`: `radeontop -b 03` (the number varies by system).
---
## Photo Captioning with JoyCaption
**JoyCaption Beta One** (`fancyfeast/llama-joycaption-beta-one-hf-llava`) produces high-quality captions specifically designed for LoRA training. It's a Llama 3.1 base with a SigLIP vision encoder.
Download (~16GB):
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
huggingface-cli download fancyfeast/llama-joycaption-beta-one-hf-llava \
--local-dir ~/models/joycaption/
```
Performance on RX 9060 XT: ~5 sec/photo, ~82% GPU load, ~11.7GB VRAM peak.
### Three bugs to know about
**Bug 1: Use local path, not HF repo ID**
```python
# Wrong — re-downloads 16GB from HuggingFace every run:
MODEL_NAME = "fancyfeast/llama-joycaption-beta-one-hf-llava"
# Correct:
MODEL_NAME = os.path.expanduser("~/models/joycaption")
```
**Bug 2: apply_chat_template with multimodal list content fails**
The Jinja2 sandbox in this version of transformers cannot call `.replace()` on list content. The multimodal format `[{"type": "image"}, {"type": "text", ...}]` throws:
```
UndefinedError: 'list object' has no attribute 'replace'
```
Fix: use a plain string with the image token embedded:
```python
conversation = [{"role": "user", "content": f"<image>\n{PROMPT}"}]
text_input = processor.tokenizer.apply_chat_template(
conversation, tokenize=False, add_generation_prompt=True
)
inputs = processor(images=image, text=text_input, return_tensors="pt").to(model.device)
```
**Bug 3: 4-bit quantization breaks SigLIP vision tower**
`BitsAndBytesConfig(load_in_4bit=True)` quantizes all linear layers including SigLIP's `MultiheadAttention.out_proj`. SigLIP calls `F.multi_head_attention_forward` with raw weight tensors, bypassing bitsandbytes' override, causing:
```
RuntimeError: self and mat2 must have the same dtype, but got Half and Byte
```
Fix: use 8-bit with vision modules excluded:
```python
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_skip_modules=["vision_tower", "multi_modal_projector"],
)
model = LlavaForConditionalGeneration.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
torch_dtype=torch.float16,
device_map="auto",
)
```
This keeps the LLM at 8-bit (~8GB) and the vision tower at fp16 (~1-2GB), totalling ~10-11GB VRAM. Fits comfortably on 16GB.
---
## The Trigger Word Problem
If you generate captions with JoyCaption (or any captioner), the captions are plain descriptive text. **The model has no trigger word unless you explicitly add one to every caption.**
Example: if you train with JoyCaption captions and then generate with prompt `"ohwx man, portrait photo..."`, the token `ohwx man` was never in the training data and is ignored by the LoRA. It is not harmful but it does nothing.
Options:
1. Prepend a trigger word to all captions before training: `"ohwx man, [joycaption text]"` — requires a script to add the prefix to every `.txt` file
2. Use the `trigger_word` or `caption_prefix` setting in the training config if the tool supports it — cupertinomiranda/ai-toolkit does not currently expose this for Flux
**Recommendation:** For option 1, a one-liner to prepend to all captions: `for f in /path/to/photos/*.txt; do sed -i "1s/^/ohwx man, /" "$f"; done`. Include the trigger word
in your generation prompts.
---
## SDXL Training Config
Save this as `~/ai-toolkit-amd-rocm-support/config/train_sdxl_full.yaml`. Minimum working config for 1500 steps, batch size 1, gfx1200:
```yaml
job: extension
config:
name: "sdxl_ohwx_man"
process:
- type: 'sd_trainer'
training_folder: "output"
device: cuda:0
network:
type: "lora"
linear: 32
linear_alpha: 16
save:
dtype: float16
save_every: 250
max_step_saves_to_keep: 4
datasets:
- folder_path: "/path/to/your/photos"
caption_ext: "txt"
caption_dropout_rate: 0.05
shuffle_tokens: false
cache_latents_to_disk: true
resolution: [512, 1024]
train:
batch_size: 1
steps: 1500
gradient_accumulation_steps: 1
train_unet: true
train_text_encoder: false
gradient_checkpointing: true
noise_scheduler: "ddpm"
optimizer: "adamw8bit"
lr: 1e-4
disable_sampling: true
dtype: bf16
model:
name_or_path: "~/models/sdxl"
is_xl: true
meta:
name: "[name]"
version: '1.0'
```
**Critical config notes:**
- `name: "sdxl_ohwx_man"` — determines the output folder name and LoRA filename. Change this to whatever name you want.
- `dtype: bf16` — never use `float32`. Float32 doubles VRAM to ~26-30GB, causes OOM, GPU driver crash, and on Windows a BSOD.
- `disable_sampling: true` — skips sample image generation during training. Saves time and VRAM.
- `cache_latents_to_disk: true` — first run does two caching passes (preview resolution and training resolution), then saves to disk. Subsequent runs skip both passes.
- `optimizer: "adamw8bit"` — requires bitsandbytes compiled from source. Halves optimizer VRAM vs standard adamw.
- `linear: 32, linear_alpha: 16` — rank 32 LoRA. Higher rank captures more detail but risks overfitting with smaller datasets. For Flux, rank 16 is sufficient — Flux is architecturally more capable and lower rank achieves equivalent quality.
- `train_text_encoder: false` — optional for SDXL (CLIP encoder is ~500MB, you could train it). For Flux this becomes mandatory — T5 is 9.5GB and must stay on CPU.
- `noise_scheduler: "ddpm"` — SDXL-specific. Flux uses `"flowmatch"` instead — the two are not interchangeable.
- `resolution: [512, 1024]` — works for SDXL. For Flux, the 1024 bucket (832×1216 / 1216×832) OOMs even with 4-bit quantization because weights are dequantized to bf16 at compute time. Use `[512, 768]` for Flux.
### Training command
```bash
cd ~/ai-toolkit-amd-rocm-support
source venv/bin/activate
systemd-inhibit --what=sleep:idle --who="LoRA training" --why="Training in progress" \
bash -c 'HSA_ENABLE_SDMA=0 python run.py config/train_sdxl_full.yaml'
```
**Why `systemd-inhibit`:** Kubuntu's power manager will suspend the system after a period of inactivity. Training looks like an idle desktop to the power manager — there is no mouse or keyboard input. `systemd-inhibit` prevents sleep and idle suspension for the duration of training.
**Why `bash -c '...'` wrapper:** `systemd-inhibit` expects a command to execute, not a shell expression. `HSA_ENABLE_SDMA=0 python run.py ...` is an env variable assignment + command — that's shell syntax, not a standalone command. Without the `bash -c` wrapper, systemd-inhibit tries to execute `HSA_ENABLE_SDMA=0` as a binary and fails with "No such file or directory".
**`HSA_ENABLE_SDMA=0`:** Disables SDMA (system DMA) in the ROCm HSA runtime. Costs ~10-15% training speed but prevents random crashes and hangs that can occur on some RDNA4 configurations. Recommended for training runs you don't want to babysit.
### Results on RX 9060 XT
- Steps: 1500
- Wall time: ~76 minutes
- Speed: 1.5-3.6 sec/step (variable; first steps slower due to caching passes)
- VRAM peak: ~10GB
- Final loss: 0.005
- Output: single `.safetensors` file, ~150MB
Two latent caching passes happen before training starts:
- Pass 1 (preview resolution ~416×608): ~40 seconds
- Pass 2 (training resolution
---
## SDXL Training Config
Save this as `~/ai-toolkit-amd-rocm-support/config/train_sdxl_full.yaml`. Minimum working config for 1500 steps, batch size 1, gfx1200:
```yaml
job: extension
config:
name: "sdxl_ohwx_man"
process:
- type: 'sd_trainer'
training_folder: "output"
device: cuda:0
network:
type: "lora"
linear: 32
linear_alpha: 16
save:
dtype: float16
save_every: 250
max_step_saves_to_keep: 4
datasets:
- folder_path: "/path/to/your/photos"
caption_ext: "txt"
caption_dropout_rate: 0.05
shuffle_tokens: false
cache_latents_to_disk: true
resolution: [512, 1024]
train:
batch_size: 1
steps: 1500
gradient_accumulation_steps: 1
train_unet: true
train_text_encoder: false
gradient_checkpointing: true
noise_scheduler: "ddpm"
optimizer: "adamw8bit"
lr: 1e-4
disable_sampling: true
dtype: bf16
model:
name_or_path: "~/models/sdxl"
is_xl: true
meta:
name: "[name]"
version: '1.0'
```
**Critical config notes:**
- `name: "sdxl_ohwx_man"` — determines the output folder name and LoRA filename. Change this to whatever name you want.
- `dtype: bf16` — never use `float32`. Float32 doubles VRAM to ~26-30GB, causes OOM, GPU driver crash, and on Windows a BSOD.
- `disable_sampling: true` — skips sample image generation during training. Saves time and VRAM.
- `cache_latents_to_disk: true` — first run does two caching passes (preview resolution and training resolution), then saves to disk. Subsequent runs skip both passes.
- `optimizer: "adamw8bit"` — requires bitsandbytes compiled from source. Halves optimizer VRAM vs standard adamw.
- `linear: 32, linear_alpha: 16` — rank 32 LoRA. Higher rank captures more detail but risks overfitting with smaller datasets. For Flux, rank 16 is sufficient — Flux is architecturally more capable and lower rank achieves equivalent quality.
- `train_text_encoder: false` — optional for SDXL (CLIP encoder is ~500MB, you could train it). For Flux this becomes mandatory — T5 is 9.5GB and must stay on CPU.
- `noise_scheduler: "ddpm"` — SDXL-specific. Flux uses `"flowmatch"` instead — the two are not interchangeable.
- `resolution: [512, 1024]` — works for SDXL. For Flux, the 1024 bucket (832×1216 / 1216×832) OOMs even with 4-bit quantization because weights are dequantized to bf16 at compute time. Use `[512, 768]` for Flux.
### Training command
```bash
cd ~/ai-toolkit-amd-rocm-support
source venv/bin/activate
systemd-inhibit --what=sleep:idle --who="LoRA training" --why="Training in progress" \
bash -c 'HSA_ENABLE_SDMA=0 python run.py config/train_sdxl_full.yaml'
```
**Why `systemd-inhibit`:** Kubuntu's power manager will suspend the system after a period of inactivity. Training looks like an idle desktop to the power manager — there is no mouse or keyboard input. `systemd-inhibit` prevents sleep and idle suspension for the duration of training.
**Why `bash -c '...'` wrapper:** `systemd-inhibit` expects a command to execute, not a shell expression. `HSA_ENABLE_SDMA=0 python run.py ...` is an env variable assignment + command — that's shell syntax, not a standalone command. Without the `bash -c` wrapper, systemd-inhibit tries to execute `HSA_ENABLE_SDMA=0` as a binary and fails with "No such file or directory".
**`HSA_ENABLE_SDMA=0`:** Disables SDMA (system DMA) in the ROCm HSA runtime. Costs ~10-15% training speed but prevents random crashes and hangs that can occur on some RDNA4 configurations. Recommended for training runs you don't want to babysit.
### Results on RX 9060 XT
- Steps: 1500
- Wall time: ~76 minutes
- Speed: 1.5-3.6 sec/step (variable; first steps slower due to caching passes)
- VRAM peak: ~10GB
- Final loss: 0.005
- Output: single `.safetensors` file, ~150MB
Two latent caching passes happen before training starts:
- Pass 1 (preview resolution ~416×608): ~40 seconds
- Pass 2 (training resolution
~832×1216): ~2.5 minutes
These only run once; subsequent training runs from the same dataset skip them.
---
## ComfyUI Installation
ComfyUI is the recommended generation UI — it has official ROCm Linux support and an AMD partnership.
```bash
cd ~
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
python3 -m venv venv
source venv/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm7.2
pip install -r requirements.txt
```
**Note on disk space:** This installs a second copy of PyTorch (~14GB). If you already have a training venv, you now have 28GB of PyTorch on disk. There is no simple way around this — the two venvs need different PyTorch versions in some cases, and sharing venvs across tools is fragile.
### Launch command
```bash
cd ~/ComfyUI
HSA_OVERRIDE_GFX_VERSION=12.0.0 ~/ComfyUI/venv/bin/python main.py --listen
```
Then open `http://localhost:8188`.
`HSA_OVERRIDE_GFX_VERSION=12.0.0` is required for some operations. Without it, some ROCm ops may not target the RDNA4 instruction set correctly, causing errors or silent CPU fallback.
### Model format: diffusers vs single-file
**This trips everyone up at least once.**
ai-toolkit downloads and uses SDXL in **diffusers format** — a folder structure with separate `unet/`, `vae/`, `text_encoder/`, `text_encoder_2/` subfolders.
ComfyUI requires a **single merged `.safetensors` file** (e.g. `sd_xl_base_1.0.safetensors`).
The weights are identical — just packaged differently. You cannot point ComfyUI at your training model folder. Download the single-file version separately:
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
huggingface-cli download stabilityai/stable-diffusion-xl-base-1.0 \
sd_xl_base_1.0.safetensors \
--local-dir ~/ComfyUI/models/checkpoints/
```
This is ~6.5GB. For Flux, `flux1-dev.safetensors` (the ComfyUI single-file) and `ae.safetensors` (VAE) are already in the download and can be symlinked directly. **Catch:** the T5 text encoder is stored sharded across two files in the HuggingFace download — ComfyUI needs a single merged file. See the ComfyUI Flux Setup section for the merge script and an fp8 alternative.
### Workflow JSON format
ComfyUI 0.21.1 uses a specific flat JSON format for workflows. The blueprint files in `~/ComfyUI/blueprints/` use a different subgraph format — do not use those as a template for manually-authored workflows.
Classic flat format structure:
```json
{
"nodes": [ { "id": 1, "type": "CheckpointLoaderSimple", ... }, ... ],
"links": [
[link_id, from_node_id, from_slot_index, to_node_id, to_slot_index, "TYPE"],
...
],
"version": 0.4
}
```
Links are arrays, not objects. Each link: `[id, source_node, source_slot, dest_node, dest_slot, "TYPENAME"]`.
### SDXL workflow
SDXL uses `CheckpointLoaderSimple` — one node loads the entire model from one file. Simpler than the Flux multi-loader setup.
Node graph:
- **CheckpointLoaderSimple** → loads `sd_xl_base_1.0.safetensors`
- **LoraLoader** → applies trained LoRA (strength 1.0)
- **CLIPTextEncode** (×2) → positive prompt + negative prompt
- **KSampler** → sampling loop
- **VAEDecode** → latent → pixel image
- **SaveImage** → saves to `~/ComfyUI/output/`
Symlink the LoRA output into ComfyUI (replace `sdxl_ohwx_man` with the `name` from your training config):
```bash
ln -s ~/ai-toolkit-amd-rocm-support/output/sdxl_ohwx_man/sdxl_ohwx_man.safetensors \
~/ComfyUI/models/loras/sdxl_ohwx_man.safetensors
```
Working settings for portrait generation on RX 9060 XT:
- Resolution: 832×1216 (matches the 1024-bucket training resolution)
- Steps: 30, CFG: 7.0, sampler: dpmpp_2m, scheduler: karras
- LoRA strength: 1.0
- Positive: `portrait photo of a man, smiling, outdoor park, natural light, bokeh background, sharp focus, photorealistic` — if you added a trigger word (Option 1 above), prepend it here
- Negative: `bad teeth, broken teeth, missing teeth, gaps in teeth, dental artifacts, blurry, watermark`
The node graph above is the complete workflow — wire it up in ComfyUI or save it as a JSON to
These only run once; subsequent training runs from the same dataset skip them.
---
## ComfyUI Installation
ComfyUI is the recommended generation UI — it has official ROCm Linux support and an AMD partnership.
```bash
cd ~
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
python3 -m venv venv
source venv/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm7.2
pip install -r requirements.txt
```
**Note on disk space:** This installs a second copy of PyTorch (~14GB). If you already have a training venv, you now have 28GB of PyTorch on disk. There is no simple way around this — the two venvs need different PyTorch versions in some cases, and sharing venvs across tools is fragile.
### Launch command
```bash
cd ~/ComfyUI
HSA_OVERRIDE_GFX_VERSION=12.0.0 ~/ComfyUI/venv/bin/python main.py --listen
```
Then open `http://localhost:8188`.
`HSA_OVERRIDE_GFX_VERSION=12.0.0` is required for some operations. Without it, some ROCm ops may not target the RDNA4 instruction set correctly, causing errors or silent CPU fallback.
### Model format: diffusers vs single-file
**This trips everyone up at least once.**
ai-toolkit downloads and uses SDXL in **diffusers format** — a folder structure with separate `unet/`, `vae/`, `text_encoder/`, `text_encoder_2/` subfolders.
ComfyUI requires a **single merged `.safetensors` file** (e.g. `sd_xl_base_1.0.safetensors`).
The weights are identical — just packaged differently. You cannot point ComfyUI at your training model folder. Download the single-file version separately:
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
huggingface-cli download stabilityai/stable-diffusion-xl-base-1.0 \
sd_xl_base_1.0.safetensors \
--local-dir ~/ComfyUI/models/checkpoints/
```
This is ~6.5GB. For Flux, `flux1-dev.safetensors` (the ComfyUI single-file) and `ae.safetensors` (VAE) are already in the download and can be symlinked directly. **Catch:** the T5 text encoder is stored sharded across two files in the HuggingFace download — ComfyUI needs a single merged file. See the ComfyUI Flux Setup section for the merge script and an fp8 alternative.
### Workflow JSON format
ComfyUI 0.21.1 uses a specific flat JSON format for workflows. The blueprint files in `~/ComfyUI/blueprints/` use a different subgraph format — do not use those as a template for manually-authored workflows.
Classic flat format structure:
```json
{
"nodes": [ { "id": 1, "type": "CheckpointLoaderSimple", ... }, ... ],
"links": [
[link_id, from_node_id, from_slot_index, to_node_id, to_slot_index, "TYPE"],
...
],
"version": 0.4
}
```
Links are arrays, not objects. Each link: `[id, source_node, source_slot, dest_node, dest_slot, "TYPENAME"]`.
### SDXL workflow
SDXL uses `CheckpointLoaderSimple` — one node loads the entire model from one file. Simpler than the Flux multi-loader setup.
Node graph:
- **CheckpointLoaderSimple** → loads `sd_xl_base_1.0.safetensors`
- **LoraLoader** → applies trained LoRA (strength 1.0)
- **CLIPTextEncode** (×2) → positive prompt + negative prompt
- **KSampler** → sampling loop
- **VAEDecode** → latent → pixel image
- **SaveImage** → saves to `~/ComfyUI/output/`
Symlink the LoRA output into ComfyUI (replace `sdxl_ohwx_man` with the `name` from your training config):
```bash
ln -s ~/ai-toolkit-amd-rocm-support/output/sdxl_ohwx_man/sdxl_ohwx_man.safetensors \
~/ComfyUI/models/loras/sdxl_ohwx_man.safetensors
```
Working settings for portrait generation on RX 9060 XT:
- Resolution: 832×1216 (matches the 1024-bucket training resolution)
- Steps: 30, CFG: 7.0, sampler: dpmpp_2m, scheduler: karras
- LoRA strength: 1.0
- Positive: `portrait photo of a man, smiling, outdoor park, natural light, bokeh background, sharp focus, photorealistic` — if you added a trigger word (Option 1 above), prepend it here
- Negative: `bad teeth, broken teeth, missing teeth, gaps in teeth, dental artifacts, blurry, watermark`
The node graph above is the complete workflow — wire it up in ComfyUI or save it as a JSON to
GitHub
GitHub - Comfy-Org/ComfyUI: The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface. - Comfy-Org/ComfyUI
reuse.
---
## Disk Space Reality Check
Before moving on to Flux — which adds another 54GB — here is the full storage picture after SDXL setup:
| Component | Size |
|-----------|------|
| ROCm 7.2.3 | 22GB |
| JoyCaption Beta One | 16GB |
| SDXL (diffusers format, for training) | 6.7GB |
| SDXL (single-file, for ComfyUI) | 6.5GB |
| SDXL LoRA output | ~150MB |
| Training venv (PyTorch + deps) | ~16GB |
| ComfyUI venv (PyTorch + deps) | ~16GB |
| ai-toolkit code | 1.2GB |
| ComfyUI code | ~130MB |
| **Total** | **~85GB** |
PyTorch alone is 28GB — 14GB per venv, downloaded twice because the two tools need separate environments. SDXL is downloaded twice in different formats.
Flux.1 Dev adds **54GB** on disk — not ~34GB as commonly estimated. The HuggingFace repo contains the transformer weights **twice in different formats**:
- `flux1-dev.safetensors` ~23.8GB — single-file format (ComfyUI)
- `transformer/diffusion_pytorch_model-*` ~23GB — diffusers format (training)
- T5 text encoder ~9.5GB
- CLIP, VAE, ae.safetensors ~0.8GB
The upside: the transformer and VAE are ready for both training and generation from one download — no separate 23GB checkpoint needed like SDXL. **Catch:** the T5 text encoder is sharded across two files — ComfyUI needs a single merged file. See the ComfyUI Flux Setup section for the merge script.
Plan for **~130GB+** total if you want both SDXL and Flux training and generation.
---
## Flux Model Download
Flux.1 Dev requires a HuggingFace account and license agreement (free). Accept the license at `black-forest-labs/FLUX.1-dev` on HuggingFace, then:
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
huggingface-cli download black-forest-labs/FLUX.1-dev \
--local-dir ~/models/flux/
```
This downloads ~54GB — not ~34GB as commonly estimated. The repo contains the transformer weights twice: `flux1-dev.safetensors` (~23GB, single-file for ComfyUI) and `transformer/` (~23GB, diffusers format for training), plus T5 (~9GB), CLIP, VAE, and ae.safetensors (~0.8GB). Both formats are needed; one download covers training and the transformer/VAE for generation. See the T5 catch below in the ComfyUI section.
---
## Flux Training on 16GB VRAM
The cupertinomiranda fork states 24GB minimum for Flux. This is based on loading the transformer in bf16 (~24GB alone). With quantization it fits comfortably on 16GB.
**VRAM is determined by bucket resolution, not photo count.** Training tools group images by aspect ratio into resolution buckets (e.g. 512×768, 768×512). Each training step processes one bucket at a time. The VRAM cost per step depends entirely on the pixel dimensions of that bucket — a dataset with 5 photos and one with 500 photos use identical VRAM per step if their resolution buckets are the same. This matters because some guides suggest reducing photo count to fix OOM — it doesn't help. The right lever is resolution.
**VRAM by quantization level:**
| Mode | Transformer | Total floor | Fits 16GB for training? |
|------|-------------|------------|-------------------------|
| bf16 | ~24GB | ~30GB+ | No |
| qfloat8/uint8 (8-bit) | ~12GB | ~14.87GB | No — only ~1GB for activations |
| uint4 torchao (4-bit) | ~6GB | ~7-8GB | Yes — with [512, 768] resolution |
Note: 8-bit sounds like it should fit on 16GB but doesn't. The 14.87GB floor leaves ~1GB for training activations, which is not enough for a Flux forward+backward pass. 4-bit is required. At 1024px training resolution, even 4-bit OOMs on forward/backward — use [512, 768] max resolution.
The HuggingFace QLoRA blog documents ~9-10GB peak VRAM with 4-bit quantization on FLUX.1-dev. Multiple Civitai guides confirm Flux LoRA training on RTX 3060 (12GB), so 16GB is not a concern once quantization is enabled.
Save as `~/ai-toolkit-amd-rocm-support/config/train_flux_full.yaml`:
```yaml
job: extension
config:
name: "[your-lora-name]" # determines output folder name and LoRA filename
process:
- type: 'sd_trainer'
training_folder: "output"
device: cuda:0
network:
type: "lora"
linear: 16
---
## Disk Space Reality Check
Before moving on to Flux — which adds another 54GB — here is the full storage picture after SDXL setup:
| Component | Size |
|-----------|------|
| ROCm 7.2.3 | 22GB |
| JoyCaption Beta One | 16GB |
| SDXL (diffusers format, for training) | 6.7GB |
| SDXL (single-file, for ComfyUI) | 6.5GB |
| SDXL LoRA output | ~150MB |
| Training venv (PyTorch + deps) | ~16GB |
| ComfyUI venv (PyTorch + deps) | ~16GB |
| ai-toolkit code | 1.2GB |
| ComfyUI code | ~130MB |
| **Total** | **~85GB** |
PyTorch alone is 28GB — 14GB per venv, downloaded twice because the two tools need separate environments. SDXL is downloaded twice in different formats.
Flux.1 Dev adds **54GB** on disk — not ~34GB as commonly estimated. The HuggingFace repo contains the transformer weights **twice in different formats**:
- `flux1-dev.safetensors` ~23.8GB — single-file format (ComfyUI)
- `transformer/diffusion_pytorch_model-*` ~23GB — diffusers format (training)
- T5 text encoder ~9.5GB
- CLIP, VAE, ae.safetensors ~0.8GB
The upside: the transformer and VAE are ready for both training and generation from one download — no separate 23GB checkpoint needed like SDXL. **Catch:** the T5 text encoder is sharded across two files — ComfyUI needs a single merged file. See the ComfyUI Flux Setup section for the merge script.
Plan for **~130GB+** total if you want both SDXL and Flux training and generation.
---
## Flux Model Download
Flux.1 Dev requires a HuggingFace account and license agreement (free). Accept the license at `black-forest-labs/FLUX.1-dev` on HuggingFace, then:
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
huggingface-cli download black-forest-labs/FLUX.1-dev \
--local-dir ~/models/flux/
```
This downloads ~54GB — not ~34GB as commonly estimated. The repo contains the transformer weights twice: `flux1-dev.safetensors` (~23GB, single-file for ComfyUI) and `transformer/` (~23GB, diffusers format for training), plus T5 (~9GB), CLIP, VAE, and ae.safetensors (~0.8GB). Both formats are needed; one download covers training and the transformer/VAE for generation. See the T5 catch below in the ComfyUI section.
---
## Flux Training on 16GB VRAM
The cupertinomiranda fork states 24GB minimum for Flux. This is based on loading the transformer in bf16 (~24GB alone). With quantization it fits comfortably on 16GB.
**VRAM is determined by bucket resolution, not photo count.** Training tools group images by aspect ratio into resolution buckets (e.g. 512×768, 768×512). Each training step processes one bucket at a time. The VRAM cost per step depends entirely on the pixel dimensions of that bucket — a dataset with 5 photos and one with 500 photos use identical VRAM per step if their resolution buckets are the same. This matters because some guides suggest reducing photo count to fix OOM — it doesn't help. The right lever is resolution.
**VRAM by quantization level:**
| Mode | Transformer | Total floor | Fits 16GB for training? |
|------|-------------|------------|-------------------------|
| bf16 | ~24GB | ~30GB+ | No |
| qfloat8/uint8 (8-bit) | ~12GB | ~14.87GB | No — only ~1GB for activations |
| uint4 torchao (4-bit) | ~6GB | ~7-8GB | Yes — with [512, 768] resolution |
Note: 8-bit sounds like it should fit on 16GB but doesn't. The 14.87GB floor leaves ~1GB for training activations, which is not enough for a Flux forward+backward pass. 4-bit is required. At 1024px training resolution, even 4-bit OOMs on forward/backward — use [512, 768] max resolution.
The HuggingFace QLoRA blog documents ~9-10GB peak VRAM with 4-bit quantization on FLUX.1-dev. Multiple Civitai guides confirm Flux LoRA training on RTX 3060 (12GB), so 16GB is not a concern once quantization is enabled.
Save as `~/ai-toolkit-amd-rocm-support/config/train_flux_full.yaml`:
```yaml
job: extension
config:
name: "[your-lora-name]" # determines output folder name and LoRA filename
process:
- type: 'sd_trainer'
training_folder: "output"
device: cuda:0
network:
type: "lora"
linear: 16
# rank 16 — Flux needs less rank than SDXL's 32 for equivalent quality
linear_alpha: 16
save:
dtype: float16
save_every: 250
max_step_saves_to_keep: 4
datasets:
- folder_path: "/path/to/your/photos"
caption_ext: "txt"
caption_dropout_rate: 0.05
shuffle_tokens: false
cache_latents_to_disk: true
cache_text_embeddings: true # Flux only — T5 encodes captions once then fully unloads;
# without this: training uses blank prompts, captions ignored
resolution: [512, 768] # Flux only — SDXL ran fine at [512, 1024]; Flux OOMs at the
# 832×1216 bucket even with uint4 (bf16 dequantization at compute time)
num_workers: 0 # Flux only — workers fork and inherit T5's ~15GB CPU footprint;
# 2 workers × 15GB + main process = OOM on 32GB. SDXL has no such issue.
train:
batch_size: 1
steps: 1500
gradient_accumulation_steps: 1
train_text_encoder: false # mandatory for Flux (T5 is 9.5GB); was optional for SDXL (CLIP is ~500MB)
unload_text_encoder: true # Flux only — keeps T5 off GPU during training loop
gradient_checkpointing: true
noise_scheduler: "flowmatch" # Flux only — SDXL uses "ddpm"
optimizer: "adamw8bit"
lr: 1e-4
disable_sampling: true
dtype: bf16
model:
name_or_path: "~/models/flux"
is_flux: true
quantize: true # not needed for SDXL; mandatory for Flux (24GB transformer)
qtype: "uint4" # torchao uint4 — ROCm compatible. qint4 (optimum.quanto) is CUDA-only, won't work.
low_vram: true
meta:
name: "[your-lora-name]"
version: '1.0'
```
**Use 4-bit (uint4 via torchao).** 8-bit (qfloat8) does not fit on 16GB for training — the model floor is 14.87GB leaving only ~1GB for activations. 4-bit reduces stored weight size to ~6GB.
**Important caveat:** uint4 means weights are *stored* in 4-bit, but they are dequantized to bf16 on the fly during the forward and backward pass. Activations, intermediate tensors, and gradients are still bf16. Compute-time VRAM is therefore higher than storage size suggests — if you include 1024 in the resolution list, the resulting 832×1216 bucket will still OOM even with uint4. This is why `[512, 768]` is recommended: it eliminates that bucket entirely.
**Important: qint4 (optimum.quanto) does NOT work on ROCm.** It uses TinyGEMM packing (`torch._convert_weight_to_int4pack`) which is a CUDA-only kernel. Use `qtype: "uint4"` (torchao) instead — confirmed working on gfx1200.
**Text encoders:** `train_text_encoder: false` is mandatory. Use `cache_text_embeddings: true` so T5 encodes all captions in a one-time caching pass, saves the embeddings to disk, then fully unloads from VRAM before training starts.
**Why `unload_text_encoder: true` is required:**
Without it, `get_train_sd_device_state_preset()` sets `text_encoder.device = cuda:0` even when `train_text_encoder: false` — meaning T5 gets moved to GPU at the start of the training loop, not just during model loading. This is a non-obvious flag that the fork does not set automatically.
**Required code patches to the cupertinomiranda fork:**
The fork's `low_vram: true` flag only affects transformer quantization — it does not prevent T5 (~9.5GB) from being loaded to GPU during model initialization. Five patches are needed:
**Patch 1 — `toolkit/stable_diffusion_model.py` ~line 795** (T5 initial load):
```python
# Before:
text_encoder_2.to(self.device_torch, dtype=dtype)
# After:
if not self.low_vram:
text_encoder_2.to(self.device_torch, dtype=dtype)
```
**Patch 2 — `toolkit/stable_diffusion_model.py` ~line 838** (T5 move during pipe preparation):
```python
# Before:
text_encoder[1].to(self.device_torch)
# After:
if not self.low_vram:
text_encoder[1].to(self.device_torch)
```
**Patch 3 — `toolkit/train_tools.py` ~line 564**
linear_alpha: 16
save:
dtype: float16
save_every: 250
max_step_saves_to_keep: 4
datasets:
- folder_path: "/path/to/your/photos"
caption_ext: "txt"
caption_dropout_rate: 0.05
shuffle_tokens: false
cache_latents_to_disk: true
cache_text_embeddings: true # Flux only — T5 encodes captions once then fully unloads;
# without this: training uses blank prompts, captions ignored
resolution: [512, 768] # Flux only — SDXL ran fine at [512, 1024]; Flux OOMs at the
# 832×1216 bucket even with uint4 (bf16 dequantization at compute time)
num_workers: 0 # Flux only — workers fork and inherit T5's ~15GB CPU footprint;
# 2 workers × 15GB + main process = OOM on 32GB. SDXL has no such issue.
train:
batch_size: 1
steps: 1500
gradient_accumulation_steps: 1
train_text_encoder: false # mandatory for Flux (T5 is 9.5GB); was optional for SDXL (CLIP is ~500MB)
unload_text_encoder: true # Flux only — keeps T5 off GPU during training loop
gradient_checkpointing: true
noise_scheduler: "flowmatch" # Flux only — SDXL uses "ddpm"
optimizer: "adamw8bit"
lr: 1e-4
disable_sampling: true
dtype: bf16
model:
name_or_path: "~/models/flux"
is_flux: true
quantize: true # not needed for SDXL; mandatory for Flux (24GB transformer)
qtype: "uint4" # torchao uint4 — ROCm compatible. qint4 (optimum.quanto) is CUDA-only, won't work.
low_vram: true
meta:
name: "[your-lora-name]"
version: '1.0'
```
**Use 4-bit (uint4 via torchao).** 8-bit (qfloat8) does not fit on 16GB for training — the model floor is 14.87GB leaving only ~1GB for activations. 4-bit reduces stored weight size to ~6GB.
**Important caveat:** uint4 means weights are *stored* in 4-bit, but they are dequantized to bf16 on the fly during the forward and backward pass. Activations, intermediate tensors, and gradients are still bf16. Compute-time VRAM is therefore higher than storage size suggests — if you include 1024 in the resolution list, the resulting 832×1216 bucket will still OOM even with uint4. This is why `[512, 768]` is recommended: it eliminates that bucket entirely.
**Important: qint4 (optimum.quanto) does NOT work on ROCm.** It uses TinyGEMM packing (`torch._convert_weight_to_int4pack`) which is a CUDA-only kernel. Use `qtype: "uint4"` (torchao) instead — confirmed working on gfx1200.
**Text encoders:** `train_text_encoder: false` is mandatory. Use `cache_text_embeddings: true` so T5 encodes all captions in a one-time caching pass, saves the embeddings to disk, then fully unloads from VRAM before training starts.
**Why `unload_text_encoder: true` is required:**
Without it, `get_train_sd_device_state_preset()` sets `text_encoder.device = cuda:0` even when `train_text_encoder: false` — meaning T5 gets moved to GPU at the start of the training loop, not just during model loading. This is a non-obvious flag that the fork does not set automatically.
**Required code patches to the cupertinomiranda fork:**
The fork's `low_vram: true` flag only affects transformer quantization — it does not prevent T5 (~9.5GB) from being loaded to GPU during model initialization. Five patches are needed:
**Patch 1 — `toolkit/stable_diffusion_model.py` ~line 795** (T5 initial load):
```python
# Before:
text_encoder_2.to(self.device_torch, dtype=dtype)
# After:
if not self.low_vram:
text_encoder_2.to(self.device_torch, dtype=dtype)
```
**Patch 2 — `toolkit/stable_diffusion_model.py` ~line 838** (T5 move during pipe preparation):
```python
# Before:
text_encoder[1].to(self.device_torch)
# After:
if not self.low_vram:
text_encoder[1].to(self.device_torch)
```
**Patch 3 — `toolkit/train_tools.py` ~line 564**
(device mismatch when T5 is on CPU):
```python
# Before:
prompt_embeds = text_encoder[1](text_input_ids.to(device), output_hidden_states=False)[0]
# After:
t5_device = next(text_encoder[1].parameters()).device
prompt_embeds = text_encoder[1](text_input_ids.to(t5_device), output_hidden_states=False)[0]
```
**Patch 4 — `extensions_built_in/sd_trainer/SDTrainer.py` ~line 317** (T5 moved to GPU for embedding caching before unload):
```python
# Before:
self.sd.text_encoder_to(self.device_torch)
# After:
if getattr(self.sd, 'low_vram', False) and isinstance(self.sd.text_encoder, list):
self.sd.text_encoder[0].to(self.device_torch)
else:
self.sd.text_encoder_to(self.device_torch)
```
With `unload_text_encoder: true`, the code caches text embeddings then fully unloads T5 before training starts. But before caching, it tried to move T5 to GPU — OOM. This patch keeps T5 on CPU for the caching step. Patch 3 ensures encode_prompt works correctly with T5 on CPU.
**Patch 5 — `toolkit/data_loader.py` ~line 674** (DataLoader crashes when num_workers=0):
```python
# Before:
dataloader_kwargs['num_workers'] = dataset_config_list[0].num_workers
dataloader_kwargs['prefetch_factor'] = dataset_config_list[0].prefetch_factor
# After:
dataloader_kwargs['num_workers'] = dataset_config_list[0].num_workers
if dataloader_kwargs['num_workers'] > 0:
dataloader_kwargs['prefetch_factor'] = dataset_config_list[0].prefetch_factor
```
The default `num_workers: 2` causes system RAM OOM — each worker forks the main process and inherits the full ~15GB RAM footprint (T5 on CPU). On 32GB: 2 workers × ~15GB = ~30GB + main process = OOM. The kernel OOM killer terminates the workers and can kill the terminal window. Setting `num_workers: 0` avoids forking entirely, but `prefetch_factor` must not be set when `num_workers=0` — hence this patch.
After all 5 patches, T5 runs on CPU for embedding caching (only happens once with `cache_text_embeddings: true`), then fully unloads before training starts. With `resolution: [512, 768]`, training runs with zero OOM skips — confirmed on 5 photos × 50 steps.
### Flux training confirmed working on gfx1200
After all 5 patches and the correct config flags:
- Transformer quantized and loaded (uint4 torchao) ✓
- T5 runs on CPU, encodes captions once, fully unloads ✓
- Training runs with zero OOM skips at [512, 768] resolution ✓
- Loss moves, gradient updates confirmed ✓
- Step speed: ~15 sec/step on RX 9060 XT
Full 1500-step run on 38 photos: ~7 hours, VRAM 13.7GB at step 1 → 14.4GB at step 1500, final loss 0.369.
**Training command (use this for actual training):**
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
systemd-inhibit --what=sleep:idle --who="LoRA training" --why="Training in progress" \
bash -c 'HSA_ENABLE_SDMA=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python run.py config/train_flux_full.yaml'
```
`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` reduces memory fragmentation — not needed for SDXL but important for Flux where the quantized model sits close to the VRAM limit. For quick test runs without the sleep inhibitor:
```bash
HSA_ENABLE_SDMA=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python run.py config/train_flux_full.yaml
```
**iGPU display note:** KDE desktop on the training GPU wastes ~1.3GB VRAM (framebuffer). If your CPU has integrated graphics (Ryzen 5 5600G has Vega 7), plug the monitor into the motherboard output instead. No BIOS change needed — Linux detects both GPUs on boot and uses the iGPU for display automatically, freeing the full 16GB for training. Confirmed working on this setup. Caveat: automatic only on a full restart — after sleep/wake the system may revert to the discrete GPU for display; restart to recover.
**Auto-resume:** ai-toolkit automatically resumes from the latest checkpoint if training is interrupted. It reads step metadata from the safetensors files in the output folder and loads both the weights and the optimizer state — mathematically identical to never having stopped. If you kill the process (accidentally or on purpose),
```python
# Before:
prompt_embeds = text_encoder[1](text_input_ids.to(device), output_hidden_states=False)[0]
# After:
t5_device = next(text_encoder[1].parameters()).device
prompt_embeds = text_encoder[1](text_input_ids.to(t5_device), output_hidden_states=False)[0]
```
**Patch 4 — `extensions_built_in/sd_trainer/SDTrainer.py` ~line 317** (T5 moved to GPU for embedding caching before unload):
```python
# Before:
self.sd.text_encoder_to(self.device_torch)
# After:
if getattr(self.sd, 'low_vram', False) and isinstance(self.sd.text_encoder, list):
self.sd.text_encoder[0].to(self.device_torch)
else:
self.sd.text_encoder_to(self.device_torch)
```
With `unload_text_encoder: true`, the code caches text embeddings then fully unloads T5 before training starts. But before caching, it tried to move T5 to GPU — OOM. This patch keeps T5 on CPU for the caching step. Patch 3 ensures encode_prompt works correctly with T5 on CPU.
**Patch 5 — `toolkit/data_loader.py` ~line 674** (DataLoader crashes when num_workers=0):
```python
# Before:
dataloader_kwargs['num_workers'] = dataset_config_list[0].num_workers
dataloader_kwargs['prefetch_factor'] = dataset_config_list[0].prefetch_factor
# After:
dataloader_kwargs['num_workers'] = dataset_config_list[0].num_workers
if dataloader_kwargs['num_workers'] > 0:
dataloader_kwargs['prefetch_factor'] = dataset_config_list[0].prefetch_factor
```
The default `num_workers: 2` causes system RAM OOM — each worker forks the main process and inherits the full ~15GB RAM footprint (T5 on CPU). On 32GB: 2 workers × ~15GB = ~30GB + main process = OOM. The kernel OOM killer terminates the workers and can kill the terminal window. Setting `num_workers: 0` avoids forking entirely, but `prefetch_factor` must not be set when `num_workers=0` — hence this patch.
After all 5 patches, T5 runs on CPU for embedding caching (only happens once with `cache_text_embeddings: true`), then fully unloads before training starts. With `resolution: [512, 768]`, training runs with zero OOM skips — confirmed on 5 photos × 50 steps.
### Flux training confirmed working on gfx1200
After all 5 patches and the correct config flags:
- Transformer quantized and loaded (uint4 torchao) ✓
- T5 runs on CPU, encodes captions once, fully unloads ✓
- Training runs with zero OOM skips at [512, 768] resolution ✓
- Loss moves, gradient updates confirmed ✓
- Step speed: ~15 sec/step on RX 9060 XT
Full 1500-step run on 38 photos: ~7 hours, VRAM 13.7GB at step 1 → 14.4GB at step 1500, final loss 0.369.
**Training command (use this for actual training):**
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
systemd-inhibit --what=sleep:idle --who="LoRA training" --why="Training in progress" \
bash -c 'HSA_ENABLE_SDMA=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python run.py config/train_flux_full.yaml'
```
`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` reduces memory fragmentation — not needed for SDXL but important for Flux where the quantized model sits close to the VRAM limit. For quick test runs without the sleep inhibitor:
```bash
HSA_ENABLE_SDMA=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python run.py config/train_flux_full.yaml
```
**iGPU display note:** KDE desktop on the training GPU wastes ~1.3GB VRAM (framebuffer). If your CPU has integrated graphics (Ryzen 5 5600G has Vega 7), plug the monitor into the motherboard output instead. No BIOS change needed — Linux detects both GPUs on boot and uses the iGPU for display automatically, freeing the full 16GB for training. Confirmed working on this setup. Caveat: automatic only on a full restart — after sleep/wake the system may revert to the discrete GPU for display; restart to recover.
**Auto-resume:** ai-toolkit automatically resumes from the latest checkpoint if training is interrupted. It reads step metadata from the safetensors files in the output folder and loads both the weights and the optimizer state — mathematically identical to never having stopped. If you kill the process (accidentally or on purpose),
just run the same command again. It will print `#### IMPORTANT RESUMING FROM step XXXX ####` and continue from there. For a 7-hour run this is essential — checkpoints save every 250 steps as configured in `save_every`.
**Resolution tradeoff:** SDXL trained without issue at [512, 1024]. Flux cannot — the 1024 bucket (832×1216 / 1216×832) OOMs during the forward/backward pass even with uint4, because weights are dequantized to bf16 at compute time. Training at [512, 768] means the LoRA sees a maximum of 768px. Flux can still generate at 1024px or higher at inference time — the LoRA extrapolates. For portrait and social media use (viewed on phones at 1080px or less), the quality difference is negligible compared to the alternative of skipping ~30% of training batches due to OOM.
---
## ComfyUI Flux Setup
After training, you need to point ComfyUI at your Flux models. The HuggingFace download already has everything — just symlink rather than copy.
### Model symlinks
ComfyUI expects models in specific subdirectories under `~/ComfyUI/models/`. Create symlinks from those locations into `~/models/flux/`:
```bash
# Flux transformer (single-file, 23GB)
ln -s ~/models/flux/flux1-dev.safetensors ~/ComfyUI/models/diffusion_models/flux1-dev.safetensors
# VAE
ln -s ~/models/flux/ae.safetensors ~/ComfyUI/models/vae/ae.safetensors
# CLIP text encoder
ln -s ~/models/flux/text_encoder/model.safetensors ~/ComfyUI/models/clip/clip_l.safetensors
```
### T5 text encoder: merging shards
The HuggingFace Flux download stores T5 sharded across two files (`model-00001-of-00002.safetensors` and `model-00002-of-00002.safetensors` in `text_encoder_2/`). ComfyUI needs a single file. The merge is straightforward — the shards are the same format, just split by size, with no key remapping needed:
```python
import os
from safetensors.torch import load_file, save_file
home = os.path.expanduser("~")
shard1 = load_file(f"{home}/models/flux/text_encoder_2/model-00001-of-00002.safetensors")
shard2 = load_file(f"{home}/models/flux/text_encoder_2/model-00002-of-00002.safetensors")
merged = {**shard1, **shard2}
save_file(merged, f"{home}/ComfyUI/models/clip/t5xxl_fp16_merged.safetensors")
```
Result: 219 tensors, 9.5GB, keys in standard T5 format (`encoder.block.0.layer.0.SelfAttention.k.weight`). No key conflicts. Original shards are untouched — to revert: `rm ~/ComfyUI/models/clip/t5xxl_fp16_merged.safetensors`.
Alternative if you prefer not to merge: download the standalone `t5xxl_fp8_e4m3fn.safetensors` (~4.9GB, fp8 precision) from HuggingFace and place it in `~/ComfyUI/models/clip/`. Adjust the workflow to point to that filename.
### Workflow JSON
Flux uses a different node set from SDXL in ComfyUI. SDXL uses `CheckpointLoaderSimple` which loads everything from one file. Flux loads each component separately because the sources are separate files. The native node graph:
- **UNETLoader** → loads `flux1-dev.safetensors` (stored in bf16; ComfyUI quantizes to fp8_e4m3fn on load)
- **DualCLIPLoader** → loads `clip_l.safetensors` + `t5xxl_fp16_merged.safetensors`
- **VAELoader** → loads `ae.safetensors`
- **LoraLoader** → applies the trained LoRA to model and CLIP
- **CLIPTextEncode** → encodes the positive prompt
- **EmptyLatentImage** → creates the starting latent (1024×1024)
- **RandomNoise** → generates noise seed
- **BasicGuider** → combines model + conditioning (replaces CFGGuider for Flux)
- **KSamplerSelect** → selects sampler algorithm (euler)
- **BasicScheduler** → generates sigma schedule (simple, 25 steps)
- **SamplerCustomAdvanced** → runs the full sampling loop
- **VAEDecode** → latent → pixel image
- **SaveImage** → saves to `~/ComfyUI/output/`
No custom nodes required. The node graph above is the complete workflow.
After training completes, symlink the LoRA output (replace `[your-lora-name]` with the `name` from your training config):
```bash
ln -sf ~/ai-toolkit-amd-rocm-support/output/[your-lora-name]/[your-lora-name]_000001500.safetensors \
~/ComfyUI/models/loras/flux_portrait_lora.safetensors
```
(`-sf` forces the symlink
**Resolution tradeoff:** SDXL trained without issue at [512, 1024]. Flux cannot — the 1024 bucket (832×1216 / 1216×832) OOMs during the forward/backward pass even with uint4, because weights are dequantized to bf16 at compute time. Training at [512, 768] means the LoRA sees a maximum of 768px. Flux can still generate at 1024px or higher at inference time — the LoRA extrapolates. For portrait and social media use (viewed on phones at 1080px or less), the quality difference is negligible compared to the alternative of skipping ~30% of training batches due to OOM.
---
## ComfyUI Flux Setup
After training, you need to point ComfyUI at your Flux models. The HuggingFace download already has everything — just symlink rather than copy.
### Model symlinks
ComfyUI expects models in specific subdirectories under `~/ComfyUI/models/`. Create symlinks from those locations into `~/models/flux/`:
```bash
# Flux transformer (single-file, 23GB)
ln -s ~/models/flux/flux1-dev.safetensors ~/ComfyUI/models/diffusion_models/flux1-dev.safetensors
# VAE
ln -s ~/models/flux/ae.safetensors ~/ComfyUI/models/vae/ae.safetensors
# CLIP text encoder
ln -s ~/models/flux/text_encoder/model.safetensors ~/ComfyUI/models/clip/clip_l.safetensors
```
### T5 text encoder: merging shards
The HuggingFace Flux download stores T5 sharded across two files (`model-00001-of-00002.safetensors` and `model-00002-of-00002.safetensors` in `text_encoder_2/`). ComfyUI needs a single file. The merge is straightforward — the shards are the same format, just split by size, with no key remapping needed:
```python
import os
from safetensors.torch import load_file, save_file
home = os.path.expanduser("~")
shard1 = load_file(f"{home}/models/flux/text_encoder_2/model-00001-of-00002.safetensors")
shard2 = load_file(f"{home}/models/flux/text_encoder_2/model-00002-of-00002.safetensors")
merged = {**shard1, **shard2}
save_file(merged, f"{home}/ComfyUI/models/clip/t5xxl_fp16_merged.safetensors")
```
Result: 219 tensors, 9.5GB, keys in standard T5 format (`encoder.block.0.layer.0.SelfAttention.k.weight`). No key conflicts. Original shards are untouched — to revert: `rm ~/ComfyUI/models/clip/t5xxl_fp16_merged.safetensors`.
Alternative if you prefer not to merge: download the standalone `t5xxl_fp8_e4m3fn.safetensors` (~4.9GB, fp8 precision) from HuggingFace and place it in `~/ComfyUI/models/clip/`. Adjust the workflow to point to that filename.
### Workflow JSON
Flux uses a different node set from SDXL in ComfyUI. SDXL uses `CheckpointLoaderSimple` which loads everything from one file. Flux loads each component separately because the sources are separate files. The native node graph:
- **UNETLoader** → loads `flux1-dev.safetensors` (stored in bf16; ComfyUI quantizes to fp8_e4m3fn on load)
- **DualCLIPLoader** → loads `clip_l.safetensors` + `t5xxl_fp16_merged.safetensors`
- **VAELoader** → loads `ae.safetensors`
- **LoraLoader** → applies the trained LoRA to model and CLIP
- **CLIPTextEncode** → encodes the positive prompt
- **EmptyLatentImage** → creates the starting latent (1024×1024)
- **RandomNoise** → generates noise seed
- **BasicGuider** → combines model + conditioning (replaces CFGGuider for Flux)
- **KSamplerSelect** → selects sampler algorithm (euler)
- **BasicScheduler** → generates sigma schedule (simple, 25 steps)
- **SamplerCustomAdvanced** → runs the full sampling loop
- **VAEDecode** → latent → pixel image
- **SaveImage** → saves to `~/ComfyUI/output/`
No custom nodes required. The node graph above is the complete workflow.
After training completes, symlink the LoRA output (replace `[your-lora-name]` with the `name` from your training config):
```bash
ln -sf ~/ai-toolkit-amd-rocm-support/output/[your-lora-name]/[your-lora-name]_000001500.safetensors \
~/ComfyUI/models/loras/flux_portrait_lora.safetensors
```
(`-sf` forces the symlink
update — useful if you tested with an earlier checkpoint and are now pointing at the final one.)
### Generation speed
Flux is noticeably slower than SDXL in ComfyUI — 25 steps takes considerably longer due to the 23GB transformer size and fp8 dequantization at inference time.
---
## Face Restoration in ComfyUI
**Do not use ReActor on ROCm.** ReActor (`Gourieff/ComfyUI-ReActor`) uses ONNX Runtime for InsightFace face detection. The ROCm Execution Provider was removed from ORT 1.23 — on ROCm 7.1+ only the CPU EP is available via pip, so face detection runs on CPU.
**Use `facerestore_cf` instead** (https://github.com/mav-rik/facerestore_cf) — pure PyTorch, no ONNX Runtime, runs fully on GPU on ROCm.
### Install
```bash
cd ~/ComfyUI/custom_nodes
git clone https://github.com/mav-rik/facerestore_cf
source ~/ComfyUI/venv/bin/activate
pip install -r facerestore_cf/requirements.txt
```
**Watch out for `basicsr`** — an older package that breaks with modern PyTorch. If you get import errors after install: `pip uninstall basicsr`.
Restart ComfyUI to load the new nodes.
### Models
Download into `~/ComfyUI/models/facerestore_models/`:
```bash
# CodeFormer — better identity preservation, recommended for portraits (~359MB)
wget -P ~/ComfyUI/models/facerestore_models/ \
https://github.com/sczhou/CodeFormer/releases/download/v0.1.0/codeformer.pth
# GFPGAN v1.4 — faster, better for skin texture
wget -P ~/ComfyUI/models/facerestore_models/ \
https://github.com/TencentARC/GFPGAN/releases/download/v1.3.4/GFPGANv1.4.pth
```
Face detection models (RetinaFace) auto-download on first run.
### Workflow wiring
Place **after VAEDecode, before SaveImage**:
```
[sampler] → VAEDecode → FaceRestoreWithModel → SaveImage
```
For Flux the sampler node is `SamplerCustomAdvanced`; for SDXL it is `KSampler`.
---
## Summary: What Actually Works on RX 9060 XT (gfx1200) as of May 2026
| Task | Works? | Notes |
|------|--------|-------|
| ROCm 7.2.3 install | ✓ | Via amdgpu-install; manually add user to render/video groups |
| PyTorch 2.11.0+rocm7.2 | ✓ | Stable index only; nightly crashes |
| bitsandbytes (compiled) | ✓ | Must build from source with -DBNB_ROCM_ARCH=gfx1200 |
| JoyCaption captioning | ✓ | 3 bugs to fix (documented above); 5 sec/photo, 11.7GB VRAM |
| SDXL LoRA training | ✓ | 1500 steps in 76 min; 10GB VRAM peak; bf16 required |
| SDXL ComfyUI generation | ✓ | HSA_OVERRIDE_GFX_VERSION=12.0.0 required; dpmpp_2m karras, CFG 7.0 |
| Flux.1 Dev training (uint4) | ✓ | 5 code patches + cache_text_embeddings + num_workers: 0 + [512, 768] required; zero OOM skips confirmed; qint4 fails (CUDA-only) |
| Flux ComfyUI generation | ✓ | Symlinks + T5 shard merge confirmed working; slower than SDXL (expected) |
| Flux LoRA 1500-step training | ✓ | Completed — ~7 hours, 13.7–14.4GB VRAM, final loss 0.369, ~15 sec/step |
| WSL2 training | ✗ | DXG bridge bug (libthunk_proxy.a), unfixed as of May 2026 |
https://redd.it/1tgggrv
@rStableDiffusion
### Generation speed
Flux is noticeably slower than SDXL in ComfyUI — 25 steps takes considerably longer due to the 23GB transformer size and fp8 dequantization at inference time.
---
## Face Restoration in ComfyUI
**Do not use ReActor on ROCm.** ReActor (`Gourieff/ComfyUI-ReActor`) uses ONNX Runtime for InsightFace face detection. The ROCm Execution Provider was removed from ORT 1.23 — on ROCm 7.1+ only the CPU EP is available via pip, so face detection runs on CPU.
**Use `facerestore_cf` instead** (https://github.com/mav-rik/facerestore_cf) — pure PyTorch, no ONNX Runtime, runs fully on GPU on ROCm.
### Install
```bash
cd ~/ComfyUI/custom_nodes
git clone https://github.com/mav-rik/facerestore_cf
source ~/ComfyUI/venv/bin/activate
pip install -r facerestore_cf/requirements.txt
```
**Watch out for `basicsr`** — an older package that breaks with modern PyTorch. If you get import errors after install: `pip uninstall basicsr`.
Restart ComfyUI to load the new nodes.
### Models
Download into `~/ComfyUI/models/facerestore_models/`:
```bash
# CodeFormer — better identity preservation, recommended for portraits (~359MB)
wget -P ~/ComfyUI/models/facerestore_models/ \
https://github.com/sczhou/CodeFormer/releases/download/v0.1.0/codeformer.pth
# GFPGAN v1.4 — faster, better for skin texture
wget -P ~/ComfyUI/models/facerestore_models/ \
https://github.com/TencentARC/GFPGAN/releases/download/v1.3.4/GFPGANv1.4.pth
```
Face detection models (RetinaFace) auto-download on first run.
### Workflow wiring
Place **after VAEDecode, before SaveImage**:
```
[sampler] → VAEDecode → FaceRestoreWithModel → SaveImage
```
For Flux the sampler node is `SamplerCustomAdvanced`; for SDXL it is `KSampler`.
---
## Summary: What Actually Works on RX 9060 XT (gfx1200) as of May 2026
| Task | Works? | Notes |
|------|--------|-------|
| ROCm 7.2.3 install | ✓ | Via amdgpu-install; manually add user to render/video groups |
| PyTorch 2.11.0+rocm7.2 | ✓ | Stable index only; nightly crashes |
| bitsandbytes (compiled) | ✓ | Must build from source with -DBNB_ROCM_ARCH=gfx1200 |
| JoyCaption captioning | ✓ | 3 bugs to fix (documented above); 5 sec/photo, 11.7GB VRAM |
| SDXL LoRA training | ✓ | 1500 steps in 76 min; 10GB VRAM peak; bf16 required |
| SDXL ComfyUI generation | ✓ | HSA_OVERRIDE_GFX_VERSION=12.0.0 required; dpmpp_2m karras, CFG 7.0 |
| Flux.1 Dev training (uint4) | ✓ | 5 code patches + cache_text_embeddings + num_workers: 0 + [512, 768] required; zero OOM skips confirmed; qint4 fails (CUDA-only) |
| Flux ComfyUI generation | ✓ | Symlinks + T5 shard merge confirmed working; slower than SDXL (expected) |
| Flux LoRA 1500-step training | ✓ | Completed — ~7 hours, 13.7–14.4GB VRAM, final loss 0.369, ~15 sec/step |
| WSL2 training | ✗ | DXG bridge bug (libthunk_proxy.a), unfixed as of May 2026 |
https://redd.it/1tgggrv
@rStableDiffusion
GitHub
GitHub - mav-rik/facerestore_cf: ComfyUI Custom node that supports face restore models and supports CodeFormer Fidelity parameter
ComfyUI Custom node that supports face restore models and supports CodeFormer Fidelity parameter - mav-rik/facerestore_cf
Lance by ByteDance: 3B Apache2 model for image and video understanding, generation, and editing
https://redd.it/1tgjrm2
@rStableDiffusion
https://redd.it/1tgjrm2
@rStableDiffusion
i need this hairstyle as prompts , no AI has managed to do it , they always give me a different hair , thanks in advance
https://redd.it/1tgk08w
@rStableDiffusion
https://redd.it/1tgk08w
@rStableDiffusion