# rank 16 — Flux needs less rank than SDXL's 32 for equivalent quality
linear_alpha: 16
save:
dtype: float16
save_every: 250
max_step_saves_to_keep: 4
datasets:
- folder_path: "/path/to/your/photos"
caption_ext: "txt"
caption_dropout_rate: 0.05
shuffle_tokens: false
cache_latents_to_disk: true
cache_text_embeddings: true # Flux only — T5 encodes captions once then fully unloads;
# without this: training uses blank prompts, captions ignored
resolution: [512, 768] # Flux only — SDXL ran fine at [512, 1024]; Flux OOMs at the
# 832×1216 bucket even with uint4 (bf16 dequantization at compute time)
num_workers: 0 # Flux only — workers fork and inherit T5's ~15GB CPU footprint;
# 2 workers × 15GB + main process = OOM on 32GB. SDXL has no such issue.
train:
batch_size: 1
steps: 1500
gradient_accumulation_steps: 1
train_text_encoder: false # mandatory for Flux (T5 is 9.5GB); was optional for SDXL (CLIP is ~500MB)
unload_text_encoder: true # Flux only — keeps T5 off GPU during training loop
gradient_checkpointing: true
noise_scheduler: "flowmatch" # Flux only — SDXL uses "ddpm"
optimizer: "adamw8bit"
lr: 1e-4
disable_sampling: true
dtype: bf16
model:
name_or_path: "~/models/flux"
is_flux: true
quantize: true # not needed for SDXL; mandatory for Flux (24GB transformer)
qtype: "uint4" # torchao uint4 — ROCm compatible. qint4 (optimum.quanto) is CUDA-only, won't work.
low_vram: true
meta:
name: "[your-lora-name]"
version: '1.0'
```
**Use 4-bit (uint4 via torchao).** 8-bit (qfloat8) does not fit on 16GB for training — the model floor is 14.87GB leaving only ~1GB for activations. 4-bit reduces stored weight size to ~6GB.
**Important caveat:** uint4 means weights are *stored* in 4-bit, but they are dequantized to bf16 on the fly during the forward and backward pass. Activations, intermediate tensors, and gradients are still bf16. Compute-time VRAM is therefore higher than storage size suggests — if you include 1024 in the resolution list, the resulting 832×1216 bucket will still OOM even with uint4. This is why `[512, 768]` is recommended: it eliminates that bucket entirely.
**Important: qint4 (optimum.quanto) does NOT work on ROCm.** It uses TinyGEMM packing (`torch._convert_weight_to_int4pack`) which is a CUDA-only kernel. Use `qtype: "uint4"` (torchao) instead — confirmed working on gfx1200.
**Text encoders:** `train_text_encoder: false` is mandatory. Use `cache_text_embeddings: true` so T5 encodes all captions in a one-time caching pass, saves the embeddings to disk, then fully unloads from VRAM before training starts.
**Why `unload_text_encoder: true` is required:**
Without it, `get_train_sd_device_state_preset()` sets `text_encoder.device = cuda:0` even when `train_text_encoder: false` — meaning T5 gets moved to GPU at the start of the training loop, not just during model loading. This is a non-obvious flag that the fork does not set automatically.
**Required code patches to the cupertinomiranda fork:**
The fork's `low_vram: true` flag only affects transformer quantization — it does not prevent T5 (~9.5GB) from being loaded to GPU during model initialization. Five patches are needed:
**Patch 1 — `toolkit/stable_diffusion_model.py` ~line 795** (T5 initial load):
```python
# Before:
text_encoder_2.to(self.device_torch, dtype=dtype)
# After:
if not self.low_vram:
text_encoder_2.to(self.device_torch, dtype=dtype)
```
**Patch 2 — `toolkit/stable_diffusion_model.py` ~line 838** (T5 move during pipe preparation):
```python
# Before:
text_encoder[1].to(self.device_torch)
# After:
if not self.low_vram:
text_encoder[1].to(self.device_torch)
```
**Patch 3 — `toolkit/train_tools.py` ~line 564**
linear_alpha: 16
save:
dtype: float16
save_every: 250
max_step_saves_to_keep: 4
datasets:
- folder_path: "/path/to/your/photos"
caption_ext: "txt"
caption_dropout_rate: 0.05
shuffle_tokens: false
cache_latents_to_disk: true
cache_text_embeddings: true # Flux only — T5 encodes captions once then fully unloads;
# without this: training uses blank prompts, captions ignored
resolution: [512, 768] # Flux only — SDXL ran fine at [512, 1024]; Flux OOMs at the
# 832×1216 bucket even with uint4 (bf16 dequantization at compute time)
num_workers: 0 # Flux only — workers fork and inherit T5's ~15GB CPU footprint;
# 2 workers × 15GB + main process = OOM on 32GB. SDXL has no such issue.
train:
batch_size: 1
steps: 1500
gradient_accumulation_steps: 1
train_text_encoder: false # mandatory for Flux (T5 is 9.5GB); was optional for SDXL (CLIP is ~500MB)
unload_text_encoder: true # Flux only — keeps T5 off GPU during training loop
gradient_checkpointing: true
noise_scheduler: "flowmatch" # Flux only — SDXL uses "ddpm"
optimizer: "adamw8bit"
lr: 1e-4
disable_sampling: true
dtype: bf16
model:
name_or_path: "~/models/flux"
is_flux: true
quantize: true # not needed for SDXL; mandatory for Flux (24GB transformer)
qtype: "uint4" # torchao uint4 — ROCm compatible. qint4 (optimum.quanto) is CUDA-only, won't work.
low_vram: true
meta:
name: "[your-lora-name]"
version: '1.0'
```
**Use 4-bit (uint4 via torchao).** 8-bit (qfloat8) does not fit on 16GB for training — the model floor is 14.87GB leaving only ~1GB for activations. 4-bit reduces stored weight size to ~6GB.
**Important caveat:** uint4 means weights are *stored* in 4-bit, but they are dequantized to bf16 on the fly during the forward and backward pass. Activations, intermediate tensors, and gradients are still bf16. Compute-time VRAM is therefore higher than storage size suggests — if you include 1024 in the resolution list, the resulting 832×1216 bucket will still OOM even with uint4. This is why `[512, 768]` is recommended: it eliminates that bucket entirely.
**Important: qint4 (optimum.quanto) does NOT work on ROCm.** It uses TinyGEMM packing (`torch._convert_weight_to_int4pack`) which is a CUDA-only kernel. Use `qtype: "uint4"` (torchao) instead — confirmed working on gfx1200.
**Text encoders:** `train_text_encoder: false` is mandatory. Use `cache_text_embeddings: true` so T5 encodes all captions in a one-time caching pass, saves the embeddings to disk, then fully unloads from VRAM before training starts.
**Why `unload_text_encoder: true` is required:**
Without it, `get_train_sd_device_state_preset()` sets `text_encoder.device = cuda:0` even when `train_text_encoder: false` — meaning T5 gets moved to GPU at the start of the training loop, not just during model loading. This is a non-obvious flag that the fork does not set automatically.
**Required code patches to the cupertinomiranda fork:**
The fork's `low_vram: true` flag only affects transformer quantization — it does not prevent T5 (~9.5GB) from being loaded to GPU during model initialization. Five patches are needed:
**Patch 1 — `toolkit/stable_diffusion_model.py` ~line 795** (T5 initial load):
```python
# Before:
text_encoder_2.to(self.device_torch, dtype=dtype)
# After:
if not self.low_vram:
text_encoder_2.to(self.device_torch, dtype=dtype)
```
**Patch 2 — `toolkit/stable_diffusion_model.py` ~line 838** (T5 move during pipe preparation):
```python
# Before:
text_encoder[1].to(self.device_torch)
# After:
if not self.low_vram:
text_encoder[1].to(self.device_torch)
```
**Patch 3 — `toolkit/train_tools.py` ~line 564**
(device mismatch when T5 is on CPU):
```python
# Before:
prompt_embeds = text_encoder[1](text_input_ids.to(device), output_hidden_states=False)[0]
# After:
t5_device = next(text_encoder[1].parameters()).device
prompt_embeds = text_encoder[1](text_input_ids.to(t5_device), output_hidden_states=False)[0]
```
**Patch 4 — `extensions_built_in/sd_trainer/SDTrainer.py` ~line 317** (T5 moved to GPU for embedding caching before unload):
```python
# Before:
self.sd.text_encoder_to(self.device_torch)
# After:
if getattr(self.sd, 'low_vram', False) and isinstance(self.sd.text_encoder, list):
self.sd.text_encoder[0].to(self.device_torch)
else:
self.sd.text_encoder_to(self.device_torch)
```
With `unload_text_encoder: true`, the code caches text embeddings then fully unloads T5 before training starts. But before caching, it tried to move T5 to GPU — OOM. This patch keeps T5 on CPU for the caching step. Patch 3 ensures encode_prompt works correctly with T5 on CPU.
**Patch 5 — `toolkit/data_loader.py` ~line 674** (DataLoader crashes when num_workers=0):
```python
# Before:
dataloader_kwargs['num_workers'] = dataset_config_list[0].num_workers
dataloader_kwargs['prefetch_factor'] = dataset_config_list[0].prefetch_factor
# After:
dataloader_kwargs['num_workers'] = dataset_config_list[0].num_workers
if dataloader_kwargs['num_workers'] > 0:
dataloader_kwargs['prefetch_factor'] = dataset_config_list[0].prefetch_factor
```
The default `num_workers: 2` causes system RAM OOM — each worker forks the main process and inherits the full ~15GB RAM footprint (T5 on CPU). On 32GB: 2 workers × ~15GB = ~30GB + main process = OOM. The kernel OOM killer terminates the workers and can kill the terminal window. Setting `num_workers: 0` avoids forking entirely, but `prefetch_factor` must not be set when `num_workers=0` — hence this patch.
After all 5 patches, T5 runs on CPU for embedding caching (only happens once with `cache_text_embeddings: true`), then fully unloads before training starts. With `resolution: [512, 768]`, training runs with zero OOM skips — confirmed on 5 photos × 50 steps.
### Flux training confirmed working on gfx1200
After all 5 patches and the correct config flags:
- Transformer quantized and loaded (uint4 torchao) ✓
- T5 runs on CPU, encodes captions once, fully unloads ✓
- Training runs with zero OOM skips at [512, 768] resolution ✓
- Loss moves, gradient updates confirmed ✓
- Step speed: ~15 sec/step on RX 9060 XT
Full 1500-step run on 38 photos: ~7 hours, VRAM 13.7GB at step 1 → 14.4GB at step 1500, final loss 0.369.
**Training command (use this for actual training):**
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
systemd-inhibit --what=sleep:idle --who="LoRA training" --why="Training in progress" \
bash -c 'HSA_ENABLE_SDMA=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python run.py config/train_flux_full.yaml'
```
`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` reduces memory fragmentation — not needed for SDXL but important for Flux where the quantized model sits close to the VRAM limit. For quick test runs without the sleep inhibitor:
```bash
HSA_ENABLE_SDMA=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python run.py config/train_flux_full.yaml
```
**iGPU display note:** KDE desktop on the training GPU wastes ~1.3GB VRAM (framebuffer). If your CPU has integrated graphics (Ryzen 5 5600G has Vega 7), plug the monitor into the motherboard output instead. No BIOS change needed — Linux detects both GPUs on boot and uses the iGPU for display automatically, freeing the full 16GB for training. Confirmed working on this setup. Caveat: automatic only on a full restart — after sleep/wake the system may revert to the discrete GPU for display; restart to recover.
**Auto-resume:** ai-toolkit automatically resumes from the latest checkpoint if training is interrupted. It reads step metadata from the safetensors files in the output folder and loads both the weights and the optimizer state — mathematically identical to never having stopped. If you kill the process (accidentally or on purpose),
```python
# Before:
prompt_embeds = text_encoder[1](text_input_ids.to(device), output_hidden_states=False)[0]
# After:
t5_device = next(text_encoder[1].parameters()).device
prompt_embeds = text_encoder[1](text_input_ids.to(t5_device), output_hidden_states=False)[0]
```
**Patch 4 — `extensions_built_in/sd_trainer/SDTrainer.py` ~line 317** (T5 moved to GPU for embedding caching before unload):
```python
# Before:
self.sd.text_encoder_to(self.device_torch)
# After:
if getattr(self.sd, 'low_vram', False) and isinstance(self.sd.text_encoder, list):
self.sd.text_encoder[0].to(self.device_torch)
else:
self.sd.text_encoder_to(self.device_torch)
```
With `unload_text_encoder: true`, the code caches text embeddings then fully unloads T5 before training starts. But before caching, it tried to move T5 to GPU — OOM. This patch keeps T5 on CPU for the caching step. Patch 3 ensures encode_prompt works correctly with T5 on CPU.
**Patch 5 — `toolkit/data_loader.py` ~line 674** (DataLoader crashes when num_workers=0):
```python
# Before:
dataloader_kwargs['num_workers'] = dataset_config_list[0].num_workers
dataloader_kwargs['prefetch_factor'] = dataset_config_list[0].prefetch_factor
# After:
dataloader_kwargs['num_workers'] = dataset_config_list[0].num_workers
if dataloader_kwargs['num_workers'] > 0:
dataloader_kwargs['prefetch_factor'] = dataset_config_list[0].prefetch_factor
```
The default `num_workers: 2` causes system RAM OOM — each worker forks the main process and inherits the full ~15GB RAM footprint (T5 on CPU). On 32GB: 2 workers × ~15GB = ~30GB + main process = OOM. The kernel OOM killer terminates the workers and can kill the terminal window. Setting `num_workers: 0` avoids forking entirely, but `prefetch_factor` must not be set when `num_workers=0` — hence this patch.
After all 5 patches, T5 runs on CPU for embedding caching (only happens once with `cache_text_embeddings: true`), then fully unloads before training starts. With `resolution: [512, 768]`, training runs with zero OOM skips — confirmed on 5 photos × 50 steps.
### Flux training confirmed working on gfx1200
After all 5 patches and the correct config flags:
- Transformer quantized and loaded (uint4 torchao) ✓
- T5 runs on CPU, encodes captions once, fully unloads ✓
- Training runs with zero OOM skips at [512, 768] resolution ✓
- Loss moves, gradient updates confirmed ✓
- Step speed: ~15 sec/step on RX 9060 XT
Full 1500-step run on 38 photos: ~7 hours, VRAM 13.7GB at step 1 → 14.4GB at step 1500, final loss 0.369.
**Training command (use this for actual training):**
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
systemd-inhibit --what=sleep:idle --who="LoRA training" --why="Training in progress" \
bash -c 'HSA_ENABLE_SDMA=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python run.py config/train_flux_full.yaml'
```
`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` reduces memory fragmentation — not needed for SDXL but important for Flux where the quantized model sits close to the VRAM limit. For quick test runs without the sleep inhibitor:
```bash
HSA_ENABLE_SDMA=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python run.py config/train_flux_full.yaml
```
**iGPU display note:** KDE desktop on the training GPU wastes ~1.3GB VRAM (framebuffer). If your CPU has integrated graphics (Ryzen 5 5600G has Vega 7), plug the monitor into the motherboard output instead. No BIOS change needed — Linux detects both GPUs on boot and uses the iGPU for display automatically, freeing the full 16GB for training. Confirmed working on this setup. Caveat: automatic only on a full restart — after sleep/wake the system may revert to the discrete GPU for display; restart to recover.
**Auto-resume:** ai-toolkit automatically resumes from the latest checkpoint if training is interrupted. It reads step metadata from the safetensors files in the output folder and loads both the weights and the optimizer state — mathematically identical to never having stopped. If you kill the process (accidentally or on purpose),
just run the same command again. It will print `#### IMPORTANT RESUMING FROM step XXXX ####` and continue from there. For a 7-hour run this is essential — checkpoints save every 250 steps as configured in `save_every`.
**Resolution tradeoff:** SDXL trained without issue at [512, 1024]. Flux cannot — the 1024 bucket (832×1216 / 1216×832) OOMs during the forward/backward pass even with uint4, because weights are dequantized to bf16 at compute time. Training at [512, 768] means the LoRA sees a maximum of 768px. Flux can still generate at 1024px or higher at inference time — the LoRA extrapolates. For portrait and social media use (viewed on phones at 1080px or less), the quality difference is negligible compared to the alternative of skipping ~30% of training batches due to OOM.
---
## ComfyUI Flux Setup
After training, you need to point ComfyUI at your Flux models. The HuggingFace download already has everything — just symlink rather than copy.
### Model symlinks
ComfyUI expects models in specific subdirectories under `~/ComfyUI/models/`. Create symlinks from those locations into `~/models/flux/`:
```bash
# Flux transformer (single-file, 23GB)
ln -s ~/models/flux/flux1-dev.safetensors ~/ComfyUI/models/diffusion_models/flux1-dev.safetensors
# VAE
ln -s ~/models/flux/ae.safetensors ~/ComfyUI/models/vae/ae.safetensors
# CLIP text encoder
ln -s ~/models/flux/text_encoder/model.safetensors ~/ComfyUI/models/clip/clip_l.safetensors
```
### T5 text encoder: merging shards
The HuggingFace Flux download stores T5 sharded across two files (`model-00001-of-00002.safetensors` and `model-00002-of-00002.safetensors` in `text_encoder_2/`). ComfyUI needs a single file. The merge is straightforward — the shards are the same format, just split by size, with no key remapping needed:
```python
import os
from safetensors.torch import load_file, save_file
home = os.path.expanduser("~")
shard1 = load_file(f"{home}/models/flux/text_encoder_2/model-00001-of-00002.safetensors")
shard2 = load_file(f"{home}/models/flux/text_encoder_2/model-00002-of-00002.safetensors")
merged = {**shard1, **shard2}
save_file(merged, f"{home}/ComfyUI/models/clip/t5xxl_fp16_merged.safetensors")
```
Result: 219 tensors, 9.5GB, keys in standard T5 format (`encoder.block.0.layer.0.SelfAttention.k.weight`). No key conflicts. Original shards are untouched — to revert: `rm ~/ComfyUI/models/clip/t5xxl_fp16_merged.safetensors`.
Alternative if you prefer not to merge: download the standalone `t5xxl_fp8_e4m3fn.safetensors` (~4.9GB, fp8 precision) from HuggingFace and place it in `~/ComfyUI/models/clip/`. Adjust the workflow to point to that filename.
### Workflow JSON
Flux uses a different node set from SDXL in ComfyUI. SDXL uses `CheckpointLoaderSimple` which loads everything from one file. Flux loads each component separately because the sources are separate files. The native node graph:
- **UNETLoader** → loads `flux1-dev.safetensors` (stored in bf16; ComfyUI quantizes to fp8_e4m3fn on load)
- **DualCLIPLoader** → loads `clip_l.safetensors` + `t5xxl_fp16_merged.safetensors`
- **VAELoader** → loads `ae.safetensors`
- **LoraLoader** → applies the trained LoRA to model and CLIP
- **CLIPTextEncode** → encodes the positive prompt
- **EmptyLatentImage** → creates the starting latent (1024×1024)
- **RandomNoise** → generates noise seed
- **BasicGuider** → combines model + conditioning (replaces CFGGuider for Flux)
- **KSamplerSelect** → selects sampler algorithm (euler)
- **BasicScheduler** → generates sigma schedule (simple, 25 steps)
- **SamplerCustomAdvanced** → runs the full sampling loop
- **VAEDecode** → latent → pixel image
- **SaveImage** → saves to `~/ComfyUI/output/`
No custom nodes required. The node graph above is the complete workflow.
After training completes, symlink the LoRA output (replace `[your-lora-name]` with the `name` from your training config):
```bash
ln -sf ~/ai-toolkit-amd-rocm-support/output/[your-lora-name]/[your-lora-name]_000001500.safetensors \
~/ComfyUI/models/loras/flux_portrait_lora.safetensors
```
(`-sf` forces the symlink
**Resolution tradeoff:** SDXL trained without issue at [512, 1024]. Flux cannot — the 1024 bucket (832×1216 / 1216×832) OOMs during the forward/backward pass even with uint4, because weights are dequantized to bf16 at compute time. Training at [512, 768] means the LoRA sees a maximum of 768px. Flux can still generate at 1024px or higher at inference time — the LoRA extrapolates. For portrait and social media use (viewed on phones at 1080px or less), the quality difference is negligible compared to the alternative of skipping ~30% of training batches due to OOM.
---
## ComfyUI Flux Setup
After training, you need to point ComfyUI at your Flux models. The HuggingFace download already has everything — just symlink rather than copy.
### Model symlinks
ComfyUI expects models in specific subdirectories under `~/ComfyUI/models/`. Create symlinks from those locations into `~/models/flux/`:
```bash
# Flux transformer (single-file, 23GB)
ln -s ~/models/flux/flux1-dev.safetensors ~/ComfyUI/models/diffusion_models/flux1-dev.safetensors
# VAE
ln -s ~/models/flux/ae.safetensors ~/ComfyUI/models/vae/ae.safetensors
# CLIP text encoder
ln -s ~/models/flux/text_encoder/model.safetensors ~/ComfyUI/models/clip/clip_l.safetensors
```
### T5 text encoder: merging shards
The HuggingFace Flux download stores T5 sharded across two files (`model-00001-of-00002.safetensors` and `model-00002-of-00002.safetensors` in `text_encoder_2/`). ComfyUI needs a single file. The merge is straightforward — the shards are the same format, just split by size, with no key remapping needed:
```python
import os
from safetensors.torch import load_file, save_file
home = os.path.expanduser("~")
shard1 = load_file(f"{home}/models/flux/text_encoder_2/model-00001-of-00002.safetensors")
shard2 = load_file(f"{home}/models/flux/text_encoder_2/model-00002-of-00002.safetensors")
merged = {**shard1, **shard2}
save_file(merged, f"{home}/ComfyUI/models/clip/t5xxl_fp16_merged.safetensors")
```
Result: 219 tensors, 9.5GB, keys in standard T5 format (`encoder.block.0.layer.0.SelfAttention.k.weight`). No key conflicts. Original shards are untouched — to revert: `rm ~/ComfyUI/models/clip/t5xxl_fp16_merged.safetensors`.
Alternative if you prefer not to merge: download the standalone `t5xxl_fp8_e4m3fn.safetensors` (~4.9GB, fp8 precision) from HuggingFace and place it in `~/ComfyUI/models/clip/`. Adjust the workflow to point to that filename.
### Workflow JSON
Flux uses a different node set from SDXL in ComfyUI. SDXL uses `CheckpointLoaderSimple` which loads everything from one file. Flux loads each component separately because the sources are separate files. The native node graph:
- **UNETLoader** → loads `flux1-dev.safetensors` (stored in bf16; ComfyUI quantizes to fp8_e4m3fn on load)
- **DualCLIPLoader** → loads `clip_l.safetensors` + `t5xxl_fp16_merged.safetensors`
- **VAELoader** → loads `ae.safetensors`
- **LoraLoader** → applies the trained LoRA to model and CLIP
- **CLIPTextEncode** → encodes the positive prompt
- **EmptyLatentImage** → creates the starting latent (1024×1024)
- **RandomNoise** → generates noise seed
- **BasicGuider** → combines model + conditioning (replaces CFGGuider for Flux)
- **KSamplerSelect** → selects sampler algorithm (euler)
- **BasicScheduler** → generates sigma schedule (simple, 25 steps)
- **SamplerCustomAdvanced** → runs the full sampling loop
- **VAEDecode** → latent → pixel image
- **SaveImage** → saves to `~/ComfyUI/output/`
No custom nodes required. The node graph above is the complete workflow.
After training completes, symlink the LoRA output (replace `[your-lora-name]` with the `name` from your training config):
```bash
ln -sf ~/ai-toolkit-amd-rocm-support/output/[your-lora-name]/[your-lora-name]_000001500.safetensors \
~/ComfyUI/models/loras/flux_portrait_lora.safetensors
```
(`-sf` forces the symlink
update — useful if you tested with an earlier checkpoint and are now pointing at the final one.)
### Generation speed
Flux is noticeably slower than SDXL in ComfyUI — 25 steps takes considerably longer due to the 23GB transformer size and fp8 dequantization at inference time.
---
## Face Restoration in ComfyUI
**Do not use ReActor on ROCm.** ReActor (`Gourieff/ComfyUI-ReActor`) uses ONNX Runtime for InsightFace face detection. The ROCm Execution Provider was removed from ORT 1.23 — on ROCm 7.1+ only the CPU EP is available via pip, so face detection runs on CPU.
**Use `facerestore_cf` instead** (https://github.com/mav-rik/facerestore_cf) — pure PyTorch, no ONNX Runtime, runs fully on GPU on ROCm.
### Install
```bash
cd ~/ComfyUI/custom_nodes
git clone https://github.com/mav-rik/facerestore_cf
source ~/ComfyUI/venv/bin/activate
pip install -r facerestore_cf/requirements.txt
```
**Watch out for `basicsr`** — an older package that breaks with modern PyTorch. If you get import errors after install: `pip uninstall basicsr`.
Restart ComfyUI to load the new nodes.
### Models
Download into `~/ComfyUI/models/facerestore_models/`:
```bash
# CodeFormer — better identity preservation, recommended for portraits (~359MB)
wget -P ~/ComfyUI/models/facerestore_models/ \
https://github.com/sczhou/CodeFormer/releases/download/v0.1.0/codeformer.pth
# GFPGAN v1.4 — faster, better for skin texture
wget -P ~/ComfyUI/models/facerestore_models/ \
https://github.com/TencentARC/GFPGAN/releases/download/v1.3.4/GFPGANv1.4.pth
```
Face detection models (RetinaFace) auto-download on first run.
### Workflow wiring
Place **after VAEDecode, before SaveImage**:
```
[sampler] → VAEDecode → FaceRestoreWithModel → SaveImage
```
For Flux the sampler node is `SamplerCustomAdvanced`; for SDXL it is `KSampler`.
---
## Summary: What Actually Works on RX 9060 XT (gfx1200) as of May 2026
| Task | Works? | Notes |
|------|--------|-------|
| ROCm 7.2.3 install | ✓ | Via amdgpu-install; manually add user to render/video groups |
| PyTorch 2.11.0+rocm7.2 | ✓ | Stable index only; nightly crashes |
| bitsandbytes (compiled) | ✓ | Must build from source with -DBNB_ROCM_ARCH=gfx1200 |
| JoyCaption captioning | ✓ | 3 bugs to fix (documented above); 5 sec/photo, 11.7GB VRAM |
| SDXL LoRA training | ✓ | 1500 steps in 76 min; 10GB VRAM peak; bf16 required |
| SDXL ComfyUI generation | ✓ | HSA_OVERRIDE_GFX_VERSION=12.0.0 required; dpmpp_2m karras, CFG 7.0 |
| Flux.1 Dev training (uint4) | ✓ | 5 code patches + cache_text_embeddings + num_workers: 0 + [512, 768] required; zero OOM skips confirmed; qint4 fails (CUDA-only) |
| Flux ComfyUI generation | ✓ | Symlinks + T5 shard merge confirmed working; slower than SDXL (expected) |
| Flux LoRA 1500-step training | ✓ | Completed — ~7 hours, 13.7–14.4GB VRAM, final loss 0.369, ~15 sec/step |
| WSL2 training | ✗ | DXG bridge bug (libthunk_proxy.a), unfixed as of May 2026 |
https://redd.it/1tgggrv
@rStableDiffusion
### Generation speed
Flux is noticeably slower than SDXL in ComfyUI — 25 steps takes considerably longer due to the 23GB transformer size and fp8 dequantization at inference time.
---
## Face Restoration in ComfyUI
**Do not use ReActor on ROCm.** ReActor (`Gourieff/ComfyUI-ReActor`) uses ONNX Runtime for InsightFace face detection. The ROCm Execution Provider was removed from ORT 1.23 — on ROCm 7.1+ only the CPU EP is available via pip, so face detection runs on CPU.
**Use `facerestore_cf` instead** (https://github.com/mav-rik/facerestore_cf) — pure PyTorch, no ONNX Runtime, runs fully on GPU on ROCm.
### Install
```bash
cd ~/ComfyUI/custom_nodes
git clone https://github.com/mav-rik/facerestore_cf
source ~/ComfyUI/venv/bin/activate
pip install -r facerestore_cf/requirements.txt
```
**Watch out for `basicsr`** — an older package that breaks with modern PyTorch. If you get import errors after install: `pip uninstall basicsr`.
Restart ComfyUI to load the new nodes.
### Models
Download into `~/ComfyUI/models/facerestore_models/`:
```bash
# CodeFormer — better identity preservation, recommended for portraits (~359MB)
wget -P ~/ComfyUI/models/facerestore_models/ \
https://github.com/sczhou/CodeFormer/releases/download/v0.1.0/codeformer.pth
# GFPGAN v1.4 — faster, better for skin texture
wget -P ~/ComfyUI/models/facerestore_models/ \
https://github.com/TencentARC/GFPGAN/releases/download/v1.3.4/GFPGANv1.4.pth
```
Face detection models (RetinaFace) auto-download on first run.
### Workflow wiring
Place **after VAEDecode, before SaveImage**:
```
[sampler] → VAEDecode → FaceRestoreWithModel → SaveImage
```
For Flux the sampler node is `SamplerCustomAdvanced`; for SDXL it is `KSampler`.
---
## Summary: What Actually Works on RX 9060 XT (gfx1200) as of May 2026
| Task | Works? | Notes |
|------|--------|-------|
| ROCm 7.2.3 install | ✓ | Via amdgpu-install; manually add user to render/video groups |
| PyTorch 2.11.0+rocm7.2 | ✓ | Stable index only; nightly crashes |
| bitsandbytes (compiled) | ✓ | Must build from source with -DBNB_ROCM_ARCH=gfx1200 |
| JoyCaption captioning | ✓ | 3 bugs to fix (documented above); 5 sec/photo, 11.7GB VRAM |
| SDXL LoRA training | ✓ | 1500 steps in 76 min; 10GB VRAM peak; bf16 required |
| SDXL ComfyUI generation | ✓ | HSA_OVERRIDE_GFX_VERSION=12.0.0 required; dpmpp_2m karras, CFG 7.0 |
| Flux.1 Dev training (uint4) | ✓ | 5 code patches + cache_text_embeddings + num_workers: 0 + [512, 768] required; zero OOM skips confirmed; qint4 fails (CUDA-only) |
| Flux ComfyUI generation | ✓ | Symlinks + T5 shard merge confirmed working; slower than SDXL (expected) |
| Flux LoRA 1500-step training | ✓ | Completed — ~7 hours, 13.7–14.4GB VRAM, final loss 0.369, ~15 sec/step |
| WSL2 training | ✗ | DXG bridge bug (libthunk_proxy.a), unfixed as of May 2026 |
https://redd.it/1tgggrv
@rStableDiffusion
GitHub
GitHub - mav-rik/facerestore_cf: ComfyUI Custom node that supports face restore models and supports CodeFormer Fidelity parameter
ComfyUI Custom node that supports face restore models and supports CodeFormer Fidelity parameter - mav-rik/facerestore_cf
Lance by ByteDance: 3B Apache2 model for image and video understanding, generation, and editing
https://redd.it/1tgjrm2
@rStableDiffusion
https://redd.it/1tgjrm2
@rStableDiffusion
i need this hairstyle as prompts , no AI has managed to do it , they always give me a different hair , thanks in advance
https://redd.it/1tgk08w
@rStableDiffusion
https://redd.it/1tgk08w
@rStableDiffusion
Feels like AI chatbot tools in 2026 are becoming part of creative workflows now
Kinda crazy how many Stable Diffusion workflows now include some AI chatbot alongside image generation.
People are using them for prompt refinement, scene ideas, even full workflow planning.
Feels less like separate tools now and more like one combined creative setup.
Curious what everyone here is pairing with SD lately.
https://redd.it/1tgq0ua
@rStableDiffusion
Kinda crazy how many Stable Diffusion workflows now include some AI chatbot alongside image generation.
People are using them for prompt refinement, scene ideas, even full workflow planning.
Feels less like separate tools now and more like one combined creative setup.
Curious what everyone here is pairing with SD lately.
https://redd.it/1tgq0ua
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community