r/StableDiffusion

text_encoder/model.fp16.safetensors text_encoder/model.safetensors
ln -sf text_encoder_2/model.fp16.safetensors text_encoder_2/model.safetensors
```

Without these symlinks, the pipeline load fails with a missing file error.

---

## GPU Monitoring

`rocm-smi` works on native Linux (unlike WSL2 where it was broken):

```bash
watch -n1 rocm-smi # text monitor, refreshes every second
radeontop # AMD-specific graphical TUI — recommended
```

**Do not use nvtop 3.0.2** — it crashes on this ROCm/AMD setup. Use radeontop instead.

If your system has both a discrete GPU and an integrated GPU (e.g. Ryzen with Vega iGPU), radeontop defaults to bus 0 which may be the iGPU. Find your discrete GPU's bus ID with `radeontop -l` and pass it with `-b`: `radeontop -b 03` (the number varies by system).

---

## Photo Captioning with JoyCaption

**JoyCaption Beta One** (`fancyfeast/llama-joycaption-beta-one-hf-llava`) produces high-quality captions specifically designed for LoRA training. It's a Llama 3.1 base with a SigLIP vision encoder.

Download (~16GB):
```bash
source ~/ai-toolkit-amd-rocm-support/venv/bin/activate
huggingface-cli download fancyfeast/llama-joycaption-beta-one-hf-llava \
--local-dir ~/models/joycaption/
```

Performance on RX 9060 XT: ~5 sec/photo, ~82% GPU load, ~11.7GB VRAM peak.

### Three bugs to know about

**Bug 1: Use local path, not HF repo ID**

```python
# Wrong — re-downloads 16GB from HuggingFace every run:
MODEL_NAME = "fancyfeast/llama-joycaption-beta-one-hf-llava"

# Correct:
MODEL_NAME = os.path.expanduser("~/models/joycaption")
```

**Bug 2: apply_chat_template with multimodal list content fails**

The Jinja2 sandbox in this version of transformers cannot call `.replace()` on list content. The multimodal format `[{"type": "image"}, {"type": "text", ...}]` throws:
```
UndefinedError: 'list object' has no attribute 'replace'
```

Fix: use a plain string with the image token embedded:
```python
conversation = [{"role": "user", "content": f"<image>\n{PROMPT}"}]
text_input = processor.tokenizer.apply_chat_template(
conversation, tokenize=False, add_generation_prompt=True
)
inputs = processor(images=image, text=text_input, return_tensors="pt").to(model.device)
```

**Bug 3: 4-bit quantization breaks SigLIP vision tower**

`BitsAndBytesConfig(load_in_4bit=True)` quantizes all linear layers including SigLIP's `MultiheadAttention.out_proj`. SigLIP calls `F.multi_head_attention_forward` with raw weight tensors, bypassing bitsandbytes' override, causing:
```
RuntimeError: self and mat2 must have the same dtype, but got Half and Byte
```

Fix: use 8-bit with vision modules excluded:
```python
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_skip_modules=["vision_tower", "multi_modal_projector"],
)

model = LlavaForConditionalGeneration.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
torch_dtype=torch.float16,
device_map="auto",
)
```

This keeps the LLM at 8-bit (~8GB) and the vision tower at fp16 (~1-2GB), totalling ~10-11GB VRAM. Fits comfortably on 16GB.

---

## The Trigger Word Problem

If you generate captions with JoyCaption (or any captioner), the captions are plain descriptive text. **The model has no trigger word unless you explicitly add one to every caption.**

Example: if you train with JoyCaption captions and then generate with prompt `"ohwx man, portrait photo..."`, the token `ohwx man` was never in the training data and is ignored by the LoRA. It is not harmful but it does nothing.

Options:
1. Prepend a trigger word to all captions before training: `"ohwx man, [joycaption text]"` — requires a script to add the prefix to every `.txt` file
2. Use the `trigger_word` or `caption_prefix` setting in the training config if the tool supports it — cupertinomiranda/ai-toolkit does not currently expose this for Flux

**Recommendation:** For option 1, a one-liner to prepend to all captions: `for f in /path/to/photos/*.txt; do sed -i "1s/^/ohwx man, /" "$f"; done`. Include the trigger word

4 views09:40