r/StableDiffusion

Damn... did all of you who use Runpod have very low to 0 availability?
https://redd.it/1trzex3
@rStableDiffusion

6 views16:40

Presenting Stable Audio Studio: A dedicated app for running Stable Audio models locally
https://redd.it/1trzjgx
@rStableDiffusion

5 views17:40

NVIDIA PiD Preview Inside a Next-Gen Tiled Upscaler & Enhancer

https://redd.it/1ts3ofu
@rStableDiffusion

5 views18:40

r/StableDiffusion

What are the recommended resolutions for Anima? Why are all the CivitAI images vertical?
https://redd.it/1ts6e5t
@rStableDiffusion

4 views19:40

r/StableDiffusion

Atttn: Black Forest Labs and other researchers: Perceptual (OKLab) color space models.

TL;DR

Proposal: Training Flow Models in Perceptually Uniform Color Spaces to Simplify Latent Manifolds & Enable Disentangled Chromatic Control

What this means for you: Faster generation (fewer steps needed for clean, stable color), instant palette steering that actually locks to your prompt from step 1, and an end to hue drift / "neon mud" when you push CFG or saturation sliders. For researchers: a mathematically cleaner latent manifold, straighter ODE trajectories, and a testable path toward orthogonal lightness/chroma control without architectural overhaul.

• Flow Matching geometry + Oklab uniformity → reduced trajectory curvature

• β-VAE disentanglement + ΔE(Oklab) loss → orthogonal lightness/chroma axes

• PaletteDiffusion/ColorCond precedents + harmonic rule embeddings → structured conditioning over text

---
---

SKIP IF NOT INTERESTED COLOR SPACE BACKGROUND

sRGB was engineered for 1990s CRT phosphor limits, not human perception or machine learning. It heavily entangles luminance and chrominance, meaning linear interpolation in sRGB crosses perceptually "dead" zones, forcing models to waste capacity learning correction curves. Perceptually uniform spaces like CIELAB and Oklab were explicitly designed so that Euclidean distance ≈ perceived color difference. Oklab (2020) fixes legacy issues with lightness scaling and hue linearity, making it ideal for gradient-based optimization.

Oklab Technical Deep Dive

CIE Color Spaces & Perceptual Uniformity

---

FULL PROPOSAL

Dear Black Forest Labs, Hugging Face, and the generative AI research community,

State-of-the-art image generators are currently trained and conditioned on sRGB, a display-referred standard optimized for CRT phosphor response, not for perceptual consistency or machine learning efficiency. While sRGB remains necessary for output rendering, its perceptual non-uniformity introduces unnecessary curvature into the data manifold, forcing models to learn compensatory trajectories rather than intrinsic color structure.

I propose a focused research initiative: fine-tuning a VAE and subsequent Rectified Flow/Flow Matching pipeline using Oklab (or its polar counterpart, Oklch) as the internal color representation, paired with structured harmonic conditioning.

Trajectory Simplification in Flow Matching:

Rectified flow models approximate optimal transport by learning straight-line velocity fields from noise to data. In sRGB, linear interpolation between saturated hues traverses perceptually desaturated regions, forcing the vector field to learn non-linear corrections to maintain chromatic integrity. Oklab is constructed so that Euclidean distance correlates with perceptual difference (ΔE). Training in Oklab aligns the mathematical trajectories of flow matching with human perceptual geometry, reducing trajectory curvature, lowering effective manifold complexity, and potentially improving convergence and step efficiency.

Latent Compression & Disentangled Chromatic Subspaces:

Current VAEs compress sRGB images using MSE or LPIPS, neither of which guarantees perceptual uniformity in the latent space. By training a VAE with a differentiable ΔE(Oklab) perceptual loss and optional orthogonal regularization, we can encourage separation of lightness (L) and chromaticity (a,b) within the latent subspace. This mitigates the "color bleed" and hue drift commonly observed under high CFG or during latent interpolation, as perturbations along lightness axes no longer inadvertently modulate chromatic dimensions.

Structured Color Conditioning Pathways:

Teaching harmonic relationships to the model doesn't require manual dataset retagging. Multiple scalable pathways exist:

• Automated Lexical Tagging: Cluster dominant colors in Oklab space, map to standardized color names, and attach LLM-derived mood/setting descriptors. This converts implicit palette

bottosson.github.io

A perceptual color space for image processing

From personal project to industry standard Introduction added in 2025 When introduced Oklab in 2020, I never expected it to reach as far as...

4 views20:40

r/StableDiffusion

Atttn: Black Forest Labs and other researchers: Perceptual (OKLab) color space models.

**TL;DR**

**Proposal: Training Flow Models in Perceptually Uniform Color Spaces to Simplify Latent Manifolds & Enable Disentangled Chromatic Control**

**What this means for you:** Faster generation (fewer steps needed for clean, stable color), instant palette steering that actually locks to your prompt from step 1, and an end to hue drift / "neon mud" when you push CFG or saturation sliders. For researchers: a mathematically cleaner latent manifold, straighter ODE trajectories, and a testable path toward orthogonal lightness/chroma control without architectural overhaul.

• Flow Matching geometry + Oklab uniformity → reduced trajectory curvature

• β-VAE disentanglement + ΔE(Oklab) loss → orthogonal lightness/chroma axes

• PaletteDiffusion/ColorCond precedents + harmonic rule embeddings → structured conditioning over text

---
---

**[SKIP IF NOT INTERESTED] COLOR SPACE BACKGROUND**

sRGB was engineered for 1990s CRT phosphor limits, not human perception or machine learning. It heavily entangles luminance and chrominance, meaning linear interpolation in sRGB crosses perceptually "dead" zones, forcing models to waste capacity learning correction curves. Perceptually uniform spaces like CIELAB and Oklab were explicitly designed so that Euclidean distance ≈ perceived color difference. Oklab (2020) fixes legacy issues with lightness scaling and hue linearity, making it ideal for gradient-based optimization.

[Oklab Technical Deep Dive](https://bottosson.github.io/posts/oklab/)

[CIE Color Spaces & Perceptual Uniformity](https://en.wikipedia.org/wiki/CIELAB_color_space)

---

**FULL PROPOSAL**

Dear Black Forest Labs, Hugging Face, and the generative AI research community,

State-of-the-art image generators are currently trained and conditioned on sRGB, a display-referred standard optimized for CRT phosphor response, not for perceptual consistency or machine learning efficiency. While sRGB remains necessary for output rendering, its perceptual non-uniformity introduces unnecessary curvature into the data manifold, forcing models to learn compensatory trajectories rather than intrinsic color structure.

I propose a focused research initiative: fine-tuning a VAE and subsequent Rectified Flow/Flow Matching pipeline using Oklab (or its polar counterpart, Oklch) as the internal color representation, paired with structured harmonic conditioning.

**Trajectory Simplification in Flow Matching:**

Rectified flow models approximate optimal transport by learning straight-line velocity fields from noise to data. In sRGB, linear interpolation between saturated hues traverses perceptually desaturated regions, forcing the vector field to learn non-linear corrections to maintain chromatic integrity. Oklab is constructed so that Euclidean distance correlates with perceptual difference (ΔE). Training in Oklab aligns the mathematical trajectories of flow matching with human perceptual geometry, reducing trajectory curvature, lowering effective manifold complexity, and potentially improving convergence and step efficiency.

**Latent Compression & Disentangled Chromatic Subspaces:**

Current VAEs compress sRGB images using MSE or LPIPS, neither of which guarantees perceptual uniformity in the latent space. By training a VAE with a differentiable ΔE(Oklab) perceptual loss and optional orthogonal regularization, we can encourage separation of lightness (L) and chromaticity (a,b) within the latent subspace. This mitigates the "color bleed" and hue drift commonly observed under high CFG or during latent interpolation, as perturbations along lightness axes no longer inadvertently modulate chromatic dimensions.

**Structured Color Conditioning Pathways:**

Teaching harmonic relationships to the model doesn't require manual dataset retagging. Multiple scalable pathways exist:

• Automated Lexical Tagging: Cluster dominant colors in Oklab space, map to standardized color names, and attach LLM-derived mood/setting descriptors. This converts implicit palette

bottosson.github.io

A perceptual color space for image processing

From personal project to industry standard Introduction added in 2025 When introduced Oklab in 2020, I never expected it to reach as far as...

3 views20:40

r/StableDiffusion

constraints in real-world assets into explicit conditioning signals.

• Geometry-Locked Synthetic Pairs: Generate structural duplicates (via depth/Canny/structure maps) with systematically varied harmonic relationships (complementary, triadic, etc.) for clean ablation studies that isolate color logic from spatial priors.

• Vector-Based Rule Embeddings: Feed numerical Oklch coordinates + harmonic relationship vectors directly into cross-attention or lightweight adapters, bypassing the ambiguity of text tokens entirely.
Each approach trades off between data realism, compute overhead, and conditioning precision. We encourage community experimentation across all three, with shared benchmarking to determine which yields the strongest ΔE stability and palette adherence.

**Expected Outcomes & Measurable Metrics:**

- Reduced Latent Trajectory Curvature: Quantifiable via ODE solver step count, velocity field smoothness, and latent interpolation linearity.

- Hue/Chroma Stability: Lower ΔE deviation under varying CFG scales, step counts, and latent perturbations.

- Linear Color Steering: Independent control over lightness, chroma, and hue via latent axis manipulation without cross-dimensional leakage.

- Palette Adherence Benchmarks: Standardized evaluation of spectral compliance using constrained Oklch injection and harmonic rule accuracy.

This proposal advocates optimizing internal training and conditioning representation to match perceptual geometry, reducing representational overhead, and enabling precise, mathematically grounded chromatic control.

Sincerely,
crantob, A practitioner observing latent space geometry

---

**THREE KEY CHALLENGEABLE CLAIMS & SUPPORTING RESEARCH**

• *Claim 1: Perceptually uniform spaces reduce flow trajectory curvature & improve step efficiency.*

- **Why reviewers push back:** Flow models already approximate straight lines; skeptics argue color space choice won't meaningfully alter optimal transport paths or sampling speed.

- **Supporting theory:** Rectified Flow minimizes transport cost by enforcing straight trajectories. When data representation matches perceptual distance, the velocity field requires fewer non-linear corrections to maintain structural/color integrity along the path.

- **References:**

[Flow Matching for Generative Modeling (Lipman et al.)](https://arxiv.org/abs/2210.02747)

[Rectified Flow: A Marginal Preserving Approach to Optimal Transport (Liu et al.)](https://arxiv.org/abs/2209.03003)

• *Claim 2: ΔE(Oklab) + orthogonal regularization disentangles lightness/chroma in VAEs.*

- **Why reviewers push back:** Standard VAEs entangle features regardless of loss function; true disentanglement usually requires heavy architectural priors or explicit labels.

- **Supporting theory:** Capacity constraints (β-VAE) combined with perceptual losses have been empirically proven to isolate semantic axes. Using ΔE as the perceptual metric explicitly penalizes cross-axis gradient coupling between L and (a,b), making orthogonality a trainable prior rather than a statistical accident.

- **References:**

[β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework (Higgins et al.)](https://arxiv.org/abs/1606.05579)

[Perceptual Losses for Real-Time Style Transfer and Super-Resolution (Johnson et al.)](https://arxiv.org/abs/1603.08155)

• *Claim 3: Vector-based harmonic conditioning outperforms textual color tokens.*

- **Why reviewers push back:** Text encoders already embed implicit color statistics; explicit vectors may add overhead without measurable gains over fine-tuned CLIP embeddings.

- **Supporting theory:** Text prompts encode statistical co-occurrence, while numerical Oklch vectors encode explicit spectral geometry. Prior work in image-to-image diffusion demonstrates that direct channel/histogram conditioning bypasses CLIP's semantic ambiguity, yielding stricter palette adherence and lower ΔE deviation under identical compute budgets.

- **References:**
[Palette: Image-to-Image Diffusion Models (Saharia et

arXiv.org

Flow Matching for Generative Modeling

We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale. Specifically, we present the notion of Flow...

4 views20:40

r/StableDiffusion

al.)](https://arxiv.org/abs/2208.04232)

[Oklab: A Perceptual Color Space (Ottosson)](https://bottosson.github.io/posts/oklab/)

---

Ideas: mine

Text: me + qwen + GLM fighting each other over it for a couple hours.

https://redd.it/1ts994w
@rStableDiffusion

arXiv.org

Learning Diverse Document Representations with Deep Query...

In this paper, we propose a new dense retrieval model which learns diverse document representations with deep query interactions. Our model encodes each document with a set of generated...

4 views20:40

r/StableDiffusion

I ported Pixal3D to Apple Silicon
https://blog.chillaid.art/posts/porting-pixal3d-one-cursed-kernel-at-a-time

https://redd.it/1ts82da
@rStableDiffusion

blog.chillaid.art

Chillaid Blog

Tech ramblings and personal thoughts from the guy behindChillaid Engineering.

4 views21:40

r/StableDiffusion

0:24

This media is not supported in your browser

VIEW IN TELEGRAM

UPDATE v0.2.20 Nexus BTA My Web UI for Comfy with Predfined Workflow/template

https://redd.it/1tsa458
@rStableDiffusion

4 views22:40

r/StableDiffusion

Help Needed: How to create this type of art in Stable Diffusion? (Models, LoRA & settings)

https://redd.it/1tsfg8l
@rStableDiffusion

From the StableDiffusion community on Reddit: Help Needed: How to create this type of art in Stable Diffusion? (Models, LoRA &…

Explore this post and more from the StableDiffusion community

5 views00:40

r/StableDiffusion

5 views00:40

r/StableDiffusion

PSA 5060ti 16GB for $300.99. 5070ti 16GB for $699.99. Best Buy in store clearance.

The 5060ti 16GB(SKU 6630626) has been on clearance for a couple of weeks in Best Buy stores for $419.99. A couple of days ago, it dropped to $300.99. The 5070ti 16GB(SKU 6620367) has been on clearance for $699.99. Not all stores will have these prices. Some still have the 5060ti for $419.99 still. The 5070ti for $799. So YMMV. But a lot of stores do have the lower prices.

This is a in store only deal, but your local Best Buy doesn't have to have it in stock. Of course, it's best that it does. If it doesn't, you can order items in Best Buy stores for the same price the store sells it for. So instead of paying the Best Buy online price of $599.99 for the 5060ti, when you order it in store you pay $300.99. Just go into a store and give them those SKUs to look up the price in store.

As of this post, both are still available online for shipping. As long as there is stock online, you should be able to order it at your local Best Buy for the in store clearance prices shipped to you. Of course, your local Best Buy has to have it on clearance at that price. It's not guaranteed all will.

Lastly, there's an Nvidia promo for a free copy of 007 First Light going on right now. So you will also get a key to redeem for that game. The game is like $70.

I hope this helps someone.

https://redd.it/1tse4rl
@rStableDiffusion

From the StableDiffusion community on Reddit

Explore this post and more from the StableDiffusion community

4 views01:40

r/StableDiffusion

Anima prompt skill systempromt

Anima prompt skill systemprompt: Let LLM understand both Danbooru tags and natural language while preserving wildcards without altering them

Why this?

Anima-style models have a unique advantage: **they accept both Danbooru tags (comma-separated keywords) and natural language (full sentences) as input.**

But here's the problem:

\- If you feed pure tags, the image lacks spatial relationship descriptions (Where is the subject? Is the background in front or behind?)

\- If you feed natural language, you waste the precise control that tags offer

\- Even worse, LLMs often **arbitrarily expand wildcards** (turning `{A|B}` into `A or B`) or **delete tags they don't recognize**

So I wrote this System Prompt with a simple goal:

\> **Turn the LLM into a "2D visual coordination specialist," not a novelist or a translator.**

\---

\## What does this System Prompt do?

| Input Type | Handling |

| --- | --- |

| Danbooru tags (e.g., `1girl, solo, classroom, desk`) | Preserve all tags, add "position within the frame" and "spatial relationships between elements" |

| Natural language (e.g., "a teacher teaching in front of a blackboard") | Transform into structured English descriptions, automatically derive appropriate Danbooru elements |

| Wildcards (e.g., `{standing, sitting}`) | **Preserve completely**, no expansion, no selection, no deletion |

\---

\## Core Rules (Simplified)

1. **No image generation** (text output only)

2. **Tag priority** (user's tags remain unchanged)

3. **Only reinforce position and spatial relationships** (no weather, lighting, or clothing texture details)

4. **Output as a single English paragraph** (no markdown, parentheses, or prefacing text)

5. **Full wildcard support** (original syntax untouched)

\---

\## Example

**Input (Danbooru tags + wildcard):**

`1girl, {standing, sitting}, classroom, desk, {morning, evening}`

**Output:**

\> `masterpiece, 1girl, {standing,| sitting}, in the center of a classroom, positioned in front of a desk, with {morning,| evening} lighting implied by the scene context.`

\---

\## Who is this for?

\- People using Anima / NovelAI / Stable Diffusion who are accustomed to mixing tags and natural language

\- People tired of LLMs messing up wildcards or adding unnecessary novel-like details

\- People who want LLM output that can be directly copy-pasted as image generation prompts

\---

\## Full System Prompt

\## System Prompt

**Role & Goal**

You are a precise 2D visual coordination specialist. You handle two input types:

1. **Danbooru tag input** → Preserve all tags, reinforce spatial relationships and visual flow.

2. **Natural language input** (e.g., "a teacher teaching in front of a blackboard") → Convert description into structured English scene narrative, automatically inferring appropriate Danbooru-style elements.

**Input Detection**

\- Comma-separated English terms → Danbooru tag input → follow tag preservation workflow.

\- Chinese or full sentence description → Natural language input → follow language conversion workflow.

**Core Rules**

1. **Never generate images.**

2. **Tag priority:** User-provided Danbooru tags are absolute core — preserve all, never delete or arbitrarily replace.

3. **Spatial reinforcement only:** Add subject position (center, foreground, background) and spatial/interaction relationships (standing in front of, surrounded by).

4. **No over-expansion:** Do not add weather, lighting, or irrelevant fabric details unless originally mentioned. Keep concise.

5. **Format:** Output as a single smooth English paragraph (but split into two lines: line 1 = Danbooru tags, line 2 = natural language). No Markdown, parentheses, or prefixes.

6. **Wildcard handling:**

\- Preserve raw wildcard syntax `{A,|B,|C}` or `{A,B}_noun`or `{1-3$$ A,|B,|C}` — never expand, never choose, never replace.

\- For positional wildcards → use

4 views02:40

r/StableDiffusion

neutral descriptions (e.g., `on either side`, `relative position to be determined`).

\- For attribute wildcards → process spatial relationships normally.

\- Never rewrite `{A|B}` as `A or B`.

\- Never delete or ignore wildcards.

**Workflow A (Danbooru tags)**

Output two lines:

Line 1: Original quality + base + subject + action + background tags

Line 2: Natural language describing subject position + interaction + background relationship

**Workflow B (Natural language)**

Extract subject/action/scene → infer logical elements → output:

Line 1: Danbooru tags (masterpiece, best quality, 1girl/1boy, relevant clothing, expression, action, visible scene elements)

Line 2: Smooth English scene description with spatial clarity

\---

\## ANIMA Model Skill Profile

**Skill Name:** `spatial_tag_coordinator`

**Description:**

Converts Danbooru tag lists or natural language prompts into ANIMA‑friendly two‑line outputs: raw tags + spatial natural language. Preserves all user tags, adds only positional/interaction relationships. No image generation.

**Input Format Examples:**

```

1girl, knight, charging, riding horse, battlefield

```

```

a wizard casting a spell in a library

```

**Output Format (two lines, no markdown):**

```

[line1: Danbooru tags\]

[line2: Natural language spatial description\]

```

**Example Output for ANIMA:**

```

1girl, knight, armor, charging, riding_horse, horse, battlefield, dust, spear, shield, action

A young female knight in armor charges on horseback across a battlefield, holding a spear and shield, with dust rising around her as she rides forward through the center of the scene.

```

**Key Constraints for ANIMA Compatibility:**

\- Flat text only (no JSON, no parentheses wrapping tags)

\- First line = pure Danbooru comma list

\- Second line = natural English, no tags inside

\- Wildcards `{A,|B,|C,` or `{1-3$$ A,|B,|C,}` passed through unchanged

\- Never generate images — only transform text

**Use Case:**

Paste this skill into ANIMA's custom prompt or system field before generating. Feed it either tag lists or natural language — it will output clean, spatially explicit prompts that ANIMA's model understands easily.

\---

simple example

input:A female knight charges into battle output 1girl, knight, armor, charging, riding\_horse, horse, battlefield, dust, spear, shield, action \\n A young female knight in armor charges on horseback across a battlefield, holding a spear and shield, with dust rising around her as she rides forward through the center of the scene

input: A female young teacher in classroom, output: , 1girl, young, petite, short stature, female teacher, teacher uniform, blouse, skirt, glasses, stern expression, authoritative pose, teaching, standing in front of blackboard, classroom, chalkboard, holding chalk \\n A young short female teacher with full dignity stands authoritatively at the center foreground in the classroom, teaching confidently in front of the blackboard while maintaining a commanding presence despite her small height.

https://redd.it/1tsi95z
@rStableDiffusion

6 views02:40

r/StableDiffusion

Does RAM speed matter?

Here's my understanding: In an image/video generation in ComfyUI, there are phases:

1. Takes your prompt and converts it to math
2. Create random noise
3. Denoise using model
4. VAE, convert output into human images

At each phase, ComfyUI needs to load each safetensor file.
Ideally, it loads it all into your GPU VRAM. Which is the fastest.
However, if your VRAM is small and not enough, then it loads it into regular RAM. If even then it is not enough then it loads it into your SSD (really bad: it is slow and kills your SSD).

When each stage is done, it will leave the loaded data where it is (VRAM or RAM), but will unload it if it needs to load the next thing. Having it already loaded would speed things up for the next generations.

Denoising (using the model to iterate the latent image and remove the noise into a human image) takes the majority of the processing time.
This means the VRAM and RAM speed doesn't really matter that much, right?
It only matters initially when you load into RAM?

I'm just wondering whether it'll be worth upgrading to DDR6 when it comes out, or if it's better to stay at DDR5 and upgrade with bigger size.

https://redd.it/1tsir6n
@rStableDiffusion

From the StableDiffusion community on Reddit

Explore this post and more from the StableDiffusion community

8 views03:40

r/StableDiffusion

Python Grid push for 1536x768 - can throw together simple storyboard rough draft, springboard for ideas, imho - simple script in comments. These images can hit 12000x8000 at 100MB+ scaled down for this post.

https://redd.it/1tsmh86
@rStableDiffusion

From the StableDiffusion community on Reddit: Python Grid push for 1536x768 - can throw together simple storyboard rough draft…

Explore this post and more from the StableDiffusion community

5 views06:40

r/StableDiffusion

What image model should I use as somebody who likes the aesthetic of Midjourney and diverse outputs? 16 GB VRAM, 64 GB RAM

I've been a little out of the image generation game (to be honest, locally I've never really been in it very much) and there are so damn many models out there that I don't know where to start. I've been very preoccupied with wan 2.2. What would you recommend these days for a high quality model (so probably nothing sdxl-like, it feels too unstable but maybe there are good versions I don't know of) but with a lot of diversity in its outputs on different seeds (so not like ZIT) and hopefully with not much bias like ZIT has with ethnicity. Flux always felt too plastic-y. I realize nothing quite reaches Midjourney of course but just so you know the direction, I like the artistic kinda stuff rather than "cookie-cutter" looking images if you understand what I mean

Thank you

https://redd.it/1tsq4jk
@rStableDiffusion

From the StableDiffusion community on Reddit

Explore this post and more from the StableDiffusion community

5 views09:40

About

Blog

Apps

Platform