that looks better than any "monochrome high contrast" prompt I tried
* "Cinestill 800T" for night scenes with that halation glow around lights
* Adding "slightly asymmetrical features" or "faint laugh lines" to portraits kills the symmetry default
* "On-board flash falloff" gives you that candid snapshot look with the harsh foreground light and falling-off background

**Stuff I'm still figuring out:**

* LoRA weights feel different than SDXL. Anything above 0.85 tends to overcook. Anyone else seeing this?
* Text rendering is good but seems to tank if the prompt is too long. I think the model budgets attention between scene description and typography and long prompts starve the text encoder. Curious if others have tested this.
* Bilingual prompts (EN + CN in the same prompt) sometimes produce better English typography than pure EN prompts. No idea why. Might be a training data quirk.
* Hands are genuinely fixed but feet still look weird like 30% of the time. Haven't found a reliable fix yet.

https://preview.redd.it/zrkeynx1ndug1.jpg?width=1920&format=pjpg&auto=webp&s=6ca058e66cc4c7e174f2f07ce5f6499cb15694d7

https://preview.redd.it/v557bkw7pdug1.jpg?width=1920&format=pjpg&auto=webp&s=250b92caf4634f2e40cc588728bcfdb96ec1ad2d

https://preview.redd.it/jhtxz9ecpdug1.jpg?width=1920&format=pjpg&auto=webp&s=3ba407eb55529659d95e8aca043076eea025ce3f

https://preview.redd.it/4ezi3rmhpdug1.jpg?width=1920&format=pjpg&auto=webp&s=5df585e2ced71d89e5b826941155e62a046a7f1e

https://preview.redd.it/ymibzw0lpdug1.jpg?width=1920&format=pjpg&auto=webp&s=13a51528f6849298b25e69054e3335eb65bdf741

https://preview.redd.it/c740vz9ppdug1.jpg?width=1920&format=pjpg&auto=webp&s=078a0239cc2a424c27a9b75c5a35881310b22b54



https://redd.it/1shpbbb
@rStableDiffusion
Live AI video is doing too much lifting as a term. Here's a breakdown of what people actually mean.

The phrase is everywhere right now, but it's covering at least three meaningfully different things that keep getting conflated:




1. Faster post-production. The model still generates a discrete clip, it just does it quicker than it used to. Useful, but this is throughput improvement, not liveness.




2. Low-latency iteration. You can tweak and regenerate fast enough that it feels interactive. Still clip-based under the hood. Great UX, but the model still isn't responding to a continuous stream.




3. Actual real-time inference on a live stream. The model is continuously generating frames in response to incoming input, not producing clips at all. This is a fundamentally different architecture and a much harder problem.




The third category is where things get genuinely interesting from a technical standpoint. Decart is one of the few doing this for real, but because demos for all three can look superficially similar, the distinction gets lost. Vendors have every incentive to let it stay lost.Worth being precise about which one you're actually evaluating if you're building anything serious on top of this.

https://redd.it/1shogaz
@rStableDiffusion
Qwen3.5-4B-Base-ZitGen-V1

Hi,

I'd like to share a fine-tuned LLM I've been working on. It's optimized for image-to-prompt and is only 4B parameters.

Model: https://huggingface.co/lolzinventor/Qwen3.5-4B-Base-ZitGen-V1

I thought some of you might find it interesting. It is an image captioning fine-tune optimized for Stable Diffusion prompt generation (i.e., image-to-prompt). Is there a comfy UI custom node that would allow this to be added to a cui workflow? i.e. LLM based captioning.

# What Makes This Unique

What makes this fine-tune unique is that the dataset (images + prompts) were generated by LLMs tasked with using the ComfyUI API to regenerate a target image.

# The Process

The process is as follows:

1. The target image and the last generated image (blank if it's the first step) are provided to an LLM with a comparison prompt.
2. The LLM outputs a detailed description of each image and the key differences between them.
3. The comparison results and the last generated prompt (empty if it's the first step) are provided to an LLM with an SD generation prompt.
4. The output prompt is sent to the ComfyUI API using Z-Image Turbo, and the output image is captured.
5. Repeat N times.

# Training Details

The system employed between 4 and 6 rounds of comparison and correction to generate each prompt-image pair. In theory, this process adapts the prompt to minimize the difference between the target image and the generated image, thereby tailoring the prompt to the specific SD model being used.

The prompts were then ranked and filtered to remove occasional LLM errors, such as residuals from the original prompt or undesirable artifacts (e.g., watermarks). Finally, the prompts and images were formatted into the ShareGPT dataset format and used to train Qwen 3.5 4B.

https://redd.it/1shvuxa
@rStableDiffusion
Ace Step 1.5 XL ComfyUI automation workflow without lama for generating random tags using qwen, generate song and then give it a rating by using waveform analysis

The idea came to me after sorting trough a lot of Ace Step 1.5 XL outputs and trying to find best styles and tags for songs. Why not automate the generation process AND the review process, or at least make it easier. So as usual I used Qwen LM and Qwen VL (compared to something like olama these ones run directly in comfy and do not require a server) to randomize the tags on each run, but more importantly to try and rate the output. How ? By converting the audio output into a set of waveforms for 4 segments of the song that I feed into Qwen VL as an image and ask it to subjectively look at the waveform and give it feedback and rating, rating that is used then to also name the output file. Like this. I am not sure it works properly but the A+ rated songs were indeed better than B rated ones.
Workflow is here. Install the missing extensions and add the qwen models.
Here is part of the working flow, including output folder.

https://preview.redd.it/kpar4blijfug1.jpg?width=1280&format=pjpg&auto=webp&s=cf2b4e5491c8b237d29e9649d90d40c6172090a9

https://preview.redd.it/oxtxaf8kjfug1.jpg?width=1400&format=pjpg&auto=webp&s=643c100c7fe05bb5184551edd0b7a34d99476ddf

https://preview.redd.it/3old46smjfug1.jpg?width=1592&format=pjpg&auto=webp&s=07b366afe5ae259b11fbd86cf2332c56ab9192ea






https://redd.it/1shzm63
@rStableDiffusion
Just installed ForgeNeo and I'm facing this issue *failed to recognize model type*
https://redd.it/1si419g
@rStableDiffusion
[Release] ComfyUI Image Conveyor — sequential drag-and-drop image queue node
https://redd.it/1sibmrf
@rStableDiffusion
New nodes to handle/visualize bboxes

Hello community, I'd like to introduce my ComfyUI nodes I recently created, which I hope you find useful. They are designed to work with BBoxes coming from face/pose detectors, but not only that. I tried my best but didn't find any custom nodes that allow selecting particular bboxes (per frame) during processing videos with multiple persons present on the video. The thing is - face detector perfectly detects bboxes (BoundingBox) of people's faces, but, when you want to use it for Wan 2.2. Animation or other purposes, there is no way to choose particular person on the video to crop their face for animation, when multiple characters present on the video/image. Face/Pose detectors do their job just fine, but further processing of bboxes they produce jump from one person to another sometimes, causing inconsistency. My nodes allow to pick particular bbox per frame, in order to crop their faces with precision for Wan2.2 animation, when multiple persons are present in the frame. Hence, you can choose particular face(bbox) per frame.
I haven't found any nodes that allow that so I created these for this purpose.
Please let me know if they would be helpful for your creations.
https://registry.comfy.org/publishers/masternc80/nodes/bboxnodes
Description of the nodes is in repository:
https://github.com/masternc80/ComfyUI-BBoxNodes

https://redd.it/1sidcv5
@rStableDiffusion
Trying to inpaint using Z-image Turbo BF16; what am I doing wrong?

https://preview.redd.it/3krmmy345jug1.png?width=1787&format=png&auto=webp&s=359dfa4e2515bd33e40090f986e4a597a00d06d6

Fairly new to the SD scene. I've been trying to do inpainting for an hour or so with no luck. The model, CLIP and VAE are in the screenshot. The output image always looks incredibly similar to the input image, as if I had zero denoise. the prompt also seems to do nothing. Here, I tried to make LeBron scream by masking just his face. The node connections seem to be all correct too. Is there another explanation? Sampler? The model itself?

https://redd.it/1siefug
@rStableDiffusion