Which video model learns face likeness best when training LoRA?
Hey, I’m trying to train LoRAs for real human likeness and was wondering which video model currently does the best job at learning and preserving identity.
I’ve tried a bit with LTX and Wan, but still not sure which one is actually better for likeness. Would love to hear what people are getting the best results with right now
https://redd.it/1shbfra
@rStableDiffusion
Hey, I’m trying to train LoRAs for real human likeness and was wondering which video model currently does the best job at learning and preserving identity.
I’ve tried a bit with LTX and Wan, but still not sure which one is actually better for likeness. Would love to hear what people are getting the best results with right now
https://redd.it/1shbfra
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
ACE-Step 1.5 XL Base — BF16 version (converted from FP32)
I converted the ACE-Step 1.5 XL Base model from FP32 to BF16. The original weights were \~18.8 GB in FP32, this version is \~7.5 GB — same quality, lower VRAM usage.
The Base model is the go-to starting point for fine-tuning (LoRA, etc.) — if you want to train your own style, this is the one to use. A great tool for that is Side Step.
🤗 https://huggingface.co/marcorez8/acestep-v15-xl-base-bf16
I also converted the XL Turbo variant yesterday: Reddit post | Model
https://redd.it/1shfihr
@rStableDiffusion
I converted the ACE-Step 1.5 XL Base model from FP32 to BF16. The original weights were \~18.8 GB in FP32, this version is \~7.5 GB — same quality, lower VRAM usage.
The Base model is the go-to starting point for fine-tuning (LoRA, etc.) — if you want to train your own style, this is the one to use. A great tool for that is Side Step.
🤗 https://huggingface.co/marcorez8/acestep-v15-xl-base-bf16
I also converted the XL Turbo variant yesterday: Reddit post | Model
https://redd.it/1shfihr
@rStableDiffusion
GitHub
GitHub - koda-dernet/Side-Step: The most powerful training scripts for ACE-Step 1.5 including a Command Line Interface, a Terminal…
The most powerful training scripts for ACE-Step 1.5 including a Command Line Interface, a Terminal Wizard and a Graphical User Interface. - koda-dernet/Side-Step
HappyHorse is from Alibaba ATH, not Grok / Veo 3.2 / Wan 2.7 / Seedance 2
I finally found what looks like the official clarification.
According to the verified HappyHorse twitter account, HappyHorse is a product currently in internal testing under Alibaba's ATH innovation division. It also says the product is not officially launched yet, and that the so-called "official websites" circulating online are fake.
https://preview.redd.it/s0yc372pjbug1.png?width=760&format=png&auto=webp&s=77cb530ff67fbb68537c0a7417fa782b88c3981a
https://preview.redd.it/zlpry4m0jbug1.png?width=1337&format=png&auto=webp&s=4756801907a9adcbcad4dc8c3c859615fcc6a208
https://redd.it/1shfzip
@rStableDiffusion
I finally found what looks like the official clarification.
According to the verified HappyHorse twitter account, HappyHorse is a product currently in internal testing under Alibaba's ATH innovation division. It also says the product is not officially launched yet, and that the so-called "official websites" circulating online are fake.
https://preview.redd.it/s0yc372pjbug1.png?width=760&format=png&auto=webp&s=77cb530ff67fbb68537c0a7417fa782b88c3981a
https://preview.redd.it/zlpry4m0jbug1.png?width=1337&format=png&auto=webp&s=4756801907a9adcbcad4dc8c3c859615fcc6a208
https://redd.it/1shfzip
@rStableDiffusion
Happy Horse deceiving practices
Kinda lame that Happy Horse was pushed as open weights early on, got people interested, and now it’s apparently becoming closed-source API only, they knew what they were doing.
Way less people are interested in closed video models but make a promise it’s open weights and you get way more traction… then have it closed.
A paid, censored, all you data stolen, closed video model is way less useful for a lot of us. The whole appeal was being able to run it ourselves, experiment freely, fine-tune, make loras, and build on top of it without being stuck behind someone else’s rules and pricing.
Feels like they used the open-weights angle to build hype and traction, then pulled the ladder up and i relly believe that. Also saying that the sources stating it’s open weights are fake also seem super fishy.
Like at this point alibaba just uses the name they built by releasing super good local models to promote closed models (that imo are not even close to other closed models)
https://redd.it/1shi6ca
@rStableDiffusion
Kinda lame that Happy Horse was pushed as open weights early on, got people interested, and now it’s apparently becoming closed-source API only, they knew what they were doing.
Way less people are interested in closed video models but make a promise it’s open weights and you get way more traction… then have it closed.
A paid, censored, all you data stolen, closed video model is way less useful for a lot of us. The whole appeal was being able to run it ourselves, experiment freely, fine-tune, make loras, and build on top of it without being stuck behind someone else’s rules and pricing.
Feels like they used the open-weights angle to build hype and traction, then pulled the ladder up and i relly believe that. Also saying that the sources stating it’s open weights are fake also seem super fishy.
Like at this point alibaba just uses the name they built by releasing super good local models to promote closed models (that imo are not even close to other closed models)
https://redd.it/1shi6ca
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Anyone interested in this .. or did someone else make it already? LTX 2.3 Desktop - Lora injector + my own prompt tool..
https://redd.it/1shjyg8
@rStableDiffusion
https://redd.it/1shjyg8
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit: Anyone interested in this .. or did someone else make it already? LTX 2.3 Desktop…
Explore this post and more from the StableDiffusion community
ComfyUI - disappearing workflows
gentlemen, what am I doing wrong? For some time now, whenever I launch COMFYUI, there is always only one project open, even though I had multiple tabs open when closing it. And this is not a problem, but sometimes for some reason unclosed tabs overwrite one another...
I made a beautiful SDXL table workflow and today there is an old workflow saved on it, which yesterday I turned on for literally only 5 seconds to copy one element... What am I doing wrong? How to protect yourself against uncontrolled overwriting?
https://redd.it/1shnqi4
@rStableDiffusion
gentlemen, what am I doing wrong? For some time now, whenever I launch COMFYUI, there is always only one project open, even though I had multiple tabs open when closing it. And this is not a problem, but sometimes for some reason unclosed tabs overwrite one another...
I made a beautiful SDXL table workflow and today there is an old workflow saved on it, which yesterday I turned on for literally only 5 seconds to copy one element... What am I doing wrong? How to protect yourself against uncontrolled overwriting?
https://redd.it/1shnqi4
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
After ~400 Z-Image Turbo gens I finally figured out why everyone's portraits look plastic
Been using Z-Image Turbo pretty heavily since it dropped and wanted to dump some notes here because I kept seeing the same complaints I had on day one and nobody was really answering them properly.
The thing I kept running into: every portrait looked like a skincare ad. Glossy skin, symmetrical face, that weird "influencer default" look. I tried every SDXL trick I knew. "Average person", "realistic", "not a model", "amateur photo", "candid". Basically nothing moved the needle. I was ready to write the model off as another Flux-lite.
Then I saw 90hex's post here a while back about using actual photography vocabulary and something clicked. I'd been prompting Z-Image like it was SDXL when the encoder is clearly trained on way more specific stuff. Once I started naming actual cameras and film stocks instead of emotional modifiers, the plastic problem basically evaporated.
A few things that genuinely surprised me:
1. "Point-and-shoot film camera" is the single highest-leverage phrase I've found. Drops the model out of beauty-default mode faster than any combination of "realistic/candid/amateur" ever did. "35mm film camera" works too. "iPhone snapshot with handheld imperfection" works. "Disposable camera" works. The common thread is naming a physical piece of gear with a real visual fingerprint.
2. Words like "masterpiece, 8k, etc" do almost nothing. I ran A/B tests on 20 prompts with and without the usual quality spam and the outputs were basically indistinguishable. The S3-DiT encoder clearly wasn't trained on that vocabulary the way SD1.5 was. Replace that whole block with one camera + one film stock and you get way more signal per token.
3. Negative prompts are legitimately dead at cfg 0. I know the docs say this but I didn't fully believe it until I tested. Putting "blurry, ugly, deformed, bad anatomy" in the negative field does absolutely nothing at the default cfg. If you bump cfg to 1.2-2.0 in Comfy some effect comes back but Turbo starts overcooking and the speed advantage evaporates. Just write constraints as presence instead. "Clean studio background, sharp focus, plain seamless backdrop" is way more effective than any negative prompt I tried.
4. The bracket trick is the best-kept secret in this community. 90hex mentioned it in passing and I don't think people realize how powerful it is for building character consistency without training a LoRA. Wrap alternatives in {this|that|the other} inside one prompt, batch 32, and you get an entire photoshoot of the same person across different cameras, lighting, poses, and moods. I've been using it to build reference libraries for characters I want to stay consistent across a short series. Zero training required. It's absurd.
5. Attention cap is real. Past about 75-100 effective tokens the model starts to drift. If you're writing 400-word prompts (I was) you're actively hurting yourself. 3-5 strong concepts, subject first, any quoted text second. The rest is gravy.
6. Prefix/suffix style presets are a cheat code. Saw DrStalker's 70-styles post a while back and started building my own table. Same base scene wrapped in different style prefix/suffix pairs gives you a pile of completely different looks with zero rewriting. Cinematic photo, medium format, analog film, Ansel Adams landscape, neon noir, dieselpunk, Ghibli-like, Moebius-like, pixel art, stained glass. Game changer for iteration speed.
The prompt that finally unstuck me:
>
First time I got an output that looked like an actual person I'd see on the street and not a magazine cover. The trick is stacking "realistic ordinary everyday" (which does nothing alone) with a specific equipment spec (which does everything). The equipment word is the anchor. The ordinary words only work once the anchor is there.
A few more things I've been testing that seem to work:
"Shot on Kodak Portra 400" for warm skin tones that don't look airbrushed
"Ilford HP5 black and white" for actual film B&W grain
Been using Z-Image Turbo pretty heavily since it dropped and wanted to dump some notes here because I kept seeing the same complaints I had on day one and nobody was really answering them properly.
The thing I kept running into: every portrait looked like a skincare ad. Glossy skin, symmetrical face, that weird "influencer default" look. I tried every SDXL trick I knew. "Average person", "realistic", "not a model", "amateur photo", "candid". Basically nothing moved the needle. I was ready to write the model off as another Flux-lite.
Then I saw 90hex's post here a while back about using actual photography vocabulary and something clicked. I'd been prompting Z-Image like it was SDXL when the encoder is clearly trained on way more specific stuff. Once I started naming actual cameras and film stocks instead of emotional modifiers, the plastic problem basically evaporated.
A few things that genuinely surprised me:
1. "Point-and-shoot film camera" is the single highest-leverage phrase I've found. Drops the model out of beauty-default mode faster than any combination of "realistic/candid/amateur" ever did. "35mm film camera" works too. "iPhone snapshot with handheld imperfection" works. "Disposable camera" works. The common thread is naming a physical piece of gear with a real visual fingerprint.
2. Words like "masterpiece, 8k, etc" do almost nothing. I ran A/B tests on 20 prompts with and without the usual quality spam and the outputs were basically indistinguishable. The S3-DiT encoder clearly wasn't trained on that vocabulary the way SD1.5 was. Replace that whole block with one camera + one film stock and you get way more signal per token.
3. Negative prompts are legitimately dead at cfg 0. I know the docs say this but I didn't fully believe it until I tested. Putting "blurry, ugly, deformed, bad anatomy" in the negative field does absolutely nothing at the default cfg. If you bump cfg to 1.2-2.0 in Comfy some effect comes back but Turbo starts overcooking and the speed advantage evaporates. Just write constraints as presence instead. "Clean studio background, sharp focus, plain seamless backdrop" is way more effective than any negative prompt I tried.
4. The bracket trick is the best-kept secret in this community. 90hex mentioned it in passing and I don't think people realize how powerful it is for building character consistency without training a LoRA. Wrap alternatives in {this|that|the other} inside one prompt, batch 32, and you get an entire photoshoot of the same person across different cameras, lighting, poses, and moods. I've been using it to build reference libraries for characters I want to stay consistent across a short series. Zero training required. It's absurd.
5. Attention cap is real. Past about 75-100 effective tokens the model starts to drift. If you're writing 400-word prompts (I was) you're actively hurting yourself. 3-5 strong concepts, subject first, any quoted text second. The rest is gravy.
6. Prefix/suffix style presets are a cheat code. Saw DrStalker's 70-styles post a while back and started building my own table. Same base scene wrapped in different style prefix/suffix pairs gives you a pile of completely different looks with zero rewriting. Cinematic photo, medium format, analog film, Ansel Adams landscape, neon noir, dieselpunk, Ghibli-like, Moebius-like, pixel art, stained glass. Game changer for iteration speed.
The prompt that finally unstuck me:
>
First time I got an output that looked like an actual person I'd see on the street and not a magazine cover. The trick is stacking "realistic ordinary everyday" (which does nothing alone) with a specific equipment spec (which does everything). The equipment word is the anchor. The ordinary words only work once the anchor is there.
A few more things I've been testing that seem to work:
"Shot on Kodak Portra 400" for warm skin tones that don't look airbrushed
"Ilford HP5 black and white" for actual film B&W grain
After ~400 Z-Image Turbo gens I finally figured out why everyone's portraits look plastic
Been using Z-Image Turbo pretty heavily since it dropped and wanted to dump some notes here because I kept seeing the same complaints I had on day one and nobody was really answering them properly.
The thing I kept running into: every portrait looked like a skincare ad. Glossy skin, symmetrical face, that weird "influencer default" look. I tried every SDXL trick I knew. "Average person", "realistic", "not a model", "amateur photo", "candid". Basically nothing moved the needle. I was ready to write the model off as another Flux-lite.
Then I saw 90hex's post here a while back about using actual photography vocabulary and something clicked. I'd been prompting Z-Image like it was SDXL when the encoder is clearly trained on way more specific stuff. Once I started naming actual cameras and film stocks instead of emotional modifiers, the plastic problem basically evaporated.
**A few things that genuinely surprised me:**
1. **"Point-and-shoot film camera" is the single highest-leverage phrase I've found.** Drops the model out of beauty-default mode faster than any combination of "realistic/candid/amateur" ever did. "35mm film camera" works too. "iPhone snapshot with handheld imperfection" works. "Disposable camera" works. The common thread is naming a physical piece of gear with a real visual fingerprint.
2. **Words like "masterpiece, 8k, etc" do almost nothing.** I ran A/B tests on 20 prompts with and without the usual quality spam and the outputs were basically indistinguishable. The S3-DiT encoder clearly wasn't trained on that vocabulary the way SD1.5 was. Replace that whole block with one camera + one film stock and you get way more signal per token.
3. **Negative prompts are legitimately dead at cfg 0.** I know the docs say this but I didn't fully believe it until I tested. Putting "blurry, ugly, deformed, bad anatomy" in the negative field does absolutely nothing at the default cfg. If you bump cfg to 1.2-2.0 in Comfy some effect comes back but Turbo starts overcooking and the speed advantage evaporates. Just write constraints as presence instead. "Clean studio background, sharp focus, plain seamless backdrop" is way more effective than any negative prompt I tried.
4. **The bracket trick is the best-kept secret in this community.** 90hex mentioned it in passing and I don't think people realize how powerful it is for building character consistency without training a LoRA. Wrap alternatives in {this|that|the other} inside one prompt, batch 32, and you get an entire photoshoot of the same person across different cameras, lighting, poses, and moods. I've been using it to build reference libraries for characters I want to stay consistent across a short series. Zero training required. It's absurd.
5. **Attention cap is real.** Past about 75-100 effective tokens the model starts to drift. If you're writing 400-word prompts (I was) you're actively hurting yourself. 3-5 strong concepts, subject first, any quoted text second. The rest is gravy.
6. **Prefix/suffix style presets are a cheat code.** Saw DrStalker's 70-styles post a while back and started building my own table. Same base scene wrapped in different style prefix/suffix pairs gives you a pile of completely different looks with zero rewriting. Cinematic photo, medium format, analog film, Ansel Adams landscape, neon noir, dieselpunk, Ghibli-like, Moebius-like, pixel art, stained glass. Game changer for iteration speed.
**The prompt that finally unstuck me:**
>
First time I got an output that looked like an actual person I'd see on the street and not a magazine cover. The trick is stacking "realistic ordinary everyday" (which does nothing alone) with a specific equipment spec (which does everything). The equipment word is the anchor. The ordinary words only work once the anchor is there.
**A few more things I've been testing that seem to work:**
* "Shot on Kodak Portra 400" for warm skin tones that don't look airbrushed
* "Ilford HP5 black and white" for actual film B&W grain
Been using Z-Image Turbo pretty heavily since it dropped and wanted to dump some notes here because I kept seeing the same complaints I had on day one and nobody was really answering them properly.
The thing I kept running into: every portrait looked like a skincare ad. Glossy skin, symmetrical face, that weird "influencer default" look. I tried every SDXL trick I knew. "Average person", "realistic", "not a model", "amateur photo", "candid". Basically nothing moved the needle. I was ready to write the model off as another Flux-lite.
Then I saw 90hex's post here a while back about using actual photography vocabulary and something clicked. I'd been prompting Z-Image like it was SDXL when the encoder is clearly trained on way more specific stuff. Once I started naming actual cameras and film stocks instead of emotional modifiers, the plastic problem basically evaporated.
**A few things that genuinely surprised me:**
1. **"Point-and-shoot film camera" is the single highest-leverage phrase I've found.** Drops the model out of beauty-default mode faster than any combination of "realistic/candid/amateur" ever did. "35mm film camera" works too. "iPhone snapshot with handheld imperfection" works. "Disposable camera" works. The common thread is naming a physical piece of gear with a real visual fingerprint.
2. **Words like "masterpiece, 8k, etc" do almost nothing.** I ran A/B tests on 20 prompts with and without the usual quality spam and the outputs were basically indistinguishable. The S3-DiT encoder clearly wasn't trained on that vocabulary the way SD1.5 was. Replace that whole block with one camera + one film stock and you get way more signal per token.
3. **Negative prompts are legitimately dead at cfg 0.** I know the docs say this but I didn't fully believe it until I tested. Putting "blurry, ugly, deformed, bad anatomy" in the negative field does absolutely nothing at the default cfg. If you bump cfg to 1.2-2.0 in Comfy some effect comes back but Turbo starts overcooking and the speed advantage evaporates. Just write constraints as presence instead. "Clean studio background, sharp focus, plain seamless backdrop" is way more effective than any negative prompt I tried.
4. **The bracket trick is the best-kept secret in this community.** 90hex mentioned it in passing and I don't think people realize how powerful it is for building character consistency without training a LoRA. Wrap alternatives in {this|that|the other} inside one prompt, batch 32, and you get an entire photoshoot of the same person across different cameras, lighting, poses, and moods. I've been using it to build reference libraries for characters I want to stay consistent across a short series. Zero training required. It's absurd.
5. **Attention cap is real.** Past about 75-100 effective tokens the model starts to drift. If you're writing 400-word prompts (I was) you're actively hurting yourself. 3-5 strong concepts, subject first, any quoted text second. The rest is gravy.
6. **Prefix/suffix style presets are a cheat code.** Saw DrStalker's 70-styles post a while back and started building my own table. Same base scene wrapped in different style prefix/suffix pairs gives you a pile of completely different looks with zero rewriting. Cinematic photo, medium format, analog film, Ansel Adams landscape, neon noir, dieselpunk, Ghibli-like, Moebius-like, pixel art, stained glass. Game changer for iteration speed.
**The prompt that finally unstuck me:**
>
First time I got an output that looked like an actual person I'd see on the street and not a magazine cover. The trick is stacking "realistic ordinary everyday" (which does nothing alone) with a specific equipment spec (which does everything). The equipment word is the anchor. The ordinary words only work once the anchor is there.
**A few more things I've been testing that seem to work:**
* "Shot on Kodak Portra 400" for warm skin tones that don't look airbrushed
* "Ilford HP5 black and white" for actual film B&W grain
that looks better than any "monochrome high contrast" prompt I tried
* "Cinestill 800T" for night scenes with that halation glow around lights
* Adding "slightly asymmetrical features" or "faint laugh lines" to portraits kills the symmetry default
* "On-board flash falloff" gives you that candid snapshot look with the harsh foreground light and falling-off background
**Stuff I'm still figuring out:**
* LoRA weights feel different than SDXL. Anything above 0.85 tends to overcook. Anyone else seeing this?
* Text rendering is good but seems to tank if the prompt is too long. I think the model budgets attention between scene description and typography and long prompts starve the text encoder. Curious if others have tested this.
* Bilingual prompts (EN + CN in the same prompt) sometimes produce better English typography than pure EN prompts. No idea why. Might be a training data quirk.
* Hands are genuinely fixed but feet still look weird like 30% of the time. Haven't found a reliable fix yet.
https://preview.redd.it/zrkeynx1ndug1.jpg?width=1920&format=pjpg&auto=webp&s=6ca058e66cc4c7e174f2f07ce5f6499cb15694d7
https://preview.redd.it/v557bkw7pdug1.jpg?width=1920&format=pjpg&auto=webp&s=250b92caf4634f2e40cc588728bcfdb96ec1ad2d
https://preview.redd.it/jhtxz9ecpdug1.jpg?width=1920&format=pjpg&auto=webp&s=3ba407eb55529659d95e8aca043076eea025ce3f
https://preview.redd.it/4ezi3rmhpdug1.jpg?width=1920&format=pjpg&auto=webp&s=5df585e2ced71d89e5b826941155e62a046a7f1e
https://preview.redd.it/ymibzw0lpdug1.jpg?width=1920&format=pjpg&auto=webp&s=13a51528f6849298b25e69054e3335eb65bdf741
https://preview.redd.it/c740vz9ppdug1.jpg?width=1920&format=pjpg&auto=webp&s=078a0239cc2a424c27a9b75c5a35881310b22b54
https://redd.it/1shpbbb
@rStableDiffusion
* "Cinestill 800T" for night scenes with that halation glow around lights
* Adding "slightly asymmetrical features" or "faint laugh lines" to portraits kills the symmetry default
* "On-board flash falloff" gives you that candid snapshot look with the harsh foreground light and falling-off background
**Stuff I'm still figuring out:**
* LoRA weights feel different than SDXL. Anything above 0.85 tends to overcook. Anyone else seeing this?
* Text rendering is good but seems to tank if the prompt is too long. I think the model budgets attention between scene description and typography and long prompts starve the text encoder. Curious if others have tested this.
* Bilingual prompts (EN + CN in the same prompt) sometimes produce better English typography than pure EN prompts. No idea why. Might be a training data quirk.
* Hands are genuinely fixed but feet still look weird like 30% of the time. Haven't found a reliable fix yet.
https://preview.redd.it/zrkeynx1ndug1.jpg?width=1920&format=pjpg&auto=webp&s=6ca058e66cc4c7e174f2f07ce5f6499cb15694d7
https://preview.redd.it/v557bkw7pdug1.jpg?width=1920&format=pjpg&auto=webp&s=250b92caf4634f2e40cc588728bcfdb96ec1ad2d
https://preview.redd.it/jhtxz9ecpdug1.jpg?width=1920&format=pjpg&auto=webp&s=3ba407eb55529659d95e8aca043076eea025ce3f
https://preview.redd.it/4ezi3rmhpdug1.jpg?width=1920&format=pjpg&auto=webp&s=5df585e2ced71d89e5b826941155e62a046a7f1e
https://preview.redd.it/ymibzw0lpdug1.jpg?width=1920&format=pjpg&auto=webp&s=13a51528f6849298b25e69054e3335eb65bdf741
https://preview.redd.it/c740vz9ppdug1.jpg?width=1920&format=pjpg&auto=webp&s=078a0239cc2a424c27a9b75c5a35881310b22b54
https://redd.it/1shpbbb
@rStableDiffusion
JoyAI-Image-Edit now has ComfyUI support
https://github.com/jd-opensource/JoyAI-Image
Its very good at spatial awareness.
Would be interesting to do a more detailed comparison with qwen image edit.
https://redd.it/1show8s
@rStableDiffusion
https://github.com/jd-opensource/JoyAI-Image
Its very good at spatial awareness.
Would be interesting to do a more detailed comparison with qwen image edit.
https://redd.it/1show8s
@rStableDiffusion
GitHub
GitHub - jd-opensource/JoyAI-Image: JoyAI-Image is the unified multimodal foundation model for image understanding, text-to-image…
JoyAI-Image is the unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing. - jd-opensource/JoyAI-Image
Live AI video is doing too much lifting as a term. Here's a breakdown of what people actually mean.
The phrase is everywhere right now, but it's covering at least three meaningfully different things that keep getting conflated:
1. Faster post-production. The model still generates a discrete clip, it just does it quicker than it used to. Useful, but this is throughput improvement, not liveness.
2. Low-latency iteration. You can tweak and regenerate fast enough that it feels interactive. Still clip-based under the hood. Great UX, but the model still isn't responding to a continuous stream.
3. Actual real-time inference on a live stream. The model is continuously generating frames in response to incoming input, not producing clips at all. This is a fundamentally different architecture and a much harder problem.
The third category is where things get genuinely interesting from a technical standpoint. Decart is one of the few doing this for real, but because demos for all three can look superficially similar, the distinction gets lost. Vendors have every incentive to let it stay lost.Worth being precise about which one you're actually evaluating if you're building anything serious on top of this.
https://redd.it/1shogaz
@rStableDiffusion
The phrase is everywhere right now, but it's covering at least three meaningfully different things that keep getting conflated:
1. Faster post-production. The model still generates a discrete clip, it just does it quicker than it used to. Useful, but this is throughput improvement, not liveness.
2. Low-latency iteration. You can tweak and regenerate fast enough that it feels interactive. Still clip-based under the hood. Great UX, but the model still isn't responding to a continuous stream.
3. Actual real-time inference on a live stream. The model is continuously generating frames in response to incoming input, not producing clips at all. This is a fundamentally different architecture and a much harder problem.
The third category is where things get genuinely interesting from a technical standpoint. Decart is one of the few doing this for real, but because demos for all three can look superficially similar, the distinction gets lost. Vendors have every incentive to let it stay lost.Worth being precise about which one you're actually evaluating if you're building anything serious on top of this.
https://redd.it/1shogaz
@rStableDiffusion
Reddit
From the StableDiffusion community on Reddit
Explore this post and more from the StableDiffusion community
Qwen3.5-4B-Base-ZitGen-V1
Hi,
I'd like to share a fine-tuned LLM I've been working on. It's optimized for image-to-prompt and is only 4B parameters.
Model: https://huggingface.co/lolzinventor/Qwen3.5-4B-Base-ZitGen-V1
I thought some of you might find it interesting. It is an image captioning fine-tune optimized for Stable Diffusion prompt generation (i.e., image-to-prompt). Is there a comfy UI custom node that would allow this to be added to a cui workflow? i.e. LLM based captioning.
# What Makes This Unique
What makes this fine-tune unique is that the dataset (images + prompts) were generated by LLMs tasked with using the ComfyUI API to regenerate a target image.
# The Process
The process is as follows:
1. The target image and the last generated image (blank if it's the first step) are provided to an LLM with a comparison prompt.
2. The LLM outputs a detailed description of each image and the key differences between them.
3. The comparison results and the last generated prompt (empty if it's the first step) are provided to an LLM with an SD generation prompt.
4. The output prompt is sent to the ComfyUI API using Z-Image Turbo, and the output image is captured.
5. Repeat N times.
# Training Details
The system employed between 4 and 6 rounds of comparison and correction to generate each prompt-image pair. In theory, this process adapts the prompt to minimize the difference between the target image and the generated image, thereby tailoring the prompt to the specific SD model being used.
The prompts were then ranked and filtered to remove occasional LLM errors, such as residuals from the original prompt or undesirable artifacts (e.g., watermarks). Finally, the prompts and images were formatted into the ShareGPT dataset format and used to train Qwen 3.5 4B.
https://redd.it/1shvuxa
@rStableDiffusion
Hi,
I'd like to share a fine-tuned LLM I've been working on. It's optimized for image-to-prompt and is only 4B parameters.
Model: https://huggingface.co/lolzinventor/Qwen3.5-4B-Base-ZitGen-V1
I thought some of you might find it interesting. It is an image captioning fine-tune optimized for Stable Diffusion prompt generation (i.e., image-to-prompt). Is there a comfy UI custom node that would allow this to be added to a cui workflow? i.e. LLM based captioning.
# What Makes This Unique
What makes this fine-tune unique is that the dataset (images + prompts) were generated by LLMs tasked with using the ComfyUI API to regenerate a target image.
# The Process
The process is as follows:
1. The target image and the last generated image (blank if it's the first step) are provided to an LLM with a comparison prompt.
2. The LLM outputs a detailed description of each image and the key differences between them.
3. The comparison results and the last generated prompt (empty if it's the first step) are provided to an LLM with an SD generation prompt.
4. The output prompt is sent to the ComfyUI API using Z-Image Turbo, and the output image is captured.
5. Repeat N times.
# Training Details
The system employed between 4 and 6 rounds of comparison and correction to generate each prompt-image pair. In theory, this process adapts the prompt to minimize the difference between the target image and the generated image, thereby tailoring the prompt to the specific SD model being used.
The prompts were then ranked and filtered to remove occasional LLM errors, such as residuals from the original prompt or undesirable artifacts (e.g., watermarks). Finally, the prompts and images were formatted into the ShareGPT dataset format and used to train Qwen 3.5 4B.
https://redd.it/1shvuxa
@rStableDiffusion
huggingface.co
lolzinventor/Qwen3.5-4B-Base-ZitGen-V1 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
This media is not supported in your browser
VIEW IN TELEGRAM
LTX 2.3 - Image + Audio + Video ControlNet (IC-LoRA) to Video
https://redd.it/1shxv8n
@rStableDiffusion
https://redd.it/1shxv8n
@rStableDiffusion
Ace Step 1.5 XL ComfyUI automation workflow without lama for generating random tags using qwen, generate song and then give it a rating by using waveform analysis
The idea came to me after sorting trough a lot of Ace Step 1.5 XL outputs and trying to find best styles and tags for songs. Why not automate the generation process AND the review process, or at least make it easier. So as usual I used Qwen LM and Qwen VL (compared to something like olama these ones run directly in comfy and do not require a server) to randomize the tags on each run, but more importantly to try and rate the output. How ? By converting the audio output into a set of waveforms for 4 segments of the song that I feed into Qwen VL as an image and ask it to subjectively look at the waveform and give it feedback and rating, rating that is used then to also name the output file. Like this. I am not sure it works properly but the A+ rated songs were indeed better than B rated ones.
Workflow is here. Install the missing extensions and add the qwen models.
Here is part of the working flow, including output folder.
https://preview.redd.it/kpar4blijfug1.jpg?width=1280&format=pjpg&auto=webp&s=cf2b4e5491c8b237d29e9649d90d40c6172090a9
https://preview.redd.it/oxtxaf8kjfug1.jpg?width=1400&format=pjpg&auto=webp&s=643c100c7fe05bb5184551edd0b7a34d99476ddf
https://preview.redd.it/3old46smjfug1.jpg?width=1592&format=pjpg&auto=webp&s=07b366afe5ae259b11fbd86cf2332c56ab9192ea
https://redd.it/1shzm63
@rStableDiffusion
The idea came to me after sorting trough a lot of Ace Step 1.5 XL outputs and trying to find best styles and tags for songs. Why not automate the generation process AND the review process, or at least make it easier. So as usual I used Qwen LM and Qwen VL (compared to something like olama these ones run directly in comfy and do not require a server) to randomize the tags on each run, but more importantly to try and rate the output. How ? By converting the audio output into a set of waveforms for 4 segments of the song that I feed into Qwen VL as an image and ask it to subjectively look at the waveform and give it feedback and rating, rating that is used then to also name the output file. Like this. I am not sure it works properly but the A+ rated songs were indeed better than B rated ones.
Workflow is here. Install the missing extensions and add the qwen models.
Here is part of the working flow, including output folder.
https://preview.redd.it/kpar4blijfug1.jpg?width=1280&format=pjpg&auto=webp&s=cf2b4e5491c8b237d29e9649d90d40c6172090a9
https://preview.redd.it/oxtxaf8kjfug1.jpg?width=1400&format=pjpg&auto=webp&s=643c100c7fe05bb5184551edd0b7a34d99476ddf
https://preview.redd.it/3old46smjfug1.jpg?width=1592&format=pjpg&auto=webp&s=07b366afe5ae259b11fbd86cf2332c56ab9192ea
https://redd.it/1shzm63
@rStableDiffusion
Dystalgia - Aurel Manea Photography (Aurel Manega)
Ace Step 1.5 XL ComfyUI workflow for generating random tags, generate song and then give it a rating by using waveform analysis…
The idea came to me after sorting trough a lot of Ace Step 1.5 XL outputs and trying to find best styles and tags for songs. Why not automate the […]