Built a Character Portrait Generator that reads books, identifies characters, and generates consistent portraits using ComfyUI (full RAG pipeline, local LLM, open-source)

https://redd.it/1sy3c62
@rStableDiffusion
This media is not supported in your browser
VIEW IN TELEGRAM
Got early access access to LingBot-World-Fast at 17 FPS! Here's what I found.

https://redd.it/1sy80zc
@rStableDiffusion
Ernie VS Qwen and ZiT - Big Test

A large test of 100 images in a gallery

https://www.deviantart.com/slide3d/gallery/100815775/ernie-vs-qwen-and-zit-big-test

Big image generator showdown: 100 prompts, 3 models, 1 winner.
This comparison brings together three open image models with very different strengths. ERNIE-Image-Turbo from Baidu is an 8B distilled text-to-image model built on the same single-stream Diffusion Transformer family as ERNIE-Image. It is designed for fast generation in just 8 inference steps, with a strong focus on prompt fidelity, text rendering, and structured compositions such as posters, comics, infographics, and multi-panel layouts. Baidu also says it can run on consumer GPUs with 24 GB of VRAM, which makes it one of the more practical high-speed contenders in this test.

Qwen-Image-2512 is the December update of Qwen’s image model. According to its official model card, this version improves human realism, reduces the typical “AI-generated” look, adds finer natural detail, and strengthens text rendering and layout quality compared with the base Qwen-Image release. Qwen also states that after more than 10,000 blind evaluation rounds on AI Arena, Qwen-Image-2512 ranked as the strongest open-source model while remaining competitive with closed-source systems.

Z-Image-Turbo from Tongyi-MAI takes a different route: it is a 6B distilled model optimized for efficiency and speed. Its official release highlights generation in only 8 NFEs, sub-second latency on H800 GPUs, and deployment on 16 GB consumer GPUs. The team positions it as especially strong in photorealistic image generation, bilingual English/Chinese text rendering, and instruction following. Tongyi-MAI also reports that Z-Image-Turbo ranked 8th overall on the Artificial Analysis text-to-image leaderboard and was the top open-source model there at the time of that announcement.

Why this test matters:
this is not just a simple side-by-side comparison. It is really a clash of priorities. ERNIE-Image-Turbo looks like the speed-and-structure specialist. Qwen-Image-2512 looks like the realism-and-overall-quality contender. Z-Image-Turbo looks like the efficiency-focused challenger with strong photorealism and bilingual text capabilities. On paper, all three have a strong case. The point of a 100-image test is to see which one actually holds up across the same prompts, under the same conditions, when marketing claims are stripped away.

https://preview.redd.it/fob69nizjyxg1.png?width=3080&format=png&auto=webp&s=0d76e8f6058f2499b32ff2ab45e19e628d695e5b

https://preview.redd.it/5nt47nizjyxg1.png?width=3080&format=png&auto=webp&s=f406fb2344bc6e328e44c536e84e4fd0d0379fc4

https://preview.redd.it/6qqsgnizjyxg1.png?width=3080&format=png&auto=webp&s=d17754f33623310f102b0658cd0ac543e569d347

https://preview.redd.it/aslnenizjyxg1.png?width=3080&format=png&auto=webp&s=bfeb63aa26ecf7975c5af778e48e94aab9533e82

https://preview.redd.it/r81ghnizjyxg1.png?width=3080&format=png&auto=webp&s=da0747feb07e52465055a65c1d71a2d7ec994807

https://preview.redd.it/envwbnizjyxg1.png?width=3080&format=png&auto=webp&s=c1b31e18a457cb17086d1f52d7d19c29e2c32204

https://preview.redd.it/plk7gnizjyxg1.png?width=3080&format=png&auto=webp&s=f261f623451ee626de536e8ce33c4edb89d8abf6

https://preview.redd.it/wisfgnizjyxg1.png?width=3080&format=png&auto=webp&s=19d9e5bc7f37bda73fe986c14d788ba301b1b99c

https://preview.redd.it/m2t1jnizjyxg1.png?width=3080&format=png&auto=webp&s=081cf58cf87ed471cba809e897877c90a7ab98fa

https://preview.redd.it/7qru0oizjyxg1.png?width=3080&format=png&auto=webp&s=5db25c45617a575686342e8c3968e805f1bfd023



https://redd.it/1sy6a9k
@rStableDiffusion
Update: Im going to full finetune LTX 2.3 for 2D animation, and I’m looking for people who want to help with the dataset/training (all kinds of help are welcome.)

This is a follow-up to my previous post:

Previous post for context: https://www.reddit.com/r/StableDiffusion/comments/1svrzzt/is_anyone_else_interested_in_buildingfinetuning/

Hi people of Reddit.

A few days ago I decided to try a full fine-tuning run of LTX 2.3. In a previous post, I talked about the problems LTX 2.3 has with 2D animation, and recently I had the chance to talk with people from the LTX team. They basically confirmed what I was already suspecting.

LTX did not receive that much 2D animation training, mainly because licensing this kind of data is difficult.

So after struggling with LoRA training, I decided that I wanted to do a full finetune of the model, with the goal of adding more 2D animation data into it. More specifically, I want to focus on high quality eastern 2D animation, since that is usually where the motion, acting, timing, compositing, and detail are strongest.

But while studying the architecture and trying to figure out the best way to do this full finetuning run, I realized that LTX is kind of a monster, and building a good and big dataset is much harder than it sounds.

So Im making this post to ask if anyone wants to help with this process.

The main goal is to create a curated high-quality dataset for a full finetune of LTX 2.3. From what Im seeing, the minimum target for this kind of run should be around 5k clips. If the dataset is too small, the learning rate has to be lower to avoid catastrophic forgetting and damaging the model. But if the dataset is too small and too weak, the model will not learn enough, and the full finetune will probably not be very useful.

My current plan is to collect clips from some of the best animated works and build a dataset of around 5k clips, separated into three groups.

1 - Less curated clips
These are clips that are probably good enough, but still need to be reviewed or filtered better.

2 - Highly curated clips
These are the best clips. Strong motion, clean composition, useful character acting, good animation timing, good effects, good line consistency, and generally high training value.

3 - Filtered or augmented clips
These would either be clips that pass some kind of quality filter, or high-quality clips modified with AI tools to make them slightly different while still helping the model learn useful motion and animation patterns.

The goal is not just to make the model “look anime.” That is not enough. The real goal is to improve its understanding of 2D animation in general.

Things like timing, spacing, pose changes, limited animation, smear frames, hair and clothing movement, water, smoke, impact effects, character acting, mouth shapes, and stylized camera movement.

With or without help, Im planning to do this full fine-tuning run and release the result to the open-source community.

But if more people help, either with GPU, dataset curation, clip selection, captioning, testing, the final result will probably be much better for everyone.

Right now, the most useful help would be dataset curation. Finding clips is easy. Finding clips that are actually useful for training is the hard part. (And I was also thinking about adding 2D "sexual" animation, but I haven't decided yet.)

I already have some clips collected (2k), and I also trained an experimental LoRA recently. I still need to organize the files and check which checkpoint is the best before posting it on Civitai.

If anyone is interested in helping building a serious 2D animation fine-tune for LTX 2.3, you can join this discord: https://discord.gg/MG2yUntvh

https://redd.it/1syczqo
@rStableDiffusion