🧭 Why I’m especially interested in this paper
The main reason I wanted to highlight this work is that moment matching did not stop at continuous diffusion models. The same authors later developed this direction further for Discrete Diffusion Models in:
Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD
This is particularly relevant for us because it is closely connected to our recent work:
IDLM: Inverse-distilled Diffusion Language Models — ICML 2026
Both lines of work are trying to solve a similar bottleneck: discrete diffusion models and diffusion language models can be high-quality, but inference is still expensive because generation usually requires many iterative sampling steps. So the key question is how to distill these models into much faster few-step generators without collapsing quality or diversity.
We will discuss this broader direction at the Popular Reading Group meeting devoted to Discrete Diffusion Models on Monday, May 18.
Please join the discussion chat
The meeting link will be shared later in our Telegram chat
The main reason I wanted to highlight this work is that moment matching did not stop at continuous diffusion models. The same authors later developed this direction further for Discrete Diffusion Models in:
Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD
This is particularly relevant for us because it is closely connected to our recent work:
IDLM: Inverse-distilled Diffusion Language Models — ICML 2026
Both lines of work are trying to solve a similar bottleneck: discrete diffusion models and diffusion language models can be high-quality, but inference is still expensive because generation usually requires many iterative sampling steps. So the key question is how to distill these models into much faster few-step generators without collapsing quality or diversity.
We will discuss this broader direction at the Popular Reading Group meeting devoted to Discrete Diffusion Models on Monday, May 18.
Please join the discussion chat
The meeting link will be shared later in our Telegram chat
❤1
Hi everyone!
I’m David Li, a first-year PhD student at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). My research interests are mainly in generative models, diffusion models, optimal transport, and related areas of machine learning.
I created LiSearch to share short notes about new papers, interesting ideas, and possible research directions in generative modeling and ML.
The main motivation for this channel is discussion. I don’t want it to be just a list of paper summaries, I’d like it to become a place where researchers and ML enthusiasts can exchange thoughts, ask questions, criticize ideas, suggest papers, and discuss what may be worth exploring next.
I’ll be very glad to see any activity here: comments, questions, opinions, links to papers, or your own research ideas.
A bit more about me:
Google Scholar: https://scholar.google.com/citations?hl=en&user=L88Qc4YAAAAJ
LinkedIn: https://www.linkedin.com/in/david-li-ab07b332b
Telegram: @kekchpek
Welcome to LiSearch!
I’m David Li, a first-year PhD student at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). My research interests are mainly in generative models, diffusion models, optimal transport, and related areas of machine learning.
I created LiSearch to share short notes about new papers, interesting ideas, and possible research directions in generative modeling and ML.
The main motivation for this channel is discussion. I don’t want it to be just a list of paper summaries, I’d like it to become a place where researchers and ML enthusiasts can exchange thoughts, ask questions, criticize ideas, suggest papers, and discuss what may be worth exploring next.
I’ll be very glad to see any activity here: comments, questions, opinions, links to papers, or your own research ideas.
A bit more about me:
Google Scholar: https://scholar.google.com/citations?hl=en&user=L88Qc4YAAAAJ
LinkedIn: https://www.linkedin.com/in/david-li-ab07b332b
Telegram: @kekchpek
Welcome to LiSearch!
👍2
I’ve published a video where I explain the paper “Multistep Distillation of Diffusion Models via Moment Matching”
I also made a LinkedIn post where I explain the overall idea behind this channel.
I don’t want to repeat everything here, but the main point is this: for all the “beauty” parts, such as YouTube video icons, thumbnails, and similar visuals, I’ll use neural networks. I don’t want to spend too much time on that manually, so that’s why you may see some funny faces or weird-looking icons in the YouTube videos😁
I’ll try to keep the meetings weekly and post new videos every week. If something doesn’t work out, I’ll announce it separately.
I hope this video helps you understand the paper better. I’ll be happy to discuss any ideas, questions, or thoughts in the comments!
I also made a LinkedIn post where I explain the overall idea behind this channel.
I don’t want to repeat everything here, but the main point is this: for all the “beauty” parts, such as YouTube video icons, thumbnails, and similar visuals, I’ll use neural networks. I don’t want to spend too much time on that manually, so that’s why you may see some funny faces or weird-looking icons in the YouTube videos
I’ll try to keep the meetings weekly and post new videos every week. If something doesn’t work out, I’ll announce it separately.
I hope this video helps you understand the paper better. I’ll be happy to discuss any ideas, questions, or thoughts in the comments!
Please open Telegram to view this post
VIEW IN TELEGRAM
🔥7❤3
IDLM: Inverse-distilled Diffusion Language Models (ICML 2026, Our recent work)
Paper | Code | Checkpoints
⚡️ Can a language model generate a 1024-token sequence in just 16 forward passes?
That would mean producing 1024/16=64 tokens per forward pass.
For today’s standard language models, this sounds almost impossible. They are autoregressive, meaning they generate text token by token: first token, then the next, then the next…
So generating 1024 tokens usually requires 1024 forward passes.
This is one of the biggest bottlenecks in LLM inference.
A promising alternative is Diffusion Language Models. Instead of generating tokens one by one, they try to generate or refine the whole sequence in parallel, potentially removing the need for strict autoregressive decoding.
In theory, this could make generation much faster.
But in practice, diffusion-based language models often turn out to be slower, not faster, than autoregressive models.
The main challenge is the space dimension.
If we want the model to generate the next 64 tokens at once, it is not enough to predict one token 64 times independently. Ideally, the model should approximate the joint distribution over all possible 64-token continuations.
But the number of such continuations is enormous.
Even for a relatively small vocabulary, like GPT-2’s vocabulary of about ≈60,000 tokens, the number of possible 64-token sequences is:
60,000⁶⁴ ≈ 10³⁰⁶
That is an astronomically large number. We cannot enumerate these possibilities, store them, or explicitly simulate such a distribution.
So the real question becomes:
Can we generate many tokens in parallel while keeping the model’s complexity only linear in sequence length?
Paper | Code | Checkpoints
⚡️ Can a language model generate a 1024-token sequence in just 16 forward passes?
That would mean producing 1024/16=64 tokens per forward pass.
For today’s standard language models, this sounds almost impossible. They are autoregressive, meaning they generate text token by token: first token, then the next, then the next…
So generating 1024 tokens usually requires 1024 forward passes.
This is one of the biggest bottlenecks in LLM inference.
A promising alternative is Diffusion Language Models. Instead of generating tokens one by one, they try to generate or refine the whole sequence in parallel, potentially removing the need for strict autoregressive decoding.
In theory, this could make generation much faster.
But in practice, diffusion-based language models often turn out to be slower, not faster, than autoregressive models.
The main challenge is the space dimension.
If we want the model to generate the next 64 tokens at once, it is not enough to predict one token 64 times independently. Ideally, the model should approximate the joint distribution over all possible 64-token continuations.
But the number of such continuations is enormous.
Even for a relatively small vocabulary, like GPT-2’s vocabulary of about ≈60,000 tokens, the number of possible 64-token sequences is:
60,000⁶⁴ ≈ 10³⁰⁶
That is an astronomically large number. We cannot enumerate these possibilities, store them, or explicitly simulate such a distribution.
So the real question becomes:
Can we generate many tokens in parallel while keeping the model’s complexity only linear in sequence length?
🔥7
The answer is yes and the key idea is a mixture of distributions.
Instead of trying to explicitly model all possible 64-token continuations, we introduce a latent variable ε sampled from a simple latent space.
Then the model generates tokens independently conditioned on ε.
At first glance, this looks too simple: independent tokens cannot capture complex text structure, right?
But the important part is that tokens are independent only after conditioning on ε.
The final text distribution is obtained by averaging over all latent variables (see the first attached image).
So the model is actually a mixture of many factorized distributions.
And this is powerful: the generator Gθ can encode global structure, style, topic, dependencies, and correlations between tokens thorugh the latent variable ε. As a result, the marginal distribution over text can still be highly expressive, even though each conditional distribution is factorized.
This direction has already been explored in recent works such as Di4C (ICML 2025) and VADD (ICLR 2026).
The last image is a great illustration from VADD. Without a latent variable, a factorized model fails to capture dependencies, this is what happens with MDLM. But with a latent variable, VADD can recover structured distributions like checkerboards and spirals.
However, there is still a major problem.
These mixtures in VADD is trained with a VAE-style objective. And in practice, VAE losses can be fragile: they require balancing reconstruction quality against the regularization term. If this balance is not right, the model can learn poor latent representations and produce weak samples.
So the real question becomes:
Can we design a better loss function for training mixtures of distributions?
Instead of trying to explicitly model all possible 64-token continuations, we introduce a latent variable ε sampled from a simple latent space.
Then the model generates tokens independently conditioned on ε.
At first glance, this looks too simple: independent tokens cannot capture complex text structure, right?
But the important part is that tokens are independent only after conditioning on ε.
The final text distribution is obtained by averaging over all latent variables (see the first attached image).
So the model is actually a mixture of many factorized distributions.
And this is powerful: the generator Gθ can encode global structure, style, topic, dependencies, and correlations between tokens thorugh the latent variable ε. As a result, the marginal distribution over text can still be highly expressive, even though each conditional distribution is factorized.
This direction has already been explored in recent works such as Di4C (ICML 2025) and VADD (ICLR 2026).
The last image is a great illustration from VADD. Without a latent variable, a factorized model fails to capture dependencies, this is what happens with MDLM. But with a latent variable, VADD can recover structured distributions like checkerboards and spirals.
However, there is still a major problem.
These mixtures in VADD is trained with a VAE-style objective. And in practice, VAE losses can be fragile: they require balancing reconstruction quality against the regularization term. If this balance is not right, the model can learn poor latent representations and produce weak samples.
So the real question becomes:
Can we design a better loss function for training mixtures of distributions?
🔥2
🚀 How do we train this mixture of distributions?
To do that, we use the idea of Inverse Distillation for Discrete Diffusion Models, introduced in our previous work IBMD, ICML 2025.
The idea is simple.
Usually, diffusion models are trained in the forward direction:
we have samples from the real data distribution p*,
and we train a diffusion model f* by minimizing some loss:
f* = argmin_f L(f, p*)
This is the standard diffusion training pipeline.
Data → train diffusion model.
But in Inverse Distillation, we flip the problem.
Instead of starting from data, we start from a pretrained diffusion model f*.
Now we want to learn a generator distribution pθ such that:
if we trained a diffusion model on samples from pθ,
the optimal diffusion model would be the given pretrained model f*.
Formally:
f* = argmin_f L(f, pθ)
So we are asking:
What distribution could have produced this pretrained diffusion model?
In the ideal case, the generator Gθ should produce samples from the same distribution that the teacher diffusion model was originally trained on.
In other words:
pθ = p*
This is the core intuition behind Inverse Distillation.
The generator does not just imitate individual samples.
It learns to produce samples whose induced diffusion training process matches the teacher model.
But there is a problem.
We cannot directly optimize this objective, because it contains an inner optimization:
argmin_f L(f, pθ)
This “train a diffusion model to optimality” step is not something we can directly differentiate through in practice.
So we introduce a practical loss: IDLM loss (3-rd image).
The intuition is the following:
The first term measures how well the pretrained teacher diffusion model f* fits the samples generated by pθ.
The second term asks:
what is the best possible diffusion model f̂ we could train on these generated samples?
Then the generator minimizes the gap between them.
So Gθ is trained to produce samples for which the teacher model f* is already optimal.
To do that, we use the idea of Inverse Distillation for Discrete Diffusion Models, introduced in our previous work IBMD, ICML 2025.
The idea is simple.
Usually, diffusion models are trained in the forward direction:
we have samples from the real data distribution p*,
and we train a diffusion model f* by minimizing some loss:
f* = argmin_f L(f, p*)
This is the standard diffusion training pipeline.
Data → train diffusion model.
But in Inverse Distillation, we flip the problem.
Instead of starting from data, we start from a pretrained diffusion model f*.
Now we want to learn a generator distribution pθ such that:
if we trained a diffusion model on samples from pθ,
the optimal diffusion model would be the given pretrained model f*.
Formally:
f* = argmin_f L(f, pθ)
So we are asking:
What distribution could have produced this pretrained diffusion model?
In the ideal case, the generator Gθ should produce samples from the same distribution that the teacher diffusion model was originally trained on.
In other words:
pθ = p*
This is the core intuition behind Inverse Distillation.
The generator does not just imitate individual samples.
It learns to produce samples whose induced diffusion training process matches the teacher model.
But there is a problem.
We cannot directly optimize this objective, because it contains an inner optimization:
argmin_f L(f, pθ)
This “train a diffusion model to optimality” step is not something we can directly differentiate through in practice.
So we introduce a practical loss: IDLM loss (3-rd image).
The intuition is the following:
The first term measures how well the pretrained teacher diffusion model f* fits the samples generated by pθ.
The second term asks:
what is the best possible diffusion model f̂ we could train on these generated samples?
Then the generator minimizes the gap between them.
So Gθ is trained to produce samples for which the teacher model f* is already optimal.
❤1
🧪 Experiments: how much can we accelerate diffusion language models?
All models were trained on the OWT dataset.
For evaluation, we tracked two metrics:
GenPPL — generation perplexity
Entropy — diversity of generated text
The goal was simple:
⚡️ reduce the number of diffusion steps as much as possible
while keeping GenPPL and Entropy almost unchanged.
In other words, we wanted acceleration without sacrificing sample quality or diversity.
And here is what we got 👇
1. Masked diffusion distillation
For masked diffusion, our IDLM-MDLM model achieved:
Result: 🔥 64× acceleration
with no noticeable degradation in GenPPL and Entropy.
(See the first image)
2. Uniform diffusion distillation
For uniform diffusion, our IDLM-DCD model achieved:
Result: 🚀 256× acceleration
again, without sacrificing the measured metrics.
(See the second image)
So the takeaway is:
Inverse Distillation allows us to compress many diffusion steps into just a few, while preserving the behavior of the original model.
This is exactly what we need for practical parallel generation:
fewer steps, faster sampling, same quality signals.
P.S. Feel free to destroy the whole GenPPL + Entropy evaluation protocol in the comments 😄
All models were trained on the OWT dataset.
For evaluation, we tracked two metrics:
GenPPL — generation perplexity
Entropy — diversity of generated text
The goal was simple:
⚡️ reduce the number of diffusion steps as much as possible
while keeping GenPPL and Entropy almost unchanged.
In other words, we wanted acceleration without sacrificing sample quality or diversity.
And here is what we got 👇
1. Masked diffusion distillation
For masked diffusion, our IDLM-MDLM model achieved:
Result: 🔥 64× acceleration
with no noticeable degradation in GenPPL and Entropy.
(See the first image)
2. Uniform diffusion distillation
For uniform diffusion, our IDLM-DCD model achieved:
Result: 🚀 256× acceleration
again, without sacrificing the measured metrics.
(See the second image)
So the takeaway is:
Inverse Distillation allows us to compress many diffusion steps into just a few, while preserving the behavior of the original model.
This is exactly what we need for practical parallel generation:
fewer steps, faster sampling, same quality signals.
P.S. Feel free to destroy the whole GenPPL + Entropy evaluation protocol in the comments 😄
❤1
🎤 Last but not least, let’s discuss the details live!
If you are interested in the technical details behind our work, feel free to ask questions in the comments, I’ll be happy to discuss.
Also, I’m excited to share that my colleague Nikita Gushchin and I have been invited to the DLLM Reading Group.
There, we will dive deeper into the practical side of our method:
how the objective works,
how the distillation is implemented,
and what happens in experiments.
I would be really glad to see you there!
📢 Announcement.
🔗 Meeting link.
▶️ DLLM Reading Group
P.S. Huge thanks to the organizers of the DLLM Reading Group for the invitation! 🙌
If you are interested in the technical details behind our work, feel free to ask questions in the comments, I’ll be happy to discuss.
Also, I’m excited to share that my colleague Nikita Gushchin and I have been invited to the DLLM Reading Group.
There, we will dive deeper into the practical side of our method:
how the objective works,
how the distillation is implemented,
and what happens in experiments.
I would be really glad to see you there!
📢 Announcement.
🔗 Meeting link.
▶️ DLLM Reading Group
P.S. Huge thanks to the organizers of the DLLM Reading Group for the invitation! 🙌
🏆6
LiSearch
🎤 Last but not least, let’s discuss the details live! If you are interested in the technical details behind our work, feel free to ask questions in the comments, I’ll be happy to discuss. Also, I’m excited to share that my colleague Nikita Gushchin and I…
📢 Missed the talk? Check out the recording on YouTube: https://youtu.be/RZ6_huata1Y
Huge thanks to the organizers for the recordings!
Huge thanks to the organizers for the recordings!
YouTube
S17 | IDLM: Inverse-distilled Diffusion Language Models
Diffusion Language Models (DLMs) have recently achieved strong results in text generation, but their multi-step sampling makes inference slow and limits practical use. This work extends Inverse Distillation, a technique originally developed for accelerating…
🔥4❤3