AI & ML Papers
32.4K subscribers
6.93K photos
500 videos
24 files
7.58K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
AI & ML Papers
Photo
🔥 Qwen-Image-VAE-2.0 Technical Report

💡 The Qwen Image VAE 2.0 technical report presents a high compression Variational Autoencoder suite that improves reconstruction fidelity and diffusability. The problem addressed in this paper is the reconstruction bottleneck of high compression in Variational Autoencoders. To solve this problem, the authors propose an improved architecture featuring Global Skip Connections and expanded latent channels. They also scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text rich scenarios.

The method used in this paper involves implementing an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. The authors also leverage an asymmetric and attention free encoder decoder backbone to minimize encoding overhead. The performance of Qwen Image VAE 2.0 is evaluated on public reconstruction benchmarks and a new benchmark called OmniDoc TokenBench, which is a collection of real world documents with specialized OCR based evaluation metrics.

The results show that Qwen Image VAE 2.0 achieves state of the art reconstruction performance, demonstrating exceptional capabilities in both general domains and text rich scenarios at high compression ratio. Downstream DiT experiments reveal that the models possess superior diffusability, significantly accelerating convergence compared to existing high compression baselines. Overall, Qwen Image VAE 2.0 establishes itself as a leading model with high compression, superior reconstruction, and exceptional diffusability.


📅 Published on May 13

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.13565
• PDF: https://arxiv.org/pdf/2605.13565
• GitHub: https://github.com/alibaba/OmniDoc-TokenBench 26

📊 Datasets citing this paper:
https://huggingface.co/datasets/alibabagroup/OmniDoc-TokenBench

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#VariationalAutoencoders #ImageCompressionTechniques #DeepLearningArchitectures #DiffusionModeling #LatentSpaceRepresentation
AI & ML Papers
Photo
🔥 Asymmetric Flow Models

💡 The paper introduces Asymmetric Flow Modeling, a method for efficient high-dimensional flow-based generation. The problem with existing flow-based generation methods is that they require modeling high-dimensional noise, which is difficult even when the data has a strong low-rank structure. To address this, the authors propose a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while keeping data prediction full-dimensional. This approach allows for the analytical recovery of the full-dimensional velocity without changing the network architecture or training procedures.

The method, called AsymFlow, enables effective fine-tuning from latent models to pixel-space models by aligning the low-rank pixel subspace to the latent space. This provides a seamless initialization that preserves the latent model's high-level semantics and structure, allowing fine-tuning to mainly improve low-level mismatches rather than relearning pixel generation.

The results show that AsymFlow achieves a leading performance on ImageNet 256x256, outperforming prior pixel diffusion models by a large margin. Additionally, the authors demonstrate that AsymFlow provides a route for fine-tuning pretrained latent flow models into pixel-space models, establishing a new state of the art for pixel-space text-to-image generation. The pixel AsymFlow model fine-tuned from a latent base model achieves better performance on several benchmarks, including HPSv3, DPG-Bench, and GenEval, and shows substantially improved visual realism. Overall, the paper presents a significant contribution to the field of flow-based generation, enabling efficient and effective high-dimensional generation and fine-tuning of latent models.


📅 Published on May 13

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.12964
• PDF: https://arxiv.org/pdf/2605.12964
• Project Page: https://hanshengchen.com/asymflow/
• GitHub: https://github.com/Lakonik/LakonLab 324

🤖 Models citing this paper:
https://huggingface.co/Lakonik/AsymFLUX.2-klein-9B
https://huggingface.co/Lakonik/AsymFlow-ImageNet
https://huggingface.co/OJ-1/AsymFLUX.2-klein-9B

🚀 Spaces citing this paper:
https://huggingface.co/spaces/Lakonik/AsymFLUX.2-klein

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#AsymmetricFlowModels #FlowBasedGeneration #HighDimensionalModeling #RankAsymmetricVelocity #FlowBasedDeepLearning
1
AI & ML Papers
Photo
🔥 Self-Distilled Agentic Reinforcement Learning

💡 The paper introduces Self Distilled Agentic Reinforcement Learning, a method that improves reinforcement learning for multi turn agent training. The problem with traditional reinforcement learning is that it provides only coarse supervision for long horizon interaction, which can lead to instability in multi turn agents. On Policy Self Distillation is a technique that complements reinforcement learning by providing dense token level guidance from a teacher branch, but it has limitations when applied to multi turn agents, such as compounding instability and negative teacher rejections.

The proposed method, Self Distilled Agentic Reinforcement Learning, addresses these limitations by treating On Policy Self Distillation as a gated auxiliary objective, while keeping reinforcement learning as the primary optimization backbone. It uses a sigmoid gate to selectively strengthen positive token level guidance and mitigate negative teacher rejections. This allows the method to stabilize supervision and improve the performance of multi turn agents.

The results show that Self Distilled Agentic Reinforcement Learning substantially improves over existing methods, such as GRPO, and avoids the instability of naive combinations of GRPO and On Policy Self Distillation. The method consistently outperforms hybrid reinforcement learning and On Policy Self Distillation baselines across different model scales and datasets, including Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA. Overall, the paper contributes a new method that improves the performance and stability of multi turn agents in reinforcement learning.


📅 Published on May 14

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15155
• PDF: https://arxiv.org/pdf/2605.15155

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#AgenticReinforcementLearning #MultiTurnAgentTraining #OnPolicySelfDistillation #ReinforcementLearningMethods #SelfDistilledLearning
🔥 Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

💡 The paper proposes a novel approach called Warp-as-History for camera-controlled video generation. Existing methods for this task typically require large-scale camera-annotated videos for post-training or rely on test-time optimization, which can be time-consuming and costly. The proposed method addresses this problem by transforming camera-induced warps into pseudo-history representations, which enables a frozen video generation model to follow camera trajectories without any training or test-time optimization.

The Warp-as-History method works by constructing camera-warped pseudo-history from past observations and feeding it through the model's visual-history pathway. The positional encoding is aligned with the target frames being denoised, and warped-history tokens without valid source observations are removed. This simple interface reveals a non-trivial zero-shot capability of the model to follow camera trajectories.

The results show that the proposed method can achieve good camera adherence, visual quality, and motion dynamics without requiring large-scale camera-annotated videos or test-time optimization. Furthermore, lightweight offline finetuning on only one camera-annotated video can further improve the model's capability and generalize to unseen videos. Extensive experiments on diverse datasets confirm the effectiveness of the Warp-as-History method, making it a promising approach for camera-controlled video generation.

Overall, the paper's contributions include a novel method for camera-controlled video generation that requires minimal training data and no test-time optimization, and demonstrates the potential for zero-shot capability in video generation models. The proposed approach has the potential to simplify the process of camera-controlled video generation and make it more accessible to a wider range of applications.


📅 Published on May 14

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15182
• PDF: https://arxiv.org/pdf/2605.15182
• Project Page: https://yyfz.github.io/warp-as-history/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#VideoGeneration #CameraControlledSynthesis #WarpAsHistory #PseudoHistoryRepresentations #CameraInducedWarps
AI & ML Papers
Photo
🔥 Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

💡 The paper presents a systematic approach to transform post-trained reasoning models into rigorous olympiad-level solvers. The problem addressed is achieving gold-medal-level performance on mathematical and physics competitions. The method involves a simple and unified recipe that includes three main components: a reverse-perplexity curriculum, a two-stage reinforcement learning pipeline, and test-time scaling. The reverse-perplexity curriculum is used to instill rigorous proof-search and self-checking behaviors in the model. The two-stage reinforcement learning pipeline progresses from reinforcement learning with verifiable rewards to more delicate proof-level reinforcement learning, allowing the model to scale its behaviors. Finally, test-time scaling is used to boost the solving performance of the model.

The authors applied this recipe to a 30B-A3B backbone with sequence-to-function transformer training on around 340K sub-8K-token trajectories, followed by 200 reinforcement learning steps. The resulting model, SU-01, demonstrates stable reasoning on difficult problems with trajectories exceeding 100K tokens. The results show that the model achieves gold-medal-level performance on mathematical and physical olympiad competitions, including the International Mathematical Olympiad and the International Physics Olympiad. Additionally, the model demonstrates strong generalization of scientific reasoning to domains beyond mathematics and physics. Overall, the paper contributes a simple and unified approach to achieving gold-medal-level olympiad reasoning, with significant implications for advancing long-horizon mathematical and scientific problem solving.


📅 Published on May 13

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.13301
• PDF: https://arxiv.org/pdf/2605.13301
• Project Page: https://simplified-reasoning.github.io/SU-01

🤖 Models citing this paper:
https://huggingface.co/Simplified-Reasoning/SU-01

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#OlympiadReasoning #MathematicalCompetitions #PhysicsCompetitions #ReinforcementLearning #ArtificialIntelligence
🔥 RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

💡 The paper introduces RAVEN, a real-time autoregressive video extrapolation network, and CM-GRPO, a consistency model-based reinforcement learning approach. The problem addressed is the gap between the history distributions encountered during training and those arising at inference in causal autoregressive video diffusion models, which constrains generation quality over long horizons.

To solve this problem, RAVEN repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states, aligning training attention with inference-time extrapolation. This formulation allows downstream chunk losses to supervise the history representations on which future predictions depend.

Additionally, CM-GRPO reformulates a consistency sampling step as a conditional Gaussian transition and applies online reinforcement learning directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations.

The results demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations. Furthermore, CM-GRPO provides further gains when combined with RAVEN, indicating the effectiveness of the proposed methods in improving real-time video generation.

Overall, the paper presents a novel approach to real-time video generation through causal autoregressive extrapolation with improved training alignment and consistency model-based reinforcement learning, achieving state-of-the-art results in video generation quality and performance.


📅 Published on May 14

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15190
• PDF: https://arxiv.org/pdf/2605.15190
• Project Page: https://yanzuo.lu/raven/

🤖 Models citing this paper:
https://huggingface.co/mvp-lab/RAVEN

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#AutoregressiveVideoExtrapolation #VideoDiffusionModels #ReinforcementLearningForVideo #ConsistencyModelBasedRL #RealTimeVideoGeneration
AI & ML Papers
Photo
🔥 SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

💡 The paper introduces SANA-Video, a small diffusion model designed for efficient video generation. The problem addressed is the high cost and slow speed of existing video generation models. To solve this, the authors propose two core designs: Linear DiT, which leverages linear attention as the core operation, and a constant-memory KV cache for block linear attention. This cache provides global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient long video generation.

The method used is a block-wise autoregressive approach for long video generation, which employs a constant-memory state derived from the cumulative properties of linear attention. The authors also explore effective data filters and model training strategies, which narrow the training cost to 12 days on 64 H100 GPUs, a significant reduction compared to other models.

The results show that SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models, while being 16 times faster in measured latency. The model can generate high-resolution, high-quality videos up to 720x1280 resolution and minute-length duration at a remarkably fast speed. Additionally, SANA-Video can be deployed on RTX 5090 GPUs, accelerating the inference speed of generating a 5-second 720p video from 71 seconds to 29 seconds, a 2.4 times speedup. Overall, SANA-Video enables low-cost, high-quality video generation, making it a significant contribution to the field of video generation.


📅 Published on Sep 29, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.24695
• PDF: https://arxiv.org/pdf/2509.24695
• Project Page: https://nvlabs.github.io/Sana/Video

🤖 Models citing this paper:
https://huggingface.co/Efficient-Large-Model/SANA-Video_2B_720p
https://huggingface.co/Efficient-Large-Model/SANA-Video_2B_480p
https://huggingface.co/Efficient-Large-Model/SANA-Video_2B_480p_diffusers

🚀 Spaces citing this paper:
https://huggingface.co/spaces/helenai/check-optimum-intel-support

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#VideoGenerationModels #DiffusionTransformer #BlockLinearAttention #EfficientVideoProcessing #AutoregressiveVideoGeneration
AI & ML Papers
Photo
🔥 Transformer Explainer: Interactive Learning of Text-Generative Models

💡 The paper introduces Transformer Explainer, an interactive visualization tool that helps non-experts understand the inner workings of the GPT-2 model. The problem addressed is that Transformers, despite being a revolutionary machine learning technology, are often opaque to those without extensive expertise. To tackle this issue, the authors developed a tool that provides a model overview and allows users to smoothly transition across different abstraction levels of mathematical operations and model structures.

The method used to create the tool involves integrating a live GPT-2 instance that runs locally in the user's browser, enabling users to experiment with their own input and observe in real-time how the internal components and parameters of the Transformer work together to predict the next tokens. This approach allows users to gain hands-on experience and intuition about complex Transformer concepts without requiring installation or special hardware.

The results of this work are a publicly available, open-sourced tool that broadens access to education on modern generative AI techniques. The tool is accessible at a provided website and a video demo is also available, showcasing the tool's capabilities. Overall, the paper contributes to making Transformers more accessible and understandable to a wider audience, including non-experts, by providing an interactive and intuitive learning experience.


📅 Published on Aug 8, 2024

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2408.04619
• PDF: https://arxiv.org/pdf/2408.04619
• Project Page: https://poloclub.github.io/transformer-explainer/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#TransformerModels #GPT2Explained #NaturalLanguageProcessing #TextGenerationModels #ExplainableAI
AI & ML Papers
Photo
🔥 Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

💡 The paper introduces Orthrus, a dual architecture framework that combines the strengths of autoregressive large language models and diffusion models to achieve fast parallel token generation while maintaining exact inference fidelity. The problem with standard autoregressive decoding is that it is sequential, which represents a fundamental bottleneck for high throughput inference. Diffusion language models try to address this issue with parallel generation, but they suffer from performance degradation, high training costs, and lack of convergence guarantees.

The Orthrus framework resolves this issue by augmenting a frozen large language model with a lightweight trainable module to create a parallel diffusion view alongside the standard autoregressive view. Both views attend to the same high fidelity key value cache, where the autoregressive head executes context pre filling to construct accurate key value representations, and the diffusion head executes parallel generation. The framework employs an exact consensus mechanism between the two views to guarantee lossless inference.

The results show that Orthrus delivers a speedup of up to 7.8 times with only a constant memory cache overhead and minimal parameter additions. This is achieved by sharing key value caches and using a consensus mechanism, which allows the framework to maintain exact inference fidelity while generating tokens in parallel. Overall, the Orthrus framework provides a simple and efficient solution to the problem of slow sequential decoding in autoregressive large language models, and it has the potential to be seamlessly integrated into existing transformer architectures.


📅 Published on May 12

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.12825
• PDF: https://arxiv.org/pdf/2605.12825

🤖 Models citing this paper:
https://huggingface.co/chiennv/Orthrus-Qwen3-8B
https://huggingface.co/chiennv/Orthrus-Qwen3-4B
https://huggingface.co/chiennv/Orthrus-Qwen3-1.7B

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#DiffusionLanguageModels #ParallelTokenGeneration #AutoregressiveDecoding #DualViewDiffusion #LargeLanguageModels
AI & ML Papers
Photo
🔥 Long Context Pre-Training with Lighthouse Attention

💡 The paper proposes a new attention algorithm called Lighthouse Attention that enables efficient training of causal transformers on long sequences. The main problem addressed is the quadratic time and memory complexity of scaled dot-product attention, which makes it difficult to train models on extremely long sequences. To solve this, the authors introduce a hierarchical selection-based attention algorithm that reduces computational complexity while maintaining model performance.

The Lighthouse Attention algorithm has three key contributions. First, it uses a subquadratic hierarchical pre- and post-processing step that adaptively compresses and decompresses the sequence, reducing the computational cost. Second, it employs a symmetrical compression strategy that pools queries, keys, and values simultaneously while preserving left-to-right causality, which improves parallelism. Third, it uses a two-stage training approach, where the model is pre-trained with Lighthouse Attention for most of the time and then recovered to a full attention model with a short training phase.

The authors evaluate their method through small-scale pre-training experiments and show that it achieves faster total training time and lower final loss compared to full attention training with matched settings. The results demonstrate the effectiveness of Lighthouse Attention in reducing the computational complexity of training causal transformers on long sequences. The full code for the algorithm is available online, allowing others to implement and build upon the method. Overall, the paper presents a novel attention algorithm that can efficiently train causal transformers on long sequences, making it a useful contribution to the field of natural language processing.


📅 Published on May 7

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.06554
• PDF: https://arxiv.org/pdf/2605.06554
• Project Page: https://nousresearch.com/lighthouse-attention

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#CausalTransformers #LighthouseAttention #EfficientAttentionMechanisms #LongSequenceModeling #HierarchicalAttentionAlgorithms
2