AI & ML Papers

The AI community building the future. Hugging Face has 421 repositories available. Follow their code on GitHub.

469 views01:53

375 views11:48

🔥 Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

💡 The paper introduces Orthrus, a dual architecture framework that combines the strengths of autoregressive large language models and diffusion models to achieve fast parallel token generation while maintaining exact inference fidelity. The problem with standard autoregressive decoding is that it is sequential, which represents a fundamental bottleneck for high throughput inference. Diffusion language models try to address this issue with parallel generation, but they suffer from performance degradation, high training costs, and lack of convergence guarantees.

The Orthrus framework resolves this issue by augmenting a frozen large language model with a lightweight trainable module to create a parallel diffusion view alongside the standard autoregressive view. Both views attend to the same high fidelity key value cache, where the autoregressive head executes context pre filling to construct accurate key value representations, and the diffusion head executes parallel generation. The framework employs an exact consensus mechanism between the two views to guarantee lossless inference.

The results show that Orthrus delivers a speedup of up to 7.8 times with only a constant memory cache overhead and minimal parameter additions. This is achieved by sharing key value caches and using a consensus mechanism, which allows the framework to maintain exact inference fidelity while generating tokens in parallel. Overall, the Orthrus framework provides a simple and efficient solution to the problem of slow sequential decoding in autoregressive large language models, and it has the potential to be seamlessly integrated into existing transformer architectures.

📅 Published on May 12

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.12825
• PDF: https://arxiv.org/pdf/2605.12825

🤖 Models citing this paper:
• https://huggingface.co/chiennv/Orthrus-Qwen3-8B
• https://huggingface.co/chiennv/Orthrus-Qwen3-4B
• https://huggingface.co/chiennv/Orthrus-Qwen3-1.7B

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#DiffusionLanguageModels #ParallelTokenGeneration #AutoregressiveDecoding #DualViewDiffusion #LargeLanguageModels

The AI community building the future. Hugging Face has 421 repositories available. Follow their code on GitHub.

433 views11:48

399 views13:48

🔥 Long Context Pre-Training with Lighthouse Attention

💡 The paper proposes a new attention algorithm called Lighthouse Attention that enables efficient training of causal transformers on long sequences. The main problem addressed is the quadratic time and memory complexity of scaled dot-product attention, which makes it difficult to train models on extremely long sequences. To solve this, the authors introduce a hierarchical selection-based attention algorithm that reduces computational complexity while maintaining model performance.

The Lighthouse Attention algorithm has three key contributions. First, it uses a subquadratic hierarchical pre- and post-processing step that adaptively compresses and decompresses the sequence, reducing the computational cost. Second, it employs a symmetrical compression strategy that pools queries, keys, and values simultaneously while preserving left-to-right causality, which improves parallelism. Third, it uses a two-stage training approach, where the model is pre-trained with Lighthouse Attention for most of the time and then recovered to a full attention model with a short training phase.

The authors evaluate their method through small-scale pre-training experiments and show that it achieves faster total training time and lower final loss compared to full attention training with matched settings. The results demonstrate the effectiveness of Lighthouse Attention in reducing the computational complexity of training causal transformers on long sequences. The full code for the algorithm is available online, allowing others to implement and build upon the method. Overall, the paper presents a novel attention algorithm that can efficiently train causal transformers on long sequences, making it a useful contribution to the field of natural language processing.

📅 Published on May 7

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.06554
• PDF: https://arxiv.org/pdf/2605.06554
• Project Page: https://nousresearch.com/lighthouse-attention

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#CausalTransformers #LighthouseAttention #EfficientAttentionMechanisms #LongSequenceModeling #HierarchicalAttentionAlgorithms

The AI community building the future. Hugging Face has 421 repositories available. Follow their code on GitHub.

❤2

548 views13:48

451 views23:48

🔥 DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

💡 The paper introduces DataFlow, a framework for unified data preparation and workflow automation in the context of large language models. The problem addressed is the current lack of scalable and reliable data preparation pipelines, which are often dominated by ad-hoc scripts and loosely specified workflows, hindering reproducibility and model performance.

To address this challenge, the authors propose DataFlow, a framework that provides system-level abstractions for modular, reusable, and composable data transformations. It includes a PyTorch-style pipeline construction API and nearly 200 reusable operators, as well as six domain-general pipelines for various tasks such as text, mathematical reasoning, and code.

The framework also includes DataFlow-Agent, which can automatically translate natural-language specifications into executable pipelines. This is achieved through operator synthesis, pipeline planning, and iterative verification.

The results show that DataFlow consistently improves downstream large language model performance across six representative use cases. The framework outperforms curated human datasets and specialized synthetic baselines, achieving significant gains in execution accuracy and average improvements on code benchmarks.

For example, the math, code, and text pipelines achieve up to 3 percent execution accuracy in Text-to-SQL, 7 percent average improvements on code benchmarks, and 1-3 point gains on math benchmarks. Additionally, a unified dataset produced by DataFlow enables base models to surpass counterparts trained on larger datasets.

Overall, the paper demonstrates that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable large language model data preparation, and establishes a system-level foundation for future data-centric AI development.

📅 Published on Dec 18, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2512.16676
• PDF: https://arxiv.org/pdf/2512.16676
• Project Page: https://github.com/OpenDCAI/DataFlow

📊 Datasets citing this paper:
• https://huggingface.co/datasets/OpenDCAI/dataflow-demo-Text2SQL
• https://huggingface.co/datasets/OpenDCAI/dataflow-mm-context_vqa
• https://huggingface.co/datasets/OpenDCAI/dataflow-instruct-10k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#DataCentricAI #LLMDrivenFrameworks #UnifiedDataPreparation #WorkflowAutomation #LargeLanguageModels

The AI community building the future. Hugging Face has 421 repositories available. Follow their code on GitHub.

❤2

549 views23:48

285 views05:49

🔥 Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

💡 The paper introduces FramePack, a neural network designed to improve video generation by enhancing next-frame prediction models. The main problem addressed is the limitation of transformer context length, which restricts the number of frames that can be processed. To overcome this, FramePack compresses input frames, allowing the transformer context length to be fixed regardless of the video length. This enables the processing of a large number of frames and increases the batch size, making it comparable to image diffusion training.

The method proposed by FramePack involves compressing input frames and using an anti-drifting sampling method to generate frames in inverted temporal order. This approach helps to avoid exposure bias, which occurs when errors accumulate over iterations. Additionally, FramePack can be used to fine-tune existing video diffusion models, allowing for more balanced diffusion schedulers with less extreme flow shift timesteps.

The results show that FramePack improves the visual quality of video generation by supporting more balanced diffusion schedulers. The increased batch size and improved frame prediction also enhance the overall performance of video diffusion models. Overall, FramePack provides a novel approach to video generation by addressing the limitations of transformer context length and improving the efficiency of next-frame prediction models.

📅 Published on Apr 17, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2504.12626
• PDF: https://arxiv.org/pdf/2504.12626
• Project Page: https://lllyasviel.github.io/frame_pack_gitpage/

🤖 Models citing this paper:
• https://huggingface.co/URWAIFU/framepack-eichi-f1

📊 Datasets citing this paper:
• https://huggingface.co/datasets/agreeupon/wrkspace-backup-ttl

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/linoyts/FramePack-F1
• https://huggingface.co/spaces/makululinux/FramePack-F1
• https://huggingface.co/spaces/ObiJuanCodenobi/VidGen-Emilio

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#VideoGenerationModels #NextFramePrediction #TransformerContextLength #FrameCompressionTechniques #NeuralNetworkArchitecture

The AI community building the future. Hugging Face has 421 repositories available. Follow their code on GitHub.

❤4

315 views05:49

100 views15:49

🔥 MMSkills: Towards Multimodal Skills for General Visual Agents

💡 The paper introduces MMSkills, a framework for representing and using reusable multimodal procedures for visual decision making in complex environments. The authors argue that current skill packages for visual agents are limited because they primarily rely on textual prompts or executable code, and do not account for the multimodal nature of procedural knowledge. To address this, the authors formalize the concept of multimodal procedural knowledge, which requires recognizing relevant state, interpreting visual evidence, and deciding what to do next.

The authors identify three practical challenges in developing multimodal skill packages: defining the contents of a package, deriving packages from public interaction experience, and consulting multimodal evidence at inference time. To overcome these challenges, the authors propose a framework that represents each skill as a compact package containing a textual procedure, runtime state cards, and multi-view keyframes.

The authors develop an agentic trajectory-to-skill generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. This generator enables the construction of multimodal skill packages from public interaction experience.

To use these packages, the authors introduce a branch-loaded multimodal skill agent that inspects selected state cards and keyframes in a temporary branch, aligns them with the live environment, and distills them into structured guidance for the main agent. This approach allows the agent to consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots.

The authors evaluate MMSkills on GUI and game-based visual-agent benchmarks and demonstrate that it consistently improves the performance of both frontier and smaller multimodal agents. The results suggest that external multimodal procedural knowledge complements model-internal priors, and that MMSkills provides a effective framework for representing and using reusable multimodal procedures for visual decision making. Overall, the paper contributes a new framework for multimodal skills, a method for generating these skills from public interaction experience, and a approach for using these skills in visual decision making.

📅 Published on May 14

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.13527
• PDF: https://arxiv.org/pdf/2605.13527
• Project Page: https://deepexperience.github.io/MMSkills/

📊 Datasets citing this paper:
• https://huggingface.co/datasets/zhangkangning/mmskills

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#MultimodalProceduralKnowledge #VisualDecisionMaking #MultimodalSkills #GeneralVisualAgents #ProceduralKnowledgeRepresentation

The AI community building the future. Hugging Face has 421 repositories available. Follow their code on GitHub.

87 views15:49

63 views15:49

🔥 FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

💡 The paper introduces FashionChameleon, a real-time and interactive framework for human-garment video customization in autoregressive video generation. The problem addressed is the inability of existing approaches to support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation.

To solve this problem, the authors propose a method that consists of three key techniques. First, they train a Teacher Model with In-Context Learning on a single reference-garment pair, which encourages the model to implicitly preserve coherence during single-garment switching. Second, they introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. Third, they propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence.

The results show that FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU. This is 30-180 times faster than existing baselines. The framework enables users to interactively switch garments during generation, making it a significant contribution to the field of human-centric video customization. Overall, the paper presents a novel approach to achieving real-time and interactive human-garment video customization, which has significant commercial value and potential applications in e-commerce and content creation.

📅 Published on May 15

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15824
• PDF: https://arxiv.org/pdf/2605.15824
• Project Page: https://quanjiansong.github.io/projects/FashionChameleon/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#RealTimeVideoCustomization #HumanGarmentInteraction #AutoregressiveVideoGeneration #InteractiveGarmentControl #EcommerceVideoTechnology

The AI community building the future. Hugging Face has 421 repositories available. Follow their code on GitHub.

68 views15:49

🔥 ReactiveGWM: Steering NPC in Reactive Game World Models

💡 Current game world models have limitations as they simulate environments from a player centric perspective and treat non player characters as background elements, failing to capture interactions between the player and the non player character. This results in models that lack physical understanding and cannot simulate action induced non player character reactions.

The paper introduces ReactiveGWM, a reactive game world model that synthesizes dynamic interactions between the player and the non player character by decoupling player controls from non player character behaviors. This is achieved through the use of diffusion models with cross attention modules that learn a game agnostic representation of interactive logic, allowing for zero shot strategy transfer across different games.

In the proposed method, player actions are injected into the diffusion backbone via a lightweight additive bias, while high level non player character responses are grounded through cross attention modules. This enables the model to learn a game agnostic representation of interactive logic, which can be transferred to other games without requiring domain specific retraining.

The results show that ReactiveGWM maintains fine grain player controllability while achieving robust and prompt aligned non player character strategy adherence. The model is evaluated on two Street Fighter games, demonstrating its ability to unlock steerable non player character interactions without requiring domain specific retraining. Overall, the paper contributes a novel approach to simulating dynamic interactions between players and non player characters in game worlds, paving the way for scalable and strategy rich interactions with non player characters.

📅 Published on May 14

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15256
• PDF: https://arxiv.org/pdf/2605.15256
• Project Page: https://inv-wzq.github.io/ReactiveGWM/

🤖 Models citing this paper:
• https://huggingface.co/INV-WZQ/ReactiveGWM-Models

📊 Datasets citing this paper:
• https://huggingface.co/datasets/INV-WZQ/ReactiveGWM-Datasets

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#GameWorldModels #ReactiveGameDevelopment #NPCAI #GamePhysicsSimulation #ReactiveGameWorldModeling

The AI community building the future. Hugging Face has 421 repositories available. Follow their code on GitHub.

72 views15:49

This media is not supported in your browser

1:16

VIEW IN TELEGRAM

77 views15:49

🔥 DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo

💡 The paper presents DexJoCo, a benchmark and toolkit for task-oriented dexterous manipulation, which aims to advance the capabilities of robotic hands in complex object interactions. The problem addressed is the lack of standardized benchmarks for evaluating dexterous manipulation, with existing benchmarks lacking tasks that reflect the unique capabilities of dexterous hands. To address this, the authors developed DexJoCo, which comprises 11 functional tasks that evaluate tool-use, bimanual coordination, long-horizon execution, and reasoning.

The method used to achieve this involves developing a low-cost data collection system, which collected 1.1K trajectories across these tasks, with support for domain randomization to assess robustness. The authors also benchmarked modern models under diverse settings, including visual and dynamics randomization, multi-task training, and action-head adaptation.

The results of the paper include identifying several important insights and common limitations of current policies in dexterous manipulation, highlighting key challenges for future research in dexterous hand robot learning. The authors found that through extensive empirical analysis, current policies struggle with tasks that require long-horizon execution, bimanual coordination, and tool-use, and that domain randomization is essential for assessing the robustness of policies. Overall, the paper provides a comprehensive benchmark and toolkit for task-oriented dexterous manipulation, which can be used to evaluate and improve the capabilities of robotic hands in complex object interactions.

📅 Published on May 15

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.16257
• PDF: https://arxiv.org/pdf/2605.16257
• Project Page: https://dexjoco.github.io/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#DexterousManipulation #TaskOrientedRobotics #MuJoCoBenchmark #RoboticHandControl #BimanualCoordination

The AI community building the future. Hugging Face has 421 repositories available. Follow their code on GitHub.

110 views15:50

This media is not supported in your browser

0:14

124 views15:50

138 views15:50

🔥 InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

💡 The paper InsightTok proposes a new discrete visual tokenization framework to improve the quality of autoregressive image generation, particularly for text and face reconstruction. The problem addressed is that current discrete tokenization methods often discard fine-grained structures necessary for preserving readable text and distinctive facial features due to aggressive downsampling and quantization. This is because standard discrete-tokenizer objectives are not well aligned with text legibility and facial fidelity, as they optimize generic reconstruction while compressing diverse content uniformly.

To address this issue, the authors propose InsightTok, which uses localized, content-aware perceptual losses to enhance text and face fidelity. This approach allows the tokenizer to prioritize the preservation of important details in text and faces, resulting in better reconstruction quality. The InsightTok framework uses a compact 16k codebook and a 16x downsampling rate, which is relatively efficient compared to prior methods.

The results show that InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. Furthermore, the gains achieved by InsightTok consistently transfer to autoregressive image generation, producing images with clearer text and more faithful facial details. The paper highlights the potential of specialized supervision in tokenizer training for advancing discrete image generation, demonstrating that a simple yet effective approach can lead to significant improvements in image generation quality. Overall, the InsightTok framework provides a new direction for improving the quality of autoregressive image generation, particularly for applications where text and face reconstruction are critical.

📅 Published on May 14

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.14333
• PDF: https://arxiv.org/pdf/2605.14333

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://t.me/PaperNexus

#AutoregressiveImageGeneration #DiscreteTokenization #FaceReconstruction #TextReconstruction #VisualTokenization

The AI community building the future. Hugging Face has 421 repositories available. Follow their code on GitHub.

❤1

169 views15:50