RIML Lab

🪢 Compositional Learning Journal Club

Join us this week for an in-depth discussion on Compositional Learning for Visual Reasoning in modern vision–language models. We will explore recent breakthroughs and challenges, focusing on how these models perform compositional visual reasoning over complex scenes and where there is still room for improvement in robustness, faithfulness, and instruction following.

🌟 This Week's Presentation

📌 Title:
Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

🧠 Abstract:
Multimodal Large Language Models (MLLMs) have recently shown strong potential in visual reasoning, especially when combined with test-time scaling techniques. However, most current approaches keep the visual input fixed and only explore different textual reasoning paths, which limits their ability to exploit rich visual details—particularly in high-resolution images with many fine-grained elements. In such settings, vision-level reasoning becomes crucial: models need to dynamically zoom into informative regions of the image to gather the evidence required for accurate decisions.
In this session, we will discuss ZoomEye, a training-free, model-agnostic tree search algorithm for vision-level reasoning. ZoomEye treats an image as a hierarchical tree, where each node is a region and child nodes correspond to zoomed-in sub-regions. By navigating this tree, MLLMs can simulate human-like zooming behavior, selectively focusing on task-relevant areas. Experiments on high-resolution benchmarks show that ZoomEye substantially boosts the performance of multiple MLLMs (e.g., InternVL2.5-8B gains over 15%–17% on HR-Bench) and even enables small 3–8B models to outperform larger systems such as GPT-4o.

📄 Paper:
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

🎙 Presenter: Amir Kasaei

Session Details:
- 📅 Date: Tuesday, November 25th
- 🕒 Time: 3:00 PM - 4:00 PM
- 🌐 Location: Online at vc.sharif.edu/ch/rohban

We look forward to your participation! ✌️

arXiv.org

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming...

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in vision-language understanding. Recently, with the integration of test-time scaling techniques, these models...

2.69K viewsAmir Kasaei, edited 06:09

RIML Lab

🔐 ML Security Journal Club

✅ This Week's Presentation:

🔹 Title: Unlearning diffusion models

🔸 Presenter: Arian Komaei

🌀 Abstract:
This paper takes a hard look at the real-world reliability of concept erasure in text-to-image models. While many erasure methods look clean in controlled demos, their behavior collapses when concepts get interconnected or ambiguous. The authors identify two major gaps in current practice: the lack of evaluation across diverse concept types and the absence of systematic analysis of failure modes after erasure. They examine how removing one concept unintentionally damages others—visually similar, binomial, or semantically linked concepts—revealing widespread spillover effects. To tackle this, they introduce EraseBench, a large benchmark containing 100+ curated concepts, targeted prompts, and metrics that capture both erasure effectiveness and unintended degradation. Their findings show consistent concept entanglement, where erasing a target concept suppresses non-target ones and reduces generation quality, exposing significant limitations in current erasure techniques.

📄 Paper: Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts

Session Details:
- 📅 Date: Sunday
- 🕒 Time: 3:30 - 4:30 PM
- 🌐 Location: Online at vc.sharif.edu/ch/rohban

We look forward to your participation! ✌️

arXiv.org

Erasing More Than Intended? How Concept Erasure Degrades the...

Concept erasure techniques have recently gained significant attention for their potential to remove unwanted concepts from text-to-image models. While these methods often demonstrate promising...

2.47K viewsAryan, 13:44

RIML Lab

🪢 Compositional Learning Journal Club

Join us this week for a deep dive into how CLIP actually represents multiple objects in an image—and where it silently goes wrong. We’ll look at subtle biases in both text and image encoders, how they interact with caption structure and object size, and what this means for downstream multimodal models and text-to-image generation.

🌟 This Week's Presentation

📌 Title:
CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

🧠 Abstract:
Contrastive Language–Image Pre-training (CLIP) has become a workhorse for zero-shot classification and many vision–language tasks, but its behavior in complex scenes with multiple objects is far from fully understood. This session focuses on a systematic study of CLIP in controlled multi-object setups using ComCO, a dedicated dataset designed to probe how CLIP’s encoders handle object combinations and compositional structure.

We will discuss evidence that:

- The text encoder tends to over-focus on the first-mentioned object in a caption.

- The image encoder tends to favor larger objects in the scene.

- Small changes such as swapping token order or resizing objects can cause sharp drops in image–text matching and retrieval performance across multiple CLIP variants.

📄 Paper:
CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

🎙 Presenter: Dr MH Rohban

Session Details:
- 📅 Date: Tuesday, December 2nd
- 🕒 Time: 3:00 PM - 4:00 PM
- 🌐 Location: Online at vc.sharif.edu/rohban

We look forward to your participation! ✌️

arXiv.org

CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object...

Contrastive Language-Image Pre-training (CLIP) models excel in zero-shot classification, yet face challenges in complex multi-object scenarios. This study offers a comprehensive analysis of CLIP's...

2.58K viewsAmir Kasaei, edited 06:30

RIML Lab

🔐 ML Security Journal Club

✅ This Week's Presentation:

🔹 Title: Unlearning diffusion models

🔸 Presenter: Arian Komaei

🌀 Abstract:
The paper argues that modern multi-stage training pipelines create a fundamental obstacle for machine unlearning. The ideal goal—Retrain Equivalence—is for an unlearned model to behave exactly like one retrained from scratch without the forgotten data. But the authors show, both theoretically and empirically, that this is often impossible: once training happens in multiple stages with different data and objectives, the model’s behavior becomes path-dependent. That means the order of training steps permanently affects how unlearning works, and “local” unlearning methods that only use gradients from the forget set can’t universally reach Retrain Equivalence. Experiments on Llama and Qwen models (1B–14B) confirm strong divergence: the same data but different training orders lead to very different unlearning outcomes, with accuracy dropping by 20% across paths. Some training paths also produce models that are inherently harder to unlearn. Because multi-stage training is now standard and training histories are often unavailable, the paper concludes that Retrain Equivalence is the wrong target and the field needs to rethink what machine unlearning should actually aim for.

📄 Paper: ON THE IMPOSSIBILITY OF RETRAIN EQUIVALENCE IN MACHINE UNLEARNING

Session Details:
- 📅 Date: Sunday
- 🕒 Time: 3:30 - 4:30 PM
- 🌐 Location: Online at vc.sharif.edu/ch/rohban

We look forward to your participation! ✌️

arXiv.org

On the Impossibility of Retrain Equivalence in Machine Unlearning

Machine unlearning seeks to selectively remove the "influence" of specific training data on a model's outputs. The ideal goal is Retrain Equivalence--behavior identical to a model trained from...

2.88K viewsAryan, 08:16

RIML Lab

🪢 Compositional Learning Journal Club

Join us this week for a critical exploration of robustness in Visual Question Answering systems and the broader implications for visual–language model reliability. We’ll analyze how even subtle, meaning-preserving changes to inputs can destabilize model outputs and discuss what this means for future evaluation and model design.

🌟 This Week's Presentation

📄 Paper:
Questioning the Stability of Visual Question Answering

🧠 Abstract:
Modern Visual Language Models (VLMs) have achieved impressive performance on a wide range of visual reasoning tasks, yet fundamental questions remain about their robustness to benign input perturbations. This paper presents the first large-scale, systematic study of how VLMs respond to small, meaning-preserving changes—such as pixel shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites—that do not change the true semantics of an image–question pair.

Across multiple datasets and models, the authors find that minor visual or textual perturbations frequently lead to different predicted answers, even for state-of-the-art systems like GPT-4o and Gemini 2.0 Flash. They also show that stability under perturbations correlates strongly with correctness, and that the stability patterns of small open-source models can be used to predict when larger models will fail.

In this session, we’ll discuss:
• What kinds of input changes most disrupt VQA predictions.
• How stability can serve as a proxy for reliability and model confidence.
• Implications for evaluation benchmarks and future model development.

🎙 Presenter: Amir Kasaei

Session Details:
- 📅 Date: Tuesday, December 23rd
- 🕒 Time: 3:00 PM - 4:00 PM
- 🌐 Location: Online at vc.sharif.edu/ch/rohban

We look forward to your participation! ✌️

arXiv.org

Questioning the Stability of Visual Question Answering

Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale,...

3.36K viewsAmir Kasaei, edited 07:00

RIML Lab

🪢 Compositional Learning Journal Club

Join us this week for a fascinating dive into how multimodal language models can think visually by drawing — mimicking a human’s use of sketches to guide reasoning and solve complex tasks.

🌟 This Week's Presentation

📄 Paper:
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

🧠 Abstract:
Multimodal LLMs are strong at visual reasoning, but they typically rely on text-only intermediate steps. This paper introduces Visual Sketchpad, which gives MLLMs a lightweight drawing interface (e.g., lines, boxes, marks) so they can create visual intermediate steps while reasoning—similar to how humans sketch when solving problems. By integrating these sketch actions (and optionally leveraging vision modules during sketching), the approach improves performance across a wide range of tasks, including math/geometry, graphs, and spatial reasoning.

🎙 Presenter: Amir Kasaei

Session Details:
- 📅 Date: Tuesday, December 30
- 🕒 Time: 3:00 - 4:00 PM
- 🌐 Location: Online at vc.sharif.edu/ch/rohban

We look forward to your participation! ✌️

arXiv.org

Visual Sketchpad: Sketching as a Visual Chain of Thought for...

Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our...

4.39K viewsAmir Kasaei, edited 13:45

RIML Lab

Hey folks! 👋
Our team is working on foundation models, with a strong focus on interpretability and reliability. We’re opening applications for new team members who are excited to learn, take ownership of tasks, and contribute consistently to solid, hands-on research and experimentation.

Highly motivated bachelor’s students and junior undergraduates are especially encouraged to apply. Selection prioritizes motivation, dedication, perseverance, strong fundamentals, and consistent follow-through over grades. Experience with PyTorch is a major plus. Ideal candidates are comfortable implementing clean, reproducible experiments and communicating progress clearly. 👉 Apply Here.

Looking forward to doing deep, high-impact, and fun science together! 🚀🥰🔥

7.64K views07:55

RIML Lab

📢 Open RA Positions: Generative Models (RIML & TSAIL Labs)

Supervisors: Dr. Rohban & Dr. Sadeghzadeh (Sharif University of Technology) Focus: Robustness, Interpretability, and Trustworthiness in Generative Models.
Goal: NeurIPS/ICLR submissions (4-month timeline).

✅ Requirements:
- Strong background in ML/AI and Generative Models
- Proficiency in Python/PyTorch
- Commitment: 20–30 hours/week

⚠️Note: We do not accept applicants who currently have a full-time job or those who are students with a part-time job or have other research positions elsewhere.

📚 Background Reading
- http://arxiv.org/abs/2410.15618
- http://arxiv.org/abs/2305.10120

👉 Apply Here

arXiv.org

Erasing Undesirable Concepts in Diffusion Models with Adversarial...

Diffusion models excel at generating visually striking content from text but can inadvertently produce undesirable or harmful content when trained on unfiltered internet data. A practical solution...

3.95K views05:44

RIML Lab

دانشجویانی که علاقه‌مند به دستیاری آموزشی درس سیستم ۲ (دکتر رهبان، دکتر سلیمانی و آقای سمیعی) هستند، خواهشمند است فرم زیر را تکمیل نمایند:
تکمیل فرم

3.19K views11:05

RIML Lab

We invite interested collaborators to join an ongoing research project that aims to redefine attention mechanisms in large language models, drawing inspiration from the concept of consciousness. The goal of this work is to produce results suitable for submission to NeurIPS 2026. This research is conducted under the supervision of Dr. Rohban and is led by his PhD student, Hamidreza Akbari. Motivated undergraduate students with a strong background or interest in deep learning/LLMs and related fields are encouraged to reach out to @hamidrakbari to explore potential collaboration opportunities.

4.5K views10:18

RIML Lab

دانشجویانی که علاقه مند به دستیاری آموزشی درس یادگیری تقویتی دکتر رهبان در نیم سال جاری هستند لطفا فرم زیر را پر کنند.

https://docs.google.com/forms/d/e/1FAIpQLSe2rPGlxxTQDzDnt6PB_hRAtBa64_OLmWLE_dUnPqQlAfpUqQ/viewform?usp=dialog

3.7K views15:23

RIML Lab

🔘 Open Research Positions: Shortcut Learning and Spurious Correlation
We are looking for motivated students to join our research projects.

🔍 Project Description
Shortcut learning occurs when models rely on spurious or overly simple patterns instead of learning the intended underlying task. Our research aims to understand why and how shortcuts emerge, and how they can be identified, analyzed, and mitigated.
We offer three research assistant positions aligned with the following directions, co-advised by Dr. Rohban and Dr. Soleymani:

🔹 1. Understanding the Implicit Effects of Inductive Biases on Shortcut Learning

This project investigates how training-related inductive biases, such as batch size, learning rate, and loss function, influence the emergence of shortcut learning and group robustness.

[1] The Silent Helper: How Implicit Regularization Enhances Group Robustness (HiLD Workshop ICLR 2025)
[2] On the Role of Implicit Regularization of Stochastic Gradient Descent in Group Robustness (ICLR 2026)

🔗 Apply here: Google Form

🔹 2. Understanding Shortcut Learning Through the Lens of Task Arithmetic

It has been observed that shortcut features are often learned early during training. By analyzing the trajectory of weight evolution, we aim to identify shortcut task directions in parameter space and leverage this understanding to mitigate shortcut reliance.

[3] Editing Models with Task Arithmetic (ICLR 2023)

🔗 Apply here: Google Form

🔹 3. Investigating Shortcut Learning in Large-Scale Models

Shortcut-based generalization failures were first studied in linear models and simple settings. In this project, we aim to understand how these biases propagate to large-scale models, including language models, and explore strategies to mitigate them.

[4] Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models (EMNLP 2024)
[5] Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful (NeurIPS 2025)

🔗 Apply here: Google Form

📌 Important Notes:
Please carefully read the Internship Ethics and Guidelines before submitting your application.
Submitting the application form does not guarantee acceptance.
Only shortlisted candidates will be contacted via email by March 15.

Google Docs

آیین‌نامه‌ی برنامه‌ی کارآموزی

پروژه‌ی همبستگی جعلی منتور: نهال میرزایی اساتید: آقای دکتر رهبان و خانم دکتر سلیمانی آیین‌نامه برنامه‌ی کارآموزی پژوهشی شرایط، انتظارات و سیاست‌های همکاری این برنامه با هدف ایجاد یک تجربه پژوهشی ساختارمند، حرفه‌ای و اخلاق‌محور برای دانشجویان علاقه‌مند…

1.23K views18:17

About

Blog

Apps

Platform