Self Supervised Boy – Telegram

Self Supervised Boy

@selfsupervised

160 subscribers

9 photos

56 links

Posting links to papers I read. Right now I'm mostly interested in things around LLMs, AI agents, and ML4Code. That is subject to change.

@martolod

Download Telegram

About

Blog

Apps

Platform

Self Supervised Boy

160 subscribers

Self Supervised Boy

How Useful is Self-Supervised Pretraining for Visual Tasks?

A relatively old paper (CVPR2020), by our fast life standards. Nevertheless, it has a pair of practical takeaways.

Authors created a synthetic dataset with several degrees of freedom to vary difficulty. It varies from almost monochrome objects to randomized textures and positioning on image.

The target was to compare how good different self-supervised approaches help to tune for different downstream tasks. From classification to depth estimation.

Two practical takeways are:
1. The self-supervised method utility is wildly dependent on task, markup amount and even data complexity.
2. A linear evaluation score, so popular in papers, has almost no correlation with actual fine-tuning results.

Authors found out, that there is no improvement by self-supervised training when lots of labeled data presented (which became kinda well known since then). Based on this, they hypothesise, that improvement of SSL pre-training is rather kind of a regularization than optimization. That is, SSL pre-training helps to find wider optimum, not better. Though, to claim this, some kind of loss plane investigation would be more helpful.

Source: here

👍2

21.7K views18:48

Self Supervised Boy

Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration
from NeurIPS2021.

It was already noted, that quality of the contrastive learning may suffer from intense augmentation. In this paper, the authors make one step further and try to understand the source of this.

The main hypothesis is, if augmentations are too intense, the assumption of invariance of the image information to augmentation just breaks. That is, we augment images so hard, that it isn't meaningful to ask a model to predict close embeddings for such different inputs.

To mitigate this, the authors proposed to model a distribution of the embeddings of views (positive samples, different augmentations of the same image) as a normal distribution with a shared covariance matrix (experiments show that shared covariance matrix is somehow very effective). And then add weight to each component of the loss with a normalized distance between two views which are pulled together in this component. The distance here is Mahalanobis distance defined by the fitted distribution.

To put it simpler: if two positive samples are too far away from each other, maybe they are not so positive after all?

This makes contrastive methods to not over relate on the assumption of the invariance to augmentation. And also makes them more aware of what happens in the embedded space itself.

Authors demonstrate consistent improvement for different contrastive losses.
source: here

👍1

1.38K viewsedited 16:17

Self Supervised Boy

Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
from the ICML2020.

Previously it was noted, that if one swaps contrastive loss with a tighter bound on MI, the downstream quality decreases. The authors propose, therefore, to move from InfoMax intuition to rather simple concepts: alignment and uniformity. The former enforces that positive pairs stay as close as possible and the latter enforces that all samples stay as evenly distributed as possible.

These components are empirically important for downstream performance. Furthermore, their direct optimization may outperform the classical contrastive loss training.

With images and a bit longer: here
Source: here

Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

The work from the ICML2020 goes deeper in the understanding of contrastive learning. Authors diverged from a proposal that contrastive loss maximizes the mutual information between the positive views because it was shown that optimizing tighter bound on MI…

1.89K views14:50

Self Supervised Boy

Well, there was more than three years since the last post here. In these three years a lot has changed. I'm done with my PhD in Heidelberg Uni, and moved on to JetBrains to lead a team on AI agents. With all this on my hands, I will have even less time for writing the reviews I'd like to read. But on the other hand, I'd still like to share the papers I read.

So, instead, I will post here links to the papers that I read. You can view this experiment as copycatting the @j_links but with a bias towards LLMs and probably agents.

🔥10👍5

236 views09:10

Self Supervised Boy

https://arxiv.org/abs/2509.18542

Symphony-MoE: Harmonizing Disparate Pre-trained Models into a...

Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely, minimizing computational overhead. To mitigate the prohibitive cost of training MoEs from...

🔥2

223 views09:10

Self Supervised Boy

https://arxiv.org/abs/2509.19170

Soft Tokens, Hard Truths

The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of...

224 views09:41

Self Supervised Boy

https://arxiv.org/abs/2509.21013

Predicting LLM Reasoning Performance with Small Proxy Model

Given the prohibitive cost of pre-training large language models, it is essential to leverage smaller proxy models to optimize datasets before scaling up. However, this approach becomes...

216 views10:57

Self Supervised Boy

https://arxiv.org/abs/2509.26476

Regression Language Models for Code

We study code-to-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of programming languages. While prior methods have resorted to...

207 views13:14

Self Supervised Boy

https://arxiv.org/abs/2510.01123

Rethinking Thinking Tokens: LLMs as Improvement Operators

Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher...

230 views13:37

Self Supervised Boy

https://arxiv.org/abs/2406.18665v4

RouteLLM: Learning to Route LLMs with Preference Data

Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More...

👍1

257 views14:21

Self Supervised Boy

https://arxiv.org/abs/2403.12031

RouterBench: A Benchmark for Multi-LLM Routing System

As the range of applications for Large Language Models (LLMs) continues to grow, the demand for effective serving solutions becomes increasingly critical. Despite the versatility of LLMs, no...

👍1

279 views15:05

Self Supervised Boy

https://arxiv.org/abs/2510.02375

Pretraining with hierarchical memories: separating long-tail and...

The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge...

256 views12:04

Self Supervised Boy

https://arxiv.org/abs/2510.05445

AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative...

Large language models (LLMs) and agent-based frameworks have advanced rapidly, enabling diverse applications. Yet, with the proliferation of models and agentic strategies, practitioners face...

265 views15:57

Self Supervised Boy

https://arxiv.org/abs/2510.12773

Dr.LLM: Dynamic Layer Routing in LLMs

Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need...

259 views10:41

Self Supervised Boy

https://arxiv.org/abs/2510.18148v1

Extracting Rule-based Descriptions of Attention Features in Transformers

Mechanistic interpretability strives to explain model behavior in terms of bottom-up primitives. The leading paradigm is to express hidden states as a sparse linear combination of basis vectors,...

👍1

261 views08:26

Self Supervised Boy

https://arxiv.org/abs/2510.18147v1

LLMs Encode How Difficult Problems Are

Large language models exhibit a puzzling inconsistency: they solve complex problems yet frequently fail on seemingly simpler ones. We investigate whether LLMs internally encode problem difficulty...

261 views09:00

Self Supervised Boy

https://arxiv.org/abs/2510.21614v1

Huxley-Gödel Machine: Human-Level Coding Agent Development by an...

Recent studies operationalize self-improvement through coding agents that edit their own codebases. They grow a tree of self-modifications through expansion strategies that favor higher software...

252 views16:58

Self Supervised Boy

https://arxiv.org/abs/2601.05167

RelayLLM: Efficient Reasoning via Collaborative Decoding

Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary...

121 views16:59

Self Supervised Boy

https://arxiv.org/abs/2601.03335v1

Digital Red Queen: Adversarial Program Evolution in Core War with LLMs

Large language models (LLMs) are increasingly being used to evolve solutions to problems in many domains, in a process inspired by biological evolution. However, unlike biological evolution, most...

👍1

139 views16:59

Self Supervised Boy

https://arxiv.org/abs/2601.04786v1

AgentOCR: Reimagining Agent History via Optical Self-Compression

Recent advances in large language models (LLMs) enable agentic systems trained with reinforcement learning (RL) over multi-turn interaction trajectories, but practical deployment is bottlenecked...

152 views16:59