Forwarded from Gradient Dude
🦾 Main experiments
Pretrain on Imagenet -> finetune on COCO or PASCAL:
1. Pretrain on Imagenet in a self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 12 epochs on COCO or 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 5x fewer pretraining epochs than SimCLR.
Pretrain on COCO -> finetune on PASCAL for Semantic Segmentation task:
1. Pretrain on COCO in self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 4x fewer pretraining epochs than SimCLR.
5. The first time a self-supervised pretrained ResNet-50 backbone outperforms supervised pretraining on COCO.
📝 Paper: Efficient Visual Pretraining with Contrastive Detection
Pretrain on Imagenet -> finetune on COCO or PASCAL:
1. Pretrain on Imagenet in a self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 12 epochs on COCO or 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 5x fewer pretraining epochs than SimCLR.
Pretrain on COCO -> finetune on PASCAL for Semantic Segmentation task:
1. Pretrain on COCO in self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 4x fewer pretraining epochs than SimCLR.
5. The first time a self-supervised pretrained ResNet-50 backbone outperforms supervised pretraining on COCO.
📝 Paper: Efficient Visual Pretraining with Contrastive Detection
My additional bits on method here.
swanky-pleasure-bcf on Notion
Efficient Visual Pretraining with Contrastive Detection | Notion
Authors proposed the new way to pre-train neural networks in the line of contrastive losses. Instead of contrasting two different representations of the image itself, authors proposed to contrast representations masked with segmentation of the image. Since…
Contrast to Divide: a pretty practical paper on usage self-supervision to solve learning with noisy labels.
Self-supervision as a plug-and-play technique shows good results in more and more areas. As authors show in this paper, simply replacing a supervised pre-training with the self-supervised can show results and stability improvements. This is achieved due to removing the label noise memorisation (or domain gap in case of transfer) from the warm-up stage of training, therefore maintaining a better discrepancy between classes.
Source here.
Self-supervision as a plug-and-play technique shows good results in more and more areas. As authors show in this paper, simply replacing a supervised pre-training with the self-supervised can show results and stability improvements. This is achieved due to removing the label noise memorisation (or domain gap in case of transfer) from the warm-up stage of training, therefore maintaining a better discrepancy between classes.
Source here.
Self-supervision pre-training for brain cortex segmentation: a paper from MICCAI-2018.
Quite old (for this boiling-hot area) paper, although with interesting take. Authors set up the metric learning pre-training, but instead of the 3D metric they estimated the geodesic distance on the brain surface between cuts taken orthogonal to the surface. Why? Because the brain cortex is a relatively thin structure along the curved brain surface. And therefore areas are separated not as the 3D space patches, but as patches on this surface. Authors demonstrate how predicted distance between adjacent slices then aligns with the ground truth borders of the areas.
Despite presented result is better then the naïve baseline, I wouldn't be astonished if the other pre-training techniques emerged since then, would provide good results as well.
With a bit more words and one formula here.
Original on there.
Quite old (for this boiling-hot area) paper, although with interesting take. Authors set up the metric learning pre-training, but instead of the 3D metric they estimated the geodesic distance on the brain surface between cuts taken orthogonal to the surface. Why? Because the brain cortex is a relatively thin structure along the curved brain surface. And therefore areas are separated not as the 3D space patches, but as patches on this surface. Authors demonstrate how predicted distance between adjacent slices then aligns with the ground truth borders of the areas.
Despite presented result is better then the naïve baseline, I wouldn't be astonished if the other pre-training techniques emerged since then, would provide good results as well.
With a bit more words and one formula here.
Original on there.
Ярослав's Notion on Notion
Improving Cytoarchitectonic Segmentation of Human Brain Areas with Self-supervised Siamese Networks
In this paper authors proposed very specific yet beneficial loss for pre-training of the segmentation encoder. This loss is based on the knowledge of the brain cortex structure specifics. Authors proposed to predict distance between two dissections not in…
Instance Localization for Self-supervised Detection Pretraining. The paper on importance of the task-specific pre-training.
Authors hypothesised about problems of the popular self-supervised pre-training frameworks w.r.t. the task of localisation. They come to the idea that there are no losses to enforce localisation of the object representations. Therefore authors proposed a new loss. To make two contrastive representation of one image, they crop two random parts of it and paste those parts to random images from the same dataset. Later, they use neural network to embed those images, but contrast only region of embedding related to the pasted image. Instead of contrasting the whole image embedding as it is done usually.
Interestingly, the proposed loss not only provides SotA pre-training for the localisation task, but also degrades the classification quality. This is somewhat important practical finding, that while general representations becoming better and better, it could be more important to have task-specific pre-training, than the SotA, but tailored for another task.
More detailed here.
Source here.
Authors hypothesised about problems of the popular self-supervised pre-training frameworks w.r.t. the task of localisation. They come to the idea that there are no losses to enforce localisation of the object representations. Therefore authors proposed a new loss. To make two contrastive representation of one image, they crop two random parts of it and paste those parts to random images from the same dataset. Later, they use neural network to embed those images, but contrast only region of embedding related to the pasted image. Instead of contrasting the whole image embedding as it is done usually.
Interestingly, the proposed loss not only provides SotA pre-training for the localisation task, but also degrades the classification quality. This is somewhat important practical finding, that while general representations becoming better and better, it could be more important to have task-specific pre-training, than the SotA, but tailored for another task.
More detailed here.
Source here.
swanky-pleasure-bcf on Notion
Instance Localization for Self-supervised Detection Pretraining | Notion
Since the better pre-training for classification task doesn't always implies better localisation quality, authors proposed idea of task-specific pre-training for the localisation task. The key point of the proposed method is to formulate task, which will…
SelfReg — paper on contrastive learning towards domain generalisation.
Domain generalisation methods are focused on training models which will not need transfer to work on new domains. Authors proposed to adapt the popular contrastive learning framework to this task.
To provide positive pair, they sample two examples of the same class from different domains. Compared to classical contrastive learning it is, like, having different domains instead of different augmentations, and different classes instead of different samples.
To avoid burden of the good negative sample mining authors adapted the BYOL idea, and employed a projection network to avoid representation collapse.
Suppose having
As the loss itself authors used two squared L2 distances:
1.
NB! in the second loss, the right part is linear mixture of sample representations from different domains.
By minimising the presented loss alongside with classification loss itself, authors achieved pretty separated latent space representation, and got close to the SotA without additional tricks.
Source could be found here.
Domain generalisation methods are focused on training models which will not need transfer to work on new domains. Authors proposed to adapt the popular contrastive learning framework to this task.
To provide positive pair, they sample two examples of the same class from different domains. Compared to classical contrastive learning it is, like, having different domains instead of different augmentations, and different classes instead of different samples.
To avoid burden of the good negative sample mining authors adapted the BYOL idea, and employed a projection network to avoid representation collapse.
Suppose having
f as the neural network under training, g as a trainable linear layer to gain projection of the representation and x_ck — random sample of the class c and domain k.As the loss itself authors used two squared L2 distances:
1.
|f(x_cj) - g(f(x_ck))|
2. |f(x_cj) - (l*g(f(x_cj)) + (1-l)*g(f(x_ck)))|. Where l ~ Beta.NB! in the second loss, the right part is linear mixture of sample representations from different domains.
By minimising the presented loss alongside with classification loss itself, authors achieved pretty separated latent space representation, and got close to the SotA without additional tricks.
Source could be found here.
Exploring Visual Engagement Signals for Representation Learning: recent arxiv paper from Facebook with interesting source of the supervision for training.
In this paper authors proposed to use comments and reaction (facebookish likes) as the source of supervision for pre-training. The proposed method is simple and is therefore well scalable. For each image authors collect two pseudo-labels:
1. All reactions are counted and normalised to sum to 1. This is used as a label for cross-entropy loss.
2. Each comment is converted to bag of words, then weight this embedding via TF-IDF and assign cluster id with kNN (where "fitting" of the clustering is done on random subset of comments from the same dataset). Cluster ids of all comments to the image are used together as a target for a multi-label classification loss.
This approach shows slight to medium increase on the tasks which are well-related to the multi-modal learning. E.g. memes intent classification or political leaning of an image.
While method description is somewhat messy and method itself requires enormous training time (10 days on 32 V100, chances are additional markup could be cheaper) this paper shows again interesting idea of getting supervision for pre-training.
In this paper authors proposed to use comments and reaction (facebookish likes) as the source of supervision for pre-training. The proposed method is simple and is therefore well scalable. For each image authors collect two pseudo-labels:
1. All reactions are counted and normalised to sum to 1. This is used as a label for cross-entropy loss.
2. Each comment is converted to bag of words, then weight this embedding via TF-IDF and assign cluster id with kNN (where "fitting" of the clustering is done on random subset of comments from the same dataset). Cluster ids of all comments to the image are used together as a target for a multi-label classification loss.
This approach shows slight to medium increase on the tasks which are well-related to the multi-modal learning. E.g. memes intent classification or political leaning of an image.
While method description is somewhat messy and method itself requires enormous training time (10 days on 32 V100, chances are additional markup could be cheaper) this paper shows again interesting idea of getting supervision for pre-training.
Variance-Invariance-Covariance Regularisation: fresh paper on the self-supervised training. Kinda follow-up to the idea raised by the Barlow Twins.
In this paper authors proposed three-fold loss function which:
1. Prevents representation collapse by enforcing high variance across different embedding vectors.
2. Decreasing representation redundancy by decorrelating dimensions of the embedded space.
3. Enforcing invariance of the embedded vectors to the different augmentations by pulling different embeddings of the same image together.
This loss helps to avoid both the burden of negative samples mining and crafty tricks employed by other methods. Authors demonstrate, that their method is somewhat on par with the SoTA, while avoiding all that and having some additional benefits. e.g. free of explicit normalisations and somewhat free of batch-size dependency.
A bit longer overview here.
Source here.
In this paper authors proposed three-fold loss function which:
1. Prevents representation collapse by enforcing high variance across different embedding vectors.
2. Decreasing representation redundancy by decorrelating dimensions of the embedded space.
3. Enforcing invariance of the embedded vectors to the different augmentations by pulling different embeddings of the same image together.
This loss helps to avoid both the burden of negative samples mining and crafty tricks employed by other methods. Authors demonstrate, that their method is somewhat on par with the SoTA, while avoiding all that and having some additional benefits. e.g. free of explicit normalisations and somewhat free of batch-size dependency.
A bit longer overview here.
Source here.
swanky-pleasure-bcf on Notion
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning | Notion
Another approach trying to develop losses which does not require neither tedious negative sampling procedures, nor vague architecture tricks. This one falls in line with the Barlow Twins approach presented by the same FAIR lab earlier.
Not much more to say about it. It's emerging topic of building transformers for computer vision tasks. And this is more or less technical paper, to show progress of the transformers towards replacing ResNet-like architectures wherever they are still used this time (not the first attempt, though) in self-supervision.
The loss itself is superposition of the MoCov2 and BYOL approaches. It simultaneously has queue of the negative examples and contrastive loss from MoCov2 with model-level asymmetry from BYOL, where one branch is momentum-updated.
Results are on par with SoTA for linear evaluation scheme. But the method (1) have lesser complex tricks to achieve and (2) is applicable to transfer-learning towards detection and segmentation.
The loss itself is superposition of the MoCov2 and BYOL approaches. It simultaneously has queue of the negative examples and contrastive loss from MoCov2 with model-level asymmetry from BYOL, where one branch is momentum-updated.
Results are on par with SoTA for linear evaluation scheme. But the method (1) have lesser complex tricks to achieve and (2) is applicable to transfer-learning towards detection and segmentation.
👍1
Forwarded from Just links
Self-Supervised Learning with Swin Transformers https://arxiv.org/abs/2105.04553
Contrastive Conditional Transport for Representation Learning.
This paper tries to make the step similar to GAN->WGAN step, although in terms of the representation learning. At first authors propose the idea, that instead of the training with more-or-less classical SimCLR loss, authors proposed to just minimize (C+) - (C-), where (C+) is mean distance between anchor and positive samples (different views of the anchor) and (C-) is mean distance between anchor and negative samples.
Although, when this does not work out, authors proposed to add more complex weighting procedure: now, positive samples are weighted with respect to their distance to the anchor (more distance — larger weight) and vice versa goes for the negative samples (less distance — larger weight).
Despite description of the idea is somewhat chaotic, the reported results looks good. Also, one more positive side-effect: this loss easily works with multiple positive samples drawn in one minibatch.
Source here.
This paper tries to make the step similar to GAN->WGAN step, although in terms of the representation learning. At first authors propose the idea, that instead of the training with more-or-less classical SimCLR loss, authors proposed to just minimize (C+) - (C-), where (C+) is mean distance between anchor and positive samples (different views of the anchor) and (C-) is mean distance between anchor and negative samples.
Although, when this does not work out, authors proposed to add more complex weighting procedure: now, positive samples are weighted with respect to their distance to the anchor (more distance — larger weight) and vice versa goes for the negative samples (less distance — larger weight).
Despite description of the idea is somewhat chaotic, the reported results looks good. Also, one more positive side-effect: this loss easily works with multiple positive samples drawn in one minibatch.
Source here.
It was quite a vacation, huh. Now back to the matter.
Object-aware Contrastive Learning for Debiased Scene Representation, from current NeurIPS.
The authors proposed to alter the Class Activation Map method a bit, to set it ready for contrastive learning. They named the thing ContraCAM. It's just the usual CAM with:
1. loss replaced with contrastive loss
2. negative gradients dropped
3. iterative accumulation of the masks.
And this itself shows unsupervised object localization, with SoTA IoU.
Based on this localization, the authors proposed two augmentations to reduce negative biases in contrastive learning:
1. guided random crop, to avoid having multiple objects on one image; this avoids over-reliance on co-occurring objects.
2. replacing background (using a soft mask of the localization); this helps to avoid over-reliance on the typical background for the sample.
Since localization is gained without additional information, this is still a self-supervised approach, and therefore could be directly compared with them.
Authors compare those augmentations with self-supervised localization and ground truth masks. They found that both ways can produce a notable boost to the MoCov2 or BYOL results.
More and with images here.
Source here.
Object-aware Contrastive Learning for Debiased Scene Representation, from current NeurIPS.
The authors proposed to alter the Class Activation Map method a bit, to set it ready for contrastive learning. They named the thing ContraCAM. It's just the usual CAM with:
1. loss replaced with contrastive loss
2. negative gradients dropped
3. iterative accumulation of the masks.
And this itself shows unsupervised object localization, with SoTA IoU.
Based on this localization, the authors proposed two augmentations to reduce negative biases in contrastive learning:
1. guided random crop, to avoid having multiple objects on one image; this avoids over-reliance on co-occurring objects.
2. replacing background (using a soft mask of the localization); this helps to avoid over-reliance on the typical background for the sample.
Since localization is gained without additional information, this is still a self-supervised approach, and therefore could be directly compared with them.
Authors compare those augmentations with self-supervised localization and ground truth masks. They found that both ways can produce a notable boost to the MoCov2 or BYOL results.
More and with images here.
Source here.
swanky-pleasure-bcf on Notion
Object-aware Contrastive Learning for Debiased Scene Representation | Notion
In this paper, the authors propose to modify Class Activation Map w.r.t. self-supervised losses and create ContraCAM. Thus allowing unsupervised object localization by network trained with self-supervised losses. With this localization in mind authors propose…
👍1
PiCIE Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering
Paper from CVPR'21
There is one of more or less classical approaches to deep unsupervised segmentation: cluster your embeddings and use it as pseudo labels. Add tricks, repeat multiple times. In this paper the authors made one step further to unite it with self-supervision. They designed a loss function to enforce invariance of these clustered representation to the color augmentations and equivariance to the spatial augmentations.
The algorithm of the loss calculation is:
1. Get two representations of the same image. Both disturbed with different color augmentations but with same spatial augmentation. in first case image is disturbed before going through the network and in second — output of the network is disturbed. So, ideally both outputs should be identical, it will show invariance and equivariance to color and spatial augmentations respectively. I will name these representations
2. For each of those outputs, run KMeans clustering of the embeddings. I will name obtained centroids
3. The next step is going to finally mix those two spaces. Let say that
3.1. We enforce clustering in each representation with
3.2. We enforce that this clustering itself should hold across the representations with
And that's it. Training with this approach achieves SoTA on the unsupervised segmentation and shows qualitatively good object masks. The most improved part of the segmentation is thing (foreground object) segmentation, which is systematically problematic for unsupervised learning, because of the huge imbalance in class sizes.
More here.
Source here.
Paper from CVPR'21
There is one of more or less classical approaches to deep unsupervised segmentation: cluster your embeddings and use it as pseudo labels. Add tricks, repeat multiple times. In this paper the authors made one step further to unite it with self-supervision. They designed a loss function to enforce invariance of these clustered representation to the color augmentations and equivariance to the spatial augmentations.
The algorithm of the loss calculation is:
1. Get two representations of the same image. Both disturbed with different color augmentations but with same spatial augmentation. in first case image is disturbed before going through the network and in second — output of the network is disturbed. So, ideally both outputs should be identical, it will show invariance and equivariance to color and spatial augmentations respectively. I will name these representations
z1 and z2.2. For each of those outputs, run KMeans clustering of the embeddings. I will name obtained centroids
µ1 and µ2.3. The next step is going to finally mix those two spaces. Let say that
L(z, µ) is a loss, that for each vector in z brings it closer to the nearest vector of µ. (prototype learning waves). Then:3.1. We enforce clustering in each representation with
L(z1, µ1) + L(z2, µ2).3.2. We enforce that this clustering itself should hold across the representations with
L(z1, µ2) + L(z2, µ1).And that's it. Training with this approach achieves SoTA on the unsupervised segmentation and shows qualitatively good object masks. The most improved part of the segmentation is thing (foreground object) segmentation, which is systematically problematic for unsupervised learning, because of the huge imbalance in class sizes.
More here.
Source here.
swanky-pleasure-bcf on Notion
PiCIE Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering | Notion
The proposed approach on one hand falls in line with many approaches for semantic segmentation training based on clustering (e.g. DeepCluster). Although, unlike these approaches, authors propose not to rely solely on the clustering iterative improvement.…
Lilian Weng's description of the contrastive learning is a good way to have a quick dive intro.
Lil'Log
Contrastive Representation Learning