Interactive Weak Supervision paper from ICLR 2021.
In contrast to classical active learning where experts are queried to assess individual samples, the idea of this paper is to assess labeling heuristics being automatically generated. Authors argue that since experts are good in writing such heuristics from scratch, they should be able to label auto-generated heuristics. To be able to rank non-assessed heuristics authors proposed to train an ensemble of models to predict the assessors' mark for the heuristic. As input for these models authors proposed to use fingerprint of the heuristic: concatenated predictions on some subset of data.
There is no very fancy results, there is some concerns raised by reviewers, and there are some strange notations in this paper. Yet the idea looks interesting to me.
With a bit deeper description (and one unanswered question) here.
Source (and rebuttal comments with important links) there.
In contrast to classical active learning where experts are queried to assess individual samples, the idea of this paper is to assess labeling heuristics being automatically generated. Authors argue that since experts are good in writing such heuristics from scratch, they should be able to label auto-generated heuristics. To be able to rank non-assessed heuristics authors proposed to train an ensemble of models to predict the assessors' mark for the heuristic. As input for these models authors proposed to use fingerprint of the heuristic: concatenated predictions on some subset of data.
There is no very fancy results, there is some concerns raised by reviewers, and there are some strange notations in this paper. Yet the idea looks interesting to me.
With a bit deeper description (and one unanswered question) here.
Source (and rebuttal comments with important links) there.
swanky-pleasure-bcf on Notion
Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling | Notion
Authors proposed a new pipeline for interactive weak supervision. Instead of asking users for sample labeling authors proposed to ask for labeling of the labeling functions (LF), e.g. regular expressions for text parsing. Authors argues that since experts…
Transferable Visual Words. The paper which exploits assumption on medical images being well aligned for pseudo-labeling procedure.
Authors proposed to use the fact, that the structure on the medical images is fixed due to imaging procedures and anatomical semantics. They generated pseudo-labels under assumption that the same region spatial of different images represents more or less the same semantic features. To enforce this assumption further, they trained AE model, and selected training samples, which are close in the latent space.
They used index of the cropping region as pseudo-label and trained denoising AE with classification head on these crops as the model for the further fine-tuning.
Not only this method surpasses presented self-supervised baselines, but it is beneficial for combined pre-training with them.
More precise training and labeling procedure here.
Original paper here.
Authors proposed to use the fact, that the structure on the medical images is fixed due to imaging procedures and anatomical semantics. They generated pseudo-labels under assumption that the same region spatial of different images represents more or less the same semantic features. To enforce this assumption further, they trained AE model, and selected training samples, which are close in the latent space.
They used index of the cropping region as pseudo-label and trained denoising AE with classification head on these crops as the model for the further fine-tuning.
Not only this method surpasses presented self-supervised baselines, but it is beneficial for combined pre-training with them.
More precise training and labeling procedure here.
Original paper here.
swanky-pleasure-bcf on Notion
Transferable Visual Words: Exploiting the Semantics of Anatomical Patterns for Self-supervised Learning | Notion
Authors proposed two-step approach, which takes advantage of the medical images being highly structured opposed to the real-world images. At first they create unsupervised pseudo-labels for image patches. These labels are designed based on assumption that…
Self-supervision paper from arxiv for histopathology CV.
Authors draw inspiration from the process of how histopathologists tend to review the images, and how those images are stored. Histopathology images are multiscale slices of enormous size (tens of thousands pixels by one side), and area experts constantly move through different levels of magnification to keep in mind both fine and coarse structures of the tissue.
Therefore, in this paper the loss is proposed to capture relation between different magnification levels. Authors propose to train network to order concentric patches by their magnification level. They organise it as the classification task — network to predict id of the order permutation instead of predicting order itself.
Also, authors proposed specific architecture for this task and appended self-training procedure, as it was shown to boost results even after pre-training.
All this allows them to reach quality increase even in high-data regime.
My description of the architecture and loss expanded here.
Source of the work here.
Authors draw inspiration from the process of how histopathologists tend to review the images, and how those images are stored. Histopathology images are multiscale slices of enormous size (tens of thousands pixels by one side), and area experts constantly move through different levels of magnification to keep in mind both fine and coarse structures of the tissue.
Therefore, in this paper the loss is proposed to capture relation between different magnification levels. Authors propose to train network to order concentric patches by their magnification level. They organise it as the classification task — network to predict id of the order permutation instead of predicting order itself.
Also, authors proposed specific architecture for this task and appended self-training procedure, as it was shown to boost results even after pre-training.
All this allows them to reach quality increase even in high-data regime.
My description of the architecture and loss expanded here.
Source of the work here.
swanky-pleasure-bcf on Notion
Self-supervised driven consistency training for annotation efficient histopathology image analysis | Notion
In this paper authors gain insight for the new loss from the way histopathologists work with images. Since the enormous scale of the images for histopathological research it is stored in pyramid-like structure with different zoom level, so researches tend…
Forwarded from Gradient Dude
DetCon: The Self-supervised Contrastive Detection Method🥽
DeepMind
A new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations.
Object-based regions are identified with an approximate, automatic segmentation algorithm based on pixel affinity (bottom). These masks are carried through two stochastic data augmentations and a convolutional feature extractor, creating groups of feature vectors in each view (middle). The contrastive detection objective then pulls together pooled feature vectors from the same mask (across views) and pushes apart features from different masks and different images (top).
🌟Highlights
+ SOTA detection and Instance Segmentation (on COCO) and Semantic Segmentation results (on PASCAL) when pretrained in self-supervised regime on ImageNet, while requiring up to 5× fewer epochs than SimCLR.
+ It also outperforms supervised pretraining on Imagenet.
+ DetCon(SimCLR) converges much faster to reach SOTA: 200 epochs are sufficient to surpass supervised transfer to COCO, and 500 to PASCAL.
+ Linear increase in the number of model parameters (using ResNet-101, ResNet-152, and ResNet-200) brings a linear increase in the accuracy on downstream tasks.
+ Despite only being trained on ImageNet, DetCon(BYOL) matches the performance of Facebook's SEER model that used a higher capacity RegNet architecture and was pretrained on 1 Billion Instagram images.
+ First time a ResNet-50 with self-supervised pretraining on COCO outperforms the supervised pretraining for Transfer to PASCAL
+ The power of DetCon strongly correlates with the quality of the masks. The better the masks used during the self-supervised pretraining stage, the better the accuracy on downstream tasks.
⚙️ Method details
DetConS and DetConB, based on two recent self-supervised baselines: SimCLR and BYOL respectively with ResNet-50 backbone.
Authors adopt the data augmentation procedure and network architecture from these methods while applying the proposed Contrastive Detection loss to each.
Each image is randomly augmented twice, resulting in two images:
In addition, they compute for each image a set of masks that segment the image into different components.
These masks can be computed using efficient, off-the-shelf, unsupervised segmentation algorithms. In particular, authors use Felzenszwalb-Huttenlocher algorithm a classic segmentation procedure that iteratively merges regions using pixel-based affinity. This algorithm does not require any training and is available in scikit-image. If available, human-annotated segmentations can also be used instead of automatically generated. Each mask (represented as a binary image) is transformed using the same cropping and resizing as used for the underlying RGB image, resulting in two sets of masks
For every mask
Then 2-layer MLP is used as a projection on top of the mask-pooled hidden vectors. Note that if you replace masked-pooling with a single global average pooling then you will get exactly SimCLR or BYOL architecture.
Standard contrastive loss based on cross-entropy is used for learning. Positive pair is the latent representations of the same mask from augmented views
DeepMind
A new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations.
Object-based regions are identified with an approximate, automatic segmentation algorithm based on pixel affinity (bottom). These masks are carried through two stochastic data augmentations and a convolutional feature extractor, creating groups of feature vectors in each view (middle). The contrastive detection objective then pulls together pooled feature vectors from the same mask (across views) and pushes apart features from different masks and different images (top).
🌟Highlights
+ SOTA detection and Instance Segmentation (on COCO) and Semantic Segmentation results (on PASCAL) when pretrained in self-supervised regime on ImageNet, while requiring up to 5× fewer epochs than SimCLR.
+ It also outperforms supervised pretraining on Imagenet.
+ DetCon(SimCLR) converges much faster to reach SOTA: 200 epochs are sufficient to surpass supervised transfer to COCO, and 500 to PASCAL.
+ Linear increase in the number of model parameters (using ResNet-101, ResNet-152, and ResNet-200) brings a linear increase in the accuracy on downstream tasks.
+ Despite only being trained on ImageNet, DetCon(BYOL) matches the performance of Facebook's SEER model that used a higher capacity RegNet architecture and was pretrained on 1 Billion Instagram images.
+ First time a ResNet-50 with self-supervised pretraining on COCO outperforms the supervised pretraining for Transfer to PASCAL
+ The power of DetCon strongly correlates with the quality of the masks. The better the masks used during the self-supervised pretraining stage, the better the accuracy on downstream tasks.
⚙️ Method details
DetConS and DetConB, based on two recent self-supervised baselines: SimCLR and BYOL respectively with ResNet-50 backbone.
Authors adopt the data augmentation procedure and network architecture from these methods while applying the proposed Contrastive Detection loss to each.
Each image is randomly augmented twice, resulting in two images:
x, x'.In addition, they compute for each image a set of masks that segment the image into different components.
These masks can be computed using efficient, off-the-shelf, unsupervised segmentation algorithms. In particular, authors use Felzenszwalb-Huttenlocher algorithm a classic segmentation procedure that iteratively merges regions using pixel-based affinity. This algorithm does not require any training and is available in scikit-image. If available, human-annotated segmentations can also be used instead of automatically generated. Each mask (represented as a binary image) is transformed using the same cropping and resizing as used for the underlying RGB image, resulting in two sets of masks
{m}, {m'} which are aligned with the augmented images x, x'.For every mask
m associated with the image, authors compute a mask-pooled hidden vector (i.e., similar to regular average pooling but applied only to spatial locations belonging to the same mask).Then 2-layer MLP is used as a projection on top of the mask-pooled hidden vectors. Note that if you replace masked-pooling with a single global average pooling then you will get exactly SimCLR or BYOL architecture.
Standard contrastive loss based on cross-entropy is used for learning. Positive pair is the latent representations of the same mask from augmented views
x and x'. Latent representations of different masks from the same image and from different images in the batch are used as negative samples. Moreover, negative masks are allowed to overlap with a positive one.Forwarded from Gradient Dude
🦾 Main experiments
Pretrain on Imagenet -> finetune on COCO or PASCAL:
1. Pretrain on Imagenet in a self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 12 epochs on COCO or 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 5x fewer pretraining epochs than SimCLR.
Pretrain on COCO -> finetune on PASCAL for Semantic Segmentation task:
1. Pretrain on COCO in self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 4x fewer pretraining epochs than SimCLR.
5. The first time a self-supervised pretrained ResNet-50 backbone outperforms supervised pretraining on COCO.
📝 Paper: Efficient Visual Pretraining with Contrastive Detection
Pretrain on Imagenet -> finetune on COCO or PASCAL:
1. Pretrain on Imagenet in a self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 12 epochs on COCO or 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 5x fewer pretraining epochs than SimCLR.
Pretrain on COCO -> finetune on PASCAL for Semantic Segmentation task:
1. Pretrain on COCO in self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 4x fewer pretraining epochs than SimCLR.
5. The first time a self-supervised pretrained ResNet-50 backbone outperforms supervised pretraining on COCO.
📝 Paper: Efficient Visual Pretraining with Contrastive Detection
My additional bits on method here.
swanky-pleasure-bcf on Notion
Efficient Visual Pretraining with Contrastive Detection | Notion
Authors proposed the new way to pre-train neural networks in the line of contrastive losses. Instead of contrasting two different representations of the image itself, authors proposed to contrast representations masked with segmentation of the image. Since…
Contrast to Divide: a pretty practical paper on usage self-supervision to solve learning with noisy labels.
Self-supervision as a plug-and-play technique shows good results in more and more areas. As authors show in this paper, simply replacing a supervised pre-training with the self-supervised can show results and stability improvements. This is achieved due to removing the label noise memorisation (or domain gap in case of transfer) from the warm-up stage of training, therefore maintaining a better discrepancy between classes.
Source here.
Self-supervision as a plug-and-play technique shows good results in more and more areas. As authors show in this paper, simply replacing a supervised pre-training with the self-supervised can show results and stability improvements. This is achieved due to removing the label noise memorisation (or domain gap in case of transfer) from the warm-up stage of training, therefore maintaining a better discrepancy between classes.
Source here.
Self-supervision pre-training for brain cortex segmentation: a paper from MICCAI-2018.
Quite old (for this boiling-hot area) paper, although with interesting take. Authors set up the metric learning pre-training, but instead of the 3D metric they estimated the geodesic distance on the brain surface between cuts taken orthogonal to the surface. Why? Because the brain cortex is a relatively thin structure along the curved brain surface. And therefore areas are separated not as the 3D space patches, but as patches on this surface. Authors demonstrate how predicted distance between adjacent slices then aligns with the ground truth borders of the areas.
Despite presented result is better then the naïve baseline, I wouldn't be astonished if the other pre-training techniques emerged since then, would provide good results as well.
With a bit more words and one formula here.
Original on there.
Quite old (for this boiling-hot area) paper, although with interesting take. Authors set up the metric learning pre-training, but instead of the 3D metric they estimated the geodesic distance on the brain surface between cuts taken orthogonal to the surface. Why? Because the brain cortex is a relatively thin structure along the curved brain surface. And therefore areas are separated not as the 3D space patches, but as patches on this surface. Authors demonstrate how predicted distance between adjacent slices then aligns with the ground truth borders of the areas.
Despite presented result is better then the naïve baseline, I wouldn't be astonished if the other pre-training techniques emerged since then, would provide good results as well.
With a bit more words and one formula here.
Original on there.
Ярослав's Notion on Notion
Improving Cytoarchitectonic Segmentation of Human Brain Areas with Self-supervised Siamese Networks
In this paper authors proposed very specific yet beneficial loss for pre-training of the segmentation encoder. This loss is based on the knowledge of the brain cortex structure specifics. Authors proposed to predict distance between two dissections not in…
Instance Localization for Self-supervised Detection Pretraining. The paper on importance of the task-specific pre-training.
Authors hypothesised about problems of the popular self-supervised pre-training frameworks w.r.t. the task of localisation. They come to the idea that there are no losses to enforce localisation of the object representations. Therefore authors proposed a new loss. To make two contrastive representation of one image, they crop two random parts of it and paste those parts to random images from the same dataset. Later, they use neural network to embed those images, but contrast only region of embedding related to the pasted image. Instead of contrasting the whole image embedding as it is done usually.
Interestingly, the proposed loss not only provides SotA pre-training for the localisation task, but also degrades the classification quality. This is somewhat important practical finding, that while general representations becoming better and better, it could be more important to have task-specific pre-training, than the SotA, but tailored for another task.
More detailed here.
Source here.
Authors hypothesised about problems of the popular self-supervised pre-training frameworks w.r.t. the task of localisation. They come to the idea that there are no losses to enforce localisation of the object representations. Therefore authors proposed a new loss. To make two contrastive representation of one image, they crop two random parts of it and paste those parts to random images from the same dataset. Later, they use neural network to embed those images, but contrast only region of embedding related to the pasted image. Instead of contrasting the whole image embedding as it is done usually.
Interestingly, the proposed loss not only provides SotA pre-training for the localisation task, but also degrades the classification quality. This is somewhat important practical finding, that while general representations becoming better and better, it could be more important to have task-specific pre-training, than the SotA, but tailored for another task.
More detailed here.
Source here.
swanky-pleasure-bcf on Notion
Instance Localization for Self-supervised Detection Pretraining | Notion
Since the better pre-training for classification task doesn't always implies better localisation quality, authors proposed idea of task-specific pre-training for the localisation task. The key point of the proposed method is to formulate task, which will…
SelfReg — paper on contrastive learning towards domain generalisation.
Domain generalisation methods are focused on training models which will not need transfer to work on new domains. Authors proposed to adapt the popular contrastive learning framework to this task.
To provide positive pair, they sample two examples of the same class from different domains. Compared to classical contrastive learning it is, like, having different domains instead of different augmentations, and different classes instead of different samples.
To avoid burden of the good negative sample mining authors adapted the BYOL idea, and employed a projection network to avoid representation collapse.
Suppose having
As the loss itself authors used two squared L2 distances:
1.
NB! in the second loss, the right part is linear mixture of sample representations from different domains.
By minimising the presented loss alongside with classification loss itself, authors achieved pretty separated latent space representation, and got close to the SotA without additional tricks.
Source could be found here.
Domain generalisation methods are focused on training models which will not need transfer to work on new domains. Authors proposed to adapt the popular contrastive learning framework to this task.
To provide positive pair, they sample two examples of the same class from different domains. Compared to classical contrastive learning it is, like, having different domains instead of different augmentations, and different classes instead of different samples.
To avoid burden of the good negative sample mining authors adapted the BYOL idea, and employed a projection network to avoid representation collapse.
Suppose having
f as the neural network under training, g as a trainable linear layer to gain projection of the representation and x_ck — random sample of the class c and domain k.As the loss itself authors used two squared L2 distances:
1.
|f(x_cj) - g(f(x_ck))|
2. |f(x_cj) - (l*g(f(x_cj)) + (1-l)*g(f(x_ck)))|. Where l ~ Beta.NB! in the second loss, the right part is linear mixture of sample representations from different domains.
By minimising the presented loss alongside with classification loss itself, authors achieved pretty separated latent space representation, and got close to the SotA without additional tricks.
Source could be found here.
Exploring Visual Engagement Signals for Representation Learning: recent arxiv paper from Facebook with interesting source of the supervision for training.
In this paper authors proposed to use comments and reaction (facebookish likes) as the source of supervision for pre-training. The proposed method is simple and is therefore well scalable. For each image authors collect two pseudo-labels:
1. All reactions are counted and normalised to sum to 1. This is used as a label for cross-entropy loss.
2. Each comment is converted to bag of words, then weight this embedding via TF-IDF and assign cluster id with kNN (where "fitting" of the clustering is done on random subset of comments from the same dataset). Cluster ids of all comments to the image are used together as a target for a multi-label classification loss.
This approach shows slight to medium increase on the tasks which are well-related to the multi-modal learning. E.g. memes intent classification or political leaning of an image.
While method description is somewhat messy and method itself requires enormous training time (10 days on 32 V100, chances are additional markup could be cheaper) this paper shows again interesting idea of getting supervision for pre-training.
In this paper authors proposed to use comments and reaction (facebookish likes) as the source of supervision for pre-training. The proposed method is simple and is therefore well scalable. For each image authors collect two pseudo-labels:
1. All reactions are counted and normalised to sum to 1. This is used as a label for cross-entropy loss.
2. Each comment is converted to bag of words, then weight this embedding via TF-IDF and assign cluster id with kNN (where "fitting" of the clustering is done on random subset of comments from the same dataset). Cluster ids of all comments to the image are used together as a target for a multi-label classification loss.
This approach shows slight to medium increase on the tasks which are well-related to the multi-modal learning. E.g. memes intent classification or political leaning of an image.
While method description is somewhat messy and method itself requires enormous training time (10 days on 32 V100, chances are additional markup could be cheaper) this paper shows again interesting idea of getting supervision for pre-training.
Variance-Invariance-Covariance Regularisation: fresh paper on the self-supervised training. Kinda follow-up to the idea raised by the Barlow Twins.
In this paper authors proposed three-fold loss function which:
1. Prevents representation collapse by enforcing high variance across different embedding vectors.
2. Decreasing representation redundancy by decorrelating dimensions of the embedded space.
3. Enforcing invariance of the embedded vectors to the different augmentations by pulling different embeddings of the same image together.
This loss helps to avoid both the burden of negative samples mining and crafty tricks employed by other methods. Authors demonstrate, that their method is somewhat on par with the SoTA, while avoiding all that and having some additional benefits. e.g. free of explicit normalisations and somewhat free of batch-size dependency.
A bit longer overview here.
Source here.
In this paper authors proposed three-fold loss function which:
1. Prevents representation collapse by enforcing high variance across different embedding vectors.
2. Decreasing representation redundancy by decorrelating dimensions of the embedded space.
3. Enforcing invariance of the embedded vectors to the different augmentations by pulling different embeddings of the same image together.
This loss helps to avoid both the burden of negative samples mining and crafty tricks employed by other methods. Authors demonstrate, that their method is somewhat on par with the SoTA, while avoiding all that and having some additional benefits. e.g. free of explicit normalisations and somewhat free of batch-size dependency.
A bit longer overview here.
Source here.
swanky-pleasure-bcf on Notion
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning | Notion
Another approach trying to develop losses which does not require neither tedious negative sampling procedures, nor vague architecture tricks. This one falls in line with the Barlow Twins approach presented by the same FAIR lab earlier.
Not much more to say about it. It's emerging topic of building transformers for computer vision tasks. And this is more or less technical paper, to show progress of the transformers towards replacing ResNet-like architectures wherever they are still used this time (not the first attempt, though) in self-supervision.
The loss itself is superposition of the MoCov2 and BYOL approaches. It simultaneously has queue of the negative examples and contrastive loss from MoCov2 with model-level asymmetry from BYOL, where one branch is momentum-updated.
Results are on par with SoTA for linear evaluation scheme. But the method (1) have lesser complex tricks to achieve and (2) is applicable to transfer-learning towards detection and segmentation.
The loss itself is superposition of the MoCov2 and BYOL approaches. It simultaneously has queue of the negative examples and contrastive loss from MoCov2 with model-level asymmetry from BYOL, where one branch is momentum-updated.
Results are on par with SoTA for linear evaluation scheme. But the method (1) have lesser complex tricks to achieve and (2) is applicable to transfer-learning towards detection and segmentation.
👍1
Forwarded from Just links
Self-Supervised Learning with Swin Transformers https://arxiv.org/abs/2105.04553
Contrastive Conditional Transport for Representation Learning.
This paper tries to make the step similar to GAN->WGAN step, although in terms of the representation learning. At first authors propose the idea, that instead of the training with more-or-less classical SimCLR loss, authors proposed to just minimize (C+) - (C-), where (C+) is mean distance between anchor and positive samples (different views of the anchor) and (C-) is mean distance between anchor and negative samples.
Although, when this does not work out, authors proposed to add more complex weighting procedure: now, positive samples are weighted with respect to their distance to the anchor (more distance — larger weight) and vice versa goes for the negative samples (less distance — larger weight).
Despite description of the idea is somewhat chaotic, the reported results looks good. Also, one more positive side-effect: this loss easily works with multiple positive samples drawn in one minibatch.
Source here.
This paper tries to make the step similar to GAN->WGAN step, although in terms of the representation learning. At first authors propose the idea, that instead of the training with more-or-less classical SimCLR loss, authors proposed to just minimize (C+) - (C-), where (C+) is mean distance between anchor and positive samples (different views of the anchor) and (C-) is mean distance between anchor and negative samples.
Although, when this does not work out, authors proposed to add more complex weighting procedure: now, positive samples are weighted with respect to their distance to the anchor (more distance — larger weight) and vice versa goes for the negative samples (less distance — larger weight).
Despite description of the idea is somewhat chaotic, the reported results looks good. Also, one more positive side-effect: this loss easily works with multiple positive samples drawn in one minibatch.
Source here.