Casual GAN Papers
1.08K subscribers
13 photos
2 videos
40 files
148 links
🔥 Popular deep learning & GAN papers explained casually!

📚 Main ideas & insights from papers to stay up to date with research trends

⭐️ New posts every Tue and Fri

Reading time <10 minutes

patreon.com/casual_gan

Admin/Ads:
@KirillDemochkin
Download Telegram
to view and join the conversation
​​#7.2: "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis" (#NeRF) by Mildenhall et al.

📈Interesting Numbers :
Realistic Synthetic 360 (SSIM)
The best is NeRF @ 0.947 (next best is LLFF @ 0.911)

Realistic Forward-Facing (SSIM)
The best is NeRF @ 0.811 (next best is LLFF @ 0.798)

✏️My Notes:
- 9/10 for the name, it spawned many puns and funny variations in follow-up papers
- You have to check out the video samples of scenes for yourself, static images do not do them justice
- This is the OG paper that started the NeRF hype train
- The idea to use a coordinate based MLP for implicit representations can be generalized to other domains such as video, sound, 2d images, etc. such as in SIREN
- The main downsides are: 1) performance, it takes forever to fit a single scene; 2) no support for dynamic scenes and no generalization across scenes 3) no way to change the lighting in the optimized scene

🔗Links:
Paper / Code

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#8: "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement" (#ReStyle) by Yuval et al.

🔑 Keywords:
#Iterative_inversion #latent_projection #StyleGAN_inversion #differentiable_rendering #latent_space_encoding #residual_encoding

🎯 At a glance:
The authors propose a fast iterative method of image inversion into the latent space of a pretrained StyleGAN generator that acheives SOTA quality at a lower inference time. The core idea is to start from the average latent vector in W+ and predict an offset that would make the generated image look more like the target, then repeat this step with the new image and latent vector as the starting point. With the proposed approach a good inversion can be obtained in about 10 steps.

🔍 Main Ideas:
1) model architecture
Interestingly, ReStyle is agnostic to the encoder architecture and can be used with any StyleGAN encoder. However, due to the self-correcting multi-step nature of the model the authors sought to reduce the complexity of the base encoder, and used a simplified version of the pSp and e4e encoders (only the 16x16 feature maps are used in the map2style blocks instead of the multiple scales of feature maps in the original encoders) in most of their experiments. The losses from the underlying encoder models are all used as is.

2) Iterative refinement
Each training iteration consists of ~10 repeated steps. At each step the encoder receives the current image (initialized with the image produced from the average latent vector) concatenated along the channel dimension with the target image and produces a set of W+ vector offsets that are added to the latent code from the previous step (initialized with the average latent code). The resulting W+ vectors are given to the pretrained generator to produce an image that is used as the input in the next step.

3) Insights from experiments:
Upon
observing the magnitude of changes in the inverted images the authors concluded that their iterative scheme works in a coarse to fine manner, first focusing on low frequency features and then refining the smaller details. In their experiments the authors show that the editability of e4e encoded images remains intact with the proposed modifications.

4) Encoder bootstrapping:
Authors
consider the task of creating a cartoon character from an input photo using ReStyle. The obvious way to do this is to initialize the starting latent code with the average latent vector from the cartoon generator, and do ~10 ReStyle steps to obtain the final image. However, a smarter way is to first invert the input image into the real face domain of StyleGAN using a pretrained encoder, and initialize ReStyle's starting latent code with the resulting W+ vector. Doing so allows to obtain a high quality cartoon reconstruction of the input image in just a few ReStyle steps.

📈Interesting Numbers :
It is hard to definitively judge the quantitative comparison of the methods since there is a tradeoff between inference speed and reconstruction quality. See Fig. 5 in the paper for more details.

CelebA-HQ (ID Similarity / Inference time)
The best (at inference time of 0.5 second) is authors' ReStyle pSp encoder @ 0.66 (next best is authors' ReStyle e4e encoder @ 0.51)

✏️My Notes:
- The name is good, 8/10
- I played around with their demo, and while the inverted real world images are very good, I am not sure about the editability of these inversions. For example I could not get StyleCLIP to edit the inverted images at all compared to images inverted with e4e (although these inversions do not look very much like the real world image inputs)
- I think the question is: how can we do a single nonlinear update to the latent code instead of 10 linear steps to invert the image in a single forward pass?
Also I wonder if it is possible to "prune" the encoder during inference, and do just 2 or 3 big approximate steps that combine the smaller steps that the encoder learns during training.

🔗Links:
Paper / Code

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#9.1: "Designing an Encoder for StyleGAN Image Manipulation" (#e4e) by Omer et al.

🔑 Keywords:
#latent_projection #StyleGAN_inversion #StyleGAN_image_editing #latent_space_encoding #StyleGAN_encoder

🎯 At a glance:
This architecture is the go to for StyleGAN inverion and image editing at the moment. The authors build on the ideas proposed in pSp and generalize the proposed method beyond the face domain. Moreover, the proposed method achieves a balance between the reconstruction quality of the images and the ability to edit them.

🔍 Main Ideas:
1) The inversion tradeoff:
One of the key ideas of the paper is that there exists a tradeoff between the quality of the reconstruction of an inverted image and the ability to coherently edit it by manipulating the corresponding latent vector. The authors state that better reconstruction quality is achieved by inverting images to W+, however edits for images inverted into the W space are more consistent at the cost of perceived similarity to the source image One possible reason for this behavior is that StyleGAN is trained in W, and while inverting in W+ has greater expressive power, the encoded latent codes lie further from the true distribution of the latent vectors learned by the generator. To balance these aspects of image inversion a W^k space is proposed, which has the ability to control the tradeoff by its proximity to W.

2) Inverting images to W^k:
The authors propose two things to get W^k closer to W while retaining the expressiveness of W+.
First, the encoder predicts a single latent code and a set of offsets equal to the number of layers in the StyleGAN generator that are added to the latent code to obtain the final set of vectors that is passed to the pretrained generator. In order to keep those offsets small, and the latent codes in turn closer to W, the offsets are regularized with L2.
Second, the authors introduce a progressive scheme, where for the first 20k iterations the encoder predicts just the single latent vector that is used for each of the generator's layers. Then the model learns to predict the latent vector and the offset for the second layer of the generator (the predicted latent vector is used for all of the layers in the generator, except the ones for which an offset is predicted). A new offset is added every 2k iterations.
It is important to note that the offsets are computed from different levels of the feature maps of the encoder taken from pSp with the intuition that the network learns to predict the coarse structure of the images and progressively learn to refine it with the learned offsets.

3) Latent discriminator:
To make sure the inverted latent codes do not fall out of the true distribution of the latent space learned by the generator, the authors employ a small discriminator that learns to distinguish between the real latent vectors sampled from StyleGAN's mapping network and fakes ones from the encoder.

4) Generalization beyond the faces domain:
For
all domains other than the facial domain the authors replace the ArcFace id loss with a cosine distance between features extracted from a ResNet-50 trained with MOCO2.

5) Latent editing consistency:
The metric measures how well behaved the latent edits are and how well the image is reconstructed. Calculating LEC consists of 4 steps: inverting an image, performing an edit, inverting the edited image, and doing an inverse edit on it. Ideally the first and last images along with the latent codes are equal.

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#9.2: "Designing an Encoder for StyleGAN Image Manipulation" (#e4e) by Omer et al.

📈Interesting Numbers:
The metrics provided are mostly inconclusive since there is a tradeoff between editability and distortion (quote from the paper): "However, the perceptual metrics contradict each other, resulting in no clear winner. As discussed, perceptual quality is best evaluated using human judgement"

✏️My Notes:
- Decent name, alas not a meme, 7/10.
- While the model does wonders for FFHQ/CelebA images, it showed some pretty lackluster results for my own selfies that I tried to invert. Not only did the backgrounds not get inverted, which is expected, but the identity loss on the inverted selfies was quite severe.
- I do believe that this is mostly a limitation of StyleGAN not e4e as the authors state that it is possible to improve reconstruction by sacrificing some editability.
- It works nicely as a plug and play module in a StylGAN editing/inversion pipeline as shown by ReStyle, and StyleCLIP

🔗Links:
Paper / Code

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#10: "Spatially-Adaptive Pixelwise Networks for Fast Image Translation" (#ASAPNet) by Shaham et al.

🔑 Keywords:
#Pix2pix #image_to_image #model_speedup #MLP #coordinate_based

🎯 At a glance:
The authors propose а novel architecture for efficient high resolution image to image translation. At the core of the method is a pixel-wise model with spatially varying parameters that are predicted by a convolutional network from a low-resolution version of the input. Reportedly, an 18x speedup is achieved over baseline methods with a similar visual quality.

🔍 Main Ideas:
1) Spatially Adaptive Pixelwise Networks:
The output image is obtained independently pixel by pixel (can be parallelized) from a multilayered perceptron with spatially varying weights. Spatial information is preserved by passing the pixel's coordinates along with its color value. Moreover, while the MLP architecture is shared for all pixels, its weights vary for different pixels. In essence, this means that the model learns to predict how to process each image, and then every image is processed by a different set of pixelwise networks based on the input.

2) Predicting network parameters from a low resolution input:
A convolution network takes a downsampled version of the input image and predicts a low resolution (16x16 for 1024x1024 images) grid of parameters for the pixelwise networks. The parameter grid is upsampled to the required resolution via nearest neighbor interpolation.

3) Positional encoding:
To help the model learn to synthesize high frequency features the authors project the pixels' coordinates to a higher dimensional space. Namely, they use fourier features, which are a set of sine and cosine functions with different frequencies evaluated at the

4) Training and implementation details:
The training procedure, discriminator, and losses are the same as for Pix2PixHD and SPADE. Without positional encoding the network is unable to generate high frequency details. Without spatial variations the expressiveness of the model is severely limited. There is a tradeoff between faster inference for more aggressive downsampling and visual quality for less downsampling.

📈Interesting Numbers :
See the attached picture bellow!
The main takeaway is that the model works way faster than the competing methods while maintaining a similar visual quality.

✏️My Notes:
- The name sounds dope, but a bit on the nose with the words that make up the acronym, 8/10
- Interesting that there are no artifacts on the edges of the regions with different network parameters after parameter grid upsampling
- I wonder if it is possible to do modify this approach for resolution free image synthesis if instead of fixing the coordinates for each pixel, the coordinates were sampled from regions with the same network parameters on the upsampled grid
- I have not tried their demo, feel free to comment on the results if you have played with the authors' code
- Finally a straightforward paper that can be explained in under 5 minutes 😁

🔗Links:
Paper / Code

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#11.1: "Training Generative Adversarial Networks with
Limited Data" (#StyleGAN-ada) by Karras et al.

🔑 Keywords:
#StyleGAN #differentiable_augmentations #limited_data_training #GAN_augmentations #discriminator_augmentations

🎯 At a glance:
The authors propose а novel method to train a StyleGAN on a small dataset (few thousand images) without overfitting. They achieve high visual quality of generated images by introducing a set of adaptive discriminator augmentations that stabilize training with limited data.

🔍 Main Ideas:
1) Stochastic discriminator augmentation
By design all augmentations applied to images during training will leak, and appear on the generated images, which is an undesirable side effect of having to use augmentations to artificially increase the size of the training set. One way to prevent the augmentation from leaking from the training images is by requiring that discriminator output is consistent for an image when two different sets of augmentations are applied to it only when training the discriminator. However, this approach causes the discriminator to be "blind" to augmentations, which is not good since the generator can create augmented images without any penalty. The proposed approach is similar in that it applies a set of augmentations to the images shown to the discriminator, however they show only the augmented images to the discriminator, and also use the augmentations when training the generator.

2) Designing augmentations that do not leak
It has been shown that GANs can implicitly undo corruptions when training only on corrupted images as long as the augmentations used to corrupt the images allow for a way to tell if two sets of augmented images are equal without knowing the underlying images without the augmentations. A simple example of stochastic non-leaking augmentation is randomly rotating images by 0, 90, 180, 270 degrees 10% of the time since this increases the relative occurence of images at 0 degrees, and the only way for the generator to match the distribution of real images is to generate images in the correct orientation. Most deterministic augmentations (additive noise, flips, shifts, scaling, etc) can be made non-leaking by applying them only p% of the time. A safe value of p is less than 0.8.

3) Augmentation pipeline
The authors use 18 augmentations in a predifined order all applied independently with the same probability. The large number of augmentations makes it extremely unlikely that the discriminator will ever see an image without augmentations, yet the generator is guided to produce only clean images, as long as p remains under the safe value.

4) Adaptive discriminator augmentation
To avoid manual tuning of the strength of every augmentations the authors develop two heuristics to indicate the discriminator overfitting. The first heuristic expresses the discriminator predictions for the validation set relative to the train set and generated images. The second heuristic the portion of the training set that gets positive discriminator outputs. In practice the value of p is initialized with 0, and every couple of minibatches the two heuristics are calculated, and p is aggressively adjusted up or down to counteract overfitting.

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#11.2: "Training Generative Adversarial Networks with
Limited Data" (#StyleGAN-ada) by Karras et al.

📈Interesting Numbers :
MetFaces (FID)
The best is authors' ADA StyleGAN2 @ 18.22 for training from scratch and 0.81 for transferring from a pretrained StyleGAN2 (next best is default StyleGAN2 @ 57.26 for training from scratch and 3.16 for transferring from a pretrained StyleGAN2)

✏️My Notes:
- Boring but recognizable model name, 6/10
- The results are beyond impressive, the images have no business looking this good for how small the training dataset can be
- It is interesting that for whatever reason the proposed augmentations have not yet become a standard part of the GAN pipeline.
- This post is brought to you by LabelMe, who are making an open-ended library of free datasets for all sorts of computer vision tasks. Do you want to train your own StyleGAN-ada model, but just can't find the right dataset? The guys at LabelMe got you covered. Just fill out the form, and their specialists will go over all of the submissions, select the most popular ones, and create datasets for the winning submissions. The best part is that those datasets will be published in the public domain on their website.

Link to the dataset request form by LabelMe

🔗Links:
Paper / Code

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#12: "Generating Diverse High-Fidelity Images with VQ-VAE-2" (#VQ-VAE-2) by Razavi et al.

🔑 Keywords:
#discrete_representation #discrete_latents #VAE #VQVAE #image_generation #autoregressive

🎯 At a glance:
The authors propose a novel hierarchical encoder-decoder model with discrete latent vectors that uses an autoregressive prior (PixelCNN) to sample diverse high quality samples.

🔍 Main Ideas:
1) Vector Quantized Variational AutoEncoder:
The model is comprise of three parts: an encoder that maps an image to a sequence of latent variables, a shared codebook that is used to quantize these continuous latent vectors to a set of discrete latent variables (each vector is replaced with the nearest vector from the codebook), and the decoder that maps the indices of the vectors from the codebook back to an image.

2) Learning the codebook:
Since the quantization operation is non-differentiable a straight through gradient estimator is used. As scary as that sounds it simply means that the gradient from the first layer of the decoder is directly passed to the last layer of the encoder skipping the codebook altogether.
The codebook itself is updated via exponential moving average of the encoder outputs. The encoder outputs are regularized so that they stay close to the chosen codebook without fluctuating too much.

3) Hierarchical VQ-VAE:
Inspired by the ideas of coarse-to-fine generation the authors propose a hierarchy of vector quantized codes to model large images. The whole architecture resembles a 3 level UNet with concatenating skip connections, except that each feature map is quantized before being passed to the decoder.

4) Learning priors over the latent codes.
In order to sample from the model a separate autoregressive prior (PixelCNN) is learned over the sequences of latent codes at each resolution level. The authors use self attention layers in the top level prior since it has lower resolution, and large conditional stacks coming from the top prior for the bottom prior with higher resolution due to memory constraints. Each prior is trained separately. Sampling from the model requires passing a class label to the trained top level PixelCNN to obtain the top level codes, then passing the class label along with the generated codes to the bottom level to generate the higher resolution code, and then use the decoder to generate an image from the top and bottom level codes.

📈Interesting Numbers :
ImageNet (Top-1 Classification Accuracy Score - test score of a classifier trained only on samples from the generator)
The best is Real Data @ 91.47 (next best is authors' VQ-VAE @ 80.98)

✏️My Notes:
- 3/5 for the name, basic but cool
- Seems to work well for datasets with a ton of classes
- Wonder why this has not dethroned StyleGAN2 as the go-to method for image generation if it is really as good as the authors claim 🤔

🔗Links:
Paper / Code

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#13: "EigenGAN: Layer-Wise Eigen-Learning for GANs" (#EigenGAN) by Zhenliang He et al.

🔑 Keywords:
#GAN #GAN_editing #latent_editing #latent_directions #interpretable_generation #discovering_semantic_directions

🎯 At a glance:
The authors propose a novel generator architecture that can intrinsically learn interpretable directions in the latent space in an unsupervised manner. Moreover each direction can be controlled in a straightforward way with a strength coefficient to directly influence the attributes such as gender, smile, pose, etc on the generated images.

🔍 Main Ideas:
1) Generator with layer-wise subspaces:
The generator architecture is simply a stack of transposed convolution layers that start from a latent noise vector processed by a fully-connected layer and increase the resolution of the image twofold. Before each transposed conv2d layer an orthogonal linear subspace is injected.

2) Linear subspace structure:
Each linear subspace is comprised of 3 elements (all of which are learned during training): a set of orthonormal vectors U that correspond to interpretable directions, a diagonal matrix L that controls the strength of each direction, and an offset vector μ that denotes the origin of the subspace. The orthogonality of U is achieved via regularization

3) Injection mechanism:
To inject this subspace a random noise vector z (coordinate) is sampled and projected into the subspace by left-multiplying it with the U and L matrices, and adding the offset μ. The resulting vector is either plainly added to the feature map from the previous layer or, alternatively, processed by a 1x1 convolution. This process is repeated for each layer with a new sampled coordinate.

4) Details:
For a single linear subspace the authors prove that the columns of U are the results of PCA of the space. When the subspaces are injected into a hierarchical nonlinear generator it can be seen as progressively adding new "straight" dimentions that bend and curve after each nonlinear transposed convolution layer. The authors limit each subspace to just 6 basis vectors, and observe that the subspaces injected into the earlier layers correspond to more love level attributes such as pose and hue, while the later subspaces learn directions for more abstract attributes such as facial hair, hair side, and background texture orientation.


📈Interesting Numbers :
The main takeaway is that EigenGAN has better disentanglement than directions discovered with SeFa in StyleGAN, additionally the EigenGAN directions are more similar to a PCA decomposition of the latent space than directions from other baselines.

✏️My Notes:
- 4/5 for the name - it's cool and straight to the point
- Important to note that higher level attributes are not very well disentangled, although the authors claim that they are still interpretable
- Interesting whether real images can be projected well into the latent space of the proposed generator
- I would like to see, whether this method can be applied to other domains besides faces
- Do learned directions depend very much on the initialization? Or are they close every time? (Authors note that some more rare attributes are not discovered everytime)
- Is it possible to swap the learned eigen-dimensions between two generators trained on different datasets?

🔗Links:
Paper / Code

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
Media is too big
VIEW IN TELEGRAM
#13: "EigenGAN: Layer-Wise Eigen-Learning for GANs" (#EigenGAN) by Zhenliang He et al.

P.S.
Here is a really cool example of traversal along the learned directions.

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#14: "An Image Is Worth 16X16 Words: Transformers For Image Recognition At Scale" (#ViT) by Dosovitskiy et al.

🔑 Keywords:
#Transformers #ImageClassification #ImageTransformers #Pretraining #LargeScale

🎯 At a glance:
In this paper from late 2020 the authors propose a novel architecture that successfully applies transformers to the image classification task. The model is a transformer encoder that operates on flattened image patches. By pretraining on a very large image dataset the authors are able to show great results on a number of smaller datasets after finetuning the classifier on top of the transformer model.

🔍 Main Ideas:
1) Patch tokens:
The image is split into small patches (16x16), and each patch is completely flattened into a single vector of all of the pixel color values. The patches are projected with a linear layer into a smaller dimension (512) and concatenated with a learnable 1D positional encoding (turns out it implicitly learns to represent 2D coordinates), and an optional "classification" token embedding. This sequence of embeddings serves as the input for the transformer.

2) Vision Transformer:
The transformer part is just the standard transformer model: alternating layers of multiheaded self-attention and feedforward blocks with layernorm applied before every block and residual connections after every block. The feedforward blocks have two layers with GELU non-linearity.

3) Finetuning:
During inference it is possible to process images of higher resolution, since the transformer works with any sequence length. However, it is necessary to interpolate the positional encodings since they may lose their meaning at different resolutions than the one they have been trained for. It is also possible to finetune the pretrained model to any dataset by removing the pretrained classifier "head", and replacing it with a zero initialized feedforward layer that predicts probabilities for the classes in the new dataset.

📈Interesting Numbers :
The main takeaway is that ViT requires A LOT of data to "git gud", but when it gets the data it outperforms SOTA classifiers on downstream tasks, and on larger datasets it pretrains faster than other baseline methods.

✏️My Notes:
- 3/5 for the name: very utilitarian, lacking some pizzazz.
- I guess the main question is about data availability for training ViT
- There is a point in the appendix about self-supervised pretraining: the authors corrup 50% of embeddings and ask the model to predict the average rgb value of the missing patch (predicting a 4x4 patch did not work as good). It works good enough i suppose.

🔗Links:
Paper / Code

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#14: "StyleGAN2 Distillation for Feed-forward Image Manipulation" by Viazovetskyi et al.

🔑 Keywords:
#StyleGAN_editing #latent_image_editing #StyleGAN_directions_discovery #unsupervised

🎯 At a glance:
In this paper from October, 2020 the authors propose a pipeline to discover semantic editing directions in StyleGAN in an unsupervised way, gather a paired synthetic dataset using these directions, and use it to train a light Image2Image model that can perform one specific edit (add a smile, change hair color, etc) on any new image with a single forward pass.

🔍 Main Ideas:
1) Data collection:
The main contribution of this paper is a pipeline that enables an unsupervised creation of synthetic paired datasets that can be used to apply chosen edits to real images. This pipeline can be described in 7 steps:
- sample a lot latent vectors and corresponding images from a pretrained generator
- get attribute predictions from a pretrained classifier network
- select images with high classification scores for desired attributes
- find the editing direction by subtracting the average latent vector of the images with the highest negative scores for the attribute from images with the most positive scores.
- sample sets of latent vectors that are made up of a random vector, a vector with the edit direction subtracted from it and one with the edit direction added
- predict the attribute scores for all sampled images
- from each set of images select a pair based on the classification scores such that the two images belong to opposite classes with high certainty

2) Distilation:
Once the dataset is gathered any Image2Image model can be trained with it. The authors use the vanilla Pix2PixHD framework without any modifications at 512x512 resolution.

3) Stylemixing distillation:
Besides showing impressive results for distilling attribute edits from a StyleGAN2 generator into a Pix2PixHD model the authors also show that stylemixing operations can be distilled in the same way. To train such a model it is necessary to collect a dataset of triplets: two images, and their mixture, and feed a concatenation of the two input images to the Pix2PixHD model. Another possibility is image blending, where the third image in the triplet corresponds to the average of the two input latent vectors.

📈Interesting Numbers :
FFHQ (FID)
The best is authors' approach @ 14.7 (next best is StarGANv2 @ 25.6) | real data is @ 3.3

✏️My Notes:
- No rating for the name, since the paper does not have a model/pipeline name 🤷‍♂️
- There is an identity gap between the original and edited images in the synthetic dataset, hence on real images the change in identity becomes even more apparent
- It is non trivial to get disentangled directions in a StyleGAN generator, especially with a simple linear vector addition as proposed here. Some papers try to solve this by modeling the edit as a nonlinear transformation between two vectors in the latent space.

🔗Links:
Paper / Code

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#15: "MLP-Mixer: An all-MLP Architecture for Vision" by Tolstikhin et al.

🔑 Keywords:
#MLP #no_convolutions #MLP_Mixer #Transformers #ImageTransformers #ImageClassification #ViT

🎯 At a glance:
This paper is a spiritual successor of Vision Transformer from last year. This time around the authors once again come up with an all-MLP (multi layer perceptron) model for solving computer vision tasks. This time around, no self-attention blocks are used either (!) instead two types of "mixing" layers are proposed. The first is for interaction of features inside patches , and the second - between patches.

🔍 Main Ideas:
1) Big picture:
There are no convolutions or spatial attention blocks used anywhere in the network architecture. As input it accepts a sequence of linearly projected image patches (tokens) and operates on them without changing the dimensionality (patches x channels table). All patches are projected using the same projection matrix. The overall structure goes like this: an image patch embedding layer followed by a number of Mixer layers that are connected to a global average pooling layer and a fully connected layer that predicts the resulting class.

2) Mixer layer:
The core idea of this novel layer is to separate feature mixing at spatial locations (channel-mixing) and across spatial locations (token-mixing). Each mixer layer consists of a layer norm, two MLP blocks (two fully-connected layers with a GELU nonlinearity in-between), and two skip connections between the input, and outputs of each MLP block.
The first block operates on transposed features (channels x tokens). For each channel it combines feature values across all tokens. The features are then transposed back to their original shape (tokens x channels), layer-normed, and passed to a second MLP that processes and reweights all channels for each token.
The same MLP is reused for each channel in the first case, and for each token in the second block.
No positional encoding is used anywhere in the model since token-mixing MLPs can become location aware through training.

📈Interesting Numbers / Main takeaways :
- Mixer improves faster with more training data than other baselines
- Mixer runs faster than ViT
- Mixer achieves 87.94% top-1 accuracy on ImageNet when pretrained on JFT-300M
- Mixer overfits faster than ViT on smaller datasets

✏️My Notes:
- 5/5 - such a dope name!
- Same thing as with ViT, MLP-Mixer is very data-hungry
- As the authors point out, it is very interesting to see what inductive biases are hidden in the feature learned by the model, and how they relate to generalization
- Interesting to see where else these ideas could be applied: NLP, music, generative tasks?

🔗Links:
Paper / Code

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#16: "Exploiting Spatial Dimensions of Latent in GAN for Real-time Image Editing" (#StyleMapGAN) by Hyunsu Kim et al.

🔑 Keywords:
#GAN_editing #StyleGAN_inversion #StyleGAN_editing #spatial_styles #latent_projection #latent_image_editing

🎯 At a glance:
One more paper about inverting images into latent spaces of generators. This time with the twist that it uses explicit spatial styles (style tensors instead of style vectors) in the generator, and the encoder, hence making it possible to perform local edits, and smoothly swap parts of images. Overall the authors show that their approach outperforms other baseline in the aforementioned tasks as well as image interpolation.

🔍 Main Ideas:
1) Spatial Styles:
The autthors of StyleMapGAN propose to use latent tensors instead of latent styles. To do that they reshape the output of their mapping network into a 64 by 8 by 8 tensor and then pass it through a resizer network that is made up of convolution and upsampling layers and outputs a set of of style tensors with spatial size matching the intermidiate feature maps in the generator. The style tensors are then used to predict modulation mean and variance for each of the generator layers.
An interesting change from the StyleGAN generator is the absence of type B noise since it added spatial variation to the generated image, and spatial variation is baked into the style tensors.

2) Encoder:
The encoder network is pretty much the same thing as the patch discriminator used in StyleGAN except that it does not do minibatch discrimination, and instead predicts a single style tensor that later goes into the resizer network.

3) Training:
All networks are jointly trained with multiple losses: MSE and LPIPS between real and generated images for the generator and encoder networks, adversarial loss for all of the models, and MSE loss between the predicted/sampled style tensor and the style tensor that is obtained by passing the generated image back through the encoder (In-Domain loss).

4) Local editing:
The source and reference images are inverted with the encoder to obtain their style tensors that are then alpha blended in W+ according to the provided editing mask.

📈Interesting Numbers / Main takeaways :
- The authors introduce an interesting new metric called FID-lerp to measure how well images can be interpolated in the latent space. It is computed by measuring FID on images that are randomly interpolated between pairs of images from the test set.
- To measure the effectiveness of local edits the authors compute two types of MSE: one between the source parts of the image, and one between the inserted parts of the reference image
- Quite obviously increased style tensor resolution improves reconstruction drastically. However it is not stated, whether the StyleMap overfits at higher resolution since there is a known tradeoff between reconstruction quality and editability.
- The authors claim the best FID for real image projections but strangely do not provide comparisons with pSp, e4e or any other encoder based architectures. Only claiming speedup over optimization based methods.

✏️My Notes:
- 3/5 - ok name, nothing too interesting
- It is not mentioned in the paper, whether there are any engineering tricks for fitting the whole pipeline into memory as it seems the total memory requirement should be considerable.
- The most surprising property of the model imho is that alignment is not required between images when editing them.

🔗Links:
Paper / Code

If you found this paper explanation useful, please share it with your friends and colleagues to support this channel!

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#17: "Improving Inversion and Generation Diversity in
StyleGAN using a Gaussianized Latent Space" by Wulff et al.

🔑 Keywords:
#latent_space_encoding #pretrained_generator #latent_interpolation #gaussian_prior #latent_PCA

🎯 At a glance:
In this paper about improving latent space inversion for a pretrained StyleGAN2 generator the authors propose to model the output of the mapping network as a Gaussian, which can be expressed as a mean and a covariance matrix. This prior is used to regularize images that are projected into latent space via optimization, which makes the inverted images lie in well conditioned regions of the generator's latent space, and allows for smoother interpolations and better editing.

🔍 Main Ideas:
1) Gaussian latent space:
It is possible to embed any image into the latent space of a generator. For example, embedding an image of a car into a face generator, however these images would lie in ill-defined regions of the latent space, and it is nearly impossible to interpolate between these images. To make sure the inverted images stay within the learned distribution the authors propose to use a gaussian prior when inverting images.
It turns out that the distribution of latent codes in W is highly irregular, and hard to describe analytically, however undoing the last leaky ReLU with slope = -0.2 with another leaky ReLU with slope = -5 the distribution of latent vectors sampled from the mapping network becomes a high-dimensional Gaussian. Hence it is possible to empirically compute the mean and the covariance matrix of this Gaussian by sampling a set of vectors from the mapping network.

2) Improving inversion:
To improve the inversion the authors add a regularizer to the standard inversion optimization that finds a latent vector that minimizes the distance between the input image and the image that is generated from the optimized vector. The regularizer is the empirical covariance matrix-multiplied on the left and right by the difference between the "gaussianized" (by Leaky ReLU with a slope of -5.0) optimized latent vector and the empirical mean of the gaussianized latent space.
It is possible to extend this approach to W+ by summing the regularization terms for each of the optimized vectors in W+

3) Removing artifacts using the Gaussian prior:
The go-to method for increasing the sample quality in generative models is the truncation trick that pushes samples closer to the average vector, which leads to a decrease in sample diversity since all images start to look similar to the average image. The authors suggest an alternative approach to combat generation artifacts. They perform PCA on the Gaussian latent space to find the main directions of variation, and project the latent vectors onto the computed primary components. As it turns out, images with artifacts have very large values in certain components, especially lower dimensions. The final step is to apply logarithmic compression to the large components, and the resulting images become virtually artifact-free, while retaining generated content diversity.

📈Interesting Numbers / Main takeaways:
-Truncation changes the identity of the person on the image, while PCA compression does not
- User study concluded that the interpolation quality for the proposed method is miles ahead of the baseline (2-8 times higher)
- The reconstruction errors on images are 3-4 times lower compared to the baseline

✏️My Notes:
- No name to rate here
- Kind of a counterintuitive idea since the whole motivation for the mapping network was to let the generator learn a complex latent space and move away from the simple gaussian prior since it was deemed not expressive enough.
- Cool that it is a trick that does not require to train anything

🔗Links:
Paper / Code (not available at this time)

👋 If you found this paper explanation useful, please subscribe, and share it with your friends and colleagues to support this channel!

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#18.1: "Emerging Properties in Self-Supervised Vision Transformers" by Caron et al.

🔑 Keywords:
#vision_transformers #self_supervised #ImageTransformers #Transformers #DINO

🎯 At a glance:
In this paper from Facebook AI Research the authors propose a novel pipeline to train a ViT model in a self-supervised setup. Perhaps the most interesting consequence of this setup is that the learned features are good enough to achieve 80.1% top-1 score on ImageNet. At the core of their pipeline is a pair of networks that learn to predict the outputs of one another. The trick is that while the student network is trained via gradient descent over the cross-entropy loss functions, the teacher network is updated with an exponentially moving average of the student network weights. Several tricks such as centering and sharpening are employed to combat mode collapse. As a fortunate side-effect the learned self-attention maps of the final layer automatically learns class-specific features leading to unsupervised object segmentations.

🔍 Main Ideas:
1) Self-supervised learning with knowledge distillation:
In DINO the student network tries to match the output of the teacher network that is a vector of probability distribution over its dimensions. Both output vectors are normalized by softmax with a temperature parameter that controls the sharpness of the distribution. Higher values for smoother, lower for sharper. The teacher network is fixed, and the student network is trained with a cross-entropy loss between the two output probability distribution vectors. Both networks share the same architecture but with different sets of parameters

2) Local to global correspondence:
The two networks actually see different transformed views of the same input image. The teacher network sees only 2 high resolution "global" views, while the student network additionally sees several smaller (less than 50% of full image) "local" views.

3) Teacher Network:
Since DINO is self supervised there is no pretrained teacher network, hence it is built directly during training. Empirically the best results were obtained with a momentum encoder. Meaning that the weights for the teacher network for an iteration are equal to the exponentially moving average of the student weights from previous iterations.

4) Architecture:
The backbone of DINO is
ViT, followed by a projection head made of 3 fully connected layers of size 2048, L2 normalization and a weight-normalized fully connected layer. 8x8 patches are used as input to Vit

5) Avoiding Mode Collapse:
Two opposite operations are used: centering, which adds a bias term to the teacher outputs to stop one dimension from becoming too dominant, and sharpening that uses a low temperature in the teacher softmax to prevent collapse to a uniform distribution, which is a side effect of centering. The "center" is updated with an EMA, which works well for varying batch sizes.

👋 If you found this paper explanation useful, please subscribe, and share it with your friends and colleagues to support this channel!

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
#18.2: "Emerging Properties in Self-Supervised Vision Transformers" by Caron et al.

📈Interesting Numbers / Main takeaways:
- DINO outperforms other baselines with a simple k-NN classifier on ImageNet
- DINO outperforms supervised baselines on ImageNet image retrieval benchmark
- DINO works well for copy detection tasks
- DINO's self-attention maps are competitive for video instance segmentation
- DINO's features transfer better to downstream tasks than comparable supervised methods

✏️My Notes:
- 4/5 DINO is a dope name but it is just random parts of words from the full title
- The authors decrease the spatial sizes of patches (compared to ViT), which obviously boosts accuracy, but harms memory and performance. Interesting to see what could be done to alleviate this problem.
- With the number of Vision Transformer models I am covering lately, this channel might as well be renamed to "Casual Transformers"
- How long do you think until the vision transformers have their "StyleGAN/ImageNet" moment and it becomes obvious if and why they are superior to convnets for CV tasks? I think we will know by the end of the year for sure. Comment bellow, and let's discuss!

🔗Links:
Paper / Code

👋 If you found this paper explanation useful, please subscribe, and share it with your friends and colleagues to support this channel!

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#19: "GANcraft: Unsupervised 3D Neural Rendering of Minecraft Worlds" by Zekun Hao et al.

🔑 Keywords:
#neural_rendering #unsupervised #unpaired #3d_computer_vision #3D

🎯 At a glance:
Did you ever want to quickly create a photorealistic 3d scene from scratch? Well, now you can! The authors from NVidia proposed a new neural rendering model trained with adversarial losses ...WITHOUT a paired dataset. Yes, it only requires a 3D semantic block world as input, a pseudo ground truth image generated by a pretrained image synthesis model, and any real landscape photos to output a consistent photorealistic render of a 3D scene corresponding to the block world input.

🔍 Main Ideas:
1) Generating training data:
For each training iteration a random camera pose from the upper hemisphere of the scene is sampled along with a focal length value. The camera is used to project the block world to a 2d image to obtain a semantic segmentation mask. The mask along with a random latent code is passed to a pretrained SPADE network that generates the pseudo ground truth image corresponding to the sampled camera view of the Minecraft scene.

2) Voxel-based volumetric neural renderer:
In GANCraft the scene is represented by a set of voxels with corresponding semantic labels. There is a separate neural radiance field for each of the blocks, and for points in space where blocks do not exist, a null feature vector with density 0 is returned. To model diverse appearances of the same underlying scene the radiance fields are conditioned on style code Z. Interestingly, the MLP encoder that predicts the style vector is shared amongst all voxels. The location code is derived by computing Fourier features of the trilinear interpolation of learnable codes on the vertices of voxels.

3) Neural sky dome:
The sky is assumed to be infinitely far away, hence it is rendered as a large dome, and its color is obtained from a single MLP that maps a ray direction and a style code to a color value.

4) Hybrid neural rendering:
The rendering is done in two phases, first feature vectors are aggregated along rays for each pixel, and then a shallow CNN with 9x9 receptive field is employed to convert the feature map to an RGB image of the same size.

5) Losses and regularizers:
The authors use L1, L2, perceptual, and adversarial losses on the pseudo ground truth images, and adversarial loss on the real images to train all of the models.

📈Interesting Numbers / Main takeaways:
- GANCraft has lower FID and KID then MUNIT and NSVF-W, and almost matches SPADE even though SPADE is not view consistent
-GANCraft is the most view consistent among the baselines


✏️My Notes:
- 5/5 Awesome paper/model name!
- No idea how they managed to fit all of this into memory. Even though it is mentioned in the paper that the two stage approach helps reduce the memory footprint, they would still need to keep the activations for every ray to compute the per image losses.
- Love how the paper brings together a bunch of ideas from unrelated papers into one cohesive narrative
- The results are far from perfect, but it is nevertheless a great start.
- Without GAN loss the model produces results that are blurrier, without pseudo ground truth images - less realistic, and without the two phase rendering - lacking in fine detail

🔗Links:
Paper / Project page (code not available at this time)

👋 If you found this paper explanation useful, please subscribe, and share it with your friends and colleagues to support this channel!

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
gancraft.png
2.5 MB
#19: "GANcraft: Unsupervised 3D Neural Rendering of Minecraft Worlds" by Zekun Hao et al.

Some helpful figures from the paper.
​​#20: "Large Scale Image Completion via Co-Modulated Generative Adversarial Networks" (#CoModGAN) by Zhao Shengyu et al.

🔑 Keywords:
#ICLR2021 #generator #inpainting #modulated_convolutions #conditional_generation #image_synthesis #image2image

🎯 At a glance:
Is it true that all existing methods fail to inpaint large-scale missing regions? The authors of CoModGAN claim that it is impossible to complete an object that is missing a large part unless the model is able to generate a completely new object of that kind, and propose a novel GAN architecture that bridges the gap between image-conditional and unconditional generators, which enables it to generate very convincing complete images from inputs with large portions masked out.

🔍 Main Ideas:
1) Modulation approaches:
On the one hand, there is unconditional modulation from StyleGAN2, where a noise vector is passed through a fully-connected mapping network to obtain a style vector. The style vector is used to generate a modulation vector that multiplies the input feature maps channel-wise. The resulting feature maps are processed by a convolution and demodulated (normalized) to have unit variance.
On the other, there are image-conditional generators that use learned flattened features from an encoder as modulation parameters. Their main shortcoming is the lack of stochastic generative capability. The outputs are simply not diverse enough, when limited input information is available since the output should only be weakly conditioned on the input.

2) Co-modulation:
The authors propose to solve the aforementioned issues by combining the two types of style vectors (from encoder and from sampled noise) via a joint affine transformation to produce a single modulation parameter. It is noted that it is not necessary to use a nonlinear mapping to combine the two style vectors, and it is sufficient to assume that the two style vectors can be linearly correlated to improve the generated image quality, especially when a large portion of the input is missing.

3) Paired/Unpaired Inception Discriminative Score:
The authors additionally propose a novel metric to measure the linear separability of fake/real samples in a pretrained feature space.
For the paired case the idea is to sample pairs of real and fake images from a joint distribution, extract features from them with a pretrained Inception v3 model, and train a linear SVM on the extracted features. The P-IDS is the probability that a fake sample is considered more realistic than the corresponding real sample.
If there is only unpaired data available the images are sampled independently, and the U-IDS is the misclassification rate of the SVM instead.
The authors point out 3 major advantages of P-IDS/U-IDS: robustness to sampling size, effectiveness of capturing subtle differences, and good correlation to human preferences.

📈Interesting Numbers / Main takeaways:
- It is possible to tune the tradeoff between quality and diversity by adjusting the truncation parameter that amplifies the stochastic branch
- CoModGAN outperforms other baselines on inpainting tasks, and especially large-region inpainting
- CoModGAN can be effectively used for various image-to-image tasks such as edges-to-photo and labels-to-photo, and outperforms MUNIT and SPADE on the corresponding tasks

✏️My Notes:
- 4/5 for the name CoMod GAN sounds really funny in Russian
- The inpainted images look pretty insane
- Really simple yet effective idea
- Awesome that the model is applicable in many various settings
- Interesting whether the images can be composited of different input parts and edited in the latent space

🔗Links:
Paper / Code

👋 If you found this paper explanation useful, please subscribe, and share it with your friends and colleagues to support this channel!

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin
​​#21: "Endless Loops: Detecting and Animating Periodic Patterns in Still Images" by Tavi Halperin et al.

🔑 Keywords:
#video_generation #single_image #endless_motion #texture_animation #periodic_animation #cinemagraph_generation

🎯 At a glance:
Have you ever taken a still photo and later realized how cool it would have been to take a video instead. The authors of the "Endless Loops" paper got you covered. They propose a novel method that creates seamless animated loops from single images. The algorithm is able to detect periodic structures in the input images that it uses to predict a motion field for the region, and finally smoothly warps the image to produce a continuous animation loop.

🔍 Main Ideas:
1) Overview:
The algorithm takes as input an image to be animated, a binary mask indicating the pixels to be animated, and a general direction of motion. The authors observed that displacement fields suitable for cinemagraph creation can mostly be constructed from a main direction vector together with a small number of offsets. The dense CRF solver is quadratic in the number of displacement vectors , hence keeping that number small is vital to the inference time of the model.

2) Detecting repetitions:
The authors use a two stage approach to generate the displacement field. They first solve a simpler problem in 1D. Specifically, they ample a wide band of pixels around the main direction line going through the center of mass of the masked region. The goal is to match successive occurrences of the repeating pattern (windows, tile, etc). The problem is solved by finding the shortest path spanning the main diagonal of the self-similarity matrix of the band of pixels via dynamic programming. The shortest path has successive, non-decreasing indices of points in both dimensions, and no more than two points can stay on the same matrix row. To avoid a trivial path along the main diagonal (1, 1) -> (2, 2) -> .. (n, n) the elements of the matrix on and around the main diagonal are given infinite weight. The authors found that the best results are obtained when reflecting the original sample width-wise before computing the shortest path and taking the middle part of the computed path.

3) Displacement assignment with CRF:
To compute the displacement field over every pixel in the region the authors choose to fit a CRF on an extended set of motion vectors obtained in the previous step. Since more angular freedom is wanted for assigning motion vectors to pixels, the set of vectors from the previous step is expanded with additional vectors up to 30 degree angular deviation. Each pixel is assigned a starting value extrapolated from the 1D displacement vectors.
The CRF solver uses feature vectors from the first two layers of a VGG-16 model to minimize the perceptual distance between each pixel and its displacement. The solution is regularized to stay close to the original guess. Cosine similarity is used as the label compatibility function.

4) Inverting the flow:
An invertible motion field is required to generate video frames via backward warping according to the flow, hence the predicted motion field is subsampled after a gaussian kernel is applied to it. Subsequently, the sparse flow is interpolated to all pixels using a polyharmonic spline with radial basis function. This operation ensures the dense field is smooth, has not holes or collisions and serves as a good approximation of the inverse flow.

(Continues bellow 👇)

👋 If you found this paper explanation useful, please subscribe, and share it with your friends and colleagues to support this channel!

By: @casual_gan
P.S. Send me paper suggestions for future posts @KirillDemochkin