I'm back people! After some pause I decided to continue posting in this channel. I promise to select the most interesting papers and write at least 1-2 posts per week.
Cheers,
Artsiom
Cheers,
Artsiom
🇨🇳 Chinese researchers brought deepfakes to the next level by changing the entire head
We have all seen deepfakes where faces are swapped. This paper went further, they substitute the driving head entirely. Miracles of Chinese engineering skills and a lot of losses do the job 🤓.
Compared to the usual "face swap", the new method exhibits better transfer of the personality from the target photo to the driving video, preserving the hair, eyebrows, and other important attributes. A slight improvement of the temporal stability is needed though - the edges of the head are a little twitchy. There is no code yet, but the authors promised to upload it soon.
❱❱ Few-Shot Head Swapping in the Wild CVPR 2022, Oral
We have all seen deepfakes where faces are swapped. This paper went further, they substitute the driving head entirely. Miracles of Chinese engineering skills and a lot of losses do the job 🤓.
Compared to the usual "face swap", the new method exhibits better transfer of the personality from the target photo to the driving video, preserving the hair, eyebrows, and other important attributes. A slight improvement of the temporal stability is needed though - the edges of the head are a little twitchy. There is no code yet, but the authors promised to upload it soon.
❱❱ Few-Shot Head Swapping in the Wild CVPR 2022, Oral
Neural 3D Reconstruction in the Wild”
[SIGGRAPH 2022]
When will neural-based approaches beat COLMAP in terms of both speed and quality of the surface reconstruction? Here is the Neural Rendering method tackling the quality part.
Authors show that with a clever sampling strategy, neural-based 3D reconstruction for large scenes in the wild can be better and faster than COLMAP.
The method is built on top of NeuS (2 MLPs for prediction color and SDF for (x,y,z) loсation). The main contribution is a hybrid voxel- and surface guided sampling technique that allows for more efficient ray sampling around surfaces and leads to significant improvements in
reconstruction quality (see figure attached).
❱❱ Project page
Code is coming soon
@Artem Gradient
[SIGGRAPH 2022]
When will neural-based approaches beat COLMAP in terms of both speed and quality of the surface reconstruction? Here is the Neural Rendering method tackling the quality part.
Authors show that with a clever sampling strategy, neural-based 3D reconstruction for large scenes in the wild can be better and faster than COLMAP.
The method is built on top of NeuS (2 MLPs for prediction color and SDF for (x,y,z) loсation). The main contribution is a hybrid voxel- and surface guided sampling technique that allows for more efficient ray sampling around surfaces and leads to significant improvements in
reconstruction quality (see figure attached).
❱❱ Project page
Code is coming soon
@Artem Gradient
Image Inpainting: Partial Convolution vs Gated Convolution
Let’s talk about some essential components of the image inpainting networks - convolutions. #fundamentals
It is common in image inpainting model to feed a corrupted image (with some parts masked out) to the generator network. But we don’t want the network layers to rely on empty regions when features are computed. There is a straightforward solutions to this problem (Partial Convolutions) and a more elegant one (gated convolution).
🔻Partial Convolutions make the convolutions dependent only on valid pixels. They are like normal convolutions, but with hard mask multiplication applied to each output feature map. The first map is computed from occluded image directly or provided as an input from user. Masks for every next partial convolution are computed by finding non-zero elements in the input feature maps.
- Partial convolution heuristically classifies all spatial locations to be either valid or invalid. The mask in next layer will be set to ones no matter how many pixels are covered by the filter range in previous layer (for example, for a 3x3 conv, 1 valid pixel and 9 valid pixels are treated as same to update current mask).
- For partial convolution the invalid pixels will progressively disappear in deep layers, gradually converting all mask values to ones.
- partial convolution is incompatible with additional user inputs. However, we would like to be able to utilize extra user inputs for conditional generation (for example, sparse sketch inside the mask).
- All channels in each layer share the same mask, which limits the flexibility. Essentially, partial convolution can be viewed as un-learnable single-channel feature hard-gating.
🔻Gated convolutions. Instead of hard-gating mask updated with rules, gated convolutions learn soft mask automatically from data. It has a “Soft gating” block (consists of one convolutional layer) which takes an input feature map and predicts an appropriate soft mask which is applied to the output of the convolution.
- Can take any extra user guidance (e.g., mask, sketch) as input. They can be all concatenated with the corrupted image and fed to the first gated convolution.
- Learns a dynamic feature selection mechanism for each channel and each spatial
location.
- Interestingly, visualization of intermediate gating values show that it learns to select the feature not only according to background, mask, sketch, but also considering semantic segmentation in some channels.
- Even in deep layers, gated convolution learns to highlight the masked regions and sketch information in separate channels to better generate inpainting results.
@gradientdude
Let’s talk about some essential components of the image inpainting networks - convolutions. #fundamentals
It is common in image inpainting model to feed a corrupted image (with some parts masked out) to the generator network. But we don’t want the network layers to rely on empty regions when features are computed. There is a straightforward solutions to this problem (Partial Convolutions) and a more elegant one (gated convolution).
🔻Partial Convolutions make the convolutions dependent only on valid pixels. They are like normal convolutions, but with hard mask multiplication applied to each output feature map. The first map is computed from occluded image directly or provided as an input from user. Masks for every next partial convolution are computed by finding non-zero elements in the input feature maps.
- Partial convolution heuristically classifies all spatial locations to be either valid or invalid. The mask in next layer will be set to ones no matter how many pixels are covered by the filter range in previous layer (for example, for a 3x3 conv, 1 valid pixel and 9 valid pixels are treated as same to update current mask).
- For partial convolution the invalid pixels will progressively disappear in deep layers, gradually converting all mask values to ones.
- partial convolution is incompatible with additional user inputs. However, we would like to be able to utilize extra user inputs for conditional generation (for example, sparse sketch inside the mask).
- All channels in each layer share the same mask, which limits the flexibility. Essentially, partial convolution can be viewed as un-learnable single-channel feature hard-gating.
🔻Gated convolutions. Instead of hard-gating mask updated with rules, gated convolutions learn soft mask automatically from data. It has a “Soft gating” block (consists of one convolutional layer) which takes an input feature map and predicts an appropriate soft mask which is applied to the output of the convolution.
- Can take any extra user guidance (e.g., mask, sketch) as input. They can be all concatenated with the corrupted image and fed to the first gated convolution.
- Learns a dynamic feature selection mechanism for each channel and each spatial
location.
- Interestingly, visualization of intermediate gating values show that it learns to select the feature not only according to background, mask, sketch, but also considering semantic segmentation in some channels.
- Even in deep layers, gated convolution learns to highlight the masked regions and sketch information in separate channels to better generate inpainting results.
@gradientdude
VisCo Grids: Surface Reconstruction with Viscosity and Coarea Grids
The goal of this work it show that network-free grid-based implicit representations can achieve INR (Implicit Neural Representation)-level 3D reconstructions when incorporating suitable priors.
VisCo Grids is a grid-based surface reconstruction algorithm that incorporates well-defined geometric priors: Viscosity and Coarea.
The viscosity loss, is replacing the Eikonal loss used in INRs for optimizing Signed Distance Functions (SDF). The Eikonal loss posses many bad minimal solution that are avoided in the INR setting due to the network’s inductive bias, but are present in the grid parametrization (see attached Figure, middle). The viscosity loss regularizes the Eikonal loss and provides well defined smooth solution that converges to the "correct" viscosity SDF solution. The viscosity loss provides smooth SDF solution but do not punish excessive or "ghost" surface parts (see the Figure, right).
Therefore, a second useful prior is the Coarea loss, directly controlling the surface’s area, and encourages it to be smaller. The coarea loss is defined using a "squashing" function applied to the viscosity SDF making it approximately an indicator function, and then integrates its gradient norm over the domain.
VisCo grids have instant inference, and even with our current rather naive implementation are faster to train than INRs. Considerable training time improvement are expected with a more efficient implementation.
We tested VisCo Grids on a standard 3D reconstruction dataset, and achieved comparable accuracy to the state-of-the-art INR methods.
❱❱ Paper
❱❱ Code is coming soon.
@gradientdude
The goal of this work it show that network-free grid-based implicit representations can achieve INR (Implicit Neural Representation)-level 3D reconstructions when incorporating suitable priors.
VisCo Grids is a grid-based surface reconstruction algorithm that incorporates well-defined geometric priors: Viscosity and Coarea.
The viscosity loss, is replacing the Eikonal loss used in INRs for optimizing Signed Distance Functions (SDF). The Eikonal loss posses many bad minimal solution that are avoided in the INR setting due to the network’s inductive bias, but are present in the grid parametrization (see attached Figure, middle). The viscosity loss regularizes the Eikonal loss and provides well defined smooth solution that converges to the "correct" viscosity SDF solution. The viscosity loss provides smooth SDF solution but do not punish excessive or "ghost" surface parts (see the Figure, right).
Therefore, a second useful prior is the Coarea loss, directly controlling the surface’s area, and encourages it to be smaller. The coarea loss is defined using a "squashing" function applied to the viscosity SDF making it approximately an indicator function, and then integrates its gradient norm over the domain.
VisCo grids have instant inference, and even with our current rather naive implementation are faster to train than INRs. Considerable training time improvement are expected with a more efficient implementation.
We tested VisCo Grids on a standard 3D reconstruction dataset, and achieved comparable accuracy to the state-of-the-art INR methods.
❱❱ Paper
❱❱ Code is coming soon.
@gradientdude
I will be presenting our VisCo Grids paper at NeurIPS tomorrow at 11:00-13:00 in Hall J.
Feel free to come to poster #527 to learn more details if you are attending the conference!
@gradientdude
Feel free to come to poster #527 to learn more details if you are attending the conference!
@gradientdude
This media is not supported in your browser
VIEW IN TELEGRAM
🦿Avatars Grow Legs
I'm thrilled to share with you my latest research paper (CVPR 2023)! This was a joint effort with my intern at Meta Reality Labs before our team transitioned to GenAI.
Our innovative method, dubbed Avatars Grow Legs (AGRoL), aims to control a 3D avatar's entire body in VR without the need for extra sensors. Typically in VR, your interaction is limited to a headset and two handheld joysticks, leaving out any direct input from your legs. This limitation persists despite the Quest's downward-facing cameras, as they rarely capture the unocludded view of the legs.
To tackle this, we've introduced a novel solution based on a diffusion model. Our model synthesizes the 3D movement of the whole body conditioned solely on the tracking data from the hands and the head, circumventing the need for direct observation of the legs.
Moreover, we've designed AGRoL with an efficient architecture
enabling 30 FPS synthsis on a V100 GPU.
❱❱ Code and weights
❱❱ Paper
❱❱ Project Page
@gradientdude
I'm thrilled to share with you my latest research paper (CVPR 2023)! This was a joint effort with my intern at Meta Reality Labs before our team transitioned to GenAI.
Our innovative method, dubbed Avatars Grow Legs (AGRoL), aims to control a 3D avatar's entire body in VR without the need for extra sensors. Typically in VR, your interaction is limited to a headset and two handheld joysticks, leaving out any direct input from your legs. This limitation persists despite the Quest's downward-facing cameras, as they rarely capture the unocludded view of the legs.
To tackle this, we've introduced a novel solution based on a diffusion model. Our model synthesizes the 3D movement of the whole body conditioned solely on the tracking data from the hands and the head, circumventing the need for direct observation of the legs.
Moreover, we've designed AGRoL with an efficient architecture
enabling 30 FPS synthsis on a V100 GPU.
❱❱ Code and weights
❱❱ Paper
❱❱ Project Page
@gradientdude
Media is too big
VIEW IN TELEGRAM
Demo of our Avatars Grow Legs model that synthesizes the full 3D body motion based on sparse tracking inputs from the head and the wrists.
More details are in the paper.
@gradientdude
More details are in the paper.
@gradientdude
This media is not supported in your browser
VIEW IN TELEGRAM
Today I will be presenting our CVPR2023 poster "Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model".
Learn how to synthesize full body motion based on 3 known points only (head and wrists)!
❱❱ Detailed post about the paper.
Come to chat with me today at 10:30-12:30 PDT, poster #46 if you are at CVPR.
@gradientdude
Learn how to synthesize full body motion based on 3 known points only (head and wrists)!
❱❱ Detailed post about the paper.
Come to chat with me today at 10:30-12:30 PDT, poster #46 if you are at CVPR.
@gradientdude
Staff Research Scientist: Personal Update
I have some exciting news that I'd like to share with you! On Monday, I was promoted to E6, which means I am now a Staff Research Scientist at Meta GenAI.
This was made possible thanks to the significant impact and scope of a Generative AI project that I proposed, led, and completed last year. The project is not yet public, so I can't share details about it right now.
Before this, I was at the terminal level - Senior Research Scientist, a position many get stuck in forever. It takes extra effort and personal qualities to break out of this limbo and become a Staff member. But now, I've unlocked a new ladder, E6+, where leveling up is significantly more challenging than between Junior (E3) and Senior (E5) levels. However, this also presents a challenge and an opportunity for further development!
Exciting stuff!
@gradientdude
I have some exciting news that I'd like to share with you! On Monday, I was promoted to E6, which means I am now a Staff Research Scientist at Meta GenAI.
This was made possible thanks to the significant impact and scope of a Generative AI project that I proposed, led, and completed last year. The project is not yet public, so I can't share details about it right now.
Before this, I was at the terminal level - Senior Research Scientist, a position many get stuck in forever. It takes extra effort and personal qualities to break out of this limbo and become a Staff member. But now, I've unlocked a new ladder, E6+, where leveling up is significantly more challenging than between Junior (E3) and Senior (E5) levels. However, this also presents a challenge and an opportunity for further development!
Exciting stuff!
@gradientdude
⚡️SD3-Turbo: Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation
Following Stable Diffusion 3, my ex-colleagues have published a preprint on SD3 distillation using 4-step, while maintaining quality.
The new method – Latent Adversarial Diffusion Distillation (LADD), which is similar to ADD (see post about it in @ai_newz), but with a number of differences:
️↪️ Both teacher and student are on a Transformer-based SD3 architecture here.
The biggest and best model has 8B parameters.
️↪️ Instead of DINOv2 discriminator working on RGB pixels, this article suggests going back to latent space discriminator in order to work faster and burn less memory.
️↪️ A copy of the teacher is taken as a discriminator (i.e. the discriminator is trained generatively instead of discriminatively, as in the case of DINO). After each attention block, a discriminator head with 2D conv layers that classifies real/fake is added. This way the discriminator looks not only at the final result but at all in-between features, which strengthens the training signal.
️↪️ Trained on pictures with different aspect ratios, rather than just 1:1 squares.
️↪️ They removed L2 reconstruction loss between Teacher's and Student's outputs. It's said that a blunt discriminator is enough if you choose the sampling distribution of steps t wisely.
️↪️ During training, they more frequently sample t with more noise so that the student learns to generate the global structure of objects better.
️↪️ Distillation is performed on synthetic data which was generated by the teacher, rather than on a photo from a dataset, as was the case in ADD.
It's also been shown that the DPO-LoRA tuning is a pretty nice way to add to the quality of the student's generations.
So, we get SD3-Turbo model producing nice pics in 4 steps. According to a small Human Eval (conducted only on 128 prompts), the student is comparable to the teacher in terms of image quality. But the student's prompt alignment is inferior, which is expected.
📖 Paper
@gradientdude
Following Stable Diffusion 3, my ex-colleagues have published a preprint on SD3 distillation using 4-step, while maintaining quality.
The new method – Latent Adversarial Diffusion Distillation (LADD), which is similar to ADD (see post about it in @ai_newz), but with a number of differences:
️
The biggest and best model has 8B parameters.
️
️
️
️
️
️
It's also been shown that the DPO-LoRA tuning is a pretty nice way to add to the quality of the student's generations.
So, we get SD3-Turbo model producing nice pics in 4 steps. According to a small Human Eval (conducted only on 128 prompts), the student is comparable to the teacher in terms of image quality. But the student's prompt alignment is inferior, which is expected.
@gradientdude
Please open Telegram to view this post
VIEW IN TELEGRAM
Please open Telegram to view this post
VIEW IN TELEGRAM
The paper also shows that SD3-Turbo is better than Midjourney 6 in both image quality and prompt alignment, which is surprising 🫥 . Looking forward to the weights release to do a reality check!
@gradientdude
@gradientdude
Please open Telegram to view this post
VIEW IN TELEGRAM