π¦ Scaling Vision to 4Kπ¦
πPS3 by #Nvidia (+UC Berkeley) to scale-up CLIP-style vision pre-training to 4K with *near-constant* cost. Encoding LR global image and selectively processes only informative HR regions. Impressive work. Code/weights & π€ announcedπ
πReview https://t.ly/WN479
πPaper https://lnkd.in/ddWq8UpX
πProject https://lnkd.in/dMkTY8-k
πRepo https://lnkd.in/d9YSB6yv
πPS3 by #Nvidia (+UC Berkeley) to scale-up CLIP-style vision pre-training to 4K with *near-constant* cost. Encoding LR global image and selectively processes only informative HR regions. Impressive work. Code/weights & π€ announcedπ
πReview https://t.ly/WN479
πPaper https://lnkd.in/ddWq8UpX
πProject https://lnkd.in/dMkTY8-k
πRepo https://lnkd.in/d9YSB6yv
π₯14β€4π2π1
This media is not supported in your browser
VIEW IN TELEGRAM
πLATTE-MV: #3D Table Tennisπ
πUC Berkeley unveils at #CVPR2025 a novel system for reconstructing monocular video of table tennis in 3D with uncertainty-aware controller that anticipates opponent actions. Code & Dataset announced, to be releasedπ
πReview https://t.ly/qPMOU
πPaper arxiv.org/pdf/2503.20936
πProject sastry-group.github.io/LATTE-MV/
πRepo github.com/sastry-group/LATTE-MV
πUC Berkeley unveils at #CVPR2025 a novel system for reconstructing monocular video of table tennis in 3D with uncertainty-aware controller that anticipates opponent actions. Code & Dataset announced, to be releasedπ
πReview https://t.ly/qPMOU
πPaper arxiv.org/pdf/2503.20936
πProject sastry-group.github.io/LATTE-MV/
πRepo github.com/sastry-group/LATTE-MV
π₯8π2π1π€―1
This media is not supported in your browser
VIEW IN TELEGRAM
π³MSVA Zero-Shot Multi-Viewπ³
πNiantic unveils MVSA, novel Multi-View Stereo Architecture to work anywhere by generalizing across diverse domains & depth ranges. Highly accurate & 3D-consistent depths. Code & models announcedπ
πReview https://t.ly/LvuTh
πPaper https://arxiv.org/pdf/2503.22430
πProject https://nianticlabs.github.io/mvsanywhere/
πRepo https://lnkd.in/ddQz9eps
πNiantic unveils MVSA, novel Multi-View Stereo Architecture to work anywhere by generalizing across diverse domains & depth ranges. Highly accurate & 3D-consistent depths. Code & models announcedπ
πReview https://t.ly/LvuTh
πPaper https://arxiv.org/pdf/2503.22430
πProject https://nianticlabs.github.io/mvsanywhere/
πRepo https://lnkd.in/ddQz9eps
π₯12π2π1
This media is not supported in your browser
VIEW IN TELEGRAM
πSegment Any Motion in Videoπ
πFrom CVPR2025 a novel approach for moving object segmentation that combines DINO-based semantic features and SAM2. Code under MIT licenseπ
πReview https://t.ly/4aYjJ
πPaper arxiv.org/pdf/2503.22268
πProject motion-seg.github.io/
πRepo github.com/nnanhuang/SegAnyMo
πFrom CVPR2025 a novel approach for moving object segmentation that combines DINO-based semantic features and SAM2. Code under MIT licenseπ
πReview https://t.ly/4aYjJ
πPaper arxiv.org/pdf/2503.22268
πProject motion-seg.github.io/
πRepo github.com/nnanhuang/SegAnyMo
π₯5π3β€2π€―1
This media is not supported in your browser
VIEW IN TELEGRAM
π Video Motion Graphs π
π#Adobe unveils a novel system designed to generate realistic human motion videos. Using a reference video and conditional signals such as music or motion tags, the system synthesizes amazing new videos. Code & Models to be releasedπ
πReview https://t.ly/r4EGF
πPaper https://lnkd.in/dK_tHyzh
πProject https://lnkd.in/dE6c_KYZ
πRepo TBA
π#Adobe unveils a novel system designed to generate realistic human motion videos. Using a reference video and conditional signals such as music or motion tags, the system synthesizes amazing new videos. Code & Models to be releasedπ
πReview https://t.ly/r4EGF
πPaper https://lnkd.in/dK_tHyzh
πProject https://lnkd.in/dE6c_KYZ
πRepo TBA
β€15π₯7π2π1π1π€£1
This media is not supported in your browser
VIEW IN TELEGRAM
π³ Compose Anything is out π³
πSkywork AI unveils SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e.g., characters, objects, backgrounds) into synthesized videos based on textual prompts. Code, models, & evaluation benchmark releasedπ
πReview https://t.ly/MEjzL
πPaper https://arxiv.org/pdf/2504.02436
πProject skyworkai.github.io/skyreels-a2.github.io/
πRepo github.com/SkyworkAI/SkyReels-A2
π€Models https://huggingface.co/Skywork/SkyReels-A2
πSkywork AI unveils SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e.g., characters, objects, backgrounds) into synthesized videos based on textual prompts. Code, models, & evaluation benchmark releasedπ
πReview https://t.ly/MEjzL
πPaper https://arxiv.org/pdf/2504.02436
πProject skyworkai.github.io/skyreels-a2.github.io/
πRepo github.com/SkyworkAI/SkyReels-A2
π€Models https://huggingface.co/Skywork/SkyReels-A2
β€9π3π2π₯1π€©1π€£1
This media is not supported in your browser
VIEW IN TELEGRAM
β½ VoRA: Vision as LoRA β½
π#ByteDance unveils Vision as LoRA (VoRA), a novel paradigm converting LLMs into Multimodal Large Language Models (MLLMs) by integrating vision-specific LoRA layers. All training data, codes, and model weights availableπ
πReview https://t.ly/guNVN
πPaper arxiv.org/pdf/2503.20680
πRepo github.com/Hon-Wong/VoRA
πProject georgeluimmortal.github.io/vora-homepage.github.io/
π#ByteDance unveils Vision as LoRA (VoRA), a novel paradigm converting LLMs into Multimodal Large Language Models (MLLMs) by integrating vision-specific LoRA layers. All training data, codes, and model weights availableπ
πReview https://t.ly/guNVN
πPaper arxiv.org/pdf/2503.20680
πRepo github.com/Hon-Wong/VoRA
πProject georgeluimmortal.github.io/vora-homepage.github.io/
π15β€7π€―4π1
This media is not supported in your browser
VIEW IN TELEGRAM
π TTT Long Video Generationπ
πA novel architecture for video generation adapting the CogVideoX 5B model by incorporating Test-Time Training layers. Adding TTT layers into a pre-trained Transformer -> one-minute clip from text storyboards. Videos, code & annotations releasedπ
πReview https://t.ly/mhlTN
πPaper arxiv.org/pdf/2504.05298
πProject test-time-training.github.io/video-dit/
πRepo github.com/test-time-training/ttt-video-dit
πA novel architecture for video generation adapting the CogVideoX 5B model by incorporating Test-Time Training layers. Adding TTT layers into a pre-trained Transformer -> one-minute clip from text storyboards. Videos, code & annotations releasedπ
πReview https://t.ly/mhlTN
πPaper arxiv.org/pdf/2504.05298
πProject test-time-training.github.io/video-dit/
πRepo github.com/test-time-training/ttt-video-dit
β€12π₯3π2
This media is not supported in your browser
VIEW IN TELEGRAM
π Unified Scalable SVG Generator π
πOmniSVG is the first family of e2e multimodal generators that leverages pre-trained VLMs to create detailed SVGs. Code, models & dataset to be released under MITπ
πReview https://t.ly/JcR3I
πPaper https://arxiv.org/pdf/2504.06263
πProject https://omnisvg.github.io/
πRepo github.com/OmniSVG/OmniSVG
πDataset https://huggingface.co/OmniSVG
πOmniSVG is the first family of e2e multimodal generators that leverages pre-trained VLMs to create detailed SVGs. Code, models & dataset to be released under MITπ
πReview https://t.ly/JcR3I
πPaper https://arxiv.org/pdf/2504.06263
πProject https://omnisvg.github.io/
πRepo github.com/OmniSVG/OmniSVG
πDataset https://huggingface.co/OmniSVG
β€15π₯2π1π1π1
This media is not supported in your browser
VIEW IN TELEGRAM
π§BoxDreamer Object Poseπ§
πBoxDreamer is a generalizable RGB-based approach for #3D object pose estimation in the wild, specifically designed to address challenges in sparse-view settings. Code coming, demo releasedπ
πReview https://t.ly/e-vX9
πPaper arxiv.org/pdf/2504.07955
πProject https://lnkd.in/djz8jqn9
πRepo https://lnkd.in/dfuEawSA
π€Demo https://lnkd.in/dVYaWGcS
πBoxDreamer is a generalizable RGB-based approach for #3D object pose estimation in the wild, specifically designed to address challenges in sparse-view settings. Code coming, demo releasedπ
πReview https://t.ly/e-vX9
πPaper arxiv.org/pdf/2504.07955
πProject https://lnkd.in/djz8jqn9
πRepo https://lnkd.in/dfuEawSA
π€Demo https://lnkd.in/dVYaWGcS
π₯3β€2π2π1
This media is not supported in your browser
VIEW IN TELEGRAM
π₯ Pose in Combat Sports π₯
πThe novel SOTA framework for an accurate physics-based #3D human pose estimation in combat sports w/ sparse multi-cameras setup. Dataset to be released soonπ
πReview https://t.ly/EfcGL
πPaper https://lnkd.in/deMMrKcA
πProject https://lnkd.in/dkMS_UrH
πThe novel SOTA framework for an accurate physics-based #3D human pose estimation in combat sports w/ sparse multi-cameras setup. Dataset to be released soonπ
πReview https://t.ly/EfcGL
πPaper https://lnkd.in/deMMrKcA
πProject https://lnkd.in/dkMS_UrH
π13π₯4β€3π€―2
This media is not supported in your browser
VIEW IN TELEGRAM
π₯Geo4D: VideoGen 4D Sceneπ₯
πThe Oxford VGG unveils Geo4D: video diffusion for monocular 4D reconstruction. Only synthetic data for training, but strong generalization to real world: point maps, depth & ray maps for the new SOTA in dynamic reconstruction. Code releasedπ
πReview https://t.ly/X55Uj
πPaper arxiv.org/pdf/2504.07961
πProject geo4d.github.io/
πCode github.com/jzr99/Geo4D
πThe Oxford VGG unveils Geo4D: video diffusion for monocular 4D reconstruction. Only synthetic data for training, but strong generalization to real world: point maps, depth & ray maps for the new SOTA in dynamic reconstruction. Code releasedπ
πReview https://t.ly/X55Uj
πPaper arxiv.org/pdf/2504.07961
πProject geo4d.github.io/
πCode github.com/jzr99/Geo4D
π₯12β€2π1π1
This media is not supported in your browser
VIEW IN TELEGRAM
π 4D Mocap Human-Object π
π#Adobe unveils HUMOTO, HQ dataset of human-object interactions for motion generation, computer vision, and robotics: 700+ sequences (7,875 seconds @ 30FPS), interactions with 63 precisely modeled objects and 72 articulated parts
πReview https://t.ly/lCof3
πPaper https://lnkd.in/dVVBDd_c
πProject https://lnkd.in/dwBcseDf
π#Adobe unveils HUMOTO, HQ dataset of human-object interactions for motion generation, computer vision, and robotics: 700+ sequences (7,875 seconds @ 30FPS), interactions with 63 precisely modeled objects and 72 articulated parts
πReview https://t.ly/lCof3
πPaper https://lnkd.in/dVVBDd_c
πProject https://lnkd.in/dwBcseDf
β€8π2π₯1π1
This media is not supported in your browser
VIEW IN TELEGRAM
πPartField #3D Part Segmentationπ
π#Nvidia unveils PartField, a FFW approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy. Suitable for single-shape decomposition, co-segm., correspondence & more. Code & Models released under Nvidia Licenseπ
πReview https://t.ly/fGb2O
πPaper https://lnkd.in/dGeyKSzG
πCode https://lnkd.in/dbe57XGH
πProject https://lnkd.in/dhEgf7X2
π#Nvidia unveils PartField, a FFW approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy. Suitable for single-shape decomposition, co-segm., correspondence & more. Code & Models released under Nvidia Licenseπ
πReview https://t.ly/fGb2O
πPaper https://lnkd.in/dGeyKSzG
πCode https://lnkd.in/dbe57XGH
πProject https://lnkd.in/dhEgf7X2
β€2π₯2π€―2
This media is not supported in your browser
VIEW IN TELEGRAM
π―UniAnimate-DiT: Human Animationπ―
πUniAnimate-DiT is a novel n' effective framework based on Wan2.1 for consistent human image animation. LoRAs to finetune the model parameters -reducing memory- maintaining the original modelβs generative skills. Training and inference code releasedπ
πReview https://t.ly/1I50N
πPaper https://arxiv.org/pdf/2504.11289
πRepo https://github.com/ali-vilab/UniAnimate-DiT
πUniAnimate-DiT is a novel n' effective framework based on Wan2.1 for consistent human image animation. LoRAs to finetune the model parameters -reducing memory- maintaining the original modelβs generative skills. Training and inference code releasedπ
πReview https://t.ly/1I50N
πPaper https://arxiv.org/pdf/2504.11289
πRepo https://github.com/ali-vilab/UniAnimate-DiT
π₯9π4π2π2
This media is not supported in your browser
VIEW IN TELEGRAM
π₯General attention-based objectπ₯
πGATE3D is a novel framework designed specifically for generalized monocular 3D object detection via weak supervision. GATE3D effectively bridges domain gaps by employing consistency losses between 2D and 3D predictions.
πReview https://t.ly/O7wqH
πPaper https://lnkd.in/dc5VTUj9
πProject https://lnkd.in/dzrt-qQV
πGATE3D is a novel framework designed specifically for generalized monocular 3D object detection via weak supervision. GATE3D effectively bridges domain gaps by employing consistency losses between 2D and 3D predictions.
πReview https://t.ly/O7wqH
πPaper https://lnkd.in/dc5VTUj9
πProject https://lnkd.in/dzrt-qQV
π₯8π3π1π1
This media is not supported in your browser
VIEW IN TELEGRAM
πEvent Blurry Super-Resolutionπ
πUSTC unveils Ev-DeblurVSR: event signals into BVSR for a novel event-enhanced network. Blurry Video Super-Resolution (BVSR) aiming at generating HR videos from low-resolution and blurry inputs. Pretrained models and test released under Apacheπ
πReview https://t.ly/x6hRs
πPaper https://lnkd.in/dzbkCJMh
πRepo https://lnkd.in/dmvsc-yS
πUSTC unveils Ev-DeblurVSR: event signals into BVSR for a novel event-enhanced network. Blurry Video Super-Resolution (BVSR) aiming at generating HR videos from low-resolution and blurry inputs. Pretrained models and test released under Apacheπ
πReview https://t.ly/x6hRs
πPaper https://lnkd.in/dzbkCJMh
πRepo https://lnkd.in/dmvsc-yS
π₯18β€8π€―5π€©1π1
This media is not supported in your browser
VIEW IN TELEGRAM
π₯ #Apple Co-Motion is out! π₯
πApple unveils a novel approach for detecting & tracking detailed 3D poses of multiple people from single monocular stream. Temporally coherent predictions in crowded scenes with hard poses & occlusions. New SOTA, 10x faster! Code & Models released only for researchπ
πReview https://t.ly/-86CO
πPaper https://lnkd.in/dQsVGY7q
πRepo https://lnkd.in/dh7j7N89
πApple unveils a novel approach for detecting & tracking detailed 3D poses of multiple people from single monocular stream. Temporally coherent predictions in crowded scenes with hard poses & occlusions. New SOTA, 10x faster! Code & Models released only for researchπ
πReview https://t.ly/-86CO
πPaper https://lnkd.in/dQsVGY7q
πRepo https://lnkd.in/dh7j7N89
π7π€£6β€5π₯2π1
This media is not supported in your browser
VIEW IN TELEGRAM
π§TAP in Persistent 3D Geometryπ§
πTAPIP3D is the novel SOTA for long-term 3D point tracking in mono-RGB/RGB-D. Videos as camera-stabilized spatio-temporal feature clouds, leveraging depth & motion to lift 2D video feats into a 3D world space where camera motion is effectively canceled. Code under Apacheπ
πReview https://t.ly/oooMy
πPaper https://lnkd.in/d8uqjdE4
πProject https://tapip3d.github.io/
πRepo https://lnkd.in/dsvHP_8u
πTAPIP3D is the novel SOTA for long-term 3D point tracking in mono-RGB/RGB-D. Videos as camera-stabilized spatio-temporal feature clouds, leveraging depth & motion to lift 2D video feats into a 3D world space where camera motion is effectively canceled. Code under Apacheπ
πReview https://t.ly/oooMy
πPaper https://lnkd.in/d8uqjdE4
πProject https://tapip3d.github.io/
πRepo https://lnkd.in/dsvHP_8u
π₯7β€2π2π1π1
This media is not supported in your browser
VIEW IN TELEGRAM
𦧠#Nvidia Describe Anything π¦§
πNvidia unveils Describe Anything Model (DAM) the new SOTA in generating detailed descriptions for user-specified regions in images/videos, marked by points, boxes, scribbles, or masks. Repo under Apache, Dataset available and live demo on π€
πReview https://t.ly/la4JD
πPaper https://lnkd.in/dZh82xtV
πProject https://lnkd.in/dcv9V2ZF
πRepo https://lnkd.in/dJB9Ehtb
π€Demo https://lnkd.in/dXDb2MWU
πNvidia unveils Describe Anything Model (DAM) the new SOTA in generating detailed descriptions for user-specified regions in images/videos, marked by points, boxes, scribbles, or masks. Repo under Apache, Dataset available and live demo on π€
πReview https://t.ly/la4JD
πPaper https://lnkd.in/dZh82xtV
πProject https://lnkd.in/dcv9V2ZF
πRepo https://lnkd.in/dJB9Ehtb
π€Demo https://lnkd.in/dXDb2MWU
π₯10π5β€1
This media is not supported in your browser
VIEW IN TELEGRAM
πMoving Points -> Depthπ
πKAIST & Adobe propose Seurat, a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories (via off-the-shelf point tracking models). Repo & Demo to be releasedπ
πReview https://t.ly/qA2P5
πPaper https://lnkd.in/dpXDaQtM
πProject https://lnkd.in/d9qWYsjP
πRepo https://lnkd.in/dZEMDiJh
πKAIST & Adobe propose Seurat, a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories (via off-the-shelf point tracking models). Repo & Demo to be releasedπ
πReview https://t.ly/qA2P5
πPaper https://lnkd.in/dpXDaQtM
πProject https://lnkd.in/d9qWYsjP
πRepo https://lnkd.in/dZEMDiJh
β€8π₯3π1π1