AI with Papers - Artificial Intelligence & Deep Learning
15K subscribers
96 photos
238 videos
11 files
1.27K links
All the AI with papers. Every day fresh updates on Deep Learning, Machine Learning, and Computer Vision (with Papers).

Curated by Alessandro Ferrari | https://www.linkedin.com/in/visionarynet/
Download Telegram
This media is not supported in your browser
VIEW IN TELEGRAM
๐ŸชžRobo-Emulation via Video Imitation๐Ÿชž

๐Ÿ‘‰OKAMI (UT & #Nvidia) is a novel foundation method that generates a manipulation plan from a single RGB-D video and derives a policy for execution.

๐Ÿ‘‰Review https://t.ly/_N29-
๐Ÿ‘‰Paper arxiv.org/pdf/2410.11792
๐Ÿ‘‰Project https://lnkd.in/d6bHF_-s
๐Ÿ‘4๐Ÿคฏ2๐Ÿ”ฅ1
๐Ÿ”ฅ "Nuclear" AI vs. Hyper-Cheap Inference ๐Ÿ”ฅ

โญ What do you expect in 2025 after the #Nvidia announcements at CES 2025? Free to comment :)
Anonymous Poll
23%
๐ŸคฒPortabile Training Workstation
35%
โš›๏ธNuclear energy for AI training
33%
๐Ÿ–ฒ๏ธCheaper Only-inference devices
9%
๐Ÿ’ฐCloud-intensive Only-inference
๐Ÿ‘4โค1๐Ÿ”ฅ1๐Ÿคฏ1๐Ÿคฉ1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿงžโ€โ™‚๏ธOmni-RGPT: SOTA MLLM Understanding๐Ÿงžโ€โ™‚๏ธ

๐Ÿ‘‰ #NVIDIA presents Omni-RGPT, MLLM for region-level comprehension for both images & videos. New SOTA on image/video-based commonsense reasoning.

๐Ÿ‘‰Review https://t.ly/KHnQ7
๐Ÿ‘‰Paper arxiv.org/pdf/2501.08326
๐Ÿ‘‰Project miranheo.github.io/omni-rgpt/
๐Ÿ‘‰Repo TBA soon
๐Ÿ”ฅ10โค3๐Ÿพ2โšก1๐Ÿ‘1๐Ÿ‘1
This media is not supported in your browser
VIEW IN TELEGRAM
๐ŸŒˆ #Nvidia Foundation ZS-Stereo ๐ŸŒˆ

๐Ÿ‘‰Nvidia unveils FoundationStereo, a foundation model for stereo depth estimation with strong zero-shot generalization. In addition, a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism. Code, model & dataset to be released๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/rfBr5
๐Ÿ‘‰Paper arxiv.org/pdf/2501.09898
๐Ÿ‘‰Project nvlabs.github.io/FoundationStereo/
๐Ÿ‘‰Repo github.com/NVlabs/FoundationStereo/tree/master
โค6๐Ÿ”ฅ6๐Ÿคฉ1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿฅ›HAMSTER: Hierarchical VLA Manipulation๐Ÿฅ›

๐Ÿ‘‰#Nvidia unveils HAMSTER: novel Hierarchical VLA architecture to enable robotic manipulation with semantic, visual & geometric generalization trained on easy to collect, off-domain data. Source Code announced๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/2yXaY
๐Ÿ‘‰Paper https://arxiv.org/pdf/2502.05485
๐Ÿ‘‰Project https://hamster-robot.github.io/
๐Ÿ‘‰Repo TBA
๐Ÿ”ฅ4โค1
This media is not supported in your browser
VIEW IN TELEGRAM
๐ŸŒˆUnified Low-Level 4D Vision๐ŸŒˆ

๐Ÿ‘‰#Nvidia L4P is a novel feedforward, general-purpose, architecture to solve low-level 4D perception tasks in a unified framework. L4P combines a ViTbased backbone with per-task heads that are lightweight and therefore do not require extensive training. One backbone - many SOTAs. Code announced ๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/04DGj
๐Ÿ‘‰Paper arxiv.org/pdf/2502.13078
๐Ÿ‘‰Project research.nvidia.com/labs/lpr/l4p/
๐Ÿ‘‰Repo TBA
๐Ÿ”ฅ5๐Ÿ‘2๐Ÿคฏ1๐Ÿคฉ1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿ‘ฝNeural-Free Sparse Voxels Rasterization๐Ÿ‘ฝ

๐Ÿ‘‰#Nvidia unveils a novel efficient radiance field rendering algorithm that incorporates a rasterization process on adaptive sparse voxels without neural networks or 3D Gaussians. Code released (custom license)๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/Nh_ic
๐Ÿ‘‰Paper https://lnkd.in/g8k8Zs6R
๐Ÿ‘‰Project https://lnkd.in/gR-bD4Wx
๐Ÿ‘‰Repo https://lnkd.in/gNHX-w4t
๐Ÿ”ฅ14๐Ÿ‘4๐Ÿคฉ1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿ™€3D MultiModal Memory๐Ÿ™€

๐Ÿ‘‰M3 is a novel framework by UCSD & #NVIDIA for rendering 3D scenes w/ RGB & foundation model embeddings. Rich spatial & semantic understanding via novel memory system designed to retain multimodal info through videos

๐Ÿ‘‰Review https://t.ly/OrXZO
๐Ÿ‘‰Paper arxiv.org/pdf/2503.16413
๐Ÿ‘‰Project https://lnkd.in/dXAZ97KH
๐Ÿ‘‰Repo https://lnkd.in/dWvunCET
๐Ÿ”ฅ10โค4๐Ÿ‘1๐Ÿ‘1
๐ŸฆŽ Scaling Vision to 4K๐ŸฆŽ

๐Ÿ‘‰PS3 by #Nvidia (+UC Berkeley) to scale-up CLIP-style vision pre-training to 4K with *near-constant* cost. Encoding LR global image and selectively processes only informative HR regions. Impressive work. Code/weights & ๐Ÿค— announced๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/WN479
๐Ÿ‘‰Paper https://lnkd.in/ddWq8UpX
๐Ÿ‘‰Project https://lnkd.in/dMkTY8-k
๐Ÿ‘‰Repo https://lnkd.in/d9YSB6yv
๐Ÿ”ฅ14โค4๐Ÿ‘2๐Ÿ‘1
This media is not supported in your browser
VIEW IN TELEGRAM
๐ŸPartField #3D Part Segmentation๐Ÿ

๐Ÿ‘‰#Nvidia unveils PartField, a FFW approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy. Suitable for single-shape decomposition, co-segm., correspondence & more. Code & Models released under Nvidia License๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/fGb2O
๐Ÿ‘‰Paper https://lnkd.in/dGeyKSzG
๐Ÿ‘‰Code https://lnkd.in/dbe57XGH
๐Ÿ‘‰Project https://lnkd.in/dhEgf7X2
โค2๐Ÿ”ฅ2๐Ÿคฏ2
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿฆง #Nvidia Describe Anything ๐Ÿฆง

๐Ÿ‘‰Nvidia unveils Describe Anything Model (DAM) the new SOTA in generating detailed descriptions for user-specified regions in images/videos, marked by points, boxes, scribbles, or masks. Repo under Apache, Dataset available and live demo on ๐Ÿค—

๐Ÿ‘‰Review https://t.ly/la4JD
๐Ÿ‘‰Paper https://lnkd.in/dZh82xtV
๐Ÿ‘‰Project https://lnkd.in/dcv9V2ZF
๐Ÿ‘‰Repo https://lnkd.in/dJB9Ehtb
๐Ÿค—Demo https://lnkd.in/dXDb2MWU
๐Ÿ”ฅ10๐Ÿ‘5โค1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿ#Nvidia Dynamic Pose ๐Ÿ

๐Ÿ‘‰Nvidia unveils DynPose-100K, the largest dataset of dynamic Internet videos annotated with camera poses. Dataset released under Nvidia license๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/wrcb0
๐Ÿ‘‰Paper https://lnkd.in/dycGjAyy
๐Ÿ‘‰Project https://lnkd.in/dDZ2Ej_Q
๐Ÿค—Data https://lnkd.in/d8yUSB7m
๐Ÿ”ฅ4๐Ÿ‘2โค1๐Ÿคฏ1๐Ÿ˜1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿงžโ€โ™€๏ธGENMO: Generalist Human Motion ๐Ÿงžโ€โ™€๏ธ

๐Ÿ‘‰#Nvidia presents GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework. Conditioning on videos, 2D keypoints, text, music, and 3D keyframes. No code at the moment๐Ÿฅฒ

๐Ÿ‘‰Review https://t.ly/Q5T_Y
๐Ÿ‘‰Paper https://lnkd.in/ds36BY49
๐Ÿ‘‰Project https://lnkd.in/dAYHhuFU
๐Ÿ”ฅ13โค3๐Ÿ‘2๐Ÿ˜ข1๐Ÿ˜1
This media is not supported in your browser
VIEW IN TELEGRAM
๐ŸงคDiffusive Hand from Signs๐Ÿงค

๐Ÿ‘‰LIGM + #NVIDIA unveil a novel generative model of 3D hand motions from Sign Language Data. Motion characteristics such as handshapes, locations, finger, hand & arm movements. Code, Models & Data to be released ๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/HonX_
๐Ÿ‘‰Paper https://arxiv.org/pdf/2508.15902
๐Ÿ‘‰Project https://imagine.enpc.fr/~leore.bensabath/HandMDM/
๐Ÿ‘‰Data drive.google.com/drive/u/1/folders/1BLsu2hAqhAJ_gnGb9TNXW7MLiSuSEzEj
๐Ÿ‘‰Repo TBA
๐Ÿ‘3๐Ÿ”ฅ2