AI with Papers - Artificial Intelligence & Deep Learning
15K subscribers
95 photos
235 videos
11 files
1.26K links
All the AI with papers. Every day fresh updates on Deep Learning, Machine Learning, and Computer Vision (with Papers).

Curated by Alessandro Ferrari | https://www.linkedin.com/in/visionarynet/
Download Telegram
๐Ÿฆ‘ Hyper-Detailed Image Descriptions ๐Ÿฆ‘

๐Ÿ‘‰#Google unveils ImageInWords (IIW), a carefully designed HIL annotation framework for curating hyper-detailed image descriptions and a new dataset resulting from this process

๐Ÿ‘‰Review https://t.ly/engkl
๐Ÿ‘‰Paper arxiv.org/pdf/2405.02793
๐Ÿ‘‰Repo github.com/google/imageinwords
๐Ÿ‘‰Project google.github.io/imageinwords
๐Ÿ‘‰Data huggingface.co/datasets/google/imageinwords
โค11๐Ÿ”ฅ3๐Ÿ‘2๐Ÿคฏ2๐Ÿพ1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿ”ซ Free-Moving Reconstruction ๐Ÿ”ซ

๐Ÿ‘‰EPFL (+#MagicLeap) unveils a novel approach for reconstructing free-moving object from monocular RGB clip. Free interaction with objects in front of a moving cam without relying on any prior, and optimizes the sequence globally without any segments. Great but no code announced๐Ÿฅบ

๐Ÿ‘‰Review https://t.ly/2xhtj
๐Ÿ‘‰Paper arxiv.org/pdf/2405.05858
๐Ÿ‘‰Project haixinshi.github.io/fmov/
๐Ÿ‘6๐Ÿคฏ4โšก1โค1๐Ÿฅฐ1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿ’ฅFeatUp: Any Model at Any Resolution๐Ÿ’ฅ

๐Ÿ‘‰FeatUp is a task-model agnostic framework to restore lost spatial information in deep features. It outperforms other methods in class activation map generation, transfer learning for segmentation & depth, and end-to-end training for semantic segm. Source Code released๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/Evq_g
๐Ÿ‘‰Paper https://lnkd.in/gweaN4s6
๐Ÿ‘‰Project https://lnkd.in/gWcGXdxt
๐Ÿ‘‰Code https://lnkd.in/gweq5NY4
๐Ÿ”ฅ19โค4๐Ÿ‘3๐Ÿ‘1๐Ÿพ1
This media is not supported in your browser
VIEW IN TELEGRAM
๐ŸAniTalker: Universal Talking Humans๐Ÿ

๐Ÿ‘‰SJTU (+AISpeech) unveils AniTalker, a framework that transforms a single static portrait and input audio into animated talking videos with naturally flowing movements.

๐Ÿ‘‰Review https://t.ly/MD4yX
๐Ÿ‘‰Paper https://arxiv.org/pdf/2405.03121
๐Ÿ‘‰Project https://x-lance.github.io/AniTalker/
๐Ÿ‘‰Repo https://github.com/X-LANCE/AniTalker
๐Ÿ”ฅ6โค4๐Ÿ‘2โšก1๐Ÿคฏ1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿ‘ป 3D Humans Motion from Text ๐Ÿ‘ป

๐Ÿ‘‰Zhejiang (+ANT) unveils a novel method to generate human motions containing accurate human-object interactions in 3D scenes based on textural descriptions. Code announced, coming ๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/eOZnU
๐Ÿ‘‰Paper https://arxiv.org/pdf/2405.07784
๐Ÿ‘‰Project https://zju3dv.github.io/text_scene_motion/
๐Ÿ‘3๐Ÿ”ฅ2โค1
This media is not supported in your browser
VIEW IN TELEGRAM
๐ŸชฌUHM: Authentic Hand by Phone๐Ÿชฌ

๐Ÿ‘‰ META unveils UHM, novel 3D high-fidelity avatarization of your (yes, the your one) hand. Adaptation pipeline fits the pre-trained UHM via phone scan. Source Code released ๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/fU5rA
๐Ÿ‘‰Paper https://lnkd.in/dyGaiAnq
๐Ÿ‘‰Code https://lnkd.in/d9B_XFAA
๐Ÿ‘4โค1๐Ÿ”ฅ1๐Ÿคฏ1
๐Ÿ”ฅEfficientTrain++: Efficient Foundation Visual Backbone Training๐Ÿ”ฅ

๐Ÿ‘‰Tsinghua unveils EfficientTrain++, a simple, general, surprisingly effective, off-the-shelf approach to reduce the training time of various popular models (e.g., ResNet, ConvNeXt, DeiT, PVT, Swin, CSWin, and CAFormer). Up to 3.0ร— faster on ImageNet-1K/22K without sacrificing accuracy. Source Code released ๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/D8ttv
๐Ÿ‘‰Paper https://arxiv.org/pdf/2405.08768
๐Ÿ‘‰Code https://github.com/LeapLabTHU/EfficientTrain
๐Ÿ‘9๐Ÿ”ฅ3๐Ÿคฏ3โค2๐Ÿฅฐ1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿซ€ EchoTracker: Tracking Echocardiography๐Ÿซ€

๐Ÿ‘‰EchoTracker: two-fold coarse-to-fine model that facilitates the tracking of queried points on a tissue surface across ultrasound. Source Code released๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/NyBe0
๐Ÿ‘‰Paper https://arxiv.org/pdf/2405.08587
๐Ÿ‘‰Code https://github.com/riponazad/echotracker/
โค15๐Ÿ‘1๐Ÿฅฐ1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿฆ• Grounding DINO 1.5 Pro/Edge ๐Ÿฆ•

๐Ÿ‘‰Grounding DINO 1.5, a suite of advanced open-set object detection models to advanced the "Edge" of open-set object detection. Source Code released under Apache 2.0๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/kS-og
๐Ÿ‘‰Paper https://lnkd.in/dNakMge2
๐Ÿ‘‰Code https://lnkd.in/djhnQmrm
๐Ÿ”ฅ22โค1๐Ÿ‘1๐Ÿ˜1
This media is not supported in your browser
VIEW IN TELEGRAM
โšฝ3D Shot Posture in Broadcastโšฝ

๐Ÿ‘‰Nagoya Univeristy unveils 3DSP soccer broadcast videos, the most extensive sports image dataset with 2D pose annotations ever.

๐Ÿ‘‰Review https://t.ly/IIMeZ
๐Ÿ‘‰Paper https://arxiv.org/pdf/2405.12070
๐Ÿ‘‰Code https://github.com/calvinyeungck/3D-Shot-Posture-Dataset/tree/master
๐Ÿ”ฅ8๐Ÿฅฐ1๐Ÿ˜1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿ–ผ๏ธ Diffusive Images that Sound ๐Ÿ–ผ๏ธ

๐Ÿ‘‰The University of Michigan unveils a diffusion model able to generate spectrograms that look like images but can also be played as sound.

๐Ÿ‘‰Review https://t.ly/ADtYM
๐Ÿ‘‰Paper arxiv.org/pdf/2405.12221
๐Ÿ‘‰Project ificl.github.io/images-that-sound
๐Ÿ‘‰Code github.com/IFICL/images-that-sound
๐Ÿคฏ11โค5๐Ÿ˜5๐Ÿ”ฅ4๐Ÿ‘1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿ‘šViViD: Diffusion VTON๐Ÿ‘š

๐Ÿ‘‰ViViD is a novel framework employing powerful diffusion models to tackle the task of video virtual try-on. Code announced, not released yet๐Ÿ˜ข

๐Ÿ‘‰Review https://t.ly/h_SyP
๐Ÿ‘‰Paper arxiv.org/pdf/2405.11794
๐Ÿ‘‰Repo https://lnkd.in/dT4_bzPw
๐Ÿ‘‰Project https://lnkd.in/dCK5ug4v
๐Ÿ”ฅ13๐Ÿคฉ3โค1๐Ÿ‘1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿ€OmniGlue: Foundation Matcher๐Ÿ€

๐Ÿ‘‰#Google OmniGlue from #CVPR24: the first learnable image matcher powered by foundation models. Impressive OOD results!

๐Ÿ‘‰Review https://t.ly/ezaIc
๐Ÿ‘‰Paper https://arxiv.org/pdf/2405.12979
๐Ÿ‘‰Project hwjiang1510.github.io/OmniGlue/
๐Ÿ‘‰Code https://github.com/google-research/omniglue/
๐Ÿคฏ10โค6๐Ÿ‘2๐Ÿ‘1
๐Ÿ”ฅ YOLOv10 is out ๐Ÿ”ฅ

๐Ÿ‘‰YOLOv10: novel real-time end-to-end object detection. Code released under GNU AGPL v3.0๐Ÿ’™

๐Ÿ‘‰Review https://shorturl.at/ZIHBh
๐Ÿ‘‰Paper arxiv.org/pdf/2405.14458
๐Ÿ‘‰Code https://github.com/THU-MIG/yolov10/
๐Ÿ”ฅ25โค3๐Ÿ‘2โšก1
This media is not supported in your browser
VIEW IN TELEGRAM
โ›ˆ๏ธUnsupervised Neuromorphic Motionโ›ˆ๏ธ

๐Ÿ‘‰The Western Sydney University unveils a novel unsupervised event-based motion segmentation algorithm, employing the #Prophesee Gen4 HD event camera.

๐Ÿ‘‰Review https://t.ly/UZzIZ
๐Ÿ‘‰Paper arxiv.org/pdf/2405.15209
๐Ÿ‘‰Project samiarja.github.io/evairborne
๐Ÿ‘‰Repo (empty) github.com/samiarja/ev/_deep/_motion_segmentation
๐Ÿ‘5๐Ÿ”ฅ1๐Ÿฅฐ1๐Ÿ‘1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿฆ“ Z.S. Diffusive Segmentation ๐Ÿฆ“

๐Ÿ‘‰KAUST (+MPI) announced the first zero-shot approach for Video Semantic Segmentation (VSS) based on pre-trained diffusion models. Source Code released under MIT๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/v_64K
๐Ÿ‘‰Paper arxiv.org/pdf/2405.16947
๐Ÿ‘‰Project https://lnkd.in/dcSt4dQx
๐Ÿ‘‰Code https://lnkd.in/dcZfM8F3
๐Ÿคฏ4๐Ÿ”ฅ2๐Ÿ‘1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿชฐ Dynamic Gaussian Fusion via 4D Motion Scaffolds ๐Ÿชฐ

๐Ÿ‘‰MoSca is a novel 4D Motion Scaffolds to reconstruct/synthesize novel views of dynamic scenes from monocular videos in the wild!

๐Ÿ‘‰Review https://t.ly/nSdEL
๐Ÿ‘‰Paper arxiv.org/pdf/2405.17421
๐Ÿ‘‰Code github.com/JiahuiLei/MoSca
๐Ÿ‘‰Project https://lnkd.in/dkjMVcqZ
๐Ÿ”ฅ6๐Ÿ‘1๐Ÿฅฐ1
This media is not supported in your browser
VIEW IN TELEGRAM
๐ŸงคTransformer-based 4D Hands๐Ÿงค

๐Ÿ‘‰4DHands is a novel and robust approach to recovering interactive hand meshes and their relative movement from monocular inputs. Authors: Beijing NU, Tsinghua & Lenovo. No code announced ๐Ÿ˜ข

๐Ÿ‘‰Review https://t.ly/wvG-l
๐Ÿ‘‰Paper arxiv.org/pdf/2405.20330
๐Ÿ‘‰Project 4dhands.github.io/
๐Ÿ”ฅ4๐Ÿคฏ3โค1๐Ÿ‘1๐Ÿ‘1
This media is not supported in your browser
VIEW IN TELEGRAM
๐ŸŽญNew 2D Landmarks SOTA๐ŸŽญ

๐Ÿ‘‰Flawless AI unveils FaceLift, a novel semi-supervised approach that learns 3D landmarks by directly lifting (visible) hand-labeled 2D landmarks and ensures better definition alignment, with no need for 3D landmark datasets. No code announced๐Ÿฅน

๐Ÿ‘‰Review https://t.ly/lew9a
๐Ÿ‘‰Paper arxiv.org/pdf/2405.19646
๐Ÿ‘‰Project davidcferman.github.io/FaceLift
๐Ÿ”ฅ16โค5๐Ÿ˜ข5๐Ÿ‘2๐Ÿ’ฉ2โšก1๐Ÿ‘1
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿณ MultiPly: in-the-wild Multi-People ๐Ÿณ

๐Ÿ‘‰MultiPly: novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. It's the new SOTA over the publicly available datasets and in-the-wild videos. Source Code announced, coming๐Ÿ’™

๐Ÿ‘‰Review https://t.ly/_xjk_
๐Ÿ‘‰Paper arxiv.org/pdf/2406.01595
๐Ÿ‘‰Project eth-ait.github.io/MultiPly
๐Ÿ‘‰Repo github.com/eth-ait/MultiPly
๐Ÿ”ฅ14๐Ÿ‘4๐Ÿ‘2โค1๐Ÿคฏ1