GitHub repos

OFA-Sys/ONE-PEACE
A general representation modal across vision, audio, language modalities.
Language: Python
#audio_language #foundation_models #multimodal #representation_learning #vision_language
Stars: 185 Issues: 2 Forks: 5
https://github.com/OFA-Sys/ONE-PEACE

GitHub

GitHub - OFA-Sys/ONE-PEACE: A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring…

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities - OFA-Sys/ONE-PEACE

2.2K views04:11

GitHub repos

google/break-a-scene
Official implementation for "Break-A-Scene: Extracting Multiple Concepts from a Single Image" [SIGGRAPH Asia 2023]
Language: Python
#deep_learning #diffusion_models #generative_ai #multimodal #text_to_image
Stars: 164 Issues: 1 Forks: 4
https://github.com/google/break-a-scene

GitHub

GitHub - google/break-a-scene: Official implementation for "Break-A-Scene: Extracting Multiple Concepts from a Single Image" [SIGGRAPH…

Official implementation for "Break-A-Scene: Extracting Multiple Concepts from a Single Image" [SIGGRAPH Asia 2023] - google/break-a-scene

2.5K views04:18

GitHub repos

lxe/llavavision
A simple "Be My Eyes" web app with a llama.cpp/llava backend
Language: JavaScript
#ai #artificial_intelligence #computer_vision #llama #llamacpp #llm #local_llm #machine_learning #multimodal #webapp
Stars: 284 Issues: 0 Forks: 7
https://github.com/lxe/llavavision

GitHub

GitHub - lxe/llavavision: A simple "Be My Eyes" web app with a llama.cpp/llava backend

A simple "Be My Eyes" web app with a llama.cpp/llava backend - lxe/llavavision

2.1K views05:21

GitHub repos

LLaVA-VL/LLaVA-Plus-Codebase
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
Language: Python
#agent #large_language_models #large_multimodal_models #multimodal_large_language_models #tool_use
Stars: 213 Issues: 7 Forks: 13
https://github.com/LLaVA-VL/LLaVA-Plus-Codebase

GitHub

GitHub - LLaVA-VL/LLaVA-Plus-Codebase: LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills - LLaVA-VL/LLaVA-Plus-Codebase

2.2K views17:21

GitHub repos

YangLing0818/RPG-DiffusionMaster
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (PRG)
Language: Python
#image_editing #large_language_models #multimodal_large_language_models #text_to_image_diffusion
Stars: 272 Issues: 5 Forks: 14
https://github.com/YangLing0818/RPG-DiffusionMaster

GitHub

GitHub - YangLing0818/RPG-DiffusionMaster: [ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating…

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG) - YangLing0818/RPG-DiffusionMaster

2.3K views05:25

GitHub repos

X-PLUG/MobileAgent
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Language: Python
#agent #gpt4v #mllm #mobile_agents #multimodal #multimodal_large_language_models
Stars: 246 Issues: 3 Forks: 21
https://github.com/X-PLUG/MobileAgent

GitHub

GitHub - X-PLUG/MobileAgent: Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family - X-PLUG/MobileAgent

2.7K views11:26

GitHub repos

BradyFU/Video-MME
✨✨Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Language: Python
#large_language_models #large_vision_language_models #mme #multimodal_large_language_models #video #video_mme
Stars: 182 Issues: 1 Forks: 6
https://github.com/BradyFU/Video-MME

GitHub

GitHub - BradyFU/Video-MME: ✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video…

✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis - BradyFU/Video-MME

2.1K views04:00

GitHub repos

ictnlp/LLaMA-Omni
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Language: Python
#large_language_models #multimodal_large_language_models #speech_interaction #speech_language_model #speech_to_speech #speech_to_text
Stars: 274 Issues: 1 Forks: 16
https://github.com/ictnlp/LLaMA-Omni

GitHub

GitHub - ictnlp/LLaMA-Omni: LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1…

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level. - ictnlp/LLaMA-Omni

2.1K views16:00

GitHub repos

nv-tlabs/LLaMA-Mesh
Unifying 3D Mesh Generation with Language Models
Language: Python
#3d_generation #llm #mesh_generation #multimodal
Stars: 427 Issues: 7 Forks: 14
https://github.com/nv-tlabs/LLaMA-Mesh

GitHub

GitHub - nv-tlabs/LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

Unifying 3D Mesh Generation with Language Models. Contribute to nv-tlabs/LLaMA-Mesh development by creating an account on GitHub.

1.7K views05:00

GitHub repos

ictnlp/LLaVA-Mini
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
Language: Python
#efficient #gpt4o #gpt4v #large_language_models #large_multimodal_models #llama #llava #multimodal #multimodal_large_language_models #video #vision #vision_language_model #visual_instruction_tuning
Stars: 173 Issues: 7 Forks: 11
https://github.com/ictnlp/LLaVA-Mini

GitHub

GitHub - ictnlp/LLaVA-Mini: LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images,…

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner. - GitHub - ictnlp/LLaVA-Mini: LLaVA-Mi...

1.8K views23:00

GitHub repos

ByteDance-Seed/Seed1.5-VL
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
Language: Jupyter Notebook
#cookbook #large_language_model #multimodal_large_language_models #vision_language_model
Stars: 404 Issues: 0 Forks: 3
https://github.com/ByteDance-Seed/Seed1.5-VL

GitHub

GitHub - ByteDance-Seed/Seed1.5-VL: Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal…

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks. ...

1.5K views16:00

About

Blog

Apps

Platform