GitHub Trends

#python #cuda #deepseek #deepseek_llm #deepseek_v3 #inference #llama #llama2 #llama3 #llama3_1 #llava #llm #llm_serving #moe #pytorch #transformer #vlm

SGLang is a tool that makes working with large language models and vision language models much faster and more manageable. It has a fast backend runtime that optimizes model performance with features like prefix caching, continuous batching, and quantization. The frontend language is flexible and easy to use, allowing for complex tasks like chained generation calls and multi-modal inputs. SGLang supports many different models and has an active community behind it. This means you can get your models running quickly and efficiently, saving time and resources. Additionally, the extensive documentation and community support make it easier to get started and resolve any issues.

https://github.com/sgl-project/sglang

GitHub

GitHub - sgl-project/sglang: SGLang is a high-performance serving framework for large language models and multimodal models.

SGLang is a high-performance serving framework for large language models and multimodal models. - sgl-project/sglang

518 views12:00

GitHub Trends

#cplusplus #cpp #cuda #deep_learning #deep_learning_library #gpu #nvidia

CUTLASS is a powerful tool for high-performance matrix operations on NVIDIA GPUs. It helps developers create efficient code by breaking down complex tasks into reusable parts, making it easier to build custom applications. CUTLASS supports various data types and architectures, including the new Blackwell SM100 architecture, which means users can optimize their programs for different hardware. This flexibility and support for advanced features like Tensor Cores improve performance significantly, benefiting users who need fast computations in fields like AI and scientific computing.

https://github.com/NVIDIA/cutlass

GitHub

GitHub - NVIDIA/cutlass: CUDA Templates and Python DSLs for High-Performance Linear Algebra

CUDA Templates and Python DSLs for High-Performance Linear Algebra - NVIDIA/cutlass

👍1

600 views00:00

GitHub Trends

#cplusplus #cuda #cutlass #gpu #pytorch

Flux is a library that helps speed up machine learning on GPUs by overlapping communication and computation tasks. It supports various parallelisms in model training and inference, making it compatible with PyTorch and different Nvidia GPU architectures. This means you can train models faster because Flux combines the steps of sending data between GPUs (communication) and doing calculations (computation), allowing them to happen at the same time. This overlap reduces overall training time, which is beneficial for users working with large or complex models.

https://github.com/bytedance/flux

GitHub

GitHub - bytedance/flux: A fast communication-overlapping library for tensor/expert parallelism on GPUs.

A fast communication-overlapping library for tensor/expert parallelism on GPUs. - bytedance/flux

494 views12:00

GitHub Trends

#cplusplus #cuda #gpu #machine_learning #machine_learning_algorithms #nvidia

cuML - RAPIDS Machine Learning Library

https://github.com/rapidsai/cuml

GitHub

GitHub - rapidsai/cuml: cuML - RAPIDS Machine Learning Library

cuML - RAPIDS Machine Learning Library. Contribute to rapidsai/cuml development by creating an account on GitHub.

653 views14:31

GitHub Trends

#cplusplus #assembly #assembly_language #avx512 #benchmark #coroutines #cpp #cpp_programming #cpp17 #cpp20 #cuda #gcc #google_benchmark #hpc #io_uring #linux_kernel #llvm #ptx #ranges #tutorial #tutorials

This repository helps developers improve their coding skills by showing how to write faster and more efficient code. It includes examples for C++, CUDA, and Assembly, focusing on performance optimization techniques. By using this resource, developers can learn how to avoid common pitfalls like performance bottlenecks and improve their coding habits. It also provides benchmarks to compare different coding methods, helping users choose the best approach for their projects. This can lead to significant speed improvements and better use of computer resources.

https://github.com/ashvardanian/less_slow.cpp

GitHub

GitHub - ashvardanian/less_slow.cpp: Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics…

Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and use...

573 views11:30

GitHub Trends

#cuda

DeepEP is a special communication library for Mixture-of-Experts (MoE) models. It helps these models work faster and more efficiently by improving how data is shared between different parts of the system. DeepEP supports low-precision operations and can handle data transfer between different types of connections, like NVLink and RDMA. This makes it very useful for both training and using AI models, especially when speed is important. Users benefit from faster processing times and better performance overall.

https://github.com/deepseek-ai/DeepEP

GitHub

GitHub - deepseek-ai/DeepEP: DeepEP: an efficient expert-parallel communication library

DeepEP: an efficient expert-parallel communication library - deepseek-ai/DeepEP

441 views12:00

GitHub Trends

#rust #cuda #rust

ZLUDA is a software that lets you run CUDA programs, originally made for NVIDIA GPUs, on AMD Radeon RX 5000 series and newer GPUs without changing the programs. It aims to give near-native performance on non-NVIDIA hardware, making CUDA applications more accessible. Currently, ZLUDA is still being developed and mainly supports Geekbench tests, so it might not work perfectly with all applications yet. It works on Windows and Linux but not on MacOS. If you have an AMD GPU and want to try running CUDA apps without an NVIDIA card, ZLUDA could be very useful as it opens up more hardware options for CUDA software[3][5].

https://github.com/vosen/ZLUDA

GitHub

GitHub - vosen/ZLUDA: CUDA on non-NVIDIA GPUs

CUDA on non-NVIDIA GPUs. Contribute to vosen/ZLUDA development by creating an account on GitHub.

357 views12:30

GitHub Trends

#c_lang #cuda #cuda_driver_api #cuda_kernels #cuda_opengl

You can use the CUDA Samples from NVIDIA to learn and test CUDA Toolkit 12.9 features by downloading them from GitHub or as a ZIP file. These samples show how to use CUDA for GPU programming, including utilities, concepts, libraries, and performance optimization. You build them with CMake on Linux, Windows, or Tegra devices, and can run tests automatically with a provided Python script. This helps you understand CUDA programming, debug GPU code, and optimize your applications for better performance on NVIDIA GPUs. It’s a practical way to develop and improve GPU-accelerated software efficiently.

https://github.com/NVIDIA/cuda-samples

GitHub

GitHub - NVIDIA/cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit

Samples for CUDA Developers which demonstrates features in CUDA Toolkit - NVIDIA/cuda-samples

450 views11:30

GitHub Trends

#typescript #ai #cuda #mlx #qwen3_tts #qwen3_tts_ui #voice_ai #voice_clone #whisper

Voicebox is a free, open-source voice synthesis studio that lets you clone voices, generate speech in 23 languages, and apply audio effects—all running privately on your computer. You can create realistic voice clones from just seconds of audio, use five different text-to-speech engines for different needs, add effects like reverb and pitch shift, and build multi-voice projects with a timeline editor. The key benefit is complete privacy: your voice data and AI models never leave your machine, unlike cloud-based alternatives. It also includes an API for building voice-powered applications and works across Mac, Windows, and Linux with GPU acceleration support.

https://github.com/jamiepine/voicebox

GitHub

GitHub - jamiepine/voicebox: The open-source AI voice studio. Clone, dictate, create.

The open-source AI voice studio. Clone, dictate, create. - jamiepine/voicebox

864 views11:30

GitHub Trends

#cuda

DeepGEMM is a high-performance library that speeds up matrix calculations for large language models using NVIDIA GPUs. It combines key computation tools—including FP8 and FP4 matrix operations, expert systems (MoE), and attention scoring—into one efficient CUDA codebase. All kernels compile automatically at runtime without requiring manual CUDA setup during installation. The benefit to you is faster AI model training and inference: DeepGEMM's performance matches or exceeds specialized libraries across different matrix sizes, achieving up to 1550 TFLOPS on high-end GPUs, while remaining simple and accessible for learning GPU optimization techniques.

https://github.com/deepseek-ai/DeepGEMM

GitHub

GitHub - deepseek-ai/DeepGEMM: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling - deepseek-ai/DeepGEMM

661 views12:00

About

Blog

Apps

Platform