GitHub repos

tairov/llama2.mojo
Inference Llama 2 in one file of pure 🔥
#inference #llama #llama2 #modular #mojo #parallelize #performance #simd #tensor #vectorization
Stars: 200 Issues: 0 Forks: 7
https://github.com/tairov/llama2.mojo

GitHub

GitHub - tairov/llama2.mojo: Inference Llama 2 in one file of pure 🔥

Inference Llama 2 in one file of pure 🔥. Contribute to tairov/llama2.mojo development by creating an account on GitHub.

2.5K views22:18

GitHub repos

chengzeyi/stable-fast
An ultra lightweight inference performance optimization library for HuggingFace Diffusers on NVIDIA GPUs.
Language: Python
#cuda #deep_learning #deeplearning #diffusers #inference #inference_engine #performance_optimization #pytorch #stable_diffusion #triton
Stars: 134 Issues: 3 Forks: 5
https://github.com/chengzeyi/stable-fast

GitHub

GitHub - chengzeyi/stable-fast: https://wavespeed.ai/ Best inference performance optimization framework for HuggingFace Diffusers…

https://wavespeed.ai/ Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs. - chengzeyi/stable-fast

2.0K views22:20

GitHub repos

Fuzzy-Search/realtime-bakllava
llama.cpp with BakLLaVA model describes what does it see
Language: Python
#bakllavva #cpp #demo_application #inference #llama #llamacpp #llm
Stars: 141 Issues: 1 Forks: 15
https://github.com/Fuzzy-Search/realtime-bakllava

GitHub

GitHub - OneInterface/realtime-bakllava: llama.cpp with BakLLaVA model describes what does it see

llama.cpp with BakLLaVA model describes what does it see - OneInterface/realtime-bakllava

2.1K views05:21

GitHub repos

hpcaitech/SwiftInfer
Efficient AI Inference & Serving
Language: Python
#artificial_intelligence #deep_learning #gpt #inference #llama #llama2 #llm_inference #llm_serving
Stars: 299 Issues: 3 Forks: 14
https://github.com/hpcaitech/SwiftInfer

GitHub

GitHub - hpcaitech/SwiftInfer: Efficient AI Inference & Serving

Efficient AI Inference & Serving. Contribute to hpcaitech/SwiftInfer development by creating an account on GitHub.

2.5K views11:25

GitHub repos

arc53/llm-price-compass
LLM provider price comparison, gpu benchmarks to price per token calculation, gpu benchmark table
Language: TypeScript
#benchmark #gpu #inference_comparison #llm #llm_comparison #llm_inference #llm_price
Stars: 138 Issues: 1 Forks: 5
https://github.com/arc53/llm-price-compass

GitHub

GitHub - arc53/llm-price-compass: This project collects GPU benchmarks from various cloud providers and compares them to fixed…

This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selections and cost-effective AI models. LLM provide...

2.5K views16:00

GitHub repos

zml/zml
High performance AI inference stack. Built for production. @ziglang / @openxla / MLIR / @bazelbuild
Language: Zig
#ai #bazel #hpc #inference #xla #zig
Stars: 691 Issues: 1 Forks: 19
https://github.com/zml/zml

GitHub

GitHub - zml/zml: Any model. Any hardware. Zero compromise. Built with @ziglang / @openxla / MLIR / @bazelbuild

Any model. Any hardware. Zero compromise. Built with @ziglang / @openxla / MLIR / @bazelbuild - zml/zml

2.1K views10:00

GitHub repos

AgibotTech/agibot_x1_infer
The inference module for AgiBot X1.
Language: C++
#inference #open_source #robotics
Stars: 455 Issues: 2 Forks: 152
https://github.com/AgibotTech/agibot_x1_infer

GitHub

GitHub - AgibotTech/agibot_x1_infer: The inference module for AgiBot X1.

The inference module for AgiBot X1. Contribute to AgibotTech/agibot_x1_infer development by creating an account on GitHub.

1.9K views16:00

GitHub repos

thu-ml/SageAttention
Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Language: Python
#attention #inference_acceleration #llm #quantization
Stars: 145 Issues: 6 Forks: 3
https://github.com/thu-ml/SageAttention

GitHub

GitHub - thu-ml/SageAttention: Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers,…

Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models. - thu-ml/SageAttention

1.8K views22:00

GitHub repos

zhihu/ZhiLight
A highly optimized inference acceleration engine for Llama and its variants.
Language: C++
#cpm #cuda #gpt #inference_engine #llama #llm #llm_serving #minicpm #pytorch #qwen
Stars: 192 Issues: 1 Forks: 16
https://github.com/zhihu/ZhiLight

GitHub

GitHub - zhihu/ZhiLight: A highly optimized LLM inference acceleration engine for Llama and its variants.

A highly optimized LLM inference acceleration engine for Llama and its variants. - zhihu/ZhiLight

1.7K views17:00

GitHub repos

therealoliver/Deepdive-llama3-from-scratch
Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.
Language: Jupyter Notebook
#attention #attention_mechanism #gpt #inference #kv_cache #language_model #llama #llm_configuration #llms #mask #multi_head_attention #positional_encoding #residuals #rms #rms_norm #rope #rotary_position_encoding #swiglu #tokenizer #transformer
Stars: 388 Issues: 0 Forks: 28
https://github.com/therealoliver/Deepdive-llama3-from-scratch

GitHub

GitHub - therealoliver/Deepdive-llama3-from-scratch: Achieve the llama3 inference step-by-step, grasp the core concepts, master…

Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code. - therealoliver/Deepdive-llama3-from-scratch

1.6K views11:00

About

Blog

Apps

Platform