In-Kernel Broadcast Optimization: Co-Designing Kernels for RecSys Inference
In-Kernel Broadcast Optimization (IKBO) eliminates redundant user-embedding replication by fusing broadcast logic directly into interaction kernels, significantly reducing memory bandwidth and compute waste. This co-design approach delivers up to a two-thirds reduction in latency across Meta's recommendation stack, optimized for high-performance hardware like NVIDIA H100 and Meta’s MTIA.
https://pytorch.org/blog/in-kernel-broadcast-optimization-co-designing-kernels-for-recsys-inference/
In-Kernel Broadcast Optimization (IKBO) eliminates redundant user-embedding replication by fusing broadcast logic directly into interaction kernels, significantly reducing memory bandwidth and compute waste. This co-design approach delivers up to a two-thirds reduction in latency across Meta's recommendation stack, optimized for high-performance hardware like NVIDIA H100 and Meta’s MTIA.
https://pytorch.org/blog/in-kernel-broadcast-optimization-co-designing-kernels-for-recsys-inference/