Maximize LLM Inference Performance + Auto-Profile/Optimize PyTorch/CUDA Code

Опубликовано: 18 Август 2025
на канале: AI Performance Engineering

392

Talk #1: Everything You Need to Know About Reducing Voice-Agent Latency (by Philip Kiely @ Baseten)
Rolling your own optimized voice agent introduces hard problems at each layer of the stack. In this talk, Philip will provide an overview of the runtime optimizations, infrastructure setup, and client code required to get consistently low latencies for voice at scale.

Talk #2: PyTorch Profiling That Actually Tells You What to Fix (by Emilio Andere @ Herdora)
Automate PyTorch profiler analysis by tracing bottlenecks to root causes including kernel memory patterns, tensor layouts, missing fusions - mapping them to specific code fixes.

Talk #3: Auto-Optimizing PyTorch and CUDA Code (by Chris Fregly)
Automate PyTorch and CUDA performance optimizations for all environments including GPUs.

Zoom link: https://us02web.zoom.us/j/82308186562

Related Links
Github Repo: http://github.com/cfregly/ai-performa...
O'Reilly Book: https://www.amazon.com/Systems-Perfor...
YouTube: / @aiperformanceengineering
Generative AI Free Course on DeepLearning.ai: https://bit.ly/gllm