Red Hat recently announced the launch of a new open source project, llm-d, aimed at addressing the critical large-scale inference requirements of future generative AI (Gen AI).
This project was jointly initiated by founding contributors CoreWeave, Google Cloud, IBM Research, and NVIDIA, and has received participation from industry players such as AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI, as well as academic institutions such as the University of California, Berkeley and the University of Chicago. The goal is to make generative AI applications in production environments as ubiquitous as Linux.
LLM-D leverages breakthrough generative AI large-scale inference technology, employs a native Kubernetes architecture, vLLM-based distributed inference, and intelligent AI-aware network routing to create a powerful large-scale language model (LLM) inference cloud that meets the most stringent production service-level objectives (SLOs).
"The launch of the llm-d community, backed by numerous AI leaders, signals a critical juncture in addressing the need for scalable generative AI inference, a significant challenge that enterprises must overcome to enable broader AI adoption," said Brian Stevens, senior vice president and chief technology officer, AI, at Red Hat. "By leveraging the innovative technology of vLLM and the proven capabilities of Kubernetes, llm-d helps enterprises more seamlessly implement distributed, scalable, and high-performance AI inference across extended hybrid cloud environments. It supports any model, any accelerator, and runs on any cloud, helping realize the promise of AI's limitless potential."
Meeting the Need for Scalable Generative AI Inference with LLM-D
To address these challenges, Red Hat, in collaboration with industry partners, launched llm-d. This forward-thinking project not only enhances vLLM capabilities beyond the limitations of a single server, but also unlocks the potential for large-scale production AI inference. llm-d leverages the proven and powerful scheduling capabilities of Kubernetes to seamlessly integrate advanced inference capabilities into an enterprise's existing IT infrastructure. IT teams can meet the diverse service requirements of mission-critical workloads on a unified platform while maximizing efficiency through the deployment of innovative technologies and significantly reducing the total cost of ownership (TCO) of high-performance AI accelerators.
LLM-D offers a range of features, highlights include:
• vLLM is rapidly becoming the de facto standard inference server in open source: it provides Day 0 model support for emerging models and works on a variety of accelerators, including Google Cloud Tensor Processor Units (TPUs).
• Separation of pre-filling and decoding: Separate the AI input content and scepter generation stages into independent computing tasks, and distribute these tasks to multiple servers for execution.
• LMCache-based key-value (KV) cache offloading: Shifting the memory load of KV cache from GPU memory to more cost-effective and resource-rich standard storage devices such as CPU memory or network storage.
• Kubernetes-powered clusters and controllers: More efficiently schedule compute and storage resources as workload demands fluctuate while maintaining performance and reducing latency.
• AI-aware network routing: Schedules incoming requests to servers and accelerators that are most likely to have hot caches from previous inference operations.
• High-Performance Communications API: Enables faster and more efficient data transfer between servers and supports the NVIDIA Inference Xfer Library (NIXL).
llm-d receives support from industry leaders
This new open source project has the backing of a strong alliance of leading generative AI model providers, AI accelerator leaders, and leading AI cloud platforms. Founding contributors include CoreWeave, Google Cloud, IBM Research, and NVIDIA, while partners include AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI. This lineup highlights the deep collaboration among the industry to shape the future of large-scale LLM services. The llm-d community also has significant founding supporters from academia, including the Sky Computing Lab at the University of California, Berkeley (founder of vLLM) and the LMCache Lab at the University of Chicago (founder of LMCache).
Red Hat, committed to open collaboration, recognizes that a vibrant and engaging community is crucial to the rapid evolution of generative AI inference. Red Hat will actively cultivate the llm-d community to foster its growth, foster an inclusive environment for new members, and facilitate its continued development.



