Inferact’s cover photo
Inferact

Inferact

Software Development

San Francisco, CA 2,456 followers

__

About us

Inferact is a startup founded by creators and core maintainers of vLLM, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster.

Industry
Software Development
Company size
11-50 employees
Headquarters
San Francisco, CA
Type
Privately Held
Founded
2025

Locations

Employees at Inferact

Updates

  • Shoutout to our co-founder Kaichao You for making this fix and writing up the full story. From a 2024 hackathon bug → in-tree workarounds in vLLM → PyTorch Foundation TAC → fix landed in PyTorch 2.11.0. This kind of unglamorous, multi-org debugging makes the whole stack better. 👇

    View organization page for PyTorch

    320,033 followers

    vLLM and PyTorch worked together to fix a long-standing aarch64 install headache — as of PyTorch 2.11.0, pip install torch on GB200 / GB300 / GH200 just works. What changed: PyTorch 2.11.0 now publishes CUDA-enabled aarch64 wheels to the default PyPI index. No more custom --index-url flags. No more transitive dependencies silently swapping your GPU build for the CPU wheel. New users on Grace Hopper and Grace Blackwell systems can follow the standard install instructions and have vLLM work the first time. In our latest blog, Kaichao You (co-founder Inferact, Lead Maintainer vLLM) shares the full story: 🐛 A 2024 hackathon bug bringing up vLLM on GH200 🔧 vLLM's in-tree workarounds (use_existing_torch.py and [tool.uv] build-isolation passthrough) 🤝 From GitHub issue to PyTorch Foundation TAC discussion 🚀 The fix landing in PyTorch 2.11.0, driven by NVIDIA and PyTorch core. A great example of cross-project collaboration under the PyTorch Foundation umbrella — and a reminder that boring infrastructure wins compound. Read the full story: https://lnkd.in/gGc8mRm8Alban Desmaison (Meta), Nikita Shulga (Meta), Andrey Talman (Meta), Piotr Bialecki (NVIDIA)

  • View organization page for Inferact

    2,456 followers

    We’re at MLSys 2026 in Bellevue this week! ⛴️ Come find the Inferact team at Booth #2 in the Evergreen Ballroom. Talks: • Roger Wang(co-founder at Inferact) — “Rethinking Open Source Contribution in the Age of AI Agents”, Mon 5/18, 11:36 AM • Yifan Qiao (vLLM core contributor) — YPS Sponsor Lightning Talk — Mon 5/18, 11:36 AM At the booth: • 20 Questions with vLLM — a game with vLLM running on DGX Spark, with prizes 🎯 • vLLM + Inferact swag 🧢 • Inferact team members! happy to talk inference and vLLM If you’re attending, come say hi, chat about inference, or learn what we’re building!

    • No alternative text description for this image
  • We're onto Inferact's second office this year! Yesterday, we finally broke it in with an office warming. It's amazing to see how far we've come. The vLLM ecosystem has been growing at lightning pace, and we've been lucky to scale alongside it: helping teams serve inference faster, cheaper, and at scale. Thank you to everyone who made it out yesterday — customers, partners, friends, and the whole Inferact team. It meant a lot to celebrate this milestone together. We're hiring across all teams. If you want to join one of the fastest-growing AI infra companies and work on the systems powering the next generation of AI, check out our careers page or DM us. Excited for many more office warmings to come!

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
  • Inferact reposted this

    THE MORE U BUY, THE MORE U SAVE: By ganging up multiple B200 8-GPU machines together over RoCEv2 CX-7 ethernet with Tomahawk switches with an inference optimization called PD disaggregation, the per GPU token throughput increases up to 7x. By increasing per GPU token throughput by up to 7x, this decreases cost per million tokens by up to 7x also. Great work to Inferact & vLLM for building this amazing OSS engine & for NVIDIA Data Center Kyle Kranen for building dynamo inference orchestrator. More improvements to disagg b200 perf to come!

    • No alternative text description for this image
  • Proud of what the team has shipped here. Huge thank you to vLLM community, NVIDIA, Red Hat, and DigitalOcean for the partnership. All this work is in vLLM main or heading upstream 🚀

    View organization page for vLLM

    25,678 followers

    vLLM tops the Artificial Analysis leaderboard 🎉 vLLM tops Artificial Analysis on DeepSeek V3.2 and ranks among the top deployments of MiniMax-M2.5 and Qwen 3.5 397B. The leading deployments of these models are now open source. How each result was built: 🔹 DeepSeek V3.2 — Aggressive op fusion across the attention path collapsed ~33 per-layer kernels down toward ~10.  🔹 MiniMax-M2.5 — Custom EAGLE3 draft trained against the target's own token distribution via TorchSpec, plus a custom QK-norm fusion for MiniMax's TP-aware attention.  🔹 Qwen 3.5 397B — Targeted fusions plus a QK-norm fix for Qwen's linear-attention path. Every optimization is in vLLM main or on its way upstream. Huge thank you to Inferact, DigitalOcean, NVIDIA, Red Hat, and the vLLM community 🙏 Full breakdown 👇 https://lnkd.in/gtRgxSFS

  • Inferact reposted this

    Today, we're proud to announce Inferact, a startup founded by creators and core maintainers of vLLM, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. ## The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. ## Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. ## Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. ## Join Us Through the open source community, we are fortunate to work with some of the best people we know. For Inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us! - Simon Mo, Woosuk Kwon, Kaichao You, Roger Wang, Ion Stoica, and the rest of founding members of Inferact.

    • No alternative text description for this image

Similar pages