Inferact

Inferact · 2026-05-11T21:39:16.300Z

Proud of what the team has shipped here. Huge thank you to vLLM community, NVIDIA, Red Hat, and DigitalOcean for the partnership. All this work is in vLLM main or heading upstream 🚀

Software Development

San Francisco, CA 2,456 followers

__

See jobs Follow

View all 28 employees

About us

Inferact is a startup founded by creators and core maintainers of vLLM, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster.

Website: https://inferact.ai
External link for Inferact
Industry: Software Development
Company size: 11-50 employees
Headquarters: San Francisco, CA
Type: Privately Held
Founded: 2025

Locations

Primary

San Francisco, CA, US

Get directions

Employees at Inferact

See all employees

Updates

Inferact

2,456 followers
1d
Report this post
Shoutout to our co-founder Kaichao You for making this fix and writing up the full story. From a 2024 hackathon bug → in-tree workarounds in vLLM → PyTorch Foundation TAC → fix landed in PyTorch 2.11.0. This kind of unglamorous, multi-org debugging makes the whole stack better. 👇

PyTorch

320,033 followers
1d

vLLM and PyTorch worked together to fix a long-standing aarch64 install headache — as of PyTorch 2.11.0, pip install torch on GB200 / GB300 / GH200 just works. What changed: PyTorch 2.11.0 now publishes CUDA-enabled aarch64 wheels to the default PyPI index. No more custom --index-url flags. No more transitive dependencies silently swapping your GPU build for the CPU wheel. New users on Grace Hopper and Grace Blackwell systems can follow the standard install instructions and have vLLM work the first time. In our latest blog, Kaichao You (co-founder Inferact, Lead Maintainer vLLM) shares the full story: 🐛 A 2024 hackathon bug bringing up vLLM on GH200 🔧 vLLM's in-tree workarounds (use_existing_torch.py and [tool.uv] build-isolation passthrough) 🤝 From GitHub issue to PyTorch Foundation TAC discussion 🚀 The fix landing in PyTorch 2.11.0, driven by NVIDIA and PyTorch core. A great example of cross-project collaboration under the PyTorch Foundation umbrella — and a reminder that boring infrastructure wins compound. Read the full story: https://lnkd.in/gGc8mRm8 ✍ Alban Desmaison (Meta), Nikita Shulga (Meta), Andrey Talman (Meta), Piotr Bialecki (NVIDIA)

Like Comment Share
Inferact

2,456 followers
2d Edited
Report this post
We’re at MLSys 2026 in Bellevue this week! ⛴️ Come find the Inferact team at Booth #2 in the Evergreen Ballroom. Talks: • Roger Wang(co-founder at Inferact) — “Rethinking Open Source Contribution in the Age of AI Agents”, Mon 5/18, 11:36 AM • Yifan Qiao (vLLM core contributor) — YPS Sponsor Lightning Talk — Mon 5/18, 11:36 AM At the booth: • 20 Questions with vLLM — a game with vLLM running on DGX Spark, with prizes 🎯 • vLLM + Inferact swag 🧢 • Inferact team members! happy to talk inference and vLLM If you’re attending, come say hi, chat about inference, or learn what we’re building!
Like Comment Share
Inferact

2,456 followers
5d
Report this post
We're onto Inferact's second office this year! Yesterday, we finally broke it in with an office warming. It's amazing to see how far we've come. The vLLM ecosystem has been growing at lightning pace, and we've been lucky to scale alongside it: helping teams serve inference faster, cheaper, and at scale. Thank you to everyone who made it out yesterday — customers, partners, friends, and the whole Inferact team. It meant a lot to celebrate this milestone together. We're hiring across all teams. If you want to join one of the fastest-growing AI infra companies and work on the systems powering the next generation of AI, check out our careers page or DM us. Excited for many more office warmings to come!
Like Comment Share
Inferact reposted this
SemiAnalysis

39,557 followers
1w
Report this post
THE MORE U BUY, THE MORE U SAVE: By ganging up multiple B200 8-GPU machines together over RoCEv2 CX-7 ethernet with Tomahawk switches with an inference optimization called PD disaggregation, the per GPU token throughput increases up to 7x. By increasing per GPU token throughput by up to 7x, this decreases cost per million tokens by up to 7x also. Great work to Inferact & vLLM for building this amazing OSS engine & for NVIDIA Data Center Kyle Kranen for building dynamo inference orchestrator. More improvements to disagg b200 perf to come!
8 Comments

Like Comment Share
Inferact

2,456 followers
1w
Report this post
Proud of what the team has shipped here. Huge thank you to vLLM community, NVIDIA, Red Hat, and DigitalOcean for the partnership. All this work is in vLLM main or heading upstream 🚀

vLLM

25,678 followers
1w

vLLM tops the Artificial Analysis leaderboard 🎉 vLLM tops Artificial Analysis on DeepSeek V3.2 and ranks among the top deployments of MiniMax-M2.5 and Qwen 3.5 397B. The leading deployments of these models are now open source. How each result was built: 🔹 DeepSeek V3.2 — Aggressive op fusion across the attention path collapsed ~33 per-layer kernels down toward ~10. 🔹 MiniMax-M2.5 — Custom EAGLE3 draft trained against the target's own token distribution via TorchSpec, plus a custom QK-norm fusion for MiniMax's TP-aware attention. 🔹 Qwen 3.5 397B — Targeted fusions plus a QK-norm fix for Qwen's linear-attention path. Every optimization is in vLLM main or on its way upstream. Huge thank you to Inferact, DigitalOcean, NVIDIA, Red Hat, and the vLLM community 🙏 Full breakdown 👇 https://lnkd.in/gtRgxSFS

vLLM Tops the Artificial Analysis Leaderboard vllm.ai

Like Comment Share
Inferact reposted this
Simon Mo
3mo
Report this post
Today, we're proud to announce Inferact, a startup founded by creators and core maintainers of vLLM, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. ## The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. ## Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. ## Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. ## Join Us Through the open source community, we are fortunate to work with some of the best people we know. For Inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us! - Simon Mo, Woosuk Kwon, Kaichao You, Roger Wang, Ion Stoica, and the rest of founding members of Inferact.
89 Comments

Like Comment Share

LinkedIn respects your privacy

Inferact

Software Development

San Francisco, CA 2,456 followers

__

About us

Locations

Employees at Inferact

Max Gazor

Joseph Spisak

Robin Cheng

Nick Hill

Updates

Join now to see what you are missing

Similar pages

vLLM

RadixArk

Baseten

LiveKit

humans&

Railway

Ricursive Intelligence

Andreessen Horowitz

Parloa

Flapping Airplanes