We have shipped on-premise ChatGPT. We have started our journey with a tiny node JOHNAIC 16 late last year. But our customers turn out to be pretty demanding: they're making us grow by giving more intense requirements. We have designed two new systems: 1. JOHNAIC 140: 7 x 20 GB GPU machine suitable for onprem AI inference. Can run OpenAI's GPT-OSS-120b with 100 concurrent users. The model is as good or better than GPT-4o. Even after this deployment of this model, we have GPU RAM for embeddings, speech to text and text to speech. 2. JOHNAIC DataBank 64: 64 TB of high performance NVMe storage. The storage can reach in-memory performance! We've benchmarked and achieved 64 GB/s disk read speeds. We have deployed highly available postgres in two of these nodes. Idea is that JOHNAIC 140 should be able to generate SQL queries for the data stored in Databank 64. A private coding assistant is also made available with the OpenAI compatabile API deployed on the cluster.
Now, I am Imagining with this great power, what could be the possibility, and it's mind boggling !
Quite a big milestone, is this for someone in a regulated industry?
Amazing work! 👏 Truly impressive to see on-prem AI at this scale. Quick question, what strategies are you using to optimize memory and GPU utilization to support 100 concurrent users on GPT-OSS-120b?
Sasank Chilamkurthy sounds crazy! What does this cost?
Impressive.
CTO, Urai - Building ML/AI products
2wThis is amazing. It is unfair to just post this photo without the config. 🤓 Give us the deets. :) 7x20 GB is impressive. It is particularly good for MoE models like the gpt-oss-120b. Is it a cuda enabled card? Some of the newer attention mechanisms use fp8 cores to drive some amazing performance on vLLM. How do you cool this and how much power does this draw? Are these blower style cards? Isn't it better to have storage and computer separated? Wouldn't the NVME and gpus compete for the same pcie lanes? The latency from the model would any way be higher than the latency between a network connection especially if you can colocate them on the same rack. 🤔