Run
(almost)
any
language
model

We use vLLM and a custom-built, autoscaling GPU scheduler to run (almost) any open-source large language model for you: just paste a link to the Hugging Face repo. You can use our chat UI, or our OpenAI-compatible API. We'll let you use up to eight Nvidia A100 80Gb GPUs.
Works with any full-weight or 4-bit AWQ repo on Hugging Face that vLLM supports, including:
  • Meta Llama 3.1 405b Instruct (and 70b, and 8b)
  • Qwen 2 72b
  • Mixtral 8x22b
  • Gemma 2 27b
  • Jamba 1.5 Mini (support for the Jamba 1.5 Large is in the works)
  • Phi-3
And many more. We'll run full-weight finetunes as well, like those from Nous Research or uncensored anti-refusal abliterated models.
For the most popular models, we proxy to always-on inference providers for you automatically. For the more bespoke models, we'll spin up a cluster for you on-demand, and spin it down once you're done using it.
It's free during the beta period, while we work out the kinks and figure out how to price it. Once the beta is over, we expect to significantly beat pricing of the major cloud GPU vendors due to our ability to run the models multi-tenant.