There's no subscription required to use GLHF. Instead, we charge based on usage: if you don't use the product, you don't get charged.
GPU Type | Price |
---|---|
80GB | 3 cents/min, per GPU |
48GB | 1.5 cents/min, per GPU |
24GB | 1.2 cents/min, per GPU |
Our on-demand GPU rates are very competitive: for example, an 80GB GPU is ~2x cheaper on GLHF than on competing services like Replicate or Modal Labs.
We automatically calculate the type and number of GPUs required for a model repository for you. We don't quantize on-demand models: they're launched in whatever precision the underlying repo uses; typically, BF16, with the exception of Jamba-based models which are launched in FP8. Quantizing past FP8 can significantly harm model performance.
On-demand model context length is capped to a maximum of 32k tokens.
Model | Provider | Price |
---|---|---|
deepseek-ai/DeepSeek-V3 | Together | $1.25/mtok |
google/gemma-2-27b-it | Together | $0.80/mtok |
google/gemma-2-9b-it | Together | $0.20/mtok |
meta-llama/Llama-3.1-405B-Instruct | Fireworks | $3.00/mtok |
meta-llama/Llama-3.1-70B-Instruct | Fireworks | $0.90/mtok |
meta-llama/Llama-3.1-8B-Instruct | Fireworks | $0.20/mtok |
meta-llama/Llama-3.2-11B-Vision-Instruct | Fireworks | $0.20/mtok |
meta-llama/Llama-3.2-3B-Instruct | Fireworks | $0.10/mtok |
meta-llama/Llama-3.2-90B-Vision-Instruct | Fireworks | $0.90/mtok |
meta-llama/Llama-3.3-70B-Instruct | Fireworks | $0.90/mtok |
mistralai/Mistral-7B-Instruct-v0.3 | Together | $0.20/mtok |
mistralai/Mixtral-8x22B-Instruct-v0.1 | Together | $1.20/mtok |
mistralai/Mixtral-8x7B-Instruct-v0.1 | Together | $0.60/mtok |
NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO | Together | $0.60/mtok |
nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | Together | $0.90/mtok |
Qwen/Qwen2.5-72B-Instruct | Fireworks | $0.90/mtok |
Qwen/Qwen2.5-7B-Instruct | Together | $0.18/mtok |
Qwen/Qwen2.5-Coder-32B-Instruct | Fireworks | $0.90/mtok |
upstage/SOLAR-10.7B-Instruct-v1.0 | Together | $0.30/mtok |
Always-on models run in whatever precision the underlying API provider supports: typically, either BF16 or FP8. Always-on models support their full context length.