There's no subscription required to use GLHF. Instead, we charge based on usage: if you don't use the product, you don't get charged.
GPU Type | Price |
---|---|
80GB | 3 cents/min, per GPU |
48GB | 1.5 cents/min, per GPU |
24GB | 1.2 cents/min, per GPU |
Our on-demand GPU rates are very competitive: for example, an 80GB GPU is ~2x cheaper on GLHF than on competing services like Replicate or Modal Labs.
We automatically calculate the type and number of GPUs required for a model repository for you. We don't quantize on-demand models: they're launched in whatever precision the underlying repo uses; typically, BF16, with the exception of Jamba-based models which are launched in FP8. Quantizing past FP8 can significantly harm model performance.
On-demand model context length is capped to a maximum of 32k tokens.
Base model | Input price (per million tokens) | Output price (per million tokens) |
---|---|---|
meta-llama/Llama-3.2-1B-Instruct | $0.06/mtok | $0.06/mtok |
meta-llama/Llama-3.2-3B-Instruct | $0.06/mtok | $0.06/mtok |
meta-llama/Meta-Llama-3.1-8B-Instruct | $0.20/mtok | $0.20/mtok |
meta-llama/Meta-Llama-3.1-70B-Instruct | $0.90/mtok | $0.90/mtok |
LoRA finetunes are efficient, small, quick-to-train finetunes of base models. Their size is measured in "ranks," starting at rank-8; we support up to rank-64 LoRAs kept always-on, and we run them in FP8 precision. The rank is set during the finetuning process: if you create your own LoRA, you'll be able to set exactly what rank you want using standard configuration for your training framework.
For LoRAs of base models not listed in the table above, we support running them on-demand as long as vLLM does; however, since the base models aren't kept always-on, you'll be charged our standard on-demand pricing for the base model (and no additional charge for the LoRA).
Model | Provider | Input price (per million tokens) | Output price (per million tokens) |
---|---|---|---|
deepseek-ai/DeepSeek-R1 | Fireworks | $0.55/mtok | $2.19/mtok |
deepseek-ai/DeepSeek-R1-Distill-Llama-70B | Together | $0.90/mtok | $0.90/mtok |
deepseek-ai/DeepSeek-V3 | Together | $1.25/mtok | $1.25/mtok |
deepseek-ai/DeepSeek-V3-0324 | Fireworks | $1.20/mtok | $1.20/mtok |
google/gemma-2-27b-it | Together | $0.80/mtok | $0.80/mtok |
meta-llama/Llama-3.1-405B-Instruct | Fireworks | $3.00/mtok | $3.00/mtok |
meta-llama/Llama-3.1-70B-Instruct | Fireworks | $0.90/mtok | $0.90/mtok |
meta-llama/Llama-3.1-8B-Instruct | Fireworks | $0.20/mtok | $0.20/mtok |
meta-llama/Llama-3.2-11B-Vision-Instruct | Fireworks | $0.20/mtok | $0.20/mtok |
meta-llama/Llama-3.2-3B-Instruct | Fireworks | $0.10/mtok | $0.10/mtok |
meta-llama/Llama-3.2-90B-Vision-Instruct | Fireworks | $0.90/mtok | $0.90/mtok |
meta-llama/Llama-3.3-70B-Instruct | Fireworks | $0.90/mtok | $0.90/mtok |
mistralai/Mistral-7B-Instruct-v0.3 | Together | $0.20/mtok | $0.20/mtok |
mistralai/Mixtral-8x22B-Instruct-v0.1 | Together | $1.20/mtok | $1.20/mtok |
mistralai/Mixtral-8x7B-Instruct-v0.1 | Together | $0.60/mtok | $0.60/mtok |
NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO | Together | $0.60/mtok | $0.60/mtok |
nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | Together | $0.90/mtok | $0.90/mtok |
Qwen/Qwen2.5-72B-Instruct | Fireworks | $0.90/mtok | $0.90/mtok |
Qwen/Qwen2.5-7B-Instruct | Together | $0.18/mtok | $0.18/mtok |
Qwen/Qwen2.5-Coder-32B-Instruct | Fireworks | $0.90/mtok | $0.90/mtok |
upstage/SOLAR-10.7B-Instruct-v1.0 | Together | $0.30/mtok | $0.30/mtok |
Always-on models run in whatever precision the underlying API provider supports: typically, either BF16 or FP8. Always-on models support their full context length.