Pricing

There's no subscription required to use GLHF. Instead, we charge based on usage: if you don't use the product, you don't get charged.

On-demand pricing

GPU TypePrice
80GB3 cents/min, per GPU
48GB1.5 cents/min, per GPU
24GB1.2 cents/min, per GPU

Our on-demand GPU rates are very competitive: for example, an 80GB GPU is ~2x cheaper on GLHF than on competing services like Replicate or Modal Labs.

We automatically calculate the type and number of GPUs required for a model repository for you. We don't quantize on-demand models: they're launched in whatever precision the underlying repo uses; typically, BF16, with the exception of Jamba-based models which are launched in FP8. Quantizing past FP8 can significantly harm model performance.

On-demand model context length is capped to a maximum of 32k tokens.

LoRA pricing

Base modelInput price (per million tokens)Output price (per million tokens)
meta-llama/Llama-3.2-1B-Instruct$0.06/mtok$0.06/mtok
meta-llama/Llama-3.2-3B-Instruct$0.06/mtok$0.06/mtok
meta-llama/Meta-Llama-3.1-8B-Instruct$0.20/mtok$0.20/mtok
meta-llama/Meta-Llama-3.1-70B-Instruct$0.90/mtok$0.90/mtok

LoRA finetunes are efficient, small, quick-to-train finetunes of base models. Their size is measured in "ranks," starting at rank-8; we support up to rank-64 LoRAs kept always-on, and we run them in FP8 precision. The rank is set during the finetuning process: if you create your own LoRA, you'll be able to set exactly what rank you want using standard configuration for your training framework.

For LoRAs of base models not listed in the table above, we support running them on-demand as long as vLLM does; however, since the base models aren't kept always-on, you'll be charged our standard on-demand pricing for the base model (and no additional charge for the LoRA).

Always-on pricing

ModelProviderInput price (per million tokens)Output price (per million tokens)
deepseek-ai/DeepSeek-R1Fireworks$0.55/mtok$2.19/mtok
deepseek-ai/DeepSeek-R1-Distill-Llama-70BTogether$0.90/mtok$0.90/mtok
deepseek-ai/DeepSeek-V3Together$1.25/mtok$1.25/mtok
deepseek-ai/DeepSeek-V3-0324Fireworks$1.20/mtok$1.20/mtok
google/gemma-2-27b-itTogether$0.80/mtok$0.80/mtok
meta-llama/Llama-3.1-405B-InstructFireworks$3.00/mtok$3.00/mtok
meta-llama/Llama-3.1-70B-InstructFireworks$0.90/mtok$0.90/mtok
meta-llama/Llama-3.1-8B-InstructFireworks$0.20/mtok$0.20/mtok
meta-llama/Llama-3.2-11B-Vision-InstructFireworks$0.20/mtok$0.20/mtok
meta-llama/Llama-3.2-3B-InstructFireworks$0.10/mtok$0.10/mtok
meta-llama/Llama-3.2-90B-Vision-InstructFireworks$0.90/mtok$0.90/mtok
meta-llama/Llama-3.3-70B-InstructFireworks$0.90/mtok$0.90/mtok
mistralai/Mistral-7B-Instruct-v0.3Together$0.20/mtok$0.20/mtok
mistralai/Mixtral-8x22B-Instruct-v0.1Together$1.20/mtok$1.20/mtok
mistralai/Mixtral-8x7B-Instruct-v0.1Together$0.60/mtok$0.60/mtok
NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPOTogether$0.60/mtok$0.60/mtok
nvidia/Llama-3.1-Nemotron-70B-Instruct-HFTogether$0.90/mtok$0.90/mtok
Qwen/Qwen2.5-72B-InstructFireworks$0.90/mtok$0.90/mtok
Qwen/Qwen2.5-7B-InstructTogether$0.18/mtok$0.18/mtok
Qwen/Qwen2.5-Coder-32B-InstructFireworks$0.90/mtok$0.90/mtok
upstage/SOLAR-10.7B-Instruct-v1.0Together$0.30/mtok$0.30/mtok

Always-on models run in whatever precision the underlying API provider supports: typically, either BF16 or FP8. Always-on models support their full context length.