Pricing

There's no subscription required to use GLHF. Instead, we charge based on usage: if you don't use the product, you don't get charged.

On-demand pricing

On-demand models are charged per-minute that the model is running.

You're never charged for boot time. We'll keep the model running for 10 minutes after your last-sent message, and you can stop it before that at any time.

GPU Type	Price
80GB	3 cents/min, per GPU
48GB	1.5 cents/min, per GPU
24GB	1.2 cents/min, per GPU

Our on-demand GPU rates are very competitive: for example, an 80GB GPU is ~2x cheaper on GLHF than on competing services like Replicate or Modal Labs.

We automatically calculate the type and number of GPUs required for a model repository for you. We don't quantize on-demand models: they're launched in whatever precision the underlying repo uses; typically, BF16, with the exception of Jamba-based models which are launched in FP8. Quantizing past FP8 can significantly harm model performance.

On-demand model context length is capped to a maximum of 32k tokens.

LoRA pricing

Low-rank adapter (LoRA) finetunes of certain base models are priced per-token

We keep certain base models always-on, and LoRA finetunes of those models are also always-on after the initial upload process. This means they're quick to respond, and we can charge cheap per-token pricing for them.

Base model	Context length	Input price (per million tokens)	Output price (per million tokens)
meta-llama/Llama-3.2-1B-Instruct	128k tokens	$0.06/mtok	$0.06/mtok
meta-llama/Llama-3.2-3B-Instruct	128k tokens	$0.06/mtok	$0.06/mtok
meta-llama/Meta-Llama-3.1-8B-Instruct	128k tokens	$0.20/mtok	$0.20/mtok
meta-llama/Meta-Llama-3.1-70B-Instruct	128k tokens	$0.90/mtok	$0.90/mtok

LoRA finetunes are efficient, small, quick-to-train finetunes of base models. Their size is measured in "ranks," starting at rank-8; we support up to rank-64 LoRAs kept always-on, and we run them in FP8 precision. The rank is set during the finetuning process: if you create your own LoRA, you'll be able to set exactly what rank you want using standard configuration for your training framework.

For LoRAs of base models not listed in the table above, we support running them on-demand as long as vLLM does; however, since the base models aren't kept always-on, you'll be charged our standard on-demand pricing for the base model (and no additional charge for the LoRA).

Always-on pricing

Always-on models are charged per token, at the provider's MSRP with no markup.

Prices are listed per-million-tokens.

Model	Provider	Context length	Input price (per million tokens)	Output price (per million tokens)
deepseek-ai/DeepSeek-R1	Fireworks	128k tokens	$0.55/mtok	$2.19/mtok
deepseek-ai/DeepSeek-R1-0528	Fireworks	128k tokens	$3.00/mtok	$3.00/mtok
deepseek-ai/DeepSeek-R1-Distill-Llama-70B	Together	128k tokens	$0.90/mtok	$0.90/mtok
deepseek-ai/DeepSeek-V3	Together	128k tokens	$1.25/mtok	$1.25/mtok
deepseek-ai/DeepSeek-V3-0324	Fireworks	128k tokens	$1.20/mtok	$1.20/mtok
google/gemma-2-27b-it	Together	8k tokens	$0.80/mtok	$0.80/mtok
meta-llama/Llama-3.1-405B-Instruct	Fireworks	128k tokens	$3.00/mtok	$3.00/mtok
meta-llama/Llama-3.1-70B-Instruct	Fireworks	128k tokens	$0.90/mtok	$0.90/mtok
meta-llama/Llama-3.1-8B-Instruct	Fireworks	128k tokens	$0.20/mtok	$0.20/mtok
meta-llama/Llama-3.2-11B-Vision-Instruct	Together	128k tokens	$0.18/mtok	$0.18/mtok
meta-llama/Llama-3.2-3B-Instruct	Together	128k tokens	$0.06/mtok	$0.06/mtok
meta-llama/Llama-3.3-70B-Instruct	Fireworks	128k tokens	$0.90/mtok	$0.90/mtok
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8	Fireworks	524k tokens	$0.22/mtok	$0.88/mtok
meta-llama/Llama-4-Scout-17B-16E-Instruct	Fireworks	328k tokens	$0.15/mtok	$0.60/mtok
mistralai/Mistral-7B-Instruct-v0.3	Together	32k tokens	$0.20/mtok	$0.20/mtok
mistralai/Mixtral-8x22B-Instruct-v0.1	Together	64k tokens	$1.20/mtok	$1.20/mtok
mistralai/Mixtral-8x7B-Instruct-v0.1	Together	32k tokens	$0.60/mtok	$0.60/mtok
NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO	Together	32k tokens	$0.60/mtok	$0.60/mtok
nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	Together	32k tokens	$0.90/mtok	$0.90/mtok
Qwen/Qwen2.5-72B-Instruct	Fireworks	128k tokens	$0.90/mtok	$0.90/mtok
Qwen/Qwen2.5-7B-Instruct	Together	32k tokens	$0.18/mtok	$0.18/mtok
Qwen/Qwen2.5-Coder-32B-Instruct	Together	32k tokens	$0.80/mtok	$0.80/mtok
Qwen/Qwen3-235B-A22B	Together	128k tokens	$0.20/mtok	$0.60/mtok

Always-on models run in whatever precision the underlying API provider supports: typically, either BF16 or FP8. Always-on models support their full context length.