Published 2025-02-06
Thank you to everyone for supporting us since we launched billing last month! To help run custom finetunes cheaply, without needing to wait for model boots, we're launching cheap, per-token pricing for LoRAs for many base models: Llama 3.1 70B Instruct, Qwen 2.5 72B Instruct, and more. LoRA finetunes targeting these base models will stay always-on once uploaded, so you don't need to wait for model boots! We've also added a bundle of other new features:
LoRAs — short for "Low Rank Adaptations" — are (relatively) small, efficient finetunes that modify some base model, like Llama 3.1 70B Instruct or Qwen 2.5 72B Instruct. Unlike a full-parameter finetune, we can quickly hot-swap LoRAs in and out of GPU VRAM for various base models, meaning that the LoRA finetunes stay always-on once they've been uploaded for the first time, and we can offer cheap, per-token pricing for them. LoRAs are still very effective at modifying base model behavior, so they're a great way to run custom models without having to pay the full price for full-parameter finetunes.
We've added a new section to our pricing page to cover the LoRA pricing; basically:
We've bumped everyone's rate limits even higher! The old rate limit was 4 requests per minute, on average over an eight-hour period (to allow for short spikes). The new rate limit is 60 requests per minute (aka, one request per second), also averaged over an eight-hour period to allow for short spikes. Have fun!
While the API has always supported these sampler parameters, the UI hasn't. In addition to launching support for system prompts last month, we've now added support for temperature, presence penalty, and frequency penalty sampler parameters.
"Temperature" refers to the creativity of a model, vs its accuracy on the tasks it was trained on; a higher temperature makes the model more creative, but less accurate, whereas a lower temperature makes it more accurate but less creative.
Presence penalty and frequency penalties are ways to reduce the model's likelihood to repeat itself; presence penalty applies a flat penalty for any words that have been used already, and frequency penalty applies varying penalties to words depending on how many times they've been used: the more a word is used, the more it's penalized.
Check all of them out in the per-thread settings UI!
As many of you have noticed... We now support DeepSeek V3 and R1, in addition to the distilled models we supported at launch. V3 and R1 are proxied to our partners at Together AI: your data stays in the U.S., and like all API use cases, we delete API prompts and completions within 14 days.
You've probably noticed the site becoming smoother and easier on the eyes over the last couple of months. That's no accident, and we'll keep working on it!
If you made it here, thank you so much for your support and being a part of our journey!
We're hard at work on more improvements, including:
If you have any thoughts or feedback, please continue to reach out at [email protected]. We appreciate all the emails we've gotten so far!
— Matt & Billy