Published 2024-11-19
After quite a few requests for an email list, we're launching the Newsletter! We'll post regular updates on the newsletter every few weeks. Here's the latest news and what we've been shipping:
Cline is a popular coding assistant for VSCode. With the launch of Alibaba's Qwen2.5-Coder-32B-Instruct — a state-of-the-art open-source coding model that performs on par with OpenAI's gpt-4o — there's been quite a bit of popular demand for support. Cline in theory supported "OpenAI-compatible" API providers... But used very new extensions to the OpenAI API that most backend providers don't yet support, like vLLM.
To support Cline usage with the new model, we updated our API to silently transform the new OpenAI API into the older, more widely supported version, meaning we now support using Cline with glhf.chat! We believe we may be the first Qwen2.5-Coder-32b provider to work with Cline's OpenAI-compatible mode.
Neovim CodeCompanion is
another popular coding assistant, this time for Neovim. We worked with the
maintainers to ship a
bugfix that allows
CodeCompanion to work out-of-the-box with GLHF using the OpenAI-compatible
adapter. Make sure you have your GLHF_API_KEY
exported as an environment
variable in your shell, and then configure CodeCompanion with:
require("codecompanion").setup({
strategies = {
chat = {
adapter = "openai_compatible",
},
inline = {
adapter = "openai_compatible",
},
agent = {
adapter = "openai_compatible",
},
},
openai_compatible = function()
return require("codecompanion.adapters").extend("openai_compatible", {
env = {
url = "https://glhf.chat",
api_key = "GLHF_API_KEY",
chat_url = "/api/openai/v1/chat/completions",
},
schema = {
model = {
-- Or any GLHF model!
default = "hf:Qwen/Qwen2.5-Coder-32B-Instruct",
},
num_ctx = {
default = 32768,
},
},
})
end,
})
It's a problem when we run out of GPU capacity: new model launches are delayed until our infrastructure providers are able to get new GPUs online or reclaim old, unused GPU capacity. To better handle low-capacity situations, we've expanded our backend GPU support to work with more types of GPUs, meaning we're less likely to encounter capacity constraints that delay model launches.
Larger models like Nous Research's Hermes 3 can be hundreds of gigabytes; if there isn't a machine already running for the model, it can be quite slow to download. Previously, we just showed a spinner... For a very, very long time: potentially 10 minutes or longer. We'll still show a spinner while the machine goes through its initial boot process, since we can't start downloading the model until we have a machine to download it to, but once the machine boots, we'll now show a nice progress UI for the model download to help you figure out how long is left in the process.
We updated the version of vLLM we're running in production (and our CUDA version) for better model support. Models that are either newly-supported or increasingly-stable include:
You've probably noticed the site becoming smoother and easier on the eyes over the last couple of months. That's no accident, and we'll keep working on it!
If you made it here, thank you so much for your support and being a part of our journey!
We're hard at work on more improvements, including but not limited to:
If you have any thoughts or feedback, please continue to reach out at [email protected]. We appreciate all the emails we've gotten so far!
— Matt & Billy