Hello World

Published 2024-11-19

After quite a few requests for an email list, we're launching the Newsletter! We'll post regular updates on the newsletter every few weeks. Here's the latest news and what we've been shipping:

Support for the VSCode coding assistant plugin, Cline
Neovim CodeCompanion support
More robust GPU allocation
Model download progress UI
vLLM updates for better model support
Various bugfixes, performance improvements, and UI tweaks
Future plans

Support for the VSCode coding assistant plugin, Cline

Cline is a popular coding assistant for VSCode. With the launch of Alibaba's Qwen2.5-Coder-32B-Instruct — a state-of-the-art open-source coding model that performs on par with OpenAI's gpt-4o — there's been quite a bit of popular demand for support. Cline in theory supported "OpenAI-compatible" API providers... But used very new extensions to the OpenAI API that most backend providers don't yet support, like vLLM.

To support Cline usage with the new model, we updated our API to silently transform the new OpenAI API into the older, more widely supported version, meaning we now support using Cline with glhf.chat! We believe we may be the first Qwen2.5-Coder-32b provider to work with Cline's OpenAI-compatible mode.

Neovim CodeCompanion support

Neovim CodeCompanion is another popular coding assistant, this time for Neovim. We worked with the maintainers to ship a bugfix that allows CodeCompanion to work out-of-the-box with GLHF using the OpenAI-compatible adapter. Make sure you have your GLHF_API_KEY exported as an environment variable in your shell, and then configure CodeCompanion with:

require("codecompanion").setup({
  strategies = {
    chat = {
      adapter = "openai_compatible",
    },
    inline = {
      adapter = "openai_compatible",
    },
    agent = {
      adapter = "openai_compatible",
    },
  },
  openai_compatible = function()
    return require("codecompanion.adapters").extend("openai_compatible", {
      env = {
        url = "https://glhf.chat",
        api_key = "GLHF_API_KEY",
        chat_url = "/api/openai/v1/chat/completions",
      },
      schema = {
        model = {
          -- Or any GLHF model!
          default = "hf:Qwen/Qwen2.5-Coder-32B-Instruct",
        },
        num_ctx = {
          default = 32768,
        },
      },
    })
  end,
})

More robust GPU allocation

It's a problem when we run out of GPU capacity: new model launches are delayed until our infrastructure providers are able to get new GPUs online or reclaim old, unused GPU capacity. To better handle low-capacity situations, we've expanded our backend GPU support to work with more types of GPUs, meaning we're less likely to encounter capacity constraints that delay model launches.

Model download progress UI

Larger models like Nous Research's Hermes 3 can be hundreds of gigabytes; if there isn't a machine already running for the model, it can be quite slow to download. Previously, we just showed a spinner... For a very, very long time: potentially 10 minutes or longer. We'll still show a spinner while the machine goes through its initial boot process, since we can't start downloading the model until we have a machine to download it to, but once the machine boots, we'll now show a nice progress UI for the model download to help you figure out how long is left in the process.

vLLM updates for better model support

We updated the version of vLLM we're running in production (and our CUDA version) for better model support. Models that are either newly-supported or increasingly-stable include:

Mamba
Granite MoE
Deepseek 2.5
Qwen2.5-Math-RM-72B
Ministral

Various bugfixes, performance improvements, and UI tweaks

You've probably noticed the site becoming smoother and easier on the eyes over the last couple of months. That's no accident, and we'll keep working on it!

Future plans

If you made it here, thank you so much for your support and being a part of our journey!

We're hard at work on more improvements, including but not limited to:

ToS & Privacy Policy: we've gotten many emails asking about our privacy policy. We're working on it with our legal team and it's coming.
Non-merged LoRA adapter support: we currently support merged LoRAs, but unmerged LoRA adapters should be much faster to boot.
Billing: running this is quite expensive, and to stay in business we'll need to start charging for models. Thank you for your support!
More backend and UI improvements. Maybe not the most exciting line item, but we're always bugfixing.

If you have any thoughts or feedback, please continue to reach out at [email protected]. We appreciate all the emails we've gotten so far!

— Matt & Billy