TL;DR: Check it out at oss.zuericitygpt.ch.

There's always a discussion — both internally and with customers — about whether we really need to send all data to OpenAI (or commercial alternatives). Aren't there open-source alternatives that could be run entirely in Switzerland, or even in one's own data center? In certain cases, privacy and data protection are absolutely critical, and local hosting can simplify legal implications.

The answer is yes, there are open-source options, but it's not as convenient to set up as just using OpenAI or its competitors.

Our framework always had the possibility to send the chat component to any large language model (LLM), not just OpenAI. There are many providers offering alternative models, and we wrote about this earlier in the year: Running Mixtral Self-Hosted (in that case, on a GPU service provider).

But in a retrieval-augmented generation (RAG) chatbot, the "G" (generating the answer) is just one part. The "R" (retrieval) is as important, if not more so, these days. For the semantic search part of the retrieval, you need embeddings or vectors, and unfortunately, most "independent" SaaS providers of OSS LLM model services don't provide these capabilities—at least not in Switzerland.

Another key component—technically not required, but still important—is a reranker. Cohere offers a very good one, which we use by default, and it can also be hosted in the Azure cloud for added privacy. But it's not open source.

With this in mind, we finally took the time to explore fully open-source options, and now we can offer a chatbot built entirely on OSS models, which you can host yourself (or on some rented GPUs).

You can test the result at oss.zuericitygpt.ch. Compared to the original, OpenAI- and Cohere-based ZĂźriCityGPT, it may give you different answers, but they are usually just as good.

The main challenge was not getting it running in the first place—that basically meant writing different connectors and running the models locally. But we also wanted to be able to compare the results and therefore needed a way in our framework to store different vectors for the different models (and in the end, also different chunking methods).

We can now test different models for embeddings, chunk sizes, reranking, and text generation in all combinations we want.

What did we use for the OSS version of ZĂźriCityGPT?

  • For the chat/text-generation part, we use Swiss-based Infomaniak's LLM offering with the Llama 3 70B model. Llama 3 70B strikes a good balance between price and performance.
  • For embeddings, we use the bge-m3 model.
  • For reranking, we use the bge-reranker-v2-m3 model.

Currently, the embeddings and reranker are hosted using RunPod's serverless offering. It's easy to set up and cost-effective to keep available. However, depending on how "cold" the server is, the initial response may take up to 30 seconds—a fair trade-off for a demo. We will try to keep it warm over the next few days.

Embedders and rerankers don't require powerful GPUs with lots of VRAM (unlike text completion LLMs for chat). A GPU with 16GB of VRAM is more than sufficient. We even tried running them on a Mac Mini M1 in our office: the embedding performance was fast enough, but the reranker was a bit too slow, so we didn't go that route in the end.

We haven't tested many other OSS models for these tasks—for this proof of concept, we were satisfied with the ones we chose, which are often considered some of the best.

Enjoy it! If you have questions, want to know more, or even host such a chatbot yourself, do not hesitate to contact us.

PS: Our code itself isn't open source (yet), but it builds on a lot of open-source components like LangChain/LangGraph, pgvector, Cheerio, NestJS, and more. It's not totally out of question that we open source it one day, just not today.

PPS: I'm aware that in the LLM world, open-source software is handled much more loosely than the Open Source Initiative thinks it should be. Time will tell which direction this will lead.