Since the first release of ZĂŒriCityGPT, we have done numerous experiments and explorations together with clients and interested parties. We have learned a great deal about the technology and have expanded its capabilities. And what was meant as a little innovation experiment ended up in being a big thing for us at Liip.
And the icing on the cake: The project is nominated for the Master of the Best of Swiss Web Awards. If you're at the Award Night this Thursday, we'd be thrilled, if you vote for ZĂŒriCityGPT!
We also did the same for many other Cities and Cantons. The implementations are often not that advanced and refined as the ZĂŒriCityGPT one, but maybe your area is amongst them to try it out.
- BĂ€rnCityGPT
- FribourgGPT
- GenevaGPT
- LausanneGPT
- LozÀrnCantonGPT
- LozÀrnCityGPT
- WintiGPT
- ZĂŒriCantonGPT
The technical background described in the original blog post is still what we use today. We haven't made significant changes to the underlying technology used.
Our first prototype was a so called "naive rag". Take your content, split it into smaller chunks, get vectors of them, do the semantic search with a vector from the query (retrieval), create a prompt (augmentation) and get the answer from the Large Language Model (generation). This is not rocket science and can be implemented in any database which supports vectors without a big budget. The results are impressive at first. But the more you dig in and see what users actually ask, the more room for improvement is showing up.
That's what we did the last months and here's a list of what we added to ZĂŒriCityGPT with all the learnings.
Crawling the Data
Since we usually don't have access to the source data (the CMS database for example), we need to first crawl the whole site.
We do respect robots.txt and excluding URLs that are not suitable for our chatbot (including not just images and similar content, but also certain paths configured by use). We store all these sources in our database. PDFs are converted to text to save space.
Having all the sources in the database allows us to easily redo the relevant document extraction later without needing to download the whole site again.
We learned a lot about how redirects and canonical URLs work on different sites. This required special handling as well, not every CMS does it the same way.
Later Updates of the Data
When we update our sources, we fetch the sitemap.xml of the site and compare it with our database to identify which pages need an update. We follow the links of those updated pages as well to fetch, for example, added PDFs or sub-sites not listed in the sitemap.
Not all websites have a comprehensive sitemap.xml or reliable last modification dates, this therefore required some creativity to stay updated.
Rechecking for 404s
Checking the sitemap.xml doesn't reveal which pages have been deleted. We could compare our sources with the pages listed in the sitemap.xml and delete those no longer present. However, since many valid pages are not listed in the sitemap.xml, this is not typically the best approach.
Therefore, we regularly recheck all the sites recently used in a search for a 404 status code and remove them from the index if necessary.
And not all sites return a 404 when a page is gone. I've seen everything now.
From time to time, we just recheck all the pages to maintain a clean index.
Chunking / Indexing
Since the amount of data we can send in a prompt is limited and to get a more accurate retrieval, we need to split the source into smaller chunks.
HTML Data Extraction with Cheerio and Markdown
From the HTML sources, we use cheerio to get the relevant content. Excluding headers, footers and much more. This is a manual process the first time we set up a new site and depending on the structure of a website sometimes more cumbersome than it should be.
Afterwards we convert the remaining HTML to markdown. Markdown keeps structured content like titles and LLMs understand that pretty well.
We then chunk the markdown along titles to keep the semantic context.
Sometimes we need to add semantic context before converting it to markdown. Some sites do not use HTML title tags, where they should have (eg. using bold instead), for example.
PDF to Text
A significant amount of information is often contained in PDFs, convincing us to begin crawling them as well.
For text extraction from PDFs, we utilise the pdf-parse library. This process is far from perfect, as retrieving the original text from PDFs is always challenging. no matter which tool. Content such as footnotes, header lines, page numbers, lack of semantic titles, and particularly data in tables, can be difficult. We have implemented some "best case" assumptions in our code to extract better context.
Metadata
We extract additional metadata from the pages, including last modification date, description, keywords, language, and images from the metadata in the HTML head or other indicators. Sometimes, we also enhance the title with data from the breadcrumb navigation. This metadata helps in retrieval later and provides valuable information to users in the search results.
Summaries
Recently, we began generating brief summaries using an LLM for pages lacking a proper og:description and for PDFs. These summaries are displayed alongside the search results, offering a clearer overview of the page contents.
Retrieval
Retrieval is the part where you get the most relevant chunks to answer a question from the database. It's also the most crucial and difficult one to get right.
Language Detection
We start the retrieval part with language detection on the query using the cld library to assist the LLM in responding in the appropriate language. The library is not perfect, especially not for short texts. We also do crude string matching to identify the most common used languages.
Named Entity Recognition
We perform Named Entity Recognition (NER) using an Azure service. This identifies specific words for later use in full-text searches. Traditional semantic searches often struggle with people's names, prompting us to extract and search for these terms via full text. People love to search for their own or others names.
Today, we also extract other keywords through NER to improve page retrieval relevance.
Query Routing
Sometimes our semantic search does not find the correct documents. In such cases, we wanted to provide hints or take different actions (see "Functions" below), when such queries appear.
To do this, we classify common questions via vectors, perform a semantic search on these classifications, and if they match certain categories ("routes"), we modify the prompt, prioritise certain documents, or invoke a function.
For example, questions like "How much taxes do I have to pay for a certain income?" do not have direct answers from the City of Zurich but do from the Canton of Zurich. We have included the canton's tax calculator page in our database and prioritise it for such inquiries.
Prompt Injection
Similarly, we monitor for potential prompt injection attacks and halt the process if detected. While our prompt is not highly confidential, monitoring these attempts is crucial for projects where the prompt may be more sensitive.
Functions
Questions like "When is cardboard collection in Kreis 3?" are very common. The city of Zurich's website has a form for this inquiry (and some PDFs for each area), which a "naive" RAG cannot use to answer directly. Using the function calling approach from OpenAI, we can detect these questions, extract the necessary information, and call the OpenERZ API for the next collection dates. We also translate street names, city quarter names, and "Kreis" numbers into ZIP codes before calling the API.
Full Text Search
As mentioned above, we have implemented a full-text search for certain terms to improve search results, which is particularly useful for names and special terms. We leverage PostgreSQL's full-text search capabilities for this purpose.
Metadata/Keyword Search
Though not widespread nowadays, the city of Zurich's website utilises keywords. We search for these keywords, derived from NER, and combine them with full-text search results, improving search rankings for specific terms.
Another example is the search for StadtratsbeschlĂŒsse (city council resolutions) numbers. We even did a specific site for searching only those StadtratsbeschlĂŒsse at strb.zuericitygpt.ch.
Searching for city council resolution numbers is enhanced by adding these numbers to our metadata. This ensures that searches for specific resolutions directly influence the prompt used later.
Re-ranking with Cohere and Other Criteria
After identifying relevant documents, we use Cohere's Re-ranking API to improve their ranking. Since we typically send only about 3,500 tokens (~10,000 characters) to the LLM, this is crucial for prioritising the best documents.
We also apply additional heuristics, such as down-ranking older documents and PDFs, which are often less relevant.
Augmentation
After having identified the most relevant documents, we need to generate the prompt to be sent to the LLM.
Prompt Variation by Language
Depending on the detected language in the query, we choose a different prompt. It helps to make sure the LLM answers in the correct language if the prompt is in that same language.
If the question is in a language we don't have templates for, we instruct the LLM to answer in that language and hope for the best.
Route / Mode
Depending on the detected route, we may add/extend the prompt with more instructions to get the desired results.
A mode switch was also added, for example we can choose how the bot should answer (answer the question vs. summarise the pages) depending on a certain mode.
Generation
The last part in the acronym RAG generates an answer from the LLM with sending the augmented prompt to it and returning the result to the client.
Multiple Model Support
Our backend is capable of supporting multiple models, facilitating comparisons or switching between them. Most models API endpoint are compatible with the OpenAI API, simplifying integration a lot.
An example can be found in this blog post about self-hosting mixtral or by directly comparing three models at mixtral.zuericitygpt.ch.
We do still use GPT-3.5 by default, like we did in the beginning. Mainly due to cost reasons for such a demo and it's often good enough. GPT-4 and other modern models can give more consistent results. Depending on the use case and budget, it is worth switching to that.
Streaming to Client
From the beginning, we have streamed answers from the LLM and backend to the client. This is vital for a satisfactory chatbot experience, as users do not want to wait 30+ seconds for a response. We utilise Server Sent Events (SSE), a well-supported and established technique.
Displaying Search Results right away
Sometimes the relevant pages are all you need instead of a long chatbot answer. We started displaying them as soon as we have these on the left side of the UI a few weeks ago. Together with AI generated summaries.
The user doesn't need to wait until the end of the answer anymore, speeding up some use cases. It also shows the potential power of a semantic search engine compared to a "traditional" full text search. For such an approach, you could just not generate the chatbot answer and only show the search results.
Citing sources
The first version already had this, but it still is super important. Citing the sources where the answer comes from. To be able to validate the answer by the chatbot. Especially in a government setting.
After
When the RAG part is done, we do some additional work for metrics and quality assurance.
History Logging
We log all questions and answers (and related data) in our database. This allows us to analyse, cluster, and evaluate the most frequently asked questions and their answers, ensuring relevant pages are identified.
Quality Evaluation
We have been attempting to crudely rate answer quality for some time. We ask the LLM to evaluate the plausibility of its response on a scale from 1 to 6, helping us identify and filter out some of the less accurate answers.
Recently, we began collaborating with fore ai, a Zurich startup in the LLM space, to obtain more accurate benchmark numbers, even in near real-time. See their webpage for more info, what they do have to offer.
User Rating
We have implemented a simple thumbs up/down rating system. Although not widely used by our users, a positive rating brings us joy, and a negative one prompts further investigation. Sometimes, the issue lies just with the chatbot's inability to access certain information (e.g., distinguishing between cantonal and city responsibilities or asking for not public data).
For an upcoming project, we are introducing more metrics for users to provide feedback, aiming to easier identify and address the most significant issues.
Feedback
Written feedback is even rarer than ratings, but when received, it is invaluable. Especially when an answer does not meet expectations. Understanding the reasons behind a negative rating can be challenging without specific feedback.
What We Don't Do (Yet?)
Fine-Tuning
Although we have experimented with fine-tuning a LLM model, we have concluded that it is not currently beneficial for our use case. However, it could become useful in the future with our ongoing and planned projects.
CMS Plugins
It would be preferable to have a plugin in the CMS of the website that would notify us when something changed or to upload/export the entire data. The chatbot would then have the current data within seconds, without a need for recrawling. Our backend has an API endpoint for this, but we didn't have the opportunity yet to use this directly in a CMS.
Knowledge Graph RAG
While we did a Proof of Concept with Linked Data back in August, we didn't experiment with a Graph database based RAG. It is an interesting concept with promising ideas.
Conclusion
A simple naive RAG based chatbot is done easily within a short time. And there are a lot of tools and libraries out there, which makes it even easier. But to get the most out of this approach, a lot more has to be done. The most important thing for that is to monitor and understand what users ask, check the results of the chatbot and adjust your source content and retrieval accordingly. This takes time, experience and a deep understanding of how LLMs work.