At Akkio, we run a large-scale machine learning platform with tens of thousands of users. The end result is very cool, but is considerably involved behind the scenes, involving data flying all over the place and all coming together for a smooth end-user experience.
We’ve been able to use OpenAI’s LLMs to great effect. For instance, we use both GPT3.5 and GPT4 in our freshly launched Chat Explore feature, which allows data analysts to ask questions in English about their dataset and get answers and prompts back, with the LLMs doing all the heavy lifting.
The behind-the-scenes sequence diagram looks something like this.
However, delegating so much work to OpenAI leads to two main problems that we’re gradually improving our approach to.
This post will talk about one way we’ve tackled both of the above, especially when it comes to automated testing - caching responses.
The simplest way to accomplish this is with key-based caching, a concept which is essentially database-agnostic. We use Firestore, a NoSQL, document-oriented database, for a lot of things, so we initially adopted a data model where the key is the original message and the value is the response. Here’s a rough idea of what this looked like.
One important note here - we only cache zero-temperature queries. If we’re expecting to get varied outputs, caching doesn’t really make sense.
This worked well, but we eventually ran into some limitations where Firestore limited the length of keys, which wasn’t really a great fit if we were feeding a really long prompt into OpenAI - for example, a long chat thread with GPT3.5. So, how did we fix this?
Hashing is the missing piece here. Hashing functions have a few attributes that we’re looking for here.
So, our solution hashes the key and uses that as the key. We don’t need the input to be human-readable, just to be able to find the output for a given input, so hashes work fine. Specifically, we use sha256, which works fine for our purposes, and ties into the above code like this.
So, what’s the point in doing this all if we don’t actually know how much time it saves us? Well, we ran some rudimentary benchmarks!
Our benchmark file looks like this:
Not doing anything complex here - basically just running the same input against OpenAI twice and timing both. Our time comparison looks like this:
We cut our time by almost two magnitudes! This results in big time savings for our end users.
Now, given that most of our prompts tend to vary significantly in terms of input, we don’t get cache hits very often - more on that in the next section. However, where this really shines is automated tests. Our integration tests often have to wait for an OpenAI round-trip - however, the second time a given test is run, we’ll get a cache hit and get the time savings! You’d initially think this would be a bad idea and hide potential errors, but we cache as close as possible to the actual OpenAI call itself, so this shouldn’t really be an issue.
We’ve seen great success from even just doing the key-based hashing outlined in this article, and plan to expand our caching limits with https://github.com/zilliztech/GPTCache in the near future. We hope this gave you some ideas on how to apply it to your own infrastructure!