TABLE OF CONTENTS

Overview

At Akkio, we run a large-scale machine learning platform with tens of thousands of users. The end result is very cool, but is considerably involved behind the scenes, involving data flying all over the place and all coming together for a smooth end-user experience.

We’ve been able to use OpenAI’s LLMs to great effect. For instance, we use both GPT3.5 and GPT4 in our freshly launched Chat Explore feature, which allows data analysts to ask questions in English about their dataset and get answers and prompts back, with the LLMs doing all the heavy lifting.

The behind-the-scenes sequence diagram looks something like this.

However, delegating so much work to OpenAI leads to two main problems that we’re gradually improving our approach to.

Response Times. Especially with GPT4 and complex prompting, response time can often get up into the greater part of a minute, which makes for a less than stellar experience for our users.
Rate Limits. OpenAI scaling up so quickly themselves means we occasionally get rate limit responses. While we have retry logic in place, that’s a band-aid patch at best that doesn’t fundamentally solve the issue.

This post will talk about one way we’ve tackled both of the above, especially when it comes to automated testing - caching responses.

Key-Based Caching

The simplest way to accomplish this is with key-based caching, a concept which is essentially database-agnostic. We use Firestore, a NoSQL, document-oriented database, for a lot of things, so we initially adopted a data model where the key is the original message and the value is the response. Here’s a rough idea of what this looked like.


def get(key: str) -> Optional[str]:
  # Search for element with `key` equal  
  # to `key` and return its stringifided value  
  cache = db.collection(COLLECTION_NAME).document(key).get()  
  if cache.exists:    
    return cache.to_dict()["value"]  
  return None
  
def set(key: str, value: Any) -> None:
  db.collection(COLLECTION_NAME).document(key).set({ "value": value });

One important note here - we only cache zero-temperature queries. If we’re expecting to get varied outputs, caching doesn’t really make sense.

This worked well, but we eventually ran into some limitations where Firestore limited the length of keys, which wasn’t really a great fit if we were feeding a really long prompt into OpenAI - for example, a long chat thread with GPT3.5. So, how did we fix this?

Hashing

Hashing is the missing piece here. Hashing functions have a few attributes that we’re looking for here.

Deterministic. We need to make sure a given input always maps to the same output, otherwise we won’t be able to actually check the cache later down the line.
No collisions. Most modern hashing algorithms have extremely low chances (like, less than one in a billion billions or something like that) that two different inputs map to the same output, which would result in false positives in our caching system.
Quick. If we spend too long calculating any given hash, that starts to really cut into one of the reasons why we’re even caching in the first place - to save time.

So, our solution hashes the key and uses that as the key. We don’t need the input to be human-readable, just to be able to find the output for a given input, so hashes work fine. Specifically, we use sha256, which works fine for our purposes, and ties into the above code like this.


def hash(key: str) -> str:
  return sha256(key.encode()).hexdigest()

def get(key: str) -> Optional[str]:
  key_hash = hash(key)
  cache = db.collection(COLLECTION_NAME).document(key_hash).get()
  if cache.exists:
    return cache.to_dict()["value"]
  return None

def set(key: str, value: Any) -> None:
  key_hash = hash(key)
  db.collection(COLLECTION_NAME).document(key_hash).set({ "value": value });

Benchmarks

So, what’s the point in doing this all if we don’t actually know how much time it saves us? Well, we ran some rudimentary benchmarks!

Our benchmark file looks like this:


from __future__ import annotations
import remote.openai.completions.service as completions
import time
from uuid import uuid4 as v4

prompts: list[str] = [
    str(v4()), str(v4()), str(v4()), str(v4()), str(v4()), str(v4()), str(v4()), str(v4()), str(v4()),str(v4()),str(v4()),str(v4()),str(v4()),str(v4()),str(v4()),str(v4()),
]

fresh_timings = []
cached_timings = []
for prompt in prompts:
    start = time.time()
    completions.get(f"Repeat this back at me: {prompt}", GptOptionsCompletions())
    end = time.time()
    fresh_timings.append(end - start)

    start = time.time()
    completions.get(f"Repeat this back at me: {prompt}", GptOptionsCompletions())
    end = time.time()
    cached_timings.append(end - start)

print(f"Average Fresh Time: {sum(fresh_timings) / len(fresh_timings)}")
print(f"Average Cached Time: {sum(cached_timings) / len(cached_timings)}")

Not doing anything complex here - basically just running the same input against OpenAI twice and timing both. Our time comparison looks like this:


Average Fresh Time: 3.69604155421257s Average Cached Time: 0.09573692083358765s

We cut our time by almost two magnitudes! This results in big time savings for our end users.

Now, given that most of our prompts tend to vary significantly in terms of input, we don’t get cache hits very often - more on that in the next section. However, where this really shines is automated tests. Our integration tests often have to wait for an OpenAI round-trip - however, the second time a given test is run, we’ll get a cache hit and get the time savings! You’d initially think this would be a bad idea and hide potential errors, but we cache as close as possible to the actual OpenAI call itself, so this shouldn’t really be an issue.

Conclusion

We’ve seen great success from even just doing the key-based hashing outlined in this article, and plan to expand our caching limits with https://github.com/zilliztech/GPTCache in the near future. We hope this gave you some ideas on how to apply it to your own infrastructure!

<- Previous

How Much Data Is Required To Train ML Models in 2024?

Next ->

Machine Learning in Retail: Top Trends & Real Use Cases

Published on

January 5, 2024

Caching OpenAI Responses

Overview

Key-Based Caching

Hashing

Benchmarks

Conclusion

Subscribe. Scale. Succeed.