Skip to content

The Economics of LLMs

How to avoid bankruptcy scaling up large language models?

10 min read
The Economics of LLMs
🦉
Before Growth is a weekly newsletter about startups and their builders before product–market fit, by 3x founder and programmer Kamil Nicieja.

🙏 My work is reader–supported. You can get a membership here!

📣 Before Growth has grown through word of mouth. Want to help? Share it on Twitter here, Facebook here, or LinkedIn here.

📚 My new ebook Generative AI in Product Design offers case studies on AI and just enough theory for you to build your next app with gen AI. Get your copy here!


As AI startups grow, there’s a trend of sharing memes on Twitter about massive bills from OpenAI. Some companies are posting about receiving bills of $8,000 or even $25,000, which can amount to about 10% of a startup’s monthly recurring revenue.

In the past decade, we’ve seen similar situations with cloud service bills. Back then, teams didn’t worry too much because if their services gained popularity, they had access to almost unlimited venture capital. However, in today’s climate, with the end of the zero interest rate policy era, companies need to be much more mindful of costs right from the start.

So, the big question is, how can we reduce costs? Naturally, the main solutions include developing more efficient models and improving hardware. However, we can also apply software engineering or prompt engineering techniques to cut expenses. This article explores the following strategies:

  • Trimming prompts and responses to minimize token usage
  • Implementing caching, including both exact matches and semantic caching for approximate matches
  • Optimizing models through fine-tuning and deploying smaller models trough the AI router design pattern

This post leans more towards the technical side. I’m deeply interested in practical implementation techniques. I want to ensure Before Growth doesn’t turn into a purely theoretical business blog disconnected from real-world practices.

Let’s dive in.

Condensing prompts and responses

We’ll begin with the basics. Since the primary cost from LLM cloud providers comes from the tokens used, reducing the number of tokens in each request can lower expenses. Because we can’t always manage user input, it makes sense to look for efficiencies in the system’s prompts and ChatGPT's responses.

  • System prompts can be manually shortened or we can use a tool like ChatGPT to do it for us. As I explained in Corrections, the LLM itself often has the ability to rephrase its own prompts in a way that makes them more compliant, and this technique is effective in reducing their length as well.
  • We can also use summaries. For example, we can summarize a document once, incurring the full cost, and then use the summary for further processing. This approach reduces the number of tokens used while preserving the most important information.
  • For the model’s responses, we can request it to be less verbose or to follow instructions such as replying in just a single sentence.
Corrections
Uncovering the quirks and capabilities of ChatGPT using Ruby on Rails.

While these strategies might seem simple, they’re not trivial. If you look into the leaked system prompt for ChatGPT, you’ll discover that its developers have explicitly instructed it to conserve computing resources. This includes directives to avoid verbosity, such as the guideline to “never write a summary with more than 80 words” in the prompt. If OpenAI sees savings opportunities in commands like this, you can benefit from them as well.

If you’re really looking for something more advanced, there’s LLMLingua by Microsoft. This tool uses a compact, thoroughly trained language model like GPT2-small or LLaMA-7B to pinpoint and eliminate unnecessary tokens in prompts. This allows for efficient processing, achieving up to 20 times compression while keeping performance loss to a minimum.

To me, investing in such frameworks really pays off when you’re handling highly complex prompts or when doing stuff like retrieval-augmented generation. However, as the tech evolves, we’re seeing new features, like Google Gemini’s 1 million token context window, enabling users to literally put entire books into these models. If history from the past decade has shown us anything, it’s that people will continue to push the boundaries in unexpected ways with these technologies. So, approaches like these could become increasingly valuable as well.

Exact caching

Caching is a technique familiar to programmers across many fields, not just those working with AI. If you’re using a framework like LangChain, which is optimized for developing applications powered by language models, you might find caching features already built in. This means you can easily incorporate it into your app without much hassle.

Here’s an example.

from langchain.globals import set_llm_cache
from langchain_openai import OpenAI

# To make the caching really obvious, lets use a slower model
llm = OpenAI(model_name="gpt-3.5-turbo-instruct", n=2, best_of=2)
%%time
from langchain.cache import InMemoryCache

set_llm_cache(InMemoryCache())

# The first time, the input is not yet in cache, so request should take longer
llm.predict("What's GitHub?")
CPU times: user 13.7 ms, sys: 6.54 ms, total: 20.2 ms
Wall time: 330 ms
%%time
# The second time it is, so we go faster
llm.predict("What's GitHub?")
CPU times: user 436 µs, sys: 921 µs, total: 1.36 ms
Wall time: 1.36 ms

When the framework accesses the cache for the second time, it skips connecting to your provider’s API and fetches the same answer from the data store. This not only reduces costs but also offers a speed benefit of almost 15 times.

However, there are downsides, such as increased complexity but I won't go into more detail on that—every engineer knows how caching can generate problems. And to be fair, you don’t necessarily need LangChain to set up exact caching. It’s easy to implement in any programming language or framework. For example, the effort would be similar even in Ruby on Rails, which is my usual coding environment.

But there are some unique downsides to caching with LLMs that many might find new. One major issue is that the response from the model will remain unchanged until the cache expires. This might work well for certain AI products, but it’s less than ideal for others—particularly those focused on content generation. For example, if you ask an LLM to write a blog post and it produces the same one every time, it clearly is not very good at its job. However, in the case of a customer support chatbot, this might not be a concern at all.

Semantic caching

The second issue becomes visible soon after implementing exact caching. One user might say “Tell me a joke,” while another asks “Do you know any jokes?” Because these sentences don’t match exactly, the cache will be bypassed.

This is where semantic caching and tools like GPTCache become valuable. GPTCache uses embedding algorithms to transform queries into embeddings, employing a vector store for similarity searches on these embeddings. Through this method, GPTCache can recognize and fetch similar or related queries from the cache, enhancing efficiency.

We can integrate GPTCache with LangChain to enhance our previous example.

import hashlib

from gptcache import Cache
from gptcache.adapter.api import init_similar_cache
from langchain.cache import GPTCache

def get_hashed_name(name):
    return hashlib.sha256(name.encode()).hexdigest()

def init_gptcache(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    init_similar_cache(cache_obj=cache_obj, data_dir=f"similar_cache_{hashed_llm}")

set_llm_cache(GPTCache(init_gptcache))
%%time
# This is an exact match, so it finds it in the cache
llm("What's GitHub?")
"GitHub is a developer platform that allows developers to create, store, manage and share their code."
%%time
# This is not an exact match, but semantically within distance so it hits!
llm("Explain what GitHub is.")
"GitHub is a developer platform that allows developers to create, store, manage and share their code."

This time, even though our second query wasn’t identical to the first, we still managed to hit the cache successfully.

This solution has its drawbacks, too. With a semantic cache, you might face false positives during cache hits and false negatives during cache misses. So, not only have we added a caching system that increases complexity, but we’ve also introduced a particularly complex type of cache. Hopefully, when we weigh these challenges against potential savings, they will justify the effort involved.

🤔
Now, you can see why opting for a dedicated framework like LangChain might be more optimal than just querying external APIs. Both GPTCache and LLMLingua, which we discussed earlier, are available as integrations within LangChain's framework, allowing for seamless chaining. The more complex your required chains are, the more it makes sense to invest in a solid foundation to support them.

Fine-tuning and model-swapping

If you prefer not to use caching, there’s another strategy to consider. We’re in the middle of the AI boom; with the tech improving quickly, everyone wants to use the latest, state-of-the-art models. However, it can sometimes be more practical to opt for a less advanced LLM and tailor it to your specific needs through fine-tuning.

Fine-tuning is a method where a pre-trained model undergoes additional training on a smaller, specialized dataset. This process adjusts the model’s parameters to improve its performance on tasks related to this new data. It’s like an experienced chef refining a new recipe by tweaking their methods. This approach enables the model to become more specialized, boosting its effectiveness on specific tasks without having to be developed from the ground up.

For example, if we assign a task to GPT-4, it might perform well 80% of the time, while GPT-3.5 might only succeed in 60% of cases for the same task. However, by fine-tuning GPT-3.5 with sufficient specific examples demonstrating how to complete that task, it can eventually match the performance of its newer counterpart.

Research shows that fewer than 1000 data points can be enough for effective fine-tuning. Just 100 data points led to a 96% improvement in GPT-3.5’s ability to answer questions in JSON format, and 1000 data points were enough to surpass GPT-4 in generating raw responses. While GPT-4’s pricing is $0.03 per 1000 tokens for inputs and $0.06 per 1000 tokens for outputs, GPT-3.5’s costs are much lower, at only $0.0005 per 1000 tokens for inputs and $0.0015 per 1000 tokens for outputs. This represents a 60x cost improvement!

🤔
Got any questions about this week’s article? Feel free to respond to this email or post a comment once you’ve upgraded your subscription.

If you're interested, here’s a 4-step playbook you can follow.

Step 1. Begin with the most advanced model required for your application’s needs. For 95% of companies, this would be GPT-4, but probably not Turbo, as you’re aiming for the highest quality outputs. These will serve as the basis for fine-tuning a smaller model.

Step 2. Keep a record of your requests and responses in a format that allows for easy export.


Related posts

Introducing Lammy

An LLM library for Ruby

Introducing Lammy

Flavors of Ruby on Rails Architectures

I gave this talk at the SF Bay Area Ruby meetup on Sep 3rd, 2024, at GitHub HQ in San Francisco.

Flavors of Ruby on Rails Architectures

Chaining Prompts

Chains go beyond a single LLM call and involve sequences of calls.

Chaining Prompts