If you've ever used ChatGPT and received an error message or an inaccurate response, you might have wondered if a better alternative is available. After all, developers are currently flooding the large language model (LLM) market with new and updated models. Even as machine learning developers ourselves, keeping up with the capabilities of each new LLM is arduous.
In this article, we'll present a detailed comparison of three key players in the competitive landscape of LLMs - Anthropic's Claude 3.5 Sonnet, OpenAI's GPT-4o and Google Gemini. Our machine learning team has worked with each of these models and will provide a robust, referenced analysis of each model. Exploring price, explainability, and more, we'll compare each LLM to crown a winner. Skip doing your own research - let's find out which LLM you should be using.
We’ll compare these current LLM models:
Each model is heavy or medium-weight, meaning that it has billions of parameters and can handle a range of tasks (e.g., creative text generation, image interpretation, etc.).
Of course, lightweight counterparts exist for each model, meaning they have fewer parameters but are cheaper and faster. The more lightweight version of GPT-4o is GPT-4o Mini; Claude 3.5 Sonnet’s is 3 Haiku. Finally, Gemini’s is 1.5 Flash.
Though the size of the training dataset and overall model architecture would help compare each LLM’s performance, OpenAI, Anthropic and Google will likely never make this information publicly available. The reason? Doing so would make it easier for developers to replicate and outperform their models.
GPT-4o, Gemini Pro 1.5 and Claude 3.5 Sonnet all have similar capacity limits if their limits are exceeded. In other words, if you send too many requests per minute (RPMs).
All three LLMs will display a 429 Error when this occurs. So, if you’re looking for the most generous rate limits, there is no standout model here.
The rate limits of each LLM on a free tier are:
The rate limits for each paid version will vary greatly depending on which model you select. For example, Gemini offers the most generous request per day (RPD) limit of two million requests. In contrast, Claude has one million and GPT-4o has none listed.
A context window is the amount of information an LLM can accept into an input when comprehending or generating information. It’s also the point where the differences between these LLMs become clear, as they each have a different context window measurable in tokens. Tokens are smaller units like words, subwords and characters. Here’s a breakdown of our contenders.
For context, 200 tokens represent about 500 pages of text. In theory, this means that Gemini should retain more information for longer during conversations with the user, or process more information in one request.
Of course, bigger isn’t necessarily better. It’s also important to consider the usefulness of the context window. For example, generative AI company Galileo’s new hallucination report crowned Claude 3.5 Sonnet’s context window the best-performing across short, medium and long-context scenarios, producing the fewest hallucinations (i.e. fictitious pieces of information).
Therefore, while the context window is a consideration for comparing LLMs, it doesn’t yield a clear winner among them.
The pricing for each LLM is somewhat complex. Each model’s pricing consists of two systems. The first is user interface (UI) access with premium features at a flat rate per account and then a price per token for API access. Let’s start by analysing the differences between the premium pricing plans.
Claude offers pricing plans starting from $20 per person per month. Users get higher usage limits and access to new features. Note: The plan description is somewhat vague, so the best way to explore its benefits would be for the user to try it out.
ChatGPT offers a similar plan, with its Plus option clocking in at $20 per person monthly. However, this plan comes with image generation capabilities and access to GPT-4o, giving subscribers up to five times more messages.
Gemini Advanced (the package that combines Google One and Gemini Pro) positions its one-million token context window at the centre of its plan. It also offers benefits like uploading Google Docs and spreadsheets, extra Google One storage, Gemini for emails and more. Gemini Advanced is available for $19.99 per month.
UI access pricing is primarily suitable for individual users or small teams. For enterprises or businesses building products or ongoing projects, pricing per token is the critical factor. To compare LLMs, let's focus on the price per token offered by each model.
Gemini Advanced is, therefore, the cheapest overall per token. However, its output quality is more limited across major benchmarks like code generation, problem-solving and maths – more on this below.
Opting for a more lightweight model will restrict performance across multimodal tasks but will prove more pocket-friendly. Per one million tokens, the cheapest lightweight model is Gemini Pro at $0.125, GPT-4o Mini at $0.15 and Claude 3 Haiku at $0.25.
In each of the following cases, finding the right benchmark (i.e. a standard test to compare model performance) is essential.
Massive Multitask Language Understanding (MMLU) is one of the most popular and versatile benchmarks. MMLU measures an LLM’s understanding across various subjects, including law, philosophy, medicine and maths.
Vellum’s comparison has shown GPT-4o and Claude 3.5 Sonnet to have equal MMLU scores. Therefore, regardless of the specific subject, you might opt for GPT-4o or Claude 3.5 Sonnet for general reasoning tasks. However, it seems like Claude tends to generate the most natural-sounding language for text-generation tasks.
One of the most popular benchmarks is code generation. By giving an LLM multiple programming problems, you can measure how often it produces the correct code.
Claude 3.5 Sonnet emerged as the winner with 92%, followed by GPT-4o at 90.2% and Gemini 1.5 Pro weighing in at 71.9%.
For generating code, you might go straight to Claude 3.5 Sonnet.
Explainability is important for accuracy, transparency and more when working with an LLM. Users (and stakeholders) need to understand what the LLM’s response means and why (and how) it came to that decision.
Most LLMs, however, are built as black boxes, making their inner calculations inaccessible. An LLM’s lack of transparency is problematic for users who want to attribute the LLM’s output (e.g. using LLMs for sensitive financial data).
Unfortunately, there is no accurate way to compare the explainability of LLMs, as there are no universally agreed-upon benchmarks. Researchers and other experts have proposed several interesting ideas for surveying, but no formal comparison has occurred.
The good news is that when testing models, you can perform a lightweight comparison of the LLMs’ explainability with a few simple tricks:
We can expect to see further examinations of explainability in the future, as it is fundamental to the development of responsible AI.
LLMs are ethically complex – raising questions about their usage, bias and potential job displacement. It’s important that the LLM’s developer attempts to address the ethical risks of their LLM. By proactively identifying and addressing ethical risks, developers can prevent harm caused by the misuse or malfunction of their models. It’s also essential on the developing end to avoid legal and regulatory compliance issues.
Currently, the only LLM provider that has fully addressed the ethics of their creation is Anthropic.
Anthropic has built a ‘constitution’, one they’ve used to train their LLM. The constitution consists of an extensive compilation of ethical guidelines. Though they’ve trained their LLM on the Universal Declaration of Human Rights, Anthropic also trained it on Apple’s guidelines for app developers, which prohibits ‘content that is:
Instead, the model self-trains to choose responses that promote ‘freedom of thought, conscience, opinion, expression, assembly, and religion’.
If you want to support the most ethical AI model, choose Claude 3.5 Sonnet.
Though an LLM’s cultural impact isn’t a key factor you’d consider when deciding which LLM might suit your purposes, you may be interested in what issues each LLM's development and deployment have evoked.
Claude has done wonders for Anthropic, a startup founded by ex-OpenAI employees in 2021 to create responsible AI.
It’s a gamble that seems to have paid off, with Claude receiving nearly 70 million visits monthly. The fact that experts and users consider Claude to be in the same league as Google, Amazon, OpenAI and Meta AI’s LLMs reflects its exceptional capabilities. Anthropic now sits at over a $4 billion valuation.
Anthropic's success has also intensified AI competition, likely sparking a surge in new AI startups. With approximately 70,000 AI companies already in the market, many of which are startups, experts say the industry will likely experience meteoric growth. Research from Statista supports this, predicting that the AI industry’s net worth will have quadrupled by 2030.
ChatGPT's explosive introduction to the world accrued one million users in just five days. Until Threads debuted in July 2023, it was the fastest-growing application in history. In comparison, Netflix took 3.5 years to reach 1 million users.
ChatGPT’s popularity unleashed a wave of FOBO (Fear of Becoming Obsolete) by demonstrating automation capabilities for roles like:
ChatGPT’s successful launch also pressured other companies like Google to release their (own) LLMs. Although, as Gemini’s introduction has proven, it may have also encouraged developers to release their rival models prematurely. For LLMs, capabilities matter more than the speed of release.
Gemini had a disastrous introduction. Formerly Bard, it went viral in February 2023 during a Twitter promotion. Bard had incorrectly answered a simple question about astronomy. The LLM’s mistake then wiped away $100 billion off the market value of its parent company, Alphabet.
The incident may have motivated Google’s choice to rebrand Bard as Gemini, although CEO Sundar Pichai offered a somewhat more incomprehensible explanation for the change:
“For us, Gemini is our approach overall in terms of how we are building our most capable and safe AI model and Bard was the most direct way that people could interact with our models, so it really made sense to just evolve it to be Gemini because you are talking directly to the underlying Gemini model when you use it."
Either way, rebranding Bard seems like a smart move, as it avoids lexical similarities with Claude’s latest version, Sonnet.
Despite its delayed and rocky entrance, Gemini’s development has ignited an ongoing conversation. Some highlights include:
As machine learning developers, we can say that Claude 3.5 Sonnet is the most developer-friendly LLM due to its consistent performance. Lowering an LLM's temperature generally reduces creativity and increases output consistency. Claude tends to produce more consistent results at lower temperatures than GPT-4. Also, Claude 3.5 Sonnet ranks the highest on HumanEval and MMLU. So, overall, though not perfect, Claude 3.5 Sonnet is currently the highest-performing model.
Yet, this isn’t a definitive status as LLM developers constantly release new and improved models, and each LLM excels at different tasks.
The best way to find the right LLM is to experiment and log your findings. Every time a new LLM comes out, try it – it will only take a few minutes. As benchmarks like Reasoning and HumanEval prove, there will be disparities between the performance of models that could affect their overall usefulness.
A bit about us: Evolution AI uses sophisticated LLM technologies to build solutions like Financial Statements AI (our financial extraction & analysis tool).
Interested in keeping up to date regarding our work with LLMs? Follow us on LinkedIn and X.