Claude vs. GPT-4.5 vs. Gemini: A Comprehensive Comparison

SECTIONS

Note: this post was last updated in March 2025 to reflect the new, exciting developments in the LLM landscape - Claude 3.7 Sonnet, etc.!

‍

Introduction: Why Compare LLMs?

If you've ever used ChatGPT and received an error message or an inaccurate response, you might have wondered if a better alternative is available. After all, developers are currently flooding the large language model (LLM) market with new and updated models. Even as machine learning developers ourselves, keeping up with the capabilities of each new LLM is arduous.

In this article, we'll present a detailed comparison of three key players in the competitive landscape of LLMs - Anthropic's Claude 3.5 Sonnet, OpenAI's GPT-4o and Google Gemini. Our machine learning team has worked with each of these models and will provide a robust, referenced analysis of each model. Exploring price, explainability, and more, we'll compare each LLM to crown a winner. Skip doing your own research - let's find out which LLM you should be using.

‍

‍

Current Models

We’ll compare these current LLM models:

OpenAI's GPT-4.5 (released February 2025)

Anthropic's Claude 3.7 Sonnet (released March 2025)

Google's Gemini 2.0 Flash (released January 2025)

‍

Model Weight

Each model is heavy or medium-weight, meaning that it has billions of parameters and can handle a range of tasks (e.g., creative text generation, image interpretation, etc.).

Of course, lightweight counterparts exist for each model, meaning they have fewer parameters but are cheaper and faster. The more lightweight version of GPT-4.5 is GPT-4o Mini; Claude 3.7 Sonnet’s is 3 Haiku. Finally, Gemini’s is 2.0 Flash-Lite.

‍

Training Data & Model Size

Though the size of the training dataset and overall model architecture would help compare each LLM’s performance, OpenAI, Anthropic and Google will likely never make this information publicly available. The reason? Doing so would make it easier for developers to replicate and outperform their models.

‍

Rate Limits

GPT-4.5, Gemini 2.0 Flash and Claude 3.7 Sonnet all have similar capacity limits if their limits are exceeded. In other words, if you send too many requests per minute (RPMs).

All three LLMs will display a 429 Error when this occurs. So, if you’re looking for the most generous rate limits, there is no standout model here.

‍

LLM Rate Limits: Free vs. Paid Versions

The rate limits of each LLM on a free tier are (in Requests Per Minute):

Claude 3.7 Sonnet: 50 RPM
GPT-4.5 - isn’t available on a free tier, but it is generally speculated that the rate limit is 50 messages per 7 days.
Gemini 2.0 Flash: 15 RPM

‍

Context Windows

A context window is the amount of information an LLM can accept into an input when comprehending or generating information. It’s also the point where the differences between these LLMs become clear, as they each have a different context window measurable in tokens. Tokens are smaller units like words, subwords and characters. Here’s a breakdown of our contenders.

Claude 3.7 Sonnet: 200k tokens
GPT-4.5: 128k tokens
Gemini 2.0 Flash: 1 million tokens

For context, 200 tokens represent about 500 pages of text. In theory, this means that Gemini should retain more information for longer during conversations with the user, or process more information in one request.

Of course, bigger isn’t necessarily better. It’s also important to consider the usefulness of the context window. For example, generative AI company Vectara's hallucination leaderboard noted that Gemini 2.0-Flash produced the least hallucinations, then GPT-4.5 before Claude 3.7. Based on the size of the context window, you might expect Claude 3.7 Sonnet to produce fewer hallucinations than Gemini 2.0 Flash.

Therefore, while the context window is a consideration for comparing LLMs, it doesn’t yield a clear winner among them.

‍

Pricing

The pricing for each LLM is somewhat complex. Each model’s pricing consists of two systems. The first is user interface (UI) access with premium features at a flat rate per account and then a price per token for API access. Let’s start by analysing the differences between the premium pricing plans.

‍

Claude

Claude offers pricing plans starting from $20 per person per month. Users get higher usage limits and access to new features. Note: The plan description is somewhat vague, so the best way to explore its benefits would be for the user to try it out.

‍

ChatGPT

ChatGPT offers a similar plan, with its Plus option clocking in at $20 per person monthly. However, this plan comes with video generation capabilities and access to GPT-4.5.

‍

Gemini Advanced

Gemini Advanced (the package that combines Google One and Gemini Pro) positions its one-million token context window at the centre of its plan. It also offers benefits like integration with Google Docs and spreadsheets, extra Google One storage, Gemini for emails and more. Gemini Advanced is available for £18.99 per month.

‍

UI Pricing - Claude vs. ChatGPT vs. Gemini

UI access pricing is primarily suitable for individual users or small teams. For enterprises or businesses building products or ongoing projects, pricing per token is the critical factor. To compare LLMs, let's focus on the price per token offered by each model.

‍

‍

Gemini 2.0. Flash is, therefore, (by far) the cheapest overall per token, with GPT-4.5's research preview amassing a startling cost. However, its output quality is more limited across major benchmarks like code generation, problem-solving and maths – more on this below.

Opting for a more lightweight model will restrict performance across multimodal tasks but will prove more pocket-friendly. Per one million tokens, the cheapest lightweight model is Gemini 1.5 Flash-8B at $0.0375, GPT-4o Mini at $0.15 and Claude 3 Haiku at $0.25.

‍

Testing the LLMs on Typical Use Cases

In each of the following cases, finding the right benchmark (i.e. a standard test to compare model performance) is essential.

‍

Massive Multitask Language Understanding

Massive Multitask Language Understanding (MMLU) is one of the most popular and versatile benchmarks. MMLU measures an LLM’s understanding across various subjects, including law, philosophy, medicine and maths.

Vellum's leaderboard currently crowns Claude 3.7 as having the best MMLU score, at 85%. Claude 3.7 integrates an AI mechanism known as 'long thinking', which is designed to take a more detailed and logical approach to complex challenges. Note that these types of models are likely to generate responses slower than in typical LLM interactions.

‍

Code Generation (HumanEval)

One of the most popular benchmarks is code generation. By giving an LLM multiple programming problems, you can measure how often it produces the correct code. The newest models haven't been tested, so let's compare the earlier versions of each model.

Claude 3.5 Sonnet emerged as the winner with 93.7%, followed by GPT-4o at 90.2% and Gemini 1.5 Pro weighing in at 71.9%.

For generating code, you might go straight to Claude.

‍

Explainability

Explainability is important for accuracy, transparency and more when working with an LLM. Users (and stakeholders) need to understand what the LLM’s response means and why (and how) it came to that decision.

Most LLMs, however, are built as black boxes, making their inner calculations inaccessible. An LLM’s lack of transparency is problematic for users who want to attribute the LLM’s output (e.g. using LLMs for sensitive financial data).

Unfortunately, there is no accurate way to compare the explainability of LLMs, as there are no universally agreed-upon benchmarks. Researchers and other experts have proposed several interesting ideas for surveying, but no formal comparison has occurred.

The good news is that when testing models, you can perform a lightweight comparison of the LLMs’ explainability with a few simple tricks:

Ask the model, ‘Why did you choose this answer?’

Slightly vary the wording of the prompt and measure how this changes the output. The variations will indicate how the LLM reached its answer.

Likewise, experiment with the sentiment of your query. For example, you could change the perspective of the prompt and ask the LLM to answer as an analyst/advisor and so on.

We can expect to see further examinations of explainability in the future, as it is fundamental to the development of responsible AI.

‍

Ethics

LLMs are ethically complex – raising questions about their usage, bias and potential job displacement. It’s important that the LLM’s developer attempts to address the ethical risks of their LLM. By proactively identifying and addressing ethical risks, developers can prevent harm caused by the misuse or malfunction of their models. It’s also essential on the developing end to avoid legal and regulatory compliance issues.

Currently, the only LLM provider that has fully addressed the ethics of their creation is Anthropic.

Anthropic has built a ‘constitution’, one they’ve used to train their LLM. The constitution consists of an extensive compilation of ethical guidelines. Though they’ve trained their LLM on the Universal Declaration of Human Rights, Anthropic also trained it on Apple’s guidelines for app developers, which prohibits ‘content that is:

Offensive
Insensitive
Upsetting
Intended to disgust
In exceptionally poor taste
Just plain creepy.’

Instead, the model self-trains to choose responses that promote ‘freedom of thought, conscience, opinion, expression, assembly, and religion’.

If you want to support the most ethical AI model, choose Claude.

‍

Bonus: Cultural Impact

Though an LLM’s cultural impact isn’t a key factor you’d consider when deciding which LLM might suit your purposes, you may be interested in what issues each LLM's development and deployment have evoked.

‍

1. Claude/Anthropic

Claude has done wonders for Anthropic, a startup founded by ex-OpenAI employees in 2021 to create responsible AI.

It’s a gamble that seems to have paid off, with Claude receiving over 100 million visits monthly. The fact that experts and users consider Claude to be in the same league as Google, Amazon, OpenAI and Meta AI’s LLMs reflects its exceptional capabilities. Anthropic now sits at over a $6 billion valuation.

Anthropic's success has also intensified AI competition, likely sparking a surge in new AI startups. With approximately 70,000 AI companies already in the market, many of which are startups, experts say the industry will likely experience meteoric growth. Research from Statista supports this, predicting that the AI industry’s net worth will have quadrupled by 2030.

‍

2. ChatGPT/OpenAI

ChatGPT's explosive introduction to the world accrued one million users in just five days. Until Threads debuted in July 2023, it was the fastest-growing application in history. In comparison, Netflix took 3.5 years to reach 1 million users.

ChatGPT’s popularity unleashed a wave of FOBO (Fear of Becoming Obsolete) by demonstrating automation capabilities for roles like:

Copywriter
Data entry clerk
Coder
Legal assistant
Tutor

ChatGPT’s successful launch also pressured other companies like Google to release their (own) LLMs. Although, as Gemini’s introduction has proven, it may have also encouraged developers to release their rival models prematurely. For LLMs, capabilities matter more than the speed of release.

‍

3. Gemini/Google

Gemini had a disastrous introduction. Formerly Bard, it went viral in February 2023 during a Twitter promotion. Bard had incorrectly answered a simple question about astronomy. The LLM’s mistake then wiped away $100 billion off the market value of its parent company, Alphabet.

The incident may have motivated Google’s choice to rebrand Bard as Gemini, although CEO Sundar Pichai offered a somewhat more incomprehensible explanation for the change:

“For us, Gemini is our approach overall in terms of how we are building our most capable and safe AI model and Bard was the most direct way that people could interact with our models, so it really made sense to just evolve it to be Gemini because you are talking directly to the underlying Gemini model when you use it."

Either way, rebranding Bard seems like a smart move, as it avoids lexical similarities with Claude’s model, Sonnet.

Despite its delayed and rocky entrance, Gemini’s development has ignited an ongoing conversation. Some highlights include:

Controversy over whether LLMs could scan documents on your computer, with one user complaining that Gemini had scanned his Google Drive.
Next on Gemini’s agenda is the ability to edit AI-generated images, which could be useful for users tired of tweaking prompts to generate their desired results.
And, who could forget when Gemini advised a user to put glue on pizza? In response, Sundar Pichai suggested that Gemini lacks ‘factuality’. Make of that what you will.

‍

Conclusion: Comparing LLMs Is Difficult

As machine learning developers, we can say that Claude 3.7 Sonnet is the most developer-friendly LLM due to its consistent performance. Lowering an LLM's temperature generally reduces creativity and increases output consistency. Claude tends to produce more consistent results at lower temperatures than GPT-4.5. Also, Claude 3.7 Sonnet ranks the highest on HumanEval and MMLU. So, overall, though not perfect, Claude 3.5 Sonnet is currently the highest-performing model.

Yet, this isn’t a definitive status as LLM developers constantly release new and improved models, and each LLM excels at different tasks.

The best way to find the right LLM is to experiment and log your findings. Every time a new LLM comes out, try it – it will only take a few minutes. As benchmarks like Reasoning and HumanEval prove, there will be disparities between the performance of models that could affect their overall usefulness.

‍

A bit about us: Evolution AI uses sophisticated LLM technologies to build solutions like Financial Statements AI (our financial extraction & analysis tool).

Interested in keeping up to date regarding our work with LLMs? Follow us on LinkedIn and X.

‍