Every week, new and more advanced Large Language Models (LLMs) are released, each claiming to be better than the last. But how can we keep up with all these new developments? The answer is the LMSYS Chatbot Arena.
The LMSYS Chatbot Arena is an innovative platform created by the Large Model Systems Organization, a group made up of students and teachers from UC Berkeley, UCSD, and CMU. This platform makes it easy to compare and evaluate different LLMs by allowing users to test and rate them. It’s a place where anyone interested in these models can come to find out about the latest releases and see how they stack up against each other.
This leaderboard ranks various LLMs using a Bradley-Terry model, with the rankings displayed on an Elo scale. The LMSYS leaderboard collects human pairwise comparisons to determine the ranking. As of April 26, 2024, the leaderboard includes 91 different models and has collected more than 800,000 human pairwise comparisons. The models are ranked based on their performance in different categories, such as coding and long user queries. The rankings are displayed in Elo-scale, and the leaderboard is continuously updated.
Click here to start the live testing of LLMs.
The top and trending models based on Arena Elo Ratings are:
Open AI is clearly winning the race of best LLMs so far.
Now if you’re like me and wondering why there is a term preview in front of some models then here is the answer – The term “preview” typically refers to a version of a large language model (LLM) that is made available for testing, feedback, or experimental use before its official release. This “preview” stage allows developers and users to explore the model’s capabilities, identify any issues, and provide feedback, which can be incorporated into further improvements or refinements of the model. Essentially, it’s like a beta version of the software, where it’s mostly functional and showcases new features or improvements, but might still have some bugs or limitations that need addressing before a full, stable release.
The rankings take into account the 95% confidence interval when determining a model’s ranking, and models with fewer than 500 votes are removed from the rankings.
You might have heard that Llama 3 is the best open source Large Language Model (LLM) so far. However, if you check the overall rankings, GPT-4 Turbo is at the top. Why is that? It’s because the rankings include both open source and closed source LLMs.
Look at the last column of the leaderboard—it shows the type of license each LLM has. This is important because it divides the models into two main groups: open source and closed source.
The code behind the Open Source LLMs is publicly available. This allows anyone to inspect, understand, and even improve the model. This fosters a collaborative development environment.
LLMs that are not publicly available and require permission or licensing to use. These are typically developed by commercial entities. (e.g., OpenAI’s GPT-4 series, Google’s Gemini series, Anthropic’s Claude series).
In short, open source LLMs offer transparency and foster collaboration, while closed-source LLMs prioritize control and potentially deliver a more polished user experience.
The LMSYS platform works by collecting user dialogue data to evaluate large language models (LLMs). Users can compare two different LLMs side-by-side on a given task and then vote on which LLM provided a better response. The LMSYS platform uses these votes to rank the different LLMs.
Here’s a step-by-step breakdown of how LMSYS works:
The LMSYS leaderboard uses two main ways to rate Large Language Models (LLMs): the Elo rating system and the Bradley-Terry model.
In the LMSYS Chatbot Arena, LLMs are like players in a game, where they interact with users and compete against each other. Each LLM starts with a basic score, and this score changes based on whether they win or lose matches. Winning against a stronger LLM gives more points, and losing to a weaker one takes away more points. This way, the ratings always reflect the current strengths of the LLMs accurately.
The Elo system is great for keeping track of how LLMs perform over time, helping to understand which models are doing well and predicting how they might do in the future. This makes it a very useful tool for seeing how new and existing models stack up against each other in the ever-changing world of AI development.
Interested in reading more about the evaluation process, check out their paper: https://arxiv.org/abs/2403.04132
I hope this article has helped you understand how the LMSYS leaderboard works and where you can keep track of the latest developments in large language models.
The LMSYS Chatbot Arena uses a system where users help rank the models, and it uses detailed methods to score them. This makes it a great place to really see how these models perform. Understanding these models better helps everyone use them more effectively in real-life situations.
If you know of any other resources that can help stay up-to-date in the field of Generative AI, please share them in the comments section below. Your input can help us all keep pace with this rapidly evolving technology!