So, you are a company taking the big leap. You want to adapt large language models (LLMs) into your operations. Perhaps you want a nifty system that can summarize your sales reports or a smart tool that can ably handle customer service inquiries. The problem is, there are so many of these LLMs to choose from. Hundreds of unique models, each distinguished by subtle variations. Picking the perfect one is as complex and time-consuming as finding a needle in a haystack.
Well, don’t sweat it too much. LLM ranking platforms are designed to make your decision-making process a bit easier. They gather feedback from users and score different models based on how well they perform tasks such as coding, visual understanding, or natural language processing. If a model is top-ranked, it’s assumed to be the best fit for a given application. Easy enough, right?
Not so fast. Researchers from MIT have dropped a bombshell that disrupts our trust in these ranking platforms. Their study shows that even a minor fraction of user interactions—sometimes as puny as two or three opinions—can make a drastic alteration in the rankings. It’s alarming because it casts a shadow of doubt on whether the top-ranked model is indeed the most reliable or the most effective for real-world use.
“The results surprised us,” observed Tamara Broderick, an associate professor in MIT’s Department of Electrical Engineering and Computer Science (EECS) and the senior researcher of the study. “It raises the question of whether a top-ranked LLM that rests on two or three pieces of feedback from tens of thousands can consistently outperform all the other LLMs in practice.”
The research team at MIT, including EECS graduate students Jenny Huang and Yunyi Shen and Dennis Wei from IBM Research, set lenses to how ranking platforms can be manipulated. They developed a quick yet efficient way to test the stability of LLM ranking platforms by spotting which individual pieces of feedback significantly influence the overall ranking. Intriguingly, they discovered that a tiny 0.0035 percent data alteration—that’s just two votes out of 57,000—could toss around the top-ranked model.
After analyzing several platforms using their novel method, they noticed a platform where removing just two evaluations out of thousands, flipped the top model. Even on a more robust platform where expert annotators were used, discarding just 3 percent of 2,575 evaluations changed the rankings.
Beyond revealing a surprising sensitivity in the systems, their research spotlighted that a good chunk of the influential feedback appeared to be erroneous. Unwarranted selection of less accurate models due to misclicks or diverted attention by the user exacerbated the inconsistency. It’s a wake-up call about the reliability of crowdsourced feedback while choosing the best LLM.
The researchers believe we could attenuate these issues by gathering more detailed user feedback. For instance, understanding users’ confidence level in their votes could offer better context. They also propose engagement of human mediators to verify crowdsourced responses, minimising the impact of any noisy or erroneous inputs.
While this study does not provide a complete resolution, it sheds light on the issues that prompt a more rigorous evaluation method for LLMs. The team hopes their findings would stimulate enhancements in the ways LLMs are assessed and ranked.
At the moment, Broderick and her group aim to explore similar issues in other machine learning areas while perfecting their techniques to expose even subtler forms of instability. An outside viewer, Jessica Hullman, a computer science professor at Northwestern University who didn’t contribute to the study, commented on its wider implications: “Seeing how few preferences can so dramatically change the functioning of a fine-tuned model may push for more thoughtful data collection methods.”
This ground-breaking research received generous support from many sponsors, including the Office of Naval Research, the MIT-IBM Watson AI Lab, the National Science Foundation, Amazon, along with a CSAIL seed award. If you wish to dive deeper into the study, you can find the original article on MIT News.
This website uses cookies.