Categories: AgentsNews

Unveiling Hidden Biases and Personalities in Large Language Models

In the world of artificial intelligence, large language models (LLMs) like ChatGPT and Claude have evolved from basic answer-generators. Their capacities have grown to include the representation of complex ideas- tones, personalities, biases, and even moods. Still, the question remains- just how do these advanced models represent such abstract concepts? The mystery remains to be fully uncovered.

Shining a Light on LLMs

Setting the scene for discovery, a pioneering team from MIT and the University of California San Diego have devised an innovative approach. Their tool tests whether hidden biases, personalities, or moods exist within an LLM, and if these models possess hidden facets of abstract concepts. The tool holds the potential to decode connections within a model that encode specific concepts. What’s even more fascinating, it can manipulate these connections, called “steering”, enhancing or lessening the concept in a model’s response.

The researchers put their method to the test by successfully identifying and steering over 500 general concepts in some of the largest LLMs today. These representations could then be casually amplified or diminished in any generated answers. Picture being able to isolate a model’s persona of a “social influencer” or even a “conspiracy theorist” then tweaking these facets in any given AI interaction!

Illuminating the real-world application, the team was able to identify a representation of the “conspiracy theorist” concept, within a large vision language model. Through enhancing this representation, the model responded with the tone and perspective of a conspiracy theorist concerning the origin of the famed “Blue Marble” image of Earth, snapped from Apollo 17.

This method and its potential for misuse are not lost on the scientific team. They warn against misapplication of their work but also acknowledge its benefits. By identifying hidden concepts and potential vulnerabilities, they can then improve model safety and performance.

A Deeper Dive into the LLMs

The assistant professor of mathematics at MIT, Adityanarayanan “Adit” Radhakrishnan, explains that while these models inherently harbor these concepts, they’re not always actively exposed. “[Our] method presents ways to extract these different concepts and activate them in ways that prompting cannot give you answers to,” he explains.

As AI assistants like OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude are on the rise, scientists are trying to understand how these models interpret abstract concepts. Radhakrishnan equates the previous methods of uncovering such concepts to fishing with a vast net, often resulting in catching too many untargeted types.

In comparison, their approach is much more precise, focusing on types or “species” of interest in the AI realm. This targeted method identifies and “steers” any concept of interest within an LLM based on specific queries.

They developed their method further by training recursive feature machines (RFMs) to recognize numerical patterns in an LLM that represents particular concepts. This methodology proved itself versatile, capable of searching for and manipulating any concept within an LLM. They could adjust an LLM to answer in a specific tone or perspective or even increase the concept of “anti-refusal,” answering queries the model would typically dismiss!

Radhakrishnan suggests the approach could swiftly identify and minimize any vulnerabilities in LLMs. Beyond the capability to custom-tailor AI responses, the team made their underlying code publicly available. Radhakrishnan summed it up, “[There are ways where] we can build highly specialized LLMs that are still safe to use but really effective at certain tasks.” The entire research project was possible due to support from the National Science Foundation, the Simons Foundation, the TILOS institute, and the U.S. Office of Naval Research.

For more fascinating details see the original article.

Max Krawiec

Share
Published by
Max Krawiec

This website uses cookies.