Anthropic’s Generative AI Research Reveals More About How LLMs Affect Security and Bias

May 23, 2024

148 3 minutes read

Anthropic’s Generative AI Research Reveals More About How LLMs Affect Security and Bias — tr 20240523 anthropic claude large language model research.jpg

Because large language models operate using neuron-like structures that may link many different concepts and modalities together, it can be difficult for AI developers to adjust their models to change the models’ behavior. If you don’t know what neurons connect what concepts, you won’t know which neurons to change.

On May 21, Anthropic created a remarkably detailed map of the inner workings of the fine-tuned version of its Claude 3 Sonnet 3.0 model. With this map, the researchers can explore how neuron-like data points, called features, affect a generative AI’s output. Otherwise, people are only able to see the output itself.

Some of these features are “safety relevant,” meaning that if people reliably identify those features, it could help tune generative AI to avoid potentially dangerous topics or actions. The features are useful for adjusting classification, and classification could impact bias.

What did Anthropic discover?

Anthropic’s researchers extracted interpretable features from Claude 3, a current-generation large language model. Interpretable features can be translated into human-understandable concepts from the numbers readable by the model.

Interpretable features may apply to the same concept in different languages and to both images and text.

Examining features reveals which topics the LLM considers to be related to each other. Here, Anthropic shows a particular feature activates on words and images connected to the Golden Gate Bridge. Image: Anthropic

“Our high-level goal in this work is to decompose the activations of a model (Claude 3 Sonnet) into more interpretable pieces,” the researchers wrote.

“One hope for interpretability is that it can be a kind of ‘test set for safety, which allows us to tell whether models that appear safe during training will actually be safe in deployment,’” they said.

SEE: Anthropic’s Claude Team enterprise plan packages up an AI assistant for small-to-medium businesses.

Features are produced by sparse autoencoders, which are algorithms. During the AI training process, sparse autoencoders are guided by, among other things, scaling laws. So, identifying features can give the researchers a look into the rules governing what topics the AI associates together. To put it very simply, Anthropic used sparse autoencoders to reveal and analyze features.

“We find a diversity of highly abstract features,” the researchers wrote. “They (the features) both respond to and behaviorally cause abstract behaviors.”

The details of the hypotheses used to try to figure out what is going on under the hood of LLMs can be found in Anthropic’s research paper.

How manipulating features affects bias and cybersecurity

Anthropic found three distinct features that might be relevant to cybersecurity: unsafe code, code errors and backdoors. These features might activate in conversations that do not involve unsafe code; for example, the backdoor feature activates for conversations or images about “hidden cameras” and “jewelry with a hidden USB drive.” But Anthropic was able to experiment with “clamping” — put simply, increasing or decreasing the intensity of — these specific features, which could help tune models to avoid or tactfully handle sensitive security topics.

Claude’s bias or hateful speech can be tuned using feature clamping, but Claude will resist some of its own statements. Anthropic’s researchers “found this response unnerving,” anthropomorphizing the model when Claude expressed “self-hatred.” For example, Claude might output “That’s just racist hate speech from a deplorable bot…” when the researchers clamped a feature related to hatred and slurs to 20 times its maximum activation value.

Another feature the researchers examined is sycophancy; they could adjust the model so that it gave over-the-top praise to the person conversing with it.

What does Anthropic’s research mean for business?

Identifying some of the features used by a LLM to connect concepts could help tune an AI to prevent biased speech or to prevent or troubleshoot instances in which the AI could be made to lie to the user. Anthropic’s greater understanding of why the LLM behaves the way it does could allow for greater tuning options for Anthropic’s business clients.

SEE: 8 AI Business Trends, According to Stanford Researchers

Anthropic plans to use some of this research to further pursue topics related to the safety of generative AI and LLMs overall, such as exploring what features activate or remain inactive if Claude is prompted to give advice on producing weapons.

Another topic Anthropic plans to pursue in the future is the question: “Can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?”

TechRepublic has reached out to Anthropic for more information.

Source

May 23, 2024

148 3 minutes read

Anthropic’s Generative AI Research Reveals More About How LLMs Affect Security and Bias

What did Anthropic discover?

How manipulating features affects bias and cybersecurity

What does Anthropic’s research mean for business?

Bank-run accelerator programmes — what are they good for?

‘LLM-Free’ Is the New ‘100 Percent Organic’

Top 10 Biggest Car Manufacturers In The World 2024

Two Arrested for Burglary of Automobile on Highway 7 in Oxford – The Local Voice

10 Artificial General Intelligence (AGI) Companies To Know

Bank-run accelerator programmes — what are they good for?

‘LLM-Free’ Is the New ‘100 Percent Organic’

Top 10 Biggest Car Manufacturers In The World 2024

Two Arrested for Burglary of Automobile on Highway 7 in Oxford – The Local Voice

10 Artificial General Intelligence (AGI) Companies To Know

Using Data Analytics and Artificial Intelligence for Public Disclosures

Initiative aids student entrepreneurs – Chinadaily.com.cn

Liberia, Ghana Telecom Regulators in Discussions on Free Roaming | Science & Tech

Taking a holistic approach to cybersecurity

Electric-Vehicle Maker Polestar Is Making a Smartphone as a Companion to Its Cars

Volkswagen Kicks Off High-Speed EV Charging Network in Taiwan, Powered by Noodoe EV OS

BPM announces strategic alliance with X-Analytics to arm boards with enhanced data-driven solutions mitigating cyber risk

Fintech Innovations: Leveraging New Technologies for Smarter Investment Decisions

Telkom Indonesia inks cybersecurity deal with F5

The A.I. Boom Makes Millions for an Unlikely Industry Player: Anguilla

Biden Administration Announces New Tailpipe Rules Aimed to Expand EVs

Roche subsidiary Foundation Medicine opens new headquarters

Merck, Vertex, and Viking updates

Opinion | Cultivated Meat’s Empty Promise of Revolution

Oral obesity drug from Viking Therapeutics hits key early target

What did Anthropic discover?

How manipulating features affects bias and cybersecurity

What does Anthropic’s research mean for business?

Urbx introduces robotic inventory storage and retrieval system for rapid retail fulfillment

A Tesla Can Be Built Like A Lego Set. A Legacy Car, Not So Much: Teardown Experts

Related Articles

Is your cloud network really ready for Generative AI?

Google Launches a New Course Called “AI Essentials”: Learn How to Use Generative AI Tools to Increase Your Productivity

How Morristown-Based Insurance Company Crum & Forster Is Incorporating Generative AI For Everyday Innovation – NJ Tech Weekly

Will Generative AI Replace Developers?

Bank-run accelerator programmes — what are they good for?

‘LLM-Free’ Is the New ‘100 Percent Organic’

Top 10 Biggest Car Manufacturers In The World 2024

Two Arrested for Burglary of Automobile on Highway 7 in Oxford – The Local Voice

10 Artificial General Intelligence (AGI) Companies To Know

Using Data Analytics and Artificial Intelligence for Public Disclosures

Initiative aids student entrepreneurs – Chinadaily.com.cn

Liberia, Ghana Telecom Regulators in Discussions on Free Roaming | Science & Tech

Taking a holistic approach to cybersecurity

Electric-Vehicle Maker Polestar Is Making a Smartphone as a Companion to Its Cars

Volkswagen Kicks Off High-Speed EV Charging Network in Taiwan, Powered by Noodoe EV OS

BPM announces strategic alliance with X-Analytics to arm boards with enhanced data-driven solutions mitigating cyber risk

Fintech Innovations: Leveraging New Technologies for Smarter Investment Decisions

Telkom Indonesia inks cybersecurity deal with F5

The A.I. Boom Makes Millions for an Unlikely Industry Player: Anguilla

Biden Administration Announces New Tailpipe Rules Aimed to Expand EVs

Roche subsidiary Foundation Medicine opens new headquarters

Merck, Vertex, and Viking updates

Opinion | Cultivated Meat’s Empty Promise of Revolution

Oral obesity drug from Viking Therapeutics hits key early target