Artificial Intelligence Is a 'Black Box.' Maybe Not For Long

Billy Perrigo

May 20, 2024 at 2:39 PM·5 min read

Credit - Getty Images

Today’s artificial intelligence is often described as a “black box.” AI developers don’t write explicit rules for these systems; instead, they feed in vast quantities of data and the systems learn on their own to spot patterns. But the inner workings of the AI models remain opaque, and efforts to peer inside them to check exactly what is happening haven’t progressed very far. Beneath the surface, neural networks—today’s most powerful type of AI—consist of billions of artificial “neurons” represented as decimal-point numbers. Nobody truly understands what they mean, or how they work.

For those concerned about risks from AI, this fact looms large. If you don’t know exactly how a system works, how can you be sure it is safe?

On Tuesday, the AI lab Anthropic announced it had made a breakthrough toward solving this problem. Researchers developed a technique for essentially scanning the “brain” of an AI model, allowing them to identify collections of neurons—called “features”—corresponding to different concepts. And for the first time, they successfully used this technique on a frontier large language model, Anthropic’s Claude Sonnet, the lab’s second-most powerful system, .

In one example, Anthropic researchers discovered a feature inside Claude representing the concept of “unsafe code.” By stimulating those neurons, they could get Claude to generate code containing a bug that could be exploited to create a security vulnerability. But by suppressing the neurons, the researchers found, Claude would generate harmless code.

The findings could have big implications for the safety of both present and future AI systems. The researchers found millions of features inside Claude, including some representing bias, fraudulent activity, toxic speech, and manipulative behavior. And they discovered that by suppressing each of these collections of neurons, they could alter the model’s behavior.

As well as helping to address current risks, the technique could also help with more speculative ones. For years, the primary method available to researchers trying to understand the capabilities and risks of new AI systems has simply been to chat with them. This approach, sometimes known as “red-teaming,” can help catch a model being toxic or dangerous, allowing researchers to build in safeguards before the model is released to the public. But it doesn’t help address one type of potential danger that some AI researchers are worried about: the risk of an AI system becoming smart enough to deceive its creators, hiding its capabilities from them until it can escape their control and potentially wreak havoc.

“If we could really understand these systems—and this would require a lot of progress—we might be able to say when these models actually are safe, or whether they just appear safe,” Chris Olah, the head of Anthropic’s interpretability team who led the research, tells TIME.

“The fact that we can do these interventions on the model suggests to me that we're starting to make progress on what you might call an X-ray, or an MRI [of an AI model],” Anthropic CEO Dario Amodei adds. “Right now, the paradigm is: let's talk to the model, let's see what it does. But what we'd like to be able to do is look inside the model as an object—like scanning the brain instead of interviewing someone.”

The research is still in its early stages, Anthropic said in a summary of the findings. But the lab struck an optimistic tone that the findings could soon benefit its AI safety work. “The ability to manipulate features may provide a promising avenue for directly impacting the safety of AI models,” Anthropic said. By suppressing certain features, it may be possible to prevent so-called “jailbreaks” of AI models, a type of vulnerability where safety guardrails can be disabled, the company added.

Researchers in Anthropic’s “interpretability” team have been trying to peer into the brains of neural networks for years. But until recently, they had mostly been working on far smaller models than the giant language models currently being developed and released by tech companies.

One of the reasons for this slow progress was that individual neurons inside AI models would fire even when the model was discussing completely different concepts. “This means that the same neuron might fire on concepts as disparate as the presence of semicolons in computer programming languages, references to burritos, or discussion of the Golden Gate Bridge, giving us little indication as to which specific concept was responsible for activating a given neuron,” Anthropic said in its summary of the research.

To get around this problem, Olah’s team of Anthropic researchers zoomed out. Instead of studying individual neurons, they began to look for groups of neurons that would all fire in response to a specific concept. This technique worked—and allowed them to graduate from studying smaller “toy” models to larger models like Anthropic’s Claude Sonnet, which has billions of neurons.

Although the researchers said they had identified millions of features inside Claude, they cautioned that this number was nowhere near the true number of features likely present inside the model. Identifying all the features, they said, would be prohibitively expensive using their current techniques, because doing so would require more computing power than it took to train Claude in the first place. (Costing somewhere in the tens or hundreds of millions of dollars.) The researchers also cautioned that although they had found some features they believed to be related to safety, more study would still be needed to determine whether those features could reliably be manipulated to improve a model’s safety.

For Olah, the research is a breakthrough that proves the utility of his esoteric field, interpretability, to the broader world of AI safety research. “Historically, interpretability has been this thing on its own island, and there was this hope that someday it would connect with [AI] safety—but that seemed far off,” Olah says. “I think that’s no longer true.”

Write to Billy Perrigo at billy.perrigo@time.com.

TechCrunch
WTF is AI?
The best way to think of artificial intelligence is as software that approximates human thinking. AI is also called machine learning, and the terms are largely equivalent — if a little misleading. The concepts behind today's AI models aren't actually new; they go back decades.
TechCrunch
UK opens office in San Francisco to tackle AI risk
Ahead of the AI safety summit kicking off in Seoul, South Korea later this week, its co-host, the United Kingdom, is expanding its own efforts in the field. The AI Safety Institute, a U.K. body set up in November 2023 with the ambitious goal of assessing and addressing risks in AI platforms, has said it will open a second location in San Francisco. The Bay Area is the home of companies like OpenAI, Anthropic, Google and Meta that are building foundational AI technology.
TechCrunch
Anthropic is expanding to Europe and raising more money
On the heels of OpenAI announcing the latest iteration of its GPT large language model, its biggest rival in generative AI in the U.S. announced an expansion of its own. Anthropic said Monday that Claude, its AI assistant, is now live in Europe with support for "multiple languages," including French, German, Italian and Spanish across Claude.ai, its iOS app and its business plan for teams. The launch comes after Anthropic extended its API to Europe to get developers using and integrating its models.
TechCrunch
Anthropic hires Instagram co-founder as head of product
Mike Krieger, one of the co-founders of Instagram and, more recently, the co-founder of personalized news app Artifact (which TechCrunch corporate parent Yahoo recently acquired), is joining Anthropic as the company's first chief product officer. As CPO, Krieger will oversee Anthropic's product engineering, management and design efforts, Anthropic says, as the company works to expand its suite of AI apps and bring Claude, its generative AI technology, to a wider audience.
TechCrunch
Swiss startup Neural Concept raises $27M to cut EV design time to 18 months
As pressure from Chinese competitors intensifies and the EV market stalls, major U.S. and European auto manufacturers are racing to cut the cost of producing electric vehicles so they can get to the price tags and profit margins of ICE cars. Now, a company spun out from the Swiss Federal Institute of Technology in Lausanne (EPFL), has raised $27 million in a Series B funding round to apply AI to solve that exact pain point. The company says it uses deep learning in a 3D environment, and combines data analysis with machine learning to speed up development times by up to 75% and product simulation by as much as 10 times.
Yahoo Finance
Tech companies bet on carbon removal startups as AI tests climate goals
Carbon removal technologies are becoming increasingly important for companies, particularly for tech giants locked in a fierce battle to become the leader in artificial intelligence.
TechCrunch
Binit is bringing AI to trash
AI for sorting the stuff we throw away to boost recycling efficiency at the municipal or commercial level has garnered attention from entrepreneurs for a while now (see startups like Greyparrot, TrashBot, Glacier). "We're producing the first household waste tracker," he tells TechCrunch, likening the forthcoming AI gadgetry to a sleep tracker but for your trash tossing habits.
Yahoo Finance
New jobs report report kicks off new month of trading: What to know this week
The June jobs report comes as the latest stock market rally took a breather to end May.
Engadget
ASUS' slimmed-down ProArt laptops focus on AI features
ASUS has revamped its ProArt line to make the laptops thinner, lighter and more portable. There are three new models, each with unique specifications.
Yahoo Finance
Here's what's really bothering me about the exploding Nasdaq
Not everything is looking great within the record-setting Nasdaq.
TechCrunch
Don't miss StrictlyVC in DC next week
Coming off of sold-out events in London, Los Angeles, and San Francisco, we're heading to Washington, D.C. for a cozy-vc-packed, evening at the Woolly Mammoth Theatre on Tuesday, June 11 in partnership with Revolution. Attendees can look forward to thought-provoking discussions, insightful perspectives, and plenty of chances to network—all while enjoying complimentary drinks and hors d'oeuvres.
Autoblog
Volkswagen GTI Clubsport is something old, new, borrowed, fast
The new Volkswagen GTI Clubsport is mostly the 2021 Clubsport, making the same 296 hp and 296 lb-ft. with chassis refinements and optional forged wheels.
Engadget
The ASUS Zenbook S16 laptop boasts an ultra-thin design and AMD's latest AI chip
The latest ASUS Zenbook S16 laptop boasts a revamped cooling system. It’s also lighter and thinner than ever.
TechCrunch
Spotify to increase premium pricing in the US to $11.99 per month
Spotify has announced it's hiking subscriptions for customers in the U.S., the second such price increase in the space of a year. The music-streaming giant reports that premium pricing will increase in July from $10.99 to $11.99, representing a rise of nearly 10%. The Duo and Family plans will go up to $16.99 and $19.99, which are $2 and $3 increases respectively, while the student plan will remain at $5.99 per month.
Yahoo Finance
Beware the retirement savings 'time bomb,' tax expert warns
Taxes are the "retirement time bomb," according to one tax expert. Here's what you can do now.
TechCrunch
Temasek, Fidelity buy $200M stake in Lenskart at $5B valuation
Temasek and Fidelity have purchased shares worth $200 million in Indian eyewear retailer Lenskart, according to a statement by the startup's financial advisor, Avendus. The transaction values Lenskart at $5 billion, the startup's co-founder and chief executive, Peyush Bansal, told TechCrunch in a text message. Avendus, which also advised selling shareholders on the deal, didn't name the investors who sold the shares.
Yahoo Sports
5 things to know from the weekend in MLB: With Juan Soto and Aaron Judge crushing at the plate, are the Yankees to be feared again?
The Yanks head home after a very successful nine-game California trip — they went 7-2 — for a series against (who else?) their all-time punching bag: the Minnesota Twins.
Yahoo Sports
Simone Biles wins record 9th all-around title at U.S. championships ahead of 2024 Paris Olympics
Simone Biles extended her record with another all-around title at the U.S. championships on Sunday night
Yahoo Sports
Commanders release K Brandon McManus after sexual assault lawsuit
Brandon McManus was accused of sexually assaulting two flight attendants last season with the Jaguars in a new lawsuit.
Yahoo Sports
White Sox's Tommy Pham says he's always prepared to 'f*** somebody up' after confrontation with Brewers' William Contreras
Chicago White Sox outfielder Tommy Pham told reporters he's always prepared to fight after an on-field confrontation with Milwaukee Brewers catcher William Contreras.

News

Life

Entertainment

Finance

Sports

New on Yahoo

Artificial Intelligence Is a 'Black Box.' Maybe Not For Long

Recommended Stories

WTF is AI?

UK opens office in San Francisco to tackle AI risk

Anthropic is expanding to Europe and raising more money

Anthropic hires Instagram co-founder as head of product

Swiss startup Neural Concept raises $27M to cut EV design time to 18 months

Tech companies bet on carbon removal startups as AI tests climate goals

Binit is bringing AI to trash

New jobs report report kicks off new month of trading: What to know this week

ASUS' slimmed-down ProArt laptops focus on AI features

Here's what's really bothering me about the exploding Nasdaq

Don't miss StrictlyVC in DC next week

Volkswagen GTI Clubsport is something old, new, borrowed, fast

The ASUS Zenbook S16 laptop boasts an ultra-thin design and AMD's latest AI chip

Spotify to increase premium pricing in the US to $11.99 per month

Beware the retirement savings 'time bomb,' tax expert warns

Temasek, Fidelity buy $200M stake in Lenskart at $5B valuation

5 things to know from the weekend in MLB: With Juan Soto and Aaron Judge crushing at the plate, are the Yankees to be feared again?

Simone Biles wins record 9th all-around title at U.S. championships ahead of 2024 Paris Olympics

Commanders release K Brandon McManus after sexual assault lawsuit

White Sox's Tommy Pham says he's always prepared to 'f*** somebody up' after confrontation with Brewers' William Contreras