Google DeepMind launches new framework to assess the dangers of AI models

Title icon
Title icon

The Scoop

Preparing for a time when artificial intelligence is so powerful that it can pose a serious, immediate threat to people, Google DeepMind on Friday released a framework for peering inside AI models to determine if they’re approaching dangerous capabilities.

The paper released Friday describes a process in which DeepMind’s models will be reevaluated every time the compute power used to train the model increases six-fold, or is fine-tuned for three months. Early warning evaluations will help make up for the the time between evaluations.

DeepMind will work with other companies, academia and lawmakers to improve the framework, according to a statement shared exclusively with Semafor. It plans to start implementing its auditing tools by 2025.

Today, evaluating powerful, frontier AI models is more of an ad hoc process that is constantly evolving as researchers develop new techniques. “Red teams” spend weeks or months testing them by trying out different prompts that might bypass safeguards. Then companies implement various techniques, from reinforcement learning to special prompts to corral the models into compliance.

That approach works for models today because they aren’t powerful enough to pose much of a threat, but researchers believe a more robust process is needed as models gain capabilities. As that changes, critics worry that by the time people realize the technology has gone too far, it’ll be too late.

The Frontier Safety Framework released by DeepMind looks to address that issue. It’s one of several methods announced by major tech companies, including Meta, OpenAI, and Microsoft, to mitigate concerns about AI.

“Even though these risks are beyond the reach of present-day models, we hope that implementing and improving the framework will help us prepare to address them,” the company said.

Title icon
Title icon

Know More

DeepMind has been working on “early warning” systems for AI models for over a year. And it’s published papers on new methods to evaluate models that go far beyond the ones used by most companies today.

The Frontier Model Framework incorporates those advances in a succinct set of protocols, including ongoing evaluation of models and mitigation methods that researchers should take if they do discover what they call “critical capability levels.” That could be a model that is able to complete long-term, complex tasks, which DeepMind calls “exceptional agency,” or the capability to write sophisticated malware.

DeepMind has set specific critical capability levels for four domains: autonomy, biosecurity, cybersecurity, and machine learning research and development.

“Striking the optimal balance between mitigating risks and fostering access and innovation is paramount to the responsible development of AI,” the company said.

DeepMind will discuss the framework at an AI safety summit in Seoul next week where other industry leaders will be in attendance.

Title icon
Title icon

Reed’s view

It’s encouraging that AI researchers at Google DeepMind are making progress on more scientific methods to determine what’s happening inside AI models, though they still have a way to go.

It’s also helpful for AI safety to see that, as researchers make breakthroughs on capabilities, they are also improving their ability to understand and ultimately control this software.

However, the paper released today is light on technical details about how these evaluations will be conducted. I’ve read several research papers DeepMind has put out on this subject and it appears that AI models will be used to evaluate emerging models.

I hope to learn more about how, exactly, this will be done and will write more about that in the future. For now, I think it’s fair to say we don’t really know if the technology is currently there to make this framework a success.

There’s an interesting regulatory component to this, as well. A new comprehensive AI bill in California, sponsored by state Sen. Scott Wiener, would require AI companies to evaluate the dangers of models before they are trained. This framework is the first I’ve seen that might make compliance with that law possible. But again, it’s not clear whether it’s technically possible today.

Another point: Building these techniques could be useful in another way: By helping companies forecast how AI model capabilities will change in the coming months and years. That knowledge could help product teams design new offerings more quickly, giving an advantage to Google and other companies with the ability to do these evaluations.

Title icon
Title icon

Room for Disagreement

Critics of AI, such as Eliezer Yudkowsky, have expressed skepticism that humans will be able to determine whether an AI model has gained “superintelligence” fast enough to make a difference.

Their argument is that the very nature of the technology means that it will be able to outsmart anything humans can come up with.

Semafor Logo
Semafor Logo