Home / News / Anthropic scanning Claude chats for queries about DIY nukes for some reason

Anthropic scanning Claude chats for queries about DIY nukes for some reason

Anthropic says it has scanned an undisclosed portion of conversations with its Claude AI model to catch concerning inquiries about nuclear weapons.

The company created a classifier – tech that tries to categorize or identify content using machine learning algorithms – to scan for radioactive queries. Anthropic already uses other classification models to analyze Claude interaction for potential harms and to ban accounts involved in misuse.

Based on tests with synthetic data, Anthropic says its nuclear threat classifier achieved a 94.8 percent detection rate for questions about nuclear weapons, with zero false positives. Nuclear engineering students no doubt will appreciate not having coursework-related Claude conversations referred to authorities by mistake.

With that kind of accuracy, no more than five percent of terrorist bomb-building guidance requests should go undetected – at least among aspiring mass murderers with so little grasp of operational security and so little nuclear knowledge that they’d seek help from an internet-connected chatbot.

Anthropic claims the classifier also performed well when exposed to actual Claude traffic, without providing specific detection figures for live data. But the company suggests its nuclear threat classifier generated more false positives when evaluating real-world conversations.

“For example, recent events in the Middle East brought renewed attention to the issue of nuclear weapons,” the company explained in a blog post. “During this time, the nuclear classifier incorrectly flagged some conversations that were only related to these events, not actual misuse attempts.”

By applying an additional check known as hierarchical summarization that considered flagged conversations together rather than individually, Anthropic found its systems could correctly label the discussions.

“The classifier is running on a percentage of Claude traffic, not all of Claude traffic,” a company spokesperson told The Register. “It is an experimental addition to our Safeguards Usage Policy, such as efforts to develop or design explosives or chemical, biological, radiological, or nuclear weapons, we take appropriate action, which could include suspending or terminating access to our services.”

Despite the absence of specific numbers, the model-maker did provide a qualitative measure of its classifier’s effectiveness on real-world traffic: The classifier caught the firm’s own red team which, unaware of the system’s deployment, experimented with harmful prompts.

“The classifier correctly identified these test queries as potentially harmful, demonstrating its effectiveness,” the AI biz wrote.

Anthropic says that it co-developed its nuclear threat classifier in conjunction with the US Department of Energy (DOE)’s National Nuclear Security Administration (NNSA) as a part of a partnership that began last year to evaluate company models for nuclear proliferation risks.

NNSA spent a year red-teaming Claude in a secure environment and then began working with Anthropic on a jointly developed classifier. The challenge, according to Anthropic, involved balancing NNSA’s need to keep certain data secret with Anthropic’s user privacy commitments.

Anthropic expects to share its findings with the Frontier Model Forum, an AI safety group consisting of Anthropic, Google, Microsoft, and OpenAI that was formed in 2023, back when the US seemed interested in AI safety. The group is not intended to address the financial risk of stratospheric spending on AI.

Oliver Stephenson, associate director of AI and emerging tech policy for the Federation of American Scientists (FAS), told The Register in an emailed statement: “AI is advancing faster than our understanding of the risks. The implications for nuclear non-proliferation still aren’t clear, so it is important that we closely monitor how frontier AI systems might intersect with sensitive nuclear knowledge.

“In the face of this uncertainty, safeguards need to balance reducing risks while ensuring legitimate scientific, educational, and policy conversations can continue. It’s good to see Anthropic collaborating with the Department of Energy’s National Nuclear Security Administration to explore appropriate guardrails.

“At the same time, government agencies need to ensure they have strong in-house technical expertise in AI so they can continually evaluate, anticipate, and respond to these evolving challenges.”

Especially as the government sheds in-house nuclear expertise. ®

**Get our** Tech Resources
Anthropic says it has scanned an undisclosed portion of conversations with its Claude AI model to catch concerning inquiries about nuclear weapons.
Based on tests with synthetic data, Anthropic says its nuclear threat classifier achieved a 94.8 percent detection rate for questions about nuclear weapons, with zero false positives.
Nuclear engineering students no doubt will appreciate not having coursework-related Claude conversations referred to authorities by mistake.
Anthropic claims the classifier also performed well when exposed to actual Claude traffic, without providing specific detection figures for live data.
“The classifier is running on a percentage of Claude traffic, not all of Claude traffic,” a company spokesperson told The Register.

Anthropic says it has scanned an undisclosed portion of conversations with its Claude AI model to catch concerning inquiries about nuclear weapons.

The company created a classifier – tech that tries to categorize or identify content using machine learning algorithms – to scan for radioactive queries. Anthropic already uses other classification models to analyze Claude interaction for potential harms and to ban accounts involved in misuse.

Based on tests with synthetic data, Anthropic says its nuclear threat classifier achieved a 94.8 percent detection rate for questions about nuclear weapons, with zero false positives. Nuclear engineering students no doubt will appreciate not having coursework-related Claude conversations referred to authorities by mistake.

With that kind of accuracy, no more than five percent of terrorist bomb-building guidance requests should go undetected – at least among aspiring mass murderers with so little grasp of operational security and so little nuclear knowledge that they’d seek help from an internet-connected chatbot.

Anthropic claims the classifier also performed well when exposed to actual Claude traffic, without providing specific detection figures for live data. But the company suggests its nuclear threat classifier generated more false positives when evaluating real-world conversations.

“For example, recent events in the Middle East brought renewed attention to the issue of nuclear weapons,” the company explained in a blog post. “During this time, the nuclear classifier incorrectly flagged some conversations that were only related to these events, not actual misuse attempts.”

By applying an additional check known as hierarchical summarization that considered flagged conversations together rather than individually, Anthropic found its systems could correctly label the discussions.

“The classifier is running on a percentage of Claude traffic, not all of Claude traffic,” a company spokesperson told The Register. “It is an experimental addition to our Safeguards Usage Policy, such as efforts to develop or design explosives or chemical, biological, radiological, or nuclear weapons, we take appropriate action, which could include suspending or terminating access to our services.”

Despite the absence of specific numbers, the model-maker did provide a qualitative measure of its classifier’s effectiveness on real-world traffic: The classifier caught the firm’s own red team which, unaware of the system’s deployment, experimented with harmful prompts.

“The classifier correctly identified these test queries as potentially harmful, demonstrating its effectiveness,” the AI biz wrote.

Anthropic says that it co-developed its nuclear threat classifier in conjunction with the US Department of Energy (DOE)’s National Nuclear Security Administration (NNSA) as a part of a partnership that began last year to evaluate company models for nuclear proliferation risks.

NNSA spent a year red-teaming Claude in a secure environment and then began working with Anthropic on a jointly developed classifier. The challenge, according to Anthropic, involved balancing NNSA’s need to keep certain data secret with Anthropic’s user privacy commitments.

Anthropic expects to share its findings with the Frontier Model Forum, an AI safety group consisting of Anthropic, Google, Microsoft, and OpenAI that was formed in 2023, back when the US seemed interested in AI safety. The group is not intended to address the financial risk of stratospheric spending on AI.

Oliver Stephenson, associate director of AI and emerging tech policy for the Federation of American Scientists (FAS), told The Register in an emailed statement: “AI is advancing faster than our understanding of the risks. The implications for nuclear non-proliferation still aren’t clear, so it is important that we closely monitor how frontier AI systems might intersect with sensitive nuclear knowledge.

“In the face of this uncertainty, safeguards need to balance reducing risks while ensuring legitimate scientific, educational, and policy conversations can continue. It’s good to see Anthropic collaborating with the Department of Energy’s National Nuclear Security Administration to explore appropriate guardrails.

“At the same time, government agencies need to ensure they have strong in-house technical expertise in AI so they can continually evaluate, anticipate, and respond to these evolving challenges.”

Especially as the government sheds in-house nuclear expertise. ®

Get our Tech Resources

Tagged:

Leave a Reply

Your email address will not be published. Required fields are marked *