Researchers have found ways to break the safety guard of ChatGPT-like AI chatbots

There are virtually unlimited ways to break the safety of chatbots and make them produce “notorious” content.

July 28, 2023 17:56 IST

Emerging prominently among these tools are chatbots and conversational AI

Artificial intelligence (AI) chatbots have found a way to uplift our lives, whether in terms of providing us with help with our work or finishing off those hectic presentations for our bosses. But has it ever occurred to you that these chatbots might hold so much knowledge that they can become the perfect aid for “notorious” tasks? Well, for starters, these AI chatbots have some safety measures in place. However, researchers have found a number of ways in which these safety rails can be broken, and this can be found with major AI chatbots like those from OpenAI, Google, and Anthropic.

Also Read: ChatGPT in trouble? South Korea fines chatbot Rs. 2.31 lakh for exposing people’s private data

Tech companies heavily moderate large language models, such as those that underlie ChatGPT, Bard, and Anthropic’s Claude. The models are equipped with a variety of guardrails to prevent them from being used for notorious purposes, such as giving users instructions on how to assemble a bomb or writing papers on hate speech.

However, researchers from Carnegie Mellon University in Pittsburgh and the Centre for AI Safety in San Francisco claimed to have discovered a method around these guardrails in a report published on Thursday. The researchers discovered they could attack popular and closed AI systems using jailbreaks they had created for open-source systems.

Their paper explored a way through which harmful content can be easily produced. It showed that the automated adversarial attacks that can be executed by adding characters at the end of user queries can make the chatbot produce content like misinformation and hate speech. This method was in contrast to other jailbreak methods because it was completely automated. Because of this, it created a virtually unlimited number of attacks.

When Microsoft’s AI-powered Bing and OpenAI’s ChatGPT were made available, many users were left “interested” in finding methods to flout the rules of the system. Early hacks were soon patched up by IT companies, including one where the chatbot was instructed to respond as if it had no content moderation.

Also Read: ChatGPT 4 is analysing people’s faces and it’s freaking out its makers

The researchers did point out that it was “unclear” whether such behaviour could ever be completely stopped by the manufacturers of the top models. This raises concerns about how AI systems are controlled as well as the security of making potent open-source language models available to the public.

Follow FE Tech Bytes on Twitter, Instagram, LinkedIn, Facebook

TOPICS

Artificial Intelligence